In [None]:
%%R
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  dev = "svg",
  fig.align = "center",
  #fig.width = 11,
  #fig.height = 5
  cache = FALSE
)

# define vars
om = par("mar")
lowtop = c(om[1],om[2],0.1,om[4])
library(tidyverse)
library(knitr)
library(reticulate)
use_python("C:\\Users\\jbpost2\\AppData\\Local\\Programs\\Python\\Python310\\python.exe")
#use_python("C:\\python\\python.exe")
options(dplyr.print_min = 5)
options(reticulate.repl.quiet = TRUE)

layout: false
class: title-slide-section-red, middle

# Plotting with `matplotlib`
Justin Post

---
layout: true

<div class="my-footer"><img src="img/logo.png" style="height: 60px;"/></div> 


---

# First Steps with Data

- EDA generally consists of a few steps:

    + Understand how your data is stored
    + Do basic data validation
    + Determine rate of missing values
    + Clean data up data as needed
    + Investigate distributions
        - Univariate measures/graphs
        - Multivariate measures/graphs
    + Apply transformations and repeat previous step
   
---

# Making Sense of Data

- Understand types of data and their distributions  

- Graphical summaries (across subgroups)  

    + Bar plots (categorical data)   
    + Histograms  
    + Box plots  
    + Scatter plots
    


---

# Graphical Summaries

Some major systems for plotting:

- `matplotlib`: [based on matlab](https://matplotlib.org/) plotting.  Similar to base R plotting

- `seaborn`: an abstraction of `matplotlib` but [still growing](https://seaborn.pydata.org/nextgen/)

- `Bokeh`: for [interactive visuals via HTML](https://bokeh.org/)

- `plotly`: general plotting system that has a [python module](https://plotly.com/python/)

- `plotnine`: [a ggplot port](https://plotnine.readthedocs.io/en/stable/)


---

# Plotting with `matplotlib`

- Two APIs...
    + Explicit axes interface (object oriented api)
    + **Implicit pyplot interface**

- Implicit
    + `plt.figure()`, `plt.plot(...)`, `plt.scatter()`, `plt.bar()`, or `plt.hist()`
    + Determine *axes* and *artist* elements    
    + Add labels, legends, and annotations
    + Produce the plot and then close it
        + `plt.show()` then `plt.close()` (not usually needed in `jupyter`)
    
---

# Reading in Data

- Consider data on titanic passengers in `titanic.csv`
- Start with a focus on plotting categorical data

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
#readin data
titanic_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/titanic.csv")
#remove some columns and a bad row
sub_titanic_data = titanic_data.drop(columns = ["body", "cabin", "boat"], axis = 1) \
                               .iloc[:(titanic_data.shape[0]-1)]
#create category versions of the variables (some code omitted)
sub_titanic_data["embarkedC"] = sub_titanic_data.embarked.astype("category")
sub_titanic_data.embarkedC = sub_titanic_data.embarkedC.cat.rename_categories(
                                    ["Cherbourg", "Queenstown", "Southampton"])

In [None]:
sub_titanic_data["sexC"] = sub_titanic_data.sex.astype("category")
sub_titanic_data.sexC = sub_titanic_data.sexC.cat.rename_categories(["Female", "Male"])
sub_titanic_data["survivedC"] = sub_titanic_data.survived.astype("category")
sub_titanic_data.survivedC = sub_titanic_data.survivedC.cat.rename_categories(["Died", "Survived"])

---

# Barplots

Categorical variable - entries are a label or attribute   

- Create summary counts
- Barplots gives visual of those counts 
    + `plt.bar()`
        - `x` represents the categories 
        - `height` the corresponding heights

In [None]:
table = sub_titanic_data.embarkedC.value_counts()
print(table)

---

layout: false

# Barplots with `matplotlib`

In [None]:
plt.bar(x = sub_titanic_data.embarkedC.cat.categories,  height = table)
plt.show()

---

# Barplots with `matplotlib`

In [None]:
plt.bar(x = sub_titanic_data.embarkedC.cat.categories,  height = table)
plt.xlabel("Port Embarked"); plt.ylabel("Number of People")
plt.title("Most Embarked in the Southampton Port"); plt.show()

---

# Barplots with `matplotlib`

In [None]:
plt.subplots(figsize = (12,7))
plt.bar(x = sub_titanic_data.embarkedC.cat.categories,  height = table)
plt.xlabel("Port Embarked"); plt.ylabel("Number of People")
plt.title("Most Embarked in the Southampton Port"); plt.show()

---

# Stacked Barplot with `matplotlib`

- Color the bars by another categorical variable

In [None]:
stack_table = pd.crosstab(sub_titanic_data.embarkedC, sub_titanic_data.survivedC)
stack_table
print(stack_table.loc[:, "Survived"])

---

# Stacked Barplot with `matplotlib`

In [None]:
plt.subplots(figsize = (12,7))
p1 = plt.bar(
  x = sub_titanic_data.embarkedC.cat.categories,  
  height = stack_table.loc[:, "Died"],
  label = "Died")
p2 = plt.bar(
  x = sub_titanic_data.embarkedC.cat.categories,  
  height = stack_table.loc[:, "Survived"], 
  bottom = stack_table.loc[:, "Died"],
  label = "Survived"
  )
plt.xlabel("Port Embarked")
plt.ylabel("Number of People")
plt.title("Most Embarked in the Southampton Port \n A higher proportion survived from Cherbourg")
plt.legend(loc = 0)
plt.show()

---

# Stacked Barplot with `matplotlib`

In [None]:
plt.subplots(figsize = (12,7))
p1 = plt.bar(
  x = sub_titanic_data.embarkedC.cat.categories,  
  height = stack_table.loc[:, "Died"],
  label = "Died")
p2 = plt.bar(
  x = sub_titanic_data.embarkedC.cat.categories,  
  height = stack_table.loc[:, "Survived"], 
  bottom = stack_table.loc[:, "Died"],
  label = "Survived"
  )
plt.xlabel("Port Embarked")
plt.ylabel("Number of People")
plt.title("Most Embarked in the Southampton Port \n A higher proportion survived from Cherbourg")
plt.legend(loc = 0)
plt.show()

---

# Side-by-Side Barplot with `matplotlib`

- Place bars next to each other for easier comparison
- Need to have different x locations for each bar

---

# Side-by-Side Barplot with `matplotlib`

- Place bars next to each other for easier comparison
- Need to have different x locations for each bar

In [None]:
labels = sub_titanic_data.embarkedC.cat.categories
length = len(labels)
width = 0.4
plt.subplots(figsize = (12,7))
p1 = plt.bar(
  x = [i - width/2 for i in range(0, length)],  
  height = stack_table.loc[:, "Died"],
  width = width,
  label = "Died")
p2 = plt.bar(
  x = [i + width/2 for i in range(0, length)],  
  height = stack_table.loc[:, "Survived"],
  width = width,
  label = "Survived")
plt.xticks(range(0, length), labels) #(list of positions, list of labels)
plt.xlabel("Port Embarked")
plt.ylabel("Number of People")
plt.title("Most Embarked in the Southampton Port \n A higher proportion survived from Cherbourg")
plt.show()

---

# Side-by-Side Barplot with `matplotlib`


In [None]:
labels = sub_titanic_data.embarkedC.cat.categories
length = len(labels)
width = 0.4
plt.subplots(figsize = (12,7))
p1 = plt.bar(
  x = [i - width/2 for i in range(0, length)],  
  height = stack_table.loc[:, "Died"],
  width = width,
  label = "Died")
p2 = plt.bar(
  x = [i + width/2 for i in range(0, length)],  
  height = stack_table.loc[:, "Survived"],
  width = width,
  label = "Survived")
plt.xticks(range(0, length), labels) #(list of positions, list of labels)
plt.xlabel("Port Embarked")
plt.ylabel("Number of People")
plt.title("Most Embarked in the Southampton Port \n A higher proportion survived from Cherbourg")
plt.show()

---

# Plotting Numeric Variables    

With two numeric variables, describe the shape via a scatter plot
- `plt.scatter()` from `matplotlib`

In [None]:
plt.subplots(figsize = (12,7))
plt.scatter(sub_titanic_data.age, sub_titanic_data.fare)
plt.xlabel("Age"); plt.ylabel("Fare"); plt.show()

---

# To JupyterLab!  

- Add a plot to our LLN function

---

# Recap

- Must understand the type of data you have to visualize it
- Goal: Describe the distribution

- `matplotlib` can create custom plots

    + Lots of work to specify everything yourself
    
- Many other plotting paradigms to consider!
    + `pandas` and `seaborn` next
