In [None]:
%%R
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  dev = "svg",
  fig.align = "center",
  #fig.width = 11,
  #fig.height = 5
  cache = FALSE
)

# define vars
om = par("mar")
lowtop = c(om[1],om[2],0.1,om[4])
library(tidyverse)
library(knitr)
library(reticulate)
use_python("C:\\Users\\jbpost2\\AppData\\Local\\Programs\\Python\\Python310\\python.exe")
#use_python("C:\\python\\python.exe")
options(dplyr.print_min = 5)
options(reticulate.repl.quiet = TRUE)

layout: false
class: title-slide-section-red, middle

# Plotting with `pandas`
Justin Post

---

# Making Sense of Data

- Understand types of data and their distributions  

- Graphical summaries (across subgroups)  

    + Bar plots (categorical data)   
    + Histograms  
    + Box plots  
    + Scatter plots

- We'll create some of the same plots from the previous lecture (same data processing done)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#readin data
titanic_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/titanic.csv")
#remove some columns and a bad row
sub_titanic_data = titanic_data.drop(columns = ["body", "cabin", "boat"], axis = 1) \
                               .iloc[:(titanic_data.shape[0]-1)]
#create category versions of the variables (some code omitted)
sub_titanic_data["embarkedC"] = sub_titanic_data.embarked.astype("category")
sub_titanic_data.embarkedC = sub_titanic_data.embarkedC.cat.rename_categories(
                                    ["Cherbourg", "Queenstown", "Southampton"])
sub_titanic_data["sexC"] = sub_titanic_data.sex.astype("category")
sub_titanic_data.sexC = sub_titanic_data.sexC.cat.rename_categories(["Female", "Male"])
sub_titanic_data["survivedC"] = sub_titanic_data.survived.astype("category")
sub_titanic_data.survivedC = sub_titanic_data.survivedC.cat.rename_categories(["Died", "Survived"])

---

# Barplots with `pandas`

- Barplots via `.plot.bar()` method on a `series` or `dataframe`
- Alternative: `.plot()` method with `kind = 'bar'`

.left50[

In [None]:
table = sub_titanic_data.embarkedC.value_counts()
table #a series

]
.right50[

In [None]:
table.plot.bar()

]

---

# Barplots with `pandas`

- Barplots via `.plot.bar()` method on a `series` or `dataframe`
- Alternative: `.plot()` method with `kind = 'bar'`

In [None]:
table.plot.bar()
plt.xticks(rotation = 0)
plt.show()

---

# Barplots with `pandas`

- Barplots via `.plot.bar()` method on a `series` or `dataframe`
- Alternative: `.plot()` method with `kind = 'bar'`

In [None]:
table.plot(kind = "bar", rot = 0) #can use additional arg rather than additional function call

---

# Stacked Barplot with `pandas`

- Color the bars by another categorical variable in the `dataframe`

In [None]:
table = pd.crosstab(sub_titanic_data["embarkedC"], sub_titanic_data["survivedC"])
table.plot.bar(stacked = True, rot = 0) # or table.plot(stacked = True, kind = "bar", rot = 0)

---

# Side-by-Side Barplots with `pandas`

- Place bars next to each other for easier comparison

In [None]:
table = pd.crosstab(sub_titanic_data["embarkedC"], sub_titanic_data["survivedC"])
table.plot.bar(rot = 0)

---

# Plotting Numeric Variables    

Numeric variable - entries are a numerical value where math can be performed

Goal: describe the shape, center, and spread
- Generally, via a histogram or boxplot!  
- **Histogram** - Bins data to show distribution of observations
- via `.plot.hist()` or `.plot(kind = "hist")` method
- A `.hist()` method also exists!

---

# Histogram with `pandas`

- **Histogram** - Bins data to show distribution of observations

In [None]:
sub_titanic_data["age"].plot.hist()
plt.xlabel("Age")
plt.title("Histogram of Age for Titanic Passengers"); plt.show()

---

# Histogram with `pandas`

- Specify # of bins

In [None]:
#can add label/title here (xlabel doesn't seem to work as intended...)
sub_titanic_data.age.plot.hist(bins = 20, title = "Histogram of Age for Titanic Passengers") \
    .set(xlabel = "Age") 

---

# Histogram with `pandas`

- To overlay on the same graph create two histograms and use `alpha = 0-1 value` 
    + First set up the bins manually so they are the same bins

In [None]:
bin_ends = 10 #Ideally want same bins, set those ourselves
bins = [i*max(sub_titanic_data.age)/bin_ends for i in range(0, bin_ends + 1)]
print(bins)

- Obtain subsets of data needed

In [None]:
age_died = sub_titanic_data.loc[sub_titanic_data.survivedC == "Died", "age"] #series for died
age_survived = sub_titanic_data.loc[sub_titanic_data.survivedC == "Survived", "age"] #series for survived

---

# Histogram with `pandas`

- To overlay on the same graph create two histograms and use `alpha = 0-1 value` 

In [None]:
age_died.plot.hist(bins = bins, alpha = 0.5, label = "Died", 
                   title = "Ages for those that survived vs those that died") \
                   .set(xlabel = "Age")
age_survived.plot.hist(bins = bins, alpha = 0.5, label = "Survived")
plt.legend(); plt.show()

---

# Histogram with `pandas`

- `pandas` will automatically overlay data from different columns of the **same** data frame
- Just to show that:

In [None]:
age_died = sub_titanic_data.loc[sub_titanic_data.survivedC == "Died", "age"] #809 values
age_survived = sub_titanic_data.loc[sub_titanic_data.survivedC == "Survived", "age"] #500 values
temp = pd.DataFrame(zip(age_died, age_survived), columns = ["Died", "Survived"]) #only has 500 rows
temp.plot.hist(alpha = 0.5)

---

# Histogram with `pandas`

- Kind of funky when the number in each group differ though...

Just a quick note:

In [None]:
age_survived
age_survived[0:5] #Note: matched index and only returns those!

---

# Histogram with `pandas`

- Kind of funky when the number in each group differ though...

Just a quick note:

In [None]:
age_survived.iloc[0:5] #use iloc when indices aren't 0, 1, 2, ...

---

# Histogram with `pandas`

- Work around: Obtain same length `series` with `NaN` inserted

.left45[

In [None]:
list_age_survived = list(age_survived)
list_age_survived.extend([np.nan for _ in range(308)])
temp = pd.Series(list_age_survived) 
temp

]

---

# Histogram with `pandas`

- Work around: Obtain same length `series` with `NaN` inserted

.left50[

In [None]:
list_age_survived = list(age_survived)
list_age_survived.extend([np.nan for _ in range(308)])
temp = pd.Series(list_age_survived) 
temp

]
.right50[

In [None]:
plotting_df = pd.DataFrame(zip(age_died, temp), 
                      columns = ["Died", "Survived"])
plotting_df.tail()

]

---

# Histogram with `pandas`

- Overlay automatic when `.plot.hist()` method used with two numeric columns

In [None]:
plotting_df.plot.hist(alpha = 0.5, title = "Ages for those that survived vs those that died") \
    .set(xlabel = "Age")

---

# Histogram with `pandas`

- Can place two graphs next to each other with `.hist()` method (notice this is a different method! (and different bin widths too!)

In [None]:
sub_titanic_data.hist(column = "age", by = "survivedC")
plt.show()

---

# Kernel smoother with `pandas`

- **Kernel Smoother** - Smoothed version of a histogram  
- 'Kernel' determines weight given to nearby points    
    + Use `.plot.density()` or `plot(kind = "density")` method

---

# Kernel smoother with `pandas`

- **Kernel Smoother** - Smoothed version of a histogram  
- 'Kernel' determines weight given to nearby points    
    + Use `.plot.density()` or `plot(kind = "density")` method

In [None]:
sub_titanic_data.age.plot.density(bw_method = 0.1, label = "bw = 0.1", 
                                  title = "Density Plots of Age for Titanic Passengers")
sub_titanic_data.age.plot.density(bw_method = 0.25, label = "bw = 0.25")
sub_titanic_data.age.plot.density(bw_method = 0.5, label = "bw = 0.5")
plt.legend()

---

# Kernel smoother with `pandas`

- **Kernel Smoother** - Smoothed version of a histogram  

In [None]:
sub_titanic_data.age.plot.density(bw_method = 0.1, label = "bw = 0.1", 
                                  title = "Density Plots of Age for Titanic Passengers")
sub_titanic_data.age.plot.density(bw_method = 0.25, label = "bw = 0.25")
sub_titanic_data.age.plot.density(bw_method = 0.5, label = "bw = 0.5")
plt.legend(); plt.show()

---

# Boxplots with `pandas`

One numerical and one categorical variable

- **Boxplot** - Provides the five number summary in a graph
    - Min, Q1, Median, Q3, Max  
    - Often show possible outliers as well  
    - Use `.plot.box()` or `plot(kind = "box")` method
    - A `.boxplot()` method also exists!

---

# Boxplots with `pandas`

In [None]:
sub_titanic_data.age.plot.box()

---

# Boxplots with `pandas`

- Compare across another variable

In [None]:
sub_titanic_data.boxplot(column = ["age"], by = "survivedC")
plt.show()

---

# Scatter Plots with `pandas`

- **Scatter Plot** - graphs points corresponding to each observation
    + Use `.plot.scatter()` or `plot(kind = "scatter")` method with `x =`, and `y =`

In [None]:
sub_titanic_data.plot.scatter(x = "age", y = "fare", title = "Scatter plots rule!")

---

# Modifying Scatter Plots

- Easy to modify

In [None]:
#c = color, marker is a matplotlib option
sub_titanic_data.plot.scatter(x = "age", y = "fare", c = "Red", marker = "v", title = "Oh, V's!") 

---

# Modifying Scatter Plots

- Modify based on a variable

In [None]:
#s for size (should be a numeric column), cmap can be used with c for specifying color scales
sub_titanic_data.plot.scatter(x = "age", y = "fare", c = "survivedC", cmap = "viridis", s = 10)

---

# Matrix of Scatter Plots

- `.plotting.scatter_matrix()` function will produce basic graphs showing relationships

In [None]:
pd.plotting.scatter_matrix(sub_titanic_data[["age", "fare", "survived", "sibsp"]])
plt.show()

---

# Matrix of Scatter Plots

In [None]:
%%R
knitr::include_graphics("img/matrix_plot.png")

---

# To JupyterLab!  

- Read in some data

- Create some plots using `pandas`!

---

# Recap

- Creating visualizations is an important part of an EDA

- Goal: Describe the distribution

- `pandas` has nice functionality for creating common plots

    + `.plot()` method

- May want to [check out seaborn for quick ways to do fancier plots!](https://seaborn.pydata.org/tutorial/introduction.html)