In [None]:
%%R
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  dev = "svg",
  fig.align = "center",
  #fig.width = 11,
  #fig.height = 5
  cache = FALSE
)

# define vars
om = par("mar")
lowtop = c(om[1],om[2],0.1,om[4])
library(tidyverse)
library(knitr)
library(reticulate)
use_python("C:\\Users\\jbpost2\\AppData\\Local\\Programs\\Python\\Python310\\python.exe")
#use_python("C:\\python\\python.exe")
options(dplyr.print_min = 5)
options(reticulate.repl.quiet = TRUE)

layout: false
class: title-slide-section-red, middle

# Numerical Summaries 
Justin Post

---
layout: true

<div class="my-footer"><img src="img/logo.png" style="height: 60px;"/></div> 

---

# Exploratory Data Analysis (EDA)

- Usual first step in an analysis is to get to know your data

- EDA generally consists of a few steps:

    + Understand how your data is stored
    + Do basic data validation
    + Determine rate of missing values
    + Clean data up data as needed
    + Investigate distributions
        - Univariate measures/graphs
        - Multivariate measures/graphs
    + Apply transformations and repeat previous step
   
---

# Understand How Data is Stored

Read in some data

In [None]:
import pandas as pd
titanic_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/titanic.csv")

.left45[

In [None]:
titanic_data.info()

]
.left45[

In [None]:
titanic_data.head()
titanic_data.tail()

]

---

# Do Basic Data Validation

- Use the `describe()` method on a data frame

In [None]:
titanic_data.describe()

---

# Do Basic Data Validation

- Use the `describe()` method on a data frame

In [None]:
titanic_data.columns
titanic_data[["age", "sibsp", "parch", "fare"]].describe(percentiles = [0.05, 0.25])

---

# Determine Rate of Missing Values

- Use `is.null()` method

In [None]:
titanic_data.isnull()

---

# Determine Rate of Missing Values

- Use `is.null()` method

In [None]:
titanic_data.isnull().sum()

---

# Clean Up Data As Needed

- `.drop()` method can remove variables (or use other subsetting)

In [None]:
titanic_data.drop(columns = ["body", "cabin", "boat"])

---

# Clean Up Data As Needed

- `.drop()` method can remove variables (or use other subsetting)

In [None]:
sub_titanic_data = titanic_data.drop(columns = ["body", "cabin", "boat"], axis = 1) \
                               .iloc[:(titanic_data.shape[0]-1)]
sub_titanic_data

---

# Clean Up Data As Needed

- Can be dangerous to impute the missing values... but can be done with `.fillna()` method

In [None]:
sub_titanic_data.fillna(value = 0)

---

# Clean Up Data As Needed

- Can be dangerous to impute the missing values... but can be done with `.fillna()` method

In [None]:
sub_titanic_data.fillna(value = {"home.dest": "Unknown"})

---

# Clean Up Data As Needed

- Can remove rows with missing using `.dropna()` method

.left45[

In [None]:
sub_titanic_data.shape
sub_titanic_data.isnull().sum()

]
.left45[

In [None]:
temp = sub_titanic_data.dropna()
temp.shape
temp.isnull().sum()

]

---

# Exploratory Data Analysis (EDA)

- Usual first step in an analysis is to get to know your data

- EDA generally consists of a few steps:

    + Understand how your data is stored
    + Do basic data validation
    + Determine rate of missing values
    + Clean data up data as needed
    + Investigate distributions
        - Univariate measures/graphs
        - Multivariate measures/graphs
    + Apply transformations and repeat previous step
   

---

# Investigate distributions  

- Numerical summaries (across subgroups)  

    + Contingency Tables  
    + Mean/Median  
    + Standard Deviation/Variance/IQR
    + Quantiles/Percentiles
    

---

# Investigate distributions  

- Numerical summaries (across subgroups)  

    + Contingency Tables  
    + Mean/Median  
    + Standard Deviation/Variance/IQR
    + Quantiles/Percentiles

- Graphical summaries (across subgroups)  

    + Bar plots  
    + Histograms  
    + Box plots  
    + Scatter plots


---

# Types of Data

- How to summarize data depends on the type of data  

    + Categorical (Qualitative) variable - entries are a label or attribute   
    + Numeric (Quantitative) variable - entries are a numerical value where math can be performed



In [None]:
%%R
knitr::include_graphics("img/variableTypes.png")

---

# Categorical Data

Goal: Describe the **distribution** of the variable  

- Distribution = pattern and frequency with which you observe a variable  
- Categorical variable - entries are a label or attribute   

---

# Categorical Data

Goal: Describe the **distribution** of the variable  

- Distribution = pattern and frequency with which you observe a variable  
- Categorical variable - entries are a label or attribute   

    + Describe the relative frequency (or count) for each category
    + Using `pandas` `.value_counts()` method and `crosstab()` function

    
---

# Category Type 

- Variables of interest:

    + embarked (where journey started)  


In [None]:
sub_titanic_data.embarked[0:2]
type(sub_titanic_data.embarked[0])

---

# Category Type 

- Variables of interest:

    + embarked (where journey started)  


In [None]:
sub_titanic_data.embarked[0:2]
type(sub_titanic_data.embarked[0])
sub_titanic_data["embarkedC"] = sub_titanic_data.embarked.astype("category")
sub_titanic_data.embarkedC[0:2]

---

# Category Type 

- Variables of interest:

    + embarked (where journey started)  


In [None]:
sub_titanic_data.embarkedC = sub_titanic_data.embarkedC.cat.rename_categories(["Cherbourg", "Queenstown", "Southampton"])
sub_titanic_data.embarkedC[0:2]

---

# Category Type 

- Variables of interest:

    + sex (Male or Female)  


In [None]:
sub_titanic_data["sexC"] = sub_titanic_data.sex.astype("category")
sub_titanic_data.sexC = sub_titanic_data.sexC.cat.rename_categories(["Female", "Male"])
sub_titanic_data.sexC[0:2]

---

# Category Type 

- Variables of interest:

    + survived (survive or not)    
  

In [None]:
type(sub_titanic_data.survived[0])
sub_titanic_data.survived[0:2]
sub_titanic_data["survivedC"] = sub_titanic_data.survived.astype("category")
sub_titanic_data.survivedC = sub_titanic_data.survivedC.cat.rename_categories(["Died", "Survived"])
sub_titanic_data.survivedC[0:2]

---

# Contingency tables 

- Create **one-way contingency tables** (`.value_counts()` method) 

    + embarked (where journey started)  
    + survived (survive or not)    
    + sex (Male or Female)  

.left45[

In [None]:
sub_titanic_data.embarkedC.value_counts(dropna = False)
sub_titanic_data.survivedC.value_counts()

]
.right45[

In [None]:
sub_titanic_data.sexC.value_counts()

]

---

# Contingency tables 

- Create **one-way contingency tables** (`cross_tab()` function) 

    + embarked (where journey started)  
    + survived (survive or not)    

.left45[

In [None]:
sub_titanic_data["dummy"] = 0
pd.crosstab(index = sub_titanic_data.embarkedC, columns = sub_titanic_data.dummy)

]
.right45[

In [None]:
pd.crosstab(index = sub_titanic_data.sexC, columns = sub_titanic_data.dummy)

]

---

# Two-way contingency tables 

- Create **two-way contingency tables** (`cross_tab()` function)  

In [None]:
pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC)
pd.crosstab(
  sub_titanic_data.sexC, 
  sub_titanic_data.survivedC)

---

# Two-way contingency tables 

- Create **two-way contingency tables** (`cross_tab()` function)  
- Add marginal totals with `margins = True` argument

In [None]:
pd.crosstab(
  sub_titanic_data.embarkedC, 
  sub_titanic_data.survivedC, 
  margins = True)

---

# Two-way contingency tables 

- Create **two-way contingency tables** (`cross_tab()` function)  
- Add row and columns names


In [None]:
pd.crosstab(
  sub_titanic_data.embarkedC, 
  sub_titanic_data.survivedC, 
  margins = True, 
  rownames = ["Embarked Port"], 
  colnames = ["Survival Status"]
  )

---

# Three-way contingency tables 

- Create **two-way contingency tables** (`cross_tab()` function)  

In [None]:
pd.crosstab(
  [sub_titanic_data.embarkedC, sub_titanic_data.survivedC], 
  sub_titanic_data.sexC, 
  margins = True)

---

# Three-way contingency tables 

- Create **two-way contingency tables** (`cross_tab()` function)  
    + Add in names for more clarity

In [None]:
my_tab = pd.crosstab(
  [sub_titanic_data.embarkedC, sub_titanic_data.survivedC], 
  sub_titanic_data.sexC, 
  margins = True,
  rownames = ['Embarked Port', 'Survival Status'],
  colnames = ['Sex'])
my_tab

---

# Conditional contingency tables

- `crosstab()` returns a data frame!

In [None]:
type(my_tab)

In [None]:
my_tab.columns
my_tab.index

---

# Conditional contingency tables

- Can obtain **conditional** bivariate info!
- Returns the embarked vs survived table for females

.left50[

In [None]:
my_tab

]
.left5[
&nbsp;
]
.left45[

In [None]:
my_tab.iloc[:, 0]

]

---

# Conditional contingency tables

- Can obtain **conditional** bivariate info!

In [None]:
my_tab.columns
my_tab.index

- Returns the embarked vs survived table for males

In [None]:
my_tab.loc[:, "Male"]

---

# Conditional contingency tables

- Can obtain **conditional** bivariate info!

In [None]:
my_tab

- Returns the sex vs embarked table for those that died

In [None]:
my_tab.iloc[0:5:2, :]

---

# Conditional contingency tables

- Can obtain **conditional** bivariate info!

In [None]:
my_tab.columns
my_tab.index

- Returns the sex vs embarked table for those that died

In [None]:
my_tab.loc[(("Cherbourg", "Queenstown", "Southampton"), "Died"), :]

---

# Conditional contingency tables

- Can obtain **conditional** bivariate info!

In [None]:
my_tab.columns
my_tab.index

- Returns the sex vs survived table for embarked of Cherbourg

In [None]:
my_tab.loc[('Cherbourg', ("Died", "Survived")), :]

---

# Conditional contingency tables

- Can obtain **conditional** univariate info too!

In [None]:
my_tab

- Return the sex table for those that died and embarked at Cherbourg

In [None]:
my_tab.iloc[0, :]

---

# Conditional contingency tables

- Can obtain **conditional** univariate info too!

In [None]:
my_tab.columns
my_tab.index

- Return the sex table for those that died and embarked at Cherbourg

In [None]:
my_tab.loc[('Cherbourg', 'Died')]

---

# To JupyterLab!  

- Read in some data

- Use `.cut()` method to split a numeric variable 

- Create some contingency tables

    + Subset data and create some too!


---

# Exploratory Data Analysis (EDA)

- Usual first step in an analysis is to get to know your data

- EDA generally consists of a few steps:

    + Understand how your data is stored
    + Do basic data validation
    + Determine rate of missing values
    + Clean data up data as needed
    + Investigate distributions
        - Univariate measures/graphs
        - Multivariate measures/graphs
    + Apply transformations and repeat previous step
   

---

# Investigate distributions  

- Numerical summaries (across subgroups)  

    + Contingency Tables  
    + Mean/Median  
    + Standard Deviation/Variance/IQR
    + Quantiles/Percentiles
    

---

# Numeric Data

Goal: Describe the **distribution** of the variable  

- Distribution = pattern and frequency with which you observe a variable  
- Numeric variable - entries are a numerical value where math can be performed

---

# Numeric Data

Goal: Describe the **distribution** of the variable  

- Distribution = pattern and frequency with which you observe a variable  
- Numeric variable - entries are a numerical value where math can be performed

For a single numeric variable, describe the distribution via 

+ Shape: Histogram, Density plot, ...
+ **Measures of center: Mean, Median, ...**
+ **Measures of spread: Variance, Standard Deviation, Quartiles, IQR, ...**

For two numeric variables, describe the distribution via 

+ Shape: Scatter plot, ...
+ **Measures of linear relationship: Covariance, Correlation, ...**

---

# Measures of Center

Find mean and median 

.left45[

In [None]:
sub_titanic_data['fare'].mean()
sub_titanic_data['fare'].median()

]
.left45[

In [None]:
sub_titanic_data.age.mean()
sub_titanic_data.age.median()

]

---

# Measures of Spread

Standard Deviation, Quartiles, & IQR

.left50[

In [None]:
sub_titanic_data.age.std()
sub_titanic_data.age.quantile(q = [0.2, 0.25, 0.5, 0.95])

]
.right45[

In [None]:
q1 = sub_titanic_data.age.quantile(q = [0.25])
q3 = sub_titanic_data.age.quantile(q = [0.75])
q1
q3
type(q1)
q3[0.75] - q1[0.25]

]

---

# Measures of Linear Relationship

Correlation   

In [None]:
sub_titanic_data.corr()

In [None]:
sub_titanic_data[["age", "fare", "sibsp", "parch"]].corr()

---

# Summaries Across Groups

Usually want summaries for different **subgroups of data** 
- Ex: Get similar fare summaries for each *survival status*

---

# Summaries Across Groups

Usually want summaries for different **subgroups of data** 
- Ex: Get similar fare summaries for each *survival status*

Idea: 

- Use `.groupby()` method and then use a summarization function 
- Use `.crosstab()` method with `aggfunc` argument


---

# Summaries Across Groups Using `.groupby()`

- Ex: Get summary for numeric type variables for each *survival status*

In [None]:
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()

---

# Summaries Across Groups Using `.groupby()`

- Ex: Get summary for numeric type variables for each *survival status*

In [None]:
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean()

In [None]:
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].std()

---

# Summaries Across Groups Using `.groupby()`

- `.unstack()` method can sometimes make the output clearer

In [None]:
sub_titanic_data.groupby("survivedC")[["age", "fare", "sibsp", "parch"]].mean().unstack()

---

# Summaries Across Groups Using `.groupby()`

- Ex: Get summary for numeric type variables for each *survival status* and *embarked port*

In [None]:
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].mean()
sub_titanic_data.groupby(["survivedC", "embarkedC"])[["age", "fare", "sibsp", "parch"]].std()

---

# Formatting Code

- Can be a good place to use `\`

In [None]:
sub_titanic_data \
  .groupby(["survivedC", "embarkedC"]) \
   [["age", "fare", "sibsp", "parch"]] \
   .mean()

---

# Summaries Across Groups Using `.crosstab()`

- Ex: Get summary for numeric type variables for each *survival status*

In [None]:
pd.crosstab(
  sub_titanic_data.survivedC, 
  columns = ["mean" for _ in range(sub_titanic_data.shape[0])],
  values = sub_titanic_data.fare,
  aggfunc = 'mean')

---

# Summaries Across Groups Using `.crosstab()`

- Ex: Get summary for numeric type variables for each *survival status*

In [None]:
pd.crosstab(
  sub_titanic_data.survivedC, 
  columns = ["stat" for _ in range(sub_titanic_data.shape[0])],
  values = sub_titanic_data.fare,
  aggfunc = ['mean', 'median', 'std', 'count'])

---

# Summaries Across Groups Using `.crosstab()`

- Ex: Get summary for numeric type variables for each *survival status* and *embarked port*

In [None]:
pd.crosstab(
  sub_titanic_data.embarkedC,
  sub_titanic_data.survivedC, 
  values = sub_titanic_data.fare,
  aggfunc = ['mean', 'count'])

---

# To JupyterLab!  

- Read in some data

- Find some numeric summaries!

    + Check out `.pivot_table()` method and `.agg()` method
    
<!-- df.pivot_table('tip', 'time', 'day', aggfunc='count', margins=True)
df.pivot_table('tip', ['sex', 'smoker'], ['time', 'day'], 
               aggfunc='median')
               
data[data['item'] == 'call'].groupby('month').agg(
    # Get max of the duration column for each group
    max_duration=('duration', max),
    # Get min of the duration column for each group
    min_duration=('duration', min),
    # Get sum of the duration column for each group
    total_duration=('duration', sum),
    # Apply a lambda to date column
    num_days=("date", lambda x: (max(x) - min(x)).days)    
)
-->

---

# Recap

EDA is often the first step to an analysis:

- Must understand the type of data you have/missingness/data validation
- Then describe the distributions of the variables
- Numerical summaries

    + Contingency Tables: `pd.crosstab()`  
    + Mean/Median: `.mean()`, `.median()` methods on a data frame
    + Standard Deviation/quantiles: `.std()`, `.quantile()`  methods

- Across subgroups with `.groupby()` method or `pd.crosstab(value, aggfunc)`

- You can [fancy up output](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html) too!
