In [None]:
%%R
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  dev = "svg",
  fig.align = "center",
  #fig.width = 11,
  #fig.height = 5
  cache = TRUE
)

# define vars
om = par("mar")
lowtop = c(om[1],om[2],0.1,om[4])
library(tidyverse)
library(knitr)
library(reticulate)
use_python("C:\\Users\\jbpost2\\AppData\\Local\\Programs\\Python\\Python310\\python.exe")
#use_python("C:\\python\\python.exe")
options(dplyr.print_min = 5)
options(reticulate.repl.quiet = TRUE)

layout: false
class: title-slide-section-red, middle

# Summarizing Data Goals
Justin Post

---
layout: true

<div class="my-footer"><img src="img/logo.png" style="height: 60px;"/></div> 

---

# Course Plan

- Course split into four topics

    1. Programming in `python`

    2. Big Data Management

    3. Modeling Big Data (with `Spark` via `pyspark`)

    4. Streaming Data


---

# Programming in Python (Prep for Dealing with Big Data)

- `JupyterLab` as our IDE (interactive development environment)
- Basic Use of Python
- Markdown capabilities
- Python Modules

<br>

- Summarizing Data Ideas
- Basic data types & Writing Functions
- Control flow (if/then/else, Looping)

<br>

- Compound data types (including `Numpy` arrays, `pandas` data frames)
- Summarizing data
- Common models and model evaluation

---

# Uses for Data

Four major goals with data:
1. Description
2. Inference
3. Prediction/Classification
4. Pattern Finding

- **Descriptive Statistics** try to summarize the distribution of the variable

- Supervised Learning methods try to relate predictors to a response variable through a model
    - Some models used for inference and prediction/classification
    - Some used just for prediction/classification


---

# Making Sense of Data  

Goal: Understand types of data and their distributions  

- Numerical summaries  

In [None]:
%%R
knitr::include_graphics("img/summarizeAllF.png")

---

# Making Sense of Data  

Goal: Understand types of data and their distributions  

- Numerical summaries (across subgroups)  

In [None]:
%%R
knitr::include_graphics("img/summarizeGroupsF.png")

---

# Types of Data

- How to summarize data depends on the type of data  

    + Categorical (Qualitative) variable - entries are a label or attribute   
    + Numeric (Quantitative) variable - entries are a numerical value where math can be performed



In [None]:
%%R
knitr::include_graphics("img/variableTypes.png")

---

# Making Sense of Data  

Goal: Understand types of data and their distributions  

- Numerical summaries (across subgroups)  

    + Contingency Tables  
    + Mean/Median  
    + Standard Deviation/Variance/IQR
    + Quantiles/Percentiles
    

---

# Categorical Data

Goal: Describe the **distribution** of the variable  

- Distribution = pattern and frequency with which you observe a variable  

- Categorical variable - entries are a label or attribute   

--

    + Describe the relative frequency (or count) for each category

    + Called a **contingency table**

---

# Read in Data

Read in some data

In [None]:
import pandas as pd
wine_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/winequality-full.csv")
wine_data.head()

---

# Categorical Variable Summary - One-way Table

- Count the \# of times each category of **one** variable appears!

In [None]:
#hacky way to do this
pd.crosstab(wine_data.quality,columns = ["freq" for _ in range(len(wine_data.quality))])

--

In [None]:
#hacky way to do this
pd.crosstab(wine_data.type,columns = ["freq" for _ in range(len(wine_data.quality))])

---

# Categorical Variable Summary - One-way Table

- Count the \# of times each category of **one** variable appears!

- Easy! Create counters and **loop** through observations

In [None]:
wine_data.head()

---

# Categorical Variable Summary - Two-way Table

- Count the \# of times each **combination** of categories for *two* variables appear!

In [None]:
#hacky way to do this
pd.crosstab(wine_data.quality, wine_data.type)

--

- Same idea: Create counters and **loop** through observations
    

---

# Numeric Data

Goal: Describe the **distribution** of the variable  

- Distribution = pattern and frequency with which you observe a variable  

- Numeric variable - entries are a numerical value where math can be performed

--

For a single numeric variable, describe the distribution via 

+ Shape: Histogram, Density plot, ...

+ Measures of center: Mean, Median, ...

+ Measures of spread: Variance, Standard Deviation, Quartiles, IQR, ...

--

For two numeric variables, describe the distribution via 

+ Shape: Scatter plot; Measures of linear relationship: Covariance, Correlation

---

# Numerical Variable Location Summary - Mean

- Sample mean: for a variable in our data set (call it $y$)

$$\bar{y} = \frac{1}{n}\sum_{i=1}^{n}y_i$$

--

- Idea: Loop through and sum up values, then divide by \# of values

---

# Numerical Variable Location Summary - Mean

In [None]:
wine_data.head()
wine_data.loc[0:4].alcohol.mean()

---

# Numerical Variable Location Summary - Trimmed Mean

- Sample mean

    + Outlier values greatly affect the mean
    
.left50[ 

In [None]:
(1+2+3+4+5+6+7+8+9+100)/10

]

.right50[

In [None]:
(1+2+3+4+5+6+7+8+9+10)/10

]

<br>
<br>

--

- x% **trimmed mean** 
    + Sort the data
    + Remove bottom x% and top x% of data
    + Find sample mean on remaining values

--

In [None]:
#10% trimmed mean of 1, 2, 3, 4, 5, 6, 7, 8, 9, 100 and 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
(2+3+4+5+6+7+8+9)/8

---

# Numerical Variable Location Summary - Median

- Sample median 
    + Sort values
    + Value with 50% of data below and above is the median

--


In [None]:
wine_data.head()

.left45[

In [None]:
wine_data.loc[0:4].alcohol.median()

]

.right45[

In [None]:
wine_data.loc[0:4, ["fixed acidity"]].median()

]

---

# Numerical Variable Location Summary - Median

- Sample median 
    + Sort values
    + Value with 50% of data below and above is the median

- If even number of observations, average middle two values

In [None]:
wine_data.head(n = 6)

.left45[

In [None]:
wine_data.loc[0:5].alcohol.median()

]
.right45[

In [None]:
wine_data.loc[0:5, "fixed acidity"].median()

]

---

# Numerical Variable Spread Summary - Variance

- Sample variance is *almost* the average squared deviation from the mean
    
$$S^2 = \frac{1}{n-1}\sum_{i=1}^{n}(y_i-\bar{y})^2$$

--

In [None]:
wine_data.head()

.left45[

In [None]:
wine_data.loc[0:4].alcohol.mean()

]

.right45[

In [None]:
wine_data.loc[0:4].alcohol.var()

]

---

# Numerical Variable Spread Summary - Standard Deviation

- Sample Standard Deviation = square root of sample variance

    + Puts metric on the scale of the variable 
    

--

In [None]:
wine_data.head()
wine_data.loc[0:4].alcohol.var()
wine_data.loc[0:4].alcohol.std()

---

# Numerical Variable Spread Summary - Quantiles/Percentiles

- Sample quantile - a generalization of the median

    + $p^{th}$ quantile - value with p% of the values below it
    + Also called the 100*p%ile

--

In [None]:
wine_data.head()
wine_data.loc[0:4].alcohol.quantile(q = [0.25, 0.5, 0.75])

---

# Numerical Variable Spread Summary - Quantiles/Percentiles

In [None]:
wine_data.head()

In [None]:
wine_data.loc[0:4, "volatile acidity"].quantile(q = [0.75, 0.9])
wine_data.loc[0:4, "volatile acidity"].quantile(q = [0.75, 0.9], interpolation = "midpoint")

---

# Numerical Variable Relationship Summary - Correlation

- Sample correlation - a measure of the **linear** relationship between two variables

    + Call the variables $x$ and $y$
    + $(x_i, y_i)$ are numeric variables observed on the same $n$ units, $i=1,...,n$
    + Pearson's correlation coefficient: 

$$r = \frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2}}$$

---

# Numerical Variable Relationship Summary - Correlation

- Sample correlation - a measure of the **linear** relationship between two variables

In [None]:
wine_data.head()
wine_data.loc[0:4, ["fixed acidity", "alcohol"]].corr()

--

.right75[
- Idea more involved:
    - Find sample means for $x$ and $y$
    - Compute numerator sum
    - Compute denominator sums
    - Find quotient
]

---

# Numerical Variable Relationship Summary - Correlation

- Sample correlation - a measure of the **linear** relationship between two variables

    + Sensitive to outliers
    + Spearman's correlation coefficient simply uses Pearson's correlation on the ranks of the data!


In [None]:
wine_data.head()

---

# Recap

Goal: Understand types of data and their distributions  

- Numerical summaries (across subgroups)  

    + Contingency Tables  
    + Mean/Median  
    + Standard Deviation/Variance/IQR
    + Quantiles/Percentiles
    

- Usually want summaries for different **subgroups of data**!!