In [None]:
%%R
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  dev = "svg",
  fig.align = "center",
  #fig.width = 11,
  #fig.height = 5
  cache = TRUE
)

# define vars
om = par("mar")
lowtop = c(om[1],om[2],0.1,om[4])
library(tidyverse)
library(knitr)
library(reticulate)
use_python("C:\\Users\\jbpost2\\AppData\\Local\\Programs\\Python\\Python310\\python.exe")
#use_python("C:\\python\\python.exe")
options(dplyr.print_min = 5)
options(reticulate.repl.quiet = TRUE)

layout: false
class: title-slide-section-red, middle

# Common Uses for Data

Justin Post

---
layout: true

<div class="my-footer"><img src="img/logo.png" style="height: 60px;"/></div> 

---

# Big Picture

.left45[
- 5 V's of Big Data
    + Volume
    + Variety
    + Velocity
    + Veracity (Variability)
    + Value

- Will look at the Big Data pipeline later
    + Databases/Data Lakes/Data Warehouses/etc.
    + SQL basics
    + Hadoop & Spark
]

.right50[

In [None]:
%%R
knitr::include_graphics("img/big-data-characteristics.png")

]

--

- **What to do with the data?**

---

# Standard Rectangular Data

In [None]:
%%R
knitr::include_graphics('img/rectangular_data.png')

---

# Data Driven Goals

Four major goals when using data:  

1. Description 

<div style="float: left; width: 45%;">

In [None]:
%%R
knitr::include_graphics('img/summary_stats.png')

</div>
<div style="float: right; width: 45%;">

In [None]:
%%R
knitr::include_graphics('img/graph.png')

</div>
<!--comment-->

---

# Data Driven Goals

Four major goals when using data:  

<ol start = "2">
<li> Inference</li>
</ol>


---

# Data Driven Goals

Four major goals when using data:  

<ol start = "3">
<li> Prediction/Classification</li>
</ol>

<div style="float: left; width: 45%;">

In [None]:
%%R
knitr::include_graphics('img/slr.png')

</div>
<div style="float: left; width: 45%;">

In [None]:
%%R
knitr::include_graphics('img/tree.png')

</div>

---

# Data Driven Goals

Four major goals when using data:  

<ol start = "4">
<li> Pattern Finding</li>
</ol>

In [None]:
%%R
knitr::include_graphics('img/clustering.png')

---

# Statistical Learning

**Statistical learning** - Inference, prediction/classification, and pattern finding

- Supervised learning - a variable (or variables) represents an **output** or **response** of interest

--

    + May model response and
        - Make **inference** on the model parameters  
        - **predict** a value or **classify** an observation

--

- Unsupervised learning - **No output or response variable** to shoot for  

    + Goal - learn about patterns and relationships in the data


---

# 1. Describing Data

Goal: Describe the **distribution** of the variable  

- Distribution = pattern and frequency with which you observe a variable  

- Numeric variable - entries are a numerical value where math can be performed

--

For a single numeric variable,
+ Shape: Histogram, Density plot, ...
+ Measures of center: Mean, Median, ...
+ Measures of spread: Variance, Standard Deviation, Quartiles, IQR, ...

--

For two numeric variables,
+ Shape: Scatter plot
+ Measures of Dependence: Correlation

---

# Quick Example

Read in some data

In [None]:
import pandas as pd
wine_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/winequality-full.csv")
wine_data.head()

---

# Lots of Summaries! 

- Use the `describe()` method on a `pandas` data frame 

In [None]:
wine_data.describe()

---

# Graphs

- Many standard graphs to summarize with as well

.left45[

In [None]:
wine_data.alcohol.plot.density()

]
.right45[

In [None]:
wine_data.plot.scatter(x = "alcohol", y = "residual sugar")

]

---

# 3. Statistical Model

- A mathematical representation of some phenomenon on which you've observed data
- Form of the model can vary greatly!

--

## Simple Linear Regression Model

$$\mbox{response = intercept + slope*predictor + Error}$$
$$Y_i = \beta_0+\beta_1x_i+E_i$$

--

- Assumptions often made about the data generating process to make inference (not required)

---

# Simple Linear Regression Model

- This model can be used for inference or prediction  

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression() #Create a reg object
reg.fit(X = wine_data['alcohol'].values.reshape(-1,1), y = wine_data['residual sugar'].values) 
print(round(reg.intercept_, 3), round(reg.coef_[0], 3))

---

# Simple Linear Regression Model

- This model can be used for inference or prediction  

In [None]:
import seaborn as sns
sns.regplot(x = wine_data["alcohol"], y = wine_data["residual sugar"], scatter_kws={'s':2})

---

# Regression Tree

- This model can be used for prediction

In [None]:
from sklearn.tree import DecisionTreeRegressor
reg_tree = DecisionTreeRegressor(max_depth=3)
reg_tree.fit(X = wine_data['alcohol'].values.reshape(-1,1), y = wine_data['residual sugar'].values)

---

# Regression Tree

- This model can be used for prediction

In [None]:
from sklearn.tree import plot_tree
plot_tree(reg_tree)

---

# 3. Statistical Model

- A mathematical representation of some phenomenon on which you've observed data

- *Predict* a **numeric response** or *Classify* an observation into a **category**
    + Depends on if your response is numeric or categorical!

--

- Form of the model can vary greatly (consider binary response)
$$P(\mbox{success}|\mbox{predictor}) = \frac{e^{\mbox{intercept+slope*predictor}}}{1+e^{\mbox{intercept+slope*predictor}}}$$
$$P(\mbox{success}|\mbox{predictor}) = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}$$
--

We'll investigate a number of different models later in the course!

<!--- Classify result as a 'success' for values of the predictor where this probability is larger than 0.5 (otherwise classify as a 'failure')-->

---

# Recap

Four major goals with data:
1. Description
2. Inference
3. Prediction/Classification
4. Pattern Finding

- Descriptive Statistics try to summarize the distribution of the variable

- Supervised Learning methods try to relate predictors to a response variable through a model
    - Some models used for inference and prediction/classification
    - Some used just for prediction/classification
