In [None]:
%%R
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  dev = "svg",
  fig.align = "center",
  #fig.width = 11,
  #fig.height = 5
  cache = FALSE
)

# define vars
om = par("mar")
lowtop = c(om[1],om[2],0.1,om[4])
library(tidyverse)
library(knitr)
library(reticulate)
use_python("C:\\Users\\jbpost2\\AppData\\Local\\Programs\\Python\\Python310\\python.exe")
#use_python("C:\\python\\python.exe")
options(dplyr.print_min = 5)
options(reticulate.repl.quiet = TRUE)

layout: false
class: title-slide-section-red, middle

# Prediction and Training/Test Set Ideas
Justin Post

---
layout: true

<div class="my-footer"><img src="img/logo.png" style="height: 60px;"/></div> 


---

# Common Uses for Data

Four major goals with data:
1. Description (EDA)
2. Inference
3. Prediction/Classification
    - Supervised Learning methods try to relate predictors to a response variable through a model
        - Some models used for inference and prediction/classification
        - Some used just for prediction/classification
4. Pattern Finding

---

# Statistical Learning

**Statistical learning** - Inference, prediction/classification, and pattern finding

- Supervised learning - a variable (or variables) represents an **output** or **response** of interest

--

    + May model response and
        - Make **inference** on the model parameters  
        - **predict** a value or **classify** an observation

Goal: Understand what it means to be a good predictive model

---

# Simple Linear Regression Model

Basic model for relating a numeric predictor to a numeric response

$$\mbox{response = intercept + slope*predictor + Error}$$
$$Y_i = \beta_0+\beta_1x_i+E_i$$

---

# Simple Linear Regression Model

Basic model for relating a numeric predictor to a numeric response

$$\mbox{response = intercept + slope*predictor + Error}$$
$$Y_i = \beta_0+\beta_1x_i+E_i$$

- Obtain data $(x,y)$ pairs and **fit** (or train) the model
- Assumptions often made about the data generating process to make inference (not required)

In [None]:
import numpy as np
import pandas as pd
wine_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/winequality-full.csv")
wine_data[["alcohol", "residual sugar", "quality", "type"]].head()

---

# Simple Linear Regression Model

- This model can be used for inference or prediction!
  + Use `alcohol` content (predictor) to predict `residual sugar` content (response)

---

# Simple Linear Regression Model

- This model can be used for inference or prediction!
  + Use `alcohol` content (predictor) to predict `residual sugar` content (response)

In [None]:
import seaborn as sns
sns.regplot(x = wine_data["alcohol"], y = wine_data["residual sugar"], scatter_kws={'s':2})

---

# Simple Linear Regression Model

- Can use `linear_model` from `sklearn`

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression() #Create a reg object
reg.fit(X = wine_data['alcohol'].values.reshape(-1,1), y = wine_data['residual sugar'].values) 

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression() #Create a reg object
reg.fit(X = wine_data['alcohol'].values.reshape(-1,1), y = wine_data['residual sugar'].values) 

---

# Simple Linear Regression Model

- Can use `linear_model` from `sklearn`

In [None]:
from sklearn import linear_model
reg = linear_model.LinearRegression() #Create a reg object
reg.fit(X = wine_data['alcohol'].values.reshape(-1,1), y = wine_data['residual sugar'].values) 

- `reg` object has info about the fit!

In [None]:
print(round(reg.intercept_, 3), round(reg.coef_[0], 3))

---

# Simple Linear Regression Model

- Can use line for prediction with `.predict()` method!

In [None]:
print(round(reg.intercept_, 3), round(reg.coef_[0], 3))
pred1 = reg.predict(np.array([[10], [12], [14]]))
pred1 #each of these represents a 'y-hat' for the given value of x

In [None]:
import seaborn as sns
sns.regplot(x = wine_data["alcohol"], y = wine_data["residual sugar"], scatter_kws={'s':2})

---

# Recenter

Supervised Learning methods try to relate predictors to a response variable through a model
- Some used for inference and prediction/classification
- Some used just for prediction/classification

Lots of common models

- Regression models
- Tree based methods
- Naive Bayes
- k Nearest Neighbors

---

# Recenter

Supervised Learning methods try to relate predictors to a response variable through a model
- Some used for inference and prediction/classification
- Some used just for prediction/classification

Lots of common models

- Regression models
- Tree based methods
- Naive Bayes
- k Nearest Neighbors

Given a particular model type, we *fit* (or train) the model, use it to predict (call prediction $\hat{y}$)

Goal: Understand what it means to be a good predictive model. **How do we evaluate the model?**

---

# Quantifying How Well the Model Predicts

Need a way to quantify how well our prediction ($\hat{y}_i$) is doing (need a **loss** function)

- Want something 'close' to all points
- For a given **numeric** response value, $y_i$ and prediction, $\hat{y}_i$
$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$

---

# Quantifying How Well the Model Predicts

Need a way to quantify how well our prediction ($\hat{y}_i$) is doing (need a **loss** function)

- Want something 'close' to all points
- For a given **numeric** response value, $y_i$ and prediction, $\hat{y}_i$
$$y_i - \hat{y}_i, (y_i-\hat{y}_i)^2, |y_i-\hat{y}_i|$$
- Incorporate all points via
$$\frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i), \frac{1}{n}\sum_{i=1}^{n} (y_i-\hat{y}_i)^2, \frac{1}{n}\sum_{i=1}^{n} |y_i-\hat{y}_i|$$

---

# Loss Function

- For a numeric response, we commonly use squared error loss to evaluate a prediction
$$L(y_i,\hat{y}_i) = (y_i-\hat{y}_i)^2$$

- Use Root Mean Square Error as a metric across all observations
$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n} L(y_i, \hat{y}_i)} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y}_i)^2}$$

---

# Commonly Used Loss Functions

For prediction (numeric response)
- Mean Squared Error (MSE) or Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE or MAD - deviation)
$$L(y_i,\hat{y}_i) = |y_i-\hat{y}_i|$$
- [Huber Loss](https://en.wikipedia.org/wiki/Huber_loss)
    + Doesn't penalize large mistakes as much as MSE

---

# Training vs Test Sets

Ideally we want our model to predict well for observations **it has yet to see**

---

# Training vs Test Sets

Ideally we want our model to predict well for observations **it has yet to see**
- Evaluation of predictions over the observations used to *fit or train the model* is called the **training (set) error**

$$\mbox{Training RMSE} = \sqrt{\frac{1}{\mbox{# of obs used to fit model}}\sum_{\mbox{obs used to fit model}}(y-\hat{y})^2}$$

---

# Training vs Test Sets

Ideally we want our model to predict well for observations **it has yet to see**
- Evaluation of predictions over the observations used to *fit or train the model* is called the **training (set) error**

$$\mbox{Training RMSE} = \sqrt{\frac{1}{\mbox{# of obs used to fit model}}\sum_{\mbox{obs used to fit model}}(y-\hat{y})^2}$$

- If we only consider this, we'll have no idea how the model will fare on data it hasn't seen!



---

# Training vs Test Sets

One method is to split the data into a **training set** and **test set**
- On the training set we can fit (or train) our models
- We can then predict for the test set observations and judge effectiveness with RMSE

In [None]:
%%R
knitr::include_graphics("img/trainingtest.png")

---

# Example of Fitting and Evaluating Models

Consider a data set on motorcycle sale prices

In [None]:
import pandas as pd
bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv")
print(bike_data.columns)
bike_data.head()

---

# Example of Fitting and Evaluating Models

- Response variable of `log_selling_price = ln(selling_price)`

- Consider two simple linear regression models
$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$
$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$

---

layout: false

# Example of Fitting and Evaluating Models

$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$
$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$

.left45[

In [None]:
bike_data['log_selling_price'] = np.log(bike_data['selling_price'])
bike_data['log_km_driven'] = np.log(bike_data['km_driven'])
sns.regplot(x = bike_data["year"], y = bike_data["log_selling_price"])

]
.right45[

In [None]:
sns.regplot(x = bike_data["log_km_driven"], y = bike_data["log_selling_price"])

]



---

# Example of Fitting and Evaluating Models

$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$
$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$

- Fit SLR models using **scikit learn** 

In [None]:
import numpy as np
#create response and new predictor
bike_data['log_selling_price'] = np.log(bike_data['selling_price'])
bike_data['log_km_driven'] = np.log(bike_data['km_driven'])

---

# Example of Fitting and Evaluating Models

$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$
$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$

- Fit SLR models using **scikit-learn**

In [None]:
print(bike_data['log_selling_price'].values) #can pass the response as an array (1D is ok)
print(bike_data['year'].values.reshape(-1,1)) #pass each predictor as an array (must be 2D)

---

# Example of Fitting and Evaluating Models

$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$
$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$

In [None]:
reg1 = linear_model.LinearRegression() #Create a reg object
reg1.fit(bike_data['year'].values.reshape(-1,1), bike_data['log_selling_price']) 

In [None]:
reg1 = linear_model.LinearRegression() #Create a reg object
reg1.fit(bike_data['year'].values.reshape(-1,1), bike_data['log_selling_price']) 

In [None]:
print(reg1.intercept_, reg1.coef_)

---

# Example of Fitting and Evaluating Models

$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$
$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$

In [None]:
reg2 = linear_model.LinearRegression() #Create a reg object
reg2.fit(bike_data['log_km_driven'].values.reshape(-1,1), bike_data['log_selling_price'])

In [None]:
reg2 = linear_model.LinearRegression() #Create a reg object
reg2.fit(bike_data['log_km_driven'].values.reshape(-1,1), bike_data['log_selling_price'])

In [None]:
print(reg2.intercept_, reg2.coef_)

---

# Example of Fitting and Evaluating Models

- Now we have the fitted models.  Want to use them to predict the response
$$\mbox{Model 1: } \widehat{\mbox{log_selling_price}} = -201.06 + 0.105*\mbox{year}$$
$$\mbox{Model 2: } \widehat{\mbox{log_selling_price}} = 14.64 -0.391*\mbox{log_km_driven}$$

- Use the `.predict()` method

In [None]:
pred1 = reg1.predict(bike_data['year'].values.reshape(-1,1))
pred2 = reg2.predict(bike_data['log_km_driven'].values.reshape(-1,1))
pd.DataFrame(zip(pred1, pred2, bike_data['log_selling_price']), columns = ["Model1", "Model2", "Actual"])

---

# Example of Fitting and Evaluating Models

- Now we have the fitted models.  Want to use them to predict the response
$$\mbox{Model 1: } \widehat{\mbox{log_selling_price}} = -201.06 + 0.105*\mbox{year}$$
$$\mbox{Model 2: } \widehat{\mbox{log_selling_price}} = 14.64 -0.391*\mbox{log_km_driven}$$

- Find **training** RMSE

In [None]:
from sklearn.metrics import mean_squared_error
RMSE1 = np.sqrt(mean_squared_error(y_true = bike_data['log_selling_price'], y_pred = pred1))
RMSE2 = np.sqrt(mean_squared_error(y_true = bike_data['log_selling_price'], y_pred = pred2))
print(round(RMSE1, 3), round(RMSE2, 3))

---

# Example of Fitting and Evaluating Models

- Now we have the fitted models.  Want to use them to predict the response
$$\mbox{Model 1: } \widehat{\mbox{log_selling_price}} = -201.06 + 0.105*\mbox{year}$$
$$\mbox{Model 2: } \widehat{\mbox{log_selling_price}} = 14.64 -0.391*\mbox{log_km_driven}$$

- Find **training** RMSE

In [None]:
from sklearn.metrics import mean_squared_error
RMSE1 = np.sqrt(mean_squared_error(y_true = bike_data['log_selling_price'], y_pred = pred1))
RMSE2 = np.sqrt(mean_squared_error(y_true = bike_data['log_selling_price'], y_pred = pred2))
print(round(RMSE1, 3), round(RMSE2, 3))

- Estimate of RMSE is too **optimistic** compared to how the model would perform with new data!  Redo with train/test split!

---

# Train/Test Split

- `sklearn` has a function to make splitting data easy
- Commonly use 80/20 or 70/30 split

---

# Train/Test Split

- `sklearn` has a function to make splitting data easy
- Commonly use 80/20 or 70/30 split

In [None]:
from sklearn.model_selection import train_test_split
#Function will return a list with four things:
#Test/train for predictors (X)
#Test/train for response (y)
X_train, X_test, y_train, y_test = train_test_split(
  bike_data[["year", "log_km_driven"]],
  bike_data["log_selling_price"], 
  test_size=0.20, 
  random_state=42)

---

# Fit or Train Model

- We then fit the model on the training set

In [None]:
reg1 = linear_model.LinearRegression() 
reg2 = linear_model.LinearRegression() 
reg1.fit(X_train["year"].values.reshape(-1,1), y_train.values)
reg2.fit(X_train["log_km_driven"].values.reshape(-1,1), y_train.values)

In [None]:
reg1 = linear_model.LinearRegression() 
reg2 = linear_model.LinearRegression() 
reg1.fit(X_train["year"].values.reshape(-1,1), y_train.values)
reg2.fit(X_train["log_km_driven"].values.reshape(-1,1), y_train.values)

---

# Fit or Train Model

- We then fit the model on the training set

In [None]:
reg1 = linear_model.LinearRegression() 
reg2 = linear_model.LinearRegression() 
reg1.fit(X_train["year"].values.reshape(-1,1), y_train.values)
reg2.fit(X_train["log_km_driven"].values.reshape(-1,1), y_train.values)

- Can look at training RMSE if we want

In [None]:
train_RMSE1 = np.sqrt(mean_squared_error(y_true = y_train.values, 
                           y_pred = reg1.predict(X_train['year'].values.reshape(-1,1))))
train_RMSE2 = np.sqrt(mean_squared_error(y_true = y_train.values, 
                           y_pred = reg2.predict(X_train['log_km_driven'].values.reshape(-1,1))))
print(round(train_RMSE1, 3), round(train_RMSE2, 3))

---

# Test Error

- Now we look at predictions on the test set
    + Test data **not** used when training model

In [None]:
test_RMSE1 = np.sqrt(mean_squared_error(y_true = y_test.values, 
                          y_pred = reg1.predict(X_test['year'].values.reshape(-1,1))))

test_RMSE2 = np.sqrt(mean_squared_error(y_true = y_test.values, 
                          y_pred = reg2.predict(X_test['log_km_driven'].values.reshape(-1,1))))

print(round(test_RMSE1, 3), round(test_RMSE2, 3))

---

# Test Error

- Now we look at predictions on the test set
    + Test data **not** used when training model

In [None]:
test_RMSE1 = np.sqrt(mean_squared_error(y_true = y_test.values, 
                          y_pred = reg1.predict(X_test['year'].values.reshape(-1,1))))

test_RMSE2 = np.sqrt(mean_squared_error(y_true = y_test.values, 
                          y_pred = reg2.predict(X_test['log_km_driven'].values.reshape(-1,1))))

print(round(test_RMSE1, 3), round(test_RMSE2, 3))

- When choosing a model, if the RMSE values were 'close', we'd want to consider the interpretability of the model (and perhaps the assumptions if we wanted to do inference too!)

---

# Recap

- Model is fit using some criteria 

- Must determine a method to judge the model's effectiveness
    + Use a **Loss** function (usually summed over all observations in a given set)
    
- To obtain a better understanding of the predictive power of a model, we split our data into a training and test set
    + Evaluate the model using the loss function on the **test set**
    
