In [None]:
%%R
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  dev = "svg",
  fig.align = "center",
  #fig.width = 11,
  #fig.height = 5
  cache = FALSE
)

# define vars
om = par("mar")
lowtop = c(om[1],om[2],0.1,om[4])
library(tidyverse)
library(knitr)
library(reticulate)
use_python("C:\\Users\\jbpost2\\AppData\\Local\\Programs\\Python\\Python310\\python.exe")
#use_python("C:\\python\\python.exe")
options(dplyr.print_min = 5)
options(reticulate.repl.quiet = TRUE)

layout: false
class: title-slide-section-red, middle

# Cross-Validation
Justin Post 

---
layout: true

<div class="my-footer"><img src="img/logo.png" style="height: 60px;"/></div> 

---

# Recap

- Judge the model's effectiveness using a **Loss** function

- Often split data into a training and test set
    + Perhaps 70/30 or 80/20
    
- Next: Cross-validation as an alternative to just train/test (and why we might do both!)

---

# Issues with Trainging vs Test Sets

Why may we not want to just do a basic training/test set?

- If we don't have much data, we aren't using it all when fitting the models

- Data is randomly split into training/test

    + May just get a weird split by chance
    + Makes loss function eval a somewhat variable measurement depending on number of data points

---

# Issues with Trainging vs Test Sets

Why may we not want to just do a basic training/test set?

- If we don't have much data, we aren't using it all when fitting the models

- Data is randomly split into training/test

    + May just get a weird split by chance
    + Makes loss function eval a somewhat variable measurement depending on number of data points

- Instead, we could consider splitting the data multiple ways and averaging over the results!
    + Exactly the idea for cross validation
    + A less variable measurement that uses all the data when fitting
    + Higher computational cost!

---

# Cross-validation

Common method for assessing a predictive model

In [None]:
%%R
knitr::include_graphics("img/cv.png")

---

# Cross-Validation Idea

$k$ fold Cross-Validation (CV)

- Split data into k folds
- Train model on first k-1 folds, test on kth, find sum of loss function
- Train model on first k-2 folds and kth fold, test on (k-1)st fold, find sum of loss function
- ...

---

# Cross-Validation Idea

$k$ fold Cross-Validation (CV)

- Split data into k folds
- Train model on first k-1 folds, test on kth, find sum of loss function 
- Train model on first k-2 folds and kth fold, test on (k-1)st fold, find sum of loss function
- ...

Find CV error 
- Sum all test errors from above (Ex: if RMSE, square it first, sum, then square root again)

Key = no predictions used in the RMSE were done on data used to train that model!

---

# CV on MLR Models

- Let's consider three linear regression models
    + Our two models from last video:
    
$$\mbox{Model 1: log_selling_price = intercept + slope*year + Error}$$

$$\mbox{Model 2: log_selling_price = intercept + slope*log_km_driven + Error}$$
-  And a multiple linear regression model with both variable as predictors
    
$$\mbox{Model 3: log_selling_price = intercept + slope1*year + slope2*log_km_driven+ Error}$$
---

# CV on MLR Models

- Can use CV error to choose between these models
- In `scikit-learn` use the `cross_validate()` function from the `model_selection` submodule
    + Uses a `scoring` [input](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) to determine the loss function

---

# CV on MLR Models

- Can use CV error to choose between these models
- In `scikit-learn` use the `cross_validate()` function from the `model_selection` submodule
    + Uses a `scoring` [input](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) to determine the loss function

In [None]:
import pandas as pd
import numpy as np

bike_data = pd.read_csv("https://www4.stat.ncsu.edu/~online/datasets/bikeDetails.csv")
#create response and new predictor
bike_data['log_selling_price'] = np.log(bike_data['selling_price'])
bike_data['log_km_driven'] = np.log(bike_data['km_driven'])

In [None]:
from sklearn.model_selection import cross_validate
from sklearn import linear_model
reg1 = linear_model.LinearRegression() 

---

# CV on MLR Models

- Can use CV error to choose between these models
- In `scikit-learn` use the `cross_validate()` function from the `model_selection` submodule
    + Uses a `scoring` [input](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) to determine the loss function

In [None]:
from sklearn.model_selection import cross_validate
from sklearn import linear_model
reg1 = linear_model.LinearRegression() 
cv1 = cross_validate(reg1, 
    bike_data["year"].values.reshape(-1,1), 
    bike_data["log_selling_price"].values, 
    cv=5,
    scoring=('r2', 'neg_mean_squared_error'),     
    return_train_score=True)
print(sorted(cv1.keys()))
round(sum(cv1["test_neg_mean_squared_error"]),4)

---

# CV on MLR Models

In [None]:
reg2 = linear_model.LinearRegression() 
cv2 = cross_validate(reg2,   
    bike_data["log_km_driven"].values.reshape(-1,1), 
    bike_data["log_selling_price"].values, 
    cv=5, scoring=('r2', 'neg_mean_squared_error'))

reg3 = linear_model.LinearRegression() 
cv3 = cross_validate(reg3, bike_data[["year", "log_km_driven"]].values, 
    bike_data["log_selling_price"].values, 
    cv = 5, scoring=('r2','neg_mean_squared_error'))

---

# CV on MLR Models

In [None]:
reg2 = linear_model.LinearRegression() 
cv2 = cross_validate(reg2,   
    bike_data["log_km_driven"].values.reshape(-1,1), 
    bike_data["log_selling_price"].values, 
    cv=5, scoring=('r2', 'neg_mean_squared_error'))

reg3 = linear_model.LinearRegression() 
cv3 = cross_validate(reg3, bike_data[["year", "log_km_driven"]].values, 
    bike_data["log_selling_price"].values, 
    cv = 5, scoring=('r2','neg_mean_squared_error'))

print(round(sum(cv1["test_neg_mean_squared_error"]),4),
      round(sum(cv2["test_neg_mean_squared_error"]),4),
      round(sum(cv3["test_neg_mean_squared_error"]),4))

- Now we would refit the 'best' model on the full data set!

---

# Recap

Cross-validation gives a way to use more of the data while still seeing how the model does on test data

- Commonly 5 fold or 10 fold is done
- Once a best model is chosen, model is refit on entire data set
