# Lab: K-Fold CV 
## CMSE 381 - Fall 2022
## Oct 5,  2022. Lecture 11



In [None]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# 1. Roll your own $k$-fold

Ok, let's try to get a handle on what this $k$-fold CV is doing with our data. To do that, we're going to build our own $k$-fold splitter before we use the provided tools in `scikitlearn`. Of course, this is not going to be optimized at all, the goal is just to figure out how the innards are working. 

&#9989; **<font color=red>Do this:</font>** Below is the skeleton of code that will return the $k$-fold train/test splits. Update the code where noted to make it work. 

How do you check that your code is doing what you want? 
- Make sure you end up with $k$ splits 
- Make sure that each of the testing splits has $n/k$ data points
- Make sure that the rest of the data points end up in the training set. 
- A good check is to see that you have all $n$ data points between the training and testing set every time.

In [None]:
def mykfold(n,k):
    # Input: integers n and k.
    #        This version is only going to allow us to work with  
    #        a $k$ that is actually divisible by $n$ 
    # Output: a list of the train/test splits to be used.
    
    # This command is just going to make a warning so that if you pass in 
    # n and k not divisble, the code will kick you out.     
    assert (n % k == 0), "k doesn't divide n, this code can't handle it"
    
    # Make an array of the indices:
    all_my_indices = np.array(range(n))
    
    
    # First, shuffle your array to make sure we're working with randomized order.
    # ----your code here to shuffle----# 
    
    
    # Write an equation that will figure out the length of each fold below
    length_of_fold = np.nan #<----- fix this
    
    
    # Now we're going to keep a list of all your splits. Modify the code below so that 
    # you can keep track of the training and testing splits.
    AllSplits = []
    for i in range(k):
        
        test_set = [] #<------ fix this
        training_set = [] #<------ fix this, too
        AllSplits.append({'train': training_set, 'test':test_set})
    
    return AllSplits
 
n = 30
k = 5
mykfold(n,k)
    

Now we are going to fix the code above to allow for $n$ not divisible by $k$. We want to take all the leftover data points from dividing the folds evenly and just add them to the first folds. Below is one way to figure out how long each fold should be in this more general case. 

In [None]:
n = 33
k = 5

length_of_each_fold = [n//k for i in range(k)]

for i in range(n % k):
    length_of_each_fold[i]+=1
    
print(length_of_each_fold)
print(np.sum(length_of_each_fold))

&#9989; **<font color=red>Do this:</font>** Copy your `mykfold` function down here.  Modify it so that it can accept $n$ and $k$ that aren't divisible. 

In [None]:
# Your code here #

n = 33
k = 5
mykfold(n,k)
    

# 2. Letting scikitlearn do the work for us. 

Ok, now that we understand the innards, we can let `scikitlearn` do this for us. Let's get our toy data set back to mess with this.  



In [None]:
# Set the seed so everyone has the same numbers
np.random.seed(42)

def f(t, m = -3, b = 2):
    return m*t+b

n = 300
X_toy = np.random.uniform(0,5,n)
y_toy = f(X_toy) + np.random.normal(0,2,n)

plt.scatter(X_toy,y_toy)
plt.plot(X_toy,f(X_toy),c = 'red')

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=3)

# Notice that like the leave one out version, trying to print kf still doesn't 
# give us much that's useful
print(kf)

In [None]:
for train_index, test_index in kf.split(X_toy):
    print("TRAIN:", train_index, "\nTEST:", test_index, '\n')
    X_train, X_test = X_toy[train_index], X_toy[test_index]
    y_train, y_test = y_toy[train_index], y_toy[test_index]

There is a BIG PROBLEM with this code.  We haven't done something!!! Something important!!!

&#9989; **<font color=red>Q:</font>** What didn't we do? This is an easy fix, checkout the [documentation for `KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html), then modify the code below to fix the problem. 



In [None]:
# Fix this code! 
kf = KFold(n_splits=3)

for train_index, test_index in kf.split(X_toy):
    print("TRAIN:", train_index, "\nTEST:", test_index, '\n')
    X_train, X_test = X_toy[train_index], X_toy[test_index]
    y_train, y_test = y_toy[train_index], y_toy[test_index]

Now that we have our train/test split generator set up, let's take a look at the result. Note that this is just going to color by the last split generated in that for loop up above. 

In [None]:
plt.scatter(X_train,y_train, marker = '+', label = "Training")
plt.scatter(X_test,y_test, marker = '*', label = "Testing")
plt.legend()

&#9989; **<font color=red>Q:</font>** Below is my code from last class to train our linear regression model, again just using that last train/test split. Fix this so that it uses every k-fold train/test split ($k=5$) and returns the average of the MSEs. 


In [None]:
# Your code goes here

model = LinearRegression()
model.fit(X_train.reshape(-1,1),y_train)
y_hat = model.predict(X_test.reshape(-1,1))

mean_squared_error(y_hat,y_test)

&#9989; **<font color=red>Q:</font>** What happens if you set `n_splits = n`? 

*Your answer here*


![Stop Icon](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Vienna_Convention_road_sign_B2a.svg/180px-Vienna_Convention_road_sign_B2a.svg.png)

Great, you got to here! Hang out for a bit, there's more lecture before we go on to the next portion. 

# 2. Setting this up on a slightly more complicated data set. 

Ok, let's see how this is used for determining parameters. Below, we're going to generate a data set that is clearly non-linear. 

In [None]:
# Set the seed so everyone has the same numbers
np.random.seed(42)

def f(t, m1 = -7,m2 = 5, m3 = -.8, b = 6):
    return m3 * t**3 + m2*t**2 + m1*t+b

n = 300
X_toy = np.random.uniform(0,5,n)
y_toy = f(X_toy) + np.random.normal(0,2,n)

plt.scatter(X_toy,y_toy)

# Doing this so the plot isn't ugly
X_plot = X_toy.copy()
X_plot.sort()
plt.plot(X_plot,f(X_plot),c = 'red')

&#9989; **<font color=red>Do this:</font>** Using $k$-fold cross validation for $k=5$, set up code to approximate the test error for each of the polynomial models below. 
- $y = \beta_0 + \beta_1 X$
- $y = \beta_0 + \beta_1 X + \beta_2 X^2$
- $y = \beta_0 + \beta_1 X+ \beta_2 X^2+ \beta_3 X^3$
- $y = \beta_0 + \beta_1 X+ \beta_2 X^2+ \beta_3 X^3+ \beta_4 X^4$
- $y = \beta_0 + \beta_1 X+ \beta_2 X^2+ \beta_3 X^3+ \beta_4 X^4+ \beta_5 X^5$
- $y = \beta_0 + \beta_1 X+ \beta_2 X^2+ \beta_3 X^3+ \beta_4 X^4+ \beta_5 X^5+ \beta_6 X^6$

Then plot your resulting test errors for each to deterimine best choice of polynomial for this data set. 

In [None]:
# Your code here 

If you still have some time, try to see if you can figure out the test errors for everything through a degree 10 polynomial. 



-----
### Congratulations, we're done!
Written by Dr. Liz Munch, Michigan State University

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.