### Module 9-1 Learning Notebook: Cross-Validation

<B>How to get maximum benefit out of your data:</B><P>
    
Cross-validation is a method for squeezing the most value out of your dataset. Typically, when we spilt our dataset into train and test sets, we do so randomly. By chance, it may be that your training data is an especially good (or bad) representation of the true data. There is really no way to split the data into optimal training and test sets.<P>
    
This is where cross-validation helps improve the evaluation of our model. Technically, cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform.

<img src="images/crossv.png" alt="Cross Validation" style="width: 600px;"/>

In this activity, we will take a familiar model and cross-validate k-times to get a robust estimate of how well the model really does.
- Load the data and isolate features
- Create a decision tree model and cross-validate with k-folds to get k estimates of model performance
- Discuss where cross-validation fits into the Machine Learning process

In [1]:
from sklearn.model_selection import cross_val_score
from sklearn import tree
import boto3
import pandas as pd
import numpy as np

### 1. Load the data, isolate features

In [2]:
# Get the data from the S3 bucket: machinelearning-read-only
# Create session and S3 client
sess = boto3.session.Session()
s3 = sess.client('s3')
# Set variables 
source_bucket = 'machinelearning-read-only'
source_key = 'data/cars.csv'
# Load the dataframe
response = s3.get_object(Bucket=source_bucket, Key=source_key)
# The 'Body' is of type streaming body. We can put this right into a dataframe
df = pd.read_csv(response.get("Body")) 
print('The size of the complete dataset:',df.shape)
df.head(3)

The size of the complete dataset: (7906, 8)


Unnamed: 0,name,year,selling_price,km_driven,km/liter,engine,max_power,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,23.4,1248.0,74.0,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,21.14,1498.0,103.52,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,17.7,1497.0,78.0,5.0


In [3]:
# Create features and target dataframes
X = df[['year','selling_price','engine']]
y = df['selling_price'] 

### 2. Create a decision tree model and cross-validate
Here, we are not "training" a model, but just evaluating how it would do if we did train it. <P>
In effect, we are training and evaluating it 5 times with 5 different training/test sets, but in the end, we don't have a trained model.

In [4]:
# Create a decision tree regressor model
dtr = tree.DecisionTreeRegressor() # default hyperparameters
# Cross-validate with k=5
scores = cross_val_score(dtr, X, y, cv=5)
print('All scores:', scores)
print('Mean score:', round(scores.mean(),3))
# The score computed here is the coefficient of determination. 
# This is the default evaluation metric from the decision tree algorithm

All scores: [0.99376221 0.99946997 0.99995255 0.99956506 0.99998371]
Mean score: 0.999


### 3. Discuss where cross-validation fits into the Machine Learning process
Cross-validation happens early in the machine learning process. It is used to robustly evaluate competing models as you learn about your data and how to best model it.<P>

So, let's review our machine learning process so far:
1. Load the data
2. Prepare the data for modeling:
    - Clean it (we haven't done this very much yet)
    - Depending on the algorithm and the data, scale the features
3. Select a model to evaluate
4. Use cross-validation to get multiple estimates of how well the model performs
5. Repeat 3 and 4 above to evaluate several competing algorithms
6. Select an algorithm
7. Split the data into training and test sets
8. Train the selected model on the training set
    - Tune hyperparameters, if necessary
9. Evaluate the selected model on the test set
  
<P><P>    
In the next lesson, we'll practice this entire process