### Module 9-1 Learning Notebook: Cross-validation for competing model evaluation
In this lesson, we'll use cross validation to evaluate 3 possible models. We'll pick the best model, then proceed with training, evaluation and prediction.<P>
    
Our process to follow:
1. Load the data
2. Prepare the data for modeling
3. Select 3 models to evaluate
    - Stochastic Gradient Descent (SGD)
    - Decision Tree
    - Random Forest
4. Use cross-validation to get multiple estimates of how well each model performs
5. Select the best model
6. Split the data into training and test sets
7. Train the selected model on the training set
8. Evaluate the selected model on the test set    
9. Using the trained model predict new values

In [2]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn import tree
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import boto3
import pandas as pd
import numpy as np

### 1. Load the data

In [3]:
# Get the data from the S3 bucket: machinelearning-read-only
# Create session and S3 client
sess = boto3.session.Session()
s3 = sess.client('s3')
# Set variables 
source_bucket = 'machinelearning-read-only'
source_key = 'data/cars.csv'
# Load the dataframe
response = s3.get_object(Bucket=source_bucket, Key=source_key)
# The 'Body' is of type streaming body. We can put this right into a dataframe
df = pd.read_csv(response.get("Body")) 
print('The size of the complete dataset:',df.shape)
df.head(3)

The size of the complete dataset: (7906, 8)


Unnamed: 0,name,year,selling_price,km_driven,km/liter,engine,max_power,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,23.4,1248.0,74.0,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,21.14,1498.0,103.52,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,17.7,1497.0,78.0,5.0


In [4]:
# Create features and target dataframes. 
# The selected features are intuitive, but not selected though analysis.
X = df[['year','engine','seats']]
y = df['selling_price'] 

### 2. Prepare the data for modeling

In [5]:
# The data is clean, but the scaling is varied
X.describe()

Unnamed: 0,year,engine,seats
count,7906.0,7906.0,7906.0
mean,2013.983936,1458.708829,5.416393
std,3.863695,503.893057,0.959208
min,1994.0,624.0,2.0
25%,2012.0,1197.0,5.0
50%,2015.0,1248.0,5.0
75%,2017.0,1582.0,5.0
max,2020.0,3604.0,14.0


In [6]:
# Since we will use the SGD algorithm, which is sensitive to scale, let's normalize the data.
norm_scaler_all_data = MinMaxScaler()
X_norm = pd.DataFrame(data = norm_scaler_all_data.fit_transform(X), columns = X.columns)
X_norm.describe()

Unnamed: 0,year,engine,seats
count,7906.0,7906.0,7906.0
mean,0.768613,0.280104,0.284699
std,0.148604,0.169092,0.079934
min,0.0,0.0,0.0
25%,0.692308,0.192282,0.25
50%,0.807692,0.209396,0.25
75%,0.884615,0.321477,0.25
max,1.0,1.0,1.0


### 3. Select 3 models to evaluate

In [13]:
# Create 3 model objects. Just use the default hyperparameters
sgd = linear_model.SGDRegressor(max_iter = 2000) # increasing this from 1000 to 2000 prevents a warning message below
dtr = tree.DecisionTreeRegressor() 
rfr = RandomForestRegressor()

### 4. Use cross-validation to get multiple estimates of how well each model performs

In [26]:
k = 5 # k-fold parameter
#
# Eval the Stochastic Gradient Descent
print("SGD:")
scores = cross_val_score(sgd, X_norm, y, cv=k) # Use the scaled features
print('All scores:', scores)
print('Mean score:', round(scores.mean(),4))
#
# Eval dtr
print("\nDecision Tree:")
scores = cross_val_score(dtr, X, y, cv=k) # No need to use the normalized data here
print('All scores:', scores)
print('Mean score:', round(scores.mean(),4))
# Eval rfr
print("\nRandom Forest:")
scores = cross_val_score(rfr, X, y, cv=k) # No need to use the normalized data here
print('All scores:', scores)
print('Mean score:', round(scores.mean(),4))

SGD:
All scores: [0.45269862 0.45051387 0.47295811 0.40396368 0.44126419]
Mean score: 0.4443

Decision Tree:
All scores: [0.94673545 0.91910088 0.95791052 0.85484953 0.88806976]
Mean score: 0.9133

Random Forest:
All scores: [0.94320926 0.92737515 0.95775353 0.85022265 0.92760405]
Mean score: 0.9212


### 5. Select the best model
Looks like Decision Trees and Random Forests are both very good.

We would likely use more sophisticated metrics to see if we could determine a real difference between the algorithms.<P>
For demonstration purposes, I say: "use the Decision Tree model"

### 6. Split the data into training and test sets

In [20]:
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
# Verify the sizes of the split datasets
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

X_train: (6324, 3)
y_train: (6324,)
X_test: (1582, 3)
y_test: (1582,)


### 7. Train the selected model on the training set

In [21]:
# Create a new model
final_dtr = tree.DecisionTreeRegressor() 
# Train the model using the training data. 
final_dtr.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

### 8. Evaluate the selected model on the test set 

In [22]:
# Make predictions for the test set
y_pred = final_dtr.predict(X_test)
# Evaluate performance
r2 = round(r2_score(y_test, y_pred),2)
mse = round(mean_squared_error(y_test, y_pred),2)
print("Coefficient of determination: %.2f" % r2)
print("MSE:",mse)

Coefficient of determination: 0.95
MSE: 31432695514.87


### 9. Using the trained model predict new values
Now we do have a trained model. We can ask it to predict new sales price based on features.

In [23]:
# Year, Engine Size, # of seats
unseen_used_vehicle = [[2014.0, 2400.0, 1.0]]
predicted_sales_price = int(final_dtr.predict(unseen_used_vehicle).item()) # Extract the number and round it to an integer
print('Predicted sales price in some currency:', predicted_sales_price)

Predicted sales price in some currency: 580000


### Summary:
In this exercise:
- We used cross-validation to quickly evaluate 3 models
- Once we selected a model, then we trained and evaluated it as normal
- Finally, we used the trained model to make new predictions