<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/training_models/diabetes.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


## Instructions

This is a self-correcting notebook generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

# Predict Diabetes Evolution

In this notebook, you'll train several regression models to predict the disease progression one year after.

The [Diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html) dataset contains ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

## Environment setup

In [1]:
# Import base packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Import ML packages
import sklearn

print(f"scikit-learn version: {sklearn.__version__}")

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

scikit-learn version: 0.22.2.post1


## Step 1: Loading the data

In [3]:
dataset = load_diabetes()

# Put data in a pandas DataFrame
df_diab = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target to DataFrame
df_diab["target"] = dataset.target
# Show 10 random samples
df_diab.sample(n=10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
50,0.034443,-0.044642,-0.007284,0.014987,-0.044223,-0.037326,-0.002903,-0.039493,-0.021394,0.007207,155.0
190,0.009016,-0.044642,-0.012673,0.028758,-0.01808,-0.005072,-0.047082,0.034309,0.023375,-0.00522,292.0
187,-0.067268,-0.044642,-0.054707,-0.026328,-0.07587,-0.082106,0.04864,-0.076395,-0.086829,-0.10463,143.0
124,-0.005515,-0.044642,0.023973,0.008101,-0.034592,-0.038892,0.022869,-0.039493,-0.015998,-0.013504,121.0
25,-0.067268,0.05068,-0.012673,-0.040099,-0.015328,0.004636,-0.058127,0.034309,0.019199,-0.034215,202.0
386,0.019913,-0.044642,-0.040696,-0.015999,-0.008449,-0.017598,0.052322,-0.039493,-0.030751,0.003064,72.0
34,0.016281,-0.044642,-0.06333,-0.057314,-0.057983,-0.048912,0.008142,-0.039493,-0.059473,-0.067351,65.0
371,0.052606,0.05068,-0.009439,0.049415,0.050717,-0.019163,-0.013948,0.034309,0.119344,-0.017646,197.0
228,-0.052738,-0.044642,-0.012673,-0.060757,-0.000193,0.008081,0.011824,-0.002592,-0.027129,-0.050783,160.0
374,-0.107226,-0.044642,-0.034229,-0.067642,-0.063487,-0.07052,0.008142,-0.039493,-0.000609,-0.079778,140.0


## Step 2: Preparing the data

### Question

Split the dataset into training (variables `x_train`, `y_train`) and test sets (variables `x_test`, `y_test`) with a 20% ratio.

In [4]:
x_train, x_test, y_train, y_test = train_test_split(df_diab.iloc[:,:-1], df_diab.iloc[:,-1], test_size=0.2, random_state=15)

In [5]:
print(f"x_train: {x_train.shape}. y_train: {y_train.shape}")
print(f"x_test: {x_test.shape}. y_test: {y_test.shape}")

assert x_train.shape == (353, 10)
assert y_train.shape == (353,)
assert x_test.shape == (89, 10)
assert y_test.shape == (89,)

x_train: (353, 10). y_train: (353,)
x_test: (89, 10). y_test: (89,)


## Step 3: Training several models

In [6]:
def eval_model(model):
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    # Train and test MSE
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)

    print(f"Training MSE: {train_mse:.2f}. Test MSE: {test_mse:.2f}")
    
    return train_mse, test_mse

### Question

Create and train a Decision Tree, a MultiLayer Perceptron and a Random Forest on the training data.

Compute their MSE on the training and test data.

In [9]:
# Import the needed sicki-learn packages
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor

In [10]:
# Create and train a Decision Tree
dt_model = DecisionTreeRegressor()
dt_model.fit(x_train, y_train)
eval_model(dt_model)

Training MSE: 0.00. Test MSE: 6855.78


(0.0, 6855.775280898876)

In [11]:
# Create and train a MLP
mlp_model = MLPRegressor()
mlp_model.fit(x_train, y_train)
eval_model(mlp_model)

Training MSE: 23651.85. Test MSE: 23135.23




(23651.84577041807, 23135.226273102093)

In [12]:
# Create and train a Random Forest
rf_model = RandomForestRegressor()
rf_model.fit(x_train, y_train)
eval_model(rf_model)

Training MSE: 476.13. Test MSE: 3095.40


(476.12523881019825, 3095.395695505618)

## Step 4: Tuning the most promising model

### Question

Choose the most promising model and tune it, using a `GridSearchCV` instance stored in the `grid_search_cv` variable.

Your test MSE should be less than 3500.

In [38]:
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform

estimator = RandomForestRegressor()
param_grid = {
    'n_estimators': [50],
    'criterion': ['mse'],
    'max_depth': [10, None],
    'min_samples_split': [2, 4, 8],
    'max_features': [None, 'sqrt'],
    'max_samples': [0.2, 0.5, 0.8, None]
}
grid_search_cv = GridSearchCV(estimator, param_grid, verbose=1, n_jobs = -1)

In [39]:
# Search for the best parameters with the specified classifier on training data
grid_search_cv.fit(x_train, y_train)

# Print the best combination of hyperparameters found
print(grid_search_cv.best_params_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.4s


{'criterion': 'mse', 'max_depth': 10, 'max_features': None, 'max_samples': 0.2, 'min_samples_split': 4, 'n_estimators': 50}


[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:    5.4s finished


In [40]:
# Evaluate best estimator
train_mse, test_mse = eval_model(grid_search_cv.best_estimator_)

assert train_mse < 1000
assert test_mse < 3500

Training MSE: 2307.92. Test MSE: 2637.07


AssertionError: 