> This is a self-correcting activity generated by [nbgrader](https://nbgrader.readthedocs.io). Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

---

# Predict diabetes evolution

In this activity, you'll train several regression models to predict the disease progression one year after.

The [Diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html) dataset contains ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

## Environment setup

In [1]:
# Import base packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Import ML packages
import sklearn

print(f"scikit-learn version: {sklearn.__version__}")

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

scikit-learn version: 0.23.2


## Step 1: Loading the data

In [3]:
dataset = load_diabetes()

# Put data in a pandas DataFrame
df_diab = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target to DataFrame
df_diab["target"] = dataset.target
# Show 10 random samples
df_diab.sample(n=10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
198,-0.052738,-0.044642,0.054152,-0.026328,-0.055231,-0.033881,-0.013948,-0.039493,-0.074089,-0.059067,142.0
288,0.070769,0.05068,-0.016984,0.021872,0.043837,0.056305,0.037595,-0.002592,-0.070209,-0.017646,80.0
53,-0.009147,-0.044642,-0.015906,0.070073,0.012191,0.022172,0.015505,-0.002592,-0.033249,0.048628,104.0
292,0.009016,-0.044642,-0.022373,-0.032066,-0.049727,-0.068641,0.078093,-0.070859,-0.062913,-0.038357,84.0
156,-0.016412,-0.044642,-0.010517,0.001215,-0.037344,-0.03576,0.011824,-0.039493,-0.021394,-0.034215,25.0
283,-0.016412,-0.044642,-0.052552,-0.033214,-0.044223,-0.036387,0.019187,-0.039493,-0.06833,-0.030072,181.0
104,-0.02731,-0.044642,0.06493,-0.002228,-0.02496,-0.017284,0.022869,-0.039493,-0.061177,-0.063209,95.0
362,0.019913,0.05068,0.104809,0.070073,-0.035968,-0.026679,-0.024993,-0.002592,0.003712,0.040343,321.0
361,0.041708,-0.044642,-0.007284,0.028758,-0.042848,-0.048286,0.052322,-0.076395,-0.072128,0.023775,182.0
401,0.016281,-0.044642,-0.045007,-0.057314,-0.034592,-0.053923,0.074412,-0.076395,-0.042572,0.040343,93.0


## Step 2: Preparing the data

### Question

Split the dataset into training (variables `x_train`, `y_train`) and test sets (variables `x_test`, `y_test`) with a 20% ratio.

In [4]:
# YOUR CODE HERE
x_train, x_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)

In [5]:
print(f"x_train: {x_train.shape}. y_train: {y_train.shape}")
print(f"x_test: {x_test.shape}. y_test: {y_test.shape}")

assert x_train.shape == (353, 10)
assert y_train.shape == (353,)
assert x_test.shape == (89, 10)
assert y_test.shape == (89,)

x_train: (353, 10). y_train: (353,)
x_test: (89, 10). y_test: (89,)


## Step 3: Training several models

In [6]:
def eval_model(model):
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    # Train and test MSE
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)

    print(f"Training MSE: {train_mse:.2f}. Test MSE: {test_mse:.2f}")
    
    return train_mse, test_mse

### Question

Create and train a Decision Tree, a MultiLayer Perceptron and a Random Forest on the training data.

Compute their MSE on the training and test data.

In [7]:
# Import the needed sicki-learn packages
# YOUR CODE HERE
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor

In [9]:
# Create and train a Decision Tree
# YOUR CODE HERE
dt_model = DecisionTreeRegressor()
dt_model.fit(x_train, y_train)

eval_model(dt_model)

Training MSE: 0.00. Test MSE: 6982.52


(0.0, 6982.516853932584)

In [16]:
# Create and train a MLP
# YOUR CODE HERE
mlp_model = MLPRegressor(max_iter=10000)
mlp_model.fit(x_train, y_train)

eval_model(mlp_model)

Training MSE: 2828.97. Test MSE: 2681.01




(2828.966729797429, 2681.0112021081236)

In [15]:
# Create and train a Random Forest
# YOUR CODE HERE
rf_model = RandomForestRegressor()
rf_model.fit(x_train, y_train)

eval_model(rf_model)

Training MSE: 464.85. Test MSE: 3679.00


(464.85330963172805, 3678.999741573034)

## Step 4: Tuning the most promising model

### Question

Choose the most promising model and tune it, using a `GridSearchCV` instance stored in the `grid_search_cv` variable.

Your test MSE should be less than 3500.

In [19]:
# YOUR CODE HERE
param_grid = [
    {"n_estimators": [3, 10, 30, 100], "max_features": [2, 4, 6, 8, 10]},
]

grid_search_cv = GridSearchCV(
    rf_model,
    param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    return_train_score=True,
)

In [20]:
# Search for the best parameters with the specified classifier on training data
grid_search_cv.fit(x_train, y_train)

# Print the best combination of hyperparameters found
print(grid_search_cv.best_params_)

{'max_features': 4, 'n_estimators': 100}


In [21]:
# Evaluate best estimator
train_mse, test_mse = eval_model(grid_search_cv.best_estimator_)

assert train_mse < 1000
assert test_mse < 3500

Training MSE: 444.59. Test MSE: 3477.12
