# Predict diabetes evolution

In this activity, you'll train several regression models to predict the disease progression one year after.

The [Diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html) dataset contains ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

## Environment setup

In [1]:
# Import base packages
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [2]:
# Import ML packages
import sklearn

print(f"scikit-learn version: {sklearn.__version__}")

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

scikit-learn version: 0.22.1


## Step 1: Loading the data

In [3]:
dataset = load_diabetes()

# Put data in a pandas DataFrame
df_diab = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target to DataFrame
df_diab["target"] = dataset.target
# Show 10 random samples
df_diab.sample(n=10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
51,0.059871,0.05068,0.016428,0.028758,-0.041472,-0.029184,-0.028674,-0.002592,-0.002397,-0.021788,225.0
80,0.070769,-0.044642,0.012117,0.04253,0.071357,0.053487,0.052322,-0.002592,0.025393,-0.00522,143.0
254,0.030811,0.05068,0.056307,0.076958,0.049341,-0.012274,-0.036038,0.07121,0.120053,0.090049,310.0
269,0.009016,-0.044642,-0.032073,-0.026328,0.042462,-0.010395,0.159089,-0.076395,-0.011901,-0.038357,87.0
314,-0.023677,-0.044642,0.04014,-0.012556,-0.009825,-0.001001,-0.002903,-0.002592,-0.011901,-0.038357,147.0
213,0.001751,-0.044642,-0.070875,-0.022885,-0.001569,-0.001001,0.02655,-0.039493,-0.022512,0.007207,49.0
379,-0.001882,-0.044642,-0.03854,0.021872,-0.108893,-0.115613,0.022869,-0.076395,-0.046879,0.023775,40.0
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045421,0.032059,49.0
426,0.030811,0.05068,-0.034229,0.043677,0.057597,0.068831,-0.032356,0.057557,0.035462,0.085907,120.0
375,0.045341,0.05068,-0.002973,0.107944,0.035582,0.022485,0.02655,-0.002592,0.028017,0.019633,217.0


## Step 2: Preparing the data

### Question

Split the dataset into training (variables `x_train`, `y_train`) and test sets (variables `x_test`, `y_test`) with a 20% ratio.

In [4]:
# BEGIN SOLUTION CODE
# Split data between training and test sets with a 20% ratio
x_train, x_test, y_train, y_test = train_test_split(
    dataset.data, dataset.target, test_size=0.2
)
# END SOLUTION CODE

In [5]:
print(f"x_train: {x_train.shape}. y_train: {y_train.shape}")
print(f"x_test: {x_test.shape}. y_test: {y_test.shape}")

assert x_train.shape == (353, 10)
assert y_train.shape == (353,)
assert x_test.shape == (89, 10)
assert y_test.shape == (89,)

x_train: (353, 10). y_train: (353,)
x_test: (89, 10). y_test: (89,)


## Step 3: Training several models

In [6]:
def eval_model(model):
    y_train_pred = model.predict(x_train)
    y_test_pred = model.predict(x_test)

    # Train and test MSE
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)

    print(f"Training MSE: {train_mse:.2f}. Test MSE: {test_mse:.2f}")
    
    return train_mse, test_mse

### Question

Create and train a Decision Tree, a MultiLayer Perceptron and a Random Forest on the training data.

Compute their MSE on the training and test data.

In [7]:
# Import the needed sicki-learn packages
# BEGIN SOLUTION CODE
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
# END SOLUTION CODE

In [8]:
# Create and train a Decision Tree
# BEGIN SOLUTION CODE
dt_model = DecisionTreeRegressor()
dt_model.fit(x_train, y_train)
# END SOLUTION CODE

eval_model(dt_model)

Training MSE: 0.00. Test MSE: 6080.69


(0.0, 6080.685393258427)

In [9]:
# Create and train a MLP
# BEGIN SOLUTION CODE
mlp_model = MLPRegressor(max_iter=1000)
mlp_model.fit(x_train, y_train)
# END SOLUTION CODE

eval_model(mlp_model)

Training MSE: 3475.93. Test MSE: 3422.36




(3475.9303220311904, 3422.358970306721)

In [10]:
# Create and train a Random Forest
# BEGIN SOLUTION CODE
rf_model = RandomForestRegressor()
rf_model.fit(x_train, y_train)
# END SOLUTION CODE

eval_model(rf_model)

Training MSE: 485.26. Test MSE: 3224.32


(485.2633679886685, 3224.319452808989)

## Step 4: Tuning the most promising model

### Question

Choose the most promising model and tune it, using a `GridSearchCV` instance stored in the `grid_search_cv` variable.

Your test MSE should be less than 3500.

In [11]:
# BEGIN SOLUTION CODE
# Grid search explores a user-defined set of hyperparameter values
param_grid = [
    {"n_estimators": [3, 10, 30, 100], "max_features": [2, 4, 6, 8, 10]},
]

# train across 5 folds
grid_search_cv = GridSearchCV(
    rf_model,
    param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    return_train_score=True,
)
# END SOLUTION CODE

In [12]:
# Search for the best parameters with the specified classifier on training data
grid_search_cv.fit(x_train, y_train)

# Print the best combination of hyperparameters found
print(grid_search_cv.best_params_)

{'max_features': 2, 'n_estimators': 100}


In [13]:
# Evaluate best estimator
train_mse, test_mse = eval_model(grid_search_cv.best_estimator_)

assert train_mse < 1000
assert test_mse < 3500

Training MSE: 464.79. Test MSE: 3173.31
