<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/notebooks/katas/algorithms/RandomForest_Diabetes.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


## Instructions

This is a self-correcting exercise generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`. Run subsequent cells to check your code.

---

# Kata: Predict Diabetes Evolution

In this kata, you'll use several regression models to predict the disease progression one year after.

The Diabetes dataset contains ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

## Package setup

In [1]:
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

### Question

Import the needed packages.

In [26]:
# Import ML packages (edit this list if needed)
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## Step 1: Loading the data

In [3]:
dataset = load_diabetes()

# Put data in a pandas DataFrame
df_diab = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target to DataFrame
df_diab['target'] = dataset.target
# Show 10 random samples
df_diab.sample(n=10)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
41,-0.099961,-0.044642,-0.067641,-0.108957,-0.074494,-0.072712,0.015505,-0.039493,-0.049868,-0.009362,55.0
307,0.067136,0.05068,-0.030996,0.004658,0.024574,0.035638,-0.028674,0.034309,0.023375,0.081764,172.0
253,0.081666,-0.044642,0.033673,0.008101,0.052093,0.056619,-0.017629,0.034309,0.034864,0.069338,150.0
371,0.052606,0.05068,-0.009439,0.049415,0.050717,-0.019163,-0.013948,0.034309,0.119344,-0.017646,197.0
225,0.030811,0.05068,0.032595,0.049415,-0.040096,-0.043589,-0.069172,0.034309,0.063017,0.003064,208.0
306,0.009016,0.05068,-0.001895,0.021872,-0.03872,-0.0248,-0.006584,-0.039493,-0.03981,-0.013504,44.0
120,-0.049105,-0.044642,0.004572,0.011544,-0.037344,-0.018537,-0.017629,-0.002592,-0.03981,-0.021788,200.0
382,0.048974,-0.044642,0.060618,-0.022885,-0.023584,-0.072712,-0.043401,-0.002592,0.104138,0.036201,132.0
55,-0.04184,-0.044642,-0.049318,-0.036656,-0.007073,-0.022608,0.085456,-0.039493,-0.066488,0.007207,128.0
125,-0.005515,0.05068,-0.008362,-0.002228,-0.033216,-0.06363,-0.036038,-0.002592,0.080585,0.007207,161.0


## Step 2: Preparing the data

### Question

Split the dataset into training and test sets with a 20% ratio.

In [8]:
x_train, x_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size =0.2)

In [9]:
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')

assert x_train.shape == (353, 10)
assert y_train.shape == (353,)
assert x_test.shape == (89, 10)
assert y_test.shape == (89,)

x_train: (353, 10). y_train: (353,)
x_test: (89, 10). y_test: (89,)


## Step 3: Training several models

### Question

Create and train a `DecisionTreeRegressor`, a `GradientBoostingRegressor` and a `RandomForestRegressor` on the training data.

Compute their MSE on the training and test data.

In [27]:
from sklearn import tree
tree_reg = tree.DecisionTreeRegressor(random_state = 42)
tree_reg.fit(x_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=42, splitter='best')

In [35]:
train_acc = tree_reg.score(x_train, y_train)
test_acc = tree_reg.score(x_test, y_test)

print("Train accuracy", train_acc)
print("Test accuracy", test_acc)

Train accuracy 1.0
Test accuracy 0.15061418165837814


In [23]:
from sklearn.ensemble import GradientBoostingRegressor
grad_boost_reg = GradientBoostingRegressor(random_state=42)
grad_boost_reg.fit(x_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto',
                          random_state=42, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [36]:
train_acc = grad_boost_reg.score(x_train, y_train)
test_acc = grad_boost_reg.score(x_test, y_test)

print("Train accuracy", train_acc)
print("Test accuracy", test_acc)

Train accuracy 0.8568189555005084
Test accuracy 0.40057156115115466


In [17]:
from sklearn.ensemble import RandomForestRegressor
random_forest_reg = RandomForestRegressor(random_state=42)
random_forest_reg.fit(x_train, y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)

In [37]:
train_acc = random_forest_reg.score(x_train, y_train)
test_acc = random_forest_reg.score(x_test, y_test)

print("Train accuracy", train_acc)
print("Test accuracy", test_acc)

Train accuracy 0.8937713185701133
Test accuracy 0.3843917870106892
