# 2A.ml - Réduction d'une forêt aléatoire - énoncé

Le modèle Lasso permet de sélectionner des variables, une forêt aléatoire produit une prédiction comme étant la moyenne d'arbres de régression. Et si on mélangeait les deux ?

In [2]:
from jyquickhelper import add_notebook_menu
add_notebook_menu()

In [3]:
%matplotlib inline

## Datasets

Comme il faut toujours des données, on prend ce jeu [Boston](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html).

In [4]:
from sklearn.datasets import load_boston
data = load_boston()
X, y = data.data, data.target

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [6]:
data.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

## Q1 : caler une forêt aléatoire

In [32]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

In [33]:
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)

# dt = DecisionTreeRegressor()
# dt.fit(X_train, y_train)



DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [34]:
model.estimators_

[DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=31267168, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       presort=False, random_state=602421061, splitter='best'),
 DecisionTreeRegressor(criterion='mse', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, m

In [35]:
len(model.estimators_)

100

## Q2 : calculer soi-même la moyenne des prédictions des arbres de la forêt aléatoire

C'est peut-être inutile mais ça permet de s'assurer que la prédiction d'une forêt aléatoire est bien issue de la moyenne des prédictions d'un ensemble d'arbre de régression.

In [10]:
pred_rf = model.predict(X_test)

In [36]:
print(X_test.shape)

(127, 13)


In [37]:
print(pred_rf.shape)

(127,)


In [11]:
pred_rf

array([ 9.367, 26.58 , 24.825, 33.738, 45.49 , 33.322, 31.738, 20.821,
       24.351, 33.17 , 22.483, 11.738, 19.396, 23.49 , 30.196, 22.001,
       19.488, 18.652, 34.887,  8.479, 11.707, 12.521, 13.833, 24.067,
       20.749, 20.85 , 20.801, 31.87 , 26.807, 13.632, 18.414, 23.376,
       33.891, 11.834, 20.874, 15.865, 26.599, 25.612, 19.753, 14.754,
       23.165, 28.709, 46.21 , 14.516, 19.079, 22.467, 11.281, 23.899,
       21.523, 28.515, 29.422, 47.124, 24.883, 39.905, 15.201, 23.013,
       12.716, 18.38 , 20.125, 41.052, 18.603, 24.233, 23.938, 35.174,
       17.37 , 26.977, 23.731, 45.005, 32.307, 43.609, 41.246, 42.073,
       28.101, 21.094, 20.266, 12.739, 14.451, 16.474, 21.222, 32.893,
       25.484, 22.748, 17.781, 20.272, 20.232, 15.03 , 19.888, 45.092,
       21.533, 26.526, 23.642, 26.456, 19.116, 33.611, 20.643, 26.403,
       22.259, 20.775, 24.772, 11.555,  7.435, 23.559, 15.089, 13.995,
       13.446,  9.903, 23.324, 10.117, 19.942, 14.026, 46.435, 15.702,
      

In [38]:
import numpy as np
pred_manuel = np.zeros(y_test.shape)

In [22]:
for i, Xi in enumerate(X_test):
    
    #Reshape des Xi
    Xi = Xi.reshape(1, -1)
    
    arbres_out_Xi = [arbre.predict(Xi) for arbre in model.estimators_]
    pred_manuel[i] = np.mean(arbres_out_Xi)

In [23]:
pred_manuel

array([ 9.367, 26.58 , 24.825, 33.738, 45.49 , 33.322, 31.738, 20.821,
       24.351, 33.17 , 22.483, 11.738, 19.396, 23.49 , 30.196, 22.001,
       19.488, 18.652, 34.887,  8.479, 11.707, 12.521, 13.833, 24.067,
       20.749, 20.85 , 20.801, 31.87 , 26.807, 13.632, 18.414, 23.376,
       33.891, 11.834, 20.874, 15.865, 26.599, 25.612, 19.753, 14.754,
       23.165, 28.709, 46.21 , 14.516, 19.079, 22.467, 11.281, 23.899,
       21.523, 28.515, 29.422, 47.124, 24.883, 39.905, 15.201, 23.013,
       12.716, 18.38 , 20.125, 41.052, 18.603, 24.233, 23.938, 35.174,
       17.37 , 26.977, 23.731, 45.005, 32.307, 43.609, 41.246, 42.073,
       28.101, 21.094, 20.266, 12.739, 14.451, 16.474, 21.222, 32.893,
       25.484, 22.748, 17.781, 20.272, 20.232, 15.03 , 19.888, 45.092,
       21.533, 26.526, 23.642, 26.456, 19.116, 33.611, 20.643, 26.403,
       22.259, 20.775, 24.772, 11.555,  7.435, 23.559, 15.089, 13.995,
       13.446,  9.903, 23.324, 10.117, 19.942, 14.026, 46.435, 15.702,
      

A priori, c'est la même chose.

In [48]:
loss_rf = np.mean((y_test - pred_rf)**2)
print("Loss RF = {}".format(loss_rf))

Loss RF = 10.057618488188961


In [49]:
loss_manuel = np.mean((y_test - pred_rf)**2)
print("Loss manuelle = {}".format(loss_manuel))

Loss manuelle = 10.057618488188961


In [51]:
from sklearn.metrics import r2_score

In [52]:
print("R2 score random forest = {}".format(r2_score(y_test, pred_rf)))

R2 score random forest = 0.9082731624788479


## Q3 : Pondérer les arbres à l'aide d'une régression linéaire

La forêt aléatoire est une façon de créer de nouvelles features, 100 exactement qu'on utilise pour caler une régression linéaire. A vous de jouer.

In [6]:
from sklearn.linear_model import LinearRegression


def new_features(model, X):
    
    X_new = numpy.zeros((X.shape[0], len(model.estimators_)))
    
    for i, Xi in enumerate(X_test):

        #Reshape des Xi
        Xi = Xi.reshape(1, -1)

        arbres_out_Xi = [arbre.predict(Xi) for arbre in model.estimators_]
    
        X_new[i, :] = arbres_out_Xi
    
    return X_new


X_train_2 = new_features(clr, X_train)
lr = LinearRegression()
lr.fit(X_train_2, y_train)

## Q4 : Que se passe-t-il si on remplace la régression linéaire par un Lasso ?

Petit rappel : le [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) est une façon de sélectionner les variables.

## Q5 : Tracer l'évolution de la performance et du nombre d'arbres en fonction de alpha