OpenClassrooms
Project 4, Data Scientist
Author : Oumeima EL GHARBI
Date : August 2022

Un notebook pour chaque prédiction (émissions de CO2 et consommation totale d’énergie) des différents tests de modèles mis au propre, dans lequel vous identifierez clairement le modèle final choisi.


L’objectif est de te passer des relevés de consommation annuels futurs (attention à la fuite de données). Nous ferons de toute façon pour tout nouveau bâtiment un premier relevé de référence la première année, donc rien ne t'interdit d’en déduire des variables structurelles aux bâtiments, par exemple la nature et proportions des sources d’énergie utilisées..

Fais bien attention au traitement des différentes variables, à la fois pour trouver de nouvelles informations (peut-on déduire des choses intéressantes d’une simple adresse ?) et optimiser les performances en appliquant des transformations simples aux variables (normalisation, passage au log, etc.).

Mets en place une évaluation rigoureuse des performances de la régression, et optimise les hyperparamètres et le choix d’algorithmes de ML à l’aide d’une validation croisée.



Tester les modèles suivants : **regression linéaire (avec différentes régularisation : Ridge, Lasso, Elastic), Random Forest, XGboost**
Penser à comparer les performances des différents modèles : utiliser la **MAE** ( Mean Absolute Error)
Penser également à optimiser les hyper paramètres de chaque modèle via **GridSearch**


Evaluate :

https://cloud.google.com/automl-tables/docs/evaluate?hl=fr


KFOLD

Entrée : données X (dimension nxp), étiquettes y (dimension n), nombre de folds k

Couper [0, 1, ..., n-1] en k parties de taille (n/k). (La dernière partie sera un peu plus petite si n n'est pas un multiple de k)

for i=0 to (k-1):
    Former le jeu de test (X_test, y_test) en restreignant X et y aux indices contenus dans la i-ième partie.
    Former le jeu d'entraînement (X_train, y_train) en restreignant X et y aux autres indices.
    Entraîner l'algorithme sur le jeu d'entraînement
    Utiliser le modèle ainsi obtenu pour prédire sur le jeu de test
        Calculer l'erreur du modèle en comparant les étiquettes prédites aux vraies étiquettes contenues dans y_test

Sortie : la valeur moyenne des erreurs calculées sur les k folds.

In [None]:
# 1 modele de Regression (RL classsque / Elastic / ridig / laso
# Random Forest
# XGBOOST

# var à rpedire tottal GHE Emssion last one to predict
# cette var deped de la consommation des bateimenst (1) prediction sur elec, steam, naturalgas et un autre energie (2) et reutiliser pour predire

### Introduction

#### Importing libraries

In [None]:
%reset -f

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%autosave 300

#### Loading dataset

In [None]:
columns_to_categorize = ["BuildingType", "PrimaryPropertyType", "Neighborhood", "ZipCode", "CouncilDistrictCode", "LargestPropertyUseType", "SecondLargestPropertyUseType", "ThirdLargestPropertyUseType"]
#  "Neighborhood",
category_types = {column: 'category' for column in columns_to_categorize}
print("This dictionary will be used when reading the csv file to assign a type to categorical features :", category_types)

In [None]:
dataset_path = "dataset/cleaned/2016_Building_Energy_Prediction.csv"
# we assign the categorical features with a categotical type
data = pd.read_csv(dataset_path, dtype=category_types, sep=",")

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
# predict : Electricity
# train / test
# standardisation = retirer la moyen et div par ecart type (scaling : les var qn sur emem echelle
# var categ : encoding (one hot encoder)

# la fin Feature engineriing

# 2) entrainer le smodels
# perf
# temps de calcul
# graph pour montrer la perf de chaque modele(barplot)
# obj : finir exploration / finir feature engineering
# obj un premier noteboook propre (try max)

In [None]:
data.columns

In [None]:
#features_for_prediction = ["YearBuilt",  "BuildingType","PrimaryPropertyType", "Neighborhood", "NumberofFloors", "PropertyGFATotal", "PropertyGFAParking", "SecondLargestPropertyUseTypeGFA", "ThirdLargestPropertyUseTypeGFA"]

features_for_prediction = ["YearBuilt", "NumberofBuildings", "NumberofFloors", "LargestPropertyUseTypeGFA", "SecondLargestPropertyUseTypeGFA", "ThirdLargestPropertyUseTypeGFA",
                           "BuildingType","PrimaryPropertyType", "Neighborhood", "LargestPropertyUseType", "SecondLargestPropertyUseType", "ThirdLargestPropertyUseType"]

variable_to_predict = "Log2-Electricity(kBtu)"

features_for_prediction.append(variable_to_predict)
print(features_for_prediction)


In [None]:
data = data[features_for_prediction]

data

## I) Feature Engineering : preparing the vectors and matrices


#### 1) Separating training data and target vector

In [None]:
# we create the data matrix / we only take the features
X = data[data.columns[:-1]]

# we create the target vector
y = data[variable_to_predict].values # numpy array not a DataFrame anymore

print("Shape of X :", X.shape)
print("Shape of y :", y.shape)

In [None]:
X

#### 2) Separation train and test dataset


In [None]:
print("We have to separate the train / test sets before normalising the dataset.")

In [None]:
# We create a training set and a test set (the test set contains 30% of the dataset)
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3,  random_state=42)

In [None]:
X_train.shape

In [None]:
X_test.shape

#### 3) Normalization

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler

In [None]:
print("We separate categorical variables from numerical variables.")

In [None]:
X.select_dtypes(['category','object']) # we don't have 'object' here but it is just in case.

categorical_columns = X.select_dtypes(['category','object']).columns
numerical_columns = X.select_dtypes(include='number').columns.drop("YearBuilt")
print("We won't normalise the year so we drop it from numerical_columns.")

print("Shape of categorical variables : ", categorical_columns.shape)
print("Shape of numerical variables :", numerical_columns.shape)

##### 1) Data Standardisation

We have to standardize the variables before learning a **Ridge Regression**.
Standardizing means that each variable will have a **standard deviation** equal to 1.

In [None]:
print("Numerical variables standardization")
print("We have :", numerical_columns.shape[0], "numerical features to standardize.",end="\n\n")

print(numerical_columns)

In [None]:
# We train / fit the scaler on the training set / Computes the mean and std to be used for later scaling.
std_scale = StandardScaler().fit(X_train[numerical_columns])
# We transform the training set and the testing set / Performs standardization by centering and scaling.
X_train_std = X_train.copy()
X_test_std = X_test.copy()

X_train_std[numerical_columns] = std_scale.transform(X_train[numerical_columns])
X_test_std[numerical_columns] = std_scale.transform(X_test[numerical_columns])

print("Before")
display(X_train)
print("After")
display(X_train_std)

In [None]:
def densite(df, lines=7, cols=4):
    """
    Input : dataframe, lignes, colonnes
    Output : grille des courbes de densités des variables numériques du dataframe
    """
    df = df.select_dtypes(include='number').copy()

    fig, ax = plt.subplots(lines, cols, figsize=(min(15, cols * 3), lines * 2))

    for i, val in enumerate(df.columns.tolist()):
        bp = sns.distplot(df[val], hist=False, ax=ax[i // cols, i % cols], kde_kws={'shade': True})
        bp.set_title("skewness : " + str(round(df[val].skew(), 1)), fontsize=12)
        bp.set_yticks([])
        imax = i

    for i in range(imax + 1, lines * cols):
        ax[i // cols, i % cols].axis('off')

    plt.tight_layout()

print("We can check that the numerical variables have a Standard Normal distribution.")
densite(X_train[numerical_columns])

print("IMPORT FUNCTIONS / DENSITE")

##### 2) Feature Encoding : One Hot Encoder

In [None]:
print("Categorical variables featuring")

print("We have :", categorical_columns.shape[0], "categorical features to encode.", end="\n\n")
print(categorical_columns)

In [None]:
X.dtypes # we check that we have categories

In [None]:
X[categorical_columns].nunique()

In [None]:
X_train_std[categorical_columns]

##### Encoding the categorical features of the train set


In [None]:
print("Now, we can use the One Hot Encoder.")
print("With the one hot encoder, we will get :", sum([X[categorical_columns].nunique()[i] for i in range(len(categorical_columns))]), "columns to encodes the categorical features.")

In [None]:
# 0) creating instance of one-hot-encoder
one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False) # if sparse=True (by default), we need to add .toarray() to encoded_categorical_data

# 1) Fit the encoder on the training set
one_hot_encoder.fit(X_train_std[categorical_columns])

# 2) we get the encoded numpy array
encoded_categorical_data = one_hot_encoder.transform(X_train_std[categorical_columns])

# 3) we make a list of the columns names
encoded_categorical_data_names = one_hot_encoder.get_feature_names_out().tolist()
print("We have indeed :", len(encoded_categorical_data_names), "labels after encoding the categorical variables.")

# 4) we recreate a dataframe with the column names and the numpy array
X_train_encoded = pd.DataFrame(columns=encoded_categorical_data_names,
                               data=encoded_categorical_data,
                               index=X_train_std.index)
display(X_train_encoded.sort_index())

In [None]:
# 5) Concatenate the two dataframes for the training set

print("We need to add YearBuilt to the list of features.")
numerical_columns.tolist()
features_to_merge = numerical_columns.tolist().copy()
features_to_merge.append("YearBuilt")
print(features_to_merge, end="\n\n")

print("ASK JEREMY : merge based on index ok ? or should I put back OSEBuildingID ??")
X_train_std_encoded = pd.merge(X_train_std[features_to_merge].sort_index(), X_train_encoded.sort_index(), left_index=True, right_index=True)
display(X_train_std_encoded.sort_index())

##### Encoding the categorical features of the test set

In [None]:
# 5) One Hot Encoding on the testing set

# 5.1) we get the encoded numpy array
TEST_encoded_categorical_data = one_hot_encoder.transform(X_test_std[categorical_columns])

print("ASK JEREMY for this method below ???")
# 5.2) we recreate a dataframe with the column names and the numpy array
X_test_encoded = pd.DataFrame(columns=encoded_categorical_data_names,
                               data=TEST_encoded_categorical_data,
                               index=X_test_std.index)
display(X_test_encoded.sort_index())

print("ASK JEREMY : merge based on index ok ? or should I put back OSEBuildingID ??")
X_test_std_encoded = pd.merge(X_test_std[features_to_merge].sort_index(), X_test_encoded.sort_index(), left_index=True, right_index=True)
display(X_test_std_encoded.sort_index())

# Save
X_train_std_encoded.to_csv("dataset/cleaned/electricity/X_train.csv", index=False)
X_test_std_encoded.to_csv("dataset/cleaned/electricity/X_test.csv", index=False)
y_train.to_csv("dataset/cleaned/electricity/y_train.csv", index=False)
y_test.to_csv("dataset/cleaned/electricity/y_test.csv", index=False)

## II) Modelisation

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

In [None]:
print("We can start now the modelling to predict the feature wanted.")

display(X_train_std_encoded)
display(X_test_std_encoded)
print(y_train.shape, y_test.shape)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

In [None]:
print("We rename X_train_std_encoded to X_train, the same for X_test.")
X_train = X_train_std_encoded.copy()
X_test = X_test_std_encoded.copy()

X_train
X_test

### 1) Linear modelling : Linear Regression / Ridge Regression / Lasso / Elastic Net


In [None]:
X_train.shape

In [None]:
X_test.shape

#### 1) Linear Regression : baseline

In [None]:
from sklearn import linear_model

# 0) We create a linear regression model
lr = linear_model.LinearRegression()

# 1) Training Linear Regression and Evaluating
reg = lr.fit(X_train, y_train)

prediction_score = lr.score(X_test, y_test)
#print("Accuracy is : %.2f" % (100 * prediction_score))
print('Accuracy is : {:.0%}'.format(prediction_score))

In [None]:
# On récupère l'erreur de norme 2 sur le jeu de données test comme baseline
y_pred = lr.predict(X_test)
baseline_error = np.mean((y_pred - y_test) ** 2)

#On obtient l'erreur quadratique ci-dessous
print(baseline_error)

In [None]:
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error

def run_experiment(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print("R² : ", r2_score(y_test, y_pred))
    print("MAE :", mean_absolute_error(y_test,y_pred))
    print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

run_experiment(lr)

In [None]:
print("Electricity prediction")
plt.plot(y_pred, y_test, "ro", markersize=4)
plt.show()

print("If the prediction was good, we would see a line which is not the case here .")

#### 2) Linear Model : Ridge

In [None]:
n_alphas = 50 #hyperparametre alpha
alphas = np.logspace(-5, 5, n_alphas)

ridge = linear_model.Ridge()

coefs = []
errors = []
for a in alphas:
    ridge.set_params(alpha=a)
    ridge.fit(X_train, y_train)
    coefs.append(ridge.coef_)
    errors.append(np.mean((ridge.predict(X_test) - y_test) ** 2))

In [None]:
# observation du comportement de l'erreur

ax = plt.gca()
ax.plot(alphas, errors, [10**-5, 10**5], [baseline_error, baseline_error])
ax.set_xscale('log')
plt.show()

In [None]:
ax = plt.gca()

ax.plot(alphas, errors)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('error')
plt.axis('tight')
plt.show()

In [None]:
# index du min des erreurs
np.argmin(errors)

In [None]:
# recupere l'erreur min
errors[np.argmin(errors)]

In [None]:
# recup alpha associé à cet erreur min
alphas[np.argmin(errors)]
# alphas[35]

In [None]:
# chemin de régularisation
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.show()

In [None]:
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()

In [None]:
min(errors)

#### 3) Linear Model : LASSO

#### 4) Linear Model : Elastic Net

In [50]:
from sklearn.linear_model import ElasticNet

#rappel de la fonction de coût du elasticnet
#1 / (2 * n_samples) * ||y - Xw||^2_2 + alpha * l1_ratio * ||w||_1 + 0.5 * alpha * (1 - l1_ratio) * ||w||^2_2

parameters = {'tol' : [0.1,0.01,0.001,0.0001],
              "alpha": [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],  #alpha, coef qui multiplie le terme de pénalité)
              "l1_ratio": np.arange(0.0, 1.0, 0.1)}#L1 ratio , =1 équivaut à un Lasso, 0 à un Ridge


elastic_grid = GridSearchCV(estimator = ElasticNet(),
                            param_grid = parameters,
                            scoring = 'neg_mean_squared_error',
                            cv=5,
                            verbose=0
                            )

elastic_grid.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

In [51]:
elastic_grid.best_params_

{'alpha': 1, 'l1_ratio': 0.0, 'tol': 0.1}

In [52]:
results = pd.DataFrame({})

import math
results = results.append(pd.DataFrame({
    'Modèle' : ['Elasticnet Regression'],
    'Score_RMSE' : [math.sqrt(mean_squared_error(elastic_grid.predict(X_test), y_test))]}),
    ignore_index=True)

### 2) Ensemble learning methods

#### 1) Parallelized Implementation : Random Forest

In [53]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=1000) # nb of trees 1000 for the forest

In [54]:
X_train

Unnamed: 0,NumberofBuildings,NumberofFloors,LargestPropertyUseTypeGFA,SecondLargestPropertyUseTypeGFA,ThirdLargestPropertyUseTypeGFA,YearBuilt,BuildingType_Campus,BuildingType_Multifamily HR (10+),BuildingType_Multifamily LR (1-4),BuildingType_Multifamily MR (5-9),...,ThirdLargestPropertyUseType_Refrigerated Warehouse,ThirdLargestPropertyUseType_Residence Hall/Dormitory,ThirdLargestPropertyUseType_Restaurant,ThirdLargestPropertyUseType_Retail Store,ThirdLargestPropertyUseType_Social/Meeting Hall,ThirdLargestPropertyUseType_Strip Mall,ThirdLargestPropertyUseType_Supermarket/Grocery Store,ThirdLargestPropertyUseType_Swimming Pool,ThirdLargestPropertyUseType_Vocational School,ThirdLargestPropertyUseType_Worship Facility
1,-0.108665,-0.499734,-0.531541,-0.141522,0.134019,1928,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.108665,-0.314688,-0.431401,-0.382358,-0.207072,1925,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.108665,1.905866,0.068770,-0.382358,-0.207072,1971,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.108665,-0.314688,-0.431099,-0.266691,-0.207072,2001,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,-0.108665,-0.129642,-0.305071,-0.382358,-0.207072,1996,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3124,-0.108665,0.055404,-0.241025,-0.382358,-0.207072,1925,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3125,-0.108665,-0.499734,-0.489641,-0.006164,-0.046558,1927,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3126,4.469907,-0.314688,-0.530418,0.006725,0.871936,2001,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3128,-0.108665,0.055404,-0.515493,-0.141054,-0.039870,2001,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [55]:
rfr = rfr.fit(X_train.values, y_train)

In [56]:
from sklearn.metrics import accuracy_score
import timeit

start_time = timeit.default_timer()

pred = rfr.predict(X_test.values)

elapsed = timeit.default_timer() - start_time

accuracy = rfr.score(X_test.values, y_test)

print("accuracy {:.2f} time {:.2f}s".format(accuracy, elapsed))


accuracy -0.11 time 0.27s


In [57]:
from sklearn.feature_selection import SelectFromModel
model = SelectFromModel(rfr, prefit=True, threshold=0.01)
X_train2 = model.transform(X_train.values)
X_train2.shape

(2191, 11)

In [58]:
X_train2

array([[-0.49973446, -0.53154056, -0.14152203, ...,  0.        ,
         0.        ,  0.        ],
       [-0.31468828, -0.4314006 , -0.38235791, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.90586576,  0.06876975, -0.38235791, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.31468828, -0.53041775,  0.00672492, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.05540406, -0.51549261, -0.14105385, ...,  0.        ,
         0.        ,  0.        ],
       [-0.49973446, -0.56800903, -0.16203921, ...,  0.        ,
         0.        ,  0.        ]])

In [59]:
rfr2 = RandomForestRegressor(n_estimators=1000) # nb of trees 1000 for the forest
rfr2 = rfr.fit(X_train2, y_train)


In [60]:
run_experiment(rfr)

"""
R² :  -0.036410020297567236
MAE : 1.3407748367227499
RMSE: 1.698681454232116
"""

R² :  -0.10351319339504972
MAE : 1.3727456535131668
RMSE: 1.7673230612188662


'\nR² :  -0.036410020297567236\nMAE : 1.3407748367227499\nRMSE: 1.698681454232116\n'

In [61]:
run_experiment(rfr2)

R² :  -0.10804641673221704
MAE : 1.375679269249829
RMSE: 1.7709494152493972


In [62]:
from sklearn.ensemble import RandomForestRegressor

parameters = {
    'n_estimators' : [10,50,100,300,500], #nombre d'arbres de décision
    'min_samples_leaf' : [1,3,5,10], #nombre de feuilles minimales dans un noeud
    'max_features': ['auto', 'sqrt'] #nombre de features observées pour chaque arbre
}

In [None]:
rfr_search = GridSearchCV(RandomForestRegressor(),
                          param_grid = parameters,
                          #scoring='mean_squared_error',
                          verbose=2,
                          cv=5)

rfr_search.fit(X_train, y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=10; total time=   0.1s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=10; total time=   0.1s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=10; total time=   0.2s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=10; total time=   0.2s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=10; total time=   0.4s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=50; total time=   3.8s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=50; total time=   3.3s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=50; total time=   1.7s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=50; total time=   1.0s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=50; total time=   1.2s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=100; total time=   2.2s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=100; total time=   3.0s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=100; total time=   2.9s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=100; total time=   2.1s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=100; total time=   2.5s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=300; total time=   7.3s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=300; total time=   8.4s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=300; total time=   8.5s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=300; total time=   7.8s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=300; total time=   8.0s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=500; total time=  16.5s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=500; total time=  12.8s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=500; total time=  15.9s


  warn(


[CV] END max_features=auto, min_samples_leaf=1, n_estimators=500; total time=  14.2s


  warn(


In [None]:
rfr_search.best_params_

In [None]:
import math
results = results.append(pd.DataFrame({
    'Modèle' : ['Random Forest Regressor'],
    'Score_RMSE' : [math.sqrt(mean_squared_error(rfr_search.predict(X_test), y_test))]}),
    ignore_index=True)

In [None]:
coefficients = abs(rfr_search.best_estimator_.feature_importances_)
liste_coefs_rer = pd.concat((pd.DataFrame(X.columns, columns = ['Variable']),
                             pd.DataFrame(coefficients, columns = ['Coefficient'])), axis = 1).sort_values(by='Coefficient', ascending = False)

In [None]:
plt.figure(figsize=(8,8))
plt.title('RandomForestRegressor - Importance des 20 premières Features')
sns.barplot(y = liste_coefs_rer['Variable'].head(20),
            x = liste_coefs_rer['Coefficient'].head(20))
plt.show()

#### 2) Sequence Tree : XGBoost

#### 4) Linear Model : Elastic Net

### 2) Ensemble learning methods

#### 1) Parallelized Implementation : Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=1000) # nb of trees 1000 for the forest

In [None]:
X_train

In [None]:
rfr = rfr.fit(X_train.values, y_train)

In [None]:
from sklearn.metrics import accuracy_score
import timeit

start_time = timeit.default_timer()

pred = rfr.predict(X_test.values)

elapsed = timeit.default_timer() - start_time

accuracy = rfr.score(X_test.values, y_test)

print("accuracy {:.2f} time {:.2f}s".format(accuracy, elapsed))


In [None]:
from sklearn.feature_selection import SelectFromModel
model = SelectFromModel(rfr, prefit=True, threshold=0.01)
X_train2 = model.transform(X_train.values)
X_train2.shape

In [None]:
X_train2

In [None]:
rfr2 = RandomForestRegressor(n_estimators=1000) # nb of trees 1000 for the forest
rfr2 = rfr.fit(X_train2, y_train)


In [None]:
run_experiment(rfr)

"""
R² :  -0.036410020297567236
MAE : 1.3407748367227499
RMSE: 1.698681454232116
"""

In [None]:
run_experiment(rfr2)

In [None]:
X_train_tryout = X_train.copy()
X_test_tryout = X_test.copy()

X_train_tryout = X_train_tryout[X_train_tryout.columns[:6]]

X_train_tryout

X_test_tryout = X_test_tryout[X_test_tryout.columns[:6]]

In [None]:
rfr_try = rfr.fit(X_train_tryout.values, y_train)


In [None]:

start_time = timeit.default_timer()
pred = rfr_try.predict(X_test_tryout.values)
elapsed = timeit.default_timer() - start_time

accuracy = rfr_try.score(X_test_tryout.values, y_test)
print("accuracy {:.2f} time {:.2f}s".format(accuracy, elapsed))


In [None]:
run_experiment(rfr_try)

#### 2) Sequence Tree : XGBoost

#### Export des modèles pour réutilisation ultérieure


#### Chargement des modèles


#### Comparaison des modèles


### III) Evaluation

#### Vérification des prédictions


#### Intérêt du Energy Star Score
