# Anticipez les besoins en consommation de bâtiments - *Notebook prediction SiteEnergyUseWN(kBtu)*

## Mission

Vous travaillez pour la ville de Seattle. Pour atteindre son objectif de ville neutre en émissions de carbone en 2050, votre équipe s’intéresse de près à la consommation et aux émissions des bâtiments non destinés à l’habitation.

Des relevés minutieux ont été effectués par les agents de la ville en 2016. Cependant, ces relevés sont coûteux à obtenir, et à partir de ceux déjà réalisés, vous voulez tenter de prédire les émissions de CO2 et la consommation totale d’énergie de bâtiments non destinés à l’habitation pour lesquels elles n’ont pas encore été mesurées.

Votre prédiction se basera sur les données structurelles des bâtiments (taille et usage des bâtiments, date de construction, situation géographique, ...)

Vous cherchez également à évaluer l’intérêt de l’ENERGY STAR Score pour la prédiction d’émissions, qui est fastidieux à calculer avec l’approche utilisée actuellement par votre équipe. Vous l'intégrerez dans la modélisation et jugerez de son intérêt.

Vous sortez tout juste d’une réunion de brief avec votre équipe. Voici un récapitulatif de votre mission :


1) Réaliser une courte analyse exploratoire.
2) Tester différents modèles de prédiction afin de répondre au mieux à la problématique.

Fais bien attention au traitement des différentes variables, à la fois pour trouver de nouvelles informations (peut-on déduire des choses intéressantes d’une simple adresse ?) et optimiser les performances en appliquant des transformations simples aux variables (normalisation, passage au log, etc.).

Mets en place une évaluation rigoureuse des performances, et optimise les hyperparamètres et le choix d’algorithmes de ML à l’aide d’une validation croisée. Tu testeras au minimum 4 algorithmes de famille différente (par exemple : ElasticNet, SVM, GradientBoosting, RandomForest).

In [3]:
import numpy as np

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import ScalarFormatter
from matplotlib.ticker import FuncFormatter
import scipy
from scipy import stats
import scipy.stats as st

import statsmodels
import statsmodels.api as sm
import missingno as msno

import sklearn
from sklearn.experimental import enable_iterative_imputer  # Nécessaire pour activer IterativeImputer
from sklearn.impute import IterativeImputer

from sklearn.impute import KNNImputer
# Encodage des variables catégorielles avant d'utiliser KNNImputer
from category_encoders.ordinal import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# pour le centrage et la réduction
from sklearn.preprocessing import StandardScaler
# pour l'ACP
from sklearn.decomposition import PCA

from sklearn import model_selection
from sklearn.model_selection import GridSearchCV

from sklearn import metrics
from sklearn.metrics import roc_curve, auc, confusion_matrix, mean_squared_error, make_scorer, r2_score, mean_absolute_error

from sklearn import dummy
from sklearn.dummy import DummyClassifier
from sklearn.dummy import DummyRegressor

from sklearn import linear_model
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LogisticRegression

from sklearn.svm import LinearSVC
from sklearn.svm import SVR

from sklearn import kernel_ridge

from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier

import tensorflow
from tensorflow import keras
from tensorflow.keras import models
from tensorflow.keras import layers

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from xgboost import XGBRegressor

import timeit
import warnings

print("numpy version", np.__version__)
print("pandas version", pd.__version__)
print("matplotlib version", matplotlib.__version__)
print("seaborn version", sns.__version__)
print("scipy version", scipy.__version__)
print("statsmodels version", statsmodels.__version__)
print("missingno version", msno.__version__)

print("sklearn version", sklearn.__version__)
print("tensorflow version", tensorflow.__version__)

pd.options.display.max_rows = 200
pd.options.display.max_columns = 100

numpy version 1.26.4
pandas version 2.1.4
matplotlib version 3.8.0
seaborn version 0.13.2
scipy version 1.11.4
statsmodels version 0.14.0
missingno version 0.5.2
sklearn version 1.2.2
tensorflow version 2.18.0


## 1 - Création des fonctions et paramètres pour automatisation

### 1.1 - Fonction des modèles sans validation croisée

**Baseline avec DummyRegressor**

On va utiliser la stratégie de la moyenne : prédit la moyenne des valeurs cibles d'entraînement. Créons la fonction qui prend les jeux d'entraînement et de tests et en entrée, et retourne les scores MSE, RMSE, R2, MAE:

In [7]:
def fit_dummyRegressor(X_train, y_train, X_test, y_test):

    start_time = timeit.default_timer()
    
    # Initialisation du DummyRegressor avec la stratégie 'mean'
    dummy_regressor = DummyRegressor(strategy='mean')
    
    # Entraînement du modèle
    dummy_regressor.fit(X_train, y_train)
    
    # Prédiction sur les données de test
    y_pred = dummy_regressor.predict(X_test)

    elapsed = round(timeit.default_timer() - start_time, 3)
    
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    return mse, rmse, r2, mae, elapsed

**Modèle de régression Ridge**

La régression ridge nous permet de réduire l'amplitude des coefficients d'une régression linéaire et d'éviter le sur-apprentissage. On optimisera l'hyperparamètre alpha lors de la validation croisée avec GridSearchCV. Pour rappel, dans le cadre de la régression Ridge, l'hyperparamètre alpha contrôle le degré de régularisation L2 appliqué au modèle. Il agit comme un paramètre de pénalisation pour limiter la taille des coefficients du modèle afin de réduire le sur-ajustement.

Créons une fonction qui permet d'instancier le modèle, l'entraîner, et calculer les scores:

In [9]:
def fit_ridge(X_train, y_train, X_test, y_test):

    start_time = timeit.default_timer()
    
    # Initialisation du modèle Ridge avec un paramètre alpha
    ridge_regressor = Ridge(alpha=1)  # alpha contrôle la régularisation ; plus grand, plus de régularisation ( 1 = valeur par défaut)
    
    # Entraînement du modèle
    ridge_regressor.fit(X_train, y_train)
    
    # Prédiction sur les données de test
    y_pred = ridge_regressor.predict(X_test)

    elapsed = round(timeit.default_timer() - start_time, 3)
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    return mse, rmse, r2, mae, elapsed

**Modèle de régression Lasso**

Le Lasso est une méthode de sélection de variables et de réduction de dimension supervisée : les variables qui ne sont pas nécessaires à la prédiction de l'étiquette sont éliminées. l'hyperparamètre alpha contrôle l'intensité de la pénalisation L1 appliquée au modèle. Il prendre la valeur par défaut 1 ici.

Créons une fonction qui permet d'instancier le modèle, l'entraîner, et calculer les scores:

In [11]:
def fit_lasso(X_train, y_train, X_test, y_test):

    start_time = timeit.default_timer()
    
    # Initialisation du modèle Lasso avec un paramètre alpha
    lasso_regressor = Lasso(alpha=1)  # alpha contrôle la régularisation ; plus grand, plus de régularisation
    
    # Entraînement du modèle
    lasso_regressor.fit(X_train, y_train)
    
    # Prédiction sur les données de test
    y_pred = lasso_regressor.predict(X_test)

    elapsed = round(timeit.default_timer() - start_time, 3)
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    return mse, rmse, r2, mae, elapsed

**Modèle ElasticNet**

La méthode elastic net qui combine les deux termes de régularisation en un (Ridge L2 et Lasso L1). Deux hyperparamètres principaux sont à ajuster :
- alpha : contrôle la force globale de la régularisation (combine L1 et L2). (Par défaut = 1)
- l1_ratio : détermine la pondération entre la pénalisation L1 (type LASSO) et L2 (type Ridge). (Par défaut = 0.5)

Créons une fonction qui permet d'instancier le modèle, l'entraîner, et calculer les scores:

In [13]:
def fit_elasticNet(X_train, y_train, X_test, y_test):

    start_time = timeit.default_timer()

    # Initialisation du modèle ElasticNet
    elastic_net = ElasticNet(
        alpha=1, 
        l1_ratio=0.5, # Mélange égal entre ridge et lasso
        max_iter=1000, random_state=42
    )

    # Entraînement du modèle
    elastic_net.fit(X_train, y_train)

    # Prédictions sur le jeu de test
    y_pred = elastic_net.predict(X_test)
    
    elapsed = round(timeit.default_timer() - start_time, 3)
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    return mse, rmse, r2, mae, elapsed

**Bagging - Modèle RamdonForestRegressor**

Appelé aussi le bagging qui, appliqué aux arbres de décision, donne naissance au modèle de forêt aléatoire. 

Une forêt aléatoire est un ensemble de nombreux arbres de décision qui sont combinés pour produire une prédiction plus précise et plus robuste. Chaque arbre de décision est construit à partir d'un échantillon aléatoire des données et les résultats sont moyennés pour obtenir la prédiction finale. 

Le modèle de forêt aléatoire est intrinsèquement parallèle. Les arbres sont entraînés en même temps sur des parties du dataset.

Créons une fonction qui permet d'instancier le modèle, l'entraîner, et calculer les scores:

In [15]:
def fit_ramdomForestRegressor(X_train, y_train, X_test, y_test, data_columns):

    start_time = timeit.default_timer()
    
    # Création du modèle
    # n_estimators : Nombre d'arbres dans la forêt. Valeur par défaut = 100. Une valeur plus élevée peut améliorer la précision mais augmente le temps de calcul.
    model = RandomForestRegressor(n_estimators=100, random_state=42) 
    
    # Entraînement du modèle
    model.fit(X_train, y_train)
    
    # Prédictions
    y_pred = model.predict(X_test)

    elapsed = round(timeit.default_timer() - start_time, 3)
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    # Afficher l'importance des features
    print("Importance des features dans le RandomForestRegressor :")
    importances = model.feature_importances_
    # Création d'un DataFrame pour afficher l'importance des features
    feature_importance = pd.DataFrame({'Feature': data_columns, 'Importance': importances})
    # Tri par ordre d'importance décroissante
    feature_importance = feature_importance.sort_values(by='Importance', ascending=False)

    # Afficher les résultats
    print(feature_importance)

    return mse, rmse, r2, mae, elapsed

**Boosting - Modèle GradientBoostingRegressor**

Le boosting enchaîne l'entraînement des prédicteurs faibles de façon séquentielle, en se concentrant à chaque itération sur les échantillons qui ont généré le plus d'erreurs.

Il s'agit d'une méthode d'ensemble qui construit un modèle prédictif puissant en combinant plusieurs modèles faibles (typiquement des arbres de décision) de manière séquentielle

Le modèle de forêt aléatoire est intrinsèquement parallèle. Les arbres sont entraînés en même temps sur des parties du dataset.

Créons une fonction qui permet d'instancier le modèle, l'entraîner, et calculer les scores:

In [17]:
def fit_gradientBoostingRegressor(X_train, y_train, X_test, y_test):

    start_time = timeit.default_timer()

    # Création du modèle de régression Gradient Boosting
    gb_regressor = GradientBoostingRegressor(
        n_estimators=100,  # Nombre d'arbres
        learning_rate=0.1,  # Taux d'apprentissage (ici valeur par défaut)
        max_depth=3,  # Profondeur maximale des arbres (ici valeur par défaut)
        random_state=42  # Pour la reproductibilité
    )

    # Entraînement du modèle
    gb_regressor.fit(X_train, y_train)

    # Prédictions sur le jeu de test
    y_pred = gb_regressor.predict(X_test)

    elapsed = round(timeit.default_timer() - start_time, 3)
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    return mse, rmse, r2, mae, elapsed

**Modèle SVR**

Pour effectuer une régression avec un SVM (Support Vector Machine), on utilise le modèle appelé Support Vector Regression (SVR), qui fait partie des algorithmes de régression basés sur les SVM.

Le SVR cherche à trouver une fonction qui ne s'écarte pas trop des valeurs cibles, avec un contrôle sur la marge d'erreur autorisée.

Créons une fonction qui permet d'instancier le modèle, l'entraîner, et calculer les scores:

In [19]:
def fit_SVR(X_train, y_train, X_test, y_test):
    
    start_time = timeit.default_timer()
    
    # Initialisation du modèle SVR
    svr_model = SVR(
        kernel='rbf', 
        C=1.0,  # Paramètre de régularisation qui contrôle la pénalité pour les erreurs. Un C élevé cherche à minimiser les erreurs.
        epsilon=0.1 # Contrôle la largeur de la marge autour de la fonction cible. Les points situés dans cette marge ne contribuent pas à la fonction de coût
    )

    # Entraînement du modèle SVR
    svr_model.fit(X_train, y_train)

    # Prédictions sur le jeu de test
    y_pred = svr_model.predict(X_test)

    elapsed = round(timeit.default_timer() - start_time, 3)
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    return mse, rmse, r2, mae, elapsed

**Fonction qui lance tous les modèles**

In [21]:
def run_fit_models(X_train, y_train, X_test, y_test, columns):

    mse, rmse, r2, mae, elapsed = fit_dummyRegressor(X_train, y_train, X_test, y_test)
    scores_array = np.array([['DummyRegressor', mse, rmse, r2, mae, elapsed]])
    
    mse, rmse, r2, mae, elapsed = fit_ridge(X_train, y_train, X_test, y_test)
    scores_array = np.vstack([scores_array, ['ridge', mse, rmse, r2, mae, elapsed]])
    
    mse, rmse, r2, mae, elapsed = fit_lasso(X_train, y_train, X_test, y_test)
    scores_array = np.vstack([scores_array, ['lasso', mse, rmse, r2, mae, elapsed]])
    
    mse, rmse, r2, mae, elapsed = fit_elasticNet(X_train, y_train, X_test, y_test)
    scores_array = np.vstack([scores_array, ['elasticNet', mse, rmse, r2, mae, elapsed]])
    
    mse, rmse, r2, mae, elapsed = fit_ramdomForestRegressor(X_train, y_train, X_test, y_test, columns)
    scores_array = np.vstack([scores_array, ['RamdomForestRegressor', mse, rmse, r2, mae, elapsed]])
    
    mse, rmse, r2, mae, elapsed = fit_gradientBoostingRegressor(X_train, y_train, X_test, y_test)
    scores_array = np.vstack([scores_array, ['gradientBoostingRegressor', mse, rmse, r2, mae, elapsed]])
    
    mse, rmse, r2, mae, elapsed = fit_SVR(X_train, y_train, X_test, y_test)
    scores_array = np.vstack([scores_array, ['SVR', mse, rmse, r2, mae, elapsed]])

    return scores_array

### 1.2 - Fonction des modèles avec validation croisée

**Validation croisée avec le modèle Lasso**

In [24]:
def fit_GridSearchCV_lasso(X_train, y_train, scoring, param_grid):

    # Initialisation de GridSearchCV
    grid_search = GridSearchCV(
        estimator=Lasso(),           # une régression Lasso
        param_grid=param_grid,
        cv=5,                        # nombre de folds
        scoring=scoring,
        refit='R2',                  # Refit avec la meilleure valeur de R²
        #n_jobs=-1,                  # Utilisation de tous les cœurs disponibles
        verbose=1
    )

    # Entraînement de GridSearchCV
    grid_search.fit(X_train, y_train)

    return grid_search

**Validation roisée avec le modèle GradientBoostingRegressor**

In [26]:
def fit_GridSearchCV_GradientBoostingRegressor(X_train, y_train, scoring, param_grid):

    # Configuration de GridSearchCV
    grid_search = GridSearchCV(
        estimator=GradientBoostingRegressor(random_state=42),
        param_grid=param_grid,
        scoring=scoring,
        cv=5,  # Validation croisée à 5 plis
        refit='R2',
        n_jobs=-1,  # Utilisation de tous les cœurs disponibles
        verbose=1  # Affichage des détails
    )

    # Entraînement de GridSearchCV
    grid_search.fit(X_train, y_train)

    return grid_search

### 1.3 - Fonctions d'affichage des résultats

Affichage simple du résultat (meilleurs paramètres, scores,...) d'une validation croisée

In [28]:
def print_result_CV(grid_search):

    # Afficher les meilleurs paramètres trouvés
    print(f"Meilleurs paramètres : {grid_search.best_params_}")
    
    # Afficher le meilleur score
    print("Meilleu(s) score sur le jeu d'entraînement:")
    print(grid_search.best_score_)
    
    # Utiliser le modèle avec les meilleurs paramètres
    best_model = grid_search.best_estimator_
    
    # Afficher les performances correspondantes
    print("Résultats de la validation croisée :")
    for score_name in scoring.keys():
        
        print(f"\nScores pour '{score_name}':")    
        for mean, std, params, mean_fit_time in zip(
                grid_search.cv_results_[f'mean_test_{score_name}'],  # score moyen pour chaque score
                grid_search.cv_results_[f'std_test_{score_name}'],   # écart-type du score
                grid_search.cv_results_['params'],                   # valeur de l'hyperparamètre
                grid_search.cv_results_['mean_fit_time']             # temps moyen d'entraînement
        ):
            print(f"{score_name} = {mean:.3f} (+/-{std * 2:.03f}) for {params}")

Affichage sous forme de dataframe du résultat (meilleurs paramètres, scores,...) d'une validation croisée, pour plus de lisibilité

In [30]:
def print_result_CV_as_dataframe(grid_search, scoring):

    # Liste pour stocker les résultats
    results = []
    
    # Afficher les performances correspondantes
    print("Résultats de la validation croisée :")
    for score_name in scoring.keys():
           
        for mean, std, params, mean_fit_time in zip(
                grid_search.cv_results_[f'mean_test_{score_name}'],  # score moyen pour chaque score
                grid_search.cv_results_[f'std_test_{score_name}'],   # écart-type du score
                grid_search.cv_results_['params'],                   # valeur de l'hyperparamètre
                grid_search.cv_results_['mean_fit_time']             # temps moyen d'entraînement
        ):
                    
            # Ajouter chaque combinaison de résultats à une liste sous forme de dictionnaire
            results.append({
                "score_name": score_name,
                "mean_score": mean,
                "std_score": std,
                "params": params,
                "mean_fit_time": mean_fit_time
            })
    
    # Transformer en DataFrame
    df_results = pd.DataFrame(results)
    
    # Convertir la colonne 'params' en chaîne de caractères
    df_results['params'] = df_results['params'].apply(str)
    
    # Transformation avec pivot
    df_results = df_results.pivot(
        index='params',                             # Les paramètres deviennent l'index
        columns='score_name',                       # Les valeurs uniques de score_name deviennent des colonnes
        values=['mean_score', 'mean_fit_time']      # Les valeurs à remplir dans les colonnes (ici, mean_score)
    ).reset_index()
    
    # Aplatir les colonnes multi-indexées
    df_results.columns = ['_'.join(col).strip() for col in df_results.columns.values]
    
    # Réinitialiser l'index pour obtenir un DataFrame "normal"
    df_results = df_results.reset_index()
    df_results.drop(columns=['index'], inplace=True)
    
    # Supprimer l'axe des index
    df_results = df_results.rename_axis(None, axis=1)
    
    # On trie le dataframe sur la colonne R2 du pmus grand au plus petit
    df_results.sort_values(by='mean_score_R2', ascending=False, inplace=True)
    df_results = df_results.reset_index()
    df_results.drop(columns=['index', 'mean_fit_time_MAE', 'mean_fit_time_RMSE'], inplace=True)
    df_results.rename(columns={'mean_fit_time_R2': 'mean_fit_time'}, inplace=True)
    
    return df_results

Affichage des scores calculés sur le fichier de test avec le modèle de la meilleure performance de la validation croisée

In [32]:
def print_result_CV_on_test_file(X_test, y_test, grid_search):

    # Utiliser le modèle avec les meilleurs paramètres
    best_model = grid_search.best_estimator_
    
    # Prédictions avec le modèle optimisé
    y_pred = best_model.predict(X_test)
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination
    
    scores_cv_fe1 = np.array([['Lasso', mse, rmse, r2, mae]])
    
    # Conversion de l'array en DataFrame
    df_scores_cv_fe1 = pd.DataFrame(scores_cv_fe1, columns=['Modèle', 'MSE', 'RMSE', 'R2', 'MAE'])
    
    # on transforme la colonne R2 en numérique
    df_scores_cv_fe1['R2'] = pd.to_numeric(df_scores_cv_fe1['R2'], errors='coerce')
    
    # On trie le dataframe sur la colonne R2 du pmus grand au plus petit
    df_scores_cv_fe1.sort_values(by='R2', ascending=False, inplace=True)
    
    return df_scores_cv_fe1    

Affichage du tableau des scores des modèles sans validation croisée

In [34]:
def print_result_on_test_file(scores_array):

    # Conversion de l'array en DataFrame
    df = pd.DataFrame(scores_array, columns=['Modèle', 'MSE', 'RMSE', 'R2', 'MAE', 'ELAPSED_TIME'])
    
    # on transforme la colonne R2 en numérique
    df['R2'] = pd.to_numeric(df['R2'], errors='coerce')
    
    # On trie le dataframe sur la colonne R2 du pmus grand au plus petit
    df.sort_values(by='R2', ascending=False, inplace=True)
    df.reset_index(inplace=True)

    return df

Affichage des coeficients calculés dans un modèle Lasso

In [36]:
def print_coeffs_lasso(grid_search, columns):

    # Récupérer le meilleur modèle
    best_lasso = grid_search.best_estimator_
    
    # Extraire les coefficients
    coefficients = best_lasso.coef_
    
    # Associer les coefficients aux noms des variables
    coef_df = pd.DataFrame({
        'Feature': columns,
        'Coefficient': coefficients
    })
    
    coef_df.sort_values(by='Coefficient', ascending=False, inplace=True)
    
    # Afficher les coefficients
    print(coef_df)

### 1.4 - Paramètres

In [38]:
# Définition des hyperparamètres à tester
param_grid = {
    #'alpha': [0.01, 0.1, 1.0, 10.0],  # Différentes valeurs de régularisation
    'alpha': np.logspace(-6, 6, 13) 
}

In [39]:
pd.set_option('display.float_format', '{:.3f}'.format)  # Désactiver l'écriture scientifique

Créons une fonction pour calculer le RMSE qui n'a pas directement disponible dans le GridSearchCV :

In [41]:
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [42]:
rmse_scorer = make_scorer(rmse, greater_is_better=False)  # False car on minimise le RMSE

In [43]:
# Définition du dictionnaire des métriques de scoring
scoring = {
    'MAE': 'neg_mean_absolute_error',  # Utilise l'erreur absolue moyenne
    'R2': 'r2',                        # Utilise le coefficient de détermination
    'RMSE': rmse_scorer                # Utilise Root Mean Squared Error (racine carré de l'erreur quadratique moyenne)
}

## 2 - Développement et simulation du premier modèle (cible = SiteEnergyUseWN(kBtu))

In [45]:
# Charger le fichier de données
data_fe1 = pd.read_csv("C:/Users/admin/Documents/Projets/Projet_4/data_projet/cleaned/2016_Building_Energy_Benchmarking_fe1.csv", sep=',', low_memory=False)
data_fe1.head()

Unnamed: 0,NumberofBuildings,NumberofFloors,PropertyGFATotal,SiteEnergyUseWN(kBtu),TotalGHGEmissions,PrimaryPropertyType_Distribution Center,PrimaryPropertyType_Hospital,PrimaryPropertyType_Hotel,PrimaryPropertyType_K-12 School,PrimaryPropertyType_Laboratory,PrimaryPropertyType_Large Office,PrimaryPropertyType_Low-Rise Multifamily,PrimaryPropertyType_Medical Office,PrimaryPropertyType_Mixed Use Property,PrimaryPropertyType_Office,PrimaryPropertyType_Other,PrimaryPropertyType_Refrigerated Warehouse,PrimaryPropertyType_Residence Hall,PrimaryPropertyType_Restaurant,PrimaryPropertyType_Retail Store,PrimaryPropertyType_Self-Storage Facility,PrimaryPropertyType_Senior Care Community,PrimaryPropertyType_Small- and Mid-Sized Office,PrimaryPropertyType_Supermarket / Grocery Store,PrimaryPropertyType_University,PrimaryPropertyType_Warehouse,PrimaryPropertyType_Worship Facility,Neighborhood_BALLARD,Neighborhood_CENTRAL,Neighborhood_DELRIDGE,Neighborhood_DELRIDGE NEIGHBORHOODS,Neighborhood_DOWNTOWN,Neighborhood_EAST,Neighborhood_GREATER DUWAMISH,Neighborhood_LAKE UNION,Neighborhood_MAGNOLIA / QUEEN ANNE,Neighborhood_NORTH,Neighborhood_NORTHEAST,Neighborhood_NORTHWEST,Neighborhood_SOUTHEAST,Neighborhood_SOUTHWEST,"YearBuilt_Bin_(1899.885, 1911.5]","YearBuilt_Bin_(1911.5, 1923.0]","YearBuilt_Bin_(1923.0, 1934.5]","YearBuilt_Bin_(1934.5, 1946.0]","YearBuilt_Bin_(1946.0, 1957.5]","YearBuilt_Bin_(1957.5, 1969.0]","YearBuilt_Bin_(1969.0, 1980.5]","YearBuilt_Bin_(1980.5, 1992.0]","YearBuilt_Bin_(1992.0, 2003.5]","YearBuilt_Bin_(2003.5, 2015.0]",electricity_percent,gaz_percent,steam_percent
0,1.0,12,88434.0,7456910.0,249.98,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,54.61,17.66,27.73
1,1.0,11,103566.0,8664479.0,295.86,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,38.66,61.34,0.0
2,1.0,10,61320.0,6946800.5,286.43,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.75,26.66,32.59
3,1.0,18,175580.0,14656503.0,505.01,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,37.88,62.12,0.0
4,1.0,2,97288.0,12581712.0,301.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,60.99,39.01,0.0


In [46]:
data_fe1.shape

(1444, 54)

In [47]:
data_fe1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1444 entries, 0 to 1443
Data columns (total 54 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   NumberofBuildings                                1444 non-null   float64
 1   NumberofFloors                                   1444 non-null   int64  
 2   PropertyGFATotal                                 1444 non-null   float64
 3   SiteEnergyUseWN(kBtu)                            1444 non-null   float64
 4   TotalGHGEmissions                                1444 non-null   float64
 5   PrimaryPropertyType_Distribution Center          1444 non-null   float64
 6   PrimaryPropertyType_Hospital                     1444 non-null   float64
 7   PrimaryPropertyType_Hotel                        1444 non-null   float64
 8   PrimaryPropertyType_K-12 School                  1444 non-null   float64
 9   PrimaryPropertyType_Laboratory

### 2.1 - Sélectionner les features et la cible :

In [49]:
y_fe1_conso = data_fe1['SiteEnergyUseWN(kBtu)']
X_fe1 = data_fe1.drop('SiteEnergyUseWN(kBtu)', axis=1, inplace=False)
X_fe1.shape

(1444, 53)

In [50]:
y_fe1_emissions = data_fe1['TotalGHGEmissions']
X_fe1 = X_fe1.drop('TotalGHGEmissions', axis=1, inplace=False)
X_fe1.shape

(1444, 52)

In [51]:
y_fe1_conso.shape

(1444,)

In [52]:
y_fe1_emissions.shape

(1444,)

### 2.2 - Standardiser les valeurs et créer les jeux d'entraînement / test

In [54]:
X_scale_fe1 = StandardScaler().fit_transform(X_fe1)

In [55]:
df = pd.DataFrame(X_scale_fe1)
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51
count,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0,1444.0
mean,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-0.133,-0.821,-0.838,-0.191,-0.026,-0.21,-0.305,-0.053,-0.277,-0.037,-0.141,-0.268,-0.046,-0.415,-0.088,-0.119,-0.092,-0.247,-0.141,-0.102,-0.495,-0.164,-0.102,-0.385,-0.221,-0.217,-0.183,-0.175,-0.026,-0.483,-0.277,-0.541,-0.304,-0.315,-0.215,-0.287,-0.239,-0.175,-0.164,-0.342,-0.291,-0.324,-0.185,-0.349,-0.433,-0.36,-0.347,-0.354,-0.312,-2.601,-1.092,-0.202
25%,-0.133,-0.554,-0.594,-0.191,-0.026,-0.21,-0.305,-0.053,-0.277,-0.037,-0.141,-0.268,-0.046,-0.415,-0.088,-0.119,-0.092,-0.247,-0.141,-0.102,-0.495,-0.164,-0.102,-0.385,-0.221,-0.217,-0.183,-0.175,-0.026,-0.483,-0.277,-0.541,-0.304,-0.315,-0.215,-0.287,-0.239,-0.175,-0.164,-0.342,-0.291,-0.324,-0.185,-0.349,-0.433,-0.36,-0.347,-0.354,-0.312,-0.804,-1.092,-0.202
50%,-0.133,-0.287,-0.367,-0.191,-0.026,-0.21,-0.305,-0.053,-0.277,-0.037,-0.141,-0.268,-0.046,-0.415,-0.088,-0.119,-0.092,-0.247,-0.141,-0.102,-0.495,-0.164,-0.102,-0.385,-0.221,-0.217,-0.183,-0.175,-0.026,-0.483,-0.277,-0.541,-0.304,-0.315,-0.215,-0.287,-0.239,-0.175,-0.164,-0.342,-0.291,-0.324,-0.185,-0.349,-0.433,-0.36,-0.347,-0.354,-0.312,0.001,-0.084,-0.202
75%,-0.133,0.247,0.138,-0.191,-0.026,-0.21,-0.305,-0.053,-0.277,-0.037,-0.141,-0.268,-0.046,-0.415,-0.088,-0.119,-0.092,-0.247,-0.141,-0.102,-0.495,-0.164,-0.102,-0.385,-0.221,-0.217,-0.183,-0.175,-0.026,-0.483,-0.277,-0.541,-0.304,-0.315,-0.215,-0.287,-0.239,-0.175,-0.164,-0.342,-0.291,-0.324,-0.185,-0.349,-0.433,-0.36,-0.347,-0.354,-0.312,1.159,0.818,-0.202
max,12.506,25.612,6.807,5.226,37.987,4.762,3.277,18.974,3.608,26.851,7.111,3.726,21.917,2.411,11.414,8.438,10.924,4.049,7.111,9.76,2.021,6.083,9.76,2.601,4.533,4.606,5.452,5.708,37.987,2.071,3.608,1.848,3.292,3.179,4.644,3.482,4.185,5.708,6.083,2.926,3.432,3.089,5.393,2.863,2.309,2.775,2.884,2.823,3.206,1.159,2.653,9.058


In [56]:
X_fe1_train, X_fe1_test, y_fe1_train, y_fe1_test = model_selection.train_test_split(X_scale_fe1, y_fe1_conso, test_size=0.25, random_state=42 ) # 25% des données dans le jeu de test

In [57]:
X_fe1_train.shape

(1083, 52)

In [58]:
X_fe1_test.shape

(361, 52)

In [59]:
y_fe1_train.shape

(1083,)

In [60]:
y_fe1_test.shape

(361,)

### 2.3 - Tests de modèles sans validation croisée

Il s'agit d'évaluer quelques modèles sans utiliser la validation croisée, en partant d'une baseline, pour aller vers des modèles plus élaborés.

Les hyperparamètres seront fixes ici. Ils seront automatisés par la suite par validation croisée avec GridSearchCV.

Les scores seront établis sur la base du fichier de tests. On fera une boucle sur chaque modèle, et on stockera les scrores dans un tableau. 

In [62]:
warnings.filterwarnings("ignore")

scores_array_fe1 = run_fit_models(X_fe1_train, y_fe1_train, X_fe1_test, y_fe1_test, X_fe1.columns)

Importance des features dans le RandomForestRegressor :
                                            Feature  Importance
2                                  PropertyGFATotal       0.505
21  PrimaryPropertyType_Supermarket / Grocery Store       0.088
49                              electricity_percent       0.062
1                                    NumberofFloors       0.049
50                                      gaz_percent       0.040
23                    PrimaryPropertyType_Warehouse       0.033
13                        PrimaryPropertyType_Other       0.025
7                    PrimaryPropertyType_Laboratory       0.018
16                   PrimaryPropertyType_Restaurant       0.013
19        PrimaryPropertyType_Senior Care Community       0.012
47                   YearBuilt_Bin_(1992.0, 2003.5]       0.010
45                   YearBuilt_Bin_(1969.0, 1980.5]       0.010
29                            Neighborhood_DOWNTOWN       0.009
11           PrimaryPropertyType_Mixed Use Prope

Affichons le résultat dans un dataframe, triès par ordre décroissant sur R2 :

Plus R2 est proche de 1, plus le modèle est performant.

In [64]:
df_results_fe1 = print_result_on_test_file(scores_array_fe1)
df_results_fe1.head(10)

Unnamed: 0,index,Modèle,MSE,RMSE,R2,MAE,ELAPSED_TIME
0,1,ridge,4883143028088.79,2209783.48,0.65,1526074.66,0.136
1,2,lasso,4882121725575.46,2209552.38,0.65,1525093.94,0.147
2,4,RamdomForestRegressor,5056831827541.21,2248740.05,0.64,1489276.19,2.27
3,5,gradientBoostingRegressor,5513470357850.86,2348078.01,0.61,1532356.7,0.657
4,3,elasticNet,5751242024803.79,2398174.73,0.59,1751934.71,0.005
5,0,DummyRegressor,14143195564884.4,3760744.02,-0.01,2849262.3,0.002
6,6,SVR,16828270384158.12,4102227.49,-0.2,2693551.59,0.295


Sans validation croisée et optimisation des hyperparamètres, les modèles Ridge et Lasso sont les plus performants (R2 = 0.65), suivi de près par le RandomForestRegressor (R2 = 0.64).

### 2.4 - Validation croisée avec le modèle Lasso

In [67]:
# Définition des hyperparamètres à tester
param_grid = {
    'alpha': np.logspace(-6, 6, 13) 
}

In [68]:
grid_search_fe1 = fit_GridSearchCV_lasso(X_fe1_train, y_fe1_train, scoring, param_grid)

Fitting 5 folds for each of 13 candidates, totalling 65 fits


In [69]:
print_result_CV(grid_search_fe1)

Meilleurs paramètres : {'alpha': 10000.0}
Meilleu(s) score sur le jeu d'entraînement:
0.5904473335057308
Résultats de la validation croisée :

Scores pour 'MAE':
MAE = -1542554.704 (+/-189210.398) for {'alpha': 1e-06}
MAE = -1542554.704 (+/-189210.398) for {'alpha': 1e-05}
MAE = -1542554.704 (+/-189210.399) for {'alpha': 0.0001}
MAE = -1542554.705 (+/-189210.403) for {'alpha': 0.001}
MAE = -1542554.709 (+/-189210.443) for {'alpha': 0.01}
MAE = -1542554.755 (+/-189210.850) for {'alpha': 0.1}
MAE = -1542555.208 (+/-189214.913) for {'alpha': 1.0}
MAE = -1542560.476 (+/-189256.227) for {'alpha': 10.0}
MAE = -1542596.487 (+/-189688.543) for {'alpha': 100.0}
MAE = -1542196.483 (+/-190822.699) for {'alpha': 1000.0}
MAE = -1535455.447 (+/-195365.197) for {'alpha': 10000.0}
MAE = -1536052.141 (+/-216132.916) for {'alpha': 100000.0}
MAE = -2176110.008 (+/-279706.375) for {'alpha': 1000000.0}

Scores pour 'R2':
R2 = 0.589 (+/-0.115) for {'alpha': 1e-06}
R2 = 0.589 (+/-0.115) for {'alpha': 1e-05}


In [70]:
print_result_CV_as_dataframe(grid_search_fe1, scoring).head(30)

Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,{'alpha': 10000.0},-1535455.447,0.59,-2252862.327,0.006
1,{'alpha': 1000.0},-1542196.483,0.589,-2256989.725,0.04
2,{'alpha': 1e-06},-1542554.704,0.589,-2257262.565,0.081
3,{'alpha': 1e-05},-1542554.704,0.589,-2257262.565,0.069
4,{'alpha': 0.0001},-1542554.704,0.589,-2257262.565,0.092
5,{'alpha': 0.001},-1542554.705,0.589,-2257262.565,0.083
6,{'alpha': 0.01},-1542554.709,0.589,-2257262.566,0.067
7,{'alpha': 0.1},-1542554.755,0.589,-2257262.571,0.059
8,{'alpha': 1.0},-1542555.208,0.589,-2257262.619,0.058
9,{'alpha': 10.0},-1542560.476,0.589,-2257262.656,0.098


Le modèle Lasso est le plus performant avec alpha = 10 000 (R2 = 0.590). C'est aussi pour cette valeur qu'il est le plus rapide.

Affichons le résultat sur le fichier de test :

In [72]:
print_result_CV_on_test_file(X_fe1_test, y_fe1_test, grid_search_fe1).head(30)

Unnamed: 0,Modèle,MSE,RMSE,R2,MAE
0,Lasso,4856026422771.06,2203639.36,0.65,1517734.68


Le coefficient de détermination R2 est meilleur sur le fichier de test (0.65).

Regardons l'importance des coeeficients de chaque feature :

In [75]:
print_coeffs_lasso(grid_search_fe1, X_fe1.columns)

                                            Feature  Coefficient
2                                  PropertyGFATotal  2227812.959
21  PrimaryPropertyType_Supermarket / Grocery Store  1034630.413
7                    PrimaryPropertyType_Laboratory   389883.927
13                        PrimaryPropertyType_Other   301894.227
16                   PrimaryPropertyType_Restaurant   236024.477
29                            Neighborhood_DOWNTOWN   196833.533
19        PrimaryPropertyType_Senior Care Community   191051.158
5                         PrimaryPropertyType_Hotel   188542.058
47                   YearBuilt_Bin_(1992.0, 2003.5]   187773.024
45                   YearBuilt_Bin_(1969.0, 1980.5]   162743.352
0                                 NumberofBuildings   150340.353
10               PrimaryPropertyType_Medical Office   127639.771
8                  PrimaryPropertyType_Large Office   125934.169
1                                    NumberofFloors   125089.962
51                       

Les 2 plus fortes contribution positive sont nettement la surface totale, et le PrimaryPropertyType_Supermarket / Grocery Store.

Dans l'étape suivante, on va essayer d'améliorer le score R2 des validations croisées.

## 3 - Amélioration du feature Engineering (cible = SiteEnergyUseWN(kBtu))

### 3.1 - 2ème feature Engineering

In [79]:
# Charger le fichier de données
data_fe2 = pd.read_csv("C:/Users/admin/Documents/Projets/Projet_4/data_projet/cleaned/2016_Building_Energy_Benchmarking_fe2.csv", sep=',', low_memory=False)
data_fe2.head()

Unnamed: 0,NumberofBuildings,NumberofFloors,PropertyGFAParking,PropertyGFABuilding(s),SiteEnergyUseWN(kBtu),TotalGHGEmissions,Neighborhood_BALLARD,Neighborhood_CENTRAL,Neighborhood_DELRIDGE,Neighborhood_DELRIDGE NEIGHBORHOODS,Neighborhood_DOWNTOWN,Neighborhood_EAST,Neighborhood_GREATER DUWAMISH,Neighborhood_LAKE UNION,Neighborhood_MAGNOLIA / QUEEN ANNE,Neighborhood_NORTH,Neighborhood_NORTHEAST,Neighborhood_NORTHWEST,Neighborhood_SOUTHEAST,Neighborhood_SOUTHWEST,"YearBuilt_Bin_(1899.885, 1911.5]","YearBuilt_Bin_(1911.5, 1923.0]","YearBuilt_Bin_(1923.0, 1934.5]","YearBuilt_Bin_(1934.5, 1946.0]","YearBuilt_Bin_(1946.0, 1957.5]","YearBuilt_Bin_(1957.5, 1969.0]","YearBuilt_Bin_(1969.0, 1980.5]","YearBuilt_Bin_(1980.5, 1992.0]","YearBuilt_Bin_(1992.0, 2003.5]","YearBuilt_Bin_(2003.5, 2015.0]",PrimaryPropertyType_Distribution Center,PrimaryPropertyType_Hospital,PrimaryPropertyType_Hotel,PrimaryPropertyType_K-12 School,PrimaryPropertyType_Laboratory,PrimaryPropertyType_Large Office,PrimaryPropertyType_Low-Rise Multifamily,PrimaryPropertyType_Medical Office,PrimaryPropertyType_Mixed Use Property,PrimaryPropertyType_Office,PrimaryPropertyType_Other,PrimaryPropertyType_Refrigerated Warehouse,PrimaryPropertyType_Residence Hall,PrimaryPropertyType_Restaurant,PrimaryPropertyType_Retail Store,PrimaryPropertyType_Self-Storage Facility,PrimaryPropertyType_Senior Care Community,PrimaryPropertyType_Small- and Mid-Sized Office,PrimaryPropertyType_Supermarket / Grocery Store,PrimaryPropertyType_University,PrimaryPropertyType_Warehouse,PrimaryPropertyType_Worship Facility,electricity_percent,gaz_percent,steam_percent,usage_Autres,usage_Bureaux & Espaces de travail,usage_Commerce & Retail,usage_Entrepôts et Logistique,usage_Hébergement & Logement,usage_Loisirs et Divertissement,usage_Restauration,usage_Services publics & Infrastructure,usage_Soins médicaux,usage_Transports & Parking,usage_Éducation
0,1.0,12,0,88434,7456910.0,249.98,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,54.61,17.66,27.73,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,11,15064,88502,8664479.0,295.86,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38.66,61.34,0.0,0.0,0.0,0.0,0.0,80.99,0.0,4.46,0.0,0.0,14.55,0.0
2,1.0,10,0,61320,6946800.5,286.43,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.75,26.66,32.59,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,18,62000,113580,14656503.0,505.01,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.88,62.12,0.0,0.0,0.0,0.0,0.0,64.48,0.0,0.0,0.0,0.0,35.52,0.0
4,1.0,2,37198,60090,12581712.0,301.81,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60.99,39.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0


In [80]:
data_fe2.shape

(1440, 66)

**Sélectionner les features et la cible :**

In [82]:
y_fe2_conso = data_fe2['SiteEnergyUseWN(kBtu)']
X_fe2 = data_fe2.drop('SiteEnergyUseWN(kBtu)', axis=1, inplace=False)
X_fe2.shape

(1440, 65)

In [83]:
y_fe2_emissions = data_fe2['TotalGHGEmissions']
X_fe2 = X_fe2.drop('TotalGHGEmissions', axis=1, inplace=False)
X_fe2.shape

(1440, 64)

**Standardiser les valeurs et créer les jeux d'entraînement / test**

In [85]:
X_scale_fe2 = StandardScaler().fit_transform(X_fe2)

df = pd.DataFrame(X_scale_fe2)
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63
count,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0
mean,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-0.132,-0.82,-0.288,-1.009,-0.216,-0.184,-0.175,-0.026,-0.481,-0.278,-0.542,-0.303,-0.315,-0.216,-0.288,-0.239,-0.175,-0.165,-0.341,-0.29,-0.324,-0.184,-0.35,-0.434,-0.361,-0.347,-0.355,-0.311,-0.192,-0.026,-0.209,-0.306,-0.053,-0.278,-0.037,-0.141,-0.269,-0.046,-0.415,-0.088,-0.119,-0.088,-0.247,-0.138,-0.103,-0.495,-0.165,-0.103,-0.385,-0.221,-2.599,-1.092,-0.202,-0.269,-0.761,-0.414,-0.523,-0.301,-0.159,-0.158,-0.296,-0.094,-0.367,-0.345
25%,-0.132,-0.553,-0.288,-0.602,-0.216,-0.184,-0.175,-0.026,-0.481,-0.278,-0.542,-0.303,-0.315,-0.216,-0.288,-0.239,-0.175,-0.165,-0.341,-0.29,-0.324,-0.184,-0.35,-0.434,-0.361,-0.347,-0.355,-0.311,-0.192,-0.026,-0.209,-0.306,-0.053,-0.278,-0.037,-0.141,-0.269,-0.046,-0.415,-0.088,-0.119,-0.088,-0.247,-0.138,-0.103,-0.495,-0.165,-0.103,-0.385,-0.221,-0.803,-1.092,-0.202,-0.269,-0.761,-0.414,-0.523,-0.301,-0.159,-0.158,-0.296,-0.094,-0.367,-0.345
50%,-0.132,-0.286,-0.288,-0.348,-0.216,-0.184,-0.175,-0.026,-0.481,-0.278,-0.542,-0.303,-0.315,-0.216,-0.288,-0.239,-0.175,-0.165,-0.341,-0.29,-0.324,-0.184,-0.35,-0.434,-0.361,-0.347,-0.355,-0.311,-0.192,-0.026,-0.209,-0.306,-0.053,-0.278,-0.037,-0.141,-0.269,-0.046,-0.415,-0.088,-0.119,-0.088,-0.247,-0.138,-0.103,-0.495,-0.165,-0.103,-0.385,-0.221,0.001,-0.084,-0.202,-0.269,-0.761,-0.414,-0.523,-0.301,-0.159,-0.158,-0.296,-0.094,-0.367,-0.345
75%,-0.132,0.247,-0.288,0.163,-0.216,-0.184,-0.175,-0.026,-0.481,-0.278,-0.542,-0.303,-0.315,-0.216,-0.288,-0.239,-0.175,-0.165,-0.341,-0.29,-0.324,-0.184,-0.35,-0.434,-0.361,-0.347,-0.355,-0.311,-0.192,-0.026,-0.209,-0.306,-0.053,-0.278,-0.037,-0.141,-0.269,-0.046,-0.415,-0.088,-0.119,-0.088,-0.247,-0.138,-0.103,-0.495,-0.165,-0.103,-0.385,-0.221,1.159,0.818,-0.202,-0.269,0.913,-0.414,-0.523,-0.301,-0.159,-0.158,-0.296,-0.094,-0.367,-0.345
max,12.515,25.594,12.04,11.246,4.637,5.444,5.7,37.934,2.077,3.603,1.845,3.302,3.174,4.637,3.477,4.179,5.7,6.074,2.933,3.443,3.084,5.444,2.859,2.306,2.77,2.879,2.819,3.215,5.219,37.934,4.796,3.272,18.947,3.603,26.814,7.101,3.721,21.886,2.407,11.398,8.426,11.398,4.043,7.234,9.747,2.022,6.074,9.747,2.597,4.527,1.159,2.652,9.046,4.509,1.834,3.037,2.199,3.859,8.574,10.992,3.805,11.92,6.989,2.988


In [86]:
# 25% des données dans le jeu de test
X_fe2_train, X_fe2_test, y_fe2_train, y_fe2_test = model_selection.train_test_split(X_scale_fe2, y_fe2_conso, test_size=0.25, random_state=42 )

**Tests de modèles sans validation croisée**

In [88]:
warnings.filterwarnings("ignore")

scores_array_fe2 = run_fit_models(X_fe2_train, y_fe2_train, X_fe2_test, y_fe2_test, X_fe2.columns)

Importance des features dans le RandomForestRegressor :
                                            Feature  Importance
3                            PropertyGFABuilding(s)       0.467
46  PrimaryPropertyType_Supermarket / Grocery Store       0.078
56                    usage_Entrepôts et Logistique       0.077
50                              electricity_percent       0.037
1                                    NumberofFloors       0.031
51                                      gaz_percent       0.028
59                               usage_Restauration       0.027
2                                PropertyGFAParking       0.027
61                             usage_Soins médicaux       0.024
54               usage_Bureaux & Espaces de travail       0.014
55                          usage_Commerce & Retail       0.013
58                  usage_Loisirs et Divertissement       0.012
53                                     usage_Autres       0.012
38                        PrimaryPropertyType_Ot

In [89]:
df_results_fe2 = print_result_on_test_file(scores_array_fe2)
df_results_fe2.head(10)

Unnamed: 0,index,Modèle,MSE,RMSE,R2,MAE,ELAPSED_TIME
0,5,gradientBoostingRegressor,4418576409832.72,2102041.01,0.7,1399292.55,0.87
1,4,RamdomForestRegressor,4748484422539.4,2179101.75,0.67,1446073.49,3.042
2,1,ridge,5253941238689.14,2292147.73,0.64,1586453.54,0.005
3,2,lasso,5252657097309.24,2291867.6,0.64,1586164.43,0.113
4,3,elasticNet,5810201056902.76,2410435.86,0.6,1714542.54,0.006
5,0,DummyRegressor,14578031018368.1,3818118.78,-0.01,2862093.8,0.001
6,6,SVR,17300529111375.15,4159390.47,-0.19,2674589.57,0.296


On observe que les modèles Ridge et Lasso sont moins performants que les modèles gradientBoostingRegressor et RamdomForestRegressor. En fait c'est surtout la performance de Ridge et Lasso qui a chuté avec le 2ème feature engineering par rapport au 1er.

**Validation croisée avec le modèle Lasso**

In [92]:
# Définition des hyperparamètres à tester
param_grid = {
    'alpha': np.logspace(-7, 7, 13) 
}

In [93]:
grid_search_fe2 = fit_GridSearchCV_lasso(X_fe2_train, y_fe2_train, scoring, param_grid)
print_result_CV(grid_search_fe2)

Fitting 5 folds for each of 13 candidates, totalling 65 fits
Meilleurs paramètres : {'alpha': 46415.888336127915}
Meilleu(s) score sur le jeu d'entraînement:
0.5979301554039819
Résultats de la validation croisée :

Scores pour 'MAE':
MAE = -1494687.307 (+/-210109.212) for {'alpha': 1e-07}
MAE = -1494687.308 (+/-210109.212) for {'alpha': 1.4677992676220705e-06}
MAE = -1494687.308 (+/-210109.212) for {'alpha': 2.1544346900318867e-05}
MAE = -1494687.308 (+/-210109.213) for {'alpha': 0.00031622776601683794}
MAE = -1494687.319 (+/-210109.223) for {'alpha': 0.004641588833612782}
MAE = -1494687.479 (+/-210109.374) for {'alpha': 0.06812920690579623}
MAE = -1494689.843 (+/-210111.623) for {'alpha': 1.0}
MAE = -1494713.222 (+/-210140.598) for {'alpha': 14.677992676220736}
MAE = -1494672.512 (+/-210509.052) for {'alpha': 215.44346900318865}
MAE = -1492531.231 (+/-214382.679) for {'alpha': 3162.2776601683795}
MAE = -1483268.692 (+/-182992.035) for {'alpha': 46415.888336127915}
MAE = -2012200.371 (

In [94]:
print_result_CV_as_dataframe(grid_search_fe2, scoring).head(30)

Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,{'alpha': 46415.888336127915},-1483268.692,0.598,-2212090.222,0.003
1,{'alpha': 3162.2776601683795},-1492531.231,0.593,-2223220.981,0.016
2,{'alpha': 215.44346900318865},-1494672.512,0.592,-2226161.676,0.085
3,{'alpha': 1e-07},-1494687.307,0.592,-2226338.597,0.093
4,{'alpha': 1.4677992676220705e-06},-1494687.308,0.592,-2226338.597,0.099
5,{'alpha': 2.1544346900318867e-05},-1494687.308,0.592,-2226338.597,0.087
6,{'alpha': 0.00031622776601683794},-1494687.308,0.592,-2226338.598,0.088
7,{'alpha': 0.004641588833612782},-1494687.319,0.592,-2226338.602,0.097
8,{'alpha': 0.06812920690579623},-1494687.479,0.592,-2226338.664,0.088
9,{'alpha': 1.0},-1494689.843,0.592,-2226339.592,0.088


Le meilleur score R2 de la validation croisée est un peu mieux que dans le 1er feature Engineering.

Et avec le fichier de test :

In [95]:
print_result_CV_on_test_file(X_fe2_test, y_fe2_test, grid_search_fe2).head(30)

Unnamed: 0,Modèle,MSE,RMSE,R2,MAE
0,Lasso,4972753450001.51,2229967.14,0.66,1519096.67


In [97]:
print_coeffs_lasso(grid_search_fe2, X_fe2.columns)

                                            Feature  Coefficient
3                            PropertyGFABuilding(s)  1893325.514
46  PrimaryPropertyType_Supermarket / Grocery Store   941301.073
2                                PropertyGFAParking   621701.703
32                   PrimaryPropertyType_Laboratory   276719.294
59                               usage_Restauration   269654.720
44        PrimaryPropertyType_Senior Care Community   223816.917
61                             usage_Soins médicaux   189010.322
58                  usage_Loisirs et Divertissement   184071.556
26                   YearBuilt_Bin_(1992.0, 2003.5]   181201.003
52                                    steam_percent   170734.357
38                        PrimaryPropertyType_Other   167052.538
30                        PrimaryPropertyType_Hotel   153695.311
35               PrimaryPropertyType_Medical Office   146779.883
24                   YearBuilt_Bin_(1969.0, 1980.5]   146327.878
39       PrimaryPropertyT

Regardons si une validation croisée avec le modèle GradientBoostingRegressor ferait mieux.

**Validation croisée avec le modèle GradientBoostingRegressor**

In [110]:
# Définition des hyperparamètres pour la recherche
param_grid = {
    'n_estimators': [50, 100, 150],  # Nombre d'arbres dans l'ensemble.
    'learning_rate': [0.01, 0.1, 0.2], # Taux d'apprentissage pour la réduction du poids de chaque arbre
    'max_depth': [3, 5, 7], # Profondeur maximale de chaque arbre
    'subsample': [0.8] # Fraction des échantillons utilisés pour entraîner chaque arbre.
}

In [112]:
grid_search_fe2_gradient = fit_GridSearchCV_GradientBoostingRegressor(X_fe2_train, y_fe2_train, scoring, param_grid)
print_result_CV(grid_search_fe2_gradient)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Meilleurs paramètres : {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
Meilleu(s) score sur le jeu d'entraînement:
0.6561387918415079
Résultats de la validation croisée :

Scores pour 'MAE':
MAE = -2170974.581 (+/-260606.373) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.8}
MAE = -1890298.332 (+/-213261.766) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
MAE = -1744273.730 (+/-180311.133) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 150, 'subsample': 0.8}
MAE = -2110584.492 (+/-287181.376) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 50, 'subsample': 0.8}
MAE = -1797214.883 (+/-238846.358) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 100, 'subsample': 0.8}
MAE = -1636847.696 (+/-207791.679) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 150, 'subsample': 0.8}
MAE = -2075

Affichons le résultat de la validation croisée :

In [114]:
print_result_CV_as_dataframe(grid_search_fe2_gradient, scoring).head(30)

Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-1348232.026,0.656,-2041392.567,0.876
1,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-1346252.014,0.655,-2046620.541,1.305
2,"{'learning_rate': 0.1, 'max_depth': 5, 'n_esti...",-1375235.636,0.64,-2090818.526,1.349
3,"{'learning_rate': 0.1, 'max_depth': 5, 'n_esti...",-1378666.774,0.639,-2091218.066,0.652
4,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-1423369.734,0.638,-2099183.853,0.522
5,"{'learning_rate': 0.1, 'max_depth': 5, 'n_esti...",-1390481.888,0.635,-2106119.939,2.063
6,"{'learning_rate': 0.2, 'max_depth': 3, 'n_esti...",-1421663.545,0.63,-2120694.575,0.491
7,"{'learning_rate': 0.2, 'max_depth': 3, 'n_esti...",-1417342.17,0.629,-2123720.919,0.93
8,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",-1392469.077,0.628,-2122931.15,0.961
9,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",-1393551.139,0.628,-2123486.079,1.994


Et avec le fichier de test :

In [160]:
print_result_CV_on_test_file(X_fe3_test, y_fe3_test, grid_search_fe3_gradient).head(30)

Unnamed: 0,Modèle,MSE,RMSE,R2,MAE
0,Lasso,4699913390184.78,2167928.36,0.68,1451622.71


La validation croisée avec le modèle GradientBoostingRegressor est bien meilleure.

### 3.2 - 3ème feature Engineering

In [118]:
# Charger le fichier de données
data_fe3 = pd.read_csv("C:/Users/admin/Documents/Projets/Projet_4/data_projet/cleaned/2016_Building_Energy_Benchmarking_fe3.csv", sep=',', low_memory=False)
data_fe3.head()

Unnamed: 0,NumberofBuildings,NumberofFloors,PropertyGFAParking,PropertyGFABuilding(s),SiteEnergyUseWN(kBtu),TotalGHGEmissions,Neighborhood,YearBuilt_Bin,PrimaryPropertyType,electricity_percent,gaz_percent,steam_percent,usage_Autres,usage_Bureaux & Espaces de travail,usage_Commerce & Retail,usage_Entrepôts et Logistique,usage_Hébergement & Logement,usage_Loisirs et Divertissement,usage_Restauration,usage_Services publics & Infrastructure,usage_Soins médicaux,usage_Transports & Parking,usage_Éducation
0,1.0,12,0,88434,7456910.0,249.98,5000274.943,3533085.625,6218125.878,54.61,17.66,27.73,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,11,15064,88502,8664479.0,295.86,5000274.943,6145114.477,6218125.878,38.66,61.34,0.0,0.0,0.0,0.0,0.0,80.99,0.0,4.46,0.0,0.0,14.55,0.0
2,1.0,10,0,61320,6946800.5,286.43,5000274.943,3533085.625,6218125.878,40.75,26.66,32.59,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,18,62000,113580,14656503.0,505.01,5000274.943,3595235.07,6218125.878,37.88,62.12,0.0,0.0,0.0,0.0,0.0,64.48,0.0,0.0,0.0,0.0,35.52,0.0
4,1.0,2,37198,60090,12581712.0,301.81,5000274.943,6145114.477,4209164.996,60.99,39.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0


In [123]:
data_fe3.shape

(1440, 23)

**Sélectionner les features et la cible :**

In [125]:
y_fe3_conso = data_fe3['SiteEnergyUseWN(kBtu)']
X_fe3 = data_fe3.drop('SiteEnergyUseWN(kBtu)', axis=1, inplace=False)
X_fe3.shape

(1440, 22)

In [127]:
y_fe3_emissions = data_fe3['TotalGHGEmissions']
X_fe3 = X_fe3.drop('TotalGHGEmissions', axis=1, inplace=False)
X_fe3.shape

(1440, 21)

**Standardiser les valeurs et créer les jeux d'entraînement / test**

In [130]:
X_scale_fe3 = StandardScaler().fit_transform(X_fe3)

df = pd.DataFrame(X_scale_fe3)
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
count,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0
mean,0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,0.0,-0.0,0.0
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-0.132,-0.82,-0.288,-1.009,-1.394,-1.193,-1.15,-2.599,-1.092,-0.202,-0.269,-0.761,-0.414,-0.523,-0.301,-0.159,-0.158,-0.296,-0.094,-0.367,-0.345
25%,-0.132,-0.553,-0.288,-0.602,-0.851,-0.496,-0.562,-0.803,-1.092,-0.202,-0.269,-0.761,-0.414,-0.523,-0.301,-0.159,-0.158,-0.296,-0.094,-0.367,-0.345
50%,-0.132,-0.286,-0.288,-0.348,-0.094,-0.219,-0.134,0.001,-0.084,-0.202,-0.269,-0.761,-0.414,-0.523,-0.301,-0.159,-0.158,-0.296,-0.094,-0.367,-0.345
75%,-0.132,0.247,-0.288,0.163,0.953,0.156,0.237,1.159,0.818,-0.202,-0.269,0.913,-0.414,-0.523,-0.301,-0.159,-0.158,-0.296,-0.094,-0.367,-0.345
max,12.515,25.594,12.04,11.246,1.511,2.165,2.703,1.159,2.652,9.046,4.509,1.834,3.037,2.199,3.859,8.574,10.992,3.805,11.92,6.989,2.988


In [132]:
# 25% des données dans le jeu de test
X_fe3_train, X_fe3_test, y_fe3_train, y_fe3_test = model_selection.train_test_split(X_scale_fe3, y_fe3_conso, test_size=0.25, random_state=42 )

**Tests de modèles sans validation croisée**

In [135]:
warnings.filterwarnings("ignore")

scores_array_fe3 = run_fit_models(X_fe3_train, y_fe3_train, X_fe3_test, y_fe3_test, X_fe3.columns)

Importance des features dans le RandomForestRegressor :
                                    Feature  Importance
3                    PropertyGFABuilding(s)       0.428
6                       PrimaryPropertyType       0.219
13            usage_Entrepôts et Logistique       0.040
7                       electricity_percent       0.039
1                            NumberofFloors       0.033
8                               gaz_percent       0.032
5                             YearBuilt_Bin       0.030
16                       usage_Restauration       0.028
11       usage_Bureaux & Espaces de travail       0.021
4                              Neighborhood       0.020
2                        PropertyGFAParking       0.017
18                     usage_Soins médicaux       0.017
12                  usage_Commerce & Retail       0.016
19               usage_Transports & Parking       0.012
10                             usage_Autres       0.011
14             usage_Hébergement & Logement     

In [137]:
df_results_fe3 = print_result_on_test_file(scores_array_fe3)
df_results_fe3.head(10)

Unnamed: 0,index,Modèle,MSE,RMSE,R2,MAE,ELAPSED_TIME
0,5,gradientBoostingRegressor,4707176371629.02,2169602.81,0.68,1466645.55,0.707
1,4,RamdomForestRegressor,4855689704994.47,2203562.96,0.66,1475268.68,2.138
2,1,ridge,5487813259453.0,2342608.22,0.62,1637859.64,0.006
3,2,lasso,5488134330664.97,2342676.74,0.62,1637775.01,0.03
4,3,elasticNet,6092647582969.43,2468328.9,0.58,1763710.02,0.003
5,0,DummyRegressor,14578031018368.1,3818118.78,-0.01,2862093.8,0.001
6,6,SVR,17300288516838.78,4159361.55,-0.19,2674575.66,0.289


C'est encore le GradientBoostingRegressor qui est le meilleur sans la validation croisée.

**Validation croisée avec le modèle Lasso**

In [141]:
# Définition des hyperparamètres à tester
param_grid = {
    'alpha': np.logspace(-7, 7, 13) 
}

grid_search_fe3 = fit_GridSearchCV_lasso(X_fe3_train, y_fe3_train, scoring, param_grid)
print_result_CV(grid_search_fe3)

Fitting 5 folds for each of 13 candidates, totalling 65 fits
Meilleurs paramètres : {'alpha': 46415.888336127915}
Meilleu(s) score sur le jeu d'entraînement:
0.5688175362375552
Résultats de la validation croisée :

Scores pour 'MAE':
MAE = -1566443.673 (+/-191056.682) for {'alpha': 1e-07}
MAE = -1566443.673 (+/-191056.682) for {'alpha': 1.4677992676220705e-06}
MAE = -1566443.673 (+/-191056.682) for {'alpha': 2.1544346900318867e-05}
MAE = -1566443.673 (+/-191056.682) for {'alpha': 0.00031622776601683794}
MAE = -1566443.673 (+/-191056.681) for {'alpha': 0.004641588833612782}
MAE = -1566443.672 (+/-191056.680) for {'alpha': 0.06812920690579623}
MAE = -1566443.668 (+/-191056.655) for {'alpha': 1.0}
MAE = -1566443.973 (+/-191055.842) for {'alpha': 14.677992676220736}
MAE = -1566445.961 (+/-191062.720) for {'alpha': 215.44346900318865}
MAE = -1566535.719 (+/-191288.069) for {'alpha': 3162.2776601683795}
MAE = -1572768.315 (+/-193533.758) for {'alpha': 46415.888336127915}
MAE = -1897479.304 (

In [143]:
print_result_CV_as_dataframe(grid_search_fe3, scoring).head(30)

Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,{'alpha': 46415.888336127915},-1572768.315,0.569,-2296950.914,0.006
1,{'alpha': 3162.2776601683795},-1566535.719,0.569,-2296815.391,0.005
2,{'alpha': 215.44346900318865},-1566445.961,0.568,-2297485.728,0.023
3,{'alpha': 14.677992676220736},-1566443.973,0.568,-2297533.659,0.031
4,{'alpha': 1.0},-1566443.668,0.568,-2297536.79,0.02
5,{'alpha': 0.06812920690579623},-1566443.672,0.568,-2297537.044,0.031
6,{'alpha': 0.004641588833612782},-1566443.673,0.568,-2297537.061,0.032
7,{'alpha': 0.00031622776601683794},-1566443.673,0.568,-2297537.062,0.031
8,{'alpha': 2.1544346900318867e-05},-1566443.673,0.568,-2297537.062,0.031
9,{'alpha': 1.4677992676220705e-06},-1566443.673,0.568,-2297537.062,0.028


Les résultats avec le modèle Lasso sont moins bons que dans les 2 premiers feature engineering.

Et avec le fichier de test :

In [148]:
print_result_CV_on_test_file(X_fe3_test, y_fe3_test, grid_search_fe3).head(30)

Unnamed: 0,Modèle,MSE,RMSE,R2,MAE
0,Lasso,5362030175717.28,2315605.79,0.63,1619114.04


In [150]:
print_coeffs_lasso(grid_search_fe3, X_fe3.columns)

                                    Feature  Coefficient
3                    PropertyGFABuilding(s)  1662615.308
6                       PrimaryPropertyType   940062.521
2                        PropertyGFAParking   429391.422
12                  usage_Commerce & Retail   350037.290
18                     usage_Soins médicaux   346965.120
5                             YearBuilt_Bin   281209.044
16                       usage_Restauration   237910.014
15          usage_Loisirs et Divertissement   184599.817
9                             steam_percent   158779.537
10                             usage_Autres   106102.026
0                         NumberofBuildings    77370.867
1                            NumberofFloors    43380.807
11       usage_Bureaux & Espaces de travail        0.000
14             usage_Hébergement & Logement        0.000
8                               gaz_percent        0.000
17  usage_Services publics & Infrastructure       -0.000
4                              

**Validation croisée avec le modèle GradientBoostingRegressor**

In [153]:
# Définition des hyperparamètres pour la recherche
param_grid = {
    'n_estimators': [50, 100, 150],  # Nombre d'arbres dans l'ensemble.
    'learning_rate': [0.01, 0.1, 0.2], # Taux d'apprentissage pour la réduction du poids de chaque arbre
    'max_depth': [3, 5, 7], # Profondeur maximale de chaque arbre
    'subsample': [0.8] # Fraction des échantillons utilisés pour entraîner chaque arbre.
}

grid_search_fe3_gradient = fit_GridSearchCV_GradientBoostingRegressor(X_fe3_train, y_fe3_train, scoring, param_grid)
print_result_CV(grid_search_fe3_gradient)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
Meilleurs paramètres : {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
Meilleu(s) score sur le jeu d'entraînement:
0.6319589949979573
Résultats de la validation croisée :

Scores pour 'MAE':
MAE = -2160903.578 (+/-308455.259) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.8}
MAE = -1857558.536 (+/-267353.249) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 0.8}
MAE = -1700429.084 (+/-244820.381) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 150, 'subsample': 0.8}
MAE = -2086500.607 (+/-282317.471) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 50, 'subsample': 0.8}
MAE = -1758584.841 (+/-232233.860) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 100, 'subsample': 0.8}
MAE = -1595647.904 (+/-202021.883) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 150, 'subsample': 0.8}
MAE = -2051

Affichons les résultats dans un dataframe :

In [156]:
print_result_CV_as_dataframe(grid_search_fe3_gradient, scoring).head(30)

Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-1404992.48,0.632,-2114454.599,0.78
1,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-1403853.695,0.63,-2119109.384,1.163
2,"{'learning_rate': 0.2, 'max_depth': 3, 'n_esti...",-1419096.492,0.629,-2123875.668,0.772
3,"{'learning_rate': 0.2, 'max_depth': 3, 'n_esti...",-1432410.663,0.628,-2126075.285,0.417
4,"{'learning_rate': 0.1, 'max_depth': 5, 'n_esti...",-1422795.227,0.624,-2129978.517,0.557
5,"{'learning_rate': 0.1, 'max_depth': 5, 'n_esti...",-1425054.817,0.622,-2136403.806,1.11
6,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",-1416927.327,0.619,-2150900.333,0.797
7,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-1452448.323,0.618,-2153429.515,0.376
8,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",-1417858.645,0.616,-2159065.929,1.696
9,"{'learning_rate': 0.1, 'max_depth': 5, 'n_esti...",-1439079.606,0.613,-2160438.772,1.738


Les résultats sont égalements moins bons que dans le 2ème feature engineering