# Anticipez les besoins en consommation de bâtiments - *Notebook prediction TotalGHGEmissions*

## Mission

Vous travaillez pour la ville de Seattle. Pour atteindre son objectif de ville neutre en émissions de carbone en 2050, votre équipe s’intéresse de près à la consommation et aux émissions des bâtiments non destinés à l’habitation.

Des relevés minutieux ont été effectués par les agents de la ville en 2016. Cependant, ces relevés sont coûteux à obtenir, et à partir de ceux déjà réalisés, vous voulez tenter de prédire les émissions de CO2 et la consommation totale d’énergie de bâtiments non destinés à l’habitation pour lesquels elles n’ont pas encore été mesurées.

Votre prédiction se basera sur les données structurelles des bâtiments (taille et usage des bâtiments, date de construction, situation géographique, ...)

Vous cherchez également à évaluer l’intérêt de l’ENERGY STAR Score pour la prédiction d’émissions, qui est fastidieux à calculer avec l’approche utilisée actuellement par votre équipe. Vous l'intégrerez dans la modélisation et jugerez de son intérêt.

Vous sortez tout juste d’une réunion de brief avec votre équipe. Voici un récapitulatif de votre mission :


1) Réaliser une courte analyse exploratoire.
2) Tester différents modèles de prédiction afin de répondre au mieux à la problématique.

Fais bien attention au traitement des différentes variables, à la fois pour trouver de nouvelles informations (peut-on déduire des choses intéressantes d’une simple adresse ?) et optimiser les performances en appliquant des transformations simples aux variables (normalisation, passage au log, etc.).

Mets en place une évaluation rigoureuse des performances, et optimise les hyperparamètres et le choix d’algorithmes de ML à l’aide d’une validation croisée. Tu testeras au minimum 4 algorithmes de famille différente (par exemple : ElasticNet, SVM, GradientBoosting, RandomForest).

In [7]:
import numpy as np

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import ScalarFormatter
from matplotlib.ticker import FuncFormatter
import scipy
from scipy import stats
import scipy.stats as st

import statsmodels
import statsmodels.api as sm
import missingno as msno

import sklearn
from sklearn.experimental import enable_iterative_imputer  # Nécessaire pour activer IterativeImputer
from sklearn.impute import IterativeImputer

from sklearn.impute import KNNImputer
# Encodage des variables catégorielles avant d'utiliser KNNImputer
from category_encoders.ordinal import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# pour le centrage et la réduction
from sklearn.preprocessing import StandardScaler
# pour l'ACP
from sklearn.decomposition import PCA

from sklearn import model_selection
from sklearn.model_selection import GridSearchCV

from sklearn import metrics
from sklearn.metrics import roc_curve, auc, confusion_matrix, mean_squared_error, make_scorer, r2_score, mean_absolute_error

from sklearn import dummy
from sklearn.dummy import DummyClassifier
from sklearn.dummy import DummyRegressor

from sklearn import linear_model
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import LogisticRegression

from sklearn.svm import LinearSVC
from sklearn.svm import SVR

from sklearn import kernel_ridge

from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier

import tensorflow
from tensorflow import keras
from tensorflow.keras import models
from tensorflow.keras import layers

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from xgboost import XGBRegressor

import timeit
import warnings

print("numpy version", np.__version__)
print("pandas version", pd.__version__)
print("matplotlib version", matplotlib.__version__)
print("seaborn version", sns.__version__)
print("scipy version", scipy.__version__)
print("statsmodels version", statsmodels.__version__)
print("missingno version", msno.__version__)

print("sklearn version", sklearn.__version__)
print("tensorflow version", tensorflow.__version__)

pd.options.display.max_rows = 200
pd.options.display.max_columns = 100

numpy version 1.26.4
pandas version 2.1.4
matplotlib version 3.8.0
seaborn version 0.13.2
scipy version 1.11.4
statsmodels version 0.14.0
missingno version 0.5.2
sklearn version 1.2.2
tensorflow version 2.18.0


## 1 - Création des fonctions et paramètres pour automatisation

### 1.1 - Fonction des modèles avec validation croisée

**Validation croisée avec le modèle Lasso**

In [16]:
def fit_GridSearchCV_lasso(X_train, y_train, scoring, param_grid):

    # Initialisation de GridSearchCV
    grid_search = GridSearchCV(
        estimator=Lasso(),           # une régression Lasso
        param_grid=param_grid,
        cv=5,                        # nombre de folds
        scoring=scoring,
        refit='R2',                  # Refit avec la meilleure valeur de R²
        #n_jobs=-1,                  # Utilisation de tous les cœurs disponibles
        verbose=1
    )

    # Entraînement de GridSearchCV
    grid_search.fit(X_train, y_train)

    return grid_search

**Validation croisée avec le modèle ElasticNet**

In [20]:
def fit_GridSearchCV_elasticNet(X_train, y_train, scoring, param_grid):

    # Initialisation de GridSearchCV
    grid_search = GridSearchCV(
        estimator=ElasticNet(),           # une régression ElasticNet
        param_grid=param_grid,
        cv=5,                        # nombre de folds
        scoring=scoring,
        refit='R2',                  # Refit avec la meilleure valeur de R²
        #n_jobs=-1,                  # Utilisation de tous les cœurs disponibles
        verbose=1
    )

    # Entraînement de GridSearchCV
    grid_search.fit(X_train, y_train)

    return grid_search

**Validation croisée avec le modèle GradientBoostingRegressor**

In [27]:
def fit_GridSearchCV_GradientBoostingRegressor(X_train, y_train, scoring, param_grid):

    # Configuration de GridSearchCV
    grid_search = GridSearchCV(
        estimator=GradientBoostingRegressor(random_state=42),
        param_grid=param_grid,
        scoring=scoring,
        cv=5,  # Validation croisée à 5 plis
        refit='R2',
        n_jobs=-1,  # Utilisation de tous les cœurs disponibles
        verbose=1  # Affichage des détails
    )

    # Entraînement de GridSearchCV
    grid_search.fit(X_train, y_train)

    return grid_search

**Validation croisée avec le modèle RamdomForestRegressor**

In [32]:
def fit_GridSearchCV_ramdomForestRegressor(X_train, y_train, scoring, param_grid):

    # Configuration de GridSearchCV
    grid_search = GridSearchCV(
        estimator=RandomForestRegressor(random_state=42),
        param_grid=param_grid,
        scoring=scoring,
        cv=5,  # Validation croisée à 5 plis
        refit='R2',
        n_jobs=-1,  # Utilisation de tous les cœurs disponibles
        verbose=1  # Affichage des détails
    )

    # Entraînement de GridSearchCV
    grid_search.fit(X_train, y_train)

    return grid_search

**Validation croisée avec SVR**

In [37]:
def fit_GridSearchCV_SVR(X_train, y_train, scoring, param_grid):

    # Configuration de GridSearchCV
    grid_search = GridSearchCV(
        estimator=SVR(),
        param_grid=param_grid,
        scoring=scoring,
        cv=5,  # Validation croisée à 5 plis
        refit='R2',
        n_jobs=-1,  # Utilisation de tous les cœurs disponibles
        verbose=1  # Affichage des détails
    )

    # Entraînement de GridSearchCV
    grid_search.fit(X_train, y_train)

    return grid_search

### 1.2 - Fonctions d'affichage des résultats

Affichage simple du résultat (meilleurs paramètres, scores,...) d'une validation croisée

In [42]:
def print_result_CV(grid_search):

    # Afficher les meilleurs paramètres trouvés
    print(f"Meilleurs paramètres : {grid_search.best_params_}")
    
    # Afficher le meilleur score
    print("Meilleu(s) score sur le jeu d'entraînement:")
    print(grid_search.best_score_)
    
    # Utiliser le modèle avec les meilleurs paramètres
    best_model = grid_search.best_estimator_
    
    # Afficher les performances correspondantes
    print("Résultats de la validation croisée :")
    for score_name in scoring.keys():
        
        print(f"\nScores pour '{score_name}':")    
        for mean, std, params, mean_fit_time in zip(
                grid_search.cv_results_[f'mean_test_{score_name}'],  # score moyen pour chaque score
                grid_search.cv_results_[f'std_test_{score_name}'],   # écart-type du score
                grid_search.cv_results_['params'],                   # valeur de l'hyperparamètre
                grid_search.cv_results_['mean_fit_time']             # temps moyen d'entraînement
        ):
            print(f"{score_name} = {mean:.3f} (+/-{std * 2:.03f}) for {params}")

Affichage sous forme de dataframe du résultat (meilleurs paramètres, scores,...) d'une validation croisée, pour plus de lisibilité

In [47]:
def print_result_CV_as_dataframe(grid_search, scoring):

    # Liste pour stocker les résultats
    results = []
    
    # Afficher les performances correspondantes
    print("Résultats de la validation croisée :")
    for score_name in scoring.keys():
           
        for mean, std, params, mean_fit_time in zip(
                grid_search.cv_results_[f'mean_test_{score_name}'],  # score moyen pour chaque score
                grid_search.cv_results_[f'std_test_{score_name}'],   # écart-type du score
                grid_search.cv_results_['params'],                   # valeur de l'hyperparamètre
                grid_search.cv_results_['mean_fit_time']             # temps moyen d'entraînement
        ):
                    
            # Ajouter chaque combinaison de résultats à une liste sous forme de dictionnaire
            results.append({
                "score_name": score_name,
                "mean_score": mean,
                "std_score": std,
                "params": params,
                "mean_fit_time": mean_fit_time
            })
    
    # Transformer en DataFrame
    df_results = pd.DataFrame(results)
    
    # Convertir la colonne 'params' en chaîne de caractères
    df_results['params'] = df_results['params'].apply(str)
    
    # Transformation avec pivot
    df_results = df_results.pivot(
        index='params',                             # Les paramètres deviennent l'index
        columns='score_name',                       # Les valeurs uniques de score_name deviennent des colonnes
        values=['mean_score', 'mean_fit_time']      # Les valeurs à remplir dans les colonnes (ici, mean_score)
    ).reset_index()
    
    # Aplatir les colonnes multi-indexées
    df_results.columns = ['_'.join(col).strip() for col in df_results.columns.values]
    
    # Réinitialiser l'index pour obtenir un DataFrame "normal"
    df_results = df_results.reset_index()
    df_results.drop(columns=['index'], inplace=True)
    
    # Supprimer l'axe des index
    df_results = df_results.rename_axis(None, axis=1)
    
    # On trie le dataframe sur la colonne R2 du pmus grand au plus petit
    df_results.sort_values(by='mean_score_R2', ascending=False, inplace=True)
    df_results = df_results.reset_index()
    df_results.drop(columns=['index', 'mean_fit_time_MAE', 'mean_fit_time_RMSE'], inplace=True)
    df_results.rename(columns={'mean_fit_time_R2': 'mean_fit_time'}, inplace=True)
    
    return df_results

Affichage des scores calculés sur le fichier de test avec le modèle de la meilleure performance de la validation croisée

In [52]:
def print_result_CV_on_test_file(X_test, y_test, grid_search):

    # Utiliser le modèle avec les meilleurs paramètres
    best_model = grid_search.best_estimator_
    
    # Prédictions avec le modèle optimisé
    y_pred = best_model.predict(X_test)
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination
    
    scores_cv_fe1 = np.array([[mse, rmse, r2, mae]])
    
    # Conversion de l'array en DataFrame
    df_scores_cv_fe1 = pd.DataFrame(scores_cv_fe1, columns=['MSE', 'RMSE', 'R2', 'MAE'])
    
    # on transforme la colonne R2 en numérique
    df_scores_cv_fe1['R2'] = pd.to_numeric(df_scores_cv_fe1['R2'], errors='coerce')
    
    # On trie le dataframe sur la colonne R2 du pmus grand au plus petit
    df_scores_cv_fe1.sort_values(by='R2', ascending=False, inplace=True)
    
    return df_scores_cv_fe1    

### 1.3 - Paramètres

In [None]:
pd.set_option('display.float_format', '{:.3f}'.format)  # Désactiver l'écriture scientifique

Créons une fonction pour calculer le RMSE qui n'a pas directement disponible dans le GridSearchCV :

In [60]:
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [63]:
rmse_scorer = make_scorer(rmse, greater_is_better=False)  # False car on minimise le RMSE

In [66]:
# Définition du dictionnaire des métriques de scoring
scoring = {
    'MAE': 'neg_mean_absolute_error',  # Utilise l'erreur absolue moyenne
    'R2': 'r2',                        # Utilise le coefficient de détermination
    'RMSE': rmse_scorer                # Utilise Root Mean Squared Error (racine carré de l'erreur quadratique moyenne)
}

### 2 - Simulation des modèles et choix du modèle final

On repart du dataset issu du 2ème feature engineering

In [71]:
# Charger le fichier de données
data_fe2 = pd.read_csv("C:/Users/admin/Documents/Projets/Projet_4/data_projet/cleaned/2016_Building_Energy_Benchmarking_fe2.csv", sep=',', low_memory=False)
data_fe2.head()

Unnamed: 0,NumberofBuildings,NumberofFloors,PropertyGFAParking,SiteEnergyUseWN(kBtu),TotalGHGEmissions,Neighborhood_BALLARD,Neighborhood_CENTRAL,Neighborhood_DELRIDGE,Neighborhood_DELRIDGE NEIGHBORHOODS,Neighborhood_DOWNTOWN,Neighborhood_EAST,Neighborhood_GREATER DUWAMISH,Neighborhood_LAKE UNION,Neighborhood_MAGNOLIA / QUEEN ANNE,Neighborhood_NORTH,Neighborhood_NORTHEAST,Neighborhood_NORTHWEST,Neighborhood_SOUTHEAST,Neighborhood_SOUTHWEST,"YearBuilt_Bin_(1899.885, 1911.5]","YearBuilt_Bin_(1911.5, 1923.0]","YearBuilt_Bin_(1923.0, 1934.5]","YearBuilt_Bin_(1934.5, 1946.0]","YearBuilt_Bin_(1946.0, 1957.5]","YearBuilt_Bin_(1957.5, 1969.0]","YearBuilt_Bin_(1969.0, 1980.5]","YearBuilt_Bin_(1980.5, 1992.0]","YearBuilt_Bin_(1992.0, 2003.5]","YearBuilt_Bin_(2003.5, 2015.0]",PrimaryPropertyType_Distribution Center,PrimaryPropertyType_Hospital,PrimaryPropertyType_Hotel,PrimaryPropertyType_K-12 School,PrimaryPropertyType_Laboratory,PrimaryPropertyType_Large Office,PrimaryPropertyType_Low-Rise Multifamily,PrimaryPropertyType_Medical Office,PrimaryPropertyType_Mixed Use Property,PrimaryPropertyType_Office,PrimaryPropertyType_Other,PrimaryPropertyType_Refrigerated Warehouse,PrimaryPropertyType_Residence Hall,PrimaryPropertyType_Restaurant,PrimaryPropertyType_Retail Store,PrimaryPropertyType_Self-Storage Facility,PrimaryPropertyType_Senior Care Community,PrimaryPropertyType_Small- and Mid-Sized Office,PrimaryPropertyType_Supermarket / Grocery Store,PrimaryPropertyType_University,PrimaryPropertyType_Warehouse,PrimaryPropertyType_Worship Facility,electricity_percent,gaz_percent,steam_percent,usage_Autres,usage_Bureaux & Espaces de travail,usage_Commerce & Retail,usage_Entrepôts et Logistique,usage_Hébergement & Logement,usage_Loisirs et Divertissement,usage_Restauration,usage_Services publics & Infrastructure,usage_Soins médicaux,usage_Transports & Parking,usage_Éducation,PropertyGFAOutsideParking
0,1.0,12,0,7456910.0,249.98,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,54.61,17.66,27.73,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,88434.0
1,1.0,11,15064,8664479.0,295.86,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38.66,61.34,0.0,0.0,0.0,0.0,0.0,80.99,0.0,4.46,0.0,0.0,14.55,0.0,88502.0
2,1.0,10,0,6946800.5,286.43,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.75,26.66,32.59,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,61320.0
3,1.0,18,62000,14656503.0,505.01,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.88,62.12,0.0,0.0,0.0,0.0,0.0,64.48,0.0,0.0,0.0,0.0,35.52,0.0,113580.0
4,1.0,2,37198,12581712.0,301.81,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,60.99,39.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,60090.0


In [82]:
data_fe2.shape

(1440, 66)

#### 2.1 - Préparation

**Sélection de la cible**

In [80]:
y_fe2_emissions = data_fe2['TotalGHGEmissions']
X_fe2 = data_fe2.drop('TotalGHGEmissions', axis=1, inplace=False)
X_fe2 = X_fe2.drop('SiteEnergyUseWN(kBtu)', axis=1, inplace=False)
X_fe2.shape

(1440, 64)

**Standardisation et création jeux d'entrainement et test**

In [87]:
X_scale_fe2 = StandardScaler().fit_transform(X_fe2)

df = pd.DataFrame(X_scale_fe2)
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63
count,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0,1440.0
mean,2.035409e-16,1.97373e-17,0.0,4.9343250000000004e-17,-3.94746e-17,-6.167906000000001e-17,0.0,3.94746e-17,-9.868649e-18,1.480297e-17,3.94746e-17,1.727014e-17,5.921189000000001e-17,2.9605950000000004e-17,-4.934325e-18,-6.167906000000001e-17,2.713879e-17,7.401487e-18,-3.94746e-17,-1.727014e-17,-3.2073110000000004e-17,-3.700743e-17,7.648203e-17,-6.291264e-17,-5.0576830000000004e-17,-8.881784000000001e-17,1.480297e-17,4.934325e-18,1.480297e-17,5.921189000000001e-17,-5.3043990000000004e-17,-1.2335810000000002e-17,0.0,4.934325e-18,-5.551115e-18,6.414622000000001e-17,4.934325e-18,-5.4277570000000004e-17,9.868649e-18,1.2335810000000002e-17,9.251859000000001e-18,9.868649e-18,0.0,7.401487e-18,7.031412e-17,9.868649e-18,1.480297e-17,4.4408920000000007e-17,4.934325e-18,1.430954e-16,3.94746e-17,-9.868649e-18,2.713879e-17,-6.908054000000001e-17,-1.3569390000000001e-17,2.9605950000000004e-17,-3.94746e-17,2.467162e-18,4.194176e-17,-3.94746e-17,1.1102230000000002e-17,-9.868649e-18,4.4408920000000007e-17,-7.894919e-17
std,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347,1.000347
min,-0.1317341,-0.8198627,-0.28762,-0.2156655,-0.1836849,-0.175443,-0.026361,-0.481479,-0.2775575,-0.5420337,-0.3028798,-0.3150185,-0.2156655,-0.2875878,-0.2393088,-0.175443,-0.1646333,-0.3409972,-0.2904089,-0.324256,-0.1836849,-0.3498134,-0.4337267,-0.3609685,-0.3473076,-0.3547951,-0.3110065,-0.191617,-0.02636147,-0.2085144,-0.3056044,-0.05277798,-0.277557,-0.03729371,-0.1408191,-0.2687496,-0.04569117,-0.4154978,-0.08773648,-0.1186782,-0.08773648,-0.2473142,-0.138233,-0.1025978,-0.4945686,-0.1646333,-0.1025978,-0.3851303,-0.2209033,-2.599425,-1.091726,-0.2021178,-0.2689271,-0.7613622,-0.4143498,-0.5230449,-0.3007268,-0.1588959,-0.158323,-0.2959495,-0.09378887,-0.3674831,-0.3451357,-1.0489
25%,-0.1317341,-0.5530599,-0.28762,-0.2156655,-0.1836849,-0.175443,-0.026361,-0.481479,-0.2775575,-0.5420337,-0.3028798,-0.3150185,-0.2156655,-0.2875878,-0.2393088,-0.175443,-0.1646333,-0.3409972,-0.2904089,-0.324256,-0.1836849,-0.3498134,-0.4337267,-0.3609685,-0.3473076,-0.3547951,-0.3110065,-0.191617,-0.02636147,-0.2085144,-0.3056044,-0.05277798,-0.277557,-0.03729371,-0.1408191,-0.2687496,-0.04569117,-0.4154978,-0.08773648,-0.1186782,-0.08773648,-0.2473142,-0.138233,-0.1025978,-0.4945686,-0.1646333,-0.1025978,-0.3851303,-0.2209033,-0.8030706,-1.091726,-0.2021178,-0.2689271,-0.7613622,-0.4143498,-0.5230449,-0.3007268,-0.1588959,-0.158323,-0.2959495,-0.09378887,-0.3674831,-0.3451357,-0.6222299
50%,-0.1317341,-0.2862571,-0.28762,-0.2156655,-0.1836849,-0.175443,-0.026361,-0.481479,-0.2775575,-0.5420337,-0.3028798,-0.3150185,-0.2156655,-0.2875878,-0.2393088,-0.175443,-0.1646333,-0.3409972,-0.2904089,-0.324256,-0.1836849,-0.3498134,-0.4337267,-0.3609685,-0.3473076,-0.3547951,-0.3110065,-0.191617,-0.02636147,-0.2085144,-0.3056044,-0.05277798,-0.277557,-0.03729371,-0.1408191,-0.2687496,-0.04569117,-0.4154978,-0.08773648,-0.1186782,-0.08773648,-0.2473142,-0.138233,-0.1025978,-0.4945686,-0.1646333,-0.1025978,-0.3851303,-0.2209033,0.0009152931,-0.08435791,-0.2021178,-0.2689271,-0.7613622,-0.4143498,-0.5230449,-0.3007268,-0.1588959,-0.158323,-0.2959495,-0.09378887,-0.3674831,-0.3451357,-0.3568224
75%,-0.1317341,0.2473484,-0.28762,-0.2156655,-0.1836849,-0.175443,-0.026361,-0.481479,-0.2775575,-0.5420337,-0.3028798,-0.3150185,-0.2156655,-0.2875878,-0.2393088,-0.175443,-0.1646333,-0.3409972,-0.2904089,-0.324256,-0.1836849,-0.3498134,-0.4337267,-0.3609685,-0.3473076,-0.3547951,-0.3110065,-0.191617,-0.02636147,-0.2085144,-0.3056044,-0.05277798,-0.277557,-0.03729371,-0.1408191,-0.2687496,-0.04569117,-0.4154978,-0.08773648,-0.1186782,-0.08773648,-0.2473142,-0.138233,-0.1025978,-0.4945686,-0.1646333,-0.1025978,-0.3851303,-0.2209033,1.158835,0.8175184,-0.2021178,-0.2689271,0.9131284,-0.4143498,-0.5230449,-0.3007268,-0.1588959,-0.158323,-0.2959495,-0.09378887,-0.3674831,-0.3451357,0.1782799
max,12.51474,25.59361,12.040093,4.636809,5.444107,5.699857,37.934153,2.076934,3.602858,1.844904,3.30164,3.174417,4.636809,3.477198,4.178701,5.699857,6.074104,2.932576,3.44342,3.083983,5.444107,2.858667,2.3056,2.770325,2.879292,2.818528,3.215367,5.218744,37.93415,4.795832,3.272204,18.9473,3.602858,26.81418,7.101308,3.720935,21.88607,2.406752,11.39777,8.42615,11.39777,4.043439,7.234178,9.746794,2.021964,6.074104,9.746794,2.596524,4.526868,1.158835,2.652441,9.045896,4.508525,1.834244,3.03661,2.198842,3.858843,8.57365,10.99217,3.805421,11.91952,6.988983,2.988214,5.911799


In [92]:
# 25% des données dans le jeu de test
X_fe2_train, X_fe2_test, y_fe2_train, y_fe2_test = model_selection.train_test_split(X_scale_fe2, y_fe2_emissions, test_size=0.25, random_state=42 )

#### 4-2 - modèle Lasso

In [97]:
# Définition des hyperparamètres à tester
param_grid = {
    'alpha': np.logspace(-6, 6, 13) 
}

In [105]:
warnings.filterwarnings("ignore")

grid_search_fe2_lasso = fit_GridSearchCV_lasso(X_fe2_train, y_fe2_train, scoring, param_grid)
print_result_CV_as_dataframe(grid_search_fe2_lasso, scoring).head(30)

Fitting 5 folds for each of 13 candidates, totalling 65 fits
Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,{'alpha': 1.0},-39.689399,0.536159,-64.77269,0.001794
1,{'alpha': 0.1},-40.668906,0.530766,-65.114243,0.004727
2,{'alpha': 0.01},-40.887331,0.529473,-65.197769,0.022055
3,{'alpha': 0.001},-40.911766,0.529326,-65.207391,0.035118
4,{'alpha': 0.0001},-40.913981,0.5293,-65.209051,0.030759
5,{'alpha': 1e-05},-40.914161,0.529298,-65.209222,0.036033
6,{'alpha': 1e-06},-40.914179,0.529297,-65.209239,0.039362
7,{'alpha': 10.0},-43.818187,0.43493,-71.547521,0.000809
8,{'alpha': 100.0},-66.324883,-0.005747,-95.453261,0.003187
9,{'alpha': 1000.0},-66.324883,-0.005747,-95.453261,0.000292


Sur le fichier de test :

In [109]:
print_result_CV_on_test_file(X_fe2_test, y_fe2_test, grid_search_fe2_lasso).head(30)

Unnamed: 0,MSE,RMSE,R2,MAE
0,4086.62,63.93,0.55,40.76


#### 4-3 - modèle ElasticNet

In [114]:
param_grid = {
    'alpha': [0.01, 0.1, 1.0, 10.0, 100.0],  # Grille pour alpha
    'l1_ratio': [0.1, 0.5, 0.7, 0.9, 1.0]  # Grille pour l1_ratio
}

In [117]:
grid_search_fe2_elasticNet = fit_GridSearchCV_elasticNet(X_fe2_train, y_fe2_train, scoring, param_grid)
print_result_CV_as_dataframe(grid_search_fe2_elasticNet, scoring).head(30)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,"{'alpha': 1.0, 'l1_ratio': 1.0}",-39.689399,0.536159,-64.77269,0.001399
1,"{'alpha': 1.0, 'l1_ratio': 0.9}",-39.573434,0.532673,-65.05122,0.003008
2,"{'alpha': 0.1, 'l1_ratio': 0.7}",-40.412924,0.531874,-65.052574,0.003136
3,"{'alpha': 0.1, 'l1_ratio': 0.5}",-40.324045,0.531751,-65.070327,0.0
4,"{'alpha': 0.1, 'l1_ratio': 0.9}",-40.550328,0.531372,-65.077625,0.010336
5,"{'alpha': 0.1, 'l1_ratio': 1.0}",-40.668906,0.530766,-65.114243,0.004109
6,"{'alpha': 0.1, 'l1_ratio': 0.1}",-40.22162,0.530495,-65.175049,0.002038
7,"{'alpha': 0.01, 'l1_ratio': 0.1}",-40.751064,0.530327,-65.143554,0.014544
8,"{'alpha': 0.01, 'l1_ratio': 0.5}",-40.804384,0.530004,-65.163785,0.01275
9,"{'alpha': 0.01, 'l1_ratio': 0.7}",-40.835189,0.529811,-65.176048,0.01946


Résultat sur le fichier de test avec les meilleurs hyperparamètres :

In [122]:
print_result_CV_on_test_file(X_fe2_test, y_fe2_test, grid_search_fe2_elasticNet).head(30)

Unnamed: 0,MSE,RMSE,R2,MAE
0,4086.62,63.93,0.55,40.76


#### 4.4 - modèle GradientBoostingRegressor

In [130]:
# Définition des hyperparamètres pour la recherche
param_grid = {
    'n_estimators': [50, 100, 150],  # Nombre d'arbres dans l'ensemble.
    'learning_rate': [0.01, 0.1, 0.2], # Taux d'apprentissage pour la réduction du poids de chaque arbre
    'max_depth': [3, 5, 7], # Profondeur maximale de chaque arbre
    'subsample': [0.8, 1.0] # Fraction des échantillons utilisés pour entraîner chaque arbre.
}

In [132]:
grid_search_fe2_gradient = fit_GridSearchCV_GradientBoostingRegressor(X_fe2_train, y_fe2_train, scoring, param_grid)
print_result_CV_as_dataframe(grid_search_fe2_gradient, scoring).head(30)

Fitting 5 folds for each of 54 candidates, totalling 270 fits
Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-33.858223,0.629813,-57.910676,1.0037
1,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-33.26842,0.626484,-58.147825,1.048602
2,"{'learning_rate': 0.2, 'max_depth': 5, 'n_esti...",-32.63536,0.625857,-58.277198,0.507239
3,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-34.28118,0.624793,-58.292023,0.894193
4,"{'learning_rate': 0.1, 'max_depth': 3, 'n_esti...",-33.510675,0.621631,-58.521733,0.748087
5,"{'learning_rate': 0.2, 'max_depth': 5, 'n_esti...",-33.297662,0.616523,-58.957564,1.09892
6,"{'learning_rate': 0.2, 'max_depth': 5, 'n_esti...",-33.222039,0.61605,-59.004271,0.610419
7,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",-32.997382,0.615063,-59.064759,1.261341
8,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",-33.18774,0.614476,-59.103292,1.966398
9,"{'learning_rate': 0.2, 'max_depth': 5, 'n_esti...",-33.479092,0.612718,-59.238468,1.072708


Les R2 sont meilleurs avec le GradientBoosting que le Lasso et l'ElasticNet, mais le temps de traitement est plus long.

Résultat sur le fichier de test avec les meilleurs hyperparamètres :

In [141]:
print_result_CV_on_test_file(X_fe2_test, y_fe2_test, grid_search_fe2_gradient).head(30)

Unnamed: 0,MSE,RMSE,R2,MAE
0,3826.49,61.86,0.58,36.67


Par contre le résultat sur le fichier de test est moins bons que lors de la validation croisée.

#### 4.5 - Modèle RamdomForestRegressor

In [148]:
param_grid = {
    'n_estimators': [100, 200, 300],  # nombre d'arbres dans la forêt.
    'max_depth': [None, 10, 20, 30],   # profondeur maximale des arbres
}

In [151]:
grid_search_fe2_ramdomForest = fit_GridSearchCV_ramdomForestRegressor(X_fe2_train, y_fe2_train, scoring, param_grid)
print_result_CV_as_dataframe(grid_search_fe2_ramdomForest, scoring).head(30)

Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,"{'max_depth': 20, 'max_features': 'auto', 'min...",-33.05428,0.612211,-59.322061,2.364879
1,"{'max_depth': None, 'max_features': 'auto', 'm...",-32.86451,0.610775,-59.427546,5.69734
2,"{'max_depth': 20, 'max_features': 'auto', 'min...",-32.870681,0.610731,-59.430172,7.782674
3,"{'max_depth': 20, 'max_features': 'auto', 'min...",-32.913525,0.61023,-59.46943,4.635365
4,"{'max_depth': 30, 'max_features': 'auto', 'min...",-32.885754,0.610178,-59.47232,7.093365
5,"{'max_depth': None, 'max_features': 'auto', 'm...",-32.923569,0.609862,-59.49054,3.517574
6,"{'max_depth': 30, 'max_features': 'auto', 'min...",-33.07713,0.60918,-59.535876,2.089578
7,"{'max_depth': 30, 'max_features': 'auto', 'min...",-32.952071,0.609125,-59.544918,4.717043
8,"{'max_depth': 20, 'max_features': 'auto', 'min...",-33.033808,0.60905,-59.541937,2.296017
9,"{'max_depth': None, 'max_features': 'auto', 'm...",-33.086659,0.60896,-59.552594,1.527608


Résultat sur le Résultat sur le fichier de test avec les meilleurs hyperparamètres :

In [161]:
print_result_CV_on_test_file(X_fe2_test, y_fe2_test, grid_search_fe2_ramdomForest).head(30)

Unnamed: 0,MSE,RMSE,R2,MAE
0,3400.38,58.31,0.63,33.54


#### 4.6 - Modèle SVR

In [164]:
param_grid = {
    'C': [0.1, 1, 10, 100],              # Le paramètre de régularisation (contrôle la marge entre biais et variance).
    'epsilon': [0.1, 0.2, 0.5, 1],       # La largeur de la zone d'insensibilité à l'erreur
    'kernel': ['rbf', 'poly', 'linear'], # Le type de noyau utilisé dans le modèle ('linear', 'poly', 'rbf', 'sigmoid')
}

In [166]:
grid_search_fe2_SVR = fit_GridSearchCV_SVR(X_fe2_train, y_fe2_train, scoring, param_grid)
print_result_CV_as_dataframe(grid_search_fe2_SVR, scoring).head(30)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Résultats de la validation croisée :


Unnamed: 0,params_,mean_score_MAE,mean_score_R2,mean_score_RMSE,mean_fit_time
0,"{'C': 100, 'epsilon': 0.2, 'kernel': 'linear'}",-37.229895,0.513858,-66.370978,2.951179
1,"{'C': 100, 'epsilon': 0.1, 'kernel': 'linear'}",-37.23755,0.513705,-66.381033,3.412134
2,"{'C': 100, 'epsilon': 0.5, 'kernel': 'linear'}",-37.200356,0.513225,-66.413459,5.779742
3,"{'C': 100, 'epsilon': 1, 'kernel': 'linear'}",-37.156539,0.513134,-66.419777,5.603556
4,"{'C': 10, 'epsilon': 0.1, 'kernel': 'linear'}",-37.117518,0.512738,-66.448285,0.411165
5,"{'C': 10, 'epsilon': 0.2, 'kernel': 'linear'}",-37.111255,0.512522,-66.462677,0.466895
6,"{'C': 10, 'epsilon': 0.5, 'kernel': 'linear'}",-37.13105,0.511932,-66.500017,0.543075
7,"{'C': 10, 'epsilon': 1, 'kernel': 'linear'}",-37.115741,0.510706,-66.579571,0.680315
8,"{'C': 100, 'epsilon': 1, 'kernel': 'rbf'}",-35.789867,0.500026,-67.367525,0.141691
9,"{'C': 100, 'epsilon': 0.5, 'kernel': 'rbf'}",-35.819746,0.499508,-67.40316,0.153265


Le résultat du SVR est moins bon que les autres modèles, mais le SVR est plus performant sur cette cible que la cible sur la consommation d'énergie.