# Anticipez les besoins en consommation de bâtiments - *Notebook prediction*

## Mission

Vous travaillez pour la ville de Seattle. Pour atteindre son objectif de ville neutre en émissions de carbone en 2050, votre équipe s’intéresse de près à la consommation et aux émissions des bâtiments non destinés à l’habitation.

Des relevés minutieux ont été effectués par les agents de la ville en 2016. Cependant, ces relevés sont coûteux à obtenir, et à partir de ceux déjà réalisés, vous voulez tenter de prédire les émissions de CO2 et la consommation totale d’énergie de bâtiments non destinés à l’habitation pour lesquels elles n’ont pas encore été mesurées.

Votre prédiction se basera sur les données structurelles des bâtiments (taille et usage des bâtiments, date de construction, situation géographique, ...)

Vous cherchez également à évaluer l’intérêt de l’ENERGY STAR Score pour la prédiction d’émissions, qui est fastidieux à calculer avec l’approche utilisée actuellement par votre équipe. Vous l'intégrerez dans la modélisation et jugerez de son intérêt.

Vous sortez tout juste d’une réunion de brief avec votre équipe. Voici un récapitulatif de votre mission :


1) Réaliser une courte analyse exploratoire.
2) Tester différents modèles de prédiction afin de répondre au mieux à la problématique.

Fais bien attention au traitement des différentes variables, à la fois pour trouver de nouvelles informations (peut-on déduire des choses intéressantes d’une simple adresse ?) et optimiser les performances en appliquant des transformations simples aux variables (normalisation, passage au log, etc.).

Mets en place une évaluation rigoureuse des performances, et optimise les hyperparamètres et le choix d’algorithmes de ML à l’aide d’une validation croisée. Tu testeras au minimum 4 algorithmes de famille différente (par exemple : ElasticNet, SVM, GradientBoosting, RandomForest).

In [3]:
import numpy as np

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import ScalarFormatter
from matplotlib.ticker import FuncFormatter
import scipy
from scipy import stats
import scipy.stats as st

import statsmodels
import statsmodels.api as sm
import missingno as msno

import sklearn
from sklearn.experimental import enable_iterative_imputer  # Nécessaire pour activer IterativeImputer
from sklearn.impute import IterativeImputer

from sklearn.impute import KNNImputer
# Encodage des variables catégorielles avant d'utiliser KNNImputer
from category_encoders.ordinal import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# pour le centrage et la réduction
from sklearn.preprocessing import StandardScaler
# pour l'ACP
from sklearn.decomposition import PCA

from sklearn import model_selection
from sklearn.model_selection import GridSearchCV

from sklearn import metrics
from sklearn.metrics import roc_curve, auc, confusion_matrix, mean_squared_error, make_scorer, r2_score, mean_absolute_error

from sklearn import dummy
from sklearn.dummy import DummyClassifier
from sklearn.dummy import DummyRegressor

from sklearn import linear_model
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Lasso
from sklearn.linear_model import LogisticRegression

from sklearn.svm import LinearSVC

from sklearn import kernel_ridge

from sklearn import neighbors
from sklearn.neighbors import KNeighborsClassifier

import tensorflow
from tensorflow import keras
from tensorflow.keras import models
from tensorflow.keras import layers

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor

import timeit

print("numpy version", np.__version__)
print("pandas version", pd.__version__)
print("matplotlib version", matplotlib.__version__)
print("seaborn version", sns.__version__)
print("scipy version", scipy.__version__)
print("statsmodels version", statsmodels.__version__)
print("missingno version", msno.__version__)

print("sklearn version", sklearn.__version__)
print("tensorflow version", tensorflow.__version__)

pd.options.display.max_rows = 200
pd.options.display.max_columns = 100

numpy version 1.26.4
pandas version 2.1.4
matplotlib version 3.8.0
seaborn version 0.13.2
scipy version 1.11.4
statsmodels version 0.14.0
missingno version 0.5.2
sklearn version 1.2.2
tensorflow version 2.18.0


## 1 - Développement et simulation du premier modèle (cible = SiteEnergyUseWN(kBtu))

In [5]:
# Charger le fichier de données
data_fe1 = pd.read_csv("C:/Users/admin/Documents/Projets/Projet_4/data_projet/cleaned/2016_Building_Energy_Benchmarking_fe1.csv", sep=',', low_memory=False)
data_fe1.head()

Unnamed: 0,NumberofBuildings,NumberofFloors,PropertyGFATotal,SiteEnergyUseWN(kBtu),TotalGHGEmissions,PrimaryPropertyType_Distribution Center,PrimaryPropertyType_Hospital,PrimaryPropertyType_Hotel,PrimaryPropertyType_K-12 School,PrimaryPropertyType_Laboratory,PrimaryPropertyType_Large Office,PrimaryPropertyType_Low-Rise Multifamily,PrimaryPropertyType_Medical Office,PrimaryPropertyType_Mixed Use Property,PrimaryPropertyType_Office,PrimaryPropertyType_Other,PrimaryPropertyType_Refrigerated Warehouse,PrimaryPropertyType_Residence Hall,PrimaryPropertyType_Restaurant,PrimaryPropertyType_Retail Store,PrimaryPropertyType_Self-Storage Facility,PrimaryPropertyType_Senior Care Community,PrimaryPropertyType_Small- and Mid-Sized Office,PrimaryPropertyType_Supermarket / Grocery Store,PrimaryPropertyType_University,PrimaryPropertyType_Warehouse,PrimaryPropertyType_Worship Facility,Neighborhood_BALLARD,Neighborhood_Ballard,Neighborhood_CENTRAL,Neighborhood_Central,Neighborhood_DELRIDGE,Neighborhood_DELRIDGE NEIGHBORHOODS,Neighborhood_DOWNTOWN,Neighborhood_Delridge,Neighborhood_EAST,Neighborhood_GREATER DUWAMISH,Neighborhood_LAKE UNION,Neighborhood_MAGNOLIA / QUEEN ANNE,Neighborhood_NORTH,Neighborhood_NORTHEAST,Neighborhood_NORTHWEST,Neighborhood_North,Neighborhood_Northwest,Neighborhood_SOUTHEAST,Neighborhood_SOUTHWEST,"YearBuilt_Bin_(1899.885, 1911.5]","YearBuilt_Bin_(1911.5, 1923.0]","YearBuilt_Bin_(1923.0, 1934.5]","YearBuilt_Bin_(1934.5, 1946.0]","YearBuilt_Bin_(1946.0, 1957.5]","YearBuilt_Bin_(1957.5, 1969.0]","YearBuilt_Bin_(1969.0, 1980.5]","YearBuilt_Bin_(1980.5, 1992.0]","YearBuilt_Bin_(1992.0, 2003.5]","YearBuilt_Bin_(2003.5, 2015.0]",electricity_percent,gaz_percent,steam_percent
0,1.0,12,88434.0,7456910.0,249.98,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,54.61,17.66,27.73
1,1.0,11,103566.0,8664479.0,295.86,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,38.66,61.34,0.0
2,1.0,10,61320.0,6946800.5,286.43,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.75,26.66,32.59
3,1.0,18,175580.0,14656503.0,505.01,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,37.88,62.12,0.0
4,1.0,2,97288.0,12581712.0,301.81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,60.99,39.01,0.0


In [6]:
data_fe1.shape

(1444, 59)

### 1.1 - Sélectionner les features et la cible :

In [8]:
y_fe1_conso = data_fe1['SiteEnergyUseWN(kBtu)']
X_fe1 = data_fe1.drop('SiteEnergyUseWN(kBtu)', axis=1, inplace=False)
X_fe1.shape

(1444, 58)

In [9]:
y_fe1_conso.shape

(1444,)

In [10]:
y_fe1_emissions = data_fe1['TotalGHGEmissions']
X_fe1 = X_fe1.drop('TotalGHGEmissions', axis=1, inplace=False)
X_fe1.shape

(1444, 57)

In [11]:
y_fe1_emissions.shape

(1444,)

In [12]:
X_fe1.head()

Unnamed: 0,NumberofBuildings,NumberofFloors,PropertyGFATotal,PrimaryPropertyType_Distribution Center,PrimaryPropertyType_Hospital,PrimaryPropertyType_Hotel,PrimaryPropertyType_K-12 School,PrimaryPropertyType_Laboratory,PrimaryPropertyType_Large Office,PrimaryPropertyType_Low-Rise Multifamily,PrimaryPropertyType_Medical Office,PrimaryPropertyType_Mixed Use Property,PrimaryPropertyType_Office,PrimaryPropertyType_Other,PrimaryPropertyType_Refrigerated Warehouse,PrimaryPropertyType_Residence Hall,PrimaryPropertyType_Restaurant,PrimaryPropertyType_Retail Store,PrimaryPropertyType_Self-Storage Facility,PrimaryPropertyType_Senior Care Community,PrimaryPropertyType_Small- and Mid-Sized Office,PrimaryPropertyType_Supermarket / Grocery Store,PrimaryPropertyType_University,PrimaryPropertyType_Warehouse,PrimaryPropertyType_Worship Facility,Neighborhood_BALLARD,Neighborhood_Ballard,Neighborhood_CENTRAL,Neighborhood_Central,Neighborhood_DELRIDGE,Neighborhood_DELRIDGE NEIGHBORHOODS,Neighborhood_DOWNTOWN,Neighborhood_Delridge,Neighborhood_EAST,Neighborhood_GREATER DUWAMISH,Neighborhood_LAKE UNION,Neighborhood_MAGNOLIA / QUEEN ANNE,Neighborhood_NORTH,Neighborhood_NORTHEAST,Neighborhood_NORTHWEST,Neighborhood_North,Neighborhood_Northwest,Neighborhood_SOUTHEAST,Neighborhood_SOUTHWEST,"YearBuilt_Bin_(1899.885, 1911.5]","YearBuilt_Bin_(1911.5, 1923.0]","YearBuilt_Bin_(1923.0, 1934.5]","YearBuilt_Bin_(1934.5, 1946.0]","YearBuilt_Bin_(1946.0, 1957.5]","YearBuilt_Bin_(1957.5, 1969.0]","YearBuilt_Bin_(1969.0, 1980.5]","YearBuilt_Bin_(1980.5, 1992.0]","YearBuilt_Bin_(1992.0, 2003.5]","YearBuilt_Bin_(2003.5, 2015.0]",electricity_percent,gaz_percent,steam_percent
0,1.0,12,88434.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,54.61,17.66,27.73
1,1.0,11,103566.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,38.66,61.34,0.0
2,1.0,10,61320.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.75,26.66,32.59
3,1.0,18,175580.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,37.88,62.12,0.0
4,1.0,2,97288.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,60.99,39.01,0.0


### 1.2 - Standardiser les valeurs et créer les jeux d'entraînement / test

In [14]:
std_scale = StandardScaler().fit(X_fe1)
X_scale_fe1 = std_scale.transform(X_fe1)

In [15]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_scale_fe1, y_fe1_conso, test_size=0.25, random_state=42 ) # 25% des données dans le jeu de test

In [16]:
X_train.shape

(1083, 57)

In [17]:
X_test.shape

(361, 57)

In [18]:
y_train.shape

(1083,)

In [19]:
y_test.shape

(361,)

### 1.3 - Premiers modèles sans validation croisée

Il s'agit d'évaluer quelques modèles sans utiliser la validation croisée, en partant d'une baseline, pour aller vers des modèles plus élaborés.

L'hyperparamètre "alpha" sera fixe dans un premier temps (pas de validation croisée pour optimiser l'hyperparamètre).

On fera une boucle sur chaque modèle, et on stockera les scrores dans un tableau. Mais d'abord créons les fonctions qui seront utlisées dans la boucle :

**Baseline avec DummyRegressor**

On va utiliser la stratégie de la moyenne : prédit la moyenne des valeurs cibles d'entraînement. Créons la fonction qui prend les jeux d'entraînement et de tests et en entrée, et retourne les scores MSE, RMSE, R2, MAE:

In [22]:
def fit_dummyRegressor(X_train, y_train, X_test, y_test):

    start_time = timeit.default_timer()
    
    # Initialisation du DummyRegressor avec la stratégie 'mean'
    dummy_regressor = DummyRegressor(strategy='mean')
    
    # Entraînement du modèle
    dummy_regressor.fit(X_train, y_train)
    
    # Prédiction sur les données de test
    y_pred = dummy_regressor.predict(X_test)

    elapsed = timeit.default_timer() - start_time
    
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    return mse, rmse, r2, mae, elapsed

**Modèle de régression Ridge**

La régression ridge nous permet de réduire l'amplitude des coefficients d'une régression linéaire et d'éviter le sur-apprentissage. On optimisera l'hyperparamètre lors de la validation croisée avec GridSearchCV.

Créons une fonction qui permet d'instancier le modèle, l'entraîner, et calculer les scores:

In [24]:
def fit_ridge(X_train, y_train, X_test, y_test, alpha):

    start_time = timeit.default_timer()
    
    # Initialisation du modèle Ridge avec un paramètre alpha
    ridge_regressor = Ridge(alpha=alpha)  # alpha contrôle la régularisation ; plus grand, plus de régularisation
    
    # Entraînement du modèle
    ridge_regressor.fit(X_train, y_train)
    
    # Prédiction sur les données de test
    y_pred = ridge_regressor.predict(X_test)

    elapsed = timeit.default_timer() - start_time
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    return mse, rmse, r2, mae, elapsed

**Modèle de régression Lasso**

Le Lasso est une méthode de sélection de variables et de réduction de dimension supervisée : les variables qui ne sont pas nécessaires à la prédiction de l'étiquette sont éliminées.

Créons une fonction qui permet d'instancier le modèle, l'entraîner, et calculer les scores:

In [26]:
def fit_lasso(X_train, y_train, X_test, y_test, alpha):

    start_time = timeit.default_timer()
    
    # Initialisation du modèle Lasso avec un paramètre alpha
    lasso_regressor = Lasso(alpha=alpha)  # alpha contrôle la régularisation ; plus grand, plus de régularisation
    
    # Entraînement du modèle
    lasso_regressor.fit(X_train, y_train)
    
    # Prédiction sur les données de test
    y_pred = lasso_regressor.predict(X_test)

    elapsed = timeit.default_timer() - start_time
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    return mse, rmse, r2, mae, elapsed

**Modèle RamdonForestRegressor**

Une forêt aléatoire est un ensemble de nombreux arbres de décision qui sont combinés pour produire une prédiction plus précise et plus robuste. Chaque arbre de décision est construit à partir d'un échantillon aléatoire des données et les résultats sont moyennés pour obtenir la prédiction finale.

Créons une fonction qui permet d'instancier le modèle, l'entraîner, et calculer les scores:

In [28]:
def fit_ramdomForestRegressor(X_train, y_train, X_test, y_test, X_fe1):

    start_time = timeit.default_timer()
    
    # Création du modèle
    # n_estimators : Nombre d'arbres dans la forêt. Valeur par défaut = 100. Une valeur plus élevée peut améliorer la précision mais augmente le temps de calcul.
    model = RandomForestRegressor(n_estimators=100, random_state=42)  # random_state permet
    
    # Entraînement du modèle
    model.fit(X_train, y_train)
    
    # Prédictions
    y_pred = model.predict(X_test)

    elapsed = timeit.default_timer() - start_time
    
    # Évaluation du modèle avec différentes métriques
    mse = round(mean_squared_error(y_test, y_pred), 2)       # Erreur quadratique moyenne
    rmse = round(np.sqrt(mse), 2)                            # Racine carrée de l'erreur quadratique moyenne (RMSE)
    mae = round(mean_absolute_error(y_test, y_pred), 2)      # Erreur absolue moyenne
    r2 = round(r2_score(y_test, y_pred), 2)                  # Coefficient de détermination

    # Afficher l'importance des features
    print("Importance des features dans le RandomForestRegressor :")
    importances = model.feature_importances_
    # Création d'un DataFrame pour afficher l'importance des features
    feature_importance = pd.DataFrame({'Feature': X_fe1.columns, 'Importance': importances})
    # Tri par ordre d'importance décroissante
    feature_importance = feature_importance.sort_values(by='Importance', ascending=False)

    # Afficher les résultats
    print(feature_importance)

    return mse, rmse, r2, mae, elapsed

**1ere itération des modèles pour différentes valeurs d'alpha**

In [35]:
mse, rmse, r2, mae, elapsed = fit_dummyRegressor(X_train, y_train, X_test, y_test)
scores_array = np.array([['DummyRegressor', '', mse, rmse, r2, mae, elapsed]])

alphas = np.logspace(-6, 6, 13)

for alpha in alphas:
    mse, rmse, r2, mae, elapsed = fit_ridge(X_train, y_train, X_test, y_test, alpha)
    scores_array = np.vstack([scores_array, ['ridge', alpha, mse, rmse, r2, mae, elapsed]])
    mse, rmse, r2, mae, elapsed = fit_lasso(X_train, y_train, X_test, y_test, alpha)
    scores_array = np.vstack([scores_array, ['lasso', alpha, mse, rmse, r2, mae, elapsed]])

mse, rmse, r2, mae, elapsed = fit_ramdomForestRegressor(X_train, y_train, X_test, y_test, X_fe1)
scores_array = np.vstack([scores_array, ['RamdomForestRegressor', '', mse, rmse, r2, mae, elapsed]])

print(scores_array)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Importance des features dans le RandomForestRegressor :
                                            Feature  Importance
2                                  PropertyGFATotal    0.504276
21  PrimaryPropertyType_Supermarket / Grocery Store    0.087605
54                              electricity_percent    0.065398
1                                    NumberofFloors    0.048745
55                                      gaz_percent    0.037337
23                    PrimaryPropertyType_Warehouse    0.033062
13                        PrimaryPropertyType_Other    0.024428
7                    PrimaryPropertyType_Laboratory    0.017801
16                   PrimaryPropertyType_Restaurant    0.012443
19        PrimaryPropertyType_Senior Care Community    0.011961
52                   YearBuilt_Bin_(1992.0, 2003.5]    0.009903
50                   YearBuilt_Bin_(1969.0, 1980.5]    0.008951
11           PrimaryPropertyType_Mixed Use Property    0.008767
31                            Neighborhood_DOWNT

In [37]:
# Conversion de l'array en DataFrame
df_scores_fe1 = pd.DataFrame(scores_array, columns=['Modèle', 'Alpha', 'MSE', 'RMSE', 'R2', 'MAE', 'TIME'])

# on transforme la colonne R2 en numérique
df_scores_fe1['R2'] = pd.to_numeric(df_scores_fe1['R2'], errors='coerce')

# On trie le dataframe sur la colonne R2 du pmus grand au plus petit
df_scores_fe1.sort_values(by='R2', ascending=False, inplace=True)

df_scores_fe1.head(30)

Unnamed: 0,Modèle,Alpha,MSE,RMSE,R2,MAE,TIME
24,lasso,100000.0,4829557647421.92,2197625.46,0.66,1525856.11,0.0029982000123709
22,lasso,10000.0,4832927251745.98,2198391.97,0.66,1512727.73,0.0053231000201776
20,lasso,1000.0,4853627921208.3,2203095.08,0.66,1519289.86,0.0236485000932589
15,ridge,10.0,4851438738149.68,2202598.18,0.66,1524201.33,0.0046280999667942
14,lasso,1.0,4856086661077.91,2203653.03,0.65,1519721.86,0.09400220005773
12,lasso,0.1,4856086439243.64,2203652.98,0.65,1519721.03,0.0998564000474289
18,lasso,100.0,4855957676310.82,2203623.76,0.65,1519762.88,0.1028509999159723
17,ridge,100.0,4893087873647.63,2212032.52,0.65,1566766.03,0.0051470999605953
16,lasso,10.0,4856088987377.37,2203653.55,0.65,1519730.11,0.1031276999274268
1,ridge,1e-06,4865271616993.3,2205736.07,0.65,1520678.51,0.0063264000928029


Sans surprise, le  modèle DummyRegressor a des mauvais scores, comme les modèles Ridge avec alpha > 1000, et le lasso avec alpha très élevé.

Le Lasso est le meilleur modèle pour le moment.

On observe également que pour le RandomForestRegressor, la variable PropertyGFATotal à la plus grande importance, et de loin.

In [33]:
STOP

NameError: name 'STOP' is not defined

L'utilisation de GridSearchCV avec une régression Ridge permet d'optimiser les hyperparamètres du modèle, en particulier le paramètre de régularisation alpha. GridSearchCV effectue une recherche exhaustive sur un ensemble de valeurs d'hyperparamètres spécifiés, en combinant ces valeurs avec la validation croisée pour évaluer chaque combinaison. Cela permet de trouver les meilleurs hyperparamètres pour le modèle.

In [None]:
# Choisir un score à optimiser, ici le MSE
scoring = {
    'MSE': make_scorer(mean_squared_error, greater_is_better=False),
    'R2': 'r2'
}

In [None]:
# Définir une grille de paramètres à tester
param_grid = {
    'alpha': np.logspace(-6, 6, 13)
}

# Créer un classifieur kNN avec recherche d'hyperparamètre par validation croisée
grid_search = model_selection.GridSearchCV(
    estimator=Ridge(),           # une régression Ridge
    param_grid = param_grid,     # hyperparamètres à tester
    cv=5,                        # nombre de folds de validation croisée
    scoring=scoring,             # score à optimiser
    refit='R2'                  # score utilisé pour le choix de l'hyperparamètre
)
    
# Optimiser ce grid_search sur le jeu d'entraînement
grid_search.fit(X_train, y_train)
    
# Afficher le(s) hyperparamètre(s) optimaux
print("Meilleur(s) hyperparamètre(s) sur le jeu d'entraînement:")
print(grid_search.best_params_)

# Afficher le meilleur score
print("Meilleu(s) score sur le jeu d'entraînement:")
print(grid_search.best_score_)

# Afficher les performances correspondantes
print("Résultats de la validation croisée :")
for score_name in scoring.keys():
   print(f"\nScores pour '{score_name}':")
   for mean, std, params in zip(
            grid_search.cv_results_[f'mean_test_{score_name}'],  # score moyen pour chaque score
            grid_search.cv_results_[f'std_test_{score_name}'],   # écart-type du score
            grid_search.cv_results_['params']                    # valeur de l'hyperparamètre
    ):
       print(f"{score_name} = {mean:.3f} (+/-{std * 2:.03f}) for {params}")

Pour la valeur d'alpha = 1.0, le MSE est meilleur que le modèle de régression classique et la baseline avec Dummy.

Scores sur les prédictions avec les meilleurs hyperparamètres sur le fichier de test :

In [None]:
y_pred = grid_search.predict(X_test)

# Calculer des métriques pour évaluer le modèle
rmse = mean_squared_error(y_test, y_pred, squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE sur le test: {rmse:.3f}")
print(f"MSE sur le test : {mse:.3f}")
print(f"R^2 sur le test : {r2:.3f}")

Si on regarde la valeur de R2, les prédictions sont meilleures sur le fichier de test.

### 1.5 - Modèle de régression Lasso en validation croisée

Le Lasso est une méthode de sélection de variables et de réduction de dimension supervisée : les variables qui ne sont pas nécessaires à la prédiction de l'étiquette sont éliminées.

On va utliser GridSearchCV avec l'estimateur Lasso()

In [None]:
warnings.filterwarnings("ignore")

# Définir une grille de paramètres à tester
param_grid = {
    'alpha': np.logspace(-5, 1, 50)
}

# Créer un classifieur kNN avec recherche d'hyperparamètre par validation croisée
grid_search = model_selection.GridSearchCV(
    estimator=Lasso(),             # une régression Lasso
    param_grid = param_grid,     # hyperparamètres à tester
    cv=5,                        # nombre de folds de validation croisée
    scoring=scoring,             # score à optimiser
    refit='R2'                  # score utilisé pour le choix de l'hyperparamètre
)
    
# Optimiser ce grid_search sur le jeu d'entraînement
grid_search.fit(X_train, y_train)
    
# Afficher le(s) hyperparamètre(s) optimaux
print("Meilleur(s) hyperparamètre(s) sur le jeu d'entraînement:")
print(grid_search.best_params_)

# Afficher le meilleur score
print("Meilleu(s) score sur le jeu d'entraînement:")
print(grid_search.best_score_)

# Afficher les performances correspondantes
print("Résultats de la validation croisée :")
for score_name in scoring.keys():
   print(f"\nScores pour '{score_name}':")
   for mean, std, params in zip(
            grid_search.cv_results_[f'mean_test_{score_name}'],  # score moyen pour chaque score
            grid_search.cv_results_[f'std_test_{score_name}'],   # écart-type du score
            grid_search.cv_results_['params']                    # valeur de l'hyperparamètre
    ):
       print(f"{score_name} = {mean:.3f} (+/-{std * 2:.03f}) for {params}")

Le résultat du modèle Lasso est comparable à celui du modèle Ridge. 

Scores sur le fichier de test :

In [None]:
y_pred = grid_search.predict(X_test)

# Calculer des métriques pour évaluer le modèle
rmse = mean_squared_error(y_test, y_pred, squared=False)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE sur le test: {rmse:.3f}")
print(f"MSE sur le test : {mse:.3f}")
print(f"R^2 sur le test : {r2:.3f}")

In [None]:
FAIRE UNE BOUCLE SUR LES MODELES, ET AJOUTER LES AUTRES MODELES : random forest, light gbm, xgboost, gradientboosting regressor, elastic net
AJOUTER LES AUTRES METRICS MAE ET RMSE.