### Table of Contents


* [Import du jeu de donnée](#chapter0)

* [Préparation des dataframes pour la modelisation](#chapter1)

* [Modelisation](#chapter2)
    
    * [Tests avec variables numériques et OHE](#section4)
    
    * [Tests avec variables numériques et label encoding](#section5)
    
    * [Tests avec feature engineering](#section6)

    * [Tests avec energy star score](#section7)    
    
    * [Optimization des hyperparamètres](#section8)

* [Comparaison des resultats](#chapter3)

* [Features importance](#chapter4)


# Import du jeu de donnée <a class="anchor" id="chapter0"></a>

In [31]:
import pandas as pd
import numpy as np

df = pd.read_csv("data.csv", index_col=0)

In [32]:
# drop de la target 2

df = df.drop(["TotalGHGEmissions"], axis=1)

In [33]:
df.columns

Index(['BuildingType', 'PrimaryPropertyType', 'Neighborhood', 'Latitude',
       'Longitude', 'BuildingAge', 'NumberofFloors', 'PropertyGFATotal',
       'ENERGYSTARScore', 'SiteEnergyUseWN(kBtu)', 'haversine_distance',
       'PercentagePerPropertyType'],
      dtype='object')

In [34]:
df.shape

(1610, 12)

In [35]:
df.head()

Unnamed: 0,BuildingType,PrimaryPropertyType,Neighborhood,Latitude,Longitude,BuildingAge,NumberofFloors,PropertyGFATotal,ENERGYSTARScore,SiteEnergyUseWN(kBtu),haversine_distance,PercentagePerPropertyType
0,NonResidential,Hotel,DOWNTOWN,47.6122,-122.33799,89,12,88434,60.0,7456910.0,0.49678,4.65839
1,NonResidential,Hotel,DOWNTOWN,47.61317,-122.33393,20,11,103566,61.0,8664479.0,0.48873,4.65839
2,NonResidential,Hotel,DOWNTOWN,47.61393,-122.3381,47,41,956110,43.0,73937112.0,0.60238,4.65839
3,NonResidential,Hotel,DOWNTOWN,47.61412,-122.33664,90,10,61320,56.0,6946800.5,0.58625,4.65839
4,NonResidential,Hotel,DOWNTOWN,47.61375,-122.34047,36,18,175580,75.0,14656503.0,0.6508,4.65839


# Preparation des dataframes pour la modelisation <a class="anchor" id="chapter1"></a>

In [36]:
# creation d'une liste avec les variables numeriques et une liste avec les variables categoriques

objectColumns = list(df.dtypes[df.dtypes == np.object].index)
numericColumns = list(df.dtypes[df.dtypes != np.object].index)
print(objectColumns)
print(numericColumns)

['BuildingType', 'PrimaryPropertyType', 'Neighborhood']
['Latitude', 'Longitude', 'BuildingAge', 'NumberofFloors', 'PropertyGFATotal', 'ENERGYSTARScore', 'SiteEnergyUseWN(kBtu)', 'haversine_distance', 'PercentagePerPropertyType']


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  objectColumns = list(df.dtypes[df.dtypes == np.object].index)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  numericColumns = list(df.dtypes[df.dtypes != np.object].index)


In [37]:
# ohe encoding et creation du dataframe sans energystarscore

df_ohe_wEN = pd.get_dummies(
    df, columns=["BuildingType", "Neighborhood", "PrimaryPropertyType"]
)

# changer le energystar score en int pour la modelisation

df_ohe_noEN = df_ohe_wEN.drop(["ENERGYSTARScore"], axis=1)

df_ohe_wEN = df_ohe_wEN.dropna()

df_ohe_wEN.shape, df_ohe_noEN.shape

((1064, 48), (1610, 47))

In [38]:
# label encoding des variables

from sklearn.preprocessing import LabelEncoder

df_label_wEN = df

# creation du dataframe sans energystarscore

df_label_noEN = df_label_wEN.drop(["ENERGYSTARScore"], axis=1)

df_label_wEN = df_label_wEN.dropna()

le = LabelEncoder()

for feat in objectColumns:
    df_label_wEN[feat] = le.fit_transform(df_label_wEN[feat].astype(str))
    df_label_noEN[feat] = le.fit_transform(df_label_noEN[feat].astype(str))


print(df_label_wEN.info(), df_label_noEN.info())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_label_wEN[feat] = le.fit_transform(df_label_wEN[feat].astype(str))


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1064 entries, 0 to 3371
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   BuildingType               1064 non-null   int32  
 1   PrimaryPropertyType        1064 non-null   int32  
 2   Neighborhood               1064 non-null   int32  
 3   Latitude                   1064 non-null   float64
 4   Longitude                  1064 non-null   float64
 5   BuildingAge                1064 non-null   int64  
 6   NumberofFloors             1064 non-null   int64  
 7   PropertyGFATotal           1064 non-null   int64  
 8   ENERGYSTARScore            1064 non-null   float64
 9   SiteEnergyUseWN(kBtu)      1064 non-null   float64
 10  haversine_distance         1064 non-null   float64
 11  PercentagePerPropertyType  1064 non-null   float64
dtypes: float64(6), int32(3), int64(3)
memory usage: 95.6 KB
<class 'pandas.core.frame.DataFrame'>
Int64Index

In [39]:
df_label_wEN["ENERGYSTARScore"] = df_label_wEN["ENERGYSTARScore"].apply(np.int64)
df_ohe_wEN["ENERGYSTARScore"] = df_ohe_wEN["ENERGYSTARScore"].apply(np.int64)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_label_wEN["ENERGYSTARScore"] = df_label_wEN["ENERGYSTARScore"].apply(np.int64)


In [40]:
# drop des na sur les df avec energystar score

df_label_wEN = df_label_wEN.dropna()
df_ohe_wEN = df_ohe_wEN.dropna()

# Modélisation <a class="anchor" id="chapter2"></a> 

In [41]:
from sklearn.linear_model import (
    LinearRegression,
    Lasso,
    Ridge,
    SGDRegressor,
    ElasticNet,
)
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import *

from sklearn.model_selection import (
    train_test_split,
    StratifiedShuffleSplit,
    GridSearchCV,
)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import RepeatedKFold

from sklearn.preprocessing import *
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import *

from numpy import arange

import time

### Tests avec variables numériques et OHE <a class="anchor" id="section4"></a>  

In [42]:
# test 1 avec variables encodé OHE, variable numeriques

X = df_ohe_noEN[
    [
        "BuildingAge",
        "NumberofFloors",
        "PropertyGFATotal",
        "BuildingType_Campus",
        "BuildingType_NonResidential",
        "BuildingType_Nonresidential COS",
        "BuildingType_Nonresidential WA",
        "BuildingType_SPS-District K-12",
        "Neighborhood_BALLARD",
        "Neighborhood_CENTRAL",
        "Neighborhood_DELRIDGE",
        "Neighborhood_DOWNTOWN",
        "Neighborhood_EAST",
        "Neighborhood_GREATER DUWAMISH",
        "Neighborhood_LAKE UNION",
        "Neighborhood_MAGNOLIA / QUEEN ANNE",
        "Neighborhood_NORTH",
        "Neighborhood_NORTHEAST",
        "Neighborhood_NORTHWEST",
        "Neighborhood_SOUTHEAST",
        "Neighborhood_SOUTHWEST",
        "PrimaryPropertyType_Distribution Center",
        "PrimaryPropertyType_Hospital",
        "PrimaryPropertyType_Hotel",
        "PrimaryPropertyType_K-12 School",
        "PrimaryPropertyType_Laboratory",
        "PrimaryPropertyType_Large Office",
        "PrimaryPropertyType_Medical Office",
        "PrimaryPropertyType_Mixed Use Property",
        "PrimaryPropertyType_Office",
        "PrimaryPropertyType_Other",
        "PrimaryPropertyType_Refrigerated Warehouse",
        "PrimaryPropertyType_Residence Hall",
        "PrimaryPropertyType_Restaurant",
        "PrimaryPropertyType_Retail Store",
        "PrimaryPropertyType_Self-Storage Facility",
        "PrimaryPropertyType_Senior Care Community",
        "PrimaryPropertyType_Small- and Mid-Sized Office",
        "PrimaryPropertyType_Supermarket / Grocery Store",
        "PrimaryPropertyType_University",
        "PrimaryPropertyType_Warehouse",
        "PrimaryPropertyType_Worship Facility",
    ]
]
y = df_ohe_noEN["SiteEnergyUseWN(kBtu)"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results1 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(tol=0.5),
    "ElasticNet": ElasticNet(),
    "SGDRegressor": SGDRegressor(),
    "SVR": SVR(),
    "RandomForestRegressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(),
}

for algo_name, algo in algos.items():
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        #train = model.score(X_train, y_train)
        #mae = mean_absolute_error(y_test, y_pred)
        #rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        results1.append([algo_name, r2])


data_ohe_num = pd.DataFrame(results1, columns=["algo", "r2"])
data_ohe_num["algo"] = "data_ohe_num_" + data_ohe_num["algo"].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
data_ohe_num

(1610, 42)
(1610,)


Unnamed: 0,algo,r2
0,data_ohe_num_LinearRegression,0.42065
1,data_ohe_num_Ridge,0.43424
2,data_ohe_num_Lasso,0.43582
3,data_ohe_num_ElasticNet,0.35522
4,data_ohe_num_SGDRegressor,-6.049370599440906e+22
5,data_ohe_num_SVR,-0.04607
6,data_ohe_num_RandomForestRegressor,0.47904
7,data_ohe_num_XGBRegressor,0.11688


In [None]:
'''''

    Observations:

     - Le Random Forest performe mieux que les méthodes de regression linéaires
     - Le XGB Regressor performe moins bien que le random Forest Regressor
     - Le SVR performe mal
     - Le SGD Regressor performe très mal
     
'''''

### Tests avec variables numériques et Label encoding <a class="anchor" id="section5"></a>  

In [44]:
# test 2 : avec variables label encodées

y_columns = ["SiteEnergyUseWN(kBtu)"]
X = df_label_noEN[
    [
        "BuildingType",
        "PrimaryPropertyType",
        "Neighborhood",
        "BuildingAge",
        "NumberofFloors",
        "PropertyGFATotal",
    ]
]
y = df_label_noEN["SiteEnergyUseWN(kBtu)"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results2 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(tol=0.5),
    "ElasticNet": ElasticNet(),
    "SGDRegressor": SGDRegressor(),
    "SVR": SVR(),
    "RandomForestRegressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(),
}

for algo_name, algo in algos.items():
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        #train = model.score(X_train, y_train)
        #mae = mean_absolute_error(y_test, y_pred)
        #rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        results2.append([algo_name, r2])

data_label = pd.DataFrame(results2, columns=["algo", "r2"])
data_label["algo"] = "data_label_" + data_label["algo"].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
data_label

(1610, 6)
(1610,)


Unnamed: 0,algo,r2
0,data_label_LinearRegression,0.34577
1,data_label_Ridge,0.34579
2,data_label_Lasso,0.33272
3,data_label_ElasticNet,0.3488
4,data_label_SGDRegressor,-2.350486979475793e+22
5,data_label_SVR,-0.04607
6,data_label_RandomForestRegressor,0.43766
7,data_label_XGBRegressor,-0.08659


In [None]:
'''''

    Observations:

     - A part pour le Random Forest, performe moins bien que l'encodage OHE
     - Le SVR performe de la meme manière en OHE et en label encoding
     - Le SGD Regressor performe très mal

     
'''''

### Tests avec feature engineering <a class="anchor" id="section6"></a>  

In [46]:
data = pd.read_csv("data1.csv", index_col=0)
data = data.drop([ 'ENERGYSTARScore', 'haversine_distance',
       'PercentagePerPropertyType','Latitude',
       'Longitude','TotalGHGEmissions_log'], axis=1)
data.columns

Index(['BuildingType', 'PrimaryPropertyType', 'Neighborhood', 'NumberofFloors',
       'SiteEnergyUseWN_log', 'BuildingAge_log', 'PropertyGFATotal_log',
       'ENERGYSTARScore_log'],
      dtype='object')

In [47]:
# ohe encoding et creation du dataframe sans energystarscore

data_ohe_wEN = pd.get_dummies(
    data, columns=["BuildingType", "Neighborhood", "PrimaryPropertyType"]
)

# changer le energystar score en int pour la modelisation

data_ohe_noEN = data_ohe_wEN.drop(["ENERGYSTARScore_log"], axis=1)

data_ohe_wEN = data_ohe_wEN.dropna()

data_ohe_wEN.shape, data_ohe_noEN.shape

# label encoding

from sklearn.preprocessing import LabelEncoder

data_label_wEN = data

# creation du dataframe sans energystarscore

data_label_noEN = data_label_wEN.drop(["ENERGYSTARScore_log"], axis=1)

data_label_wEN = data_label_wEN.dropna()

le = LabelEncoder()

for feat in objectColumns:
    data_label_wEN[feat] = le.fit_transform(data_label_wEN[feat].astype(str))
    data_label_noEN[feat] = le.fit_transform(data_label_noEN[feat].astype(str))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_label_wEN[feat] = le.fit_transform(data_label_wEN[feat].astype(str))


In [50]:
# test 3 avec feature engineering logarithme des variables num et target sur ohe encoding

X = data_ohe_noEN[
    [
        "BuildingAge_log",
        "PropertyGFATotal_log",
        "NumberofFloors",
        "BuildingType_Campus",
        "BuildingType_NonResidential",
        "BuildingType_Nonresidential COS",
        "BuildingType_Nonresidential WA",
        "BuildingType_SPS-District K-12",
        "Neighborhood_BALLARD",
        "Neighborhood_CENTRAL",
        "Neighborhood_DELRIDGE",
        "Neighborhood_DOWNTOWN",
        "Neighborhood_EAST",
        "Neighborhood_GREATER DUWAMISH",
        "Neighborhood_LAKE UNION",
        "Neighborhood_MAGNOLIA / QUEEN ANNE",
        "Neighborhood_NORTH",
        "Neighborhood_NORTHEAST",
        "Neighborhood_NORTHWEST",
        "Neighborhood_SOUTHEAST",
        "Neighborhood_SOUTHWEST",
        "PrimaryPropertyType_Distribution Center",
        "PrimaryPropertyType_Hospital",
        "PrimaryPropertyType_Hotel",
        "PrimaryPropertyType_K-12 School",
        "PrimaryPropertyType_Laboratory",
        "PrimaryPropertyType_Large Office",
        "PrimaryPropertyType_Medical Office",
        "PrimaryPropertyType_Mixed Use Property",
        "PrimaryPropertyType_Office",
        "PrimaryPropertyType_Other",
        "PrimaryPropertyType_Refrigerated Warehouse",
        "PrimaryPropertyType_Residence Hall",
        "PrimaryPropertyType_Restaurant",
        "PrimaryPropertyType_Retail Store",
        "PrimaryPropertyType_Self-Storage Facility",
        "PrimaryPropertyType_Senior Care Community",
        "PrimaryPropertyType_Small- and Mid-Sized Office",
        "PrimaryPropertyType_Supermarket / Grocery Store",
        "PrimaryPropertyType_University",
        "PrimaryPropertyType_Warehouse",
        "PrimaryPropertyType_Worship Facility",
    ]
]
y = data_ohe_noEN["SiteEnergyUseWN_log"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results3 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(tol=0.5),
    "ElasticNet": ElasticNet(),
    "SGDRegressor": SGDRegressor(),
    "SVR": SVR(),
    "RandomForestRegressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(),
}

for algo_name, algo in algos.items():
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        #train = model.score(X_train, y_train)
        #mae = mean_absolute_error(y_test, y_pred)
        #rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        results3.append([algo_name, r2])


datafeat_ohe_log_on_num = pd.DataFrame(results3, columns=["algo", "r2"])
datafeat_ohe_log_on_num["algo"] = "datafeat_ohe_log_on_num_" + datafeat_ohe_log_on_num[
    "algo"
].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
datafeat_ohe_log_on_num

(1610, 42)
(1610,)


Unnamed: 0,algo,r2
0,datafeat_ohe_log_on_num_LinearRegression,0.72648
1,datafeat_ohe_log_on_num_Ridge,0.72897
2,datafeat_ohe_log_on_num_Lasso,0.23267
3,datafeat_ohe_log_on_num_ElasticNet,0.39153
4,datafeat_ohe_log_on_num_SGDRegressor,0.68776
5,datafeat_ohe_log_on_num_SVR,0.69119
6,datafeat_ohe_log_on_num_RandomForestRegressor,0.71779
7,datafeat_ohe_log_on_num_XGBRegressor,0.67958


In [51]:
# test 4 avec feature engineering logarithme des variables num et target sur label encoding

X = data_label_noEN[
    [
        "BuildingType",
        "PrimaryPropertyType",
        "Neighborhood",
        "BuildingAge_log",
        "PropertyGFATotal_log",
        "NumberofFloors",
    ]
]
y = data_label_noEN["SiteEnergyUseWN_log"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results4 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(tol=0.5),
    "ElasticNet": ElasticNet(),
    "SGDRegressor": SGDRegressor(),
    "SVR": SVR(),
    "RandomForestRegressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(),
}

for algo_name, algo in algos.items():
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        #train = model.score(X_train, y_train)
        #mae = mean_absolute_error(y_test, y_pred)
        #rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        results4.append([algo_name, r2])


data_label_noEN_log_on_num = pd.DataFrame(results4, columns=["algo", "r2"])
data_label_noEN_log_on_num[
    "algo"
] = "data_label_noEN_log_on_num_" + data_label_noEN_log_on_num["algo"].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
data_label_noEN_log_on_num

(1610, 6)
(1610,)


Unnamed: 0,algo,r2
0,data_label_noEN_log_on_num_LinearRegression,0.65645
1,data_label_noEN_log_on_num_Ridge,0.65646
2,data_label_noEN_log_on_num_Lasso,0.28901
3,data_label_noEN_log_on_num_ElasticNet,0.41193
4,data_label_noEN_log_on_num_SGDRegressor,0.62545
5,data_label_noEN_log_on_num_SVR,0.65824
6,data_label_noEN_log_on_num_RandomForestRegressor,0.70065
7,data_label_noEN_log_on_num_XGBRegressor,0.65698


In [None]:
'''''

    Observations:

     - Le passage à l'echelle logarithmique améliore les performances
     - L'OHE performe mieux que le label encoding sur tout les algos 
         excépté le Lasso et l'elastic net pour lesquels le label encoding performe le mieux
     - Le SGD Regressor performe mieux avec l'echelle logarithmique

     
'''''

### Tests avec Energystar Score <a class="anchor" id="section7"></a>  

In [54]:
# test 5 avec feature engineering logarithme des variables num et target sur label encoding, energy star score log

X = data_label_wEN[
    [
        "BuildingType",
        "PrimaryPropertyType",
        "Neighborhood",
        "ENERGYSTARScore_log",
        "BuildingAge_log",
        "PropertyGFATotal_log",
        "NumberofFloors",
    ]
]
y = data_label_wEN["SiteEnergyUseWN_log"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results23 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(tol=0.5),
    "ElasticNet": ElasticNet(),
    "SGDRegressor": SGDRegressor(),
    "SVR": SVR(),
    "RandomForestRegressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(),
}

for algo_name, algo in algos.items():
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        #train = model.score(X_train, y_train)
        #mae = mean_absolute_error(y_test, y_pred)
        #rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        results5.append([algo_name, r2])


datafeat_label_log_ens = pd.DataFrame(results5, columns=["algo", "r2"])
datafeat_label_log_ens["algo"] = "datafeat_label_log_ens_" + datafeat_label_log_ens[
    "algo"
].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
datafeat_label_log_ens

(1064, 7)
(1064,)


Unnamed: 0,algo,r2
0,datafeat_label_log_ens_LinearRegression,0.79318
1,datafeat_label_log_ens_Ridge,0.79307
2,datafeat_label_log_ens_Lasso,0.34698
3,datafeat_label_log_ens_ElasticNet,0.43573
4,datafeat_label_log_ens_SGDRegressor,0.71102
5,datafeat_label_log_ens_SVR,0.76101
6,datafeat_label_log_ens_RandomForestRegressor,0.86256
7,datafeat_label_log_ens_XGBRegressor,0.86892


In [55]:
# test 6 avec feature engineering logarithme des variables num et target sur ohe encoding, energy star score en log

X = data_ohe_wEN[
    [
        "BuildingAge_log",
        "PropertyGFATotal_log",
        "NumberofFloors",
        "BuildingType_Campus",
        "BuildingType_NonResidential",
        "BuildingType_Nonresidential COS",
        "BuildingType_Nonresidential WA",
        "BuildingType_SPS-District K-12",
        "Neighborhood_BALLARD",
        "Neighborhood_CENTRAL",
        "Neighborhood_DELRIDGE",
        "Neighborhood_DOWNTOWN",
        "Neighborhood_EAST",
        "Neighborhood_GREATER DUWAMISH",
        "Neighborhood_LAKE UNION",
        "Neighborhood_MAGNOLIA / QUEEN ANNE",
        "Neighborhood_NORTH",
        "Neighborhood_NORTHEAST",
        "Neighborhood_NORTHWEST",
        "Neighborhood_SOUTHEAST",
        "Neighborhood_SOUTHWEST",
        "PrimaryPropertyType_Distribution Center",
        "PrimaryPropertyType_Hospital",
        "PrimaryPropertyType_Hotel",
        "PrimaryPropertyType_K-12 School",
        "PrimaryPropertyType_Laboratory",
        "PrimaryPropertyType_Large Office",
        "PrimaryPropertyType_Medical Office",
        "PrimaryPropertyType_Mixed Use Property",
        "PrimaryPropertyType_Office",
        "PrimaryPropertyType_Other",
        "PrimaryPropertyType_Refrigerated Warehouse",
        "PrimaryPropertyType_Residence Hall",
        "PrimaryPropertyType_Restaurant",
        "PrimaryPropertyType_Retail Store",
        "PrimaryPropertyType_Self-Storage Facility",
        "PrimaryPropertyType_Senior Care Community",
        "PrimaryPropertyType_Small- and Mid-Sized Office",
        "PrimaryPropertyType_Supermarket / Grocery Store",
        "PrimaryPropertyType_University",
        "PrimaryPropertyType_Warehouse",
        "PrimaryPropertyType_Worship Facility",
        "ENERGYSTARScore_log",
    ]
]
y = data_ohe_wEN["SiteEnergyUseWN_log"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results6 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(tol=0.5),
    "ElasticNet": ElasticNet(),
    "SGDRegressor": SGDRegressor(),
    "SVR": SVR(),
    "RandomForestRegressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(),
}

for algo_name, algo in algos.items():
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        #train = model.score(X_train, y_train)
        #mae = mean_absolute_error(y_test, y_pred)
        #rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        results6.append([algo_name, r2])


datafeat_ohe_log_ens = pd.DataFrame(results6, columns=["algo", "r2"])
datafeat_ohe_log_ens["algo"] = "datafeat_ohe_log_ens_" + datafeat_ohe_log_ens[
    "algo"
].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
datafeat_ohe_log_ens

(1064, 43)
(1064,)


Unnamed: 0,algo,r2
0,datafeat_ohe_log_ens_LinearRegression,0.89057
1,datafeat_ohe_log_ens_Ridge,0.89219
2,datafeat_ohe_log_ens_Lasso,0.25989
3,datafeat_ohe_log_ens_ElasticNet,0.39229
4,datafeat_ohe_log_ens_SGDRegressor,0.86604
5,datafeat_ohe_log_ens_SVR,0.80255
6,datafeat_ohe_log_ens_RandomForestRegressor,0.88223
7,datafeat_ohe_log_ens_XGBRegressor,0.86029


In [None]:
'''''

    Observations:

     - L'energy star score améliore les performances
     - L'OHE performe mieux que le label encoding sur tout les algos 
         excépté le Lasso et l'elastic net pour lesquels le label encoding performe le mieux

     
'''''

### Tests avec optimization des hyperparamètres <a class="anchor" id="section8"></a>  

In [56]:
# paramètres SGD

param_grid_sgd = {
    "alpha": 10.0 ** -np.arange(1, 7),
    "loss": ["squared_loss", "huber", "epsilon_insensitive"],
    "penalty": ["l2", "l1", "elasticnet"],
    "learning_rate": ["constant", "optimal", "invscaling"],
}

grid_sgd = GridSearchCV(
    SGDRegressor(),
    param_grid_sgd,
    cv=RepeatedKFold(n_splits=10, n_repeats=2),
    scoring="neg_root_mean_squared_error",
    verbose=2,
    n_jobs=-1,
)

In [57]:
# Regression Ridge

param_grid_ridge = {"alpha": 0.1 * np.arange(1, 70)}
grid_ridge = GridSearchCV(
    Ridge(),
    param_grid_ridge,
    cv=RepeatedKFold(n_splits=10, n_repeats=2),
    scoring="neg_root_mean_squared_error",
    verbose=2,
    n_jobs=-1,
)

In [58]:
# Lasso

params_lasso = {"alpha": (np.logspace(-8, 8, 100))}  # It will check from 1e-08 to 1e+08
lasso = Lasso(normalize=True)
grid_lasso = GridSearchCV(
    lasso,
    params_lasso,
    cv=RepeatedKFold(n_splits=10, n_repeats=2),
    scoring="neg_root_mean_squared_error",
    verbose=2,
    n_jobs=-1,
)

In [59]:
# Elastic net

params_grid_elasticnet = dict()
params_grid_elasticnet["alpha"] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
params_grid_elasticnet["l1_ratio"] = arange(0, 1, 0.01)
grid_elasticnet = GridSearchCV(
    ElasticNet(),
    params_grid_elasticnet,
    scoring="neg_root_mean_squared_error",
    cv=RepeatedKFold(n_splits=10, n_repeats=2),
    n_jobs=-1,
)

In [61]:
# Random forest

param_grid_randomforest = {
    "bootstrap": [True, False],
    "max_depth": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    "max_features": ["auto", "sqrt"],
    "min_samples_leaf": [1, 2, 4],
    "min_samples_split": [2, 5, 10],
    "n_estimators": [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000],
}

random_randomforest = RandomizedSearchCV(
    RandomForestRegressor(),
    param_grid_randomforest,
    scoring="neg_root_mean_squared_error",
    cv=RepeatedKFold(n_splits=10, n_repeats=2),
    n_jobs=-1,
)

In [62]:
# XGBoost

param_grid_xgb = {
    "learning_rate": (0.05, 0.10, 0.15),
    "max_depth": [3, 4, 5, 6, 8],
    "min_child_weight": [1, 3, 5, 7],
    "gamma": [0.0, 0.1, 0.2],
    "colsample_bytree": [0.3, 0.4],
}

grid_xgb = GridSearchCV(
    XGBRegressor(),
    param_grid_xgb,
    scoring="neg_root_mean_squared_error",
    cv=RepeatedKFold(n_splits=10, n_repeats=2),
    n_jobs=-1,
)

In [64]:
# Test7 avec optimization des hyperparamètres / OHE encoding / features log / sans ENS

X = data_ohe_wEN[
    [
        "BuildingAge_log",
        "PropertyGFATotal_log",
        "NumberofFloors",
        "BuildingType_Campus",
        "BuildingType_NonResidential",
        "BuildingType_Nonresidential COS",
        "BuildingType_Nonresidential WA",
        "BuildingType_SPS-District K-12",
        "Neighborhood_BALLARD",
        "Neighborhood_CENTRAL",
        "Neighborhood_DELRIDGE",
        "Neighborhood_DOWNTOWN",
        "Neighborhood_EAST",
        "Neighborhood_GREATER DUWAMISH",
        "Neighborhood_LAKE UNION",
        "Neighborhood_MAGNOLIA / QUEEN ANNE",
        "Neighborhood_NORTH",
        "Neighborhood_NORTHEAST",
        "Neighborhood_NORTHWEST",
        "Neighborhood_SOUTHEAST",
        "Neighborhood_SOUTHWEST",
        "PrimaryPropertyType_Distribution Center",
        "PrimaryPropertyType_Hospital",
        "PrimaryPropertyType_Hotel",
        "PrimaryPropertyType_K-12 School",
        "PrimaryPropertyType_Laboratory",
        "PrimaryPropertyType_Large Office",
        "PrimaryPropertyType_Medical Office",
        "PrimaryPropertyType_Mixed Use Property",
        "PrimaryPropertyType_Office",
        "PrimaryPropertyType_Other",
        "PrimaryPropertyType_Refrigerated Warehouse",
        "PrimaryPropertyType_Residence Hall",
        "PrimaryPropertyType_Restaurant",
        "PrimaryPropertyType_Retail Store",
        "PrimaryPropertyType_Self-Storage Facility",
        "PrimaryPropertyType_Senior Care Community",
        "PrimaryPropertyType_Small- and Mid-Sized Office",
        "PrimaryPropertyType_Supermarket / Grocery Store",
        "PrimaryPropertyType_University",
        "PrimaryPropertyType_Warehouse",
        "PrimaryPropertyType_Worship Facility",
    ]
]
y = data_ohe_wEN["SiteEnergyUseWN_log"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results7 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": grid_ridge,
    "Lasso": grid_lasso,
    "ElasticNet": grid_elasticnet,
    "SGDRegressor": grid_sgd,
    "SVR": SVR(),
    "RandomForestRegressor": random_randomforest,
    "XGBRegressor": grid_xgb,
}

for algo_name, algo in algos.items():
    print("Algorithme: ", algo_name)
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        print("Prédiction de ", column, "sans EnergyStarScore")
        train = model.score(X_train, y_train)
        print("score d'entrainement = ", train)
        mae = mean_absolute_error(y_test, y_pred)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        end_of_f1 = time.time()
        time_used = end_of_f1 - start_of_f1
        print("MAE = ", mae)
        print("RMSE = ", rmse)
        print("r2_score = ", r2)
        print(f"The calculation time is : {end_of_f1-start_of_f1}")
        results25.append([algo_name, column, train, mae, rmse, r2, time_used])
    print("-" * 100)

datafeat_ohe_noENS_log_opti = pd.DataFrame(
    results7,
    columns=[
        "algo",
        "target",
        "score d-entrainement",
        "MAE",
        "RMSE",
        "r2",
        "time_used",
    ],
)
datafeat_ohe_noENS_log_opti[
    "algo"
] = "datafeat_ohe_noENS_log_opti_" + datafeat_ohe_noENS_log_opti["algo"].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
datafeat_ohe_noENS_log_opti

(1064, 42)
(1064,)
Algorithme:  LinearRegression
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  0.7757303287446293
MAE =  0.38360376859694617
RMSE =  0.5144268150848533
r2_score =  0.8476144397721573
The calculation time is : 0.010970592498779297
----------------------------------------------------------------------------------------------------
Algorithme:  Ridge
Fitting 20 folds for each of 69 candidates, totalling 1380 fits
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.5840254545552105
MAE =  0.3813064884315382
RMSE =  0.5108055407095684
r2_score =  0.8497523051505423
The calculation time is : 8.350260972976685
----------------------------------------------------------------------------------------------------
Algorithme:  Lasso
Fitting 20 folds for each of 100 candidates, totalling 2000 fits


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Lasso())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * np.sqrt(n_samples). 


Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.5841706144563902
MAE =  0.3820730394814869
RMSE =  0.5116106821047643
r2_score =  0.8492782853245355
The calculation time is : 10.038151741027832
----------------------------------------------------------------------------------------------------
Algorithme:  ElasticNet
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.5826797394799397
MAE =  0.38052922739854694
RMSE =  0.511777641033243
r2_score =  0.8491798962841061
The calculation time is : 133.99563837051392
----------------------------------------------------------------------------------------------------
Algorithme:  SGDRegressor
Fitting 20 folds for each of 162 candidates, totalling 3240 fits




Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.6505628207187409
MAE =  0.43534766376688877
RMSE =  0.5740331200941249
r2_score =  0.8102549215170727
The calculation time is : 25.286368131637573
----------------------------------------------------------------------------------------------------
Algorithme:  SVR
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  0.6816981520734982
MAE =  0.4693882335964964
RMSE =  0.6565924973333995
r2_score =  0.7517504610609969
The calculation time is : 0.2982025146484375
----------------------------------------------------------------------------------------------------
Algorithme:  RandomForestRegressor
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.4209679885244568
MAE =  0.3910848983208456
RMSE =  0.538273186491636
r2_score =  0.8331592583675833
The calculation time is : 226.2661693096161
-------------------------------------------------------

Unnamed: 0,algo,target,score d-entrainement,MAE,RMSE,r2,time_used
0,datafeat_ohe_noENS_log_opti_LinearRegression,SiteEnergyUseWN(kBtu),0.77573,0.3836,0.51443,0.84761,0.01097
1,datafeat_ohe_noENS_log_opti_Ridge,SiteEnergyUseWN(kBtu),-0.58403,0.38131,0.51081,0.84975,8.35026
2,datafeat_ohe_noENS_log_opti_Lasso,SiteEnergyUseWN(kBtu),-0.58417,0.38207,0.51161,0.84928,10.03815
3,datafeat_ohe_noENS_log_opti_ElasticNet,SiteEnergyUseWN(kBtu),-0.58268,0.38053,0.51178,0.84918,133.99564
4,datafeat_ohe_noENS_log_opti_SGDRegressor,SiteEnergyUseWN(kBtu),-0.65056,0.43535,0.57403,0.81025,25.28637
5,datafeat_ohe_noENS_log_opti_SVR,SiteEnergyUseWN(kBtu),0.6817,0.46939,0.65659,0.75175,0.2982
6,datafeat_ohe_noENS_log_opti_RandomForestRegressor,SiteEnergyUseWN(kBtu),-0.42097,0.39108,0.53827,0.83316,226.26617
7,datafeat_ohe_noENS_log_opti_XGBRegressor,SiteEnergyUseWN(kBtu),-0.50743,0.39904,0.54487,0.82904,447.65771


In [65]:
# Test8 avec optimization des hyperparamètres / label encoding / features log / sans ENS

X = data_label_wEN[
    [
        "BuildingType",
        "PrimaryPropertyType",
        "Neighborhood",
        "BuildingAge_log",
        "PropertyGFATotal_log",
        "NumberofFloors",
    ]
]
y = data_label_wEN["SiteEnergyUseWN_log"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results8 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": grid_ridge,
    "Lasso": grid_lasso,
    "ElasticNet": grid_elasticnet,
    "SGDRegressor": grid_sgd,
    "SVR": SVR(),
    "RandomForestRegressor": random_randomforest,
    "XGBRegressor": grid_xgb,
}

for algo_name, algo in algos.items():
    print("Algorithme: ", algo_name)
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        print("Prédiction de ", column, "sans EnergyStarScore")
        train = model.score(X_train, y_train)
        print("score d'entrainement = ", train)
        mae = mean_absolute_error(y_test, y_pred)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        end_of_f1 = time.time()
        time_used = end_of_f1 - start_of_f1
        print("MAE = ", mae)
        print("RMSE = ", rmse)
        print("r2_score = ", r2)
        print(f"The calculation time is : {end_of_f1-start_of_f1}")
        results8.append([algo_name, column, train, mae, rmse, r2, time_used])
    print("-" * 100)

datafeat_label_noENS_log_opti = pd.DataFrame(
    results8,
    columns=[
        "algo",
        "target",
        "score d-entrainement",
        "MAE",
        "RMSE",
        "r2",
        "time_used",
    ],
)
datafeat_label_noENS_log_opti[
    "algo"
] = "datafeat_label_noENS_log_opti_" + datafeat_label_noENS_log_opti["algo"].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
datafeat_label_noENS_log_opti

(1064, 6)
(1064,)
Algorithme:  LinearRegression
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  0.6394337517366702
MAE =  0.4851068120423809
RMSE =  0.6520162930738431
r2_score =  0.7551988151405143
The calculation time is : 0.015952348709106445
----------------------------------------------------------------------------------------------------
Algorithme:  Ridge
Fitting 20 folds for each of 69 candidates, totalling 1380 fits
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.7368608726701427
MAE =  0.485114431630671
RMSE =  0.6520322198782051
r2_score =  0.7551868554755811
The calculation time is : 5.159557819366455
----------------------------------------------------------------------------------------------------
Algorithme:  Lasso
Fitting 20 folds for each of 100 candidates, totalling 2000 fits


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Lasso())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * np.sqrt(n_samples). 


Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.7370339472369342
MAE =  0.4849970160235576
RMSE =  0.6536414660847986
r2_score =  0.7539769435303063
The calculation time is : 20.313678979873657
----------------------------------------------------------------------------------------------------
Algorithme:  ElasticNet
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.7368654538368101
MAE =  0.48521476122182217
RMSE =  0.652220656381062
r2_score =  0.7550453336384456
The calculation time is : 89.88046979904175
----------------------------------------------------------------------------------------------------
Algorithme:  SGDRegressor
Fitting 20 folds for each of 162 candidates, totalling 3240 fits
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.7890487478287608
MAE =  0.5318817687379479
RMSE =  0.6979326071387041
r2_score =  0.7195059865801364
The calculation time is : 23.0912399

Unnamed: 0,algo,target,score d-entrainement,MAE,RMSE,r2,time_used
0,datafeat_label_noENS_log_opti_LinearRegression,SiteEnergyUseWN(kBtu),0.63943,0.48511,0.65202,0.7552,0.01595
1,datafeat_label_noENS_log_opti_Ridge,SiteEnergyUseWN(kBtu),-0.73686,0.48511,0.65203,0.75519,5.15956
2,datafeat_label_noENS_log_opti_Lasso,SiteEnergyUseWN(kBtu),-0.73703,0.485,0.65364,0.75398,20.31368
3,datafeat_label_noENS_log_opti_ElasticNet,SiteEnergyUseWN(kBtu),-0.73687,0.48521,0.65222,0.75505,89.88047
4,datafeat_label_noENS_log_opti_SGDRegressor,SiteEnergyUseWN(kBtu),-0.78905,0.53188,0.69793,0.71951,23.09124
5,datafeat_label_noENS_log_opti_SVR,SiteEnergyUseWN(kBtu),0.6576,0.48673,0.70306,0.71537,0.30818
6,datafeat_label_noENS_log_opti_RandomForestRegr...,SiteEnergyUseWN(kBtu),-0.46376,0.42549,0.57627,0.80877,326.93471
7,datafeat_label_noENS_log_opti_XGBRegressor,SiteEnergyUseWN(kBtu),-0.53215,0.40799,0.55545,0.82234,307.20254


In [66]:
# Test9 avec optimization des hyperparamètres / OHE encoding / features log / avec ENS

X = data_ohe_wEN[
    [
        "BuildingAge_log",
        "PropertyGFATotal_log",
        "NumberofFloors",
        "BuildingType_Campus",
        "BuildingType_NonResidential",
        "BuildingType_Nonresidential COS",
        "BuildingType_Nonresidential WA",
        "BuildingType_SPS-District K-12",
        "Neighborhood_BALLARD",
        "Neighborhood_CENTRAL",
        "Neighborhood_DELRIDGE",
        "Neighborhood_DOWNTOWN",
        "Neighborhood_EAST",
        "Neighborhood_GREATER DUWAMISH",
        "Neighborhood_LAKE UNION",
        "Neighborhood_MAGNOLIA / QUEEN ANNE",
        "Neighborhood_NORTH",
        "Neighborhood_NORTHEAST",
        "Neighborhood_NORTHWEST",
        "Neighborhood_SOUTHEAST",
        "Neighborhood_SOUTHWEST",
        "PrimaryPropertyType_Distribution Center",
        "PrimaryPropertyType_Hospital",
        "PrimaryPropertyType_Hotel",
        "PrimaryPropertyType_K-12 School",
        "PrimaryPropertyType_Laboratory",
        "PrimaryPropertyType_Large Office",
        "PrimaryPropertyType_Medical Office",
        "PrimaryPropertyType_Mixed Use Property",
        "PrimaryPropertyType_Office",
        "PrimaryPropertyType_Other",
        "PrimaryPropertyType_Refrigerated Warehouse",
        "PrimaryPropertyType_Residence Hall",
        "PrimaryPropertyType_Restaurant",
        "PrimaryPropertyType_Retail Store",
        "PrimaryPropertyType_Self-Storage Facility",
        "PrimaryPropertyType_Senior Care Community",
        "PrimaryPropertyType_Small- and Mid-Sized Office",
        "PrimaryPropertyType_Supermarket / Grocery Store",
        "PrimaryPropertyType_University",
        "PrimaryPropertyType_Warehouse",
        "PrimaryPropertyType_Worship Facility",
        "ENERGYSTARScore_log",
    ]
]
y = data_ohe_wEN["SiteEnergyUseWN_log"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results9 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": grid_ridge,
    "Lasso": grid_lasso,
    "ElasticNet": grid_elasticnet,
    "SGDRegressor": grid_sgd,
    "SVR": SVR(),
    "RandomForestRegressor": random_randomforest,
    "XGBRegressor": grid_xgb,
}

for algo_name, algo in algos.items():
    print("Algorithme: ", algo_name)
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        print("Prédiction de ", column, "sans EnergyStarScore")
        train = model.score(X_train, y_train)
        print("score d'entrainement = ", train)
        mae = mean_absolute_error(y_test, y_pred)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        r2 = r2_score(y_test, y_pred)
        end_of_f1 = time.time()
        time_used = end_of_f1 - start_of_f1
        print("MAE = ", mae)
        print("RMSE = ", rmse)
        print("r2_score = ", r2)
        print(f"The calculation time is : {end_of_f1-start_of_f1}")
        results9.append([algo_name, column, train, mae, rmse, r2, time_used])
    print("-" * 100)

datafeat_ohe_log_ens_opti = pd.DataFrame(
    results9,
    columns=[
        "algo",
        "target",
        "score d-entrainement",
        "MAE",
        "RMSE",
        "r2",
        "time_used",
    ],
)
datafeat_ohe_log_ens_opti[
    "algo"
] = "datafeat_ohe_log_ens_opti_" + datafeat_ohe_log_ens_opti["algo"].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
datafeat_ohe_log_ens_opti

(1064, 43)
(1064,)
Algorithme:  LinearRegression
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  0.8347797927701742
MAE =  0.33152935666615957
RMSE =  0.43593515306141933
r2_score =  0.8905689916111211
The calculation time is : 0.05784416198730469
----------------------------------------------------------------------------------------------------
Algorithme:  Ridge
Fitting 20 folds for each of 69 candidates, totalling 1380 fits
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.500948524411634
MAE =  0.3284762999258751
RMSE =  0.4319168268727075
r2_score =  0.8925771014556856
The calculation time is : 6.732892036437988
----------------------------------------------------------------------------------------------------
Algorithme:  Lasso
Fitting 20 folds for each of 100 candidates, totalling 2000 fits


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Lasso())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * np.sqrt(n_samples). 


Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.5020272570945337
MAE =  0.3291289990549526
RMSE =  0.43279995921670184
r2_score =  0.8921373611244269
The calculation time is : 18.184733867645264
----------------------------------------------------------------------------------------------------
Algorithme:  ElasticNet
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.5007170144738831
MAE =  0.3272090067929104
RMSE =  0.43209765306972225
r2_score =  0.8924871353362206
The calculation time is : 180.0014204978943
----------------------------------------------------------------------------------------------------
Algorithme:  SGDRegressor
Fitting 20 folds for each of 162 candidates, totalling 3240 fits




Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.5795748069050212
MAE =  0.3740827796385794
RMSE =  0.49084214741379933
r2_score =  0.8612668287603429
The calculation time is : 40.89021301269531
----------------------------------------------------------------------------------------------------
Algorithme:  SVR
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  0.7522204526117926
MAE =  0.41418073571950403
RMSE =  0.585577282218729
r2_score =  0.8025463991186059
The calculation time is : 0.3190886974334717
----------------------------------------------------------------------------------------------------
Algorithme:  RandomForestRegressor
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.2854327284426461
MAE =  0.3382486780151287
RMSE =  0.45210764473626935
r2_score =  0.8822989546322879
The calculation time is : 387.0439066886902
------------------------------------------------------

Unnamed: 0,algo,target,score d-entrainement,MAE,RMSE,r2,time_used
0,datafeat_ohe_log_ens_opti_LinearRegression,SiteEnergyUseWN(kBtu),0.83478,0.33153,0.43594,0.89057,0.05784
1,datafeat_ohe_log_ens_opti_Ridge,SiteEnergyUseWN(kBtu),-0.50095,0.32848,0.43192,0.89258,6.73289
2,datafeat_ohe_log_ens_opti_Lasso,SiteEnergyUseWN(kBtu),-0.50203,0.32913,0.4328,0.89214,18.18473
3,datafeat_ohe_log_ens_opti_ElasticNet,SiteEnergyUseWN(kBtu),-0.50072,0.32721,0.4321,0.89249,180.00142
4,datafeat_ohe_log_ens_opti_SGDRegressor,SiteEnergyUseWN(kBtu),-0.57957,0.37408,0.49084,0.86127,40.89021
5,datafeat_ohe_log_ens_opti_SVR,SiteEnergyUseWN(kBtu),0.75222,0.41418,0.58558,0.80255,0.31909
6,datafeat_ohe_log_ens_opti_RandomForestRegressor,SiteEnergyUseWN(kBtu),-0.28543,0.33825,0.45211,0.8823,387.04391
7,datafeat_ohe_log_ens_opti_XGBRegressor,SiteEnergyUseWN(kBtu),-0.37361,0.33547,0.45084,0.88296,641.74016


In [67]:
# Test10 avec optimization des hyperparamètres / label encoding / features log / avec ENS

X = data_label_wEN[
    [
        "BuildingType",
        "PrimaryPropertyType",
        "Neighborhood",
        "ENERGYSTARScore_log",
        "BuildingAge_log",
        "PropertyGFATotal_log",
        "NumberofFloors",
    ]
]
y = data_label_wEN["SiteEnergyUseWN_log"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results10 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": grid_ridge,
    "Lasso": grid_lasso,
    "ElasticNet": grid_elasticnet,
    "SGDRegressor": grid_sgd,
    "SVR": SVR(),
    "RandomForestRegressor": random_randomforest,
    "XGBRegressor": grid_xgb,
}

for algo_name, algo in algos.items():
    print('Algorithme: ',algo_name)
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train,y_train)
        y_pred = model.predict(X_test)
        print('Prédiction de ',column, 'sans EnergyStarScore')
        train = model.score(X_train,y_train)
        print('score d\'entrainement = ', train)
        mae = mean_absolute_error(y_test,y_pred)
        rmse = np.sqrt(mean_squared_error(y_test,y_pred))
        r2 = r2_score(y_test,y_pred)
        end_of_f1 = time.time()
        time_used = end_of_f1-start_of_f1
        print("MAE = ", mae)        
        print("RMSE = ",rmse)
        print('r2_score = ', r2)
        print(f"The calculation time is : {end_of_f1-start_of_f1}")
        results10.append([algo_name, column, train, mae, rmse, r2, time_used])
    print('-'*100)

datafeat_label_ens_log_opti = pd.DataFrame(
    results10,
    columns=[
        "algo",
        "target",
        "score d-entrainement",
        "MAE",
        "RMSE",
        "r2",
        "time_used",
    ],
)
datafeat_label_ens_log_opti[
    "algo"
] = "datafeat_label_ens_log_opti_" + datafeat_label_ens_log_opti["algo"].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
datafeat_label_ens_log_opti

(1064, 7)
(1064,)
Algorithme:  LinearRegression
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  0.7005681135782544
MAE =  0.44058924350887624
RMSE =  0.5993122953204675
r2_score =  0.7931750185192259
The calculation time is : 0.1326444149017334
----------------------------------------------------------------------------------------------------
Algorithme:  Ridge
Fitting 20 folds for each of 69 candidates, totalling 1380 fits
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.6714937708159805
MAE =  0.4405953234295953
RMSE =  0.5993272636928859
r2_score =  0.793164687104259
The calculation time is : 14.204059362411499
----------------------------------------------------------------------------------------------------
Algorithme:  Lasso
Fitting 20 folds for each of 100 candidates, totalling 2000 fits


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Lasso())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * np.sqrt(n_samples). 


Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.6717152020896574
MAE =  0.4411949741423006
RMSE =  0.6017543548640693
r2_score =  0.7914860561403277
The calculation time is : 26.06916642189026
----------------------------------------------------------------------------------------------------
Algorithme:  ElasticNet
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.6715000921847002
MAE =  0.44082965022391923
RMSE =  0.5996850790208683
r2_score =  0.7929176403153212
The calculation time is : 118.66367149353027
----------------------------------------------------------------------------------------------------
Algorithme:  SGDRegressor
Fitting 20 folds for each of 162 candidates, totalling 3240 fits
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.717203247565434
MAE =  0.47763589094362
RMSE =  0.6411996097863168
r2_score =  0.763253745112857
The calculation time is : 177.243649244

Unnamed: 0,algo,target,score d-entrainement,MAE,RMSE,r2,time_used
0,datafeat_label_ens_log_opti_LinearRegression,SiteEnergyUseWN(kBtu),0.70057,0.44059,0.59931,0.79318,0.13264
1,datafeat_label_ens_log_opti_Ridge,SiteEnergyUseWN(kBtu),-0.67149,0.4406,0.59933,0.79316,14.20406
2,datafeat_label_ens_log_opti_Lasso,SiteEnergyUseWN(kBtu),-0.67172,0.44119,0.60175,0.79149,26.06917
3,datafeat_label_ens_log_opti_ElasticNet,SiteEnergyUseWN(kBtu),-0.6715,0.44083,0.59969,0.79292,118.66367
4,datafeat_label_ens_log_opti_SGDRegressor,SiteEnergyUseWN(kBtu),-0.7172,0.47764,0.6412,0.76325,177.24365
5,datafeat_label_ens_log_opti_SVR,SiteEnergyUseWN(kBtu),0.72569,0.44516,0.64423,0.76101,0.58045
6,datafeat_label_ens_log_opti_RandomForestRegressor,SiteEnergyUseWN(kBtu),-0.37177,0.37186,0.50052,0.85574,414.80243
7,datafeat_label_ens_log_opti_XGBRegressor,SiteEnergyUseWN(kBtu),-0.41571,0.34663,0.46533,0.87532,544.40154


In [68]:
# Test11 avec optimization des hyperparamètres / label encoding / features log / avec ENS

X = data_label_wEN[
    [
        "BuildingType",
        "PrimaryPropertyType",
        "Neighborhood",
        "ENERGYSTARScore_log",
        "BuildingAge_log",
        "PropertyGFATotal_log",
        "NumberofFloors",
    ]
]
y = data_label_wEN["SiteEnergyUseWN_log"]

print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

results11 = []

algos = {
    "LinearRegression": LinearRegression(),
    "Ridge": grid_ridge,
    "Lasso": grid_lasso,
    "ElasticNet": grid_elasticnet,
    "SGDRegressor": grid_sgd,
    "SVR": SVR(),
    "RandomForestRegressor": random_randomforest,
    "XGBRegressor": grid_xgb,
}

for algo_name, algo in algos.items():
    print('Algorithme: ',algo_name)
    for column in y_columns:
        start_of_f1 = time.time()
        model = make_pipeline(algo)
        model.fit(X_train,y_train)
        y_pred = model.predict(X_test)
        print('Prédiction de ',column, 'sans EnergyStarScore')
        train = model.score(X_train,y_train)
        print('score d\'entrainement = ', train)
        mae = mean_absolute_error(y_test,y_pred)
        rmse = np.sqrt(mean_squared_error(y_test,y_pred))
        r2 = r2_score(y_test,y_pred)
        end_of_f1 = time.time()
        time_used = end_of_f1-start_of_f1
        print("MAE = ", mae)        
        print("RMSE = ",rmse)
        print('r2_score = ', r2)
        print(f"The calculation time is : {end_of_f1-start_of_f1}")
        results11.append([algo_name, column, train, mae, rmse, r2, time_used])
    print('-'*100)

datafeat_label_ens_log_opti = pd.DataFrame(
    results11,
    columns=[
        "algo",
        "target",
        "score d-entrainement",
        "MAE",
        "RMSE",
        "r2",
        "time_used",
    ],
)
datafeat_label_ens_log_opti[
    "algo"
] = "datafeat_label_ens_log_opti_" + datafeat_label_ens_log_opti["algo"].astype(str)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
datafeat_label_ens_log_opti

(1064, 7)
(1064,)
Algorithme:  LinearRegression
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  0.7005681135782544
MAE =  0.44058924350887624
RMSE =  0.5993122953204675
r2_score =  0.7931750185192259
The calculation time is : 0.2782559394836426
----------------------------------------------------------------------------------------------------
Algorithme:  Ridge
Fitting 20 folds for each of 69 candidates, totalling 1380 fits
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.6714937708159805
MAE =  0.4405953234295953
RMSE =  0.5993272636928859
r2_score =  0.793164687104259
The calculation time is : 54.4587607383728
----------------------------------------------------------------------------------------------------
Algorithme:  Lasso
Fitting 20 folds for each of 100 candidates, totalling 2000 fits


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Lasso())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * np.sqrt(n_samples). 


Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.6715996698049208
MAE =  0.440945243306679
RMSE =  0.6009620725720545
r2_score =  0.7920347622689744
The calculation time is : 140.5880265235901
----------------------------------------------------------------------------------------------------
Algorithme:  ElasticNet
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.6715000921847002
MAE =  0.44082965022391923
RMSE =  0.5996850790208683
r2_score =  0.7929176403153212
The calculation time is : 130.39151310920715
----------------------------------------------------------------------------------------------------
Algorithme:  SGDRegressor
Fitting 20 folds for each of 162 candidates, totalling 3240 fits
Prédiction de  SiteEnergyUseWN(kBtu) sans EnergyStarScore
score d'entrainement =  -0.7385767139965003
MAE =  0.49623208204354996
RMSE =  0.6658701876815656
r2_score =  0.744685334939806
The calculation time is : 65.0923333

Unnamed: 0,algo,target,score d-entrainement,MAE,RMSE,r2,time_used
0,datafeat_label_ens_log_opti_LinearRegression,SiteEnergyUseWN(kBtu),0.70057,0.44059,0.59931,0.79318,0.27826
1,datafeat_label_ens_log_opti_Ridge,SiteEnergyUseWN(kBtu),-0.67149,0.4406,0.59933,0.79316,54.45876
2,datafeat_label_ens_log_opti_Lasso,SiteEnergyUseWN(kBtu),-0.6716,0.44095,0.60096,0.79203,140.58803
3,datafeat_label_ens_log_opti_ElasticNet,SiteEnergyUseWN(kBtu),-0.6715,0.44083,0.59969,0.79292,130.39151
4,datafeat_label_ens_log_opti_SGDRegressor,SiteEnergyUseWN(kBtu),-0.73858,0.49623,0.66587,0.74469,65.09233
5,datafeat_label_ens_log_opti_SVR,SiteEnergyUseWN(kBtu),0.72569,0.44516,0.64423,0.76101,0.52962
6,datafeat_label_ens_log_opti_RandomForestRegressor,SiteEnergyUseWN(kBtu),-0.27721,0.36437,0.48744,0.86318,548.84404
7,datafeat_label_ens_log_opti_XGBRegressor,SiteEnergyUseWN(kBtu),-0.41694,0.34287,0.46198,0.8771,570.7234


# Comparaison des résultats <a class="anchor" id="chapter3"></a>

In [70]:
# concatenation des dataframes

df_SEU = pd.concat(
    [
        datafeat_label_ens_log_opti,
        datafeat_ohe_log_ens_opti,
        datafeat_label_noENS_log_opti,
        datafeat_ohe_noENS_log_opti,
    ]
)

In [71]:
pd.set_option("display.float_format", lambda x: "%.5f" % x)

df_SEU.sort_values(by=["r2"], ascending=False)

Unnamed: 0,algo,target,score d-entrainement,MAE,RMSE,r2,time_used
1,datafeat_ohe_log_ens_opti_Ridge,SiteEnergyUseWN(kBtu),-0.50095,0.32848,0.43192,0.89258,6.73289
3,datafeat_ohe_log_ens_opti_ElasticNet,SiteEnergyUseWN(kBtu),-0.50072,0.32721,0.4321,0.89249,180.00142
2,datafeat_ohe_log_ens_opti_Lasso,SiteEnergyUseWN(kBtu),-0.50203,0.32913,0.4328,0.89214,18.18473
0,datafeat_ohe_log_ens_opti_LinearRegression,SiteEnergyUseWN(kBtu),0.83478,0.33153,0.43594,0.89057,0.05784
7,datafeat_ohe_log_ens_opti_XGBRegressor,SiteEnergyUseWN(kBtu),-0.37361,0.33547,0.45084,0.88296,641.74016
6,datafeat_ohe_log_ens_opti_RandomForestRegressor,SiteEnergyUseWN(kBtu),-0.28543,0.33825,0.45211,0.8823,387.04391
7,datafeat_label_ens_log_opti_XGBRegressor,SiteEnergyUseWN(kBtu),-0.41694,0.34287,0.46198,0.8771,570.7234
6,datafeat_label_ens_log_opti_RandomForestRegressor,SiteEnergyUseWN(kBtu),-0.27721,0.36437,0.48744,0.86318,548.84404
4,datafeat_ohe_log_ens_opti_SGDRegressor,SiteEnergyUseWN(kBtu),-0.57957,0.37408,0.49084,0.86127,40.89021
1,datafeat_ohe_noENS_log_opti_Ridge,SiteEnergyUseWN(kBtu),-0.58403,0.38131,0.51081,0.84975,8.35026


In [None]:
"""

    Observations Générales:

     - Les variables 'Haversine_distance' et 'PercentageperPropertyType', faites en feature ingineering, sont ineficaces.
         Ces variables ne sont pas prises en compte dans notre modélisation finale.
     - L'échelle logarithmique n'est pas nécessaire pour la variable 'NumberofFloors'.
     - Le passage à l'échelle logarithmique améliore les performances.
     - L'Energy Star score améliore les performances.
     - Sur ce dataset, l'OHE performe mieux que le label encoding.
     - Sur ce dataset, les méthodes linéaires performent légèrement mieux 
         que les méthodes ensemblistes avec un OHE encoding.
     - Sur ce dataset, les méthodes ensemblistes performent mieux que les méthodes linéaires avec un label encoding.
     - La Recherche en grille effectue une recherche brute de toutes les compositions nécessaires, 
         il est possible d'utiliser une randomized search 
         ou une optimisation bayésienne pour réduire la puissance de calcul nécessaire.
         Ce problème est d'autant plus visible pour l'algorithme SVR 
         qui calcule toutes les possibilités entre toutes les observations du dataset.

     
"""

# Features importance <a class="anchor" id="chapter4"></a>

In [1]:
import shap

rf = RandomForestRegressor(max_depth=6, random_state=0, n_estimators=10)
rf.fit(X_train, y_train)

shap_values = shap.TreeExplainer(rf).shap_values(X_train)
shap.summary_plot(shap_values, X_train, plot_type="bar")

NameError: name 'RandomForestRegressor' is not defined

In [None]:
"""

    Pour le random Forest Regressor,
        la surface : 'PropertyGFATotal_log' est la variable qui explique le plus la prédiction du modèle
    Il sera possible de réduire le nombre de variable et d'enlever, par exemple, 
        la variable 'BuildingType' qui a très peu d'impact dans la décision du modèle.

"""