#Stratégie adoptée

Obtenir des prédictions satisfaisantes avec ce dataset relève du défi. Si nous disposions d'une ferme de serveurs, nous pourrions utiliser RFECV pour sélectionner automatiquement les meilleures features. A défaut, nous allons tenter une stratégie moins gourmande :

1. Features importance

XGBoostRegressor va fournir une liste de features par ordre décroissant.



2. Forward selection

L'algorithme sera plus économe en ressources que RFECV parce qu'on intègrera au fur et à mesure les features les plus importantes, jusqu'à ce que le score n'augmente plus.

#1.Installation des librairies

Cette librairie, créée par les développeurs de scikit-learn, nous permettra d'effectuer une recherche d'optimisation bayésienne sur l'espace des hyperparamètres.
https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html

In [None]:
!pip install scikit-optimize



##1.2GLMM encoding

Dans leur papier, The scikit-learn library only provides a Target Encoder.
" [...] GLMMs are a major workhorse in applied statistics but not well understood and often neglected by the ML community."
Nous allons utiliser l'implémentation fournie par

In [None]:
# the GLMM encoder
!pip install --upgrade category_encoders



##1.3Count Encoder

A la place du OneHotEncoding, nous allons utiliser un encoder qui a deux avantages:
- Il nous permet de conserver les informations concernant la distribution des catégories, car elles sont remplacées par leur occurrence au lieu d'un simple 1 ou 0.
- Il ne crée pas de colonnes supplémentaires, ce qui va nous faire grandement gagner en performance.
The implementation provided by category_encoders doesn't support multiple targets, so we use the one from feature-engine.
https://feature-engine.trainindata.com/en/latest/api_doc/encoding/CountFrequencyEncoder.html

Par contre, il a un désavantage : si une catégorie figure uniquement dans le test set, il ne connaîtra pas sa fréquence, donc il va créer des valeurs manquantes. Pour éviter ce cas de figure, nous allons regrouper les catégories rares
https://feature-engine.trainindata.com/en/latest/api_doc/encoding/RareLabelEncoder.html

In [None]:
!pip install feature-engine



#2.Chargement des librairies

In [None]:
# System
import os
from joblib import dump, load
from google.colab import files
import warnings

# Data
import pandas as pd
import numpy as np
import math
from scipy.stats import randint, uniform, loguniform

# Graphics
import matplotlib.pyplot as plt

# Machine learning - Preprocessing
import sklearn
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, QuantileTransformer, PowerTransformer, FunctionTransformer, OrdinalEncoder, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.glmm import GLMMEncoder
from feature_engine.encoding import CountFrequencyEncoder, RareLabelEncoder

# Machine learning - Automatisation
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn import set_config
from sklearn.dummy import DummyRegressor

# Machine learning - Metrics
from sklearn.metrics import r2_score, mean_absolute_error

# Machine learning - Models
import xgboost as xgb
from sklearn.cluster import KMeans
from sklearn.linear_model import HuberRegressor, TheilSenRegressor
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.multioutput import RegressorChain

# Machine learning - Model selection
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import train_test_split, LearningCurveDisplay, ShuffleSplit, HalvingRandomSearchCV
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.exceptions import NotFittedError

#3.Configuration

In [None]:
# Silence warnings
# warnings.filterwarnings("ignore", category=UserWarning)

In [None]:
# Mount GoogleDrive and set the files path
from google.colab import drive
drive.mount('/content/drive')
%cd '/content/drive/MyDrive/CO2'
path = os.getcwd()
print(f"Le répertoire courant est : {path} \n")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/CO2
Le répertoire courant est : /content/drive/MyDrive/CO2 



#3.Chargement du dataset

In [None]:
df = pd.read_csv('co2_eda.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3299 entries, 0 to 3298
Data columns (total 33 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   buildingtype                   3299 non-null   object 
 1   primarypropertytype            3299 non-null   object 
 2   taxparcelidentificationnumber  3299 non-null   object 
 3   councildistrictcode            3299 non-null   int64  
 4   neighborhood                   3299 non-null   object 
 5   numberofbuildings              3299 non-null   int64  
 6   numberoffloors                 3299 non-null   int64  
 7   propertygfatotal               3299 non-null   int64  
 8   propertygfaparking             3299 non-null   int64  
 9   propertygfabuilding            3299 non-null   int64  
 10  listofallpropertyusetypes      3299 non-null   object 
 11  largestpropertyusetype         3288 non-null   object 
 12  largestpropertyusetypegfa      3288 non-null   f

In [None]:
df.drop(['taxparcelidentificationnumber', 'councildistrictcode', 'neighborhood', 'zipcode'], axis=1, inplace=True)

In [None]:
# Fix dtype changes after CSV exporting
df['energystarscore'] = df['energystarscore'].astype('object')
# Turn the boolean columns into categorical for target encoding
# for column in df.select_dtypes(include=['bool']).columns:
#   df[column] = df[column].astype('object')
df.dtypes

buildingtype                  object
primarypropertytype           object
numberofbuildings              int64
numberoffloors                 int64
propertygfatotal               int64
propertygfaparking             int64
propertygfabuilding            int64
listofallpropertyusetypes     object
largestpropertyusetype        object
largestpropertyusetypegfa    float64
energystarscore               object
siteeui_kbtu_sf              float64
siteeuiwn_kbtu_sf            float64
sourceeui_kbtu_sf            float64
sourceeuiwn_kbtu_sf          float64
siteenergyuse_kbtu           float64
siteenergyusewn_kbtu         float64
steam                           bool
electricity                     bool
naturalgas                      bool
defaultdata                     bool
compliancestatus              object
totalghgemissions            float64
latitude                     float64
longitude                    float64
age                            int64
source_site                  float64
s

#4.Gestion des targets multiples

Scikit-learn propose deux solutions :
- MultiOutputRegressor si les variables sont traitées de façon indépendante.
- RegressorChain si elles sont dépendantes.

https://scikit-learn.org/stable/modules/multiclass.html

Il y a une corrélation élevée (0.873) entre la consommation énergétique et les émissions de CO2, donc on choisira la seconde option.

Comme nous prédirons les émissions après la consommation, cela nous mène à créer une variable targets commençant par la colonne siteenergyuse_kbtu :

In [None]:
# Define the targets and features
targets = ['sourceeuiwn_kbtu_sf', 'sourceeui_kbtu_sf', 'source_wn', 'siteeuiwn_kbtu_sf', 'siteeui_kbtu_sf', 'site_wn', 'source_site', 'siteenergyusewn_kbtu', 'siteenergyuse_kbtu', 'totalghgemissions']
y = df[targets]
X = df.drop(targets, axis=1)

In [None]:
X.select_dtypes(include=['int64', 'float64']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3299 entries, 0 to 3298
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   numberofbuildings          3299 non-null   int64  
 1   numberoffloors             3299 non-null   int64  
 2   propertygfatotal           3299 non-null   int64  
 3   propertygfaparking         3299 non-null   int64  
 4   propertygfabuilding        3299 non-null   int64  
 5   largestpropertyusetypegfa  3288 non-null   float64
 6   latitude                   3299 non-null   float64
 7   longitude                  3299 non-null   float64
 8   age                        3299 non-null   int64  
dtypes: float64(3), int64(6)
memory usage: 232.1 KB


In [None]:
X.select_dtypes(include=['object', 'bool']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3299 entries, 0 to 3298
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   buildingtype               3299 non-null   object
 1   primarypropertytype        3299 non-null   object
 2   listofallpropertyusetypes  3299 non-null   object
 3   largestpropertyusetype     3288 non-null   object
 4   energystarscore            2496 non-null   object
 5   steam                      3299 non-null   bool  
 6   electricity                3299 non-null   bool  
 7   naturalgas                 3299 non-null   bool  
 8   defaultdata                3299 non-null   bool  
 9   compliancestatus           3299 non-null   object
dtypes: bool(4), object(6)
memory usage: 167.7+ KB


#5.Preprocessing des données

L'EDA a montré que certaines variables étaient loin d'avoir une distribution gaussienne. Pour y remédier, le QuantileTransformer semble préférable au PowerTransformer parce qu'il est efficace quelle que soit la distribution de départ : https://scikit-learn.org/stable/modules/preprocessing.html#mapping-to-a-gaussian-distribution

In [None]:
# Apply QuantileTransformer to the target variables
qt = PowerTransformer()
y = qt.fit_transform(y)

As we try different features for the model, the number of columns in X will vary. The following function will compute the list of columns for the numerical features and the remaining categorical feature ('listofallpropertyusetypes'), which will be treated separately with a count vectorizer.

##5.2Geo

In [None]:
class ClusteringEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=2):
        self.n_clusters = n_clusters
        self.cluster_centers_ = None

    def fit(self, X, y=None):
        # Fit a KMeans clustering model
        kmeans = KMeans(n_clusters=self.n_clusters, algorithm='elkan')
        kmeans.fit(X)

        self.cluster_centers_ = kmeans.cluster_centers_
        return self

    def transform(self, X):
        # Calculate the distances between data points and cluster centers
        distances = np.sqrt(((X[:, np.newaxis] - self.cluster_centers_) ** 2).sum(axis=2))

        # Determine the nearest cluster labels for each data point
        cluster_labels = np.argmin(distances, axis=1)

        # Retrieve the coordinates of the nearest cluster center
        encoded = self.cluster_centers_[cluster_labels]

        return encoded

In [None]:
# Preprocess the location features
transfo_geo = Pipeline(steps=[
    ('imputation', SimpleImputer(strategy='most_frequent')),
    ('clustering', ClusteringEncoder(n_clusters=2)),
    ('scaling', StandardScaler()),
    # ('imputation2', SimpleImputer(strategy='constant', fill_value=-999))
])

##5.3Numerical

In [None]:
# Preprocess the numerical features
transfo_num = Pipeline(steps=[
    ('scaling', PowerTransformer()),
    # ('scaling', QuantileTransformer(output_distribution='normal', random_state=42))
    ('imputation', SimpleImputer(strategy='constant', fill_value=-999)),
])

##5.4CountVectorizer

In [None]:
df['listofallpropertyusetypes'].value_counts()

Multifamily Housing                                                                       857
Multifamily Housing, Parking                                                              460
Office                                                                                    135
K-12 School                                                                               119
Office, Parking                                                                           116
                                                                                         ... 
Other, Parking, Restaurant, Retail Store                                                    1
Non-Refrigerated Warehouse, Other, Parking, Retail Store                                    1
Data Center, Non-Refrigerated Warehouse, Office, Retail Store                               1
Data Center, Medical Office, Office, Parking, Restaurant                                    1
Fitness Center/Health Club/Gym, Office, Other - Recreation, 

Pour faciliter la modélisation, un CountVectorizer va tokeniser chaque type d'usage en utilisant le séparateur ', ' :

In [None]:
# Fix for PicklingError when trying to dump an object that contains a lambda function
def tok(x):
  return x.split(', ')

# Fix for AttributeError: 'numpy.ndarray' object has no attribute 'lower'
class ArrayToStringTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if isinstance(X, np.ndarray):
            X = X.ravel().astype(str)
        return X

# Add the ArrayToStringTransformer before the CountVectorizer
transfo_vec = Pipeline(steps=[
    ('imputation', SimpleImputer(strategy='most_frequent')),
    ('converter', ArrayToStringTransformer()),
    ('vectorizer', CountVectorizer(tokenizer=tok))
])

##5.5OrdinalEncoder

In [None]:
# Preprocess with OrdinalEncoder
transfo_ord = Pipeline(steps=[
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)),
    ('scaling', PowerTransformer()),
    ('imputation', SimpleImputer(strategy='constant', fill_value=-999)),
])

##5.6CountEncoder

In [None]:
# Preprocess with CountEncoder
transfo_count = Pipeline(steps=[
    ('rare', RareLabelEncoder(n_categories=2, missing_values='ignore')),
    ('count', CountFrequencyEncoder(missing_values='ignore', unseen='ignore')),
    ('scaling', PowerTransformer()),
    ('imputation', SimpleImputer(strategy='constant', fill_value=-999)),
])

Pour faciliter la modélisation, un CountVectorizer va tokeniser chaque type d'usage en utilisant le séparateur ', ' :

#6.Création de la pipeline

In [None]:
# Define the different preprocessing categories
# All the categorical variables except
def get_columns(X=None):
  geo = ['latitude', 'longitude']
  num = X.drop(geo, axis=1).select_dtypes(include=['int64', 'float64']).columns.tolist()
  vec = ['listofallpropertyusetypes'] if 'listofallpropertyusetypes' in X.columns else []
  ord = ['energystarscore'] if 'energystarscore' in X.columns else []
  count = X.drop(vec+ord, axis=1).select_dtypes(include=['object', 'bool']).columns.tolist()
  return geo, num, vec, ord, count

In [None]:
def chain_pipe(model, geo, transfo_geo, vec, transfo_vec, num, transfo_num, ord, transfo_ord, count, transfo_count):
    '''Define the chain and preparation step, then concatenate'''

    preparation = ColumnTransformer(
    transformers=[
    ('geo', transfo_geo, geo),
    ('num', transfo_num, num),
    ('vec', transfo_vec, vec),
    ('ord', transfo_ord, ord),
    ('count', transfo_count, count),
    ])

    chain = RegressorChain(model, verbose=True)

    pipe = Pipeline(steps=[
    ('preparation', preparation),
    ('chain', chain)
    ],
    verbose=True,
    # memory='/content/cache_directory' # caching doesn't work with a custom Class like GridEncoder
    )
    return pipe

#7.Choix de la métrique d'erreur

A titre de comparaison, nous conserverons le R2 score, mais c'est en minimisant la MAE que nous parviendrons à obtenir les meilleurs résultats possibles avec le fichi

In [None]:
# Define the scoring metric
scoring='r2'
# Evaluate the model
def evaluate_model(opt=None, X=None, y=y, scoring=scoring):
  # Find the best parameters
  print('\nCV parameters:')
  for key, value in opt.best_params_.items():
    print("{}: {}".format(key, value))
  # Evaluate cross validation performance
  print('\nMean CV score (all targets):', opt.best_score_.round(4))

# Plot the learning curve
def plot_curve(opt=None, X=None, y=y, scoring=scoring):
  print('\nComputing Cross Validation for the Learning Curve...\n')
  display = LearningCurveDisplay.from_estimator(
    opt.best_estimator_,
    X,
    y,
    train_sizes=np.linspace(0.1, 1.0, num=5),
    cv=ShuffleSplit(n_splits=50, test_size=0.2, random_state=0),
    score_type="both",  # both train and test errors
    scoring=scoring,
    score_name="R2 score",
    std_display_style="fill_between",
    n_jobs=-1,
    verbose=3
    )
  _ = display.ax_.set_title('Learning Curve')

In [None]:
def get_predictions(opt=None, X_test=None, y_test=None, targets=targets):
  # Inverse transform the values to obtain an MAE that makes sense
  y_pred = opt.predict(X_test)
  y_pred_inv = qt.inverse_transform(y_pred)
  y_test_inv = qt.inverse_transform(y_test)
  # calculate R2 and MAE for each target
  for i, target in enumerate(targets):
    r2 = r2_score(y_test[:, i], y_pred[:, i])
    mae = mean_absolute_error(y_test_inv[:, i], y_pred_inv[:, i])
    print('\n' + target)
    print("R2 score:", r2)
    print("MAE:", mae)
  return r2

#8.Recherche des hyperparamètres

In [None]:
# XGBoost
xg = {
    'preparation__geo__clustering__n_clusters': Integer(450, 550),
    'preparation__vec__vectorizer__min_df': Real(1e-5, 0.5, prior='log-uniform'),
    'preparation__count__rare__tol': Real(1e-5, 0.05, prior='log-uniform'),
    'chain__base_estimator__n_estimators': Categorical([i for i in range(100, 1001, 50)]),
    'chain__base_estimator__max_depth': Integer(2, 20),
    'chain__base_estimator__learning_rate': Real(0.01, 1.0, prior='log-uniform'),
    'chain__base_estimator__subsample': Real(0.1, 1.0, prior='uniform'),
    'chain__base_estimator__colsample_bytree': Real(0.1, 1.0, prior='uniform'),
    'chain__base_estimator__reg_alpha': Real(1e-5, 100, prior='log-uniform'),
    'chain__base_estimator__reg_lambda': Real(1e-5, 100, prior='log-uniform'),
}

In [None]:
def find_hyperparameters(model, search_space, X=None, y=y, test_size=0.2, scoring=scoring, targets=targets, transfo_geo=transfo_geo, transfo_num=transfo_num, transfo_vec=transfo_vec, transfo_ord=transfo_ord, transfo_count=transfo_count, *, plot=False, save=False):
  '''print scores and return the feature importances if needed'''
  # split into train and test sets
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
  # Get the columns corresponding to the current selection of features
  geo, num, vec, ord, count = get_columns(X)
  # create the pipeline
  pipe = chain_pipe(model, geo, transfo_geo, vec, transfo_vec, num, transfo_num, ord, transfo_ord, count, transfo_count)
  # use BayesSearchCV to find optimal hyperparameters
  opt = BayesSearchCV(
    pipe,
    search_space,
    n_iter=10,
    scoring=scoring,
    cv=5,
    n_jobs=-1,
    n_points=10,
    verbose=3,
    error_score='raise',
    random_state=42
    )
  opt.fit(X_train, y_train)
  # get CV scores
  evaluate_model(opt, X)
  # get predictions
  score = get_predictions(opt, X_test, y_test, targets)
  # plot learning curve
  if plot is True:
    plot_curve(opt, X)
  # Save the best model to a file
  elif save is True:
    file_name = str(model).split('(')[0] + '_' + str(len(X_train.columns)) + 'features.joblib'
    dump(opt.best_estimator_, file_name)
  return opt, score

##8.1 Dummy test

In [None]:
# dr = {
#     'preparation__geo__clustering__n_clusters': Integer(2, 4),
#     # 'preparation__geo__clustering__kw_args': {'n_clusters': 2},
#     # 'preparation__geo__clustering__kw_args': Categorical([map_clusters(i) for i in range(2, 1003, 50)]),
#     #     'preparation__geo__clustering__kw_args': Space([
#     #     Categorical({'n_clusters': 2}),
#     #     Categorical({'n_clusters': 3})
#     # ]),
#     # 'preparation__geo__clustering__kw_args': Categorical([('n_clusters', 2), ('n_clusters', 3)]),
#     # 'preparation__geo__clustering__kw_args': Categorical({'n_clusters': 2}, {'n_clusters': 3}),
#     # 'preparation__geo__clustering__kw_args': 'n_clusters': 2},
#     # 'preparation__geo__clustering__kw_args': Categorical([map_clusters(i) for i in range(2, 1003, 50)]),
#     # 'preparation__geo__clustering__kw_args': Categorical([{'n_clusters': i} for i in range(2, 1003, 50)]),
#     'chain__base_estimator__quantile' : [0.5]
# }
# opt, score = find_hyperparameters(DummyRegressor(), dr, X)
# opt

In [None]:
# opt.get_params()

In [None]:
# Stop "Run All" from going beyond this cell
# assert False

##8.2 Fit the model

In [None]:
opt, score = find_hyperparameters(xgb.XGBRegressor(missing=-999), xg, X, save=True)
opt

Fitting 5 folds for each of 10 candidates, totalling 50 fits


  x = um.multiply(x, x, out=x)
  ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)


[Pipeline] ....... (step 1 of 2) Processing preparation, total=   3.7s
[Chain] .................. (1 of 10) Processing order 0, total=  12.4s
[Chain] .................. (2 of 10) Processing order 1, total=  11.9s
[Chain] .................. (3 of 10) Processing order 2, total=  13.0s
[Chain] .................. (4 of 10) Processing order 3, total=  12.7s
[Chain] .................. (5 of 10) Processing order 4, total=  12.2s
[Chain] .................. (6 of 10) Processing order 5, total=  13.3s
[Chain] .................. (7 of 10) Processing order 6, total=  13.2s
[Chain] .................. (8 of 10) Processing order 7, total=  13.2s
[Chain] .................. (9 of 10) Processing order 8, total=  12.8s
[Chain] ................. (10 of 10) Processing order 9, total=  13.3s
[Pipeline] ............. (step 2 of 2) Processing chain, total= 2.1min

CV parameters:
chain__base_estimator__colsample_bytree: 0.5116310036135371
chain__base_estimator__learning_rate: 0.010209726546505211
chain__base_e




sourceeuiwn_kbtu_sf
R2 score: 0.5836775316809294
MAE: 43.81227959396675

sourceeui_kbtu_sf
R2 score: 0.6022602223204021
MAE: 43.12195796604649

source_wn
R2 score: 0.41621516749521537
MAE: 0.020039836785328172

siteeuiwn_kbtu_sf
R2 score: 0.5871397923313009
MAE: 19.411356530668204

siteeui_kbtu_sf
R2 score: 0.5917897483354861
MAE: 18.586015718280656

site_wn
R2 score: 0.4411915171418599
MAE: 0.024528120895014424

source_site
R2 score: 0.631703516614861
MAE: 0.2567799970032171

siteenergyusewn_kbtu
R2 score: 0.8233686428817703
MAE: 2336635.4518119977

siteenergyuse_kbtu
R2 score: 0.8272413904054836
MAE: 2256938.067422263

totalghgemissions
R2 score: 0.8047640383771679
MAE: 71.35091947521892


In [None]:
# Stop "Run All" from going beyond this cell
assert False

AssertionError: ignored

In [None]:
intermediate_output = pipe.named_steps['preparation'].transform(X)

In [None]:
# Access the intermediate output after the "preprocessing" step
intermediate_output = chain_pipe.named_steps['preparation'].transform(X)

# Print the shape or any other relevant information about the intermediate output
print(intermediate_output.shape)

#9.Sélection des Features

In [None]:
X.info()

In [None]:
estimators = opt.best_estimator_['chain'].estimators_
last_estimator = estimators[-1]
importances = last_estimator.feature_importances_
importances[:5]

In [None]:
vec = ['listofallpropertyusetypes_' + x for x in list(opt.best_estimator_['preparation'].named_transformers_['vec'].named_steps['vectorizer'].vocabulary_.keys())]
len(vec)

In [None]:
_, num, _, ord, count = get_columns(X)

In [None]:
geo = ['geo__x0']

In [None]:
feature_names = geo + num + vec + ord + count
len(feature_names)

In [None]:
def get_features(opt, X=None):
  '''get feature importances once the ChainRegressor has been fit'''
  # get feature names after grid processing
  if 'latitude' or 'geo__x0' in X.columns:
    geo = ['latitude', 'longitude']
    # geo = ['geo__x0']
  else:
    geo = []
  # get feature names after CountVectorizer if the column 'listofallpropertyusetypes' was in X
  try:
    vec = ['listofallpropertyusetypes_' + x for x in list(opt.best_estimator_['preparation'].named_transformers_['vec'].named_steps['vectorizer'].vocabulary_.keys())]
  except AttributeError:
    vec = []
  # Get the column names that didn't change and sum it all
  _, num, _, ord, count = get_columns(X)
  feature_names = geo + num + vec + ord + count
  # get the best model for the last target ('totalghgemissions')
  estimators = opt.best_estimator_['chain'].estimators_
  last_estimator = estimators[-1]
  # get feature importances
  feature_importances = zip(feature_names, last_estimator.feature_importances_)
  # group feature importances by base feature name
  grouped_importances = {}
  for name, importance in feature_importances:
    if '_' in name:
      base_name = name.split('_')[0]
      if base_name in grouped_importances:
        grouped_importances[base_name] += importance
      else:
        grouped_importances[base_name] = importance
    else:
      grouped_importances[name] = importance
  return grouped_importances

In [None]:
# Sort the feature importances by value in descending order
feature_importances = get_features(opt, X)
sorted_importances = sorted(feature_importances.items(), key=lambda x: x[1], reverse=True)
print(sorted_importances)

# Extract the feature names and importances in separate lists
features = [x[0] for x in sorted_importances]
importances = [x[1] for x in sorted_importances]

# Create a bar plot of the feature importances
plt.bar(features, importances)
plt.xticks(rotation=90)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.show()

In [None]:
sorted_importances

In [None]:
# Sort the feature importances by value in descending order
feature_importances = get_features(opt, X)
sorted_importances = sorted(feature_importances.items(), key=lambda x: x[1], reverse=True)
print(sorted_importances)

# Extract the feature names and importances in separate lists
features = [x[0] for x in sorted_importances]
importances = [x[1] for x in sorted_importances]

# Create a bar plot of the feature importances
plt.bar(features, importances)
plt.xticks(rotation=90)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.show()

In [None]:
sorted_importances

In [None]:
selection = []
score_curve = []
selection_score = {}
stop_count = 0
for f, feature in enumerate(features):
    selection.append(feature)
    print('X:', X.shape, feature)
    X_sel = X[selection]
    opt, score = find_hyperparameters(xgb.XGBRegressor(), xg, X_sel, save=True)
    score_curve.append(score)
    selection_score[score] = selection.copy()
    f += 1
    print('\nFEATURE SELECTION\n{}: {}\n\n\n'.format(f, selection))
    # Check the score curve for early stopping
    if f >= 2:
      # If the new score doesn't improve substantially from the previous one
      if score_curve[-1] < score_curve[-2] + 0.02:
        stop_count += 1
      else:
        stop_count = 0
      # If the score has stabilized or decreased for 3 consecutive iterations, stop the loop
      if stop_count >= 3:
        print("Early stopping due to score stabilization or decrease")
        break

Parmi les modèles obtenant un R2 score équivalent pour 'totalghgemissions', nous choisirons celui qui intègre 'propertygfatotal' parce qu'il améliore significativement la prédiction pour la target 'siteenergyuse_kbtu'. C'est donc la sélection de 5 features que nous allons privilégier pour la suite des opérations.

In [None]:
features = ['taxparcelidentificationnumber', 'naturalgas', 'numberofbuildings', 'electricity', 'propertygfatotal']

In [None]:
X = df[features]
X = target_encoder(X)
X.head()

In [None]:
# Fit the model
opt, score = find_hyperparameters(xgb.XGBRegressor(missing=-999), xg, X)

In [None]:
# Extract the CV score for 'totalghgemissions' target
totalghgemissions_cv_score = opt.cv_results_['mean_test_score'][targets.index('totalghgemissions')]
totalghgemissions_cv_score

#10.Target encoding

##5.1GLMM

In [None]:
# Retrieve the transformed column corresponding to 'totalghgemissions'
totalghg = pd.DataFrame(y[:, -1], columns=['totalghgemissions'])

In [None]:
# Define the default value for missing categories
default_value = 0  # You can change this to any other value if desired
# Preprocess all categorical features except 'listofallpropertyusetypes' with a GLMM encoder based on 'totalghgemissions'
def glmm_encoder(X_train, X_test, glmm, totalghg, default_value):
    X_train_glmm = X_train[glmm]
    X_test_glmm = X_test[glmm]
    X_train_other = X_train.drop(glmm, axis=1)
    X_test_other = X_test.drop(glmm, axis=1)

    encoder = GLMMEncoder(verbose=2, drop_invariant=True, return_df=True, handle_unknown='return_nan', handle_missing='return_nan', randomized=True, binomial_target=False)
    X_train_glmm_encoded = encoder.fit_transform(X_train_glmm, totalghg)

    # Handle missing categories in the test set
    X_test_glmm_encoded = encoder.transform(X_test_glmm).fillna(default_value)

    X_train_encoded = pd.concat([X_train_glmm_encoded, X_train_other], axis=1)
    X_test_encoded = pd.concat([X_test_glmm_encoded, X_test_other], axis=1)

    return X_train_encoded, X_test_encoded

In [None]:
  # Apply GLMM encoding to X_train only and apply the encoded values to X_test
  X_train_encoded, X_test_encoded = glmm_encoder(X_train, X_test, glmm, totalghg, default_value)
  # Assign the encoded values back to X_train and X_test
  X_train = X_train_encoded
  X_test = X_test_encoded


#10.Comparaison des modèles

Laintenant que le nombre de feaytures

#10.1XGBoost

In [None]:
X.info()

#10.2Hubert

In [None]:
def robust_pipe(model=None, X=X, transfo_num=transfo_num, transfo_cat1=transfo_cat1, transfo_cat2=transfo_cat2):
  '''Define the chain and preparation step, then concatenate'''
  chain = RegressorChain(model)

  preparation = ColumnTransformer(
    transformers=[
        ('num', transfo_num, X.select_dtypes(include=['int64', 'float64']).columns),
        ('cat1', transfo_cat1, [feature for feature in ['primarypropertytype', 'largestpropertyusetype'] if feature in X.columns]),
        ('cat2', transfo_cat2, ['listofallpropertyusetypes'] if 'listofallpropertyusetypes' in X.columns else [])
        ])

  pipe = Pipeline(steps=[
    ('preparation', preparation),
    ('chain', chain)
    ])
  return pipe

In [None]:
# define the search space for HuberRegressor hyperparameters
huber = {
    'preparation__num__imputation__n_neighbors': Integer(2, 20),
    'chain__base_estimator__epsilon': Real(1.0, 3.0, prior='uniform'),
    'chain__base_estimator__alpha': Real(0.0001, 0.1, prior='log-uniform')
}

In [None]:
# Use HuberRegressor to fit a robust non-linear model to the data
huber = HuberRegressor()
huber.fit(opt.predict(X_train), y_train)

# Evaluate the performance of the model on the test set
y_pred = huber.predict(xgb.predict(X_test))
mse = mean_squared_error(y_test, y_pred)
print("MSE: %.2f" % mse)

#10.2HistGradientBoostingRegressor

In [None]:
hgb = {
    'preparation__num__imputation__n_neighbors': Integer(2, 20),
    'chain__base_estimator__loss': Categorical(['squared_error', 'absolute_error', 'poisson', 'quantile']),
    'chain__base_estimator__learning_rate': Real(0.01, 0.1, prior='log-uniform'),
    'chain__base_estimator__max_iter': Categorical([i for i in range(100, 1001, 50)]),
    'chain__base_estimator__max_leaf_nodes': Integer(2, 500),
    'chain__base_estimator__min_samples_leaf': Integer(1, 50),
    'chain__base_estimator__l2_regularization': Real(1e-10, 1e-1, prior='log-uniform')
}

In [None]:
opt, score = find_hyperparameters(HistGradientBoostingRegressor(), hgb)

In [None]:
# define the search space for hyperparameters
n_features = X.shape[1]
rf = {
    'preparation__num__imputation__n_neighbors': Integer(2, 20),
    'chain__base_estimator__n_estimators': Categorical([i for i in range(100, 1001, 50)]),
    'chain__base_estimator__max_depth': Integer(2, 20),
    'chain__base_estimator__min_samples_split': Integer(2, 10),
    'chain__base_estimator__min_samples_leaf': Integer(1, 10),
    'chain__base_estimator__max_features': Integer(int(np.log2(n_features)), n_features),
    'chain__base_estimator__max_samples': Real(0.1, 1.0, prior='log-uniform')
}

#10.1 RandomForestRegressor

> Indented block



In [None]:
opt, X_test, y_test = find_hyperparameters(RandomForestRegressor(), rf)

#11.Export du modèle choisi

In [None]:
# Select the best hyperparameters
best_pipe = opt.best_estimator_
# Fit the pipeline on the original dataset
X = df.drop(targets, axis=1)
X = glmm_encoder(X)
best_pipe.fit(X, y)
# Save the resulting model to a file
dump(best_pipe, 'xgboost_model.joblib')
files.download('xgboost_model.joblib')