#Stratégie adoptée

Obtenir des prédictions satisfaisantes avec ce dataset relève du défi. Si nous disposions d'une ferme de serveurs, nous pourrions utiliser RFECV pour sélectionner automatiquement les meilleures features. A défaut, nous allons tenter une stratégie moins gourmande :

1. Features importance

XGBoostRegressor va fournir une liste de features par ordre décroissant.



2. Forward selection

L'algorithme sera plus économe en ressources que RFECV parce qu'on intègrera au fur et à mesure les features les plus importantes, jusqu'à ce que le score n'augmente plus.

#1.Installation des librairies

Cette librairie, créée par les développeurs de scikit-learn, nous permettra d'effectuer une recherche d'optimisation bayésienne sur l'espace des hyperparamètres.
https://scikit-optimize.github.io/stable/modules/generated/skopt.BayesSearchCV.html

In [None]:
!pip install scikit-optimize



##1.2GLMM encoding

Dans leur papier, The scikit-learn library only provides a Target Encoder.
" [...] GLMMs are a major workhorse in applied statistics but not well understood and often neglected by the ML community."
Nous allons utiliser l'implémentation fournie par

In [None]:
# the GLMM encoder
!pip install --upgrade category_encoders



##1.3Count Encoder

A la place du OneHotEncoding, nous allons utiliser un encoder qui a deux avantages:
- Il nous permet de conserver les informations concernant la distribution des catégories, car elles sont remplacées par leur occurrence au lieu d'un simple 1 ou 0.
- Il ne crée pas de colonnes supplémentaires, ce qui va nous faire grandement gagner en performance.
The implementation provided by category_encoders doesn't support multiple targets, so we use the one from feature-engine.
https://feature-engine.trainindata.com/en/latest/api_doc/encoding/CountFrequencyEncoder.html

Par contre, il a un désavantage : si une catégorie figure uniquement dans le test set, il ne connaîtra pas sa fréquence, donc il va créer des valeurs manquantes. Pour éviter ce cas de figure, nous allons regrouper les catégories rares
https://feature-engine.trainindata.com/en/latest/api_doc/encoding/RareLabelEncoder.html

In [None]:
!pip install feature-engine



#2.Chargement des librairies

In [None]:
# System
import os
from joblib import dump, load
from google.colab import files
import warnings

# Data
import pandas as pd
import numpy as np
import math
from scipy.stats import randint, uniform, loguniform
from sklearn.utils import shuffle

# Graphics
import matplotlib.pyplot as plt

# Machine learning - Preprocessing
import sklearn
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, QuantileTransformer, PowerTransformer, FunctionTransformer, OrdinalEncoder, StandardScaler, RobustScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders.glmm import GLMMEncoder
from feature_engine.encoding import CountFrequencyEncoder, RareLabelEncoder

# Machine learning - Automatisation
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn import set_config
from sklearn.dummy import DummyRegressor

# Machine learning - Metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, make_scorer

# Machine learning - Models
import xgboost as xgb
from sklearn.cluster import KMeans
from sklearn.svm import SVR
from sklearn.linear_model import HuberRegressor, TheilSenRegressor
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor, VotingRegressor, IsolationForest
from sklearn.multioutput import RegressorChain
from sklearn.neural_network import MLPRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import LocalOutlierFactor

# Machine learning - Model selection
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import train_test_split, LearningCurveDisplay, ShuffleSplit, HalvingRandomSearchCV, cross_val_score, learning_curve
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.exceptions import NotFittedError

#3.Configuration

XGBoost utilise une méthode de sampling plus performante avec une carte graphique, donc il est préférable d'utiliser le GPU Runtime de Google Colab. Si ce n'est pas possible, alors la section 8.2.1 doit être remplacée par la version CPU (8.2.2).

In [None]:
# Silence warnings
warnings.filterwarnings('ignore')

In [None]:
# Mount GoogleDrive and set the files path
from google.colab import drive
drive.mount('/content/drive')
%cd '/content/drive/MyDrive/CO2'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/CO2


#3.Chargement du dataset

In [None]:
df = pd.read_csv('co2_predictions.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3189 entries, 0 to 3188
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   primarypropertytype        3189 non-null   object 
 1   councildistrictcode        3189 non-null   int64  
 2   numberofbuildings          3189 non-null   int64  
 3   numberoffloors             3189 non-null   int64  
 4   propertygfatotal           3189 non-null   int64  
 5   propertygfaparking         3189 non-null   int64  
 6   listofallpropertyusetypes  3189 non-null   object 
 7   largestpropertyusetype     3178 non-null   object 
 8   energystarscore            2386 non-null   float64
 9   siteeuiwn_kbtu_sf          3189 non-null   float64
 10  sourceeuiwn_kbtu_sf        3189 non-null   float64
 11  siteenergyuse_kbtu         3189 non-null   float64
 12  siteenergyusewn_kbtu       3189 non-null   float64
 13  steam                      3189 non-null   objec

In [None]:
# Fix dtype changes after CSV exporting
df['energystarscore'] = df['energystarscore'].astype('object')
df['councildistrictcode'] = df['councildistrictcode'].astype('object')
# Turn the boolean columns into categorical for target encoding
# for column in df.select_dtypes(include=['bool']).columns:
#   df[column] = df[column].astype('object')
df.dtypes

primarypropertytype           object
councildistrictcode           object
numberofbuildings              int64
numberoffloors                 int64
propertygfatotal               int64
propertygfaparking             int64
listofallpropertyusetypes     object
largestpropertyusetype        object
energystarscore               object
siteeuiwn_kbtu_sf            float64
sourceeuiwn_kbtu_sf          float64
siteenergyuse_kbtu           float64
siteenergyusewn_kbtu         float64
steam                         object
naturalgas                    object
totalghgemissions            float64
age                            int64
source_site                  float64
source_wn                    float64
site_wn                      float64
dtype: object

Après avoir tenté toutes sortes d'options pour imputer les données manquantes (SimpleImputer, KNN, native XGBoost...), la méthode la plus simple s'avère la plus efficace :

In [None]:
df.dropna(inplace=True)

#4.Gestion des targets multiples

Scikit-learn propose deux solutions :
- MultiOutputRegressor si les variables sont traitées de façon indépendante.
- RegressorChain si elles sont dépendantes.

https://scikit-learn.org/stable/modules/multiclass.html

Il y a une corrélation élevée (0.873) entre la consommation énergétique et les émissions de CO2, donc on choisira la seconde option.

Comme nous prédirons les émissions après la consommation, cela nous mène à créer une variable targets commençant par la colonne siteenergyuse_kbtu :

In [None]:
# Define the targets and features
targets = ['sourceeuiwn_kbtu_sf', 'source_wn', 'siteeuiwn_kbtu_sf', 'site_wn', 'source_site', 'siteenergyusewn_kbtu', 'siteenergyuse_kbtu', 'totalghgemissions']
y = df[targets]
X = df.drop(targets, axis=1)

In [None]:
X.select_dtypes(include=['int64', 'float64']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2379 entries, 0 to 3174
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   numberofbuildings   2379 non-null   int64
 1   numberoffloors      2379 non-null   int64
 2   propertygfatotal    2379 non-null   int64
 3   propertygfaparking  2379 non-null   int64
 4   age                 2379 non-null   int64
dtypes: int64(5)
memory usage: 111.5 KB


In [None]:
X.select_dtypes(include=['object', 'bool']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2379 entries, 0 to 3174
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   primarypropertytype        2379 non-null   object
 1   councildistrictcode        2379 non-null   object
 2   listofallpropertyusetypes  2379 non-null   object
 3   largestpropertyusetype     2379 non-null   object
 4   energystarscore            2379 non-null   object
 5   steam                      2379 non-null   object
 6   naturalgas                 2379 non-null   object
dtypes: object(7)
memory usage: 148.7+ KB


#5.Preprocessing des données

Dans l'idéal, le count encoding, target encoding ou GLMM encoding devraient donner de meilleurs résultats, mais aucun test ne s'est révélé probant (cf. dossier predictions/tests). Quand les variables catégoriques n'ont pas la distribution adéquate, ces encodeurs ont tendance à créer des valeurs manquantes qui nuisent à la modélisation. Ou alors, ils peuvent provoquer une énorme fausse joie si le X test se retrouve contaminé par les données à prédire;)

Il nous faudrait une formation poussée en statistiques pour apprendre à regrouper les catégories sans tâtonner pendant des heures...

L'EDA a montré que certaines variables étaient loin d'avoir une distribution gaussienne. Pour y remédier, le QuantileTransformer semble préférable au PowerTransformer parce qu'il est efficace quelle que soit la distribution de départ : https://scikit-learn.org/stable/modules/preprocessing.html#mapping-to-a-gaussian-distribution

##5.1Targets

In [None]:
# Fit the PowerTransformer on the training set only to avoid data leakage
def transfo_tar(y_train):
    power_transformer = PowerTransformer()
    y_transf = power_transformer.fit_transform(y_train)
    return y_transf, power_transformer

##5.3Numerical Features

In [None]:
# Preprocess the numerical features
transfo_num = Pipeline(steps=[
    ('scaling', PowerTransformer()),
    ('imputation', SimpleImputer(strategy='constant', fill_value=-999)),
])

##5.5OrdinalEncoder

In [None]:
# Preprocess with OrdinalEncoder
transfo_ord = Pipeline(steps=[
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)),
    ('scaling', PowerTransformer()),
    ('imputation', SimpleImputer(strategy='constant', fill_value=-999)),
])

##5.6OneHotEncoder

In [None]:
# Preprocess with CountEncoder
transfo_one = Pipeline(steps=[
    ('imputation', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='infrequent_if_exist', sparse_output=False))
])

Les CountVectorizer et OneHotEncoder vont démultiplier le nombre de colonnes. Pour pouvoir calculer l'importance des features, il est plus simple de regrouper les colonnes par type de processing :

In [None]:
X = X[['numberofbuildings', 'numberoffloors', 'propertygfatotal', 'propertygfaparking', 'age',
'energystarscore',
'primarypropertytype', 'councildistrictcode', 'largestpropertyusetype', 'steam', 'naturalgas', 'listofallpropertyusetypes']]

In [None]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2379 entries, 0 to 3174
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   numberofbuildings          2379 non-null   int64 
 1   numberoffloors             2379 non-null   int64 
 2   propertygfatotal           2379 non-null   int64 
 3   propertygfaparking         2379 non-null   int64 
 4   age                        2379 non-null   int64 
 5   energystarscore            2379 non-null   object
 6   primarypropertytype        2379 non-null   object
 7   councildistrictcode        2379 non-null   object
 8   largestpropertyusetype     2379 non-null   object
 9   steam                      2379 non-null   object
 10  naturalgas                 2379 non-null   object
 11  listofallpropertyusetypes  2379 non-null   object
dtypes: int64(5), object(7)
memory usage: 241.6+ KB


#6.Création de la pipeline

Au préalable, il nous faut créer une fonction qui va définir les colonnes à traiter en fonction du type de preprocessing :

As we try different features for the model, the number of columns in X will vary. The following function will compute the list of columns for the numerical features and the remaining categorical feature ('listofallpropertyusetypes'), which will be treated separately with a count vectorizer.

In [None]:
def get_columns(X):
  num = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
  ord = ['energystarscore'] if 'energystarscore' in X.columns else []
  one = X.drop(ord, axis=1).select_dtypes(include=['object', 'bool']).columns.tolist()
  return num, ord, one

In [None]:
def chain_pipe(model, num, transfo_num, ord, transfo_ord, one, transfo_one):
    '''Define the chain and preparation step, then concatenate'''

    preparation = ColumnTransformer(
    transformers=[
    ('num', transfo_num, num),
    ('ord', transfo_ord, ord),
    ('one', transfo_one, one),
    ])

    chain = RegressorChain(model, verbose=True)

    pipe = Pipeline(steps=[
    ('preparation', preparation),
    ('chain', chain)
    ],
    verbose=True,
    memory='/content/cache_directory'
    )
    return pipe

#7.Choix de la métrique d'erreur

C'est en minimisant la MAE que nous parviendrons à obtenir les meilleurs résultats possibles avec les nouveaux fichiers CSV. La RMSE nous donnera une idée de l'importance des outliers : si elle est largement supérieure à la MAE, nous essaierons d'apporter des améliorations supplémentaires.
Si on cherche à optimiser le R2 score, on peut atteindre 0.89 mais la MAE en souffre, et les courbes donnent davantage l'impression d'un overfitting qui menace clairement.

In [None]:
scoring='neg_mean_absolute_error'
# sklearn.metrics.get_scorer_names()

##7.1 CV Score

Dans le cas de targets multiples, scikit-learn ne fournit qu'un seul CV score représentant une moyenne pour toutes les targets. A la place, nous allons calculer une Learning Curve pour le dernier estimateur de la chaîne, qui correspond à la target 'totalghgemissions'.

In [None]:
# Evaluate the model
def evaluate_model(opt):
  # Find the best parameters
  print('\nCV parameters:')
  for key, value in opt.best_params_.items():
    print("{}: {}".format(key, value))
  # Evaluate cross validation performance
#   print('\nMean CV score (all targets):', opt.best_score_.round(2)) # useless for multiple targets

# Plot the learning curve
def plot_curve(opt, X, y, scoring=scoring):
  print('\nComputing Cross Validation for the Learning Curve...\n')
  # Create a pipe that will include the preparation step
  curve_pipe = Pipeline(steps=[
            ('preparation', opt.best_estimator_['preparation']),
            ('last estimator', opt.best_estimator_['chain'].estimators_[-1])
            ],
             verbose=False
            # memory='/content/cache_directory' # caching doesn't work with custom Classes
            )
  # Plot the learning curve
  display = LearningCurveDisplay.from_estimator(
    curve_pipe,
    X,
    y[:, -1],
    train_sizes=np.linspace(0.1, 1.0, num=3),
    cv=ShuffleSplit(n_splits=5, test_size=0.2, random_state=42),
    score_type="both",  # both train and test errors
    scoring=scoring,
    negate_score=True,
    std_display_style="fill_between",
    n_jobs=-1
    )
  _ = display.ax_.set_title('Learning Curve for the Last Estimator')

##7.2 Test Score

In [None]:
def get_predictions(opt, X_test, y_test, power_transformer, targets=targets):
  # Get predictions and inverse transform the values
  y_pred = opt.predict(X_test)
  y_pred_inv = power_transformer.inverse_transform(y_pred)
  # calculate MAE, RMSE and R2 score for each target
  y_test = y_test.to_numpy()
  for i, target in enumerate(targets):
    mae = mean_absolute_error(y_test[:, i], y_pred_inv[:, i])
    rmse = np.sqrt(mean_squared_error(y_test[:, i], y_pred_inv[:, i]))
    r2 = r2_score(y_test[:, i], y_pred_inv[:, i])
    print('\n' + target)
    print(f'MAE: {round(mae, 2):.2f}')
    print(f'RMSE: {round(rmse, 2):.2f}')
    print(f'R2 score: {round(r2, 4):.4f}')
  return mae

#8.Recherche des hyperparamètres

##8.1Search Space

In [None]:
# XGBoost
xg = {
    'preparation__one__onehot__min_frequency': Real(1e-5, 1, prior='log-uniform'),
    'chain__base_estimator__n_estimators': Categorical([i for i in range(100, 1001, 50)]),
    'chain__base_estimator__max_depth': Integer(2, 20),
    'chain__base_estimator__learning_rate': Real(1e-5, 1.0, prior='log-uniform'),
    'chain__base_estimator__subsample': Real(0.1, 1.0, prior='uniform'),
    'chain__base_estimator__colsample_bytree': Real(0.1, 1.0, prior='uniform'),
    'chain__base_estimator__colsample_bylevel': Real(0.1, 1.0, prior='uniform'),
    'chain__base_estimator__colsample_bynode': Real(0.1, 1.0, prior='uniform'),
    'chain__base_estimator__reg_alpha': Real(1e-5, 100, prior='log-uniform'),
    'chain__base_estimator__reg_lambda': Real(1e-5, 100, prior='log-uniform'),
    'chain__base_estimator__grow_policy': Categorical(['depthwise', 'lossguide']),
    'chain__base_estimator__max_bin': Categorical([i for i in range(256, 2049, 256)]),
}

##8.2Outer Function

In [None]:
def find_hyperparameters(model, search_space, X, y,
                         transfo_num, transfo_ord, transfo_one,
                         test_size=0.2, scoring=scoring, targets=targets,
                         *, plot=False, save=False):
  '''print scores and return the feature importances if needed'''

  # split into train and test sets
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
  # preprocess y_train
  y_train, power_transformer = transfo_tar(y_train)
  # Get the columns corresponding to the current selection of features
  num, ord, one = get_columns(X)
  # create the pipeline
  pipe = chain_pipe(model, num, transfo_num, ord, transfo_ord, one, transfo_one)
  # use BayesSearchCV to find optimal hyperparameters
  opt = BayesSearchCV(
    pipe,
    search_space,
    n_iter=1,
    scoring=scoring,
    cv=2,
    n_jobs=-1,
    n_points=10,
    verbose=3,
    error_score='raise',
    random_state=42
    )
  opt.fit(X_train, y_train)
  # get best parameters
  evaluate_model(opt)
  # get predictions
  score = get_predictions(opt, X_test, y_test, power_transformer)
  # plot learning curve
  if plot is True:
    plot_curve(opt, X, y)
  # Save the best model to a file
  if save is True:
    file_name = str(model).split('(')[0] + '_' + str(len(X_train.columns)) \
    + 'features_' + f'{round(score, 2):.2f}' + '.joblib'
    dump(opt, file_name)
  return opt, score

##8.1 Dummy Test

In [None]:
# dr = {
#     # 'preparation__geo__clustering__n_clusters': Integer(2, 4),
#     # 'preparation__geo__clustering__kw_args': {'n_clusters': 2},
#     'chain__base_estimator__quantile' : [0.5]
# }
# opt, score = find_hyperparameters(DummyRegressor(), dr, X, y,
#                                   transfo_num, transfo_ord, transfo_one)
# opt

In [None]:
# opt.get_params()

In [None]:
# Stop "Run All" from going beyond this cell
# assert False

##8.2Full Fit

###8.2.1 GPU

In [None]:
# opt, score = find_hyperparameters(xgb.XGBRegressor(sampling_method='gradient_based', tree_method='gpu_hist', missing=-999), xg, X, y,
#                                   transfo_num, transfo_ord, transfo_one,
#                                   plot=True, save=True)

###8.2.2 CPU

In [None]:
opt, score = find_hyperparameters(xgb.XGBRegressor(sampling_method='uniform', tree_method='hist', missing=-999), xg, X, y,
                                  transfo_num, transfo_ord, transfo_one,
                                  plot=False, save=True)

Fitting 2 folds for each of 1 candidates, totalling 2 fits
[Pipeline] ....... (step 1 of 2) Processing preparation, total=   0.1s
[Chain] ................... (1 of 8) Processing order 0, total=   5.5s
[Chain] ................... (2 of 8) Processing order 1, total=   3.1s
[Chain] ................... (3 of 8) Processing order 2, total=   3.7s
[Chain] ................... (4 of 8) Processing order 3, total=   7.1s
[Chain] ................... (5 of 8) Processing order 4, total=   4.5s
[Chain] ................... (6 of 8) Processing order 5, total=   8.8s
[Chain] ................... (7 of 8) Processing order 6, total=   6.1s
[Chain] ................... (8 of 8) Processing order 7, total=   9.6s
[Pipeline] ............. (step 2 of 2) Processing chain, total=  48.4s

CV parameters:
chain__base_estimator__colsample_bylevel: 0.46909356296798244
chain__base_estimator__colsample_bynode: 0.7549531688595925
chain__base_estimator__colsample_bytree: 0.9395811989630505
chain__base_estimator__grow_polic

8.2.3 Visualisation de la pipeline

In [None]:
opt

In [None]:
# Stop "Run all/after" from going beyond this cell
# assert False

#9.Sélection des Features

In [None]:
# Retrieve the saved opt if necessary
# file_name = 'XGBRegressor_12features_50.13.joblib'
# opt = load(file_name)
# score = float(os.path.splitext(file_name)[0].rpartition('_')[2])

In [None]:
def get_features(opt, X=None):
  '''get feature importances once the ChainRegressor has been fit'''
  # get feature names (this step is necessary after preprocessing with CountVectorizer and OneHotEncoder)
  num, ord, one = get_columns(X)
  try:
    one = opt.best_estimator_['preparation'].named_transformers_['one'].named_steps['onehot'].get_feature_names_out(input_features=one).tolist()
  except NotFittedError:
    one = []
  feature_names = num + ord + one
  # get feature importances
  last_estimator = opt.best_estimator_['chain'].estimators_[-1]
  feature_importances = zip(feature_names, last_estimator.feature_importances_)
  # group feature importances by base feature name
  grouped_importances = {}
  for name, importance in feature_importances:
    if '_' in name:
      base_name = name.split('_')[0]
      if base_name in grouped_importances:
        grouped_importances[base_name] += importance
      else:
        grouped_importances[base_name] = importance
    else:
      grouped_importances[name] = importance
  return grouped_importances

In [None]:
# Compute and save the feature importances
feature_importances = get_features(opt, X)
dump(feature_importances, 'feature_importances.joblib')

['feature_importances.joblib']

In [None]:
assert False

AssertionError: ignored

In [None]:
# Sort the feature importances by value in descending order
sorted_importances = sorted(feature_importances.items(), key=lambda x: x[1], reverse=True)
sorted_importances

In [None]:
# Extract the feature names and importances in separate lists
features = [x[0] for x in sorted_importances]
importances = [x[1] for x in sorted_importances]
import plotly.express as px

# Create a bar plot of the feature importances
fig = px.bar(x=features, y=importances)
fig.update_layout(xaxis_tickangle=-90, xaxis_title='Feature', yaxis_title='Importance', title='Feature Importances', height=1000)
fig.show()


Le gaz EDA?
La régularisation L1 semble avoir bien discriminé les features. Comme l'EDA pouvait l'indiquer, l'âge des bâtiments et l'EnergyStar score n'apportent pas grand-chose. Il est possible que nous n'ayons même plus besoin de ces features. La backward selection va nous permettre de calculer l'impact de leur suppression sur la performance du modèle. Puisque les émissions se mesurent en tonnes, nous allons arrêter l'algorithme dès que la MAE s'accroît de deux tonnes parce que c'est déjà beaucoup pour un bâtiment.

In [None]:
score_curve = [score]
selection_score = {}
stop_count = 0
for f, feature in enumerate(reversed(features)):
    features.remove(feature)
    print('Removed Feature: {}\n'.format(feature))
    X_sel = X[features]
    # opt, score = find_hyperparameters(xgb.XGBRegressor(sampling_method='gradient_based', tree_method='gpu_hist', missing=-999), xg, X_sel, y,
    #                                   transfo_num, transfo_ord, transfo_one)
    opt, score = find_hyperparameters(xgb.XGBRegressor(sampling_method='uniform', tree_method='hist', missing=-999), xg, X_sel, y,
                                      transfo_num, transfo_ord, transfo_one)
    score_curve.append(score)
    selection_score[score] = features.copy()
    f += 1
    print('\nFEATURE SELECTION\n{}: {}\n\n\n\n'.format(f, features))
    # Check the score curve for early stopping
    # If the new score increases substantially from the previous one
    if score_curve[-1] > score_curve[-2] + 1:
        stop_count += 1
    # If the total increase reaches 2, stop the loop
    if stop_count == 2:
        print('Early stopping due to substantial score increase.')
        break

Dès la première feature retirée, nous passons le cap des 50 tonnes et nous perdons plus de trois dixièmes en R2 score, donc autant toutes les garder.

In [None]:
# Stop "Run all/after" from going beyond this cell
# assert False

#10.Comparaison des modèles

Tous les modèles suivants font moins bien que XGBoost. Mais ils pourront servir pour le stacking quand scikit-learn. Cela devrait permettre d'améliorer un peu le score.

Contrairement à XGBoost, les modèles suivants ne gèrent pas automatiquement les valeurs manquantes, donc il nous faut modifier légèrement le preprocessing :

In [None]:
# Preprocess the numerical features
transfo_num = Pipeline(steps=[
    ('imputation', KNNImputer()),
    ('scaling', PowerTransformer()),
    # ('scaling', QuantileTransformer(output_distribution='normal', random_state=42)),
])

In [None]:
# Preprocess with OrdinalEncoder (fill_value=1: assuming that missing EnergyStar scores correspond to poorly managed buildings)
transfo_ord = Pipeline(steps=[
    ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan)),
    ('imputation', SimpleImputer(strategy='constant', fill_value=1)),
    ('scaling', PowerTransformer()),
])

##10.1MLPRegressor

In [None]:
mlp = {
    'preparation__num__imputation__n_neighbors': Integer(2, 20),
    'preparation__vec__vectorizer__min_df': Real(1e-5, 0.5, prior='log-uniform'),
    'preparation__one__onehot__min_frequency': Real(1e-5, 1, prior='log-uniform'),
    'chain__base_estimator__activation': Categorical(['relu', 'logistic', 'tanh']),
    'chain__base_estimator__solver': Categorical(['adam', 'lbfgs']),
    'chain__base_estimator__alpha': Real(1e-5, 1.0, prior='log-uniform'),
    'chain__base_estimator__learning_rate': Categorical(['constant', 'invscaling', 'adaptive']),
    'chain__base_estimator__learning_rate_init': Real(1e-5, 1.0, prior='log-uniform'),
    'chain__base_estimator__max_iter': Integer(200, 1000)
}

In [None]:
opt, score = find_hyperparameters(MLPRegressor(early_stopping=True), mlp, X, y,
                                  transfo_num, transfo_ord, transfo_one,
                                  plot=True, save=True)

##10.xKernelRidge

In [None]:
krr = {
    'preparation__num__imputation__n_neighbors': Integer(2, 20),
    'preparation__vec__vectorizer__min_df': Real(1e-5, 0.5, prior='log-uniform'),
    'preparation__one__onehot__min_frequency': Real(1e-5, 1, prior='log-uniform'),
    'chain__base_estimator__alpha': Real(1e-5, 1.0, prior='log-uniform'),
    'chain__base_estimator__kernel': Categorical(['linear', 'poly', 'rbf', 'sigmoid']),
    'chain__base_estimator__gamma': Real(1e-5, 1.0, prior='log-uniform'),
    'chain__base_estimator__degree': Integer(1, 5),
    'chain__base_estimator__coef0': Real(0.0, 1.0),
}

In [None]:
opt, score = find_hyperparameters(KernelRidge(), krr, X, y,
                                  transfo_num, transfo_ord, transfo_one,
                                  plot=True, save=True)

##10.2HistGradientBoostingRegressor

In [None]:
hgb = {
    'preparation__num__imputation__n_neighbors': Integer(2, 20),
    'chain__base_estimator__loss': Categorical(['squared_error', 'absolute_error']),
    'chain__base_estimator__learning_rate': Real(0.01, 0.1, prior='log-uniform'),
    'chain__base_estimator__max_iter': Categorical([i for i in range(100, 1001, 50)]),
    'chain__base_estimator__max_leaf_nodes': Integer(2, 500),
    'chain__base_estimator__min_samples_leaf': Integer(1, 50),
    'chain__base_estimator__l2_regularization': Real(1e-10, 1e-1, prior='log-uniform')
}

In [None]:
opt, score = find_hyperparameters(HistGradientBoostingRegressor(), hgb, X, y,
                                  transfo_num, transfo_ord, transfo_one,
                                  plot=True, save=True)

##10.1 RandomForestRegressor

In [None]:
# define the search space for hyperparameters
n_features = X.shape[1]
rf = {
    'preparation__num__imputation__n_neighbors': Integer(2, 20),
    'chain__base_estimator__n_estimators': Categorical([i for i in range(100, 1001, 50)]),
    'chain__base_estimator__max_depth': Integer(2, 20),
    'chain__base_estimator__min_samples_split': Integer(2, 10),
    'chain__base_estimator__min_samples_leaf': Integer(1, 10),
    'chain__base_estimator__max_features': Integer(int(np.log2(n_features)), n_features),
    'chain__base_estimator__max_samples': Real(0.1, 1.0, prior='log-uniform')
}

In [None]:
opt, score = find_hyperparameters(RandomForestRegressor(), rf, X, y,
                                  transfo_num, transfo_ord, transfo_one,
                                  plot=True, save=True)

#13Suppression des Outliers

In [None]:
# Retrieve the saved opt for the best model
file_name = 'XGBRegressor_12features_51.22.joblib'
opt = load(file_name)
best_pipe = opt.best_estimator_
score = float(os.path.splitext(file_name)[0].rpartition('_')[2])

In [None]:
# Shuffle the dataframe
shuffled_df = shuffle(df, random_state=42)

# Specify the proportion of data to be used for filtering
filter_proportion = 0.8

# Calculate the number of rows for the filtered set
filtered_rows = int(filter_proportion * shuffled_df.shape[0])

# Split the shuffled dataframe into filtered and unfiltered parts
df_filtered = shuffled_df[:filtered_rows]  # Train data (filtered)
df_unfiltered = shuffled_df[filtered_rows:]  # Test data (unfiltered)

In [None]:
df_filtered.info()

In [None]:
preparation = load('KernelRidge_12features_56.76.joblib').best_estimator_['preparation']
lof_pipe = Pipeline(steps=[
    ('preparation', preparation),
    ('detection', LocalOutlierFactor())
    ],
    verbose=True,
    # memory='/content/cache_directory' # caching doesn't work with custom Classes
    )

In [None]:
lof_pipe.fit(df_filtered)

In [None]:
anomaly_scores = lof_pipe['detection'].negative_outlier_factor_

In [None]:


X_filtered_train, X_filtered_test, y_train, y_test = train_test_split(X_filtered, y_filtered, test_size=0.2, random_state=42)

In [None]:
# preprocess the targets
y_test_transf, y_scaler = transfo_tar(y_test)

In [None]:
score_curve = [score]
threshold_score = {}
stop_count = 0
for threshold in range(10, 50):
    df_filtered = df_filtered[anomaly_scores > -1 -threshold/10]
    y_filtered = df_filtered[targets]
    X_filtered = df_filtered.drop(targets, axis=1)
    # find the best hyperparameters for the filtered set
    opt, _ = find_hyperparameters(xgb.XGBRegressor(sampling_method='uniform', tree_method='hist', missing=-999), xg, X_filtered, y_filtered,
                                  transfo_num, transfo_ord, transfo_one)
    # get predictions for the unfiltered set

    score = get_predictions(opt, X_test, y_test, power_transformer)
    score_curve.append(score)
    threshold_score[score] = threshold
    print('\nTHRESHOLD {} score: {}\n\n\n\n'.format(threshold, score))
    # Check the score curve for early stopping
    # If the new score increases substantially from the previous one
    if score_curve[-1] > score_curve[-2] + 1:
        stop_count += 1
    # If the total increase reaches 2, stop the loop
    if stop_count == 2:
        print('Early stopping due to substantial score increase.')
        # break

In [None]:
score_curve = [score]
threshold_score = {}
stop_count = 0
for threshold in range(10, 50):
    df_filtered = df[anomaly_scores > -1 -threshold/10]
    y = df_filtered[targets]
    X = df_filtered.drop(targets, axis=1)
    # preprocess the targets
    y_transf, y_scaler = transfo_tar(y)
    # get predictions
    score = get_predictions(opt, X, y_transf, y_scaler)
    score_curve.append(score)
    threshold_score[score] = threshold
    print('\nTHRESHOLD {} score: {}\n\n\n\n'.format(threshold, score))
    # Check the score curve for early stopping
    # If the new score increases substantially from the previous one
    if score_curve[-1] > score_curve[-2] + 1:
        stop_count += 1
    # If the total increase reaches 2, stop the loop
    if stop_count == 2:
        print('Early stopping due to substantial score increase.')
        # break

In [None]:
sorted_thresholds = sorted(threshold_score.items(), key=lambda x: x[1], reverse=True)
sorted_thresholds

In [None]:
score_curve = [score]
threshold_score = {}
stop_count = 0
for threshold in range():
    df = df[anomaly_scores > threshold]
    y = df[targets]
    X = df.drop(targets, axis=1)
    # opt, score = find_hyperparameters(xgb.XGBRegressor(sampling_method='gradient_based', tree_method='gpu_hist', missing=-999), xg, X,
    #                                   transfo_num, transfo_ord, transfo_one)
    opt, score = find_hyperparameters(xgb.XGBRegressor(sampling_method='uniform', tree_method='hist', missing=-999), xg, X,
                                      transfo_num, transfo_ord, transfo_one)
    score_curve.append(score)

    X y
    fit like in main.py
    return mae

use the features selection example to retrieve the threshold with the best mae.

In [None]:
score_curve = [score]
selection_score = {}
stop_count = 0
for f, feature in enumerate(reversed(features)):
    features.remove(feature)
    print('Removed Feature: {}\n'.format(feature))
    X_sel = X[features]
    # opt, score = find_hyperparameters(xgb.XGBRegressor(sampling_method='gradient_based', tree_method='gpu_hist', missing=-999), xg, X_sel,
    #                                   transfo_num, transfo_ord, transfo_one)
    opt, score = find_hyperparameters(xgb.XGBRegressor(sampling_method='uniform', tree_method='hist', missing=-999), xg, X_sel,
                                      transfo_num, transfo_ord, transfo_one)
    score_curve.append(score)
    selection_score[score] = features.copy()
    f += 1
    print('\nFEATURE SELECTION\n{}: {}\n\n\n\n'.format(f, features))
    # Check the score curve for early stopping
    # If the new score increases substantially from the previous one
    if score_curve[-1] > score_curve[-2] + 1:
        stop_count += 1
    # If the total increase reaches 2, stop the loop
    if stop_count == 2:
        print('Early stopping due to substantial score increase.')
        break

In [None]:
change the preprocessing

construct the pipeline

dictionary
for threshold in range():

    df = df[anomaly_scores > threshold]
    X y
    fit like in main.py
    return mae

use the features selection example to retrieve the threshold with the best mae.

#11.Export du modèle choisi

In [None]:
# Select the best hyperparameters
best_pipe = opt.best_estimator_
# Fit the pipeline on the original dataset
X = df.drop(targets, axis=1)
best_pipe.fit(X, y)
# Save the resulting model to a file
dump(best_pipe, 'xgboost_model.joblib')
files.download('xgboost_model.joblib')