# 4 Regression on a given dataset

## ("C" exercise)

Perform a regression on the dataset stored in data/regression/.

You are free to choose the regression methods, but you must compare at least two
methods. You can do more than 2 but this is not mandatory for this exercise.
Discuss the choice of the optimization procedures, solvers, hyperparameters, cross-validation, etc. The Bayes estimator for this dataset and the squared loss reaches
a R2 score of approximately 0.92, for at least 1 of the 2 estimators (1 estimator is
enough).

Your objective is be to obtain a R2 score superior than 0.88 on the test set, that
must not be used during training. Remember that training is the complete model optimisation procedure, including model selection and hyperparameters testing, not
only when you call a .fit() method ! This is the topic that we discussed during the
practical sessions on train / validation / test and cross-validation. However, since
you have the test set, all you can do is "pretend" not to use it during training, since
you can always compute the score test several times without putting it in your solution.

---

# Import

In [22]:
import numpy as np
from sklearn.model_selection import GridSearchCV

root = 'data/regression/'

# Load .npy files
X_train = np.load(root + 'X_train.npy')
X_test = np.load(root + 'X_test.npy')
y_train = np.load(root + 'y_train.npy')
y_test = np.load(root + 'y_test.npy')

print(X_train.shape, y_train.shape)


(200, 200) (200, 1)


Avant d’entraîner un modèle, j’ai réalisé une analyse simple pour mesurer l’impact de chaque variable sur la valeur à prédire. Le but est de voir si seules quelques variables ont vraiment une influence importante, ce qui pourrait orienter le choix vers un modèle adapté aux données peu denses.


In [28]:
from sklearn.feature_selection import f_regression

f_vals, p_vals = f_regression(X_train, y_train.ravel())

significant = p_vals < 0.05
print(f"{significant.sum()} features significatives sur {len(p_vals)}")


26 features significatives sur 200


In [24]:
X_train.shape

(200, 200)

À partir des résultats du test f_regression, 26 features sur 200 présentent une relation significative avec la cible. Combiné au fait que le jeu de données contient moins de 100 000 observations, ces éléments justifient l’essai de modèles comme Lasso et ElasticNet.
Je procède donc à une validation croisée avec 5 folds en utilisant un GridSearchCV sur Lasso et ElasticNet, conformément aux recommandations de Scikit-learn pour ce type de configuration.

# Lasso

In [25]:
from sklearn.linear_model import Lasso


param_grid_lasso = {
    'alpha': [1e-5, 1e-3, 1e-1, 1, 10, 100, 1e3, 1e5]
}

lasso = Lasso(max_iter=10000)
grid_search_lasso = GridSearchCV(lasso, param_grid_lasso, cv=5, scoring='r2', n_jobs=-1)
grid_search_lasso.fit(X_train, y_train)

print("Best parameters (Lasso):", grid_search_lasso.best_params_)
print("Best train R2 score (Lasso):", grid_search_lasso.best_score_)
print("Best test R2 score (Lasso):", grid_search_lasso.score(X_test, y_test))

Best parameters (Lasso): {'alpha': 0.001}
Best train R2 score (Lasso): 0.8764599666803479
Best test R2 score (Lasso): 0.8975856443355343


# ElasticNet

In [27]:
from sklearn.linear_model import ElasticNet

param_grid_enet = {
    'alpha': [1e-5, 1e-3, 1e-1, 1, 10, 100],
    'l1_ratio': [0.1, 0.5, 0.7, 0.9]
}

enet = ElasticNet(max_iter=10000)
grid_search_enet = GridSearchCV(enet, param_grid_enet, cv=5, scoring='r2', n_jobs=-1)
grid_search_enet.fit(X_train, y_train)

print("Best parameters (ElasticNet):", grid_search_enet.best_params_)
print("Best train R2 score (ElasticNet):", grid_search_enet.best_score_)
print("Best test R2 score (ElasticNet):", grid_search_enet.score(X_test, y_test))

Best parameters (ElasticNet): {'alpha': 0.001, 'l1_ratio': 0.9}
Best train R2 score (ElasticNet): 0.8685014225855108
Best test R2 score (ElasticNet): 0.8950918181512542


Ces 2 modèles ont bien un score R2 supérieur a 0.88.