# 4 regression on a given dataset

## ("C" exercise)

Perform a regression on the dataset stored in FTML/Project/data/regression/.
You are free to choose the regression methods, but you must compare at least two
methods. You can do more than 2 but this is not mandatory for this exercise.
Discuss the choice of the optimization procedures, solvers, hyperparameters, cross-validation, etc. The Bayes estimator for this dataset and the squared loss reaches
a R2 score of approximately 0.92, for at least 1 of the 2 estimators (1 estimator is
enough).

Your objective is be to obtain a R2 score superior than 0.88 on the test set, that
must not be used during training. Remember that training is the complete model optimisation procedure, including model selection and hyperparameters testing, not
only when you call a .fit() method ! This is the topic that we discussed during the
practical sessions on train / validation / test and cross-validation. However, since
you have the test set, all you can do is "pretend" not to use it during training, since
you can always compute the score test several times without putting it in your solution.

---

Pour choisir les méthodes de régression, nous testerons avec des modèles linéaires/non-linéaires.

Parmi les modèles linéaires, nous vérifierons si la performance est maximisée en stabilisant l'influence de toutes les features (Ridge) ou en sélectionnant uniquement les plus pertinentes (Lasso).

Parmi les modèles non-linéaires, nous prendrons k-NN.

- **Hyperparameter tuning :** optuna
- **Méthodes de régression :** Ridge, k-NN et Lasso
- **Hyperparamètres :** choix de modèles linéaire/non-linéaire,
  - Ridge : alpha, solver (linéaire)
  - k-NN : n_neighbors (non-linéaire)
  - Lasso : alpha (linéaire)


In [1]:
import numpy as np
root = 'data/regression/'

# Load .npy files
X_train = np.load(root + 'X_train.npy')
X_test = np.load(root + 'X_test.npy')
y_train = np.load(root + 'y_train.npy')
y_test = np.load(root + 'y_test.npy')

print(X_train.shape, y_train.shape)


(200, 200) (200, 1)


In [2]:
# Ridge regressor
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
import optuna

def objective(trial) -> float:
    """
    Objective function

    Return the r2 score on the validation set,
    after fitting a ridge estimator with a given set of hyperparameters.
    """
    alpha = trial.suggest_float("alpha", 1e-5, 1e5)

    available_solvers = ["cholesky", "lsqr", "svd", "sag"]
    solver = trial.suggest_categorical("solver", available_solvers)

    estimator = Ridge(alpha=alpha, solver=solver)
    estimator.fit(X_train, y_train)
    y_pred = estimator.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    return r2
    

study = optuna.create_study(
    study_name="ridge",
    direction="maximize",  # we want to maximize the R2 score
)
study.optimize(func=objective, n_trials=1000)

print("\n==== Ridge regressor ====")
print(f"Best parameters : {study.best_params}")
print(f"Best value: {study.best_value}")

  from .autonotebook import tqdm as notebook_tqdm
[I 2025-06-15 23:10:41,548] A new study created in memory with name: ridge
[I 2025-06-15 23:10:41,592] Trial 0 finished with value: -0.03265407798458231 and parameters: {'alpha': 24791.616854482534, 'solver': 'svd'}. Best is trial 0 with value: -0.03265407798458231.
[I 2025-06-15 23:10:41,599] Trial 1 finished with value: -0.031814718402868536 and parameters: {'alpha': 14919.6702906578, 'solver': 'svd'}. Best is trial 1 with value: -0.031814718402868536.
[I 2025-06-15 23:10:41,619] Trial 2 finished with value: -0.033585948988714254 and parameters: {'alpha': 92643.70889712007, 'solver': 'cholesky'}. Best is trial 1 with value: -0.031814718402868536.
[I 2025-06-15 23:10:41,625] Trial 3 finished with value: -0.033257811799757064 and parameters: {'alpha': 47190.88481214437, 'solver': 'lsqr'}. Best is trial 1 with value: -0.031814718402868536.
[I 2025-06-15 23:10:41,629] Trial 4 finished with value: -0.0316415500303211 and parameters: {'alph


==== Ridge regressor ====
Best parameters : {'alpha': 0.6766349566316958, 'solver': 'svd'}
Best value: 0.7214007331330906


In [3]:
# k-NN
from sklearn.neighbors import KNeighborsRegressor

def objective(trial) -> float:
    """
    Objective function

    Return the r2 score on the validation set,
    after fitting a ridge estimator with a given set of hyperparameters.
    """
    n_neighbors = trial.suggest_int("n_neighbors", 1, 30)

    kNN = KNeighborsRegressor(n_neighbors=n_neighbors)
    kNN.fit(X_train, y_train)
    y_pred = kNN.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    return r2
    

study = optuna.create_study(
    study_name="k-nn",
    direction="maximize",  # we want to maximize the R2 score
)
study.optimize(func=objective, n_trials=1000)

print("\n==== k-NN regressor ====")
print(f"Best parameters : {study.best_params}")
print(f"Best value: {study.best_value}")

[I 2025-06-15 23:10:50,638] A new study created in memory with name: k-nn
[I 2025-06-15 23:10:50,689] Trial 0 finished with value: 0.11029333258767404 and parameters: {'n_neighbors': 26}. Best is trial 0 with value: 0.11029333258767404.
[I 2025-06-15 23:10:50,691] Trial 1 finished with value: -0.7884173555868188 and parameters: {'n_neighbors': 1}. Best is trial 0 with value: 0.11029333258767404.
[I 2025-06-15 23:10:50,693] Trial 2 finished with value: 0.11223403238242469 and parameters: {'n_neighbors': 14}. Best is trial 2 with value: 0.11223403238242469.
[I 2025-06-15 23:10:50,695] Trial 3 finished with value: 0.10704592421024917 and parameters: {'n_neighbors': 19}. Best is trial 2 with value: 0.11223403238242469.
[I 2025-06-15 23:10:50,697] Trial 4 finished with value: 0.11029333258767404 and parameters: {'n_neighbors': 26}. Best is trial 2 with value: 0.11223403238242469.
[I 2025-06-15 23:10:50,701] Trial 5 finished with value: 0.11927827313086936 and parameters: {'n_neighbors': 24}


==== k-NN regressor ====
Best parameters : {'n_neighbors': 10}
Best value: 0.1359457333526919


In [4]:
# Lasso
from sklearn.linear_model import Lasso

def objective(trial) -> float:
    """
    Objective function

    Return the r2 score on the validation set,
    after fitting a ridge estimator with a given set of hyperparameters.
    """
    alpha = trial.suggest_loguniform("alpha", 1e-5, 1e5)

    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train, y_train)
    y_pred = lasso.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    return r2
    

study = optuna.create_study(
    study_name="lasso",
    direction="maximize",  # we want to maximize the R2 score
)
study.optimize(func=objective, n_trials=1000)

print("\n==== Lasso regressor ====")
print(f"Best parameters : {study.best_params}")
print(f"Best value: {study.best_value}")

[I 2025-06-15 23:10:59,431] A new study created in memory with name: lasso
  alpha = trial.suggest_loguniform("alpha", 1e-5, 1e5)
[I 2025-06-15 23:10:59,455] Trial 0 finished with value: -0.033927168597252644 and parameters: {'alpha': 71108.02696639037}. Best is trial 0 with value: -0.033927168597252644.
  alpha = trial.suggest_loguniform("alpha", 1e-5, 1e5)
[I 2025-06-15 23:10:59,464] Trial 1 finished with value: 0.9229498980499694 and parameters: {'alpha': 0.0069247570154796064}. Best is trial 1 with value: 0.9229498980499694.
  alpha = trial.suggest_loguniform("alpha", 1e-5, 1e5)
[I 2025-06-15 23:10:59,467] Trial 2 finished with value: 0.02867604467030671 and parameters: {'alpha': 0.09594733857701385}. Best is trial 1 with value: 0.9229498980499694.
  alpha = trial.suggest_loguniform("alpha", 1e-5, 1e5)
[I 2025-06-15 23:10:59,480] Trial 3 finished with value: 0.7735952328759439 and parameters: {'alpha': 0.00011132044739084212}. Best is trial 1 with value: 0.9229498980499694.
  alpha


==== Lasso regressor ====
Best parameters : {'alpha': 0.006030360874637422}
Best value: 0.9233002781527471
