# 5 classification on a given dataset

## ("C" exercise)

Same instructions as in 4, except that this time a classification has to be performed and the data and the dataset is stored in data/classification/.

Your objective should be to obtain a mean accuracy superior to 0.85 on the test set (same remark about the test set).

**Indication : a solution, with the correct hyperparameters, exists in scikit among the following scikit classes :**
- linear_model.LogisticRegression
- svm.SVC
- neighbors.KNeighborsClassifier
- neural_network.MLPClassifier
- ensemble.AdaBoostClassifier.

---

Parmi la liste des méthodes ci-dessus, nous prendrons les 3 premiers pour comparer leur score.

- **Hyperparameter tuning :** optuna
- **Méthodes de classification :** Regression logistique, SVC, k-NN
- **Score :** Accuracy
- **Hyperparamètres :** 
  - Regression logistique : C, solver
  - k-NN : n_neighbors
  - SVC : C, degree

In [1]:
import numpy as np
root = 'data/classification/'

# Load .npy files
X_train = np.load(root + 'X_train.npy')
X_test = np.load(root + 'X_test.npy')
y_train = np.load(root + 'y_train.npy')
y_test = np.load(root + 'y_test.npy')

print(X_train.shape, y_train.shape)


(2000, 30) (2000,)


In [2]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import optuna

def objective(trial):
    C = trial.suggest_float("C", 1e-4, 1e4, log=True)
    solver = trial.suggest_categorical("solver", ['liblinear', 'saga'])
    
    classifier = LogisticRegression(C=C, solver=solver, random_state=42)
    
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)

    return accuracy_score(y_test, y_pred)
    

study = optuna.create_study(
    study_name="logistic",
    direction="maximize",
)
study.optimize(func=objective, n_trials=1000)

print("\n==== Logistic Regression ====")
print(f"Best parameters : {study.best_params}")
print(f"Best value: {study.best_value}")

  from .autonotebook import tqdm as notebook_tqdm
[I 2025-06-16 12:29:13,118] A new study created in memory with name: logistic
[I 2025-06-16 12:29:13,124] Trial 0 finished with value: 0.7435 and parameters: {'C': 6238.118952455884, 'solver': 'liblinear'}. Best is trial 0 with value: 0.7435.
[I 2025-06-16 12:29:13,132] Trial 1 finished with value: 0.7435 and parameters: {'C': 0.38640292483774147, 'solver': 'saga'}. Best is trial 0 with value: 0.7435.
[I 2025-06-16 12:29:13,146] Trial 2 finished with value: 0.741 and parameters: {'C': 0.008312465656164678, 'solver': 'saga'}. Best is trial 0 with value: 0.7435.
[I 2025-06-16 12:29:13,155] Trial 3 finished with value: 0.7435 and parameters: {'C': 1.9323849221320184, 'solver': 'saga'}. Best is trial 0 with value: 0.7435.
[I 2025-06-16 12:29:13,159] Trial 4 finished with value: 0.7435 and parameters: {'C': 409.07227280661283, 'solver': 'liblinear'}. Best is trial 0 with value: 0.7435.
[I 2025-06-16 12:29:13,166] Trial 5 finished with value:


==== Logistic Regression ====
Best parameters : {'C': 0.002636727079864608, 'solver': 'saga'}
Best value: 0.7465


In [3]:
# SVC
from sklearn.svm import SVC

def objective(trial) -> float:
    C = trial.suggest_loguniform("C", 1e-3, 1e3)
    degree = trial.suggest_int("degree", 2, 6)
    kernel = "poly"

    svc = SVC(C=C, degree=degree, kernel=kernel)
    svc.fit(X_train, y_train)
    y_pred = svc.predict(X_test)
    return accuracy_score(y_test, y_pred)
    

study = optuna.create_study(
    study_name="svc",
    direction="maximize",
)
study.optimize(func=objective, n_trials=1000)

print("\n==== SVC classifier ====")
print(f"Best parameters : {study.best_params}")
print(f"Best value: {study.best_value}")

[I 2025-06-16 12:29:24,551] A new study created in memory with name: svc
  C = trial.suggest_loguniform("C", 1e-3, 1e3)
[I 2025-06-16 12:29:24,886] Trial 0 finished with value: 0.7015 and parameters: {'C': 37.036652768206814, 'degree': 2}. Best is trial 0 with value: 0.7015.
  C = trial.suggest_loguniform("C", 1e-3, 1e3)
[I 2025-06-16 12:29:24,975] Trial 1 finished with value: 0.848 and parameters: {'C': 1.083882940923317, 'degree': 3}. Best is trial 1 with value: 0.848.
  C = trial.suggest_loguniform("C", 1e-3, 1e3)
[I 2025-06-16 12:29:25,076] Trial 2 finished with value: 0.642 and parameters: {'C': 0.15375951518504968, 'degree': 6}. Best is trial 1 with value: 0.848.
  C = trial.suggest_loguniform("C", 1e-3, 1e3)
[I 2025-06-16 12:29:25,175] Trial 3 finished with value: 0.687 and parameters: {'C': 5.117324265770274, 'degree': 6}. Best is trial 1 with value: 0.848.
  C = trial.suggest_loguniform("C", 1e-3, 1e3)
[I 2025-06-16 12:29:25,272] Trial 4 finished with value: 0.6035 and paramet


==== SVC classifier ====
Best parameters : {'C': 4.612243039974901, 'degree': 3}
Best value: 0.907


In [4]:
# k-NN
from sklearn.neighbors import KNeighborsClassifier

def objective(trial) -> float:
    n_neighbors = trial.suggest_int("n_neighbors", 1, 30)

    kNN = KNeighborsClassifier(n_neighbors=n_neighbors)
    kNN.fit(X_train, y_train)
    y_pred = kNN.predict(X_test)
    return accuracy_score(y_test, y_pred)
    

study = optuna.create_study(
    study_name="k-nn",
    direction="maximize",
)
study.optimize(func=objective, n_trials=1000)

print("\n==== k-NN classifier ====")
print(f"Best parameters : {study.best_params}")
print(f"Best value: {study.best_value}")

[I 2025-06-16 12:31:49,018] A new study created in memory with name: k-nn
[I 2025-06-16 12:31:49,167] Trial 0 finished with value: 0.7905 and parameters: {'n_neighbors': 10}. Best is trial 0 with value: 0.7905.
[I 2025-06-16 12:31:49,178] Trial 1 finished with value: 0.762 and parameters: {'n_neighbors': 3}. Best is trial 0 with value: 0.7905.
[I 2025-06-16 12:31:49,185] Trial 2 finished with value: 0.724 and parameters: {'n_neighbors': 1}. Best is trial 0 with value: 0.7905.
[I 2025-06-16 12:31:49,197] Trial 3 finished with value: 0.771 and parameters: {'n_neighbors': 4}. Best is trial 0 with value: 0.7905.
[I 2025-06-16 12:31:49,216] Trial 4 finished with value: 0.785 and parameters: {'n_neighbors': 12}. Best is trial 0 with value: 0.7905.
[I 2025-06-16 12:31:49,235] Trial 5 finished with value: 0.7905 and parameters: {'n_neighbors': 26}. Best is trial 0 with value: 0.7905.
[I 2025-06-16 12:31:49,253] Trial 6 finished with value: 0.7905 and parameters: {'n_neighbors': 26}. Best is tr


==== k-NN classifier ====
Best parameters : {'n_neighbors': 15}
Best value: 0.797
