# 5 classification on a given dataset

## ("C" exercise)

Same instructions as in 4, except that this time a classification has to be performed and the data and the dataset is stored in data/classification/.

Your objective should be to obtain a mean accuracy superior to 0.85 on the test set (same remark about the test set).

**Indication : a solution, with the correct hyperparameters, exists in scikit among the following scikit classes :**
- linear_model.LogisticRegression
- svm.SVC
- neighbors.KNeighborsClassifier
- neural_network.MLPClassifier
- ensemble.AdaBoostClassifier.

---


In [22]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

import numpy as np

root = 'data/classification/'

# Load .npy files
X_train = np.load(root + 'X_train.npy')
X_test = np.load(root + 'X_test.npy')
y_train = np.load(root + 'y_train.npy')
y_test = np.load(root + 'y_test.npy')

print(X_train.shape, y_train.shape)


(2000, 30) (2000,)


Le dataset contient moins de 100 000 échantillons et ne concerne pas des données textuelles.
Conformément aux recommandations de Scikit-learn pour ce type de configuration, j’ai choisi d’évaluer un modèle K-Nearest Neighbors.
Une recherche d’hyperparamètres sera effectuée à l’aide d’une validation croisée à 5 folds afin d’améliorer la performance du modèle.

# K-Nearest Neighbors

In [24]:
from sklearn.neighbors import KNeighborsClassifier

param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9, 11, 15, 21, 31],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # 1: Manhattan, 2: Euclidean
}

knn = KNeighborsClassifier()

grid_knn = GridSearchCV(
    estimator=knn,
    param_grid=param_grid_knn,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=1
)
grid_knn.fit(X_train, y_train)

print("Best parameters (KNN):", grid_knn.best_params_)
print("Best CV accuracy (KNN):", grid_knn.best_score_)

y_pred_knn = grid_knn.predict(X_test)
print("Test accuracy (KNN):", accuracy_score(y_test, y_pred_knn))

Fitting 5 folds for each of 32 candidates, totalling 160 fits
Best parameters (KNN): {'n_neighbors': 21, 'p': 2, 'weights': 'uniform'}
Best CV accuracy (KNN): 0.7835
Test accuracy (KNN): 0.7865


Les résultats ne sont pas convaincants, nous allons donc tester d'utiliser un SVC et un AdaBoost

# AdaBoost


In [25]:

param_grid_ada = {
    'n_estimators': [50, 100, 200, 500, 1000, 2000],
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.5, 1.0]
}

ada = AdaBoostClassifier(random_state=42)

grid_ada = GridSearchCV(
    estimator=ada,
    param_grid=param_grid_ada,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=1
)
grid_ada.fit(X_train, y_train)

print("Best parameters (AdaBoost):", grid_ada.best_params_)
print("Best CV accuracy (AdaBoost):", grid_ada.best_score_)

y_pred_ada = grid_ada.predict(X_test)
print("Test accuracy (AdaBoost):", accuracy_score(y_test, y_pred_ada))

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best parameters (AdaBoost): {'learning_rate': 0.2, 'n_estimators': 2000}
Best CV accuracy (AdaBoost): 0.7255
Test accuracy (AdaBoost): 0.746


# SVC

In [16]:

param_grid = {
        'C': [
        0.0001, 0.001, 0.01, 0.05, 0.1, 
        0.5, 1, 5, 10, 20, 50, 100, 200, 500, 1000
    ]
}

svc =    SVC(kernel="poly", probability=True,random_state=4)


grid = GridSearchCV(
    estimator=svc,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=1
)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)
print("Best CV accuracy:", grid.best_score_)

y_pred_grid = grid.predict(X_test)
print("Test accuracy:", accuracy_score(y_test, y_pred_grid))

Fitting 5 folds for each of 15 candidates, totalling 75 fits
Best parameters: {'C': 5}
Best CV accuracy: 0.773
Test accuracy: 0.903


Le meilleur modèle est donc le SVC