---
title: Implémentation de classifieurs binaires 
---

## Imports

In [40]:
from joblib import parallel_backend
parallel_backend("loky", n_jobs=-1)

from get_dataset import dataset_loaders
dataset = list(dataset_loaders.keys())[0]

In [41]:
from get_dataset import load_dataset

X, y = load_dataset(dataset)

models = dict()

def store_results(name, grid):
    models[name] = {
        "best_params": grid.best_params_,
        "best_estimator": grid.best_estimator_,
    }
    
    pass
    

## Data presentation

**{eval}`dataset`\** dataset contains `n` = {eval}`X.shape[0]` samples and `p` = {eval}`X.shape[1]` features.

The target variable is binary and {}`y.mean() * 100:.2f`% of the samples are positive.

In [42]:
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Entraînement des classifieurs

### Classifieurs non paramétriques

#### K-Nearest Neighbors

In [43]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

model = KNeighborsClassifier(weights='uniform', algorithm='auto')

param_grid = {
    'n_neighbors': [3, 5, 7, 9],
}

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid, 
    cv=5, 
    scoring='accuracy',
    refit=True
    )

grid_search.fit(X_train, y_train)
store_results('KNN', grid_search)

#### Distance-Weighted KNN

In [44]:
model = KNeighborsClassifier(weights='distance', algorithm='auto')

param_grid = {
    'n_neighbors': [3, 5, 7, 9],
}

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid, 
    cv=5, 
    scoring='accuracy',
    refit=True
    )

grid_search.fit(X_train, y_train)
store_results('KNN Distance Weighted', grid_search)

#### Condensed Nearest Neighbor

In [62]:
from imblearn.under_sampling import CondensedNearestNeighbour
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.utils import check_X_y
from sklearn.utils.validation import validate_data

class CondensedNearestNeighbourTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, sampling_strategy = "auto", random_state = None, n_neighbors = None, n_seeds_S = 1):
        self.sampling_strategy = sampling_strategy
        self.random_state = random_state
        self.n_neighbors = n_neighbors
        self.n_seeds_S = n_seeds_S

    def fit(self, X, y=None):
        print("calling fit")
        validate_data(X, y, accept_sparse=True, reset=True)
        
        # self.n_features_in_ = X.shape[1]
        return self

    def transform(self, X, y=None):
        print("calling transform")
        # check_X_y(X, y)

        if y is None:
            print("No X")
            return X
        else:
            print("X was passed")
    
        return CondensedNearestNeighbour(
            sampling_strategy = self.sampling_strategy,
            random_state = self.random_state,
            n_neighbors = self.n_neighbors,
            n_seeds_S = self.n_seeds_S
        ).fit_resample(X, y)

from sklearn.utils.estimator_checks import check_estimator

check_estimator(CondensedNearestNeighbourTransformer())

calling fit


ValueError: Expected 2D array, got 1D array instead:
array=[0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [27]:

        
model = Pipeline([
    ('cnn', CondensedNearestNeighbourTransformer(sampling_strategy='auto', random_state=42)),
    ('knn', KNeighborsClassifier(weights='uniform', algorithm='auto'))
])

param_grid = {
    'cnn__n_neighbors': [3, 5, 7, 9],
    'knn__n_neighbors': [3, 5, 7, 9],
}

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid, 
    cv=5, 
    scoring='accuracy',
    refit=True
    )

grid_search.fit(X_train, y_train)
store_results('KNN Condensed Nearest Neighbor', model.named_steps['knn'])

ValueError: 
All the 80 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 562, in _fit
    self._validate_steps()
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 339, in _validate_steps
    raise TypeError(
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'CondensedNearestNeighbourTransformer(n_neighbors=3, random_state=42)' (type <class '__main__.CondensedNearestNeighbourTransformer'>) doesn't

--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 562, in _fit
    self._validate_steps()
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 339, in _validate_steps
    raise TypeError(
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'CondensedNearestNeighbourTransformer(n_neighbors=5, random_state=42)' (type <class '__main__.CondensedNearestNeighbourTransformer'>) doesn't

--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 562, in _fit
    self._validate_steps()
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 339, in _validate_steps
    raise TypeError(
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'CondensedNearestNeighbourTransformer(n_neighbors=7, random_state=42)' (type <class '__main__.CondensedNearestNeighbourTransformer'>) doesn't

--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 654, in fit
    Xt = self._fit(X, y, routed_params, raw_params=params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 562, in _fit
    self._validate_steps()
  File "/Users/mathisderenne/Documents/02 - Scolaire/M1 MIASHS/02 - Guillaume Mezler/Projet/.venv/lib/python3.12/site-packages/sklearn/pipeline.py", line 339, in _validate_steps
    raise TypeError(
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'CondensedNearestNeighbourTransformer(n_neighbors=9, random_state=42)' (type <class '__main__.CondensedNearestNeighbourTransformer'>) doesn't


#### Locally Adaptive KNN

In [4]:
class LocallyAdaptiveKNN(KNeighborsClassifier):
    def predict(self, X):
        distances, indices = self.kneighbors(X)
        predictions = []
        for i, neighbors in enumerate(indices):
            local_k = int(len(neighbors) / 2)  # Example of adapting k locally
            local_knn = KNeighborsClassifier(n_neighbors=local_k)
            local_knn.fit(self._fit_X[neighbors], self._y[neighbors])
            predictions.append(local_knn.predict([X[i]])[0])
        return predictions

model = LocallyAdaptiveKNN(weights='uniform', algorithm='auto')

param_grid = {
    'n_neighbors': [3, 5, 7, 9],
}

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid, 
    cv=5, 
    scoring='accuracy',
    refit=True
    )

grid_search.fit(X_train, y_train)
store_results('KNN Locally Adaptive', grid_search)

NameError: name 'accuracy_score' is not defined

### Classifieurs binaires non linéaires

Algorithmes de classification non linéaires implémentés :
- Arbres de décisions
- Forêts aléatoires
- Adaboost

Arbres/Forêts :
- Random Forest avec cost-sensitive learning
- Extremely Randomized Trees
- Gradient Boosted Decision Trees
- Weighted Random Forest pour classes déséquilibrées

AdaBoost :
- AdaBoost avec différents classifieurs de base
- Cost-sensitive AdaBoost
- AdaBoost.M1 avec early stopping
- RUSBoost (combine boosting et under-sampling)

### Classifieurs binaires paramétriques

Algorithmes de classification paramétriques implémentés :
- SVM linéaire (ou noyau linéaire)
- Régression logistique

SVM :
- One-class SVM pour gérer le déséquilibre des classes
- Combiner avec des méthodes de sous/sur-échantillonnage (SMOTE, RandomUnderSampling)
- Cost-sensitive SVM : Différentes pénalités C pour chaque classe
- Ensemble de SVMs avec bagging

Régression logistique :
- Régression logistique avec pénalisation élastique (combinaison L1/L2)
- Cost-sensitive avec pondération des classes
- Régression logistique polynomiale
- Régression logistique avec sélection de features

### Évaluation des classifieurs binaires

In [4]:
from utils import roc_plot, precision_recall_plot, table_report

y_pred_weighted_knn = model.predict(X_test)
y_pred_proba_weighted_knn = model.predict_proba(X_test)[:,1]

La [](#table_report_LR1) montre les résultats de la classification par le modèle de régression logistique. On observe que :

- $83,04 \%$ des *spams* sont correctement identifiés
- $99,59 \%$ des *hams* sont correctement identifiés
- $96,88 \%$ des observations classifiées en tant que *spam* sont des *spams*
- $97,43 \%$ des observations classifiées en tant que *ham* sont des *hams*
- Le score F1 moyen pondéré est de $97,98 \%$ 