Notebook: SVM pipeline
Ce notebook implémente un modèle SVM pour la classification binaire.
Il s'inspire du pipeline présent dans `main_nf.ipynb` mais est nettoyé et documenté.  
Résultats attendus : entraînement, évaluation (F1) et export des prédictions.


In [None]:
# Imports et utilitaires
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, classification_report

# Pour l'export
import os


imports OK


In [2]:
# Chargement des datasets
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

print(f"train: {train_df.shape}, test: {test_df.shape}")

# Suppression colonnes 100% NaN
cols_all_nan_train = train_df.columns[train_df.isna().mean() == 1.0]
train_df = train_df.drop(columns=cols_all_nan_train)
cols_all_nan_test = test_df.columns[test_df.isna().mean() == 1.0]
test_df = test_df.drop(columns=cols_all_nan_test)

print('colonnes 100% NaN supprimées')


train: (225000, 325), test: (75000, 324)
colonnes 100% NaN supprimées
colonnes 100% NaN supprimées


In [None]:
# Sélection des features corrélées avec la target
# On suit la stratégie de `main_nf.ipynb` : convertir en numérique et garder |corr|>0.1

target_col = train_df.columns[-1]
print('Target :', target_col)

train_num = train_df.apply(pd.to_numeric, errors='coerce')
corr_with_target = train_num.corr(numeric_only=True)[target_col].drop(labels=[target_col], errors='ignore')
selected_features = corr_with_target[abs(corr_with_target) > 0.1].index.tolist()
print(f"{len(selected_features)} features sélectionnées")

# Construction de X et y
X = train_df[selected_features].copy()
y = train_df[target_col]

X.dropna()

# Encodage label pour les colonnes catégorielles
encoder_dict = {}
for col in X.select_dtypes(include=['object', 'category']).columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))
    encoder_dict[col] = le

print('Préparation terminée')


Target : TARGET
23 features sélectionnées
Préparation terminée
23 features sélectionnées
Préparation terminée


In [4]:
# Split train / validation
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
print(f"Split: {X_train.shape[0]} train / {X_val.shape[0]} val")

# Pipeline SVM
pipe = make_pipeline(
    StandardScaler(),
    SVC(class_weight='balanced', probability=False, random_state=42)
)

# Hyperparams grid (petit pour exécutions rapides)
param_grid = {
    'svc__C': [0.1, 1, 10],
    'svc__kernel': ['rbf', 'linear'],
    'svc__gamma': ['scale', 'auto']
}

# Recherche par grille
grid = GridSearchCV(
    pipe,
    param_grid,
    scoring='f1',
    cv=3,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train)

print('Best params:', grid.best_params_)
best_model = grid.best_estimator_

# Évaluation
y_pred_val = best_model.predict(X_val)
f1 = f1_score(y_val, y_pred_val)
print(f"F1 validation: {f1:.4f}")
print(classification_report(y_val, y_pred_val))


Split: 157500 train / 67500 val
Fitting 3 folds for each of 12 candidates, totalling 36 fits


ValueError: 
All the 36 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
36 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\pipeline.py", line 473, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\svm\_base.py", line 190, in fit
    X, y = self._validate_data(
           ^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py", line 650, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py", line 1301, in check_X_y
    X = check_array(
        ^^^^^^^^^^^^
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py", line 1064, in check_array
    _assert_all_finite(
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py", line 123, in _assert_all_finite
    _assert_all_finite_element_wise(
  File "c:\Users\jules\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\utils\validation.py", line 172, in _assert_all_finite_element_wise
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
SVC does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values


In [None]:
# Préparation du test set et prédiction finale
X_test = test_df.reindex(columns=selected_features, fill_value=np.nan).copy()
# Encodage des colonnes catégorielles avec les mêmes LabelEncoders
for col, le in encoder_dict.items():
    if col in X_test.columns:
        X_test[col] = X_test[col].astype(str)
        X_test[col] = X_test[col].map(lambda x: x if x in le.classes_ else None)
        X_test[col] = le.transform(X_test[col].fillna(le.classes_[0]))

# Prédiction
y_test_pred = best_model.predict(X_test)

# Id column
id_col = 'ID' if 'ID' in test_df.columns else test_df.columns[0]

pred_df = pd.DataFrame({
    id_col: test_df[id_col],
    'pred': y_test_pred
})

pred_df.to_csv('predictions_svm.csv', index=False)
print('Exporté: predictions_svm.csv')
print(pred_df.head())


Documentation / How to run
Ce notebook exécute les étapes suivantes :
1) Charger train/test depuis `data/`
2) Supprimer les colonnes 100% NaN
3) Sélectionner les features ayant |corr(Target)| > 0.1
4) Encodage des catégorielles avec LabelEncoder
5) Entraînement d'un SVM via GridSearchCV (scoring=f1)
6) Évaluation sur un split validation
7) Prédictions sur le jeu de test et export CSV

Notes:
- Ajuster `param_grid` pour plus d'itérations/hyperparamètres.
- Le SVM est standardisé avec `StandardScaler` pour de meilleures performances.
- Pour obtenir des probabilités, remplacer `probability=False` par `True` dans SVC.

print('Notebook prêt — exécutez les cellules dans l'ordre')
