# Détection de fraude - BNP Paribas Personal Finance

Filiale à 100% du groupe BNP Paribas, BNP Paribas Personal Finance est le n°1 du financement aux particuliers en France et en Europe au travers de ses activités de crédit à la consommation.

La fraude est un problème majeur de plus en plus préoccupant pour les institutions financières du monde entier. Les criminels utilisent une grande variété de méthodes pour attaquer des organisations comme cette entreprise, quels que soient les systèmes, les canaux, les process ou les produits.
Le développement de méthodes de détection de la fraude est stratégique et essentiel pour BNP Personal Finance. Les fraudeurs s'avèrent toujours très créatifs et ingénieux pour normaliser leurs comportements et les rendre difficilement identifiables. Une contrainte s'ajoute à cette problématique, la faible occurence de la fraude dans notre population.

L'objectif de ce challenge est de trouver la meilleure méthode pour transformer et agréger les données relatives au panier client d'un de ses parteneraires pour détecter les cas de fraude.
En utilisant ces données panier, les fraudeurs pourront être détectés, et ainsi refusés dans le futur.

### Importation des bibliothèques



In [1]:
# pip install missingno
# pip install scikit-optimize

Note: you may need to restart the kernel to use updated packages.


In [1]:
import numpy as np
import pandas as pd
import time
import missingno as msno
import matplotlib.pyplot as plt
from sklearn.impute import MissingIndicator, KNNImputer, SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD

import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, PrecisionRecallDisplay

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import precision_recall_curve, average_precision_score

from sklearn.naive_bayes import GaussianNB
from sklearn.kernel_approximation import Nystroem


### Analyse exploratoire des données

Importons X_train, X_test, y_train et y_test.

In [2]:
# Chargement des données
X_train = pd.read_csv('C:/Users/Idrissa_TRAORE/PycharmProjects/pythonProject/X_train_G3tdtEn.csv')
X_test = pd.read_csv('C:/Users/Idrissa_TRAORE/PycharmProjects/pythonProject/X_test_8skS2ey.csv')
y_train = pd.read_csv('C:/Users/Idrissa_TRAORE/PycharmProjects/pythonProject/Y_train_2_XPXJDyy.csv')
y_test = pd.read_csv('C:/Users/Idrissa_TRAORE/PycharmProjects/pythonProject/Y_test_random_2.csv')

In [None]:
X_train.head()

Unnamed: 0,ID,item1,item2,item3,item4,item5,item6,item7,item8,item9,...,Nbr_of_prod_purchas16,Nbr_of_prod_purchas17,Nbr_of_prod_purchas18,Nbr_of_prod_purchas19,Nbr_of_prod_purchas20,Nbr_of_prod_purchas21,Nbr_of_prod_purchas22,Nbr_of_prod_purchas23,Nbr_of_prod_purchas24,Nb_of_items
0,85517,COMPUTERS,,,,,,,,,...,,,,,,,,,,1.0
1,51113,COMPUTER PERIPHERALS ACCESSORIES,,,,,,,,,...,,,,,,,,,,1.0
2,83008,TELEVISIONS HOME CINEMA,,,,,,,,,...,,,,,,,,,,1.0
3,78712,COMPUTERS,COMPUTER PERIPHERALS ACCESSORIES,,,,,,,,...,,,,,,,,,,2.0
4,77846,TELEVISIONS HOME CINEMA,,,,,,,,,...,,,,,,,,,,1.0


In [None]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92790 entries, 0 to 92789
Columns: 146 entries, ID to Nb_of_items
dtypes: float64(49), int64(1), object(96)
memory usage: 103.4+ MB


X_train.describe()

#### Sélection des variables

In [None]:
# Concaténer X_train et X_test pour inclure la colonne 'fraud_flag'
df_combined = pd.concat([X_train, X_test], axis=0)

# Ajouter la colonne 'fraud_flag' à df_combined
df_combined['fraud_flag'] = pd.concat([y_train['fraud_flag'], y_test['fraud_flag']], axis=0)

# Calcul de la corrélation entre les variables numériques et la variable cible
correlation_matrix_combined = df_combined.select_dtypes(include=['number']).corr()

# Sélection des variables fortement corrélées avec la variable cible (fraud_flag)
target_correlation_combined = correlation_matrix_combined['fraud_flag'].abs().sort_values(ascending=False)

# Sélection des meilleures variables numériques (par exemple, les 10 premières)
selected_numeric_features_combined = target_correlation_combined.head(10).index

# Sélection des colonnes catégorielles
categorical_columns = ['item' + str(i) for i in range(1, 25)] + ['make' + str(i) for i in range(1, 25)] + ['model' + str(i) for i in range(1, 25)] + ['goods_code' + str(i) for i in range(1, 25)]

# Sélection des colonnes numériques et catégorielles pour X_train et X_test
selected_features = list(selected_numeric_features_combined)
selected_features.extend(categorical_columns)

X_train_selected = df_combined[df_combined.index.isin(X_train.index)][selected_features]
X_test_selected = df_combined[df_combined.index.isin(X_test.index)][selected_features]

# Sélection de la colonne cible 'fraud_flag' pour y_train et y_test
y_train_selected = df_combined[df_combined.index.isin(X_train.index)]['fraud_flag']
y_test_selected = df_combined[df_combined.index.isin(X_test.index)]['fraud_flag']


In [None]:
X_train_selected.head()

Unnamed: 0,fraud_flag,Nbr_of_prod_purchas16,Nbr_of_prod_purchas21,cash_price24,Nbr_of_prod_purchas23,Nbr_of_prod_purchas24,cash_price23,Nbr_of_prod_purchas19,cash_price22,cash_price16,...,goods_code15,goods_code16,goods_code17,goods_code18,goods_code19,goods_code20,goods_code21,goods_code22,goods_code23,goods_code24
0,0.0,,,,,,,,,,...,,,,,,,,,,
1,0.0,,,,,,,,,,...,,,,,,,,,,
2,0.0,,,,,,,,,,...,,,,,,,,,,
3,0.0,,,,,,,,,,...,,,,,,,,,,
4,0.0,,,,,,,,,,...,,,,,,,,,,


In [None]:
X_test_selected.head()

Unnamed: 0,fraud_flag,Nbr_of_prod_purchas16,Nbr_of_prod_purchas21,cash_price24,Nbr_of_prod_purchas23,Nbr_of_prod_purchas24,cash_price23,Nbr_of_prod_purchas19,cash_price22,cash_price16,...,goods_code15,goods_code16,goods_code17,goods_code18,goods_code19,goods_code20,goods_code21,goods_code22,goods_code23,goods_code24
0,0.0,,,,,,,,,,...,,,,,,,,,,
1,0.0,,,,,,,,,,...,,,,,,,,,,
2,0.0,,,,,,,,,,...,,,,,,,,,,
3,0.0,,,,,,,,,,...,,,,,,,,,,
4,0.0,,,,,,,,,,...,,,,,,,,,,


In [None]:
X_train_selected.head()

Unnamed: 0,Nbr_of_prod_purchas16,Nbr_of_prod_purchas21,cash_price24,Nbr_of_prod_purchas23,Nbr_of_prod_purchas24,cash_price23,Nbr_of_prod_purchas19,cash_price22,cash_price16,item1,...,goods_code15,goods_code16,goods_code17,goods_code18,goods_code19,goods_code20,goods_code21,goods_code22,goods_code23,goods_code24
0,,,,,,,,,,COMPUTERS,...,,,,,,,,,,
1,,,,,,,,,,COMPUTER PERIPHERALS ACCESSORIES,...,,,,,,,,,,
2,,,,,,,,,,TELEVISIONS HOME CINEMA,...,,,,,,,,,,
3,,,,,,,,,,COMPUTERS,...,,,,,,,,,,
4,,,,,,,,,,TELEVISIONS HOME CINEMA,...,,,,,,,,,,


In [None]:
X_train_selected_categorical.head()

Unnamed: 0,item1,item2,item3,item4,item5,item6,item7,item8,item9,item10,...,goods_code15,goods_code16,goods_code17,goods_code18,goods_code19,goods_code20,goods_code21,goods_code22,goods_code23,goods_code24
0,COMPUTERS,,,,,,,,,,...,,,,,,,,,,
1,COMPUTER PERIPHERALS ACCESSORIES,,,,,,,,,,...,,,,,,,,,,
2,TELEVISIONS HOME CINEMA,,,,,,,,,,...,,,,,,,,,,
3,COMPUTERS,COMPUTER PERIPHERALS ACCESSORIES,,,,,,,,,...,,,,,,,,,,
4,TELEVISIONS HOME CINEMA,,,,,,,,,,...,,,,,,,,,,


#### Traitement des valeurs manquantes

Déterminons la proportion des valeurs manquantes.

In [None]:
# Compter le nombre de valeurs manquantes par variable dans X_train_selected
X_train_selected_missing = X_train_selected.isna().sum()
X_train_selected_missing_pct = (X_train_selected_missing / X_train_selected.shape[0]) * 100

# Afficher le résultat pour X_train_selected
print("Valeurs manquantes dans X_train_selected :")
print(X_train_selected_missing_pct)

# Compter le nombre de valeurs manquantes par variable dans X_test_selected
X_test_selected_missing = X_test_selected.isna().sum()
X_test_selected_missing_pct = (X_test_selected_missing / X_test_selected.shape[0]) * 100

# Afficher le résultat pour X_test_selected
print("\nValeurs manquantes dans X_test_selected :")
print(X_test_selected_missing_pct)

# Compter le nombre de valeurs manquantes par variable dans X_train_selected_categorical
X_train_categorical_missing = X_train_selected[categorical_columns].isna().sum()
X_train_categorical_missing_pct = (X_train_categorical_missing / X_train_selected.shape[0]) * 100

# Afficher le résultat pour X_train_selected_categorical
print("\nValeurs manquantes dans X_train_selected_categorical :")
print(X_train_categorical_missing_pct)

# Compter le nombre de valeurs manquantes par variable dans X_test_selected_categorical
X_test_categorical_missing = X_test_selected[categorical_columns].isna().sum()
X_test_categorical_missing_pct = (X_test_categorical_missing / X_test_selected.shape[0]) * 100

# Afficher le résultat pour X_test_selected_categorical
print("\nValeurs manquantes dans X_test_selected_categorical :")
print(X_test_categorical_missing_pct)


Valeurs manquantes dans X_train_selected :
Nbr_of_prod_purchas16    99.824982
Nbr_of_prod_purchas21    99.919819
cash_price24             99.952581
Nbr_of_prod_purchas23    99.946546
Nbr_of_prod_purchas24    99.952581
                           ...    
goods_code20             99.906025
goods_code21             99.919819
goods_code22             99.933614
goods_code23             99.946546
goods_code24             99.952581
Length: 105, dtype: float64

Valeurs manquantes dans X_test_selected :
Nbr_of_prod_purchas16    99.823261
Nbr_of_prod_purchas21    99.922407
cash_price24             99.954737
Nbr_of_prod_purchas23    99.952582
Nbr_of_prod_purchas24    99.954737
                           ...    
goods_code20             99.915941
goods_code21             99.922407
goods_code22             99.939650
goods_code23             99.952582
goods_code24             99.954737
Length: 105, dtype: float64

Valeurs manquantes dans X_train_selected_categorical :
item1            0.000000
item2 

In [None]:
X_test_selected.head()

Unnamed: 0,Nbr_of_prod_purchas16,Nbr_of_prod_purchas21,cash_price24,Nbr_of_prod_purchas23,Nbr_of_prod_purchas24,cash_price23,Nbr_of_prod_purchas19,cash_price22,cash_price16,item1,...,goods_code15,goods_code16,goods_code17,goods_code18,goods_code19,goods_code20,goods_code21,goods_code22,goods_code23,goods_code24
0,,,,,,,,,,COMPUTERS,...,,,,,,,,,,
1,,,,,,,,,,COMPUTER PERIPHERALS ACCESSORIES,...,,,,,,,,,,
2,,,,,,,,,,TELEVISIONS HOME CINEMA,...,,,,,,,,,,
3,,,,,,,,,,COMPUTERS,...,,,,,,,,,,
4,,,,,,,,,,TELEVISIONS HOME CINEMA,...,,,,,,,,,,


Gérons ces valeurs manquantes en les imputant à l'aide du KNNImputer.

Débutons par celles contenues dans les variables numériques.

In [None]:
# Sélection des colonnes numériques pour X_train et X_test
numeric_columns = selected_numeric_features_combined

# Sous-ensemble du DataFrame avec les variables numériques pour X_train
X_train_numeric = df_combined[df_combined.index.isin(X_train.index)][numeric_columns]

# Normalisation des données pour X_train
scaler_train = StandardScaler()
X_train_numeric_scaled = scaler_train.fit_transform(X_train_numeric)

# Imputation KNN pour X_train
imputer_train = KNNImputer(n_neighbors=5)
X_train_imputed = imputer_train.fit_transform(X_train_numeric_scaled)

# Création de nouveaux DataFrames avec les données imputées pour X_train
X_train_imputed_numerical_df = pd.DataFrame(X_train_imputed, columns=numeric_columns)

# Sous-ensemble du DataFrame avec les variables numériques pour X_test
X_test_numeric = df_combined[df_combined.index.isin(X_test.index)][numeric_columns]

# Normalisation des données pour X_test
scaler_test = StandardScaler()
X_test_numeric_scaled = scaler_test.fit_transform(X_test_numeric)

# Imputation KNN pour X_test
imputer_test = KNNImputer(n_neighbors=5)
X_test_imputed = imputer_test.fit_transform(X_test_numeric_scaled)

# Création de nouveaux DataFrames avec les données imputées pour X_test
X_test_imputed_numerical_df = pd.DataFrame(X_test_imputed, columns=numeric_columns)


In [4]:
# Supprimer la colonne 'fraud_flag' de X_train_selected_numeric
X_train_imputed_numerical_df = X_train_imputed_numerical_df.drop('fraud_flag', axis=1)

# Supprimer la colonne 'fraud_flag' de X_test_selected_numeric
X_test_imputed_numerical_df = X_test_imputed_numerical_df.drop('fraud_flag', axis=1)

In [6]:
X_train_imputed_numerical_df.head()

Unnamed: 0,Nbr_of_prod_purchas16,Nbr_of_prod_purchas21,cash_price24,Nbr_of_prod_purchas23,Nbr_of_prod_purchas24,cash_price23,Nbr_of_prod_purchas19,cash_price22,cash_price16
0,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418
1,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418
2,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418
3,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418
4,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418


Revérifions le nombre de valeurs manquantes dans les variables numériques.

In [None]:
X_test_imputed_numerical_df.head()

Unnamed: 0,Nbr_of_prod_purchas16,Nbr_of_prod_purchas21,cash_price24,Nbr_of_prod_purchas23,Nbr_of_prod_purchas24,cash_price23,Nbr_of_prod_purchas19,cash_price22,cash_price16
0,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139
1,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139
2,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139
3,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139
4,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139


In [None]:
# Compter le nombre de valeurs manquantes par variable dans X_train_imputed_numerical_df
X_train_imputed_numerical_missing = X_train_imputed_numerical_df.isna().sum()
X_train_imputed_numerical_missing_pct = (X_train_imputed_numerical_missing / X_train_imputed_numerical_df.shape[0]) * 100

# Afficher le résultat pour X_train_imputed_numerical_df
print("Valeurs manquantes dans X_train_imputed_numerical_df après imputation :")
print(X_train_imputed_numerical_missing_pct)

# Compter le nombre de valeurs manquantes par variable dans X_test_imputed_numerical_df
X_test_imputed_numerical_missing = X_test_imputed_numerical_df.isna().sum()
X_test_imputed_numerical_missing_pct = (X_test_imputed_numerical_missing / X_test_imputed_numerical_df.shape[0]) * 100

# Afficher le résultat pour X_test_imputed_numerical_df
print("\nValeurs manquantes dans X_test_imputed_numerical_df après imputation :")
print(X_test_imputed_numerical_missing_pct)


Valeurs manquantes dans X_train_imputed_numerical_df après imputation :
Nbr_of_prod_purchas16    0.0
Nbr_of_prod_purchas21    0.0
cash_price24             0.0
Nbr_of_prod_purchas23    0.0
Nbr_of_prod_purchas24    0.0
cash_price23             0.0
Nbr_of_prod_purchas19    0.0
cash_price22             0.0
cash_price16             0.0
dtype: float64

Valeurs manquantes dans X_test_imputed_numerical_df après imputation :
Nbr_of_prod_purchas16    0.0
Nbr_of_prod_purchas21    0.0
cash_price24             0.0
Nbr_of_prod_purchas23    0.0
Nbr_of_prod_purchas24    0.0
cash_price23             0.0
Nbr_of_prod_purchas19    0.0
cash_price22             0.0
cash_price16             0.0
dtype: float64


A présent, traitons les valeurs manquantes des variables catégorielles.

In [None]:
# Utilisez les colonnes catégorielles de votre DataFrame d'origine
selected_categorical_columns = categorical_columns  # Remplacez par le nom réel de vos colonnes catégorielles

# Convertir les colonnes catégorielles en chaînes (si ce n'est pas déjà fait)
X_train_selected[selected_categorical_columns] = X_train_selected[selected_categorical_columns].astype(str)
X_test_selected[selected_categorical_columns] = X_test_selected[selected_categorical_columns].astype(str)

# Créer un transformateur pour les variables catégorielles
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Appliquer le transformateur sur les données
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, selected_categorical_columns)
    ])

# Appliquer la transformation sur les données d'entraînement et de test
X_train_encoded = preprocessor.fit_transform(X_train_selected)
X_test_encoded = preprocessor.transform(X_test_selected)

# Appliquer TruncatedSVD pour réduire la dimension des données catégorielles
svd_categorical = TruncatedSVD(n_components=100)  # ajustez le nombre de composantes en fonction de vos besoins
X_train_svd_categorical = svd_categorical.fit_transform(X_train_encoded)
X_test_svd_categorical = svd_categorical.transform(X_test_encoded)

# Créer un imputeur KNN
imputer = KNNImputer(n_neighbors=5)

# Appliquer l'imputation KNN sur les données catégorielles réduites avec TruncatedSVD
X_train_imputed_categorical = imputer.fit_transform(X_train_svd_categorical)
X_test_imputed_categorical = imputer.transform(X_test_svd_categorical)

# Création de nouveaux DataFrames avec les données imputées
X_train_imputed_categorical_df = pd.DataFrame(X_train_imputed_categorical, columns=range(X_train_imputed_categorical.shape[1]))
X_test_imputed_categorical_df = pd.DataFrame(X_test_imputed_categorical, columns=range(X_test_imputed_categorical.shape[1]))


In [10]:
X_train_imputed_categorical_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,9.630562,-1.137273,-0.214785,0.780704,0.061282,-0.503696,-0.120631,0.016864,0.517128,-0.688886,...,0.008668,-0.001328,-0.006259,-0.001888,-0.001424,0.004029,0.004777,-0.003999,-0.001855,-0.004619
1,9.568599,-1.121138,0.060973,-0.123417,0.091133,0.669679,0.795097,0.267356,0.065455,-0.096523,...,0.022671,-0.015797,-0.036605,-0.006919,0.011559,-0.01175,-0.009696,-0.005346,-0.035955,-0.022054
2,9.50101,-1.153073,0.290671,-1.049383,-0.159766,-0.181614,-0.218127,-0.005433,-0.17468,-0.073312,...,-0.032784,-0.025538,-0.016845,0.025383,0.013586,0.027346,-0.037321,-0.058553,-0.060466,0.019329
3,9.404351,0.55207,0.048977,0.804319,-0.183206,1.113339,-1.323592,0.018668,-0.055913,0.197939,...,0.056906,-0.01971,-0.02486,-0.005174,0.008287,0.040551,0.052993,0.00093,0.035929,-0.048605
4,9.499408,-1.150971,0.291749,-1.035919,-0.162657,-0.187832,-0.219771,0.037327,-0.193116,-0.098033,...,0.195804,0.031792,-0.007726,-0.053703,0.048726,-0.048478,-0.055109,0.104196,0.135683,0.050299


In [None]:
X_test_imputed_categorical_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,9.630562,-1.137273,-0.214785,0.780704,0.061282,-0.503696,-0.120631,0.016864,0.517128,-0.688886,...,0.021044,0.005049,-0.003848,-0.008453,-0.002205,0.001709,-0.002495,0.000149,0.002921,-0.003135
1,9.568599,-1.121138,0.060973,-0.123417,0.091133,0.669679,0.795097,0.267356,0.065455,-0.096523,...,0.027671,-0.003032,-0.012788,0.003594,0.01765,-0.027963,-0.016792,0.034524,-0.016858,0.003959
2,9.50101,-1.153073,0.290671,-1.049383,-0.159766,-0.181614,-0.218127,-0.005433,-0.17468,-0.073312,...,-0.00768,0.013033,-0.019524,0.011552,0.013466,-0.035614,0.052817,0.068413,-0.002779,-0.027419
3,9.404351,0.55207,0.048977,0.804319,-0.183206,1.113339,-1.323592,0.018668,-0.055913,0.197939,...,0.007819,0.05115,-0.043695,-0.025125,0.023009,-0.02865,-0.012946,-0.048461,0.015814,-0.013302
4,9.499408,-1.150971,0.291749,-1.035919,-0.162657,-0.187832,-0.219771,0.037327,-0.193116,-0.098033,...,0.131215,0.108146,-0.051054,-0.191214,-0.07205,0.075246,-0.03141,-0.103597,-0.002422,0.084996


Revérifions le nombre de valeurs manquantes dans les variables catégorielles.

In [None]:
# Compter le nombre de valeurs manquantes par variable dans X_train_imputed_categorical_df
X_train_imputed_categorical_missing = X_train_imputed_categorical_df.isna().sum()
X_train_imputed_categorical_missing_pct = (X_train_imputed_categorical_missing / X_train_imputed_categorical_df.shape[0]) * 100

# Afficher le résultat pour X_train_imputed_categorical_df
print("Valeurs manquantes dans X_train_imputed_categorical_df après imputation :")
print(X_train_imputed_categorical_missing_pct)

# Compter le nombre de valeurs manquantes par variable dans X_test_imputed_categorical_df
X_test_imputed_categorical_missing = X_test_imputed_categorical_df.isna().sum()
X_test_imputed_categorical_missing_pct = (X_test_imputed_categorical_missing / X_test_imputed_categorical_df.shape[0]) * 100

# Afficher le résultat pour X_test_imputed_categorical_df
print("\nValeurs manquantes dans X_test_imputed_categorical_df après imputation :")
print(X_test_imputed_categorical_missing_pct)


Valeurs manquantes dans X_train_imputed_categorical_df après imputation :
0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
     ... 
95    0.0
96    0.0
97    0.0
98    0.0
99    0.0
Length: 100, dtype: float64

Valeurs manquantes dans X_test_imputed_categorical_df après imputation :
0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
     ... 
95    0.0
96    0.0
97    0.0
98    0.0
99    0.0
Length: 100, dtype: float64


In [6]:
# Fusionner les DataFrames X_train_imputed_numerical_df et X_train_imputed_categorical_df
X_train = pd.concat([X_train_imputed_numerical_df, X_train_imputed_categorical_df], axis=1)

# Fusionner les DataFrames X_test_imputed_numerical_df et X_test_imputed_categorical_df
X_test = pd.concat([X_test_imputed_numerical_df, X_test_imputed_categorical_df], axis=1)


In [None]:
X_train.head()

Unnamed: 0,Nbr_of_prod_purchas16,Nbr_of_prod_purchas21,cash_price24,Nbr_of_prod_purchas23,Nbr_of_prod_purchas24,cash_price23,Nbr_of_prod_purchas19,cash_price22,cash_price16,0,...,90,91,92,93,94,95,96,97,98,99
0,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418,9.630562,...,0.008668,-0.001328,-0.006259,-0.001888,-0.001424,0.004029,0.004777,-0.003999,-0.001855,-0.004619
1,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418,9.568599,...,0.022671,-0.015797,-0.036605,-0.006919,0.011559,-0.01175,-0.009696,-0.005346,-0.035955,-0.022054
2,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418,9.50101,...,-0.032784,-0.025538,-0.016845,0.025383,0.013586,0.027346,-0.037321,-0.058553,-0.060466,0.019329
3,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418,9.404351,...,0.056906,-0.01971,-0.02486,-0.005174,0.008287,0.040551,0.052993,0.00093,0.035929,-0.048605
4,-0.369455,0.447675,-0.368794,-0.288527,-0.128316,-0.47753,0.837782,-0.28829,-0.246418,9.499408,...,0.195804,0.031792,-0.007726,-0.053703,0.048726,-0.048478,-0.055109,0.104196,0.135683,0.050299


In [None]:
X_test.head()

Unnamed: 0,Nbr_of_prod_purchas16,Nbr_of_prod_purchas21,cash_price24,Nbr_of_prod_purchas23,Nbr_of_prod_purchas24,cash_price23,Nbr_of_prod_purchas19,cash_price22,cash_price16,0,...,90,91,92,93,94,95,96,97,98,99
0,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139,9.630562,...,0.021044,0.005049,-0.003848,-0.008453,-0.002205,0.001709,-0.002495,0.000149,0.002921,-0.003135
1,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139,9.568599,...,0.027671,-0.003032,-0.012788,0.003594,0.01765,-0.027963,-0.016792,0.034524,-0.016858,0.003959
2,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139,9.50101,...,-0.00768,0.013033,-0.019524,0.011552,0.013466,-0.035614,0.052817,0.068413,-0.002779,-0.027419
3,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139,9.404351,...,0.007819,0.05115,-0.043695,-0.025125,0.023009,-0.02865,-0.012946,-0.048461,0.015814,-0.013302
4,-0.143968,-0.301511,-0.36319,-0.218218,0.380132,-0.225161,0.052827,-0.171707,-0.2139,9.499408,...,0.131215,0.108146,-0.051054,-0.191214,-0.07205,0.075246,-0.03141,-0.103597,-0.002422,0.084996


### Regréssion logistique

##### Entraînement du modèle

In [7]:
# Assurez-vous que X_train et y_train ont le même index
X_train = X_train.loc[y_train.index]

# Convertir tous les noms de colonnes en chaînes de caractères
X_train.columns = X_train.columns.astype(str)

# Affichez à nouveau le nombre d'échantillons dans X_train et y_train pour vérification
print("Nombre d'échantillons dans X_train:", X_train.shape[0])
print("Nombre d'échantillons dans y_train:", y_train['fraud_flag'].shape[0])

# Assurez-vous que les indices sont les mêmes dans X_train et y_train
print("Index dans X_train:", X_train.index)
print("Index dans y_train:", y_train['fraud_flag'].index)

# Entraînez le modèle après avoir aligné les données
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train['fraud_flag'])


Nombre d'échantillons dans X_train: 92790
Nombre d'échantillons dans y_train: 92790
Index dans X_train: Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,     9,
       ...
       92780, 92781, 92782, 92783, 92784, 92785, 92786, 92787, 92788, 92789],
      dtype='int64', length=92790)
Index dans y_train: RangeIndex(start=0, stop=92790, step=1)


##### Evaluation du modèle

In [None]:
# # Vérifier les colonnes dans X_train et X_test
# print("Colonnes dans X_train:", X_train.columns)
# print("Colonnes dans X_test:", X_test.columns)

Colonnes dans X_train: Index(['Nbr_of_prod_purchas16', 'Nbr_of_prod_purchas21', 'cash_price24',
       'Nbr_of_prod_purchas23', 'Nbr_of_prod_purchas24', 'cash_price23',
       'Nbr_of_prod_purchas19', 'cash_price22', 'cash_price16', '0',
       ...
       '90', '91', '92', '93', '94', '95', '96', '97', '98', '99'],
      dtype='object', length=109)
Colonnes dans X_test: Index(['Nbr_of_prod_purchas16', 'Nbr_of_prod_purchas21',
                'cash_price24', 'Nbr_of_prod_purchas23',
       'Nbr_of_prod_purchas24',          'cash_price23',
       'Nbr_of_prod_purchas19',          'cash_price22',
                'cash_price16',                       0,
       ...
                            90,                      91,
                            92,                      93,
                            94,                      95,
                            96,                      97,
                            98,                      99],
      dtype='object', length=109)


In [8]:
# Sélectionner la colonne 'fraud_flag' de y_test
y_test_column = y_test['fraud_flag'].values

# Ajuster la taille de y_test pour correspondre à la taille de X_test
y_test_column = y_test_column[:X_test.shape[0]]

# Binariser les valeurs en utilisant un seuil (par exemple, 0.5)
threshold = 0.5
y_test_binary = (y_test_column > threshold).astype(int)

# Faire des prédictions sur X_test avec les probabilités
y_probs = model.predict_proba(X_test)[:, 1]

# Ajuster la taille de y_probs pour correspondre à la taille de y_test_binary
y_probs = y_probs[:y_test_binary.shape[0]]

# Binariser les valeurs prédites en utilisant le même seuil
y_pred_binary = (y_probs > threshold).astype(int)

# Calculer la valeur du PR-AUC
pr_auc = average_precision_score(y_test_binary, y_pred_binary)
print(f"PR-AUC: {pr_auc}")

PR-AUC: 0.5010498874209232


Utilisation du seuil optimal pour calculer le PR-AUC

#### Forêt aleatoire

##### Entraînement du modèle

In [9]:
from sklearn.ensemble import RandomForestClassifier

# Convertir y_train en un tableau 1D en utilisant la première colonne
y_train_1d = y_train.iloc[:, 0]

# Création du modèle Random Forest avec 100 arbres, une profondeur maximale de 10, et un choix aléatoire du critère de séparation
random_forest = RandomForestClassifier(n_estimators=2, max_depth=10, random_state=42)

# Entraînement du modèle sur les données d'entraînement
random_forest.fit(X_train, y_train_1d)

##### Evaluation du modèle

In [15]:
# Define the batch size
batch_size = 1000

# Get the total number of samples
num_samples = X_test.shape[0]

# Initialize an empty array to store predictions
y_probs = np.empty((0,))

# Process the data in batches
for i in range(0, num_samples, batch_size):
    # Get the current batch
    X_batch = X_test.iloc[i:i + batch_size, :]

    # Make predictions on the current batch
    y_probs_batch = random_forest.predict_proba(X_batch)[:, 1]

    # Append the predictions to the result array
    y_probs = np.concatenate((y_probs, y_probs_batch))

# Binarise the values predicted using the same threshold
y_pred_binary = (y_probs > threshold).astype(int)

# Adjust the size of y_test_binary to match the size of y_probs
y_test_binary = y_test_binary[:y_probs.shape[0]]

# Create a subset of y_test_binary and y_probs
subset_size = 23198
y_test_binary_subset = y_test_binary[:subset_size]
y_probs_subset = y_probs[:subset_size]

# Calculate the PR-AUC
pr_auc = average_precision_score(y_test_binary_subset, y_pred_binary[:subset_size])
print(f"PR-AUC: {pr_auc}")
# PR-AUC: 0.501724286576429

PR-AUC: 0.501724286576429


#### Naive Bayes

In [None]:
# Utilisez un sous-ensemble plus petit de l'ensemble d'entraînement
subset_size = 23198  # ajustez la taille du sous-ensemble selon vos besoins
X_train_subset, _, y_train_subset, _ = train_test_split(X_train, y_train, train_size=subset_size, random_state=42)

# Extraction de la colonne 'fraud_flag'
y_train_column_subset = y_train_subset['fraud_flag'].values

# Création et entraînement d'un modèle Naive Bayes gaussien
nb_model = GaussianNB()
nb_model.fit(X_train_subset, y_train_column_subset)

# Faire des prédictions sur l'ensemble de test
y_test_probs_nb = nb_model.predict_proba(X_test)[:, 1]

# Binariser les étiquettes du test en utilisant un seuil
threshold_nb = 0.5
y_test_binary_nb = (y_test['fraud_flag'].values > threshold_nb).astype(int)

# Utiliser uniquement les 23198 premiers éléments de y_test_probs_nb
y_test_probs_nb_subset = y_test_probs_nb[:23198]

# Vérifier les longueurs
print("Longueur de y_test_probs_nb:", len(y_test_probs_nb_subset))
print("Longueur des indices de y_test:", len(y_test.index))

# Vérifier si les longueurs sont égales
if len(y_test_probs_nb_subset) != 23198:
    raise ValueError("Les longueurs ne correspondent pas à 23198.")

# Calculer la valeur du PR-AUC sur l'ensemble de test
pr_auc_test_nb = average_precision_score(y_test_binary_nb, y_test_probs_nb_subset)

print('Performance sur l\'ensemble de test (Naive Bayes)')
print('Test PR-AUC: ', pr_auc_test_nb)
# Test PR-AUC:  0.501724286576429

#### Support Vector Machine (SVM)

In [None]:
# Sélectionner la colonne 'fraud_flag' de y_train
y_train_column = y_train['fraud_flag'].values

# Ajuster la taille de y_test pour correspondre à la taille de X_train
y_train_column = y_train_column[:X_train.shape[0]]
y_train_binary = (y_train_column > threshold).astype(int)

# Initialiser le modèle SVM
svm_model = SVC(probability=True)

# Définir la grille des paramètres à rechercher
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [0.01, 0.1, 1, 10]}

# Initialiser la recherche sur grille avec validation croisée
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='average_precision')

# Ajuster le modèle SVM sur les données d'entraînement
grid_search.fit(X_train, y_train_binary)

# Utiliser le meilleur modèle
best_svm_model = grid_search.best_estimator_

# Faire des prédictions sur X_test avec le meilleur modèle
y_probs_svm = best_svm_model.predict_proba(X_test)[:, 1]

# Ajuster la taille de y_probs_svm pour correspondre à la taille de y_test_binary
y_probs_svm = y_probs_svm[:y_test_binary.shape[0]]

# Binariser les valeurs prédites en utilisant le même seuil
y_pred_svm_binary = (y_probs_svm > threshold).astype(int)

# Calculer la valeur du PR-AUC pour le SVM
pr_auc_svm = average_precision_score(y_test_binary, y_pred_svm_binary)
print(f"PR-AUC for SVM: {pr_auc_svm}")
#  PR-AUC for SVM: 0.5030992698087454

#### Kernel SVM

In [None]:
# Utilisez un sous-ensemble plus petit de l'ensemble d'entraînement
subset_size = 23198  # ajustez la taille du sous-ensemble selon vos besoins
X_train_subset, _, y_train_subset, _ = train_test_split(X_train, y_train, train_size=subset_size, random_state=42)

# Extraction de la colonne 'fraud_flag'
y_train_column_subset = y_train_subset['fraud_flag'].values

# Création et entraînement d'un modèle SVM avec un noyau RBF
svm_model = SVC(kernel='rbf', random_state=42, probability=True, C=1, gamma='scale')
svm_model.fit(X_train_subset, y_train_column_subset)

# Utilisation de l'approximation de noyau Nystroem
nystroem_approximation = Nystroem(kernel='rbf', n_components=100, random_state=42)
X_train_nystroem = nystroem_approximation.fit_transform(X_train_subset)

# Utilisation du modèle SVM avec l'approximation de noyau sur le sous-ensemble
svm_model_nystroem = SVC(kernel='linear', probability=True)
svm_model_nystroem.fit(X_train_nystroem, y_train_column_subset)

# Faire des prédictions sur l'ensemble de test
X_test_nystroem = nystroem_approximation.transform(X_test)
y_test_probs_svm = svm_model_nystroem.predict_proba(X_test_nystroem)[:, 1]

# Binariser les étiquettes du test en utilisant un seuil
threshold_svm = 0.5
y_test_binary_svm = (y_test['fraud_flag'].values[:len(y_test_probs_svm)] > threshold_svm).astype(int)

# Utiliser uniquement les 23198 premiers éléments de y_test_probs_svm
y_test_probs_svm_subset = y_test_probs_svm[:23198]

# Vérifier les longueurs
print("Longueur de y_test_probs_svm:", len(y_test_probs_svm_subset))
print("Longueur des indices de y_test:", len(y_test.index))

# Vérifier si les longueurs sont égales
if len(y_test_probs_svm_subset) != 23198:
    raise ValueError("Les longueurs ne correspondent pas à 23198.")

# Créer un DataFrame pour la soumission en utilisant les indices de y_test
submission_df = pd.DataFrame({
    'index': y_test.index[:23198],  # Utiliser les indices de y_test
    'ID': y_test['ID'].values[:23198],  # Utiliser les valeurs de la colonne 'ID' de y_test
    'fraud_flag': y_test_probs_svm_subset,  # Utiliser les valeurs prédites directement
})

# Enregistrez le DataFrame sous la forme d'un fichier CSV pour la soumission
submission_df.to_csv("submission_challenge_ksvm2.csv", index=False)

# Calculer la valeur du PR-AUC sur l'ensemble de test
pr_auc_test_svm = average_precision_score(y_test_binary_svm, y_test_probs_svm_subset)

print('Performance sur l\'ensemble de test (Kernel SVM avec Nystroem)')
print('Test PR-AUC: ', pr_auc_test_svm)

On constate que le modèle Kernel SVM est celui ayant le plus de précision en termes de PR-AUC: 0.5070258030844532