# Adult Income Exercise

For this exercise we will use a dataset that contains some features about people and we need to predict <b>if each person is payed more than 50.000 $ annualy or not</b>.

- You need to:
    - Explore the dataset
    - Give some basic information about the data
    - Preprocess the data (missing values, imputation...)
    - Test various Machine Learning algorithms
    - Evaluate the performance of these algorithms over the test data in terms of Accuracy, Precision, Recall, F1, plot the confusion matrix...
    - Chose one of these algorithms to perfom the task prediction and explain why (in comments or in a markdown cell)



In [12]:
import pandas as pd





In [17]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import warnings
warnings.filterwarnings('ignore')

# Lire les données
df_train = pd.read_csv('data/train.csv')
df_train.replace('', pd.NA, inplace=True)
df_train = df_train.dropna()

X_train = df_train.drop(columns=["Survived"], axis=1)
X_test = pd.read_csv('data/test.csv')
y_train = df_train["Survived"]

# Sauvegarde de la colonne PassengerId avant transformations
X_test_passenger_id = X_test["PassengerId"].copy()

# Sélection des colonnes numériques
numeric_columns = X_train.dtypes[((X_train.dtypes=="float64") | (X_train.dtypes=="int64"))].index.values.tolist()

# Sélection des colonnes catégorielles
categorical_col = X_train.dtypes[X_train.dtypes=="object"].index.values.tolist()

# Standardisation des colonnes numériques
mmc = StandardScaler()
X_train[numeric_columns] = mmc.fit_transform(X_train[numeric_columns])
X_test[numeric_columns] = mmc.transform(X_test[numeric_columns])

# Encodage des colonnes catégorielles
ohe = OneHotEncoder(drop=None, handle_unknown='ignore', sparse_output=False)

ohe_cols = [col for col in categorical_col if col in X_test.columns]

X_train_encoded = pd.DataFrame(ohe.fit_transform(X_train[ohe_cols]), index=X_train.index)
X_test_encoded = pd.DataFrame(ohe.transform(X_test[ohe_cols]), index=X_test.index)

X_train_encoded.columns = ohe.get_feature_names_out(ohe_cols)
X_test_encoded.columns = ohe.get_feature_names_out(ohe_cols)

X_train.drop(ohe_cols, axis=1, inplace=True)
X_test.drop(ohe_cols, axis=1, inplace=True)

X_train = pd.concat([X_train, X_train_encoded], axis=1)
X_test = pd.concat([X_test, X_test_encoded], axis=1)

# Vérification des NaN après transformation
print("Nombre de NaN dans X_train : ", X_train.isna().sum().sum())
print("Nombre de NaN dans X_test : ", X_test.isna().sum().sum())

# Remplacer les NaN par la moyenne des colonnes
X_train.fillna(X_train.mean(), inplace=True)
X_test.fillna(X_test.mean(), inplace=True)

# Gradient Boosting avec RandomizedSearchCV
random_grid = {
    'n_estimators': [50, 100, 150, 200, 250, 300, 350, 400],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [int(x) for x in np.linspace(10, 110, num=11)] + [None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'learning_rate': [0.1, 1],
    'subsample': [0.5, 0.7, 0.9]
}

gb = GradientBoostingClassifier(random_state=0)
gb_search = RandomizedSearchCV(estimator=gb, param_distributions=random_grid, n_iter=10, cv=3, verbose=1, random_state=0)
gb_search.fit(X_train, y_train)

# Prédiction sur le jeu de test
y_test_pred = gb_search.predict(X_test)

# Utiliser la colonne PassengerId sauvegardée
result = pd.concat([X_test_passenger_id, pd.DataFrame(y_test_pred)], axis=1)
result.columns = ["PassengerId", "Survived"]

# Convertir PassengerId en entier (au cas où)
result["PassengerId"] = result["PassengerId"].astype(int)

# Création du fichier de soumission pour Kaggle
result.to_csv("result.csv", index=False)


Nombre de NaN dans X_train :  0
Nombre de NaN dans X_test :  87
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Submission file created: result.csv


### Chose one of the algorithms and explain why

Your answer here:

Après analyse de l'accuracy et des valeurs f1 score, on en conclut que l'algorithme le plus performant est le gradient boosting.