# Notebook 01 ‚Äì Pr√©paration des donn√©es

Ce notebook traite les √©tapes suivantes :
- Chargement et inspection des fichiers `application_train` et annexes
- Analyse de la variable cible `TARGET`
- D√©tection des valeurs manquantes
- Pr√©traitement initial
- Fusion de fichiers secondaires (ex. `previous_application`)
- Feature engineering simple
- Export du dataset nettoy√©

In [None]:
# üì¶ Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from pathlib import Path

pd.set_option('display.max_columns', 100)

## 01. Chargement des donn√©es

In [None]:
# üìÅ D√©finition du chemin vers les fichiers raw
PATH_RAW = Path('../data/raw')

df_train = pd.read_csv(PATH_RAW / 'application_train.csv')
df_test = pd.read_csv(PATH_RAW / 'application_test.csv')
df_prev = pd.read_csv(PATH_RAW / 'previous_application.csv')

## 02. Analyse de la variable cible `TARGET`

In [None]:
df_train['TARGET'].value_counts(normalize=True).plot(kind='bar', color=['green', 'red'])
plt.title('Distribution de la cible (TARGET)')
plt.xticks(ticks=[0, 1], labels=['Remboursement OK', 'D√©faut'])
plt.ylabel('Proportion')
plt.grid(True)
plt.show()

## 03. Valeurs manquantes

In [None]:
msno.matrix(df_train)
plt.show()

missing = df_train.isnull().mean().sort_values(ascending=False)
missing[missing > 0].head(20)

## 04. Nettoyage initial + encodage simple

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in df_train.columns:
    if df_train[col].dtype == 'object' and df_train[col].nunique() == 2:
        df_train[col] = le.fit_transform(df_train[col].astype(str))

## 05. Fusion avec `previous_application.csv` (exemple)

In [None]:
prev_agg = df_prev.groupby('SK_ID_CURR').agg({
    'AMT_CREDIT': 'mean',
    'NAME_CONTRACT_STATUS': 'nunique'
}).rename(columns={
    'AMT_CREDIT': 'PREV_MEAN_AMT_CREDIT',
    'NAME_CONTRACT_STATUS': 'PREV_NB_CONTRACT_STATUS'
})

df_train = df_train.merge(prev_agg, on='SK_ID_CURR', how='left')

## 06. Feature engineering

In [None]:
df_train['CREDIT_INCOME_RATIO'] = df_train['AMT_CREDIT'] / df_train['AMT_INCOME_TOTAL']
df_train['AGE'] = abs(df_train['DAYS_BIRTH']) // 365

## 07. Sauvegarde du dataset final

In [None]:
df_train.to_csv('../data/output/train_preprocessed.csv', index=False)
print("Dataset pr√©trait√© sauvegard√©.")