# Pr√©diction de l'Attrition des Employ√©s

## Objectif du Notebook

Ce notebook a pour but de construire un mod√®le de Machine Learning capable de **pr√©dire l'attrition des employ√©s** (d√©part de l'entreprise) et d'**analyser les facteurs** qui influencent cette d√©cision.

### Plan du Notebook

1. **Extraction des donn√©es** - Chargement des CSV et feature engineering
2. **Analyse exploratoire** - Visualisations et corr√©lations
3. **Nettoyage des donn√©es** - Suppression colonnes probl√©matiques et imputation
4. **Pipeline ML** - Pr√©paration, entra√Ænement et tuning des mod√®les
5. **Analyse du mod√®le** - Interpr√©tation et feature importance

---
## Imports et Configuration

In [None]:
# Data manipulation
import pandas as pd
import numpy as np
from datetime import datetime

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Models
from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, roc_auc_score,
    precision_recall_curve, average_precision_score
)

# Model persistence
import joblib

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Plot style
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

---
# 1Ô∏è Extraction des Donn√©es

Nous avons 5 fichiers CSV √† charger et fusionner :
- `general_data.csv` : Informations g√©n√©rales sur les employ√©s
- `employee_survey_data.csv` : Donn√©es de satisfaction des employ√©s
- `manager_survey_data.csv` : √âvaluations des managers
- `in_time.csv` : Heures d'arriv√©e au travail
- `out_time.csv` : Heures de d√©part du travail

### 1.1 Chargement des fichiers CSV principaux

In [None]:
# Chargement des donn√©es principales
general_data = pd.read_csv('data/general_data.csv')
employee_survey = pd.read_csv('data/employee_survey_data.csv')
manager_survey = pd.read_csv('data/manager_survey_data.csv')

print("=" * 60)
print("GENERAL DATA")
print("=" * 60)
print(f"Shape: {general_data.shape}")
print(f"\nColonnes: {list(general_data.columns)}")
display(general_data.head(3))

print("\n" + "=" * 60)
print("EMPLOYEE SURVEY DATA")
print("=" * 60)
print(f"Shape: {employee_survey.shape}")
display(employee_survey.head(3))

print("\n" + "=" * 60)
print("MANAGER SURVEY DATA")
print("=" * 60)
print(f"Shape: {manager_survey.shape}")
display(manager_survey.head(3))

### 1.2 Chargement et traitement des donn√©es de pointage (in_time / out_time)

Les fichiers `in_time.csv` et `out_time.csv` contiennent les heures d'arriv√©e et de d√©part des employ√©s pour chaque jour de l'ann√©e 2015. Nous allons cr√©er des features agr√©g√©es :

- **Arrive_mean** : Moyenne des heures d'arriv√©e
- **Departure_mean** : Moyenne des heures de d√©part  
- **Worktime_mean** : Moyenne des heures travaill√©es par jour

In [None]:
# Chargement des donn√©es de pointage
in_time = pd.read_csv('data/in_time.csv')
out_time = pd.read_csv('data/out_time.csv')

print(f"in_time shape: {in_time.shape}")
print(f"out_time shape: {out_time.shape}")
print(f"\nNombre de jours enregistr√©s: {len(in_time.columns) - 1}")

In [None]:
def extract_time_features(in_time_df, out_time_df):
    """
    Extrait les features temporelles des donn√©es de pointage.
    
    Returns:
        DataFrame avec EmployeeID, Arrive_mean, Departure_mean, Worktime_mean
    """
    # Colonnes de dates (toutes sauf EmployeeID)
    date_cols = [col for col in in_time_df.columns if col != 'EmployeeID']
    
    results = []
    
    for _, row_in in in_time_df.iterrows():
        emp_id = row_in['EmployeeID']
        row_out = out_time_df[out_time_df['EmployeeID'] == emp_id].iloc[0]
        
        arrive_times = []  # En heures d√©cimales
        depart_times = []  # En heures d√©cimales
        work_durations = []  # En heures
        
        for date_col in date_cols:
            in_val = row_in[date_col]
            out_val = row_out[date_col]
            
            # Skip si NA (jour non travaill√©)
            if pd.isna(in_val) or pd.isna(out_val) or in_val == 'NA' or out_val == 'NA':
                continue
            
            try:
                # Parse datetime
                in_dt = pd.to_datetime(in_val, format='%d/%m/%Y %H:%M')
                out_dt = pd.to_datetime(out_val, format='%d/%m/%Y %H:%M')
                
                # Heure d'arriv√©e en d√©cimal (ex: 9h30 = 9.5)
                arrive_decimal = in_dt.hour + in_dt.minute / 60
                depart_decimal = out_dt.hour + out_dt.minute / 60
                
                # Dur√©e de travail en heures
                work_hours = (out_dt - in_dt).total_seconds() / 3600
                
                # Validation: dur√©e raisonnable (entre 1h et 16h)
                if 1 <= work_hours <= 16:
                    arrive_times.append(arrive_decimal)
                    depart_times.append(depart_decimal)
                    work_durations.append(work_hours)
                    
            except Exception:
                continue
        
        # Calcul des moyennes
        results.append({
            'EmployeeID': emp_id,
            'Arrive_mean': np.mean(arrive_times) if arrive_times else np.nan,
            'Departure_mean': np.mean(depart_times) if depart_times else np.nan,
            'Worktime_mean': np.mean(work_durations) if work_durations else np.nan
        })
    
    return pd.DataFrame(results)

# Extraction des features temporelles
time_features = extract_time_features(in_time, out_time)

print("Features temporelles extraites:")
display(time_features.head(10))
print(f"\nStatistiques:")
display(time_features.describe())

### 1.3 Jointure des DataFrames

Fusion de toutes les sources de donn√©es en un seul DataFrame via `EmployeeID`.

In [None]:
# Jointure de tous les DataFrames
df = general_data.copy()

# Merge avec employee survey
df = df.merge(employee_survey, on='EmployeeID', how='left')

# Merge avec manager survey
df = df.merge(manager_survey, on='EmployeeID', how='left')

# Merge avec time features
df = df.merge(time_features, on='EmployeeID', how='left')

print(f"DataFrame final shape: {df.shape}")
print(f"\nColonnes ({len(df.columns)}):")
for i, col in enumerate(df.columns):
    print(f"  {i+1}. {col}")

In [None]:
# Aper√ßu du DataFrame fusionn√©
print("Aper√ßu des donn√©es:")
display(df.head())

print("\nTypes de donn√©es:")
print(df.dtypes)

---
# 2Ô∏è Analyse Exploratoire des Donn√©es (EDA)

Avant de construire notre mod√®le, analysons les donn√©es pour comprendre les patterns et relations entre variables.

### 2.1 Distribution de la variable cible (Attrition)

In [None]:
# Distribution de la variable cible
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Countplot
attrition_counts = df['Attrition'].value_counts()
colors = ['#2ecc71', '#e74c3c']
axes[0].bar(attrition_counts.index, attrition_counts.values, color=colors, edgecolor='black')
axes[0].set_title('Distribution de l\'Attrition', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Attrition')
axes[0].set_ylabel('Nombre d\'employ√©s')

# Ajout des valeurs sur les barres
for i, (idx, val) in enumerate(zip(attrition_counts.index, attrition_counts.values)):
    axes[0].text(idx, val + 20, str(val), ha='center', fontsize=12, fontweight='bold')

# Pie chart
axes[1].pie(attrition_counts.values, labels=attrition_counts.index, autopct='%1.1f%%',
            colors=colors, explode=(0, 0.05), shadow=True, startangle=90)
axes[1].set_title('Proportion de l\'Attrition', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Stats
attrition_rate = (df['Attrition'] == 'Yes').mean() * 100
print(f"\nüìä Taux d'attrition: {attrition_rate:.2f}%")
print(f"‚ö†Ô∏è  Dataset d√©s√©quilibr√© - √† prendre en compte pour le mod√®le")

### 2.2 Heatmap de corr√©lation bivari√©e

Visualisons les corr√©lations entre les variables num√©riques.

In [None]:
# Encodage temporaire de Attrition pour la corr√©lation
df_corr = df.copy()
df_corr['Attrition_encoded'] = (df_corr['Attrition'] == 'Yes').astype(int)

# S√©lection des colonnes num√©riques
numeric_cols = df_corr.select_dtypes(include=[np.number]).columns.tolist()

# Matrice de corr√©lation
correlation_matrix = df_corr[numeric_cols].corr()

# Heatmap
plt.figure(figsize=(16, 12))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True, fmt='.2f', 
            cmap='RdBu_r', center=0, square=True, linewidths=0.5,
            annot_kws={'size': 8}, vmin=-1, vmax=1)
plt.title('Heatmap de Corr√©lation Bivari√©e', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Top corr√©lations avec Attrition
print("\nüìà Top 10 corr√©lations avec Attrition:")
attrition_corr = correlation_matrix['Attrition_encoded'].drop('Attrition_encoded').abs().sort_values(ascending=False)
for col, corr in attrition_corr.head(10).items():
    direction = '+' if correlation_matrix.loc[col, 'Attrition_encoded'] > 0 else '-'
    print(f"  {direction} {col}: {corr:.3f}")

### 2.3 Distribution des variables num√©riques

In [None]:
# Variables num√©riques √† visualiser (exclure IDs)
numeric_features = [col for col in numeric_cols if 'ID' not in col and 'encoded' not in col]

# Calcul du nombre de lignes n√©cessaires
n_cols = 4
n_rows = (len(numeric_features) + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 3.5 * n_rows))
axes = axes.flatten()

for idx, col in enumerate(numeric_features):
    ax = axes[idx]
    
    # Histogramme par Attrition
    for attrition_val, color, label in [('No', '#2ecc71', 'Reste'), ('Yes', '#e74c3c', 'Part')]:
        data = df[df['Attrition'] == attrition_val][col].dropna()
        ax.hist(data, bins=20, alpha=0.6, color=color, label=label, edgecolor='white')
    
    ax.set_title(col, fontsize=10, fontweight='bold')
    ax.legend(fontsize=8)
    ax.tick_params(labelsize=8)

# Masquer les axes vides
for idx in range(len(numeric_features), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Distribution des Variables Num√©riques par Attrition', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

### 2.4 Distribution des variables cat√©gorielles

In [None]:
# Variables cat√©gorielles
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols = [col for col in categorical_cols if col != 'Attrition']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for idx, col in enumerate(categorical_cols[:6]):
    ax = axes[idx]
    
    # Calcul des proportions d'attrition par cat√©gorie
    cross_tab = pd.crosstab(df[col], df['Attrition'], normalize='index') * 100
    
    cross_tab.plot(kind='bar', ax=ax, color=['#2ecc71', '#e74c3c'], edgecolor='black')
    ax.set_title(f'Attrition par {col}', fontsize=11, fontweight='bold')
    ax.set_xlabel('')
    ax.set_ylabel('Pourcentage (%)')
    ax.legend(['Reste', 'Part'], loc='upper right')
    ax.tick_params(axis='x', rotation=45, labelsize=9)
    ax.set_ylim(0, 100)

plt.suptitle('Taux d\'Attrition par Variable Cat√©gorielle', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

---
# 3Ô∏è Nettoyage des Donn√©es

√âtapes de nettoyage :
1. Retrait des colonnes probl√©matiques (colin√©arit√©, RGPD, √©thique, leakage)
2. Imputation des valeurs manquantes

### 3.1 Analyse des valeurs manquantes

In [None]:
# Analyse des valeurs manquantes
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({'Missing': missing, 'Percentage': missing_pct})
missing_df = missing_df[missing_df['Missing'] > 0].sort_values('Missing', ascending=False)

if len(missing_df) > 0:
    print("üìä Colonnes avec valeurs manquantes:")
    display(missing_df)
    
    # Visualisation
    plt.figure(figsize=(12, 5))
    plt.bar(missing_df.index, missing_df['Percentage'], color='#e74c3c', edgecolor='black')
    plt.title('Pourcentage de Valeurs Manquantes par Colonne', fontsize=14, fontweight='bold')
    plt.ylabel('Pourcentage (%)')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print("‚úÖ Aucune valeur manquante dans le dataset!")

### 3.2 Identification et retrait des colonnes probl√©matiques

**Crit√®res de suppression :**
- **Colin√©arit√©** : Variables fortement corr√©l√©es entre elles (redondance)
- **Non-conformit√© RGPD** : Donn√©es personnelles sensibles
- **√âthique** : Variables potentiellement discriminatoires
- **Leakage** : Variables qui fuiteraient l'information cible
- **Non pertinence** : Variables sans pouvoir pr√©dictif

In [None]:
# Colonnes √† supprimer avec justification
columns_to_drop = {
    'EmployeeID': 'Non pertinent - Identifiant unique sans valeur pr√©dictive',
    'Gender': '√âthique - Potentiellement discriminatoire (biais de genre)',
    'Over18': 'Non pertinent - Valeur constante (tous > 18 ans)',
    'StandardHours': 'Non pertinent - Valeur constante (80h pour tous)',
    'EmployeeCount': 'Non pertinent - Valeur constante (1 pour tous)',
    'Departure_mean': 'Trop corr√©l√© avec worktime mean',
    'Age': "Ethique - discrimination sur l'age",
    'MaritalStatus': '√âthique - Information personnelle potentiellement discriminatoire'
}

# V√©rifier les colonnes qui existent r√©ellement
existing_cols_to_drop = [col for col in columns_to_drop.keys() if col in df.columns]

print("üóëÔ∏è Colonnes √† supprimer:")
print("=" * 70)
for col in existing_cols_to_drop:
    reason = columns_to_drop[col]
    print(f"  ‚Ä¢ {col}: {reason}")

# Suppression des colonnes
df_clean = df.drop(columns=existing_cols_to_drop, errors='ignore')

print(f"\n‚úÖ Colonnes supprim√©es: {len(existing_cols_to_drop)}")
print(f"üìä Shape apr√®s suppression: {df_clean.shape}")

### 3.3 D√©tection et traitement de la colin√©arit√©

In [None]:
# Analyse de colin√©arit√© entre variables num√©riques
numeric_cols_clean = df_clean.select_dtypes(include=[np.number]).columns.tolist()
corr_matrix = df_clean[numeric_cols_clean].corr().abs()

# Trouver les paires fortement corr√©l√©es (> 0.85)
threshold = 0.85
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

high_corr_pairs = []
for col in upper_tri.columns:
    for idx in upper_tri.index:
        if upper_tri.loc[idx, col] > threshold:
            high_corr_pairs.append((idx, col, upper_tri.loc[idx, col]))

if high_corr_pairs:
    print(f"‚ö†Ô∏è Paires avec corr√©lation > {threshold}:")
    for var1, var2, corr in high_corr_pairs:
        print(f"  ‚Ä¢ {var1} ‚Üî {var2}: {corr:.3f}")
    
    # Colonnes √† supprimer (garder celle avec plus de corr√©lation avec Attrition)
    cols_to_remove = set()
    df_temp = df_clean.copy()
    df_temp['Attrition_num'] = (df_temp['Attrition'] == 'Yes').astype(int)
    
    for var1, var2, _ in high_corr_pairs:
        corr1 = abs(df_temp[var1].corr(df_temp['Attrition_num']))
        corr2 = abs(df_temp[var2].corr(df_temp['Attrition_num']))
        
        # Supprimer celle avec moins de corr√©lation avec la cible
        to_remove = var1 if corr1 < corr2 else var2
        cols_to_remove.add(to_remove)
        print(f"  ‚Üí Suppression de '{to_remove}' (moins corr√©l√© avec Attrition)")
    
    df_clean = df_clean.drop(columns=list(cols_to_remove), errors='ignore')
    print(f"\n‚úÖ Shape apr√®s suppression colin√©arit√©: {df_clean.shape}")
else:
    print("‚úÖ Aucune colin√©arit√© forte d√©tect√©e.")

### 3.4 Imputation des valeurs manquantes

In [None]:
# V√©rification des valeurs manquantes apr√®s nettoyage
missing_after = df_clean.isnull().sum()
missing_cols = missing_after[missing_after > 0]

if len(missing_cols) > 0:
    print("üìä Colonnes avec valeurs manquantes √† imputer:")
    for col, count in missing_cols.items():
        dtype = df_clean[col].dtype
        print(f"  ‚Ä¢ {col}: {count} valeurs ({dtype})")
    
    # Imputation
    for col in missing_cols.index:
        if df_clean[col].dtype in ['float64', 'int64']:
            # Imputation par la m√©diane pour les num√©riques
            median_val = df_clean[col].median()
            df_clean[col] = df_clean[col].fillna(median_val)
            print(f"  ‚Üí {col}: imput√© par m√©diane ({median_val:.2f})")
        else:
            # Imputation par le mode pour les cat√©gorielles
            mode_val = df_clean[col].mode()[0]
            df_clean[col] = df_clean[col].fillna(mode_val)
            print(f"  ‚Üí {col}: imput√© par mode ({mode_val})")
    
    print("\n‚úÖ Imputation termin√©e!")
else:
    print("‚úÖ Aucune valeur manquante √† imputer!")

# V√©rification finale
print(f"\nüìä Valeurs manquantes restantes: {df_clean.isnull().sum().sum()}")

### 3.5 Affichage du DataFrame final nettoy√©

In [None]:
print("=" * 70)
print("DATAFRAME FINAL NETTOY√â")
print("=" * 70)
print(f"\nShape: {df_clean.shape}")
print(f"\nColonnes ({len(df_clean.columns)}):")

# S√©parer num√©riques et cat√©gorielles
num_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = df_clean.select_dtypes(include=['object']).columns.tolist()

print(f"\n  üìä Num√©riques ({len(num_cols)}): {num_cols}")
print(f"\n  üìù Cat√©gorielles ({len(cat_cols)}): {cat_cols}")

display(df_clean.head(10))

print("\nüìà Statistiques descriptives:")
display(df_clean.describe())

---
# 4Ô∏è Pipeline Machine Learning

Construction du pipeline complet :
1. Pr√©paration des features et target
2. Split train/test/validation
3. Preprocessing (encoding + standardisation)
4. D√©finition des mod√®les et hyperparam√®tres
5. Tuning et entra√Ænement
6. Comparaison et s√©lection du meilleur mod√®le

### 4.1 Pr√©paration des features et target

In [None]:
# S√©paration features / target
X = df_clean.drop(columns=['Attrition'])
y = (df_clean['Attrition'] == 'Yes').astype(int)  # 1 = Part, 0 = Reste

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nDistribution target:")
print(y.value_counts())

### 4.2 Split Train / Test / Validation

- **Train** : 70% - Pour l'entra√Ænement des mod√®les
- **Test** : 15% - Pour l'√©valuation finale
- **Validation** : 15% - Pour l'analyse approfondie

In [None]:
# Shuffle et split
# Premier split: 70% train, 30% temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, random_state=RANDOM_STATE, stratify=y, shuffle=True
)

# Second split: 50/50 pour test et validation (15% chacun du total)
X_test, X_val, y_test, y_val = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=RANDOM_STATE, stratify=y_temp, shuffle=True
)

print("üìä R√©partition des donn√©es:")
print(f"  ‚Ä¢ Train:      {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"  ‚Ä¢ Test:       {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"  ‚Ä¢ Validation: {X_val.shape[0]} samples ({X_val.shape[0]/len(X)*100:.1f}%)")

print("\nüìà Distribution de la cible par set:")
for name, y_set in [('Train', y_train), ('Test', y_test), ('Validation', y_val)]:
    pct_pos = y_set.mean() * 100
    print(f"  ‚Ä¢ {name}: {pct_pos:.1f}% d'attrition")

### 4.3 Pipeline de Preprocessing (Encoding + Standardisation)

In [None]:
# Identification des types de colonnes
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()

print(f"Features num√©riques ({len(numeric_features)}): {numeric_features}")
print(f"\nFeatures cat√©gorielles ({len(categorical_features)}): {categorical_features}")

# Pipeline de preprocessing
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Fit et transform sur train
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
X_val_processed = preprocessor.transform(X_val)

# R√©cup√©ration des noms de features apr√®s encoding
cat_feature_names = preprocessor.named_transformers_['cat']['encoder'].get_feature_names_out(categorical_features).tolist()
all_feature_names = numeric_features + cat_feature_names

print(f"\n‚úÖ Preprocessing termin√©!")
print(f"   Shape apr√®s transformation: {X_train_processed.shape}")
print(f"   Nombre total de features: {len(all_feature_names)}")

### 4.4 D√©finition des Mod√®les et Hyperparam√®tres

In [None]:
# D√©finition des mod√®les et leurs grilles d'hyperparam√®tres
models_config = {
    'Logistic Regression': {
        'model': LogisticRegression(random_state=RANDOM_STATE, max_iter=1000),
        'params': {
            'C': [0.001, 0.01, 0.1, 1, 10, 100],
            'penalty': ['l1', 'l2'],
            'solver': ['liblinear', 'saga'],
            'class_weight': [None, 'balanced']
        }
    },
    'Random Forest': {
        'model': RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [5, 10, 15, None],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4],
            'class_weight': [None, 'balanced']
        }
    },
    'Perceptron': {
        'model': Perceptron(random_state=RANDOM_STATE),
        'params': {
            'penalty': [None, 'l2', 'l1', 'elasticnet'],
            'alpha': [0.0001, 0.001, 0.01, 0.1],
            'max_iter': [500, 1000, 2000],
            'class_weight': [None, 'balanced']
        }
    },
    'HistGradientBoosting': {
        'model': HistGradientBoostingClassifier(random_state=RANDOM_STATE),
        'params': {
            'learning_rate': [0.01, 0.05, 0.1, 0.2],
            'max_iter': [100, 200, 300],
            'max_depth': [3, 5, 7, None],
            'min_samples_leaf': [10, 20, 30],
            'l2_regularization': [0, 0.1, 1.0]
        }
    }
}

print("üìã Mod√®les configur√©s:")
for name, config in models_config.items():
    n_combinations = 1
    for param_values in config['params'].values():
        n_combinations *= len(param_values)
    print(f"  ‚Ä¢ {name}: {n_combinations} combinaisons d'hyperparam√®tres")

### 4.5 Tuning des Hyperparam√®tres et Entra√Ænement

In [None]:
# Stockage des r√©sultats
results = {}
best_models = {}

print("üöÄ Lancement du tuning des hyperparam√®tres...\n")
print("=" * 70)

for model_name, config in models_config.items():
    print(f"\nüìä {model_name}")
    print("-" * 50)
    
    # GridSearchCV avec cross-validation
    grid_search = GridSearchCV(
        estimator=config['model'],
        param_grid=config['params'],
        cv=5,
        scoring='f1',  # F1 car dataset d√©s√©quilibr√©
        n_jobs=-1,
        verbose=0
    )
    
    # Fit
    grid_search.fit(X_train_processed, y_train)
    
    # Meilleur mod√®le
    best_model = grid_search.best_estimator_
    best_models[model_name] = best_model
    
    # Pr√©dictions sur test
    y_pred = best_model.predict(X_test_processed)
    y_proba = best_model.predict_proba(X_test_processed)[:, 1] if hasattr(best_model, 'predict_proba') else None
    
    # M√©triques
    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, y_proba) if y_proba is not None else None,
        'best_params': grid_search.best_params_,
        'cv_score': grid_search.best_score_
    }
    results[model_name] = metrics
    
    # Affichage
    print(f"  Best CV F1 Score: {metrics['cv_score']:.4f}")
    print(f"  Test Accuracy:    {metrics['accuracy']:.4f}")
    print(f"  Test Precision:   {metrics['precision']:.4f}")
    print(f"  Test Recall:      {metrics['recall']:.4f}")
    print(f"  Test F1:          {metrics['f1']:.4f}")
    if metrics['roc_auc']:
        print(f"  Test ROC AUC:     {metrics['roc_auc']:.4f}")
    print(f"  Best params: {metrics['best_params']}")

print("\n" + "=" * 70)
print("‚úÖ Tuning termin√©!")

### 4.6 Graphiques Comparatifs des Mod√®les

In [None]:
# DataFrame des r√©sultats
from pandas.io.formats.style import Styler
results_df = pd.DataFrame({
    'Model': results.keys(),
    'Accuracy': [r['accuracy'] for r in results.values()],
    'Precision': [r['precision'] for r in results.values()],
    'Recall': [r['recall'] for r in results.values()],
    'F1 Score': [r['f1'] for r in results.values()],
    'ROC AUC': [r['roc_auc'] if r['roc_auc'] else 0 for r in results.values()],
    'CV F1': [r['cv_score'] for r in results.values()]
}).set_index('Model')

print("üìä Tableau comparatif des performances:")
display(results_df.style.highlight_max(axis=0, color='green'))

In [None]:
# Visualisation comparative
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Barplot des m√©triques
metrics_to_plot = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
x = np.arange(len(results_df.index))
width = 0.2
colors = ['#3498db', '#2ecc71', '#e74c3c', '#9b59b6']

for i, metric in enumerate(metrics_to_plot):
    axes[0].bar(x + i * width, results_df[metric], width, label=metric, color=colors[i])

axes[0].set_xlabel('Mod√®le')
axes[0].set_ylabel('Score')
axes[0].set_title('Comparaison des M√©triques par Mod√®le', fontsize=14, fontweight='bold')
axes[0].set_xticks(x + width * 1.5)
axes[0].set_xticklabels(results_df.index, rotation=15, ha='right')
axes[0].legend(loc='lower right')
axes[0].set_ylim(0, 1)
axes[0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)

# 2. ROC Curves
for model_name, model in best_models.items():
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X_test_processed)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_proba)
        auc = roc_auc_score(y_test, y_proba)
        axes[1].plot(fpr, tpr, label=f'{model_name} (AUC={auc:.3f})', linewidth=2)

axes[1].plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.5)')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('Courbes ROC', fontsize=14, fontweight='bold')
axes[1].legend(loc='lower right')
axes[1].set_xlim([0, 1])
axes[1].set_ylim([0, 1])

plt.tight_layout()
plt.show()

### 4.7 Choix du Meilleur Mod√®le

In [None]:
# S√©lection du meilleur mod√®le bas√© sur F1 Score (√©quilibre precision/recall)
best_model_name = results_df['F1 Score'].idxmax()
best_model_final = best_models[best_model_name]
best_metrics = results[best_model_name]

print("üèÜ MEILLEUR MOD√àLE S√âLECTIONN√â")
print("=" * 50)
print(f"\n  Mod√®le: {best_model_name}")
print(f"\n  Performances sur le set de test:")
print(f"    ‚Ä¢ Accuracy:  {best_metrics['accuracy']:.4f}")
print(f"    ‚Ä¢ Precision: {best_metrics['precision']:.4f}")
print(f"    ‚Ä¢ Recall:    {best_metrics['recall']:.4f}")
print(f"    ‚Ä¢ F1 Score:  {best_metrics['f1']:.4f}")
if best_metrics['roc_auc']:
    print(f"    ‚Ä¢ ROC AUC:   {best_metrics['roc_auc']:.4f}")
print(f"\n  Hyperparam√®tres optimaux:")
for param, value in best_metrics['best_params'].items():
    print(f"    ‚Ä¢ {param}: {value}")

### 4.8 Sauvegarde du Mod√®le

In [None]:
# Sauvegarde du mod√®le et du preprocessor
model_filename = 'attrition_model.joblib'
preprocessor_filename = 'attrition_preprocessor.joblib'

joblib.dump(best_model_final, model_filename)
joblib.dump(preprocessor, preprocessor_filename)

print("üíæ Mod√®le sauvegard√©!")
print(f"  ‚Ä¢ Mod√®le:       {model_filename}")
print(f"  ‚Ä¢ Preprocessor: {preprocessor_filename}")

# Sauvegarde des m√©tadonn√©es
metadata = {
    'model_name': best_model_name,
    'best_params': best_metrics['best_params'],
    'metrics': {
        'accuracy': best_metrics['accuracy'],
        'precision': best_metrics['precision'],
        'recall': best_metrics['recall'],
        'f1': best_metrics['f1'],
        'roc_auc': best_metrics['roc_auc']
    },
    'feature_names': all_feature_names,
    'training_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}

joblib.dump(metadata, 'attrition_metadata.joblib')
print(f"  ‚Ä¢ Metadata:     attrition_metadata.joblib")

---
# 5Ô∏è Analyse Approfondie du Mod√®le

Analyse d√©taill√©e sur le set de validation pour comprendre les pr√©dictions et les facteurs d'attrition.

### 5.1 Performance sur le Set de Validation

In [None]:
# Pr√©dictions sur validation
y_val_pred = best_model_final.predict(X_val_processed)
y_val_proba = best_model_final.predict_proba(X_val_processed)[:, 1] if hasattr(best_model_final, 'predict_proba') else None

print("üìä Performance sur le set de VALIDATION")
print("=" * 50)
print(f"\n{classification_report(y_val, y_val_pred, target_names=['Reste', 'Part'])}")

In [None]:
# Visualisations avanc√©es sur validation
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Matrice de confusion
cm = confusion_matrix(y_val, y_val_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0],
            xticklabels=['Reste', 'Part'], yticklabels=['Reste', 'Part'])
axes[0, 0].set_title('Matrice de Confusion (Validation)', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Pr√©dit')
axes[0, 0].set_ylabel('R√©el')

# 2. Courbe ROC
if y_val_proba is not None:
    fpr, tpr, thresholds = roc_curve(y_val, y_val_proba)
    auc = roc_auc_score(y_val, y_val_proba)
    axes[0, 1].plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {auc:.3f})')
    axes[0, 1].fill_between(fpr, tpr, alpha=0.3)
    axes[0, 1].plot([0, 1], [0, 1], 'k--')
    axes[0, 1].set_xlabel('False Positive Rate')
    axes[0, 1].set_ylabel('True Positive Rate')
    axes[0, 1].set_title('Courbe ROC (Validation)', fontsize=12, fontweight='bold')
    axes[0, 1].legend(loc='lower right')

# 3. Courbe Precision-Recall
if y_val_proba is not None:
    precision_curve, recall_curve, _ = precision_recall_curve(y_val, y_val_proba)
    ap = average_precision_score(y_val, y_val_proba)
    axes[1, 0].plot(recall_curve, precision_curve, 'g-', linewidth=2, label=f'AP = {ap:.3f}')
    axes[1, 0].fill_between(recall_curve, precision_curve, alpha=0.3, color='green')
    axes[1, 0].set_xlabel('Recall')
    axes[1, 0].set_ylabel('Precision')
    axes[1, 0].set_title('Courbe Precision-Recall (Validation)', fontsize=12, fontweight='bold')
    axes[1, 0].legend(loc='upper right')

# 4. Distribution des probabilit√©s
if y_val_proba is not None:
    for label, color, name in [(0, '#2ecc71', 'Reste'), (1, '#e74c3c', 'Part')]:
        mask = y_val == label
        axes[1, 1].hist(y_val_proba[mask], bins=30, alpha=0.6, color=color, label=name, edgecolor='white')
    axes[1, 1].axvline(x=0.5, color='black', linestyle='--', label='Seuil (0.5)')
    axes[1, 1].set_xlabel('Probabilit√© de D√©part')
    axes[1, 1].set_ylabel('Fr√©quence')
    axes[1, 1].set_title('Distribution des Probabilit√©s Pr√©dites', fontsize=12, fontweight='bold')
    axes[1, 1].legend()

plt.tight_layout()
plt.show()

### 5.2 Importance des Features

In [None]:
from sklearn.inspection import permutation_importance

print("üîÑ Calcul de l'importance des features (permutation)...")


perm_importance = permutation_importance(
    best_model_final,  
    X_test_processed, 
    y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

importances = perm_importance.importances_mean

# DataFrame des importances
feature_importance_df = pd.DataFrame({
    'Feature': all_feature_names,
    'Importance': importances,
    'Std': perm_importance.importances_std
}).sort_values('Importance', ascending=False)

print("üìä Top 15 Features les plus importantes:")
display(feature_importance_df.head(15))

# Visualisation
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

top_n = 20
top_features = feature_importance_df.head(top_n)

# Barplot horizontal avec barres d'erreur
colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(top_features)))
axes[0].barh(
    range(len(top_features)), 
    top_features['Importance'].values, 
    xerr=top_features['Std'].values,
    color=colors,
    alpha=0.8
)
axes[0].set_yticks(range(len(top_features)))
axes[0].set_yticklabels(top_features['Feature'].values)
axes[0].invert_yaxis()
axes[0].set_xlabel('Importance (Permutation)')
axes[0].set_title(f'Top {top_n} Features les Plus Importantes', fontsize=14, fontweight='bold')

# Importance cumul√©e
cumulative_importance = np.cumsum(feature_importance_df['Importance'].values / feature_importance_df['Importance'].sum())
axes[1].plot(range(1, len(cumulative_importance) + 1), cumulative_importance, 'b-', linewidth=2)
axes[1].axhline(y=0.9, color='r', linestyle='--', label='90% importance')
axes[1].axhline(y=0.95, color='orange', linestyle='--', label='95% importance')

n_90 = np.argmax(cumulative_importance >= 0.9) + 1
n_95 = np.argmax(cumulative_importance >= 0.95) + 1
axes[1].axvline(x=n_90, color='r', linestyle=':', alpha=0.5)
axes[1].axvline(x=n_95, color='orange', linestyle=':', alpha=0.5)

axes[1].set_xlabel('Nombre de Features')
axes[1].set_ylabel('Importance Cumul√©e')
axes[1].set_title('Importance Cumul√©e des Features', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].set_xlim([0, len(cumulative_importance)])

plt.tight_layout()
plt.show()

print(f"\nüìà {n_90} features expliquent 90% de l'importance")
print(f"üìà {n_95} features expliquent 95% de l'importance")

### 5.3 Analyse des Features Importantes par D√©partement

√âtudions comment les facteurs d'attrition varient selon le d√©partement.

In [None]:
# Analyse par d√©partement
if 'Department' in df_clean.columns:
    departments = df_clean['Department'].unique()
    
    print("üìä Taux d'attrition par d√©partement:")
    dept_attrition = df_clean.groupby('Department')['Attrition'].apply(
        lambda x: (x == 'Yes').mean() * 100
    ).sort_values(ascending=False)
    
    for dept, rate in dept_attrition.items():
        print(f"  ‚Ä¢ {dept}: {rate:.1f}%")
    
    # Visualisation
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Barplot taux d'attrition
    colors = plt.cm.RdYlGn_r(dept_attrition.values / dept_attrition.max())
    axes[0].bar(dept_attrition.index, dept_attrition.values, color=colors, edgecolor='black')
    axes[0].set_ylabel('Taux d\'attrition (%)')
    axes[0].set_title('Taux d\'Attrition par D√©partement', fontsize=14, fontweight='bold')
    axes[0].tick_params(axis='x', rotation=45)
    
    # Ajout des valeurs
    for i, (idx, val) in enumerate(dept_attrition.items()):
        axes[0].text(i, val + 0.5, f'{val:.1f}%', ha='center', fontsize=10)
    
    # Nombre d'employ√©s par d√©partement
    dept_counts = df_clean['Department'].value_counts()
    axes[1].pie(dept_counts.values, labels=dept_counts.index, autopct='%1.1f%%',
                colors=plt.cm.Set3.colors[:len(dept_counts)], explode=[0.02]*len(dept_counts))
    axes[1].set_title('R√©partition des Employ√©s par D√©partement', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Analyse des features num√©riques importantes par d√©partement et attrition
if importances is not None and 'Department' in df_clean.columns:
    # Top 4 features num√©riques importantes
    top_numeric_features = [
        f for f in feature_importance_df['Feature'].head(10).values 
        if f in numeric_features
    ][:4]
    
    if len(top_numeric_features) > 0:
        fig, axes = plt.subplots(len(top_numeric_features), 1, figsize=(14, 4 * len(top_numeric_features)))
        if len(top_numeric_features) == 1:
            axes = [axes]
        
        for idx, feature in enumerate(top_numeric_features):
            ax = axes[idx]
            
            # Boxplot par d√©partement et attrition
            data_plot = df_clean[[feature, 'Department', 'Attrition']].copy()
            
            sns.boxplot(data=data_plot, x='Department', y=feature, hue='Attrition',
                       palette=['#2ecc71', '#e74c3c'], ax=ax)
            ax.set_title(f'{feature} par D√©partement et Attrition', fontsize=12, fontweight='bold')
            ax.tick_params(axis='x', rotation=45)
            ax.legend(title='Attrition', labels=['Reste', 'Part'])
        
        plt.tight_layout()
        plt.show()

### 5.4 R√©sum√© et Recommandations

In [None]:
print("=" * 70)
print("üìã R√âSUM√â ET RECOMMANDATIONS")
print("=" * 70)

print(f"\nüèÜ Meilleur mod√®le: {best_model_name}")
print(f"   F1 Score: {best_metrics['f1']:.4f}")
if best_metrics['roc_auc']:
    print(f"   ROC AUC: {best_metrics['roc_auc']:.4f}")

if importances is not None:
    print("\nüìä Top 5 facteurs d'attrition:")
    for i, row in feature_importance_df.head(5).iterrows():
        print(f"   {i+1}. {row['Feature']} (importance: {row['Importance']:.4f})")

print("\nüí° Recommandations pour r√©duire l'attrition:")
print("   ‚Ä¢ Surveiller les employ√©s avec des scores bas sur les features importantes")
print("   ‚Ä¢ Mettre en place des programmes de r√©tention cibl√©s par d√©partement")
print("   ‚Ä¢ Utiliser ce mod√®le pour identifier proactivement les employ√©s √† risque")
print("   ‚Ä¢ Collecter des donn√©es suppl√©mentaires sur les raisons de d√©part")

print("\nüìÅ Fichiers sauvegard√©s:")
print("   ‚Ä¢ attrition_model.joblib")
print("   ‚Ä¢ attrition_preprocessor.joblib")
print("   ‚Ä¢ attrition_metadata.joblib")