# Lab 1: Fraud Detection avec Machine Learning
## Solution Compl√®te

---

## üéØ Objectifs de ce Lab

Dans ce lab, vous allez apprendre √† :

1. **G√©n√©rer** un dataset synth√©tique de transactions bancaires
2. **Analyser** les donn√©es et comprendre le d√©s√©quilibre des classes
3. **Cr√©er des features** pertinentes pour la d√©tection de fraude
4. **G√©rer** le d√©s√©quilibre avec SMOTE (Synthetic Minority Over-sampling)
5. **Entra√Æner** plusieurs mod√®les ML (Logistic Regression, XGBoost, LightGBM)
6. **√âvaluer** avec des m√©triques adapt√©es (Precision, Recall, AUC)
7. **Lancer** un SageMaker Training Job
8. **Tracker** avec SageMaker Experiments
9. **Enregistrer** dans le Model Registry

---

## üìö Contexte Business

### Le Probl√®me de la Fraude

Les banques perdent des **milliards** chaque ann√©e √† cause de la fraude :
- üá∫üá∏ USA : $28.58 milliards en 2020
- üåç Monde : $32.39 milliards projet√©s en 2025

**Types de fraude** :
- Transactions non autoris√©es
- Vol d'identit√©
- Phishing et ing√©nierie sociale
- Fraude aux marchands

### Pourquoi le ML ?

**Approche traditionnelle** (r√®gles) :
```
IF amount > 1000 AND distance > 50km THEN flag
```
‚ùå Trop de faux positifs
‚ùå Les fraudeurs adaptent leurs strat√©gies
‚ùå R√®gles manuelles difficiles √† maintenir

**Approche ML** :
‚úÖ Apprend des patterns complexes
‚úÖ S'adapte aux nouvelles techniques de fraude
‚úÖ Optimise Precision/Recall selon besoins business

### D√©fi Principal : Classes D√©s√©quilibr√©es

Typiquement : **0.1% - 1%** des transactions sont frauduleuses

```
Normal transactions: 99%  ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
Fraudulent:           1%  ‚ñà
```

‚ö†Ô∏è **Probl√®me** : Un mod√®le qui pr√©dit toujours "normal" aurait 99% d'accuracy !

**Solutions** :
1. M√©triques adapt√©es (Precision, Recall, F1, AUC)
2. Techniques de resampling (SMOTE, undersampling)
3. Class weights dans les mod√®les
4. Ensembles et boosting

---

## ‚è±Ô∏è Dur√©e Estim√©e

- **Partie 1 (Exploration)** : 15 minutes
- **Partie 2 (Feature Engineering)** : 10 minutes
- **Partie 3 (Modeling Local)** : 25 minutes
- **Partie 4 (SageMaker Integration)** : 20 minutes
- **Total** : 70 minutes

---

# Fraud Detection Solution

## Complete implementation with best practices

This notebook provides the complete solution for the fraud detection exercise.

---

## üîß Setup : Configuration de l'Environnement

Commen√ßons par importer les biblioth√®ques n√©cessaires et configurer l'environnement.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score, 
    precision_recall_curve, roc_curve, auc
)
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import lightgbm as lgb
from imblearn.over_sampling import SMOTE
import joblib
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8')
%matplotlib inline

---

## üìä Partie 1: G√©n√©ration et Exploration des Donn√©es

### Pourquoi des Donn√©es Synth√©tiques ?

Pour ce lab, nous g√©n√©rons des donn√©es synth√©tiques car :
- ‚úÖ Les vraies donn√©es de fraude sont **confidentielles**
- ‚úÖ Permet de **contr√¥ler** les patterns et le ratio de fraude
- ‚úÖ Pas de probl√®mes de **compliance** (RGPD, PCI-DSS)
- ‚úÖ **Reproductible** pour l'apprentissage

### Features de Transaction

Nous allons cr√©er des features r√©alistes :

| Feature | Description | Pattern Fraude |
|---------|-------------|----------------|
| `transaction_amount` | Montant en $ | ‚¨ÜÔ∏è Plus √©lev√© |
| `hour_of_day` | Heure (0-23) | üåô Plus la nuit |
| `day_of_week` | Jour (0-6) | ‚û°Ô∏è Pas de pattern clair |
| `merchant_category` | Type de marchand | üíª Plus online |
| `distance_from_home` | Distance (km) | ‚¨ÜÔ∏è Plus loin |
| `distance_from_last_transaction` | Distance depuis derni√®re (km) | ‚¨ÜÔ∏è Plus loin |
| `transaction_velocity` | Transactions/jour | ‚¨ÜÔ∏è Plus √©lev√©e |

In [None]:
def generate_fraud_dataset(n_samples=100000, fraud_ratio=0.02):
    """
    Generate synthetic fraud detection dataset with realistic patterns
    """
    n_fraud = int(n_samples * fraud_ratio)
    n_legit = n_samples - n_fraud
    
    # Probabilities for hour of day (normalized)
    legit_hour_probs = np.array([
        0.01, 0.01, 0.01, 0.01, 0.01, 0.02, 0.04, 0.06,
        0.08, 0.07, 0.06, 0.07, 0.08, 0.07, 0.06, 0.05,
        0.04, 0.05, 0.06, 0.07, 0.06, 0.04, 0.03, 0.02
    ])
    legit_hour_probs = legit_hour_probs / legit_hour_probs.sum()
    
    fraud_hour_probs = np.array([
        0.08, 0.08, 0.07, 0.06, 0.05, 0.03, 0.02, 0.02,
        0.02, 0.02, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03,
        0.03, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.08
    ])
    fraud_hour_probs = fraud_hour_probs / fraud_hour_probs.sum()
    
    # Legitimate transactions
    legit_data = {
        'transaction_amount': np.random.gamma(2, 50, n_legit),
        'hour_of_day': np.random.choice(range(24), n_legit, p=legit_hour_probs),
        'day_of_week': np.random.randint(0, 7, n_legit),
        'merchant_category': np.random.choice(
            ['retail', 'grocery', 'gas', 'restaurant', 'online'],
            n_legit
        ),
        'distance_from_home': np.abs(np.random.normal(5, 10, n_legit)),
        'distance_from_last_transaction': np.abs(np.random.normal(3, 5, n_legit)),
        'transaction_velocity': np.random.poisson(2, n_legit),
        'is_fraud': np.zeros(n_legit, dtype=int)
    }
    
    # Fraudulent transactions (different patterns)
    fraud_data = {
        'transaction_amount': np.random.gamma(5, 100, n_fraud),  # Higher amounts
        'hour_of_day': np.random.choice(range(24), n_fraud, p=fraud_hour_probs),  # More at night
        'day_of_week': np.random.randint(0, 7, n_fraud),
        'merchant_category': np.random.choice(
            ['retail', 'grocery', 'gas', 'restaurant', 'online'],
            n_fraud,
            p=[0.15, 0.10, 0.15, 0.10, 0.50]  # More online
        ),
        'distance_from_home': np.abs(np.random.normal(50, 100, n_fraud)),  # Farther
        'distance_from_last_transaction': np.abs(np.random.normal(100, 200, n_fraud)),
        'transaction_velocity': np.random.poisson(8, n_fraud),  # Higher velocity
        'is_fraud': np.ones(n_fraud, dtype=int)
    }
    
    # Combine and shuffle
    df_legit = pd.DataFrame(legit_data)
    df_fraud = pd.DataFrame(fraud_data)
    df = pd.concat([df_legit, df_fraud], ignore_index=True)
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    return df

# Generate dataset
df = generate_fraud_dataset(n_samples=100000, fraud_ratio=0.02)

print("Dataset shape:", df.shape)
print("\nClass distribution:")
print(df['is_fraud'].value_counts())
print("\nFraud ratio:", df['is_fraud'].mean())
df.head()

## 2. Data Exploration

---

## üìà Partie 2: Exploration Visuelle des Donn√©es

### Objectif de l'Exploration

Avant de mod√©liser, nous devons **comprendre** les donn√©es :
- ‚úÖ Identifier les **diff√©rences** entre transactions l√©gitimes et frauduleuses
- ‚úÖ D√©tecter des **patterns** exploitables par le ML
- ‚úÖ V√©rifier qu'il n'y a pas de **data leakage**
- ‚úÖ Confirmer le **d√©s√©quilibre** des classes

### Questions √† R√©pondre

1. **Montant** : Les fraudes sont-elles plus ch√®res ?
2. **Temps** : Y a-t-il des heures √† risque ?
3. **Distance** : Les fraudes se produisent-elles loin du domicile ?
4. **V√©locit√©** : Les fraudeurs font-ils plus de transactions rapidement ?

Regardons les visualisations ci-dessous üëá

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Transaction amount
axes[0, 0].hist(df[df['is_fraud']==0]['transaction_amount'], bins=50, alpha=0.5, label='Legit')
axes[0, 0].hist(df[df['is_fraud']==1]['transaction_amount'], bins=50, alpha=0.5, label='Fraud')
axes[0, 0].set_xlabel('Transaction Amount')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].set_title('Transaction Amount Distribution')

# Hour of day
df.groupby(['hour_of_day', 'is_fraud']).size().unstack().plot(ax=axes[0, 1])
axes[0, 1].set_xlabel('Hour of Day')
axes[0, 1].set_ylabel('Count')
axes[0, 1].set_title('Transactions by Hour')

# Distance from home
axes[1, 0].hist(df[df['is_fraud']==0]['distance_from_home'], bins=50, alpha=0.5, label='Legit')
axes[1, 0].hist(df[df['is_fraud']==1]['distance_from_home'], bins=50, alpha=0.5, label='Fraud')
axes[1, 0].set_xlabel('Distance from Home')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].legend()
axes[1, 0].set_title('Distance from Home Distribution')

# Transaction velocity
axes[1, 1].hist(df[df['is_fraud']==0]['transaction_velocity'], bins=20, alpha=0.5, label='Legit')
axes[1, 1].hist(df[df['is_fraud']==1]['transaction_velocity'], bins=20, alpha=0.5, label='Fraud')
axes[1, 1].set_xlabel('Transaction Velocity')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].legend()
axes[1, 1].set_title('Transaction Velocity Distribution')

plt.tight_layout()
plt.show()

## 3. Feature Engineering

---

## üî® Partie 3: Feature Engineering

### Qu'est-ce que le Feature Engineering ?

Le **Feature Engineering** est l'art de cr√©er de nouvelles variables √† partir des donn√©es brutes pour am√©liorer la performance du mod√®le.

### Nouvelles Features Cr√©√©es

| Feature | Formule | Intuition |
|---------|---------|-----------|
| `is_weekend` | day_of_week >= 5 | Comportement diff√©rent le weekend |
| `is_night` | (hour >= 22) OR (hour <= 6) | Fraudes plus fr√©quentes la nuit |
| `amount_velocity_ratio` | amount / velocity | Montant par transaction |
| `distance_ratio` | distance_last / distance_home | Ratio de distances |

### Pourquoi ces Features ?

**Exemple** : `is_night`
- üåô Les transactions la nuit (22h-6h) sont **plus suspectes**
- üë§ La plupart des gens dorment ‚Üí activit√© anormale
- ü§ñ Le mod√®le peut apprendre ce pattern facilement

**Exemple** : `amount_velocity_ratio`
- üí∞ Si quelqu'un fait 10 transactions de $1000 rapidement ‚Üí suspect
- üìä Capture la **relation** entre deux variables
- üéØ Plus informatif que les variables s√©par√©ment

In [None]:
# Create additional features
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
df['is_night'] = ((df['hour_of_day'] >= 22) | (df['hour_of_day'] <= 6)).astype(int)
df['amount_velocity_ratio'] = df['transaction_amount'] / (df['transaction_velocity'] + 1)
df['distance_ratio'] = df['distance_from_last_transaction'] / (df['distance_from_home'] + 1)

# Encode categorical variable
le = LabelEncoder()
df['merchant_category_encoded'] = le.fit_transform(df['merchant_category'])

print("Engineered features:")
df[['is_weekend', 'is_night', 'amount_velocity_ratio', 'distance_ratio', 'merchant_category_encoded']].head()

## 4. Data Preparation

---

## üé≤ Partie 4: Pr√©paration des Donn√©es pour le Modeling

### Train / Validation / Test Split

Nous divisons les donn√©es en **3 sets** :

```
Dataset (100%)
‚îú‚îÄ‚îÄ Train (60%)      ‚Üí Entra√Ænement du mod√®le
‚îú‚îÄ‚îÄ Validation (20%) ‚Üí Tuning des hyperparam√®tres
‚îî‚îÄ‚îÄ Test (20%)       ‚Üí √âvaluation finale (jamais vu pendant training)
```

### Pourquoi 3 Splits ?

**Probl√®me avec 2 splits** (Train/Test uniquement) :
- ‚ùå Risque d'**overfitting** sur le test set
- ‚ùå Pas de set ind√©pendant pour tuning

**Solution avec 3 splits** :
1. **Train** : Entra√Æner le mod√®le
2. **Validation** : Optimiser les hyperparam√®tres
3. **Test** : √âvaluation finale **non biais√©e**

### Stratification

`stratify=y` assure que chaque split a le **m√™me ratio** de fraude (2%) :
```
Train:      98% legit, 2% fraud
Validation: 98% legit, 2% fraud  
Test:       98% legit, 2% fraud
```

Sans stratification, on pourrait avoir 0% de fraude dans un split ! üò±

### Feature Scaling

**StandardScaler** : $(x - \mu) / \sigma$

**Pourquoi ?**
- Features ont des √©chelles diff√©rentes (amount: 0-1000, hour: 0-23)
- Certains algorithmes (Logistic Regression, SVM) sont sensibles √† l'√©chelle
- Tree-based models (XGBoost) ne sont pas affect√©s, mais √ßa ne fait pas de mal

‚ö†Ô∏è **Important** : Fit sur train, transform sur val/test (√©viter data leakage)

In [None]:
# Select features
feature_cols = [
    'transaction_amount', 'hour_of_day', 'day_of_week',
    'distance_from_home', 'distance_from_last_transaction',
    'transaction_velocity', 'is_weekend', 'is_night',
    'amount_velocity_ratio', 'distance_ratio', 'merchant_category_encoded'
]

X = df[feature_cols]
y = df['is_fraud']

# Split data: 60% train, 20% validation, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"Train set: {X_train.shape}, Fraud ratio: {y_train.mean():.4f}")
print(f"Validation set: {X_val.shape}, Fraud ratio: {y_val.mean():.4f}")
print(f"Test set: {X_test.shape}, Fraud ratio: {y_test.mean():.4f}")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

## 5. Handle Imbalanced Data with SMOTE

---

## ‚öñÔ∏è Partie 5: G√©rer le D√©s√©quilibre avec SMOTE

### Le Probl√®me du D√©s√©quilibre

Avec seulement 2% de fraudes, le mod√®le peut facilement apprendre √† **toujours pr√©dire "normal"** :

```
Accuracy = 98% (en pr√©disant toujours 0)
Mais Recall = 0% (aucune fraude d√©tect√©e!) ‚ùå
```

### Qu'est-ce que SMOTE ?

**SMOTE** = Synthetic Minority Over-sampling Technique

**Comment √ßa marche ?**
1. Prendre un exemple de fraude (minorit√©)
2. Trouver ses k voisins les plus proches (aussi des fraudes)
3. Cr√©er un **nouvel exemple synth√©tique** entre les deux

```
Fraude A : [100, 50, 20]
Fraude B : [120, 60, 25]
         ‚Üì SMOTE ‚Üì
Nouvelle : [110, 55, 22.5] (moyenne interpol√©e)
```

### Alternatives √† SMOTE

| Technique | Approche | Avantages | Inconv√©nients |
|-----------|----------|-----------|---------------|
| **SMOTE** | Over-sampling (cr√©er exemples) | ‚úÖ Pas de perte d'info | ‚ö†Ô∏è Peut cr√©er du bruit |
| **Undersampling** | Supprimer exemples majoritaires | ‚úÖ Plus rapide | ‚ùå Perte d'information |
| **Class Weights** | P√©naliser erreurs sur minorit√© | ‚úÖ Pas de modification data | ‚ö†Ô∏è Sensible au tuning |
| **Ensemble** | Combiner plusieurs mod√®les | ‚úÖ Performance | ‚ùå Plus complexe |

### Pourquoi SMOTE sur Train Seulement ?

‚ö†Ô∏è **Critique** : On applique SMOTE **uniquement** sur le training set

**Raison** :
- Validation et Test doivent refl√©ter la **distribution r√©elle** (98%/2%)
- Sinon, on surestime la performance en production
- Le mod√®le doit apprendre √† g√©rer le d√©s√©quilibre

```python
# ‚úÖ BON
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
model.fit(X_train_balanced, y_train_balanced)
model.evaluate(X_val, y_val)  # Val garde le d√©s√©quilibre

# ‚ùå MAUVAIS
X_val_balanced, y_val_balanced = smote.fit_resample(X_val, y_val)
model.evaluate(X_val_balanced, y_val_balanced)  # Performance irr√©aliste!
```

In [None]:
# Apply SMOTE to training data
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_scaled, y_train)

print(f"Original training set: {X_train_scaled.shape}")
print(f"Balanced training set: {X_train_balanced.shape}")
print(f"\nOriginal fraud ratio: {y_train.mean():.4f}")
print(f"Balanced fraud ratio: {y_train_balanced.mean():.4f}")

## 6. Model Training

---

## ü§ñ Partie 6: Entra√Ænement de Mod√®les

### Pourquoi 3 Mod√®les ?

Nous entra√Ænons **3 algorithmes diff√©rents** pour comparer :

#### 1. Logistic Regression (Baseline)
```
‚úÖ Simple et interpr√©table
‚úÖ Rapide √† entra√Æner
‚ùå Assume relations lin√©aires
‚ùå Moins performant sur donn√©es complexes
```

**Quand l'utiliser ?**
- Baseline simple
- Besoin d'interpr√©tabilit√© (coefficients)
- Peu de features

#### 2. XGBoost
```
‚úÖ √âtat de l'art sur donn√©es tabulaires
‚úÖ G√®re les non-lin√©arit√©s
‚úÖ Feature importance built-in
‚ùå Plus lent √† entra√Æner
‚ùå Plus de hyperparam√®tres √† tuner
```

**Quand l'utiliser ?**
- Performance maximale
- Donn√©es tabulaires
- Accepte la complexit√©

#### 3. LightGBM
```
‚úÖ Plus rapide que XGBoost
‚úÖ Moins de m√©moire
‚úÖ Bon sur gros datasets
‚ùå Peut overfitter sur petits datasets
```

**Quand l'utiliser ?**
- Gros datasets (> 10K exemples)
- Contraintes de temps/m√©moire
- Production √† haute fr√©quence

### Hyperparam√®tres Cl√©s

| Param√®tre | XGBoost / LightGBM | Effet |
|-----------|-------------------|-------|
| `n_estimators` | 100 | Nombre d'arbres (‚Üë = mieux, mais overfitting) |
| `max_depth` | 6 | Profondeur des arbres (‚Üë = plus complexe) |
| `learning_rate` | 0.1 | Taux d'apprentissage (‚Üì = plus pr√©cis, mais lent) |

**R√®gle g√©n√©rale** : Plus de `n_estimators` + plus petit `learning_rate` = meilleure performance (mais plus lent)

In [None]:
# Train Logistic Regression (baseline)
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_balanced, y_train_balanced)

# Train XGBoost
xgb_model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='auc'
)
xgb_model.fit(X_train_balanced, y_train_balanced)

# Train LightGBM
lgb_model = lgb.LGBMClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    random_state=42
)
lgb_model.fit(X_train_balanced, y_train_balanced)

print("All models trained successfully!")

## 7. Model Evaluation

---

## üìä Partie 7: √âvaluation des Mod√®les

### M√©triques pour Classes D√©s√©quilibr√©es

‚ùå **Accuracy** : Pas adapt√©e ! (99% en pr√©disant toujours "normal")

‚úÖ **M√©triques √† utiliser** :

#### 1. Precision (Pr√©cision)
```
Precision = TP / (TP + FP)
```
**Question** : Parmi les transactions flagg√©es comme fraude, combien le sont vraiment ?

**Business impact** : Co√ªt d'investigation des faux positifs

#### 2. Recall (Rappel / Sensibilit√©)
```
Recall = TP / (TP + FN)
```
**Question** : Parmi toutes les fraudes r√©elles, combien avons-nous d√©tect√©es ?

**Business impact** : Fraudes manqu√©es = pertes financi√®res

#### 3. F1-Score
```
F1 = 2 √ó (Precision √ó Recall) / (Precision + Recall)
```
**Question** : Quelle est la moyenne harmonique entre Precision et Recall ?

**Usage** : √âquilibrer les deux m√©triques

#### 4. ROC-AUC
```
AUC = Area Under the ROC Curve
```
**Question** : Quelle est la capacit√© du mod√®le √† discriminer entre classes ?

**Interpr√©tation** :
- 1.0 = Parfait (s√©pare parfaitement)
- 0.5 = Random (comme lancer une pi√®ce)
- < 0.5 = Pire que random (mod√®le invers√© ?)

### Confusion Matrix

```
                Predicted
              Neg    Pos
Actual Neg    TN     FP   ‚Üê False Positive = Co√ªt investigation
       Pos    FN     TP   ‚Üê False Negative = Fraude manqu√©e $$$$
```

**Objectif Business** : Minimiser FN (fraudes manqu√©es) tout en gardant FP raisonnable

### Trade-off Precision vs Recall

```
Seuil = 0.5 (d√©faut)
Precision = 85%, Recall = 70%

Seuil = 0.3 (plus sensible)
Precision = 60%, Recall = 90%  ‚Üê D√©tecte plus de fraudes, mais plus de faux positifs

Seuil = 0.7 (plus conservateur)
Precision = 95%, Recall = 50%  ‚Üê Moins de faux positifs, mais manque des fraudes
```

**Choisir le seuil** selon le business :
- **Banque** : Recall √©lev√© (ne pas manquer de fraudes) ‚Üí seuil bas (0.3)
- **E-commerce** : Precision √©lev√©e (ne pas bloquer clients l√©gitimes) ‚Üí seuil haut (0.7)

In [None]:
# Evaluate on validation set
models = {
    'Logistic Regression': lr_model,
    'XGBoost': xgb_model,
    'LightGBM': lgb_model
}

results = {}
for name, model in models.items():
    y_pred_proba = model.predict_proba(X_val_scaled)[:, 1]
    y_pred = (y_pred_proba >= 0.5).astype(int)
    
    roc_auc = roc_auc_score(y_val, y_pred_proba)
    
    results[name] = {
        'predictions': y_pred,
        'probabilities': y_pred_proba,
        'roc_auc': roc_auc
    }
    
    print(f"\n{name}:")
    print(f"ROC-AUC: {roc_auc:.4f}")
    print("\nClassification Report:")
    print(classification_report(y_val, y_pred))
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_val, y_pred))

In [None]:
# Plot ROC curves
plt.figure(figsize=(10, 8))
for name, result in results.items():
    fpr, tpr, _ = roc_curve(y_val, result['probabilities'])
    plt.plot(fpr, tpr, label=f"{name} (AUC = {result['roc_auc']:.4f})")

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Plot Precision-Recall curves
plt.figure(figsize=(10, 8))
for name, result in results.items():
    precision, recall, _ = precision_recall_curve(y_val, result['probabilities'])
    plt.plot(recall, precision, label=name)

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curves Comparison')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Cost-sensitive evaluation
cost_fp = 5   # Cost of investigating a legitimate transaction
cost_fn = 100  # Cost of missing a fraudulent transaction

print("Cost-Sensitive Evaluation:")
print("="*50)
for name, result in results.items():
    tn, fp, fn, tp = confusion_matrix(y_val, result['predictions']).ravel()
    total_cost = (fp * cost_fp) + (fn * cost_fn)
    print(f"\n{name}:")
    print(f"  False Positives: {fp}, Cost: ${fp * cost_fp}")
    print(f"  False Negatives: {fn}, Cost: ${fn * cost_fn}")
    print(f"  Total Cost: ${total_cost}")

## 9. Final Evaluation on Test Set

In [None]:
# Evaluate best model on test set
y_test_pred_proba = best_model.predict_proba(X_test_scaled)[:, 1]
y_test_pred = (y_test_pred_proba >= 0.5).astype(int)

print("Final Test Set Performance:")
print("="*50)
print(f"ROC-AUC: {roc_auc_score(y_test, y_test_pred_proba):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))

## 10. Save Model Artifacts

In [None]:
# Save model and preprocessing artifacts
import os

model_dir = '../../models'
os.makedirs(model_dir, exist_ok=True)

# Save model
joblib.dump(best_model, f'{model_dir}/fraud_detection_model.pkl')
joblib.dump(scaler, f'{model_dir}/fraud_detection_scaler.pkl')
joblib.dump(le, f'{model_dir}/fraud_detection_encoder.pkl')
joblib.dump(feature_cols, f'{model_dir}/fraud_detection_features.pkl')

print("Model artifacts saved successfully!")
print(f"Location: {model_dir}")

In [None]:
# Calculate final metrics on test set
y_test_pred_proba = best_model.predict_proba(X_test_scaled)[:, 1]
y_test_pred = (y_test_pred_proba >= 0.5).astype(int)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

final_metrics = {
    "accuracy": float(accuracy_score(y_test, y_test_pred)),
    "precision": float(precision_score(y_test, y_test_pred)),
    "recall": float(recall_score(y_test, y_test_pred)),
    "f1": float(f1_score(y_test, y_test_pred)),
    "roc_auc": float(roc_auc_score(y_test, y_test_pred_proba))
}

print("‚úÖ Final Test Metrics:")
for key, value in final_metrics.items():
    print(f"   {key}: {value:.4f}")

---

## 11. SageMaker Integration: Training Job & Experiments

Dans cette section, nous allons utiliser les services natifs de SageMaker :
- **SageMaker Training Jobs** : Entra√Ænement scalable et reproductible
- **SageMaker Experiments** : Tracking automatique des hyperparam√®tres et m√©triques
- **SageMaker Model Registry** : Versioning et gestion du cycle de vie des mod√®les

### Pourquoi SageMaker Training Jobs ?

‚úÖ **Scalabilit√©** : Choisissez la taille d'instance adapt√©e  
‚úÖ **Reproductibilit√©** : Environnement containeris√©  
‚úÖ **Tracking automatique** : Int√©gration avec SageMaker Experiments  
‚úÖ **Co√ªt optimis√©** : Payez uniquement pour le temps d'entra√Ænement

In [None]:
# Setup SageMaker
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn import SKLearn
from sagemaker.experiments import Run
import boto3

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = get_execution_role()
region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()

print(f"‚úÖ SageMaker Session initialized")
print(f"   Region: {region}")
print(f"   Bucket: {bucket}")
print(f"   Role: {role}")

In [None]:
# Upload training data to S3
import pandas as pd
from sklearn.model_selection import train_test_split

# Prepare data for SageMaker
train_data = pd.concat([X_train, y_train], axis=1)
val_data = pd.concat([X_val, y_val], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

# Save to CSV
os.makedirs('data', exist_ok=True)
train_data.to_csv('data/train.csv', index=False, header=False)
val_data.to_csv('data/validation.csv', index=False, header=False)
test_data.to_csv('data/test.csv', index=False, header=False)

# Upload to S3
s3_prefix = 'fraud-detection-lab1'
s3_train = sagemaker_session.upload_data(
    path='data/train.csv',
    bucket=bucket,
    key_prefix=f'{s3_prefix}/data'
)
s3_validation = sagemaker_session.upload_data(
    path='data/validation.csv',
    bucket=bucket,
    key_prefix=f'{s3_prefix}/data'
)

print(f"‚úÖ Data uploaded to S3:")
print(f"   Training: {s3_train}")
print(f"   Validation: {s3_validation}")

### üìù Create Training Script

Pour SageMaker Training, nous cr√©ons un script Python autonome qui sera ex√©cut√© dans un container.

In [None]:
%%writefile requirements.txt
imbalanced-learn==0.12.3
xgboost==2.0.0

In [None]:
print("‚úÖ Requirements file created: requirements.txt")

In [None]:
%%writefile fraud_train.py

import argparse
import os
import pandas as pd
import numpy as np
import joblib
import json

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    
    # Hyperparameters
    parser.add_argument('--n-estimators', type=int, default=100)
    parser.add_argument('--max-depth', type=int, default=6)
    parser.add_argument('--learning-rate', type=float, default=0.1)
    parser.add_argument('--threshold', type=float, default=0.5)
    
    # SageMaker specific arguments
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAINING'))
    parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION'))
    parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
    
    args = parser.parse_args()
    
    print("Starting training...")
    print(f"Model dir: {args.model_dir}")
    print(f"Train dir: {args.train}")
    print(f"Validation dir: {args.validation}")
    
    # Import here to avoid issues if not available
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
    from imblearn.over_sampling import SMOTE
    import xgboost as xgb
    
    # Load data
    print("Loading training data...")
    train_df = pd.read_csv(os.path.join(args.train, 'train.csv'), header=None)
    val_df = pd.read_csv(os.path.join(args.validation, 'validation.csv'), header=None)
    
    print(f"Train shape: {train_df.shape}")
    print(f"Validation shape: {val_df.shape}")
    
    # Split features and target
    X_train = train_df.iloc[:, :-1]
    y_train = train_df.iloc[:, -1]
    X_val = val_df.iloc[:, :-1]
    y_val = val_df.iloc[:, -1]
    
    # Apply SMOTE
    print(f"Original training set: {X_train.shape}")
    smote = SMOTE(random_state=42)
    X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
    print(f"Balanced training set: {X_train_balanced.shape}")
    
    # Calculate scale_pos_weight
    neg_count = len(y_train[y_train==0])
    pos_count = len(y_train[y_train==1])
    scale_pos_weight = neg_count / pos_count if pos_count > 0 else 1.0
    
    # Train XGBoost model
    print("Training XGBoost model...")
    model = xgb.XGBClassifier(
        n_estimators=args.n_estimators,
        max_depth=args.max_depth,
        learning_rate=args.learning_rate,
        scale_pos_weight=scale_pos_weight,
        eval_metric='logloss',
        random_state=42,
        use_label_encoder=False
    )
    
    model.fit(
        X_train_balanced, y_train_balanced,
        eval_set=[(X_val, y_val)],
        verbose=False
    )
    
    # Evaluate on validation set
    y_val_pred_proba = model.predict_proba(X_val)[:, 1]
    y_val_pred = (y_val_pred_proba >= args.threshold).astype(int)
    
    # Calculate metrics
    metrics = {
        'accuracy': float(accuracy_score(y_val, y_val_pred)),
        'precision': float(precision_score(y_val, y_val_pred, zero_division=0)),
        'recall': float(recall_score(y_val, y_val_pred, zero_division=0)),
        'f1': float(f1_score(y_val, y_val_pred, zero_division=0)),
        'roc_auc': float(roc_auc_score(y_val, y_val_pred_proba))
    }
    
    print(f"\nValidation Metrics:")
    for key, value in metrics.items():
        print(f"  {key}: {value:.4f}")
    
    # Save model
    model_path = os.path.join(args.model_dir, 'xgboost-model')
    joblib.dump(model, model_path)
    print(f"\nModel saved to: {model_path}")
    
    # Save metrics
    if args.output_data_dir:
        metrics_path = os.path.join(args.output_data_dir, 'metrics.json')
        os.makedirs(args.output_data_dir, exist_ok=True)
        with open(metrics_path, 'w') as f:
            json.dump(metrics, f)
        print(f"Metrics saved to: {metrics_path}")
    
    print("\n‚úÖ Training completed successfully!")

print("‚úÖ Training script created: fraud_train.py")

### üöÄ Launch SageMaker Training Job with Experiments Tracking

In [None]:
from sagemaker.sklearn import SKLearn
from sagemaker.experiments.run import Run
import time

# Define experiment and run names
experiment_name = "fraud-detection-experiment"
run_name = f"xgboost-run-{int(time.time())}"

# Hyperparameters to test
hyperparameters = {
    'n-estimators': 100,
    'max-depth': 6,
    'learning-rate': 0.1,
    'threshold': 0.5
}

# Create SageMaker Estimator with dependencies
sklearn_estimator = SKLearn(
    entry_point='fraud_train.py',
    role=role,
    instance_type='ml.m5.xlarge',
    instance_count=1,
    framework_version='1.2-1',
    py_version='py3',
    hyperparameters=hyperparameters,
    output_path=f's3://{bucket}/{s3_prefix}/output',
    code_location=f's3://{bucket}/{s3_prefix}/code',
    sagemaker_session=sagemaker_session,
    dependencies=['requirements.txt']  # Install additional packages
)

print(f"‚úÖ Estimator configured")
print(f"   Instance type: ml.m5.xlarge")
print(f"   Framework: scikit-learn 1.2-1")
print(f"   Dependencies: requirements.txt (imbalanced-learn, xgboost)")
print(f"\nüéØ Starting training with SageMaker Experiments...")

In [None]:
# Launch training with Experiments tracking
with Run(
    experiment_name=experiment_name,
    run_name=run_name,
    sagemaker_session=sagemaker_session
) as run:
    
    # Log hyperparameters
    run.log_parameters(hyperparameters)
    
    # Start training
    sklearn_estimator.fit({
        'training': s3_train,
        'validation': s3_validation
    }, wait=True)
    
    # Log model artifact location
    run.log_file(sklearn_estimator.model_data, name="model_artifact", is_output=True)
    
    print(f"\\n‚úÖ Training completed!")
    print(f"   Experiment: {experiment_name}")
    print(f"   Run: {run_name}")
    print(f"   Model artifact: {sklearn_estimator.model_data}")

### üì¶ Register Model in SageMaker Model Registry

Le Model Registry permet de :
- Versionner vos mod√®les
- Approuver/rejeter des versions
- Suivre le cycle de vie (Dev ‚Üí Staging ‚Üí Production)
- Int√©grer avec des pipelines CI/CD

In [None]:
from sagemaker.model import Model
from sagemaker.sklearn import SKLearnModel

# Create model package group (registry)
model_package_group_name = "fraud-detection-models"

sm_client = boto3.client('sagemaker')

try:
    sm_client.create_model_package_group(
        ModelPackageGroupName=model_package_group_name,
        ModelPackageGroupDescription="Fraud detection models for production"
    )
    print(f"‚úÖ Created Model Package Group: {model_package_group_name}")
except sm_client.exceptions.ResourceInUse:
    print(f"‚ÑπÔ∏è  Model Package Group already exists: {model_package_group_name}")

# Register the trained model
model = SKLearnModel(
    model_data=sklearn_estimator.model_data,
    role=role,
    entry_point='fraud_train.py',
    framework_version='1.2-1',
    py_version='py3',
    sagemaker_session=sagemaker_session
)

# Register model in Model Registry
model_package = model.register(
    content_types=["text/csv"],
    response_types=["application/json"],
    inference_instances=["ml.t2.medium", "ml.m5.large"],
    transform_instances=["ml.m5.large"],
    model_package_group_name=model_package_group_name,
    approval_status="PendingManualApproval",
    description="XGBoost fraud detection model trained with SMOTE"
)

print(f"\\n‚úÖ Model registered in Model Registry!")
print(f"   Model Package ARN: {model_package.model_package_arn}")
print(f"   Approval Status: PendingManualApproval")
print(f"\\nüí° To approve: SageMaker Console ‚Üí Model Registry ‚Üí {model_package_group_name}")

### üîç View Experiment Results

Visualisons les r√©sultats de notre exp√©rience SageMaker.

In [None]:
from sagemaker.analytics import ExperimentAnalytics

# Get experiment results
experiment_analytics = ExperimentAnalytics(
    experiment_name=experiment_name,
    sagemaker_session=sagemaker_session
)

# Display results as DataFrame
results_df = experiment_analytics.dataframe()

if not results_df.empty:
    print("üìä Experiment Results:")
    print(results_df[['TrialComponentName', 'n_estimators', 'max_depth', 'learning_rate']].head())
    print(f"\\n‚úÖ Total runs: {len(results_df)}")
    print(f"\\nüí° View full results in SageMaker Studio:")
    print(f"   Experiments & Trials ‚Üí {experiment_name}")
else:
    print("‚ö†Ô∏è  No results yet. Training may still be in progress.")

## Key Takeaways

1. **Class Imbalance**: SMOTE significantly improved model performance
2. **Model Selection**: XGBoost/LightGBM outperformed logistic regression
3. **Feature Engineering**: Temporal and distance-based features were crucial
4. **SageMaker Training**: Scalable and reproducible training with managed infrastructure
5. **Experiments Tracking**: Automatic logging of hyperparameters and metrics
6. **Model Registry**: Centralized model versioning and approval workflow

## Next Steps for Production

1. **Deploy** from Model Registry to real-time endpoint
2. Set up **Model Monitor** for data drift detection
3. Implement **SageMaker Pipelines** for automated retraining
4. Add **Feature Store** for real-time features
5. Set up **A/B testing** with production traffic

---

## üßπ Cleanup Resources

Nettoyez les ressources AWS pour √©viter les co√ªts inutiles.

In [None]:
# ============================================================
# Cleanup AWS Resources
# ============================================================

import boto3

print("üßπ Cleaning up Lab 1 resources...")
print("=" * 60)

sm_client = boto3.client('sagemaker')

# List all resources created in this lab
resources_deleted = []

try:
    # 1. List and optionally delete Training Jobs (they stop automatically, but you can view them)
    print("\nüìã Training Jobs:")
    training_jobs = sm_client.list_training_jobs(
        NameContains='fraud-detection',
        MaxResults=10,
        SortBy='CreationTime',
        SortOrder='Descending'
    )
    
    for job in training_jobs['TrainingJobSummaries']:
        print(f"  ‚Ä¢ {job['TrainingJobName']} - Status: {job['TrainingJobStatus']}")
    
    if training_jobs['TrainingJobSummaries']:
        print("  üí° Training jobs are automatically stopped and don't incur costs")
    else:
        print("  ‚úÖ No training jobs found")
    
    # 2. List Model Packages in Registry
    print("\nüì¶ Model Registry:")
    try:
        model_packages = sm_client.list_model_packages(
            ModelPackageGroupName='fraud-detection-model-group',
            MaxResults=10
        )
        
        for pkg in model_packages['ModelPackageSummaryList']:
            print(f"  ‚Ä¢ Version {pkg.get('ModelPackageVersion', 'N/A')} - Status: {pkg['ModelApprovalStatus']}")
        
        print("\n  üí° To delete model packages (optional):")
        print("  # for pkg in model_packages['ModelPackageSummaryList']:")
        print("  #     sm_client.delete_model_package(ModelPackageName=pkg['ModelPackageArn'])")
        
    except sm_client.exceptions.ResourceNotFound:
        print("  ‚úÖ No model package group found")
    
    # 3. List Experiments
    print("\nüî¨ Experiments:")
    try:
        experiments = sm_client.list_experiments(
            MaxResults=10
        )
        
        fraud_experiments = [e for e in experiments['ExperimentSummaries'] 
                            if 'fraud' in e['ExperimentName'].lower()]
        
        if fraud_experiments:
            for exp in fraud_experiments:
                print(f"  ‚Ä¢ {exp['ExperimentName']}")
            print("\n  üí° Experiments don't incur costs and provide history")
        else:
            print("  ‚úÖ No fraud detection experiments found")
            
    except Exception as e:
        print(f"  Note: {e}")
    
    # 4. Check for any deployed endpoints (shouldn't exist in this lab)
    print("\nüîå Endpoints:")
    endpoints = sm_client.list_endpoints(
        NameContains='fraud',
        StatusEquals='InService'
    )
    
    if endpoints['Endpoints']:
        print(f"  ‚ö†Ô∏è  Found {len(endpoints['Endpoints'])} endpoint(s):")
        for ep in endpoints['Endpoints']:
            print(f"    ‚Ä¢ {ep['EndpointName']}")
        
        # Uncomment to delete endpoints
        # for ep in endpoints['Endpoints']:
        #     sm_client.delete_endpoint(EndpointName=ep['EndpointName'])
        #     print(f"    ‚úÖ Deleted: {ep['EndpointName']}")
        #     resources_deleted.append(ep['EndpointName'])
    else:
        print("  ‚úÖ No endpoints found (good - Lab 1 doesn't deploy endpoints)")
    
    # 5. S3 Data (optional - you may want to keep training data)
    print("\nüíæ S3 Data:")
    print(f"  Data location: s3://{bucket}/lab1-fraud-detection/")
    print("  üí° Training data and models are stored here")
    print("  üí° You can delete manually if needed:")
    print(f"  # aws s3 rm s3://{bucket}/lab1-fraud-detection/ --recursive")
    
    print("\n" + "=" * 60)
    print("‚úÖ Cleanup check complete!")
    print("\nüí∞ Cost Impact:")
    print("  ‚Ä¢ Training Jobs: ‚úÖ Stopped automatically (no ongoing cost)")
    print("  ‚Ä¢ Model Registry: ‚úÖ No cost for storing model metadata")
    print("  ‚Ä¢ Experiments: ‚úÖ No cost")
    print("  ‚Ä¢ S3 Storage: üí≤ Minimal cost (~few cents)")
    print("  ‚Ä¢ Endpoints: ‚úÖ None deployed")
    
    if resources_deleted:
        print(f"\nüóëÔ∏è  Deleted resources: {', '.join(resources_deleted)}")
    
except Exception as e:
    print(f"‚ùå Error during cleanup: {e}")