# QC-Py-19 - Machine Learning Classification pour Direction Prediction

> **Predire la direction du marche avec Random Forest et XGBoost**
> Duree: 75 minutes | Niveau: Intermediaire-Avance | Python + QuantConnect

---

## Objectifs d'Apprentissage

A la fin de ce notebook, vous serez capable de :

1. Comprendre pourquoi le **ML est adapte** a la prediction de direction vs regles fixes
2. Implementer un **Random Forest Classifier** pour predire Up/Down
3. Utiliser **XGBoost** avec early stopping et hyperparametres optimises
4. Appliquer **TimeSeriesSplit** pour validation sans lookahead bias
5. Calculer les **metriques de classification** adaptees au trading
6. Implementer **Walk-Forward Validation** avec retrain periodique
7. Persister les modeles avec **ObjectStore** de QuantConnect
8. Construire un **ML Alpha Model** complet pour la production

## Prerequis

- Notebook QC-Py-18 (Feature Engineering) complete
- Comprehension des concepts ML de base (classification, train/test split)
- Familiarite avec pandas, numpy, sklearn

## Structure du Notebook

1. Introduction au ML pour Trading (15 min)
2. Random Forest Classifier (25 min)
3. XGBoost Classifier (25 min)
4. Validation et Metriques (20 min)
5. Integration QuantConnect (20 min)
6. Strategie Complete (20 min)

---

---

## Setup et Imports

In [None]:
# Imports standards
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Configuration matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("Imports de base reussis")

In [None]:
# Imports ML
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)
from sklearn.preprocessing import StandardScaler
import pickle

print("Imports sklearn reussis")

# XGBoost
try:
    import xgboost as xgb
    print(f"XGBoost importe avec succes (version {xgb.__version__})")
except ImportError:
    print("XGBoost non installe. Executez: pip install xgboost")

---

## Partie 1 : Introduction au ML pour Trading (15 min)

### 1.1 Classification vs Regles Fixes

| Approche | Avantages | Inconvenients |
|----------|-----------|---------------|
| **Regles Fixes** | Interpretable, rapide, pas d'entrainement | Rigide, ne s'adapte pas, arbitraire |
| **ML Classification** | Apprend des patterns complexes, s'adapte | Black box, overfitting, besoin de donnees |

### Pourquoi ML pour predire la direction?

```
Regles Fixes:                     ML Classification:
                                  
if RSI < 30:                      features = [RSI, MACD, Volume, ...]
    BUY                           model.fit(features, direction)
elif RSI > 70:                    prediction = model.predict(new_features)
    SELL                          
                                  -> Capture interactions complexes
-> Seuils arbitraires             -> Poids appris automatiquement
-> Ignore autres facteurs         -> Combine multiple signaux
```

### Challenges specifiques au Trading

| Challenge | Description | Solution |
|-----------|-------------|----------|
| **Non-stationnarite** | Les marches changent | Retrain periodique |
| **Regime Changes** | Bull/Bear/Sideways | Features de regime, detection |
| **Overfitting** | Memorise le passe | Regularisation, validation robuste |
| **Class Imbalance** | Plus de Up que Down (ou inverse) | Class weights, resampling |
| **Lookahead Bias** | Utiliser le futur | TimeSeriesSplit strict |

In [None]:
# Generer des donnees de demonstration
# (En production, utiliser le pipeline de QC-Py-18)

def generate_sample_data(n_days=500, seed=42):
    """
    Genere des donnees OHLCV simulees avec features et labels.
    """
    np.random.seed(seed)
    
    dates = pd.date_range(start='2022-01-01', periods=n_days, freq='B')
    
    # Prix avec tendance et cycles
    trend = np.linspace(100, 130, n_days)
    cycle = 15 * np.sin(np.linspace(0, 6 * np.pi, n_days))
    noise = np.cumsum(np.random.randn(n_days) * 0.5)
    close = trend + cycle + noise
    
    # OHLV
    high = close * (1 + np.abs(np.random.normal(0, 0.01, n_days)))
    low = close * (1 - np.abs(np.random.normal(0, 0.01, n_days)))
    open_price = close * (1 + np.random.normal(0, 0.005, n_days))
    volume = 1_000_000 * (1 + np.random.exponential(0.5, n_days))
    
    df = pd.DataFrame({
        'open': open_price,
        'high': high,
        'low': low,
        'close': close,
        'volume': volume.astype(int)
    }, index=dates)
    
    return df


def calculate_features(df):
    """
    Calcule les features techniques (simplifie du QC-Py-18).
    """
    result = df.copy()
    close = result['close']
    
    # Returns
    for period in [1, 5, 10, 20]:
        result[f'return_{period}d'] = close.pct_change(period)
    
    # Volatility
    result['volatility_20d'] = result['return_1d'].rolling(20).std()
    
    # SMA ratios
    result['sma_20'] = close.rolling(20).mean()
    result['sma_50'] = close.rolling(50).mean()
    result['price_to_sma_20'] = close / result['sma_20']
    result['ma_ratio_20_50'] = result['sma_20'] / result['sma_50']
    
    # RSI
    delta = close.diff()
    gain = delta.clip(lower=0).rolling(14).mean()
    loss = (-delta.clip(upper=0)).rolling(14).mean()
    rs = gain / loss
    result['rsi_14'] = 100 - (100 / (1 + rs))
    result['rsi_normalized'] = (result['rsi_14'] - 50) / 50
    
    # MACD
    ema_12 = close.ewm(span=12, adjust=False).mean()
    ema_26 = close.ewm(span=26, adjust=False).mean()
    result['macd'] = ema_12 - ema_26
    result['macd_signal'] = result['macd'].ewm(span=9, adjust=False).mean()
    result['macd_hist'] = result['macd'] - result['macd_signal']
    result['macd_norm'] = result['macd'] / close
    
    # Bollinger Bands
    bb_middle = close.rolling(20).mean()
    bb_std = close.rolling(20).std()
    bb_upper = bb_middle + 2 * bb_std
    bb_lower = bb_middle - 2 * bb_std
    result['bb_percent_b'] = (close - bb_lower) / (bb_upper - bb_lower)
    result['bb_bandwidth'] = (bb_upper - bb_lower) / bb_middle
    
    # Volume features
    result['volume_ratio'] = df['volume'] / df['volume'].rolling(20).mean()
    
    # Range
    result['range'] = (df['high'] - df['low']) / close
    
    return result


def create_labels(df, horizon=5, threshold=0.0):
    """
    Cree les labels de classification.
    """
    result = df.copy()
    
    # Future return
    result['future_return'] = result['close'].shift(-horizon) / result['close'] - 1
    
    # Binary label: 1 = Up, 0 = Down
    result['label'] = (result['future_return'] > threshold).astype(int)
    
    return result


# Generer et preparer les donnees
df = generate_sample_data(n_days=600)
df = calculate_features(df)
df = create_labels(df, horizon=5)

print(f"Donnees generees: {len(df)} jours")
print(f"Periode: {df.index[0].date()} a {df.index[-1].date()}")

In [None]:
# Preparer X et y

# Colonnes features (exclure OHLCV et labels)
exclude_cols = ['open', 'high', 'low', 'close', 'volume', 
                'future_return', 'label', 'sma_20', 'sma_50', 'macd', 'macd_signal']
feature_cols = [col for col in df.columns if col not in exclude_cols]

# Supprimer NaN
df_clean = df.dropna()

X = df_clean[feature_cols]
y = df_clean['label']

print(f"Features: {len(feature_cols)}")
print(f"  {feature_cols}")
print(f"\nEchantillons: {len(X)}")
print(f"\nDistribution des labels:")
print(f"  Down (0): {(y == 0).sum()} ({(y == 0).mean()*100:.1f}%)")
print(f"  Up (1):   {(y == 1).sum()} ({(y == 1).mean()*100:.1f}%)")

In [None]:
# Split temporel Train/Test
train_ratio = 0.7
split_idx = int(len(X) * train_ratio)

X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y.iloc[:split_idx]
y_test = y.iloc[split_idx:]

print(f"Train: {len(X_train)} echantillons ({X_train.index[0].date()} - {X_train.index[-1].date()})")
print(f"Test:  {len(X_test)} echantillons ({X_test.index[0].date()} - {X_test.index[-1].date()})")

# Normalisation (fit sur train, transform sur les deux)
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train),
    columns=X_train.columns,
    index=X_train.index
)
X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    columns=X_test.columns,
    index=X_test.index
)

print("\nNormalisation appliquee (StandardScaler)")

---

## Partie 2 : Random Forest Classifier (25 min)

### 2.1 Pourquoi Random Forest?

| Avantage | Description |
|----------|-------------|
| **Robuste** | Resistant a l'overfitting grace a l'ensemble |
| **Non-lineaire** | Capture relations complexes |
| **Feature Importance** | Interpretabilite des features |
| **Peu d'hyperparametres** | Facile a tuner |
| **Pas de normalisation** | Fonctionne sur donnees brutes |

### Architecture Random Forest

```
            Donnees
               |
    +----------+----------+
    |          |          |
 Arbre 1   Arbre 2  ...  Arbre N
 (sample1) (sample2)     (sampleN)
    |          |          |
 pred_1     pred_2      pred_N
    |          |          |
    +----------+----------+
               |
           VOTE MAJORITAIRE
               |
           Prediction Finale
```

### Hyperparametres importants

| Parametre | Description | Valeur typique |
|-----------|-------------|----------------|
| `n_estimators` | Nombre d'arbres | 100-500 |
| `max_depth` | Profondeur max par arbre | 5-15 |
| `min_samples_leaf` | Echantillons min par feuille | 5-20 |
| `max_features` | Features par split | 'sqrt' ou 0.3-0.7 |

In [None]:
# Random Forest Classifier

rf_model = RandomForestClassifier(
    n_estimators=100,       # 100 arbres
    max_depth=5,            # Limiter profondeur (anti-overfitting)
    min_samples_leaf=10,    # Min 10 echantillons par feuille
    max_features='sqrt',    # sqrt(n_features) par split
    random_state=42,
    n_jobs=-1,              # Utiliser tous les CPU
    class_weight='balanced' # Gerer desequilibre de classes
)

print("Random Forest Classifier configure:")
print(f"  n_estimators: {rf_model.n_estimators}")
print(f"  max_depth: {rf_model.max_depth}")
print(f"  min_samples_leaf: {rf_model.min_samples_leaf}")
print(f"  max_features: {rf_model.max_features}")
print(f"  class_weight: {rf_model.class_weight}")

In [None]:
# Entrainement avec TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

print("TimeSeriesSplit Cross-Validation:")
print("="*60)

cv_scores = []
fold_results = []

for fold, (train_idx, val_idx) in enumerate(tscv.split(X_train_scaled), 1):
    # Split
    X_cv_train = X_train_scaled.iloc[train_idx]
    X_cv_val = X_train_scaled.iloc[val_idx]
    y_cv_train = y_train.iloc[train_idx]
    y_cv_val = y_train.iloc[val_idx]
    
    # Train
    rf_model.fit(X_cv_train, y_cv_train)
    
    # Predict
    predictions = rf_model.predict(X_cv_val)
    proba = rf_model.predict_proba(X_cv_val)
    
    # Metrics
    acc = accuracy_score(y_cv_val, predictions)
    prec = precision_score(y_cv_val, predictions, zero_division=0)
    rec = recall_score(y_cv_val, predictions, zero_division=0)
    f1 = f1_score(y_cv_val, predictions, zero_division=0)
    
    cv_scores.append(acc)
    fold_results.append({
        'fold': fold,
        'train_size': len(train_idx),
        'val_size': len(val_idx),
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'f1': f1
    })
    
    print(f"\nFold {fold}:")
    print(f"  Train: {len(train_idx)} | Val: {len(val_idx)}")
    print(f"  Accuracy: {acc:.4f} | Precision: {prec:.4f} | Recall: {rec:.4f} | F1: {f1:.4f}")

print("\n" + "="*60)
print(f"Mean CV Accuracy: {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores):.4f})")

In [None]:
# Entrainer le modele final sur tout le train set
rf_model.fit(X_train_scaled, y_train)

# Predictions sur test
rf_predictions = rf_model.predict(X_test_scaled)
rf_proba = rf_model.predict_proba(X_test_scaled)

print("Random Forest - Resultats sur Test Set:")
print("="*60)
print(f"Accuracy:  {accuracy_score(y_test, rf_predictions):.4f}")
print(f"Precision: {precision_score(y_test, rf_predictions):.4f}")
print(f"Recall:    {recall_score(y_test, rf_predictions):.4f}")
print(f"F1 Score:  {f1_score(y_test, rf_predictions):.4f}")

# AUC si applicable
try:
    auc = roc_auc_score(y_test, rf_proba[:, 1])
    print(f"ROC AUC:   {auc:.4f}")
except:
    pass

In [None]:
# Feature Importance

feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (Random Forest):")
print("="*50)
for i, row in feature_importance.head(10).iterrows():
    print(f"  {row['feature']:25s} {row['importance']:.4f}")

# Visualisation
fig, ax = plt.subplots(figsize=(10, 6))
top_features = feature_importance.head(12)
colors = plt.cm.viridis(np.linspace(0, 1, len(top_features)))
ax.barh(range(len(top_features)), top_features['importance'], color=colors)
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['feature'])
ax.invert_yaxis()
ax.set_xlabel('Importance', fontsize=12)
ax.set_title('Random Forest - Feature Importance', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

---

## Partie 3 : XGBoost Classifier (25 min)

### 3.1 Pourquoi XGBoost?

| Avantage | Description |
|----------|-------------|
| **Performance** | Souvent meilleur que Random Forest |
| **Regularisation** | L1/L2 integree (anti-overfitting) |
| **Early Stopping** | Arrete avant overfitting |
| **Gestion des NaN** | Native (pas besoin d'imputation) |
| **Vitesse** | Optimise pour la performance |

### Gradient Boosting vs Random Forest

```
Random Forest:           XGBoost:
                         
 Arbre1  Arbre2  ArbreN   Arbre1 --> Arbre2 --> Arbre3 --> ArbreN
   |       |       |         |         |         |          |
   +-------+-------+         |      corrige    corrige    corrige
           |                 |      erreurs   erreurs    erreurs
        MOYENNE              |      de 1      de 2       de N-1
                             +-------+--------+-----------+
                                            SOMME
```

### Hyperparametres XGBoost

| Parametre | Description | Valeur typique |
|-----------|-------------|----------------|
| `n_estimators` | Nombre d'iterations (arbres) | 100-1000 |
| `max_depth` | Profondeur max | 3-6 |
| `learning_rate` | Taux d'apprentissage | 0.01-0.3 |
| `subsample` | Ratio d'echantillons par arbre | 0.7-0.9 |
| `colsample_bytree` | Ratio de features par arbre | 0.7-0.9 |
| `reg_alpha` | Regularisation L1 | 0-1 |
| `reg_lambda` | Regularisation L2 | 1-10 |

In [None]:
# XGBoost Classifier

xgb_model = xgb.XGBClassifier(
    n_estimators=200,          # Plus d'arbres (early stopping arretera)
    max_depth=4,               # Moins profond que RF (boosting compense)
    learning_rate=0.1,         # Taux d'apprentissage
    subsample=0.8,             # 80% des echantillons par arbre
    colsample_bytree=0.8,      # 80% des features par arbre
    reg_alpha=0.1,             # Regularisation L1
    reg_lambda=1.0,            # Regularisation L2
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    random_state=42,
    n_jobs=-1
)

print("XGBoost Classifier configure:")
print(f"  n_estimators: {xgb_model.n_estimators}")
print(f"  max_depth: {xgb_model.max_depth}")
print(f"  learning_rate: {xgb_model.learning_rate}")
print(f"  subsample: {xgb_model.subsample}")
print(f"  colsample_bytree: {xgb_model.colsample_bytree}")

In [None]:
# Entrainement avec Early Stopping

# Split train en train/validation pour early stopping
val_ratio = 0.2
val_split_idx = int(len(X_train_scaled) * (1 - val_ratio))

X_train_xgb = X_train_scaled.iloc[:val_split_idx]
X_val_xgb = X_train_scaled.iloc[val_split_idx:]
y_train_xgb = y_train.iloc[:val_split_idx]
y_val_xgb = y_train.iloc[val_split_idx:]

print(f"Train XGBoost: {len(X_train_xgb)} echantillons")
print(f"Validation:    {len(X_val_xgb)} echantillons")

# Entrainer avec early stopping
xgb_model.fit(
    X_train_xgb, y_train_xgb,
    eval_set=[(X_val_xgb, y_val_xgb)],
    verbose=False
)

print(f"\nEntrainement termine.")
print(f"Meilleur nombre d'iterations: {xgb_model.best_iteration if hasattr(xgb_model, 'best_iteration') else 'N/A'}")

In [None]:
# Predictions XGBoost sur test

xgb_predictions = xgb_model.predict(X_test_scaled)
xgb_proba = xgb_model.predict_proba(X_test_scaled)

print("XGBoost - Resultats sur Test Set:")
print("="*60)
print(f"Accuracy:  {accuracy_score(y_test, xgb_predictions):.4f}")
print(f"Precision: {precision_score(y_test, xgb_predictions):.4f}")
print(f"Recall:    {recall_score(y_test, xgb_predictions):.4f}")
print(f"F1 Score:  {f1_score(y_test, xgb_predictions):.4f}")

try:
    auc = roc_auc_score(y_test, xgb_proba[:, 1])
    print(f"ROC AUC:   {auc:.4f}")
except:
    pass

In [None]:
# XGBoost Feature Importance

xgb_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (XGBoost):")
print("="*50)
for i, row in xgb_importance.head(10).iterrows():
    print(f"  {row['feature']:25s} {row['importance']:.4f}")

# Comparaison RF vs XGBoost
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# RF
ax1 = axes[0]
top_rf = feature_importance.head(10)
ax1.barh(range(len(top_rf)), top_rf['importance'], color='steelblue')
ax1.set_yticks(range(len(top_rf)))
ax1.set_yticklabels(top_rf['feature'])
ax1.invert_yaxis()
ax1.set_xlabel('Importance')
ax1.set_title('Random Forest', fontsize=14, fontweight='bold')

# XGBoost
ax2 = axes[1]
top_xgb = xgb_importance.head(10)
ax2.barh(range(len(top_xgb)), top_xgb['importance'], color='coral')
ax2.set_yticks(range(len(top_xgb)))
ax2.set_yticklabels(top_xgb['feature'])
ax2.invert_yaxis()
ax2.set_xlabel('Importance')
ax2.set_title('XGBoost', fontsize=14, fontweight='bold')

plt.suptitle('Comparaison Feature Importance', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Comparaison des modeles

comparison = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC AUC'],
    'Random Forest': [
        accuracy_score(y_test, rf_predictions),
        precision_score(y_test, rf_predictions),
        recall_score(y_test, rf_predictions),
        f1_score(y_test, rf_predictions),
        roc_auc_score(y_test, rf_proba[:, 1]) if len(np.unique(y_test)) > 1 else 0
    ],
    'XGBoost': [
        accuracy_score(y_test, xgb_predictions),
        precision_score(y_test, xgb_predictions),
        recall_score(y_test, xgb_predictions),
        f1_score(y_test, xgb_predictions),
        roc_auc_score(y_test, xgb_proba[:, 1]) if len(np.unique(y_test)) > 1 else 0
    ]
})

print("Comparaison Random Forest vs XGBoost:")
print("="*60)
print(comparison.to_string(index=False))

# Visualisation
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(comparison))
width = 0.35

bars1 = ax.bar(x - width/2, comparison['Random Forest'], width, label='Random Forest', color='steelblue')
bars2 = ax.bar(x + width/2, comparison['XGBoost'], width, label='XGBoost', color='coral')

ax.set_xlabel('Metric', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Comparaison des Modeles', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(comparison['Metric'])
ax.legend()
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3, axis='y')

# Ajouter les valeurs sur les barres
for bar in bars1:
    height = bar.get_height()
    ax.annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom', fontsize=8)
for bar in bars2:
    height = bar.get_height()
    ax.annotate(f'{height:.3f}', xy=(bar.get_x() + bar.get_width()/2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', va='bottom', fontsize=8)

plt.tight_layout()
plt.show()

---

## Partie 4 : Validation et Metriques (20 min)

### 4.1 Metriques de Classification pour le Trading

| Metrique | Formule | Interpretation Trading |
|----------|---------|------------------------|
| **Accuracy** | (TP+TN) / Total | % de predictions correctes |
| **Precision** | TP / (TP+FP) | % de trades gagnants parmi les signaux BUY |
| **Recall** | TP / (TP+FN) | % de jours Up detectes |
| **F1 Score** | 2 * (P*R)/(P+R) | Balance precision/recall |

### Importance de la Precision en Trading

```
                        Actual
                    Up      Down
              +--------+--------+
Predicted Up  |   TP   |   FP   |  <-- Precision = TP / (TP + FP)
              +--------+--------+      (% de signaux BUY profitables)
Predicted Down|   FN   |   TN   |
              +--------+--------+

Pour un trader:
- FP (Faux Positif) = Achat suivi de baisse = PERTE
- Une Precision de 55% peut etre profitable si gain > perte
```

In [None]:
# Metriques detaillees

def detailed_classification_metrics(y_true, y_pred, y_proba=None, model_name="Model"):
    """
    Calcule et affiche les metriques detaillees de classification.
    """
    print(f"\n{'='*60}")
    print(f"{model_name} - Metriques Detaillees")
    print(f"{'='*60}")
    
    # Metriques de base
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, zero_division=0)
    rec = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    
    print(f"\nMetriques de Base:")
    print(f"  Accuracy:  {acc:.4f}")
    print(f"  Precision: {prec:.4f}  (% trades gagnants parmi les BUY)")
    print(f"  Recall:    {rec:.4f}  (% jours Up detectes)")
    print(f"  F1 Score:  {f1:.4f}")
    
    # Matrice de confusion
    cm = confusion_matrix(y_true, y_pred)
    print(f"\nMatrice de Confusion:")
    print(f"                  Predicted")
    print(f"                Down    Up")
    print(f"  Actual Down   {cm[0,0]:4d}   {cm[0,1]:4d}")
    print(f"  Actual Up     {cm[1,0]:4d}   {cm[1,1]:4d}")
    
    # Interpretation trading
    tn, fp, fn, tp = cm.ravel()
    print(f"\nInterpretation Trading:")
    print(f"  True Positives (TP):  {tp:4d} - Signaux BUY corrects")
    print(f"  False Positives (FP): {fp:4d} - Signaux BUY errones (pertes)")
    print(f"  True Negatives (TN):  {tn:4d} - Signaux SELL corrects")
    print(f"  False Negatives (FN): {fn:4d} - Opportunites manquees")
    
    # ROC AUC
    if y_proba is not None and len(np.unique(y_true)) > 1:
        auc = roc_auc_score(y_true, y_proba[:, 1])
        print(f"\n  ROC AUC: {auc:.4f}")
    
    return {
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'f1': f1,
        'confusion_matrix': cm
    }

# Appliquer aux deux modeles
rf_metrics = detailed_classification_metrics(y_test, rf_predictions, rf_proba, "Random Forest")
xgb_metrics = detailed_classification_metrics(y_test, xgb_predictions, xgb_proba, "XGBoost")

In [None]:
# Visualisation des matrices de confusion

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Random Forest
ax1 = axes[0]
cm_rf = confusion_matrix(y_test, rf_predictions)
im1 = ax1.imshow(cm_rf, cmap='Blues')
ax1.set_xticks([0, 1])
ax1.set_yticks([0, 1])
ax1.set_xticklabels(['Down (0)', 'Up (1)'])
ax1.set_yticklabels(['Down (0)', 'Up (1)'])
ax1.set_xlabel('Predicted', fontsize=12)
ax1.set_ylabel('Actual', fontsize=12)
ax1.set_title('Random Forest', fontsize=14, fontweight='bold')
for i in range(2):
    for j in range(2):
        ax1.text(j, i, cm_rf[i, j], ha='center', va='center', fontsize=16, fontweight='bold')

# XGBoost
ax2 = axes[1]
cm_xgb = confusion_matrix(y_test, xgb_predictions)
im2 = ax2.imshow(cm_xgb, cmap='Oranges')
ax2.set_xticks([0, 1])
ax2.set_yticks([0, 1])
ax2.set_xticklabels(['Down (0)', 'Up (1)'])
ax2.set_yticklabels(['Down (0)', 'Up (1)'])
ax2.set_xlabel('Predicted', fontsize=12)
ax2.set_ylabel('Actual', fontsize=12)
ax2.set_title('XGBoost', fontsize=14, fontweight='bold')
for i in range(2):
    for j in range(2):
        ax2.text(j, i, cm_xgb[i, j], ha='center', va='center', fontsize=16, fontweight='bold')

plt.suptitle('Matrices de Confusion', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Courbes ROC

fig, ax = plt.subplots(figsize=(8, 6))

# Random Forest ROC
fpr_rf, tpr_rf, _ = roc_curve(y_test, rf_proba[:, 1])
auc_rf = roc_auc_score(y_test, rf_proba[:, 1])
ax.plot(fpr_rf, tpr_rf, 'b-', linewidth=2, label=f'Random Forest (AUC = {auc_rf:.3f})')

# XGBoost ROC
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, xgb_proba[:, 1])
auc_xgb = roc_auc_score(y_test, xgb_proba[:, 1])
ax.plot(fpr_xgb, tpr_xgb, 'r-', linewidth=2, label=f'XGBoost (AUC = {auc_xgb:.3f})')

# Ligne aleatoire
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.500)')

ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('Courbes ROC', fontsize=14, fontweight='bold')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])

plt.tight_layout()
plt.show()

### 4.2 Walk-Forward Validation

La Walk-Forward Validation simule le trading reel en retrainant periodiquement le modele.

```
Expanding Window:                    Sliding Window:

|=====Train=====|Test|               |=====Train=====|Test|
|=======Train========|Test|          |     |====Train====|Test|
|=========Train==========|Test|      |          |===Train===|Test|

-> Train grandit                     -> Train a taille fixe
-> Plus de donnees                   -> Plus recent = plus pertinent
```

In [None]:
def walk_forward_validation(X, y, model_class, model_params,
                            train_window=252, test_window=21,
                            expanding=True):
    """
    Walk-Forward Validation avec retrain periodique.
    
    Parameters:
    -----------
    X : pd.DataFrame
        Features
    y : pd.Series
        Labels
    model_class : class
        Classe du modele (ex: RandomForestClassifier)
    model_params : dict
        Parametres du modele
    train_window : int
        Taille de la fenetre d'entrainement (jours)
    test_window : int
        Taille de la fenetre de test (jours)
    expanding : bool
        True = Expanding window, False = Sliding window
    
    Returns:
    --------
    dict : Resultats de la validation
    """
    results = {
        'predictions': [],
        'actuals': [],
        'probas': [],
        'dates': [],
        'fold_metrics': []
    }
    
    n_samples = len(X)
    current_idx = train_window
    fold = 0
    
    while current_idx + test_window <= n_samples:
        fold += 1
        
        # Define train/test indices
        if expanding:
            train_start = 0
        else:
            train_start = current_idx - train_window
        
        train_end = current_idx
        test_start = current_idx
        test_end = min(current_idx + test_window, n_samples)
        
        # Split data
        X_train_wf = X.iloc[train_start:train_end]
        y_train_wf = y.iloc[train_start:train_end]
        X_test_wf = X.iloc[test_start:test_end]
        y_test_wf = y.iloc[test_start:test_end]
        
        # Scale
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train_wf)
        X_test_scaled = scaler.transform(X_test_wf)
        
        # Train model
        model = model_class(**model_params)
        model.fit(X_train_scaled, y_train_wf)
        
        # Predict
        preds = model.predict(X_test_scaled)
        proba = model.predict_proba(X_test_scaled)[:, 1]
        
        # Store results
        results['predictions'].extend(preds)
        results['actuals'].extend(y_test_wf.values)
        results['probas'].extend(proba)
        results['dates'].extend(X_test_wf.index.tolist())
        
        # Fold metrics
        acc = accuracy_score(y_test_wf, preds)
        results['fold_metrics'].append({
            'fold': fold,
            'train_size': len(X_train_wf),
            'test_size': len(X_test_wf),
            'accuracy': acc
        })
        
        # Move to next window
        current_idx += test_window
    
    # Overall metrics
    results['overall_accuracy'] = accuracy_score(results['actuals'], results['predictions'])
    results['overall_precision'] = precision_score(results['actuals'], results['predictions'])
    results['overall_f1'] = f1_score(results['actuals'], results['predictions'])
    
    return results


# Executer Walk-Forward Validation
print("Walk-Forward Validation (Monthly Retrain):")
print("="*60)

wf_results = walk_forward_validation(
    X, y,
    RandomForestClassifier,
    {'n_estimators': 100, 'max_depth': 5, 'random_state': 42, 'n_jobs': -1},
    train_window=252,   # 1 an de train
    test_window=21,     # 1 mois de test
    expanding=True
)

print(f"\nNombre de folds: {len(wf_results['fold_metrics'])}")
print(f"\nMetriques par fold:")
for fm in wf_results['fold_metrics'][:5]:
    print(f"  Fold {fm['fold']}: Train={fm['train_size']}, Test={fm['test_size']}, Acc={fm['accuracy']:.4f}")
if len(wf_results['fold_metrics']) > 5:
    print(f"  ...")

print(f"\nResultats Globaux:")
print(f"  Accuracy:  {wf_results['overall_accuracy']:.4f}")
print(f"  Precision: {wf_results['overall_precision']:.4f}")
print(f"  F1 Score:  {wf_results['overall_f1']:.4f}")

In [None]:
# Visualisation Walk-Forward

fig, axes = plt.subplots(2, 1, figsize=(14, 8))

# Accuracy par fold
ax1 = axes[0]
folds = [fm['fold'] for fm in wf_results['fold_metrics']]
accs = [fm['accuracy'] for fm in wf_results['fold_metrics']]
ax1.bar(folds, accs, color='steelblue', edgecolor='black')
ax1.axhline(y=wf_results['overall_accuracy'], color='red', linestyle='--', 
            label=f"Mean Accuracy: {wf_results['overall_accuracy']:.3f}")
ax1.axhline(y=0.5, color='gray', linestyle=':', label='Random Baseline')
ax1.set_xlabel('Fold', fontsize=12)
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.set_title('Walk-Forward Validation - Accuracy par Fold', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Predictions cumulees
ax2 = axes[1]
dates = wf_results['dates']
preds = wf_results['predictions']
actuals = wf_results['actuals']

# Calculer accuracy cumulee
cumulative_correct = np.cumsum([1 if p == a else 0 for p, a in zip(preds, actuals)])
cumulative_total = np.arange(1, len(preds) + 1)
cumulative_accuracy = cumulative_correct / cumulative_total

ax2.plot(dates, cumulative_accuracy, 'b-', linewidth=2)
ax2.axhline(y=0.5, color='gray', linestyle=':', label='Random Baseline')
ax2.fill_between(dates, 0.5, cumulative_accuracy, alpha=0.3, color='steelblue')
ax2.set_xlabel('Date', fontsize=12)
ax2.set_ylabel('Cumulative Accuracy', fontsize=12)
ax2.set_title('Accuracy Cumulee au Fil du Temps', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Partie 5 : Integration QuantConnect (20 min)

### 5.1 ObjectStore pour Persistence des Modeles

QuantConnect fournit **ObjectStore** pour persister des objets entre les executions:

- Sauvegarder un modele entraine
- Charger un modele pour predictions
- Partager entre Research et Algorithm

```python
# Sauvegarder
model_bytes = pickle.dumps(model)
self.ObjectStore.SaveBytes("model/rf_classifier", model_bytes)

# Charger
model_bytes = self.ObjectStore.ReadBytes("model/rf_classifier")
model = pickle.loads(model_bytes)
```

In [None]:
# Code QuantConnect pour ObjectStore

objectstore_code = '''
from AlgorithmImports import *
import pickle
import numpy as np

class MLModelPersistence(QCAlgorithm):
    """
    Demonstration de la persistence des modeles ML avec ObjectStore.
    """
    
    def Initialize(self):
        self.SetStartDate(2024, 1, 1)
        self.SetCash(100000)
        
        self.symbol = self.AddEquity("SPY", Resolution.Daily).Symbol
        
        # Cle pour le modele dans ObjectStore
        self.model_key = "model/rf_direction_classifier"
        self.scaler_key = "model/scaler"
        
        # Charger ou entrainer le modele
        self.model = self.LoadOrTrainModel()
    
    def LoadOrTrainModel(self):
        """
        Charge le modele depuis ObjectStore ou entraine un nouveau.
        """
        if self.ObjectStore.ContainsKey(self.model_key):
            self.Debug("Chargement du modele depuis ObjectStore...")
            
            # Charger le modele
            model_bytes = self.ObjectStore.ReadBytes(self.model_key)
            model = pickle.loads(model_bytes)
            
            # Charger le scaler
            scaler_bytes = self.ObjectStore.ReadBytes(self.scaler_key)
            self.scaler = pickle.loads(scaler_bytes)
            
            self.Debug("Modele charge avec succes!")
            return model
        else:
            self.Debug("Entrainement d\'un nouveau modele...")
            return self.TrainAndSaveModel()
    
    def TrainAndSaveModel(self):
        """
        Entraine un nouveau modele et le sauvegarde.
        """
        from sklearn.ensemble import RandomForestClassifier
        from sklearn.preprocessing import StandardScaler
        
        # Recuperer les donnees historiques
        history = self.History(self.symbol, 500, Resolution.Daily)
        
        if history.empty:
            self.Debug("Pas assez de donnees historiques")
            return None
        
        # Calculer features (simplifie)
        df = history[\'close\'].unstack(level=0)
        df.columns = [\'close\']
        
        df[\'return_1d\'] = df[\'close\'].pct_change()
        df[\'return_5d\'] = df[\'close\'].pct_change(5)
        df[\'volatility\'] = df[\'return_1d\'].rolling(20).std()
        df[\'sma_ratio\'] = df[\'close\'] / df[\'close\'].rolling(20).mean()
        
        # Label: direction future sur 5 jours
        df[\'future_return\'] = df[\'close\'].shift(-5) / df[\'close\'] - 1
        df[\'label\'] = (df[\'future_return\'] > 0).astype(int)
        
        # Preparer X, y
        feature_cols = [\'return_1d\', \'return_5d\', \'volatility\', \'sma_ratio\']
        df_clean = df.dropna()
        
        X = df_clean[feature_cols]
        y = df_clean[\'label\']
        
        # Scaler
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)
        
        # Entrainer
        model = RandomForestClassifier(
            n_estimators=100,
            max_depth=5,
            random_state=42
        )
        model.fit(X_scaled, y)
        
        # Sauvegarder dans ObjectStore
        model_bytes = pickle.dumps(model)
        self.ObjectStore.SaveBytes(self.model_key, model_bytes)
        
        scaler_bytes = pickle.dumps(self.scaler)
        self.ObjectStore.SaveBytes(self.scaler_key, scaler_bytes)
        
        self.Debug(f"Modele entraine et sauvegarde. Accuracy: {model.score(X_scaled, y):.4f}")
        
        return model
    
    def DeleteModel(self):
        """
        Supprime le modele de ObjectStore (pour forcer retrain).
        """
        if self.ObjectStore.ContainsKey(self.model_key):
            self.ObjectStore.Delete(self.model_key)
            self.ObjectStore.Delete(self.scaler_key)
            self.Debug("Modele supprime de ObjectStore")
'''

print("ObjectStore pour Persistence des Modeles:")
print(objectstore_code)

### 5.2 ML Classification Alpha Model

Creeons un Alpha Model qui utilise un classificateur ML pour generer des Insights.

In [None]:
# ML Classification Alpha Model

ml_alpha_code = '''
from AlgorithmImports import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from datetime import timedelta
from collections import deque
import numpy as np
import pickle


class MLClassificationAlphaModel(AlphaModel):
    """
    Alpha Model utilisant un classificateur ML pour predire la direction.
    
    Features:
    - Rendements multi-periodes
    - Volatilite
    - Ratio prix/SMA
    - RSI normalise
    
    Prediction:
    - Direction sur 5 jours (Up/Down)
    - Confiance = probabilite du modele
    """
    
    def __init__(self,
                 model_key: str = "model/rf_classifier",
                 lookback: int = 252,
                 retrain_frequency: int = 21,
                 prediction_horizon: int = 5,
                 probability_threshold: float = 0.6):
        """
        Parameters:
        -----------
        model_key : str
            Cle ObjectStore pour le modele
        lookback : int
            Jours d\'historique pour entrainement
        retrain_frequency : int
            Frequence de retrain (jours)
        prediction_horizon : int
            Horizon de prediction (jours)
        probability_threshold : float
            Seuil de probabilite pour generer un Insight
        """
        self.model_key = model_key
        self.lookback = lookback
        self.retrain_frequency = retrain_frequency
        self.prediction_horizon = prediction_horizon
        self.probability_threshold = probability_threshold
        
        self.model = None
        self.scaler = None
        self.last_train_time = None
        self.symbol_data = {}
    
    def Update(self, algorithm, data):
        """
        Genere des Insights bases sur les predictions ML.
        """
        insights = []
        
        # Verifier si retrain necessaire
        if self._should_retrain(algorithm):
            self._train_model(algorithm)
        
        if self.model is None:
            return insights
        
        # Generer predictions pour chaque symbole
        for symbol, sd in self.symbol_data.items():
            if not data.ContainsKey(symbol):
                continue
            
            # Extraire features
            features = sd.ExtractFeatures(algorithm, symbol)
            if features is None:
                continue
            
            # Scaler et predire
            features_scaled = self.scaler.transform([features])
            prediction = self.model.predict(features_scaled)[0]
            proba = self.model.predict_proba(features_scaled)[0]
            
            # Probabilite de la classe predite
            confidence = max(proba)
            
            # Generer Insight si confiance suffisante
            if confidence >= self.probability_threshold:
                direction = InsightDirection.Up if prediction == 1 else InsightDirection.Down
                
                insights.append(Insight.Price(
                    symbol,
                    timedelta(days=self.prediction_horizon),
                    direction,
                    magnitude=confidence - 0.5,  # Excess confidence
                    confidence=confidence
                ))
                
                algorithm.Debug(f"{algorithm.Time}: {symbol.Value} -> {direction}, Conf={confidence:.3f}")
        
        return insights
    
    def OnSecuritiesChanged(self, algorithm, changes):
        """
        Gere les ajouts/suppressions de securities.
        """
        for security in changes.AddedSecurities:
            symbol = security.Symbol
            if symbol not in self.symbol_data:
                self.symbol_data[symbol] = MLSymbolData(algorithm, symbol, self.lookback)
        
        for security in changes.RemovedSecurities:
            symbol = security.Symbol
            if symbol in self.symbol_data:
                del self.symbol_data[symbol]
    
    def _should_retrain(self, algorithm):
        """
        Determine si le modele doit etre retraine.
        """
        if self.model is None:
            return True
        
        if self.last_train_time is None:
            return True
        
        days_since_train = (algorithm.Time - self.last_train_time).days
        return days_since_train >= self.retrain_frequency
    
    def _train_model(self, algorithm):
        """
        Entraine le modele sur les donnees recentes.
        """
        algorithm.Debug(f"{algorithm.Time}: Entrainement du modele ML...")
        
        # Collecter les donnees de tous les symboles
        all_features = []
        all_labels = []
        
        for symbol, sd in self.symbol_data.items():
            X, y = sd.GetTrainingData(algorithm, symbol, self.lookback, self.prediction_horizon)
            if X is not None and len(X) > 50:
                all_features.extend(X)
                all_labels.extend(y)
        
        if len(all_features) < 100:
            algorithm.Debug("Pas assez de donnees pour entrainer")
            return
        
        # Convertir en arrays
        X = np.array(all_features)
        y = np.array(all_labels)
        
        # Scaler
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)
        
        # Entrainer
        self.model = RandomForestClassifier(
            n_estimators=100,
            max_depth=5,
            min_samples_leaf=10,
            random_state=42,
            n_jobs=-1
        )
        self.model.fit(X_scaled, y)
        
        self.last_train_time = algorithm.Time
        
        # Log performance
        train_acc = self.model.score(X_scaled, y)
        algorithm.Debug(f"Modele entraine. Train Accuracy: {train_acc:.4f}")
        
        # Sauvegarder dans ObjectStore
        self._save_model(algorithm)
    
    def _save_model(self, algorithm):
        """
        Sauvegarde le modele dans ObjectStore.
        """
        model_bytes = pickle.dumps(self.model)
        algorithm.ObjectStore.SaveBytes(self.model_key, model_bytes)
        
        scaler_bytes = pickle.dumps(self.scaler)
        algorithm.ObjectStore.SaveBytes(self.model_key + "_scaler", scaler_bytes)
    
    def _load_model(self, algorithm):
        """
        Charge le modele depuis ObjectStore.
        """
        if algorithm.ObjectStore.ContainsKey(self.model_key):
            model_bytes = algorithm.ObjectStore.ReadBytes(self.model_key)
            self.model = pickle.loads(model_bytes)
            
            scaler_bytes = algorithm.ObjectStore.ReadBytes(self.model_key + "_scaler")
            self.scaler = pickle.loads(scaler_bytes)
            
            algorithm.Debug("Modele charge depuis ObjectStore")
            return True
        return False


class MLSymbolData:
    """
    Donnees par symbole pour le ML Alpha Model.
    Calcule et stocke les features techniques.
    """
    
    def __init__(self, algorithm, symbol, lookback):
        self.symbol = symbol
        self.lookback = lookback
        
        # Indicateurs
        self.sma_20 = algorithm.SMA(symbol, 20, Resolution.Daily)
        self.rsi_14 = algorithm.RSI(symbol, 14, Resolution.Daily)
        
        # Historique des prix
        self.price_history = deque(maxlen=lookback)
        
        # Warmup
        history = algorithm.History(symbol, lookback, Resolution.Daily)
        if not history.empty:
            for bar in history.itertuples():
                self.price_history.append(bar.close)
                self.sma_20.Update(bar.Index[1], bar.close)
                self.rsi_14.Update(bar.Index[1], bar.close)
    
    def ExtractFeatures(self, algorithm, symbol):
        """
        Extrait les features pour une prediction.
        """
        if not self.sma_20.IsReady or not self.rsi_14.IsReady:
            return None
        
        if len(self.price_history) < 20:
            return None
        
        prices = list(self.price_history)
        current_price = prices[-1]
        
        # Features
        return_1d = (prices[-1] - prices[-2]) / prices[-2] if len(prices) >= 2 else 0
        return_5d = (prices[-1] - prices[-6]) / prices[-6] if len(prices) >= 6 else 0
        return_10d = (prices[-1] - prices[-11]) / prices[-11] if len(prices) >= 11 else 0
        return_20d = (prices[-1] - prices[-21]) / prices[-21] if len(prices) >= 21 else 0
        
        # Volatility
        returns = [(prices[i] - prices[i-1]) / prices[i-1] for i in range(1, len(prices))]
        volatility = np.std(returns[-20:]) if len(returns) >= 20 else 0
        
        # SMA ratio
        sma_ratio = current_price / self.sma_20.Current.Value
        
        # RSI normalise
        rsi_norm = (self.rsi_14.Current.Value - 50) / 50
        
        return [return_1d, return_5d, return_10d, return_20d, volatility, sma_ratio, rsi_norm]
    
    def GetTrainingData(self, algorithm, symbol, lookback, horizon):
        """
        Recupere les donnees d\'entrainement.
        """
        history = algorithm.History(symbol, lookback + horizon + 50, Resolution.Daily)
        
        if history.empty or len(history) < lookback:
            return None, None
        
        # Convertir en DataFrame
        df = history[\'close\'].unstack(level=0)
        df.columns = [\'close\']
        
        # Features
        df[\'return_1d\'] = df[\'close\'].pct_change(1)
        df[\'return_5d\'] = df[\'close\'].pct_change(5)
        df[\'return_10d\'] = df[\'close\'].pct_change(10)
        df[\'return_20d\'] = df[\'close\'].pct_change(20)
        df[\'volatility\'] = df[\'return_1d\'].rolling(20).std()
        df[\'sma_ratio\'] = df[\'close\'] / df[\'close\'].rolling(20).mean()
        
        # RSI
        delta = df[\'close\'].diff()
        gain = delta.clip(lower=0).rolling(14).mean()
        loss = (-delta.clip(upper=0)).rolling(14).mean()
        rs = gain / loss
        df[\'rsi_norm\'] = ((100 - (100 / (1 + rs))) - 50) / 50
        
        # Label
        df[\'future_return\'] = df[\'close\'].shift(-horizon) / df[\'close\'] - 1
        df[\'label\'] = (df[\'future_return\'] > 0).astype(int)
        
        # Clean
        feature_cols = [\'return_1d\', \'return_5d\', \'return_10d\', \'return_20d\',
                        \'volatility\', \'sma_ratio\', \'rsi_norm\']
        df_clean = df.dropna()
        
        X = df_clean[feature_cols].values.tolist()
        y = df_clean[\'label\'].values.tolist()
        
        return X, y
'''

print("MLClassificationAlphaModel:")
print(ml_alpha_code)

---

## Partie 6 : Strategie Complete (20 min)

### 6.1 Architecture de la Strategie

```
            Donnees OHLCV
                  |
                  v
    +---------------------------+
    |    Feature Engineering    |
    | (returns, vol, RSI, SMA)  |
    +---------------------------+
                  |
                  v
    +---------------------------+
    |     Random Forest         |
    |     Classifier            |
    | (retrain mensuel)         |
    +---------------------------+
                  |
                  v
    +---------------------------+
    |     Probability           |
    |     Threshold (>0.6)      |
    +---------------------------+
                  |
                  v
    +---------------------------+
    |     Position Sizing       |
    |     (par confidence)      |
    +---------------------------+
                  |
                  v
           Execution
```

In [None]:
# Strategie complete ML Direction Prediction

complete_strategy_code = '''
from AlgorithmImports import *
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import numpy as np
import pickle


class MLDirectionPredictionStrategy(QCAlgorithm):
    """
    Strategie de trading basee sur ML Classification.
    
    - Modele: Random Forest Classifier
    - Features: Returns, Volatility, SMA, RSI
    - Prediction: Direction sur 5 jours
    - Retrain: Mensuel
    - Position sizing: Proportionnel a la confiance
    """
    
    def Initialize(self):
        # === CONFIGURATION ===
        self.SetStartDate(2020, 1, 1)
        self.SetEndDate(2024, 1, 1)
        self.SetCash(100000)
        
        # Parametres
        self.lookback = 252              # 1 an d\'historique
        self.retrain_days = 21           # Retrain mensuel
        self.prediction_horizon = 5      # Prediction sur 5 jours
        self.probability_threshold = 0.6 # Seuil de confiance
        self.max_position = 0.2          # Max 20% par position
        
        # Univers
        self.tickers = ["AAPL", "MSFT", "GOOGL", "AMZN", "META"]
        self.symbols = {}
        self.indicators = {}
        
        for ticker in self.tickers:
            equity = self.AddEquity(ticker, Resolution.Daily)
            symbol = equity.Symbol
            self.symbols[ticker] = symbol
            
            self.indicators[symbol] = {
                "sma_20": self.SMA(symbol, 20, Resolution.Daily),
                "rsi_14": self.RSI(symbol, 14, Resolution.Daily)
            }
        
        # Modele
        self.model = None
        self.scaler = None
        self.last_train = None
        
        # Schedule retrain mensuel
        self.Schedule.On(
            self.DateRules.MonthStart(),
            self.TimeRules.AfterMarketOpen("SPY", 30),
            self.TrainModel
        )
        
        # Warmup
        self.SetWarmup(self.lookback)
        
        self.Log("ML Direction Prediction Strategy initialized")
    
    def TrainModel(self):
        """
        Entraine le modele sur les donnees recentes.
        """
        self.Debug(f"{self.Time}: Entrainement du modele...")
        
        all_X = []
        all_y = []
        
        for ticker, symbol in self.symbols.items():
            # Historique
            history = self.History(symbol, self.lookback + 50, Resolution.Daily)
            if history.empty or len(history) < self.lookback:
                continue
            
            # Convertir
            df = history[\'close\'].unstack(level=0)
            df.columns = [\'close\']
            
            # Features
            df[\'return_1d\'] = df[\'close\'].pct_change(1)
            df[\'return_5d\'] = df[\'close\'].pct_change(5)
            df[\'return_20d\'] = df[\'close\'].pct_change(20)
            df[\'volatility\'] = df[\'return_1d\'].rolling(20).std()
            df[\'sma_ratio\'] = df[\'close\'] / df[\'close\'].rolling(20).mean()
            
            # RSI
            delta = df[\'close\'].diff()
            gain = delta.clip(lower=0).rolling(14).mean()
            loss = (-delta.clip(upper=0)).rolling(14).mean()
            rs = gain / loss
            df[\'rsi_norm\'] = ((100 - (100 / (1 + rs))) - 50) / 50
            
            # Label
            df[\'future_return\'] = df[\'close\'].shift(-self.prediction_horizon) / df[\'close\'] - 1
            df[\'label\'] = (df[\'future_return\'] > 0).astype(int)
            
            # Clean et collecter
            feature_cols = [\'return_1d\', \'return_5d\', \'return_20d\', \'volatility\', \'sma_ratio\', \'rsi_norm\']
            df_clean = df.dropna()
            
            all_X.extend(df_clean[feature_cols].values.tolist())
            all_y.extend(df_clean[\'label\'].values.tolist())
        
        if len(all_X) < 100:
            self.Debug("Pas assez de donnees")
            return
        
        # Entrainer
        X = np.array(all_X)
        y = np.array(all_y)
        
        self.scaler = StandardScaler()
        X_scaled = self.scaler.fit_transform(X)
        
        self.model = RandomForestClassifier(
            n_estimators=100,
            max_depth=5,
            min_samples_leaf=10,
            random_state=42,
            n_jobs=-1
        )
        self.model.fit(X_scaled, y)
        
        self.last_train = self.Time
        self.Debug(f"Modele entraine. Samples: {len(X)}, Accuracy: {self.model.score(X_scaled, y):.4f}")
    
    def OnData(self, data):
        """
        Execute la strategie.
        """
        if self.IsWarmingUp:
            return
        
        if self.model is None:
            self.TrainModel()
            if self.model is None:
                return
        
        for ticker, symbol in self.symbols.items():
            if not data.ContainsKey(symbol):
                continue
            
            # Verifier indicateurs
            ind = self.indicators[symbol]
            if not ind["sma_20"].IsReady or not ind["rsi_14"].IsReady:
                continue
            
            # Extraire features
            features = self.ExtractFeatures(symbol)
            if features is None:
                continue
            
            # Predire
            features_scaled = self.scaler.transform([features])
            prediction = self.model.predict(features_scaled)[0]
            proba = self.model.predict_proba(features_scaled)[0]
            confidence = max(proba)
            
            # Trading logic
            if confidence >= self.probability_threshold:
                if prediction == 1:  # Prediction UP
                    # Position proportionnelle a la confiance
                    target_weight = self.max_position * (confidence - 0.5) * 2
                    current_weight = self.Portfolio[symbol].HoldingsValue / self.Portfolio.TotalPortfolioValue
                    
                    if target_weight > current_weight + 0.02:  # Seuil de changement
                        self.SetHoldings(symbol, target_weight)
                        self.Log(f"{self.Time.date()}: BUY {ticker}, Conf={confidence:.3f}, Weight={target_weight:.2%}")
                
                elif prediction == 0:  # Prediction DOWN
                    if self.Portfolio[symbol].Invested:
                        self.Liquidate(symbol)
                        self.Log(f"{self.Time.date()}: SELL {ticker}, Conf={confidence:.3f}")
    
    def ExtractFeatures(self, symbol):
        """
        Extrait les features pour un symbole.
        """
        history = self.History(symbol, 25, Resolution.Daily)
        if history.empty or len(history) < 21:
            return None
        
        prices = history[\'close\'].values
        
        return_1d = (prices[-1] - prices[-2]) / prices[-2]
        return_5d = (prices[-1] - prices[-6]) / prices[-6]
        return_20d = (prices[-1] - prices[-21]) / prices[-21]
        
        returns = [(prices[i] - prices[i-1]) / prices[i-1] for i in range(1, len(prices))]
        volatility = np.std(returns[-20:])
        
        sma_ratio = prices[-1] / self.indicators[symbol]["sma_20"].Current.Value
        rsi_norm = (self.indicators[symbol]["rsi_14"].Current.Value - 50) / 50
        
        return [return_1d, return_5d, return_20d, volatility, sma_ratio, rsi_norm]
    
    def OnEndOfAlgorithm(self):
        """
        Resume final.
        """
        self.Log("="*60)
        self.Log("ML DIRECTION PREDICTION STRATEGY - SUMMARY")
        self.Log("="*60)
        self.Log(f"Final Value: ${self.Portfolio.TotalPortfolioValue:,.2f}")
        total_return = (self.Portfolio.TotalPortfolioValue - 100000) / 100000
        self.Log(f"Total Return: {total_return:.2%}")
        self.Log("="*60)
'''

print("MLDirectionPredictionStrategy Complete:")
print(complete_strategy_code)

In [None]:
# Resume de la strategie

print("="*70)
print("RESUME: ML Direction Prediction Strategy")
print("="*70)

strategy_components = {
    'Feature Engineering': {
        'Returns': '1D, 5D, 20D',
        'Volatility': 'Rolling 20D std',
        'SMA Ratio': 'Price / SMA(20)',
        'RSI': 'Normalized [-1, 1]'
    },
    'Modele': {
        'Type': 'Random Forest Classifier',
        'Arbres': '100',
        'Profondeur': '5',
        'Min samples leaf': '10'
    },
    'Validation': {
        'Methode': 'TimeSeriesSplit + Walk-Forward',
        'Retrain': 'Mensuel',
        'Lookback': '252 jours (1 an)'
    },
    'Trading': {
        'Horizon': '5 jours',
        'Seuil probabilite': '0.6 (60%)',
        'Position sizing': 'Proportionnel a confiance',
        'Max position': '20%'
    },
    'Signaux': {
        'BUY': 'Prediction=UP AND Confiance >= 0.6',
        'SELL': 'Prediction=DOWN AND Confiance >= 0.6'
    }
}

for category, details in strategy_components.items():
    print(f"\n{category}:")
    for key, value in details.items():
        print(f"  - {key}: {value}")

In [None]:
# Dependances Python

print("="*70)
print("DEPENDANCES PYTHON")
print("="*70)

dependencies = [
    {
        'package': 'scikit-learn',
        'install': 'pip install scikit-learn',
        'usage': 'RandomForestClassifier, StandardScaler, metriques',
        'note': 'Inclus dans QuantConnect'
    },
    {
        'package': 'xgboost',
        'install': 'pip install xgboost',
        'usage': 'XGBClassifier avec early stopping',
        'note': 'Inclus dans QuantConnect'
    },
    {
        'package': 'numpy',
        'install': 'pip install numpy',
        'usage': 'Operations numeriques',
        'note': 'Inclus dans QuantConnect'
    },
    {
        'package': 'pandas',
        'install': 'pip install pandas',
        'usage': 'Manipulation de donnees',
        'note': 'Inclus dans QuantConnect'
    }
]

for dep in dependencies:
    print(f"\n{dep['package']}:")
    print(f"  Installation: {dep['install']}")
    print(f"  Usage: {dep['usage']}")
    print(f"  Note: {dep['note']}")

---

## Conclusion et Prochaines Etapes

### Recapitulatif

Dans ce notebook, nous avons couvert:

1. **Introduction ML Trading** : Pourquoi ML vs regles fixes, challenges specifiques

2. **Random Forest** : Configuration, entrainement, feature importance

3. **XGBoost** : Early stopping, regularisation, comparaison avec RF

4. **Validation** : TimeSeriesSplit, Walk-Forward, metriques trading

5. **Integration QC** : ObjectStore, ML Alpha Model

6. **Strategie Complete** : Pipeline end-to-end avec retrain periodique

### Points Cles a Retenir

| Concept | Point Cle |
|---------|----------|
| **TimeSeriesSplit** | Toujours train avant test (pas de shuffle) |
| **Retrain** | Mensuel minimum pour s'adapter aux regimes |
| **Precision** | Plus importante que accuracy en trading |
| **Probabilite** | Utiliser proba pour confidence, pas juste la classe |
| **Feature Importance** | Comprendre ce que le modele apprend |
| **Regularisation** | Limiter profondeur, min_samples_leaf |

### Limitations et Avertissements

| Limitation | Description |
|------------|-------------|
| **Overfitting** | Modeles complexes memorisent le bruit |
| **Regime Change** | Performances degradent apres changement de regime |
| **Latence** | Retrain et inference prennent du temps |
| **Data Quality** | Les features dependent de donnees propres |
| **Black Box** | Difficile d'expliquer pourquoi un trade |

### Ameliorations Possibles

1. **Ensemble** : Combiner RF + XGBoost + autres modeles
2. **Features** : Ajouter fondamentaux, sentiment, alternatifs
3. **Hyperparameter Tuning** : GridSearch, Optuna
4. **Calibration** : Isotonic/Platt pour probabilites calibrees
5. **Online Learning** : Mise a jour incrementale

### Prochaines Etapes

| Notebook | Contenu |
|----------|--------|
| **QC-Py-20** | ML Regression (predire le rendement exact) |
| **QC-Py-21** | Deep Learning (LSTM, Transformers) |
| **QC-Py-22** | Reinforcement Learning pour trading |

### Ressources Complementaires

- [QuantConnect ML Documentation](https://www.quantconnect.com/docs/v2/writing-algorithms/machine-learning)
- [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
- [XGBoost Documentation](https://xgboost.readthedocs.io/)
- [Advances in Financial ML](https://www.amazon.com/Advances-Financial-Machine-Learning-Marcos/dp/1119482089) - Lopez de Prado

---

**Notebook complete. La classification ML est un outil puissant mais doit etre utilisee avec rigueur: validation robuste, retrain regulier, et gestion du risque appropriee.**