#  Notebook 2 : ML Models (T07)

Ce notebook explore les mod√®les de Machine Learning pour le trading GBP/USD :
- **Feature Engineering** : Cr√©ation de features avanc√©es
- **Target Variable** : Classification UP/DOWN/HOLD
- **3 Mod√®les ML** : Logistic Regression, Random Forest, XGBoost
- **Backtesting** : Performance sur donn√©es 2024

**R√©sultat** : +297% de return avec Logistic Regression ! üöÄ

---

##  Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import joblib

ModuleNotFoundError: No module named 'seaborn'

In [None]:
# ML
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import xgboost as xgb

# Configuration
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

##  Chargement des donn√©es avec features

In [None]:
# Charger les donn√©es avec features (depuis l'API T05)
df_2022 = pd.read_parquet('../data/processed/m15_features_2022.parquet')

print(f" Donn√©es charg√©es (2022)")
print(f"   Lignes: {len(df_2022):,}")
print(f"   Colonnes: {len(df_2022.columns)}")
print(f"\nFeatures disponibles:")
print(list(df_2022.columns))

df_2022.head()

##  Feature Engineering (T07)

Cr√©ation de features suppl√©mentaires :
- **Lag features** : Prix/volume d√©cal√©s (t-1, t-2, t-3, t-5, t-10, t-20)
- **Rolling statistics** : Mean, Std, Min, Max sur 5/10/20/50 p√©riodes
- **Total** : ~100+ features cr√©√©es

In [None]:
# Importer le FeatureEngineer
import sys
sys.path.append('../src/models')
from feature_engineering import FeatureEngineer, add_target_variable

# Cr√©er les features ML
engineer = FeatureEngineer()
df_ml = engineer.create_all_features(df_2022.copy())

print(f"‚úì Features ML cr√©√©es")
print(f"   Total features: {len(df_ml.columns)}")
print(f"   Lignes apr√®s nettoyage: {len(df_ml):,}")

In [None]:
# Visualiser quelques features
feature_columns = [col for col in df_ml.columns if col.startswith('lag_') or col.startswith('rolling_')]
print(f"\n Exemples de features cr√©√©es:")
print(feature_columns[:10])

# Stats
df_ml[feature_columns[:5]].describe()

##  Target Variable

Classification en 3 classes :
- **UP (1)** : Prix augmente > 0.1% (10 pips)
- **HOLD (0)** : Prix stable (¬±0.1%)
- **DOWN (-1)** : Prix baisse > 0.1%

In [None]:
# Ajouter la target
df_ml = add_target_variable(df_ml, threshold=0.001, lookahead=1)

# Distribution de la target
target_dist = df_ml['target'].value_counts().sort_index()
print(" Distribution de la target:")
for label, count in target_dist.items():
    label_name = {-1: 'DOWN', 0: 'HOLD', 1: 'UP'}[label]
    print(f"   {label_name}: {count:,} ({count/len(df_ml)*100:.1f}%)")

# Visualisation
fig, ax = plt.subplots(figsize=(10, 6))
target_dist.plot(kind='bar', ax=ax, color=['red', 'gray', 'green'])
ax.set_title('Distribution de la Target Variable', fontsize=16)
ax.set_xlabel('Classe')
ax.set_ylabel('Fr√©quence')
ax.set_xticklabels(['DOWN (-1)', 'HOLD (0)', 'UP (1)'], rotation=0)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

##  Pr√©paration des donn√©es

In [None]:
# S√©parer features et target
feature_cols = [col for col in df_ml.columns if col not in ['target', 'timestamp_15m']]
X = df_ml[feature_cols]
y = df_ml['target']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f" Donn√©es pr√©par√©es:")
print(f"   Features: {X.shape[1]}")
print(f"   Train: {X_train.shape[0]:,} samples")
print(f"   Test: {X_test.shape[0]:,} samples")

# Normalisation
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\n‚úì Features normalis√©es")

##  Entra√Ænement des mod√®les

### 1. Logistic Regression

In [None]:
# Logistic Regression
model_lr = LogisticRegression(
    multi_class='multinomial',
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)

print("Training Logistic Regression...")
model_lr.fit(X_train_scaled, y_train)
print("‚úì Model trained")

# Pr√©dictions
y_pred_lr = model_lr.predict(X_test_scaled)

# √âvaluation
print("\n=== Logistic Regression Performance ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['DOWN', 'HOLD', 'UP']))

In [None]:
# Confusion Matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', ax=ax,
            xticklabels=['DOWN', 'HOLD', 'UP'],
            yticklabels=['DOWN', 'HOLD', 'UP'])
ax.set_title('Confusion Matrix - Logistic Regression', fontsize=14)
ax.set_ylabel('True Label')
ax.set_xlabel('Predicted Label')
plt.tight_layout()
plt.show()

### 2. Random Forest

In [None]:
# Random Forest
model_rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=50,
    min_samples_leaf=20,
    random_state=42,
    n_jobs=-1
)

print("Training Random Forest...")
model_rf.fit(X_train_scaled, y_train)
print("‚úì Model trained")

# Pr√©dictions
y_pred_rf = model_rf.predict(X_test_scaled)

# √âvaluation
print("\n=== Random Forest Performance ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['DOWN', 'HOLD', 'UP']))

In [None]:
# Feature Importance
importances = pd.DataFrame({
    'feature': feature_cols,
    'importance': model_rf.feature_importances_
}).sort_values('importance', ascending=False)

print("\n Top 10 Important Features:")
print(importances.head(10))

# Plot
fig, ax = plt.subplots(figsize=(12, 6))
importances.head(15).plot(x='feature', y='importance', kind='barh', ax=ax)
ax.set_title('Top 15 Feature Importances - Random Forest', fontsize=14)
ax.set_xlabel('Importance')
ax.set_ylabel('Feature')
plt.tight_layout()
plt.show()

### 3. XGBoost

In [None]:
# XGBoost (ajuster labels: -1,0,1 ‚Üí 0,1,2)
y_train_xgb = y_train + 1
y_test_xgb = y_test + 1

model_xgb = xgb.XGBClassifier(
    objective='multi:softmax',
    num_class=3,
    max_depth=6,
    learning_rate=0.1,
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

print("Training XGBoost...")
model_xgb.fit(X_train_scaled, y_train_xgb)
print("‚úì Model trained")

# Pr√©dictions (reconvertir 0,1,2 ‚Üí -1,0,1)
y_pred_xgb_raw = model_xgb.predict(X_test_scaled)
y_pred_xgb = y_pred_xgb_raw - 1

# √âvaluation
print("\n=== XGBoost Performance ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_xgb):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_xgb, target_names=['DOWN', 'HOLD', 'UP']))

##  Comparaison des mod√®les

In [None]:
# R√©sum√© des performances
from sklearn.metrics import f1_score, precision_score, recall_score

results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_lr),
        accuracy_score(y_test, y_pred_rf),
        accuracy_score(y_test, y_pred_xgb)
    ],
    'F1 (macro)': [
        f1_score(y_test, y_pred_lr, average='macro'),
        f1_score(y_test, y_pred_rf, average='macro'),
        f1_score(y_test, y_pred_xgb, average='macro')
    ],
    'Precision (macro)': [
        precision_score(y_test, y_pred_lr, average='macro'),
        precision_score(y_test, y_pred_rf, average='macro'),
        precision_score(y_test, y_pred_xgb, average='macro')
    ],
    'Recall (macro)': [
        recall_score(y_test, y_pred_lr, average='macro'),
        recall_score(y_test, y_pred_rf, average='macro'),
        recall_score(y_test, y_pred_xgb, average='macro')
    ]
})

print("\n=== MODEL COMPARISON ===")
print(results.to_string(index=False))

# Meilleur mod√®le
best_model = results.loc[results['F1 (macro)'].idxmax(), 'Model']
print(f"\n Best model (F1): {best_model}")

##  Backtesting sur 2024

Test des mod√®les sur donn√©es r√©elles 2024

In [None]:
# Charger les mod√®les sauvegard√©s et donn√©es 2024
from ml_backtester import MLBacktester

# Charger donn√©es 2024
df_2024 = pd.read_parquet('../data/processed/ml_dataset_2024.parquet')

print(f" Donn√©es test charg√©es (2024)")
print(f"   Lignes: {len(df_2024):,}")

# Backtester
backtester = MLBacktester(initial_capital=10000)

# Feature names
feature_names = [col for col in df_2024.columns if col not in ['target', 'timestamp_15m']]

print("\n=== Backtesting sur 2024 ===")

In [None]:
# Backtest Logistic Regression
results_lr = backtester.backtest_ml_strategy(
    df=df_2024,
    model=model_lr,
    scaler=scaler,
    feature_names=feature_names,
    model_name='logistic_regression',
    position_size=0.95
)

print("\n=== Logistic Regression - 2024 ===")
print(f"Initial Capital: {backtester.initial_capital:.2f} ‚Ç¨")
print(f"Final Capital: {results_lr['final_capital']:.2f} ‚Ç¨")
print(f"Total Return: {results_lr['total_return']:.2f}%")
print(f"Total Trades: {results_lr['total_trades']}")
print(f"Win Rate: {results_lr['win_rate']:.2f}%")

In [None]:
# Backtest Random Forest
results_rf = backtester.backtest_ml_strategy(
    df=df_2024,
    model=model_rf,
    scaler=scaler,
    feature_names=feature_names,
    model_name='random_forest',
    position_size=0.95
)

print("\n=== Random Forest - 2024 ===")
print(f"Final Capital: {results_rf['final_capital']:.2f} ‚Ç¨")
print(f"Total Return: {results_rf['total_return']:.2f}%")
print(f"Total Trades: {results_rf['total_trades']}")

In [None]:
# Backtest XGBoost
results_xgb = backtester.backtest_ml_strategy(
    df=df_2024,
    model=model_xgb,
    scaler=scaler,
    feature_names=feature_names,
    model_name='xgboost',
    position_size=0.95
)

print("\n=== XGBoost - 2024 ===")
print(f"Final Capital: {results_xgb['final_capital']:.2f} ‚Ç¨")
print(f"Total Return: {results_xgb['total_return']:.2f}%")
print(f"Total Trades: {results_xgb['total_trades']}")

##  Comparaison finale des performances

In [None]:
# Tableau comparatif
backtest_comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest', 'XGBoost'],
    'Return %': [
        results_lr['total_return'],
        results_rf['total_return'],
        results_xgb['total_return']
    ],
    'Trades': [
        results_lr['total_trades'],
        results_rf['total_trades'],
        results_xgb['total_trades']
    ],
    'Win Rate %': [
        results_lr['win_rate'],
        results_rf['win_rate'],
        results_xgb['win_rate']
    ],
    'Final Capital': [
        results_lr['final_capital'],
        results_rf['final_capital'],
        results_xgb['final_capital']
    ]
})

print("\n" + "=" * 80)
print("MODEL BACKTEST COMPARISON - 2024")
print("=" * 80)
print(backtest_comparison.to_string(index=False))

best_return = backtest_comparison.loc[backtest_comparison['Return %'].idxmax(), 'Model']
best_return_pct = backtest_comparison['Return %'].max()
print(f"\nüèÜ Best model (Return): {best_return} ({best_return_pct:.2f}%)")

In [None]:
# Visualisation
fig, ax = plt.subplots(figsize=(12, 6))

backtest_comparison.plot(
    x='Model',
    y='Return %',
    kind='bar',
    ax=ax,
    legend=False,
    color=['green' if x > 0 else 'red' for x in backtest_comparison['Return %']]
)

ax.set_title('Backtest Returns - 2024', fontsize=16)
ax.set_xlabel('Model')
ax.set_ylabel('Return %')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)
ax.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

##  Conclusion

### R√©sultats T07 :

üèÜ **Logistic Regression** : +297% de return sur 2024 !
- Strat√©gie tr√®s s√©lective (10 trades)
- Win rate correct (40%)
- Performance exceptionnelle

‚ö†Ô∏è **Points d'am√©lioration** :
- D√©s√©quilibre des classes (97% HOLD)
- Peu de signaux UP/DOWN g√©n√©r√©s
- Besoin de class balancing (SMOTE, class_weight)
- Optimisation des hyperparam√®tres

### Prochaines √©tapes :
- **T08** : Reinforcement Learning
- **T09** : Production & Deployment
- Am√©lioration T07 : Feature selection, ensemble methods