# QC-Py-20 - Machine Learning Regression pour Price Prediction

> **Predire les rendements avec des modeles de regression**
> Duree: 75 minutes | Niveau: Intermediaire-Avance | Python + QuantConnect

---

## Objectifs d'Apprentissage

A la fin de ce notebook, vous serez capable de :

1. Comprendre la difference entre **Regression vs Classification** pour le trading
2. Implementer des modeles de **Linear Regression** (Ridge, Lasso, ElasticNet)
3. Utiliser des **Ensemble Methods** (Random Forest, Gradient Boosting, XGBoost)
4. Appliquer **Support Vector Regression** avec scaling
5. Evaluer les modeles avec les **metriques de regression** appropriees
6. Convertir les **predictions en signaux de trading**
7. Construire une **strategie complete** de prediction de rendements

## Prerequisites

- Notebooks QC-Py-01 a 18 completes
- QC-Py-18 Feature Engineering (features preparees)
- Notions de base en Machine Learning
- Familiarite avec scikit-learn

## Structure du Notebook

1. Regression vs Classification (10 min)
2. Linear Regression: Ridge, Lasso, ElasticNet (20 min)
3. Ensemble Methods: Random Forest, Gradient Boosting (25 min)
4. Support Vector Regression (15 min)
5. Metriques de Regression (15 min)
6. Integration Trading: Prediction to Signal (20 min)
7. Strategie Complete: Return Prediction (20 min)

---

## Partie 1 : Regression vs Classification (10 min)

### Deux approches pour predire le marche

En Machine Learning pour le trading, deux approches principales existent :

| Approche | Question | Output | Exemple |
|----------|----------|--------|--------|
| **Classification** | Le prix va-t-il monter ou descendre? | Classe (Up/Down) | Direction du mouvement |
| **Regression** | De combien le prix va-t-il changer? | Valeur continue | Rendement attendu (ex: +2.5%) |

### Quand utiliser la Regression?

```
Classification (Direction)
   - Objectif: Direction du mouvement
   - Output: Up (1), Down (0)
   - Avantage: Plus simple, plus robuste
   - Limite: Ignore la magnitude

Regression (Magnitude)
   - Objectif: Rendement exact
   - Output: Valeur continue (ex: +2.5%, -1.3%)
   - Avantage: Information sur l'amplitude
   - Limite: Plus difficile, plus de bruit
```

### Avantages de la Regression pour le Trading

| Avantage | Description |
|----------|-------------|
| **Position Sizing** | Ajuster la taille selon l'amplitude predite |
| **Risk/Reward** | Estimer le ratio risque/rendement |
| **Filtering** | Ignorer les petits mouvements non rentables |
| **Portfolio Optimization** | Utiliser les predictions comme expected returns |

### Pipeline de Regression pour Trading

```
Features (X)          Modele Regression       Prediction
[RSI, MACD, ...]  -->  [Ridge/XGBoost]  -->  y_pred = +1.8%
                                                  |
                                                  v
                                           Signal Trading
                                           (Long si > threshold)
```

In [None]:
# Imports necessaires
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Machine Learning imports
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Configuration matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

# Verifier si XGBoost est disponible
try:
    import xgboost as xgb
    XGB_AVAILABLE = True
    print("XGBoost disponible")
except ImportError:
    XGB_AVAILABLE = False
    print("XGBoost non disponible (pip install xgboost pour l'installer)")

print("\nImports reussis!")
print("Ce notebook couvre la Regression ML pour la prediction de prix.")

In [None]:
# Generer des donnees de demonstration avec features et target

def generate_regression_data(n_days=1000, seed=42):
    """
    Genere des donnees simulees pour la regression.
    Inclut features techniques et target (rendement futur).
    
    Parameters:
    -----------
    n_days : int
        Nombre de jours de donnees
    seed : int
        Graine aleatoire
    
    Returns:
    --------
    pd.DataFrame
        DataFrame avec features et target
    """
    np.random.seed(seed)
    
    # Dates
    dates = pd.date_range(start='2019-01-01', periods=n_days, freq='B')
    
    # Prix simules avec tendance et volatilite
    returns = np.random.normal(0.0003, 0.015, n_days)
    close = 100 * np.exp(np.cumsum(returns))
    
    # === FEATURES TECHNIQUES ===
    
    # Rendements passes
    df = pd.DataFrame({'close': close}, index=dates)
    df['return_1d'] = df['close'].pct_change(1)
    df['return_5d'] = df['close'].pct_change(5)
    df['return_20d'] = df['close'].pct_change(20)
    
    # Volatilite
    df['volatility_20d'] = df['return_1d'].rolling(20).std()
    
    # RSI simule
    delta = df['close'].diff()
    gain = delta.clip(lower=0).rolling(14).mean()
    loss = (-delta.clip(upper=0)).rolling(14).mean()
    rs = gain / (loss + 1e-10)
    df['rsi'] = 100 - (100 / (1 + rs))
    df['rsi_normalized'] = (df['rsi'] - 50) / 50
    
    # Moving averages
    df['sma_20'] = df['close'].rolling(20).mean()
    df['sma_50'] = df['close'].rolling(50).mean()
    df['ma_ratio'] = df['sma_20'] / df['sma_50']
    df['price_to_sma20'] = df['close'] / df['sma_20']
    
    # MACD normalise
    ema_12 = df['close'].ewm(span=12).mean()
    ema_26 = df['close'].ewm(span=26).mean()
    df['macd_norm'] = (ema_12 - ema_26) / df['close']
    
    # Bollinger Bands %B
    bb_middle = df['close'].rolling(20).mean()
    bb_std = df['close'].rolling(20).std()
    bb_upper = bb_middle + 2 * bb_std
    bb_lower = bb_middle - 2 * bb_std
    df['bb_percent_b'] = (df['close'] - bb_lower) / (bb_upper - bb_lower + 1e-10)
    
    # Volume ratio (simule)
    volume = 1e6 * (1 + np.random.exponential(0.3, n_days))
    df['volume_ratio'] = volume / pd.Series(volume).rolling(20).mean().values
    
    # === TARGET: Rendement futur sur 5 jours ===
    horizon = 5
    df['target_return'] = df['close'].shift(-horizon) / df['close'] - 1
    
    # Supprimer les lignes avec NaN
    df = df.dropna()
    
    return df

# Generer les donnees
df = generate_regression_data(n_days=1000)

print("Donnees generees:")
print(f"  Periode: {df.index[0].date()} a {df.index[-1].date()}")
print(f"  Nombre d'echantillons: {len(df)}")
print(f"\nFeatures: {[c for c in df.columns if c not in ['close', 'target_return']]}")
print(f"\nTarget: 'target_return' (rendement 5 jours)")
print(f"  Mean: {df['target_return'].mean()*100:.3f}%")
print(f"  Std: {df['target_return'].std()*100:.3f}%")

In [None]:
# Preparer les donnees pour ML

# Features (X) et Target (y)
feature_cols = ['return_1d', 'return_5d', 'return_20d', 'volatility_20d',
                'rsi_normalized', 'ma_ratio', 'price_to_sma20', 'macd_norm',
                'bb_percent_b', 'volume_ratio']

X = df[feature_cols]
y = df['target_return']

# Split temporel (train/test)
train_size = int(len(df) * 0.7)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

print("Split Train/Test (temporel):")
print(f"  Train: {len(X_train)} samples ({X_train.index[0].date()} - {X_train.index[-1].date()})")
print(f"  Test:  {len(X_test)} samples ({X_test.index[0].date()} - {X_test.index[-1].date()})")

# Standardisation
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=feature_cols, index=X_train.index)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=feature_cols, index=X_test.index)

print("\nStandardisation appliquee (fit sur train, transform sur test)")

---

## Partie 2 : Linear Regression - Ridge, Lasso, ElasticNet (20 min)

### Regularisation en Regression

La regression lineaire simple peut overfitter, surtout avec beaucoup de features. La regularisation penalise les coefficients trop grands.

| Modele | Regularisation | Formule Loss | Effet |
|--------|---------------|--------------|-------|
| **Linear** | Aucune | MSE | Baseline, peut overfitter |
| **Ridge (L2)** | L2 (somme carres) | MSE + alpha * sum(coef^2) | Shrinkage, garde toutes les features |
| **Lasso (L1)** | L1 (somme absolue) | MSE + alpha * sum(abs(coef)) | Sparse, feature selection |
| **ElasticNet** | L1 + L2 | MSE + alpha * (l1_ratio * L1 + (1-l1_ratio) * L2) | Combine les deux |

### Quand utiliser chaque modele?

```
Ridge (L2)
  - Quand toutes les features sont utiles
  - Features correlees
  - Shrinkage uniforme

Lasso (L1)
  - Quand on veut feature selection automatique
  - Beaucoup de features, peu d'utiles
  - Coefficients exactement a 0

ElasticNet
  - Combine les avantages des deux
  - Features correlees + sparse solution
  - Plus flexible
```

In [None]:
# Ridge Regression (L2 Regularization)

print("="*60)
print("RIDGE REGRESSION (L2 Regularization)")
print("="*60)

# Tester differentes valeurs de alpha
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
ridge_results = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_scaled, y_train)
    
    # Predictions
    y_train_pred = ridge.predict(X_train_scaled)
    y_test_pred = ridge.predict(X_test_scaled)
    
    # Metriques
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_r2 = r2_score(y_test, y_test_pred)
    
    ridge_results.append({
        'alpha': alpha,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'test_r2': test_r2,
        'coef_norm': np.linalg.norm(ridge.coef_)
    })

ridge_df = pd.DataFrame(ridge_results)
print("\nResultats Ridge pour differents alpha:")
print(ridge_df.to_string(index=False))

# Meilleur alpha
best_ridge_alpha = ridge_df.loc[ridge_df['test_rmse'].idxmin(), 'alpha']
print(f"\nMeilleur alpha: {best_ridge_alpha}")

In [None]:
# Lasso Regression (L1 Regularization - Feature Selection)

print("="*60)
print("LASSO REGRESSION (L1 Regularization)")
print("="*60)

# Tester differentes valeurs de alpha
alphas = [0.0001, 0.001, 0.01, 0.1, 1.0]
lasso_results = []

for alpha in alphas:
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_scaled, y_train)
    
    # Predictions
    y_train_pred = lasso.predict(X_train_scaled)
    y_test_pred = lasso.predict(X_test_scaled)
    
    # Metriques
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_r2 = r2_score(y_test, y_test_pred)
    n_nonzero = np.sum(lasso.coef_ != 0)
    
    lasso_results.append({
        'alpha': alpha,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'test_r2': test_r2,
        'n_features': n_nonzero
    })

lasso_df = pd.DataFrame(lasso_results)
print("\nResultats Lasso pour differents alpha:")
print(lasso_df.to_string(index=False))

# Meilleur Lasso
best_lasso_alpha = lasso_df.loc[lasso_df['test_rmse'].idxmin(), 'alpha']
best_lasso = Lasso(alpha=best_lasso_alpha, max_iter=10000)
best_lasso.fit(X_train_scaled, y_train)

print(f"\nMeilleur alpha: {best_lasso_alpha}")
print(f"\nFeatures selectionnees (coef != 0):")
for feat, coef in zip(feature_cols, best_lasso.coef_):
    if coef != 0:
        print(f"  {feat}: {coef:.6f}")

In [None]:
# ElasticNet (L1 + L2 Regularization)

print("="*60)
print("ELASTICNET (L1 + L2 Regularization)")
print("="*60)

# Tester differentes combinaisons
params = [
    {'alpha': 0.001, 'l1_ratio': 0.2},
    {'alpha': 0.001, 'l1_ratio': 0.5},
    {'alpha': 0.001, 'l1_ratio': 0.8},
    {'alpha': 0.01, 'l1_ratio': 0.5},
    {'alpha': 0.1, 'l1_ratio': 0.5},
]

elastic_results = []

for p in params:
    elastic = ElasticNet(alpha=p['alpha'], l1_ratio=p['l1_ratio'], max_iter=10000)
    elastic.fit(X_train_scaled, y_train)
    
    y_test_pred = elastic.predict(X_test_scaled)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_r2 = r2_score(y_test, y_test_pred)
    n_nonzero = np.sum(elastic.coef_ != 0)
    
    elastic_results.append({
        'alpha': p['alpha'],
        'l1_ratio': p['l1_ratio'],
        'test_rmse': test_rmse,
        'test_r2': test_r2,
        'n_features': n_nonzero
    })

elastic_df = pd.DataFrame(elastic_results)
print("\nResultats ElasticNet:")
print(elastic_df.to_string(index=False))

print("\nNote:")
print("  - l1_ratio=0 -> Ridge (pure L2)")
print("  - l1_ratio=1 -> Lasso (pure L1)")
print("  - l1_ratio=0.5 -> Mix equilibre")

In [None]:
# Visualisation des coefficients

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Ridge
ridge_best = Ridge(alpha=best_ridge_alpha)
ridge_best.fit(X_train_scaled, y_train)

ax1 = axes[0]
colors = ['green' if c > 0 else 'red' for c in ridge_best.coef_]
ax1.barh(feature_cols, ridge_best.coef_, color=colors)
ax1.axvline(0, color='black', linewidth=0.5)
ax1.set_xlabel('Coefficient')
ax1.set_title(f'Ridge (alpha={best_ridge_alpha})')

# Lasso
ax2 = axes[1]
colors = ['green' if c > 0 else 'red' if c < 0 else 'gray' for c in best_lasso.coef_]
ax2.barh(feature_cols, best_lasso.coef_, color=colors)
ax2.axvline(0, color='black', linewidth=0.5)
ax2.set_xlabel('Coefficient')
ax2.set_title(f'Lasso (alpha={best_lasso_alpha})')

# ElasticNet
elastic_best = ElasticNet(alpha=0.001, l1_ratio=0.5, max_iter=10000)
elastic_best.fit(X_train_scaled, y_train)

ax3 = axes[2]
colors = ['green' if c > 0 else 'red' if c < 0 else 'gray' for c in elastic_best.coef_]
ax3.barh(feature_cols, elastic_best.coef_, color=colors)
ax3.axvline(0, color='black', linewidth=0.5)
ax3.set_xlabel('Coefficient')
ax3.set_title('ElasticNet (alpha=0.001, l1=0.5)')

plt.tight_layout()
plt.show()

print("\nObservations:")
print("  - Ridge: tous les coefficients non-nuls (shrinkage)")
print("  - Lasso: certains coefficients a 0 (feature selection)")
print("  - ElasticNet: compromis entre les deux")

---

## Partie 3 : Ensemble Methods - Random Forest et Gradient Boosting (25 min)

### Ensemble Methods pour la Regression

Les methodes d'ensemble combinent plusieurs modeles pour obtenir de meilleures predictions.

| Methode | Principe | Avantages | Inconvenients |
|---------|----------|-----------|---------------|
| **Random Forest** | Moyenne de nombreux arbres | Robuste, peu d'overfitting | Moins precis que boosting |
| **Gradient Boosting** | Arbres sequentiels corrigent les erreurs | Tres precis | Risque d'overfitting |
| **XGBoost** | GB optimise avec regularisation | Rapide, regularise | Complexe a tuner |

### Random Forest Regressor

```
Donnees
   |
   +-- Bootstrap sample 1 --> Arbre 1 --> Prediction 1
   +-- Bootstrap sample 2 --> Arbre 2 --> Prediction 2
   +-- ...                    ...        ...
   +-- Bootstrap sample N --> Arbre N --> Prediction N
   |
   v
Prediction finale = Moyenne(Prediction 1, ..., Prediction N)
```

In [None]:
# Random Forest Regressor

print("="*60)
print("RANDOM FOREST REGRESSOR")
print("="*60)

# Hyperparametres a tester
rf_params = [
    {'n_estimators': 50, 'max_depth': 4, 'min_samples_leaf': 20},
    {'n_estimators': 100, 'max_depth': 6, 'min_samples_leaf': 20},
    {'n_estimators': 100, 'max_depth': 8, 'min_samples_leaf': 10},
    {'n_estimators': 200, 'max_depth': 6, 'min_samples_leaf': 20},
]

rf_results = []

for params in rf_params:
    rf = RandomForestRegressor(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        min_samples_leaf=params['min_samples_leaf'],
        random_state=42,
        n_jobs=-1
    )
    rf.fit(X_train_scaled, y_train)
    
    y_train_pred = rf.predict(X_train_scaled)
    y_test_pred = rf.predict(X_test_scaled)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_r2 = r2_score(y_test, y_test_pred)
    
    rf_results.append({
        'n_estimators': params['n_estimators'],
        'max_depth': params['max_depth'],
        'min_samples_leaf': params['min_samples_leaf'],
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'test_r2': test_r2
    })

rf_df = pd.DataFrame(rf_results)
print("\nResultats Random Forest:")
print(rf_df.to_string(index=False))

# Meilleur modele
best_rf_idx = rf_df['test_rmse'].idxmin()
best_rf_params = rf_params[best_rf_idx]
print(f"\nMeilleurs parametres: {best_rf_params}")

In [None]:
# Feature Importance avec Random Forest

# Entrainer le meilleur modele
best_rf = RandomForestRegressor(
    n_estimators=best_rf_params['n_estimators'],
    max_depth=best_rf_params['max_depth'],
    min_samples_leaf=best_rf_params['min_samples_leaf'],
    random_state=42,
    n_jobs=-1
)
best_rf.fit(X_train_scaled, y_train)

# Feature importance
importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': best_rf.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (Random Forest):")
print(importance.to_string(index=False))

# Visualisation
plt.figure(figsize=(10, 6))
colors = plt.cm.viridis(np.linspace(0, 1, len(importance)))
plt.barh(importance['feature'], importance['importance'], color=colors)
plt.xlabel('Importance')
plt.title('Feature Importance - Random Forest Regressor')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# Gradient Boosting Regressor (sklearn)

print("="*60)
print("GRADIENT BOOSTING REGRESSOR (sklearn)")
print("="*60)

# Hyperparametres
gb_params = [
    {'n_estimators': 50, 'max_depth': 3, 'learning_rate': 0.1},
    {'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.1},
    {'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.05},
    {'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.05},
]

gb_results = []

for params in gb_params:
    gb = GradientBoostingRegressor(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        learning_rate=params['learning_rate'],
        min_samples_leaf=20,
        random_state=42
    )
    gb.fit(X_train_scaled, y_train)
    
    y_train_pred = gb.predict(X_train_scaled)
    y_test_pred = gb.predict(X_test_scaled)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_r2 = r2_score(y_test, y_test_pred)
    
    gb_results.append({
        'n_estimators': params['n_estimators'],
        'max_depth': params['max_depth'],
        'learning_rate': params['learning_rate'],
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'test_r2': test_r2
    })

gb_df = pd.DataFrame(gb_results)
print("\nResultats Gradient Boosting:")
print(gb_df.to_string(index=False))

best_gb_idx = gb_df['test_rmse'].idxmin()
best_gb_params = gb_params[best_gb_idx]
print(f"\nMeilleurs parametres: {best_gb_params}")

In [None]:
# XGBoost Regressor (si disponible)

if XGB_AVAILABLE:
    print("="*60)
    print("XGBOOST REGRESSOR")
    print("="*60)
    
    # Hyperparametres
    xgb_params = [
        {'n_estimators': 50, 'max_depth': 3, 'learning_rate': 0.1},
        {'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.1},
        {'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.05},
        {'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.05},
    ]
    
    xgb_results = []
    
    for params in xgb_params:
        model = xgb.XGBRegressor(
            n_estimators=params['n_estimators'],
            max_depth=params['max_depth'],
            learning_rate=params['learning_rate'],
            objective='reg:squarederror',
            random_state=42,
            verbosity=0
        )
        model.fit(X_train_scaled, y_train)
        
        y_train_pred = model.predict(X_train_scaled)
        y_test_pred = model.predict(X_test_scaled)
        
        train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
        test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
        test_r2 = r2_score(y_test, y_test_pred)
        
        xgb_results.append({
            'n_estimators': params['n_estimators'],
            'max_depth': params['max_depth'],
            'learning_rate': params['learning_rate'],
            'train_rmse': train_rmse,
            'test_rmse': test_rmse,
            'test_r2': test_r2
        })
    
    xgb_df = pd.DataFrame(xgb_results)
    print("\nResultats XGBoost:")
    print(xgb_df.to_string(index=False))
    
    best_xgb_idx = xgb_df['test_rmse'].idxmin()
    best_xgb_params = xgb_params[best_xgb_idx]
    print(f"\nMeilleurs parametres: {best_xgb_params}")
else:
    print("XGBoost non disponible.")
    print("Installer avec: pip install xgboost")

---

## Partie 4 : Support Vector Regression (15 min)

### SVR - Support Vector Regression

Le SVR etend les SVM a la regression. Il cherche a trouver une fonction qui a au plus epsilon deviation des targets.

| Kernel | Formule | Usage |
|--------|---------|-------|
| **Linear** | `x.T @ y` | Relations lineaires |
| **RBF** | `exp(-gamma * ||x-y||^2)` | Relations non-lineaires (defaut) |
| **Poly** | `(gamma * x.T @ y + coef0)^degree` | Polynomiales |

### Parametres importants

| Parametre | Description | Effet |
|-----------|-------------|-------|
| **C** | Regularisation | Plus grand = moins de regularisation |
| **epsilon** | Tube insensible | Plus grand = plus de tolerance |
| **gamma** | Influence des points (RBF) | Plus grand = plus local |

### Importance du Scaling

**SVR necessite IMPERATIVEMENT des features standardisees**. Sans scaling, les performances sont tres degradees.

In [None]:
# Support Vector Regression

print("="*60)
print("SUPPORT VECTOR REGRESSION (SVR)")
print("="*60)

# Demonstration: Sans scaling vs Avec scaling
print("\n1. Importance du Scaling:")

# Sans scaling
svr_no_scale = SVR(kernel='rbf', C=1.0, epsilon=0.001)
svr_no_scale.fit(X_train, y_train)  # Donnees non scalees!
y_pred_no_scale = svr_no_scale.predict(X_test)
rmse_no_scale = np.sqrt(mean_squared_error(y_test, y_pred_no_scale))

# Avec scaling
svr_scaled = SVR(kernel='rbf', C=1.0, epsilon=0.001)
svr_scaled.fit(X_train_scaled, y_train)  # Donnees scalees
y_pred_scaled = svr_scaled.predict(X_test_scaled)
rmse_scaled = np.sqrt(mean_squared_error(y_test, y_pred_scaled))

print(f"   RMSE sans scaling: {rmse_no_scale:.6f}")
print(f"   RMSE avec scaling: {rmse_scaled:.6f}")
print(f"   Amelioration: {(rmse_no_scale - rmse_scaled) / rmse_no_scale * 100:.1f}%")

In [None]:
# Tester differents kernels et parametres SVR

print("\n2. Comparaison des Kernels:")

svr_configs = [
    {'kernel': 'linear', 'C': 0.1, 'epsilon': 0.001},
    {'kernel': 'linear', 'C': 1.0, 'epsilon': 0.001},
    {'kernel': 'rbf', 'C': 0.1, 'epsilon': 0.001},
    {'kernel': 'rbf', 'C': 1.0, 'epsilon': 0.001},
    {'kernel': 'rbf', 'C': 10.0, 'epsilon': 0.001},
    {'kernel': 'poly', 'C': 1.0, 'epsilon': 0.001, 'degree': 2},
]

svr_results = []

for config in svr_configs:
    svr = SVR(**config)
    svr.fit(X_train_scaled, y_train)
    
    y_train_pred = svr.predict(X_train_scaled)
    y_test_pred = svr.predict(X_test_scaled)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    test_r2 = r2_score(y_test, y_test_pred)
    
    result = {
        'kernel': config['kernel'],
        'C': config['C'],
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'test_r2': test_r2
    }
    svr_results.append(result)

svr_df = pd.DataFrame(svr_results)
print(svr_df.to_string(index=False))

best_svr_idx = svr_df['test_rmse'].idxmin()
best_svr_config = svr_configs[best_svr_idx]
print(f"\nMeilleure config: {best_svr_config}")

In [None]:
# Code SVR pour QuantConnect
# A adapter dans l'environnement QuantConnect

svr_qc_code = '''
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

class SVRRegressionModel:
    """
    Wrapper SVR pour QuantConnect.
    Inclut le scaler pour normalisation automatique.
    """
    
    def __init__(self, kernel='rbf', C=1.0, epsilon=0.001):
        self.scaler = StandardScaler()
        self.model = SVR(kernel=kernel, C=C, epsilon=epsilon)
        self.is_fitted = False
    
    def fit(self, X, y):
        """Fit le scaler et le modele."""
        X_scaled = self.scaler.fit_transform(X)
        self.model.fit(X_scaled, y)
        self.is_fitted = True
    
    def predict(self, X):
        """Predit avec scaling automatique."""
        if not self.is_fitted:
            raise ValueError("Model not fitted")
        X_scaled = self.scaler.transform(X)
        return self.model.predict(X_scaled)
'''

print("Code SVR pour QuantConnect:")
print(svr_qc_code)

---

## Partie 5 : Metriques de Regression (15 min)

### Metriques standard

| Metrique | Formule | Interpretation |
|----------|---------|---------------|
| **MSE** | mean((y - y_pred)^2) | Erreur quadratique moyenne |
| **RMSE** | sqrt(MSE) | Meme unite que y |
| **MAE** | mean(abs(y - y_pred)) | Erreur absolue moyenne |
| **R2** | 1 - SS_res / SS_tot | Variance expliquee (0-1) |
| **MAPE** | mean(abs((y - y_pred)/y)) | Erreur en pourcentage |

### Metriques specifiques au Trading

| Metrique | Description | Importance |
|----------|-------------|------------|
| **Direction Accuracy** | % de predictions avec la bonne direction | Cruciale pour PnL |
| **Correlation** | Correlation Pearson/Spearman avec y | Qualite du ranking |
| **IC (Information Coefficient)** | Correlation avec les rendements futurs | Standard en quant |

In [None]:
# Calculer toutes les metriques

def calculate_regression_metrics(y_true, y_pred, model_name):
    """
    Calcule les metriques de regression standard et trading-specifiques.
    
    Parameters:
    -----------
    y_true : array-like
        Valeurs reelles
    y_pred : array-like
        Predictions
    model_name : str
        Nom du modele
    
    Returns:
    --------
    dict
        Dictionnaire de metriques
    """
    # Metriques standard
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    # Direction Accuracy (CRUCIAL pour trading!)
    direction_true = np.sign(y_true)
    direction_pred = np.sign(y_pred)
    direction_accuracy = np.mean(direction_true == direction_pred)
    
    # Correlation
    correlation = np.corrcoef(y_true, y_pred)[0, 1]
    
    # Correlation de rang (Spearman)
    from scipy.stats import spearmanr
    spearman_corr, _ = spearmanr(y_true, y_pred)
    
    return {
        'model': model_name,
        'mse': mse,
        'rmse': rmse,
        'mae': mae,
        'r2': r2,
        'direction_accuracy': direction_accuracy,
        'pearson_corr': correlation,
        'spearman_corr': spearman_corr
    }

# Entrainer les meilleurs modeles et calculer les metriques
models = {
    'Ridge': Ridge(alpha=best_ridge_alpha),
    'Lasso': Lasso(alpha=best_lasso_alpha, max_iter=10000),
    'Random Forest': RandomForestRegressor(**best_rf_params, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingRegressor(**best_gb_params, random_state=42),
    'SVR': SVR(**best_svr_config)
}

if XGB_AVAILABLE:
    models['XGBoost'] = xgb.XGBRegressor(**best_xgb_params, random_state=42, verbosity=0)

all_metrics = []

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    metrics = calculate_regression_metrics(y_test.values, y_pred, name)
    all_metrics.append(metrics)

metrics_df = pd.DataFrame(all_metrics)
print("="*80)
print("COMPARAISON DE TOUS LES MODELES")
print("="*80)
print(metrics_df.to_string(index=False))

In [None]:
# Visualisation des metriques

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# RMSE par modele
ax1 = axes[0, 0]
colors = plt.cm.viridis(np.linspace(0, 1, len(metrics_df)))
bars = ax1.bar(metrics_df['model'], metrics_df['rmse'] * 100, color=colors)
ax1.set_ylabel('RMSE (%)')
ax1.set_title('RMSE par Modele (plus bas = mieux)')
ax1.tick_params(axis='x', rotation=45)

# R2 par modele
ax2 = axes[0, 1]
bars = ax2.bar(metrics_df['model'], metrics_df['r2'], color=colors)
ax2.axhline(0, color='red', linestyle='--', alpha=0.5)
ax2.set_ylabel('R2 Score')
ax2.set_title('R2 par Modele (plus haut = mieux)')
ax2.tick_params(axis='x', rotation=45)

# Direction Accuracy (CRUCIAL!)
ax3 = axes[1, 0]
bars = ax3.bar(metrics_df['model'], metrics_df['direction_accuracy'] * 100, color=colors)
ax3.axhline(50, color='red', linestyle='--', label='Random (50%)')
ax3.set_ylabel('Direction Accuracy (%)')
ax3.set_title('Direction Accuracy (IMPORTANT pour Trading!)')
ax3.tick_params(axis='x', rotation=45)
ax3.legend()

# Correlation
ax4 = axes[1, 1]
x = np.arange(len(metrics_df))
width = 0.35
ax4.bar(x - width/2, metrics_df['pearson_corr'], width, label='Pearson', color='steelblue')
ax4.bar(x + width/2, metrics_df['spearman_corr'], width, label='Spearman', color='coral')
ax4.set_xticks(x)
ax4.set_xticklabels(metrics_df['model'], rotation=45)
ax4.set_ylabel('Correlation')
ax4.set_title('Correlation avec les vrais rendements')
ax4.legend()

plt.tight_layout()
plt.show()

print("\nNote importante:")
print("  La Direction Accuracy est souvent plus importante que le RMSE pour le trading.")
print("  Un modele peut avoir un bon RMSE mais predire la mauvaise direction.")

In [None]:
# Predictions vs Realite pour le meilleur modele

# Trouver le meilleur modele selon direction accuracy
best_model_name = metrics_df.loc[metrics_df['direction_accuracy'].idxmax(), 'model']
best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)
y_pred_best = best_model.predict(X_test_scaled)

print(f"Meilleur modele (Direction Accuracy): {best_model_name}")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter plot predictions vs reality
ax1 = axes[0]
ax1.scatter(y_test.values * 100, y_pred_best * 100, alpha=0.5)
ax1.plot([-5, 5], [-5, 5], 'r--', label='Perfect prediction')
ax1.set_xlabel('Rendement Reel (%)')
ax1.set_ylabel('Rendement Predit (%)')
ax1.set_title(f'{best_model_name}: Predictions vs Realite')
ax1.legend()
ax1.set_xlim(-5, 5)
ax1.set_ylim(-5, 5)

# Distribution des erreurs
ax2 = axes[1]
errors = (y_test.values - y_pred_best) * 100
ax2.hist(errors, bins=50, color='steelblue', edgecolor='white')
ax2.axvline(0, color='red', linestyle='--')
ax2.axvline(np.mean(errors), color='green', linestyle='--', label=f'Mean: {np.mean(errors):.3f}%')
ax2.set_xlabel('Erreur de Prediction (%)')
ax2.set_ylabel('Frequence')
ax2.set_title('Distribution des Erreurs')
ax2.legend()

plt.tight_layout()
plt.show()

---

## Partie 6 : Integration Trading - Prediction to Signal (20 min)

### Convertir les predictions en signaux de trading

Une prediction de rendement doit etre convertie en signal actionnable :

```
Prediction (y_pred)
       |
       v
+-- > +1% ? --> LONG (acheter)
+-- < -1% ? --> SHORT (vendre)
+-- sinon  --> NEUTRAL (pas de position)
       |
       v
Position Sizing (basee sur magnitude)
       |
       v
Signal Trading
```

### Strategie de Position Sizing

| Methode | Description | Formule |
|---------|-------------|--------|
| **Fixed** | Meme taille pour tous | size = constant |
| **Proportionnel** | Proportionnel a la prediction | size = k * abs(y_pred) |
| **Confiance** | Base sur la confiance du modele | size = k * confidence |
| **Kelly** | Optimisation Kelly | size = edge / variance |

In [None]:
# Classe RegressionAlphaModel pour QuantConnect

from datetime import timedelta

qc_regression_alpha_code = '''
from AlgorithmImports import *
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

class RegressionAlphaModel(AlphaModel):
    """
    Alpha Model utilisant la regression ML pour predire les rendements.
    Convertit les predictions en Insights avec magnitude et confiance.
    """
    
    def __init__(self, lookback=252, retrain_period=30, 
                 min_return_threshold=0.005, max_confidence=0.02):
        """
        Parameters:
        -----------
        lookback : int
            Nombre de jours pour l'entrainement
        retrain_period : int
            Frequence de re-entrainement (jours)
        min_return_threshold : float
            Seuil minimum pour generer un signal (ex: 0.5%)
        max_confidence : float
            Rendement correspondant a 100% confiance (ex: 2%)
        """
        self.lookback = lookback
        self.retrain_period = retrain_period
        self.min_return_threshold = min_return_threshold
        self.max_confidence = max_confidence
        
        self.symbolData = {}
        self.model = None
        self.scaler = None
        self.last_train_time = None
        self.insight_period = timedelta(days=5)
    
    def Update(self, algorithm, data):
        """
        Genere des Insights bases sur les predictions du modele.
        """
        insights = []
        
        # Verifier si on doit re-entrainer
        if self._should_retrain(algorithm):
            self._train_model(algorithm)
        
        # Si pas de modele, pas d'insights
        if self.model is None:
            return insights
        
        for symbol, sd in self.symbolData.items():
            if not sd.IsReady:
                continue
            
            if not data.ContainsKey(symbol):
                continue
            
            # Extraire les features
            features = self._extract_features(sd)
            if features is None:
                continue
            
            # Prediction
            features_scaled = self.scaler.transform([features])
            predicted_return = self.model.predict(features_scaled)[0]
            
            # Filtrer les petites predictions
            if abs(predicted_return) < self.min_return_threshold:
                continue
            
            # Convertir en direction
            if predicted_return > 0:
                direction = InsightDirection.Up
            else:
                direction = InsightDirection.Down
            
            # Calculer la confiance (0 a 1)
            # Plus la prediction est grande, plus on est confiant
            confidence = min(abs(predicted_return) / self.max_confidence, 1.0)
            
            # Creer l'Insight
            insight = Insight.Price(
                symbol,
                self.insight_period,
                direction,
                magnitude=abs(predicted_return),
                confidence=confidence
            )
            insights.append(insight)
            
            algorithm.Debug(f"{symbol.Value}: Pred={predicted_return:.4f}, Dir={direction}, Conf={confidence:.2f}")
        
        return insights
    
    def _should_retrain(self, algorithm):
        """Determine si le modele doit etre re-entraine."""
        if self.last_train_time is None:
            return True
        
        days_since_train = (algorithm.Time - self.last_train_time).days
        return days_since_train >= self.retrain_period
    
    def _train_model(self, algorithm):
        """Entraine le modele sur les donnees historiques."""
        # Collecter les donnees d'entrainement
        X_train = []
        y_train = []
        
        for symbol, sd in self.symbolData.items():
            history = algorithm.History(symbol, self.lookback + 10, Resolution.Daily)
            if history.empty:
                continue
            
            # Calculer features et targets
            # (implementation simplifiee)
            for features, target in self._prepare_training_data(history):
                X_train.append(features)
                y_train.append(target)
        
        if len(X_train) < 100:
            return  # Pas assez de donnees
        
        X_train = np.array(X_train)
        y_train = np.array(y_train)
        
        # Standardisation
        self.scaler = StandardScaler()
        X_train_scaled = self.scaler.fit_transform(X_train)
        
        # Entrainement
        self.model = GradientBoostingRegressor(
            n_estimators=100,
            max_depth=4,
            learning_rate=0.1,
            random_state=42
        )
        self.model.fit(X_train_scaled, y_train)
        
        self.last_train_time = algorithm.Time
        algorithm.Debug(f"Model retrained with {len(X_train)} samples")
    
    def _extract_features(self, sd):
        """Extrait les features du SymbolData."""
        if not sd.IsReady:
            return None
        
        return [
            sd.Return1D,
            sd.Return5D,
            sd.Volatility,
            sd.RSI_Normalized,
            sd.MARelative,
            sd.MACD_Normalized,
            sd.BB_PercentB
        ]
    
    def OnSecuritiesChanged(self, algorithm, changes):
        """Gere les changements d'univers."""
        for security in changes.AddedSecurities:
            symbol = security.Symbol
            if symbol not in self.symbolData:
                self.symbolData[symbol] = RegressionSymbolData(algorithm, symbol)
        
        for security in changes.RemovedSecurities:
            symbol = security.Symbol
            if symbol in self.symbolData:
                del self.symbolData[symbol]
'''

print("RegressionAlphaModel pour QuantConnect:")
print(qc_regression_alpha_code)

In [None]:
# Simulation de trading basee sur les predictions

def simulate_trading_strategy(y_true, y_pred, threshold=0.005):
    """
    Simule une strategie de trading basee sur les predictions.
    
    Parameters:
    -----------
    y_true : array-like
        Rendements reels
    y_pred : array-like
        Predictions du modele
    threshold : float
        Seuil minimum pour prendre position
    
    Returns:
    --------
    dict
        Metriques de la strategie
    """
    # Signaux
    signals = np.where(y_pred > threshold, 1,
                       np.where(y_pred < -threshold, -1, 0))
    
    # Rendements de la strategie
    strategy_returns = signals * y_true
    
    # Metriques
    total_return = np.sum(strategy_returns)
    n_trades = np.sum(signals != 0)
    win_rate = np.mean(strategy_returns[signals != 0] > 0) if n_trades > 0 else 0
    avg_win = np.mean(strategy_returns[strategy_returns > 0]) if np.any(strategy_returns > 0) else 0
    avg_loss = np.mean(strategy_returns[strategy_returns < 0]) if np.any(strategy_returns < 0) else 0
    sharpe = np.mean(strategy_returns) / (np.std(strategy_returns) + 1e-10) * np.sqrt(252)
    
    # Buy & Hold
    bh_return = np.sum(y_true)
    
    return {
        'total_return': total_return,
        'n_trades': n_trades,
        'win_rate': win_rate,
        'avg_win': avg_win,
        'avg_loss': avg_loss,
        'sharpe': sharpe,
        'buy_hold_return': bh_return,
        'outperformance': total_return - bh_return
    }

# Simuler pour chaque modele
print("="*80)
print("SIMULATION DE TRADING")
print("="*80)

trading_results = []

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    result = simulate_trading_strategy(y_test.values, y_pred, threshold=0.005)
    result['model'] = name
    trading_results.append(result)

trading_df = pd.DataFrame(trading_results)
trading_df = trading_df[['model', 'total_return', 'n_trades', 'win_rate', 'sharpe', 'buy_hold_return', 'outperformance']]

# Formatter les pourcentages
for col in ['total_return', 'win_rate', 'buy_hold_return', 'outperformance']:
    trading_df[col] = trading_df[col].apply(lambda x: f"{x*100:.2f}%")

print(trading_df.to_string(index=False))

In [None]:
# Equity curves comparees

fig, ax = plt.subplots(figsize=(14, 7))

# Buy & Hold
bh_cumulative = (1 + pd.Series(y_test.values)).cumprod()
ax.plot(y_test.index, bh_cumulative, label='Buy & Hold', linewidth=2, color='gray', linestyle='--')

# Strategies ML
colors = plt.cm.tab10(np.linspace(0, 1, len(models)))

for (name, model), color in zip(models.items(), colors):
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    # Signaux
    signals = np.where(y_pred > 0.005, 1, np.where(y_pred < -0.005, -1, 0))
    strategy_returns = signals * y_test.values
    
    cumulative = (1 + pd.Series(strategy_returns)).cumprod()
    ax.plot(y_test.index, cumulative, label=name, linewidth=1.5, color=color)

ax.axhline(1, color='black', linestyle='-', linewidth=0.5)
ax.set_xlabel('Date')
ax.set_ylabel('Valeur Portfolio (base 1)')
ax.set_title('Equity Curves: Strategies ML vs Buy & Hold')
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Partie 7 : Strategie Complete - Return Prediction (20 min)

### Architecture de la strategie

```
Universe Selection
(Top 50 liquid stocks)
        |
        v
Feature Engineering
(Technical indicators)
        |
        v
ML Regression Model
(XGBoost Regressor)
        |
        v
Prediction Filtering
(|pred| > 1% threshold)
        |
        v
Signal Generation
(Long if pred > 1%, Short if pred < -1%)
        |
        v
Position Sizing
(Proportionnel a magnitude)
        |
        v
Risk Management
(5% max drawdown per position)
```

In [None]:
# Strategie Complete pour QuantConnect

qc_strategy_code = '''
from AlgorithmImports import *
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
from datetime import timedelta

class ReturnPredictionStrategy(QCAlgorithm):
    """
    Strategie complete de prediction de rendements avec ML.
    
    - Target: Rendement 5 jours
    - Model: Gradient Boosting Regressor (ou XGBoost)
    - Signal: |predicted return| > 1%
    - Position sizing: Proportionnel a la magnitude
    """
    
    def Initialize(self):
        # Configuration
        self.SetStartDate(2020, 1, 1)
        self.SetEndDate(2023, 12, 31)
        self.SetCash(100000)
        
        # Parametres
        self.lookback = 252  # 1 an d'historique
        self.prediction_horizon = 5  # Predire 5 jours
        self.retrain_frequency = 30  # Re-entrainer tous les 30 jours
        self.min_signal_threshold = 0.01  # 1% minimum
        self.max_positions = 10
        self.max_position_size = 0.1  # 10% max par position
        
        # Universe
        self.num_stocks = 50
        self.AddUniverse(self.CoarseSelection)
        self.UniverseSettings.Resolution = Resolution.Daily
        
        # ML Model
        self.model = None
        self.scaler = StandardScaler()
        self.last_train_time = None
        
        # Storage
        self.symbolData = {}
        
        # Schedule rebalancing
        self.Schedule.On(
            self.DateRules.EveryDay(),
            self.TimeRules.AfterMarketOpen("SPY", 30),
            self.Rebalance
        )
        
        self.Log("Return Prediction Strategy initialized")
    
    def CoarseSelection(self, coarse):
        """Selectionne les top N actions par liquidite."""
        filtered = [x for x in coarse 
                    if x.HasFundamentalData 
                    and x.Price > 5 
                    and x.DollarVolume > 10000000]
        
        sorted_stocks = sorted(filtered, key=lambda x: x.DollarVolume, reverse=True)
        return [x.Symbol for x in sorted_stocks[:self.num_stocks]]
    
    def OnSecuritiesChanged(self, changes):
        """Initialise les indicateurs pour les nouveaux titres."""
        for security in changes.AddedSecurities:
            symbol = security.Symbol
            if symbol not in self.symbolData:
                self.symbolData[symbol] = PredictionSymbolData(
                    self, symbol, self.lookback
                )
        
        for security in changes.RemovedSecurities:
            symbol = security.Symbol
            if symbol in self.symbolData:
                del self.symbolData[symbol]
    
    def Rebalance(self):
        """Logique principale de rebalancement."""
        # Verifier si on doit re-entrainer
        if self._should_retrain():
            self._train_model()
        
        if self.model is None:
            return
        
        # Generer les predictions
        predictions = {}
        
        for symbol, sd in self.symbolData.items():
            if not sd.IsReady:
                continue
            
            features = sd.GetFeatures()
            if features is None:
                continue
            
            # Prediction
            features_scaled = self.scaler.transform([features])
            predicted_return = self.model.predict(features_scaled)[0]
            
            # Filtrer
            if abs(predicted_return) >= self.min_signal_threshold:
                predictions[symbol] = predicted_return
        
        # Trier par magnitude (top predictions)
        sorted_predictions = sorted(
            predictions.items(), 
            key=lambda x: abs(x[1]), 
            reverse=True
        )[:self.max_positions]
        
        # Calculer les poids
        targets = {}
        total_magnitude = sum(abs(p[1]) for p in sorted_predictions)
        
        for symbol, pred in sorted_predictions:
            # Poids proportionnel a la magnitude
            weight = abs(pred) / total_magnitude if total_magnitude > 0 else 0
            weight = min(weight, self.max_position_size)  # Cap
            
            # Direction
            if pred > 0:
                targets[symbol] = weight
            else:
                targets[symbol] = -weight  # Short
        
        # Liquider les positions non dans targets
        for symbol in self.Portfolio.Keys:
            if symbol not in targets and self.Portfolio[symbol].Invested:
                self.Liquidate(symbol)
        
        # Executer les targets
        for symbol, weight in targets.items():
            self.SetHoldings(symbol, weight)
            self.Log(f"Position: {symbol.Value} = {weight:.2%}")
    
    def _should_retrain(self):
        """Determine si le modele doit etre re-entraine."""
        if self.last_train_time is None:
            return True
        return (self.Time - self.last_train_time).days >= self.retrain_frequency
    
    def _train_model(self):
        """Entraine le modele ML."""
        X_train = []
        y_train = []
        
        for symbol, sd in self.symbolData.items():
            history = self.History(symbol, self.lookback + self.prediction_horizon, Resolution.Daily)
            if history.empty or len(history) < self.lookback:
                continue
            
            # Preparer les donnees
            features_list, targets_list = sd.PrepareTrainingData(
                history, self.prediction_horizon
            )
            
            X_train.extend(features_list)
            y_train.extend(targets_list)
        
        if len(X_train) < 200:
            self.Log(f"Not enough data for training: {len(X_train)} samples")
            return
        
        X_train = np.array(X_train)
        y_train = np.array(y_train)
        
        # Standardiser
        X_train_scaled = self.scaler.fit_transform(X_train)
        
        # Entrainer
        self.model = GradientBoostingRegressor(
            n_estimators=100,
            max_depth=4,
            learning_rate=0.1,
            min_samples_leaf=20,
            random_state=42
        )
        self.model.fit(X_train_scaled, y_train)
        
        self.last_train_time = self.Time
        self.Log(f"Model trained with {len(X_train)} samples")
    
    def OnEndOfAlgorithm(self):
        """Resume final."""
        self.Log("="*60)
        self.Log("RETURN PREDICTION STRATEGY - SUMMARY")
        self.Log("="*60)
        self.Log(f"Final Value: ${self.Portfolio.TotalPortfolioValue:,.2f}")
        total_return = (self.Portfolio.TotalPortfolioValue - 100000) / 100000
        self.Log(f"Total Return: {total_return:.1%}")


class PredictionSymbolData:
    """
    Stocke les indicateurs et calcule les features pour un symbole.
    """
    
    def __init__(self, algorithm, symbol, lookback):
        self.symbol = symbol
        self.algorithm = algorithm
        
        # Indicateurs
        self.rsi = algorithm.RSI(symbol, 14, Resolution.Daily)
        self.macd = algorithm.MACD(symbol, 12, 26, 9, Resolution.Daily)
        self.bb = algorithm.BB(symbol, 20, 2, Resolution.Daily)
        self.sma_20 = algorithm.SMA(symbol, 20, Resolution.Daily)
        self.sma_50 = algorithm.SMA(symbol, 50, Resolution.Daily)
        self.atr = algorithm.ATR(symbol, 14, Resolution.Daily)
        
        # Rolling windows pour returns
        self.price_window = RollingWindow[float](lookback)
        
        # Warmup
        history = algorithm.History(symbol, lookback, Resolution.Daily)
        if not history.empty:
            for bar in history.itertuples():
                self.price_window.Add(bar.close)
    
    @property
    def IsReady(self):
        return (self.rsi.IsReady and 
                self.macd.IsReady and 
                self.bb.IsReady and
                self.sma_50.IsReady and
                self.price_window.IsReady)
    
    def GetFeatures(self):
        """Retourne le vecteur de features actuel."""
        if not self.IsReady:
            return None
        
        price = self.price_window[0]
        
        # Returns
        return_1d = (self.price_window[0] - self.price_window[1]) / self.price_window[1]
        return_5d = (self.price_window[0] - self.price_window[5]) / self.price_window[5]
        return_20d = (self.price_window[0] - self.price_window[20]) / self.price_window[20]
        
        # Volatility (20d)
        prices = [self.price_window[i] for i in range(20)]
        returns = np.diff(prices) / prices[:-1]
        volatility = np.std(returns)
        
        # RSI normalized
        rsi_norm = (self.rsi.Current.Value - 50) / 50
        
        # MA ratio
        ma_ratio = self.sma_20.Current.Value / self.sma_50.Current.Value
        
        # Price to SMA
        price_to_sma = price / self.sma_20.Current.Value
        
        # MACD normalized
        macd_norm = self.macd.Current.Value / price
        
        # Bollinger %B
        bb_range = self.bb.UpperBand.Current.Value - self.bb.LowerBand.Current.Value
        bb_pct_b = (price - self.bb.LowerBand.Current.Value) / bb_range if bb_range > 0 else 0.5
        
        return [
            return_1d, return_5d, return_20d,
            volatility, rsi_norm, ma_ratio,
            price_to_sma, macd_norm, bb_pct_b
        ]
    
    def PrepareTrainingData(self, history, horizon):
        """Prepare les donnees d'entrainement depuis l'historique."""
        features_list = []
        targets_list = []
        
        df = history.close.unstack(level=0)
        if df.empty:
            return features_list, targets_list
        
        prices = df.iloc[:, 0].values
        
        for i in range(50, len(prices) - horizon):
            # Features
            ret_1d = (prices[i] - prices[i-1]) / prices[i-1]
            ret_5d = (prices[i] - prices[i-5]) / prices[i-5]
            ret_20d = (prices[i] - prices[i-20]) / prices[i-20]
            
            vol = np.std(np.diff(prices[i-20:i]) / prices[i-21:i-1])
            
            # Simplified features
            sma_20 = np.mean(prices[i-20:i])
            sma_50 = np.mean(prices[i-50:i])
            
            features = [
                ret_1d, ret_5d, ret_20d, vol,
                0,  # RSI placeholder
                sma_20 / sma_50,
                prices[i] / sma_20,
                0,  # MACD placeholder
                0.5  # BB placeholder
            ]
            
            # Target: future return
            target = (prices[i + horizon] - prices[i]) / prices[i]
            
            features_list.append(features)
            targets_list.append(target)
        
        return features_list, targets_list
'''

print("Strategie Complete pour QuantConnect:")
print("="*60)
print("Composants:")
print("  1. Universe: Top 50 par liquidite")
print("  2. Model: Gradient Boosting Regressor")
print("  3. Target: Rendement 5 jours")
print("  4. Signal: |prediction| > 1%")
print("  5. Position Sizing: Proportionnel a magnitude")
print("  6. Retraining: Tous les 30 jours")
print("="*60)

In [None]:
# Resume des parametres recommandes

print("="*80)
print("PARAMETRES RECOMMANDES POUR PRODUCTION")
print("="*80)

recommendations = pd.DataFrame([
    {'Parametre': 'Modele', 'Valeur Recommandee': 'Gradient Boosting ou XGBoost', 'Raison': 'Bon compromis performance/stabilite'},
    {'Parametre': 'n_estimators', 'Valeur Recommandee': '100-200', 'Raison': 'Assez pour capturer patterns'},
    {'Parametre': 'max_depth', 'Valeur Recommandee': '3-5', 'Raison': 'Eviter overfitting'},
    {'Parametre': 'learning_rate', 'Valeur Recommandee': '0.05-0.1', 'Raison': 'Convergence stable'},
    {'Parametre': 'min_samples_leaf', 'Valeur Recommandee': '20-50', 'Raison': 'Regularisation'},
    {'Parametre': 'Horizon', 'Valeur Recommandee': '5 jours', 'Raison': 'Compromis signal/noise'},
    {'Parametre': 'Seuil signal', 'Valeur Recommandee': '0.5-1%', 'Raison': 'Filtrer les petits mouvements'},
    {'Parametre': 'Retraining', 'Valeur Recommandee': '30 jours', 'Raison': 'Adaptation au marche'},
    {'Parametre': 'Lookback', 'Valeur Recommandee': '252 jours', 'Raison': '1 an de donnees'},
])

print(recommendations.to_string(index=False))

---

## Conclusion et Prochaines Etapes

### Recapitulatif

Dans ce notebook, nous avons couvert :

1. **Regression vs Classification** :
   - Classification predit la direction
   - Regression predit la magnitude (rendement exact)
   - Regression permet position sizing intelligent

2. **Linear Regression Regularisee** :
   - Ridge (L2): shrinkage uniforme
   - Lasso (L1): feature selection automatique
   - ElasticNet: combine L1 + L2

3. **Ensemble Methods** :
   - Random Forest: robuste, peu d'overfitting
   - Gradient Boosting: plus precis mais risque d'overfitting
   - XGBoost: optimise avec regularisation

4. **Support Vector Regression** :
   - Necessite imperativement le scaling
   - Kernel RBF pour non-linearites
   - Parametres C, epsilon, gamma

5. **Metriques** :
   - RMSE, MAE, R2 (standard)
   - Direction Accuracy (crucial pour trading)
   - Correlation (qualite du ranking)

6. **Integration Trading** :
   - Conversion prediction -> signal
   - Position sizing proportionnel
   - Filtrage par seuil minimum

7. **Strategie Complete** :
   - Pipeline end-to-end
   - Retraining periodique
   - Risk management integre

### Points Cles a Retenir

| Concept | Point Cle |
|---------|----------|
| **Regression** | Predit la magnitude, pas juste la direction |
| **Regularisation** | Essentielle pour eviter overfitting |
| **Scaling** | Obligatoire pour SVR, recommande pour autres |
| **Direction Accuracy** | Metrique la plus importante pour PnL |
| **Seuil de signal** | Filtrer les petites predictions non rentables |
| **Retraining** | Adapter le modele aux changements de marche |

### Comparaison des Modeles

| Modele | Avantages | Inconvenients | Usage |
|--------|-----------|---------------|-------|
| **Ridge** | Simple, stable | Lineaire | Baseline |
| **Lasso** | Feature selection | Lineaire | Beaucoup de features |
| **Random Forest** | Robuste, feature importance | Moins precis | General |
| **Gradient Boosting** | Tres precis | Overfitting | Production |
| **XGBoost** | Rapide, regularise | Complexe | Production |
| **SVR** | Non-lineaire | Lent, scaling | Petits datasets |

### Limitations

- Les marches sont difficiles a predire (efficience)
- Les modeles peuvent overfitter l'historique
- Les regimes de marche changent
- Les couts de transaction impactent les petites predictions

### Prochaines Etapes

| Notebook | Contenu |
|----------|---------|
| **QC-Py-21** | Deep Learning (LSTM, Transformers) |
| **QC-Py-22** | Reinforcement Learning |
| **QC-Py-23** | Model Ensembling et Stacking |
| **QC-Py-24** | Hyperparameter Optimization |

### Ressources Complementaires

- [scikit-learn Regression](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)
- [XGBoost Documentation](https://xgboost.readthedocs.io/)
- [Advances in Financial ML](https://www.amazon.com/Advances-Financial-Machine-Learning-Marcos/dp/1119482089) - Lopez de Prado
- [QuantConnect ML Documentation](https://www.quantconnect.com/docs/v2/writing-algorithms/machine-learning)

---

**Notebook complete. Vous maitrisez maintenant la regression ML pour la prediction de prix.**