# QC-Py-18 - Feature Engineering pour Machine Learning Trading

> **Construire des features predictives pour les modeles ML**
> Duree: 90 minutes | Niveau: Intermediaire-Avance | Python + QuantConnect

---

## Objectifs d'Apprentissage

A la fin de ce notebook, vous serez capable de :

1. Comprendre l'importance du **Feature Engineering** en trading quantitatif
2. Creer des **features techniques** basees sur les prix et volumes
3. Construire des **features basees sur indicateurs** (RSI, MACD, Bollinger...)
4. Extraire des **features fondamentales** depuis QuantConnect
5. Maitriser le **labeling** pour classification et regression
6. Appliquer des techniques de **feature selection** et importance
7. Implementer un **preprocessing pipeline** robuste
8. Construire un **pipeline complet** de preparation des donnees ML

## Prerequisites

- Notebooks QC-Py-01 a 15 completes
- Comprehension des indicateurs techniques (QC-Py-11)
- Notions de base en Machine Learning (classification, regression)
- Familiarite avec pandas, numpy, sklearn

## Structure du Notebook

1. Introduction au Feature Engineering (15 min)
2. Features Techniques: Price-Based (10 min)
3. Features Techniques: Indicator-Based (15 min)
4. Features Fondamentales (20 min)
5. Labeling pour Classification et Regression (20 min)
6. Feature Selection et Importance (20 min)
7. Preprocessing et Normalization (15 min)
8. Pipeline Complet de Feature Engineering (20 min)

---

## Partie 1 : Introduction au Feature Engineering (15 min)

### Pourquoi le Feature Engineering est-il crucial?

En Machine Learning, la qualite des predictions depend directement de la qualite des features. C'est le principe **"Garbage In, Garbage Out"** :

```
Donnees Brutes     Features Bien          Modele ML        Predictions
(OHLCV)      -->   Construites      -->   (RF, XGB)   -->  Profitables
                        |
                        v
               Signal + Information
```

### Features vs Donnees Brutes

| Donnees Brutes | Features Derivees |
|----------------|-------------------|
| Prix de cloture | Rendements sur N periodes |
| Volume | Ratio volume/moyenne |
| High/Low | Range (High-Low)/Close |
| Historique | Momentum, Volatilite, Tendance |

### Importance pour l'Interpretabilite

Des features bien construites permettent :

1. **Comprendre** ce que le modele apprend
2. **Debugger** les predictions erronees
3. **Communiquer** la logique aux stakeholders
4. **Verifier** que le modele n'utilise pas de lookahead bias

### Feature Selection vs Feature Extraction

| Approche | Description | Techniques |
|----------|-------------|------------|
| **Selection** | Choisir les meilleures features existantes | Correlation, Importance, RFE |
| **Extraction** | Creer de nouvelles features combinees | PCA, Auto-encoders |

### Categories de Features en Trading

```
Features
   |
   +-- Techniques
   |      +-- Price-based (returns, volatility, range)
   |      +-- Indicator-based (RSI, MACD, BB)
   |      +-- Volume-based (OBV, volume ratio)
   |
   +-- Fondamentales
   |      +-- Valuation (P/E, P/B, P/S)
   |      +-- Profitability (ROE, ROA, Margin)
   |      +-- Growth (Revenue, EPS)
   |
   +-- Alternatives
          +-- Sentiment (news, social)
          +-- Macro (rates, indices)
```

In [None]:
# Imports necessaires
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Configuration matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

# Import bibliotheque partagee
import sys
sys.path.insert(0, '../shared')

try:
    from features import calculate_returns, add_technical_features, create_labels, walk_forward_split
    print("Import depuis shared/features.py reussi")
except ImportError as e:
    print(f"Note: {e}")
    print("Les fonctions seront definies dans ce notebook.")

print("\nImports reussis!")
print("Ce notebook couvre le Feature Engineering pour ML Trading.")

In [None]:
# Creer des donnees de demonstration (simulees)
# En production, ces donnees viendraient de QuantConnect

def generate_sample_data(n_days=500, seed=42):
    """
    Genere des donnees OHLCV simulees pour demonstration.
    
    Parameters:
    -----------
    n_days : int
        Nombre de jours de donnees
    seed : int
        Graine aleatoire pour reproductibilite
    
    Returns:
    --------
    pd.DataFrame
        DataFrame avec colonnes OHLCV
    """
    np.random.seed(seed)
    
    # Dates
    dates = pd.date_range(start='2022-01-01', periods=n_days, freq='B')  # Business days
    
    # Prix de depart
    start_price = 100.0
    
    # Generer rendements avec tendance et volatilite variables
    daily_drift = 0.0003  # ~7.5% annuel
    daily_volatility = 0.015  # ~24% annuel
    
    returns = np.random.normal(daily_drift, daily_volatility, n_days)
    
    # Ajouter des regimes (periodes de haute/basse volatilite)
    regime = np.sin(np.linspace(0, 4*np.pi, n_days)) * 0.005
    returns = returns + regime
    
    # Calculer les prix de cloture
    close = start_price * np.exp(np.cumsum(returns))
    
    # Generer OHLV a partir de close
    intraday_vol = 0.01
    high = close * (1 + np.abs(np.random.normal(0, intraday_vol, n_days)))
    low = close * (1 - np.abs(np.random.normal(0, intraday_vol, n_days)))
    open_price = close * (1 + np.random.normal(0, intraday_vol/2, n_days))
    
    # Volume (correle negativement avec les rendements)
    base_volume = 1_000_000
    volume = base_volume * (1 + np.random.exponential(0.5, n_days))
    volume = volume * (1 + np.abs(returns) * 10)  # Volume augmente avec volatilite
    
    # Creer DataFrame
    df = pd.DataFrame({
        'open': open_price,
        'high': high,
        'low': low,
        'close': close,
        'volume': volume.astype(int)
    }, index=dates)
    
    return df

# Generer donnees
df = generate_sample_data(n_days=500)

print("Donnees de demonstration generees:")
print(f"  Periode: {df.index[0].date()} a {df.index[-1].date()}")
print(f"  Nombre de jours: {len(df)}")
print(f"\nApercu:")
print(df.head(10))

In [None]:
# Visualisation des donnees brutes
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

# Prix
axes[0].plot(df.index, df['close'], label='Close', color='blue', linewidth=1)
axes[0].fill_between(df.index, df['low'], df['high'], alpha=0.2, color='blue', label='High-Low Range')
axes[0].set_ylabel('Prix ($)')
axes[0].set_title('Donnees OHLCV Brutes')
axes[0].legend()

# Volume
axes[1].bar(df.index, df['volume'], color='gray', alpha=0.6)
axes[1].set_ylabel('Volume')
axes[1].set_xlabel('Date')

plt.tight_layout()
plt.show()

print("\nStatistiques descriptives:")
print(df.describe())

---

## Partie 2 : Features Techniques - Price-Based (10 min)

### Features derivees du prix

Les features basees sur le prix capturent :

| Feature | Description | Formule |
|---------|-------------|--------|
| **Rendements** | Variation relative du prix | `(P_t - P_{t-n}) / P_{t-n}` |
| **Volatilite** | Ecart-type des rendements | `std(returns, window)` |
| **Range** | Amplitude intraday | `(High - Low) / Close` |
| **Distance from extremes** | Position vs high/low recents | `(High_max - Close) / Close` |

### Pourquoi les rendements plutot que les prix?

1. **Stationnarite** : Les prix sont non-stationnaires, les rendements le sont (approximativement)
2. **Comparabilite** : Permet de comparer des actifs a prix differents
3. **Interpretabilite** : Un rendement de 2% est directement comprehensible

In [None]:
def calculate_price_features(df, return_periods=[1, 5, 10, 20], volatility_window=20):
    """
    Calcule les features basees sur le prix.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame avec colonnes OHLCV
    return_periods : list
        Periodes pour calculer les rendements (ex: [1, 5, 20] jours)
    volatility_window : int
        Fenetre pour la volatilite
    
    Returns:
    --------
    pd.DataFrame
        DataFrame avec features ajoutees
    """
    result = df.copy()
    
    # === RETURNS ===
    # Rendements sur differentes periodes
    for period in return_periods:
        result[f'return_{period}d'] = result['close'].pct_change(period)
    
    # Log-returns (plus adaptes pour les modeles)
    result['log_return_1d'] = np.log(result['close'] / result['close'].shift(1))
    
    # === VOLATILITY ===
    # Volatilite historique (ecart-type des rendements)
    result[f'volatility_{volatility_window}d'] = result['return_1d'].rolling(volatility_window).std()
    
    # Volatilite annualisee
    result[f'volatility_annualized'] = result[f'volatility_{volatility_window}d'] * np.sqrt(252)
    
    # === RANGE (Amplitude) ===
    # Range intraday normalise
    result['range'] = (result['high'] - result['low']) / result['close']
    
    # True Range (inclut les gaps overnight)
    result['true_range'] = np.maximum(
        result['high'] - result['low'],
        np.maximum(
            np.abs(result['high'] - result['close'].shift(1)),
            np.abs(result['low'] - result['close'].shift(1))
        )
    ) / result['close']
    
    # Average True Range normalise
    result['atr_norm'] = result['true_range'].rolling(14).mean()
    
    # === DISTANCE FROM EXTREMES ===
    # Distance par rapport au plus haut des 20 derniers jours
    result['dist_from_high_20d'] = (result['high'].rolling(20).max() - result['close']) / result['close']
    
    # Distance par rapport au plus bas des 20 derniers jours
    result['dist_from_low_20d'] = (result['close'] - result['low'].rolling(20).min()) / result['close']
    
    # Position dans le range 20 jours (0 = au plus bas, 1 = au plus haut)
    high_20d = result['high'].rolling(20).max()
    low_20d = result['low'].rolling(20).min()
    result['position_in_range_20d'] = (result['close'] - low_20d) / (high_20d - low_20d)
    
    # === PRICE MOMENTUM ===
    # Ratio prix actuel / prix moyen
    result['price_to_sma_20'] = result['close'] / result['close'].rolling(20).mean()
    result['price_to_sma_50'] = result['close'] / result['close'].rolling(50).mean()
    
    return result

# Appliquer
df_features = calculate_price_features(df)

print("Features Price-Based ajoutees:")
price_features = [col for col in df_features.columns if col not in ['open', 'high', 'low', 'close', 'volume']]
for f in price_features:
    print(f"  - {f}")

print(f"\nNombre de features: {len(price_features)}")

In [None]:
# Visualisation des features price-based
fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Rendements multi-periodes
ax1 = axes[0]
for period in [1, 5, 20]:
    ax1.plot(df_features.index, df_features[f'return_{period}d'] * 100, 
             label=f'Return {period}D', alpha=0.7)
ax1.axhline(y=0, color='black', linestyle='--', linewidth=0.5)
ax1.set_ylabel('Rendement (%)')
ax1.set_title('Rendements sur differentes periodes')
ax1.legend()

# Volatilite
ax2 = axes[1]
ax2.plot(df_features.index, df_features['volatility_annualized'] * 100, 
         color='red', label='Volatilite Annualisee')
ax2.fill_between(df_features.index, 0, df_features['volatility_annualized'] * 100, 
                 alpha=0.3, color='red')
ax2.set_ylabel('Volatilite (%)')
ax2.set_title('Volatilite Historique (rolling 20D, annualisee)')
ax2.legend()

# Position dans le range
ax3 = axes[2]
ax3.plot(df_features.index, df_features['position_in_range_20d'], color='purple')
ax3.axhline(y=0.5, color='black', linestyle='--', linewidth=0.5)
ax3.axhline(y=0.2, color='green', linestyle=':', linewidth=1, label='Zone Oversold')
ax3.axhline(y=0.8, color='red', linestyle=':', linewidth=1, label='Zone Overbought')
ax3.fill_between(df_features.index, 0, 0.2, alpha=0.1, color='green')
ax3.fill_between(df_features.index, 0.8, 1, alpha=0.1, color='red')
ax3.set_ylabel('Position (0-1)')
ax3.set_xlabel('Date')
ax3.set_title('Position dans le Range 20D (0=Low, 1=High)')
ax3.legend()

plt.tight_layout()
plt.show()

---

## Partie 3 : Features Techniques - Indicator-Based (15 min)

### Indicateurs techniques comme features

Les indicateurs techniques populaires capturent differents aspects du marche :

| Indicateur | Type | Ce qu'il capture |
|------------|------|------------------|
| **RSI** | Momentum | Conditions surachat/survente |
| **MACD** | Trend/Momentum | Direction et force de la tendance |
| **Bollinger Bands** | Volatilite | Position relative et volatilite |
| **Moving Averages** | Trend | Tendance lissee |
| **ADX** | Trend Strength | Force de la tendance (pas direction) |

### Normalisation des indicateurs

Pour le ML, il est important de normaliser :

- **RSI** : Deja normalise [0, 100]
- **MACD** : Diviser par le prix ou utiliser le ratio
- **BB** : Utiliser %B (position normalisee)

In [None]:
def calculate_indicator_features(df):
    """
    Calcule les features basees sur indicateurs techniques.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame avec colonnes OHLCV
    
    Returns:
    --------
    pd.DataFrame
        DataFrame avec features indicateurs ajoutees
    """
    result = df.copy()
    close = result['close']
    high = result['high']
    low = result['low']
    
    # === MOVING AVERAGES ===
    result['sma_20'] = close.rolling(20).mean()
    result['sma_50'] = close.rolling(50).mean()
    result['ema_12'] = close.ewm(span=12, adjust=False).mean()
    result['ema_26'] = close.ewm(span=26, adjust=False).mean()
    
    # Ratios de MA (features normalisees)
    result['ma_ratio_20_50'] = result['sma_20'] / result['sma_50']
    result['price_to_ema_12'] = close / result['ema_12']
    
    # Cross signals (1 si fast > slow, 0 sinon)
    result['ma_cross_20_50'] = (result['sma_20'] > result['sma_50']).astype(int)
    
    # === RSI (Relative Strength Index) ===
    delta = close.diff()
    gain = delta.clip(lower=0)
    loss = (-delta).clip(lower=0)
    
    avg_gain = gain.rolling(14).mean()
    avg_loss = loss.rolling(14).mean()
    
    rs = avg_gain / avg_loss
    result['rsi_14'] = 100 - (100 / (1 + rs))
    
    # RSI normalise [-1, 1] pour ML
    result['rsi_normalized'] = (result['rsi_14'] - 50) / 50
    
    # === MACD (Moving Average Convergence Divergence) ===
    result['macd'] = result['ema_12'] - result['ema_26']
    result['macd_signal'] = result['macd'].ewm(span=9, adjust=False).mean()
    result['macd_hist'] = result['macd'] - result['macd_signal']
    
    # MACD normalise par prix (pour comparabilite)
    result['macd_norm'] = result['macd'] / close
    result['macd_hist_norm'] = result['macd_hist'] / close
    
    # === BOLLINGER BANDS ===
    bb_period = 20
    bb_std = 2
    
    bb_middle = close.rolling(bb_period).mean()
    bb_std_val = close.rolling(bb_period).std()
    bb_upper = bb_middle + (bb_std * bb_std_val)
    bb_lower = bb_middle - (bb_std * bb_std_val)
    
    # %B: Position dans les bandes (0 = lower, 1 = upper)
    result['bb_percent_b'] = (close - bb_lower) / (bb_upper - bb_lower)
    
    # Bandwidth: Largeur des bandes (mesure de volatilite)
    result['bb_bandwidth'] = (bb_upper - bb_lower) / bb_middle
    
    # Distance au middle band
    result['bb_dist_to_middle'] = (close - bb_middle) / (bb_std_val * bb_std)
    
    # === STOCHASTIC OSCILLATOR ===
    stoch_period = 14
    stoch_smooth = 3
    
    lowest_low = low.rolling(stoch_period).min()
    highest_high = high.rolling(stoch_period).max()
    
    # %K: Position relative dans le range
    result['stoch_k'] = 100 * (close - lowest_low) / (highest_high - lowest_low)
    # %D: Signal line (moyenne de %K)
    result['stoch_d'] = result['stoch_k'].rolling(stoch_smooth).mean()
    
    # Stochastic normalise [-1, 1]
    result['stoch_normalized'] = (result['stoch_k'] - 50) / 50
    
    # === ADX (Average Directional Index) ===
    # Mesure la FORCE de la tendance (pas la direction)
    adx_period = 14
    
    # Directional Movement
    plus_dm = high.diff()
    minus_dm = -low.diff()
    
    plus_dm = plus_dm.where((plus_dm > minus_dm) & (plus_dm > 0), 0)
    minus_dm = minus_dm.where((minus_dm > plus_dm) & (minus_dm > 0), 0)
    
    # True Range pour ADX
    tr = np.maximum(
        high - low,
        np.maximum(
            np.abs(high - close.shift(1)),
            np.abs(low - close.shift(1))
        )
    )
    
    # Smoothed
    atr = tr.rolling(adx_period).mean()
    plus_di = 100 * (plus_dm.rolling(adx_period).mean() / atr)
    minus_di = 100 * (minus_dm.rolling(adx_period).mean() / atr)
    
    # ADX
    dx = 100 * np.abs(plus_di - minus_di) / (plus_di + minus_di)
    result['adx'] = dx.rolling(adx_period).mean()
    
    # DI difference (direction)
    result['di_diff'] = plus_di - minus_di
    
    # === CCI (Commodity Channel Index) ===
    cci_period = 20
    typical_price = (high + low + close) / 3
    sma_tp = typical_price.rolling(cci_period).mean()
    mean_deviation = (typical_price - sma_tp).abs().rolling(cci_period).mean()
    result['cci'] = (typical_price - sma_tp) / (0.015 * mean_deviation)
    
    # CCI normalise (typiquement entre -200 et +200)
    result['cci_normalized'] = result['cci'] / 200
    result['cci_normalized'] = result['cci_normalized'].clip(-1, 1)
    
    return result

# Appliquer
df_features = calculate_indicator_features(df_features)

print("Features Indicator-Based ajoutees:")
indicator_features = ['sma_20', 'sma_50', 'ma_ratio_20_50', 'rsi_14', 'rsi_normalized',
                      'macd', 'macd_hist', 'macd_norm', 'bb_percent_b', 'bb_bandwidth',
                      'stoch_k', 'stoch_normalized', 'adx', 'di_diff', 'cci_normalized']
for f in indicator_features:
    if f in df_features.columns:
        print(f"  - {f}")

print(f"\nTotal features: {len([c for c in df_features.columns if c not in ['open', 'high', 'low', 'close', 'volume']])}")

In [None]:
# Visualisation des features indicator-based
fig, axes = plt.subplots(4, 1, figsize=(14, 12), sharex=True)

# Prix avec MAs
ax1 = axes[0]
ax1.plot(df_features.index, df_features['close'], label='Close', linewidth=1)
ax1.plot(df_features.index, df_features['sma_20'], label='SMA 20', linewidth=1)
ax1.plot(df_features.index, df_features['sma_50'], label='SMA 50', linewidth=1)
ax1.set_ylabel('Prix ($)')
ax1.set_title('Prix et Moving Averages')
ax1.legend()

# RSI
ax2 = axes[1]
ax2.plot(df_features.index, df_features['rsi_14'], color='purple')
ax2.axhline(y=70, color='red', linestyle='--', label='Overbought (70)')
ax2.axhline(y=30, color='green', linestyle='--', label='Oversold (30)')
ax2.axhline(y=50, color='gray', linestyle=':')
ax2.fill_between(df_features.index, 70, 100, alpha=0.1, color='red')
ax2.fill_between(df_features.index, 0, 30, alpha=0.1, color='green')
ax2.set_ylabel('RSI')
ax2.set_ylim(0, 100)
ax2.set_title('RSI (14)')
ax2.legend()

# MACD
ax3 = axes[2]
ax3.plot(df_features.index, df_features['macd'], label='MACD', color='blue')
ax3.plot(df_features.index, df_features['macd_signal'], label='Signal', color='orange')
colors = ['green' if x > 0 else 'red' for x in df_features['macd_hist']]
ax3.bar(df_features.index, df_features['macd_hist'], color=colors, alpha=0.5, label='Histogram')
ax3.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
ax3.set_ylabel('MACD')
ax3.set_title('MACD (12, 26, 9)')
ax3.legend()

# Bollinger %B
ax4 = axes[3]
ax4.plot(df_features.index, df_features['bb_percent_b'], color='teal')
ax4.axhline(y=1, color='red', linestyle='--', label='Upper Band')
ax4.axhline(y=0, color='green', linestyle='--', label='Lower Band')
ax4.axhline(y=0.5, color='gray', linestyle=':')
ax4.fill_between(df_features.index, 1, df_features['bb_percent_b'].max(), 
                 where=df_features['bb_percent_b'] > 1, alpha=0.2, color='red')
ax4.fill_between(df_features.index, df_features['bb_percent_b'].min(), 0, 
                 where=df_features['bb_percent_b'] < 0, alpha=0.2, color='green')
ax4.set_ylabel('%B')
ax4.set_xlabel('Date')
ax4.set_title('Bollinger Bands %B (position normalisee)')
ax4.legend()

plt.tight_layout()
plt.show()

---

## Partie 4 : Features Fondamentales (20 min)

### Donnees fondamentales dans QuantConnect

QuantConnect fournit des donnees fondamentales via `MorningstarData`. Ces features capturent la qualite et la valorisation des entreprises.

### Categories de features fondamentales

| Categorie | Features | Description |
|-----------|----------|-------------|
| **Valuation** | P/E, P/B, P/S, EV/EBITDA | L'action est-elle chere ou bon marche? |
| **Profitability** | ROE, ROA, Margin | L'entreprise est-elle rentable? |
| **Growth** | Revenue Growth, EPS Growth | L'entreprise croit-elle? |
| **Debt** | D/E, Interest Coverage | Le bilan est-il sain? |
| **Size** | Market Cap, Enterprise Value | Quelle est la taille? |

### Utilisation dans QuantConnect

In [None]:
# Exemple de code QuantConnect pour features fondamentales
# A copier dans l'IDE QuantConnect

qc_fundamental_code = '''
def calculate_fundamental_features(self, fundamentals):
    """
    Extrait les features fondamentales depuis QuantConnect.
    
    Parameters:
    -----------
    fundamentals : QuantConnect.Data.Fundamental.Fundamental
        Objet fondamental pour un actif
    
    Returns:
    --------
    dict
        Dictionnaire de features
    """
    features = {}
    
    # === VALUATION RATIOS ===
    if fundamentals.ValuationRatios is not None:
        vr = fundamentals.ValuationRatios
        
        # Price to Earnings
        features['pe_ratio'] = vr.PERatio if hasattr(vr, 'PERatio') else np.nan
        
        # Price to Book
        features['pb_ratio'] = vr.PBRatio if hasattr(vr, 'PBRatio') else np.nan
        
        # Price to Sales
        features['ps_ratio'] = vr.PSRatio if hasattr(vr, 'PSRatio') else np.nan
        
        # Price to Cash Flow
        features['pcf_ratio'] = vr.PCFRatio if hasattr(vr, 'PCFRatio') else np.nan
        
        # Dividend Yield
        features['dividend_yield'] = vr.DividendYield if hasattr(vr, 'DividendYield') else 0
        
        # Earnings Yield (inverse of P/E)
        features['earnings_yield'] = vr.EarningYield if hasattr(vr, 'EarningYield') else np.nan
    
    # === PROFITABILITY ===
    if fundamentals.OperationRatios is not None:
        op = fundamentals.OperationRatios
        
        # Return on Equity
        if hasattr(op, 'ROE') and op.ROE is not None:
            features['roe'] = op.ROE.Value if hasattr(op.ROE, 'Value') else op.ROE
        else:
            features['roe'] = np.nan
        
        # Return on Assets
        if hasattr(op, 'ROA') and op.ROA is not None:
            features['roa'] = op.ROA.Value if hasattr(op.ROA, 'Value') else op.ROA
        else:
            features['roa'] = np.nan
        
        # Return on Invested Capital
        if hasattr(op, 'ROIC') and op.ROIC is not None:
            features['roic'] = op.ROIC.Value if hasattr(op.ROIC, 'Value') else op.ROIC
        else:
            features['roic'] = np.nan
        
        # Profit Margins
        if hasattr(op, 'NetMargin') and op.NetMargin is not None:
            features['net_margin'] = op.NetMargin.Value if hasattr(op.NetMargin, 'Value') else op.NetMargin
        else:
            features['net_margin'] = np.nan
        
        if hasattr(op, 'GrossMargin') and op.GrossMargin is not None:
            features['gross_margin'] = op.GrossMargin.Value if hasattr(op.GrossMargin, 'Value') else op.GrossMargin
        else:
            features['gross_margin'] = np.nan
        
        # Asset Turnover
        if hasattr(op, 'AssetsTurnover') and op.AssetsTurnover is not None:
            features['asset_turnover'] = op.AssetsTurnover.Value if hasattr(op.AssetsTurnover, 'Value') else op.AssetsTurnover
        else:
            features['asset_turnover'] = np.nan
    
    # === GROWTH ===
    if fundamentals.OperationRatios is not None:
        op = fundamentals.OperationRatios
        
        # Revenue Growth
        if hasattr(op, 'RevenueGrowth') and op.RevenueGrowth is not None:
            if hasattr(op.RevenueGrowth, 'ThreeMonths'):
                features['revenue_growth_3m'] = op.RevenueGrowth.ThreeMonths
            if hasattr(op.RevenueGrowth, 'OneYear'):
                features['revenue_growth_1y'] = op.RevenueGrowth.OneYear
    
    if fundamentals.EarningRatios is not None:
        er = fundamentals.EarningRatios
        
        # EPS Growth
        if hasattr(er, 'DilutedEPSGrowth') and er.DilutedEPSGrowth is not None:
            features['eps_growth'] = er.DilutedEPSGrowth
    
    # === DEBT / LEVERAGE ===
    if fundamentals.OperationRatios is not None:
        op = fundamentals.OperationRatios
        
        # Debt to Equity
        if hasattr(op, 'DebttoEquityRatio') and op.DebttoEquityRatio is not None:
            features['debt_to_equity'] = op.DebttoEquityRatio
        else:
            features['debt_to_equity'] = np.nan
        
        # Current Ratio
        if hasattr(op, 'CurrentRatio') and op.CurrentRatio is not None:
            features['current_ratio'] = op.CurrentRatio.Value if hasattr(op.CurrentRatio, 'Value') else op.CurrentRatio
        else:
            features['current_ratio'] = np.nan
    
    # === SIZE ===
    # Market Cap
    if hasattr(fundamentals, 'MarketCap'):
        features['market_cap'] = fundamentals.MarketCap
        features['log_market_cap'] = np.log(fundamentals.MarketCap) if fundamentals.MarketCap > 0 else np.nan
    
    return features
'''

print("Code QuantConnect pour features fondamentales:")
print(qc_fundamental_code)

In [None]:
# Simuler des donnees fondamentales pour demonstration

def generate_fundamental_data(n_stocks=50, seed=42):
    """
    Genere des donnees fondamentales simulees.
    
    Returns:
    --------
    pd.DataFrame
        DataFrame avec features fondamentales
    """
    np.random.seed(seed)
    
    # Tickers simulees
    tickers = [f'STOCK_{i:03d}' for i in range(n_stocks)]
    
    # Generer features avec distributions realistes
    data = {
        'ticker': tickers,
        
        # Valuation (log-normal)
        'pe_ratio': np.exp(np.random.normal(2.7, 0.6, n_stocks)),  # Mean ~15
        'pb_ratio': np.exp(np.random.normal(0.7, 0.5, n_stocks)),  # Mean ~2
        'ps_ratio': np.exp(np.random.normal(0.5, 0.7, n_stocks)),  # Mean ~1.6
        'dividend_yield': np.abs(np.random.normal(0.02, 0.015, n_stocks)),
        
        # Profitability (normal avec bounds)
        'roe': np.clip(np.random.normal(0.15, 0.1, n_stocks), -0.3, 0.5),
        'roa': np.clip(np.random.normal(0.08, 0.06, n_stocks), -0.2, 0.3),
        'net_margin': np.clip(np.random.normal(0.10, 0.08, n_stocks), -0.2, 0.4),
        'gross_margin': np.clip(np.random.normal(0.35, 0.15, n_stocks), 0.1, 0.8),
        
        # Growth (normal)
        'revenue_growth': np.random.normal(0.08, 0.15, n_stocks),
        'eps_growth': np.random.normal(0.10, 0.25, n_stocks),
        
        # Debt (positive)
        'debt_to_equity': np.abs(np.random.normal(0.8, 0.5, n_stocks)),
        'current_ratio': np.abs(np.random.normal(1.5, 0.5, n_stocks)),
        
        # Size (log-normal)
        'market_cap': np.exp(np.random.normal(23, 2, n_stocks)),  # En dollars
    }
    
    df = pd.DataFrame(data)
    df['log_market_cap'] = np.log(df['market_cap'])
    
    # Ajouter earnings_yield (inverse de P/E)
    df['earnings_yield'] = 1 / df['pe_ratio']
    
    return df

# Generer
df_fundamentals = generate_fundamental_data(n_stocks=50)

print("Donnees fondamentales simulees:")
print(f"  Nombre d'actions: {len(df_fundamentals)}")
print(f"\nApercu:")
print(df_fundamentals.head(10))

In [None]:
# Visualisation des features fondamentales
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# P/E Ratio distribution
axes[0, 0].hist(df_fundamentals['pe_ratio'], bins=20, color='steelblue', edgecolor='white')
axes[0, 0].axvline(df_fundamentals['pe_ratio'].median(), color='red', linestyle='--', label=f"Median: {df_fundamentals['pe_ratio'].median():.1f}")
axes[0, 0].set_xlabel('P/E Ratio')
axes[0, 0].set_title('Distribution du P/E Ratio')
axes[0, 0].legend()

# ROE distribution
axes[0, 1].hist(df_fundamentals['roe'] * 100, bins=20, color='forestgreen', edgecolor='white')
axes[0, 1].axvline(df_fundamentals['roe'].median() * 100, color='red', linestyle='--', label=f"Median: {df_fundamentals['roe'].median()*100:.1f}%")
axes[0, 1].set_xlabel('ROE (%)')
axes[0, 1].set_title('Distribution du ROE')
axes[0, 1].legend()

# Debt to Equity
axes[0, 2].hist(df_fundamentals['debt_to_equity'], bins=20, color='coral', edgecolor='white')
axes[0, 2].axvline(1.0, color='black', linestyle='--', label='D/E = 1.0')
axes[0, 2].set_xlabel('Debt/Equity')
axes[0, 2].set_title('Distribution du Debt/Equity')
axes[0, 2].legend()

# Scatter P/E vs ROE (Value vs Quality)
scatter = axes[1, 0].scatter(df_fundamentals['pe_ratio'], df_fundamentals['roe'] * 100,
                              c=df_fundamentals['log_market_cap'], cmap='viridis', alpha=0.6)
axes[1, 0].set_xlabel('P/E Ratio')
axes[1, 0].set_ylabel('ROE (%)')
axes[1, 0].set_title('P/E vs ROE (couleur = Market Cap)')
plt.colorbar(scatter, ax=axes[1, 0], label='Log Market Cap')

# Growth vs Margin
scatter2 = axes[1, 1].scatter(df_fundamentals['revenue_growth'] * 100, df_fundamentals['net_margin'] * 100,
                               c=df_fundamentals['pe_ratio'], cmap='coolwarm', alpha=0.6)
axes[1, 1].set_xlabel('Revenue Growth (%)')
axes[1, 1].set_ylabel('Net Margin (%)')
axes[1, 1].set_title('Growth vs Margin (couleur = P/E)')
axes[1, 1].axvline(0, color='gray', linestyle='--', alpha=0.5)
axes[1, 1].axhline(0, color='gray', linestyle='--', alpha=0.5)
plt.colorbar(scatter2, ax=axes[1, 1], label='P/E Ratio')

# Correlation heatmap
fundamental_cols = ['pe_ratio', 'pb_ratio', 'roe', 'roa', 'net_margin', 
                    'revenue_growth', 'debt_to_equity', 'log_market_cap']
corr = df_fundamentals[fundamental_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=axes[1, 2])
axes[1, 2].set_title('Correlation entre Features')

plt.tight_layout()
plt.show()

---

## Partie 5 : Labeling pour Classification et Regression (20 min)

### Qu'est-ce que le labeling?

Le **labeling** est le processus de creation des **targets** (y) a partir des donnees de prix futurs. C'est une etape critique car elle definit ce que le modele va apprendre a predire.

### Types de labels

| Type | Description | Utilisation |
|------|-------------|-------------|
| **Binary** | Up (1) / Down (0) | Classification simple |
| **Ternary** | Up (1) / Flat (0) / Down (-1) | Classification avec neutre |
| **Multi-class** | Quantiles (Q1, Q2, Q3, Q4) | Classification fine |
| **Regression** | Rendement continu | Prediction du rendement exact |

### Horizon de prediction

```
t=0 (maintenant)                    t=horizon (futur)
    |                                    |
    v                                    v
[Features calculees]  ---> [Modele] ---> [Label = future return]
```

**Attention au lookahead bias!** Le label utilise des donnees futures, donc ne jamais l'inclure dans les features.

In [None]:
def create_classification_labels(df, horizon=5, threshold=0.0):
    """
    Cree des labels pour classification.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame avec colonne 'close'
    horizon : int
        Nombre de jours pour le rendement futur
    threshold : float
        Seuil pour separer up/down (0 = simple direction)
    
    Returns:
    --------
    pd.DataFrame
        DataFrame avec labels ajoutes
    """
    result = df.copy()
    
    # Rendement futur sur l'horizon
    result['future_return'] = result['close'].shift(-horizon) / result['close'] - 1
    
    # === BINARY LABEL ===
    # 1 si rendement > threshold, 0 sinon
    result['label_binary'] = (result['future_return'] > threshold).astype(int)
    
    # === TERNARY LABEL ===
    # 1 (up), 0 (flat), -1 (down)
    result['label_ternary'] = 0
    result.loc[result['future_return'] > threshold, 'label_ternary'] = 1
    result.loc[result['future_return'] < -threshold, 'label_ternary'] = -1
    
    # === QUANTILE LABEL ===
    # Diviser en quartiles (0, 1, 2, 3)
    result['label_quantile'] = pd.qcut(
        result['future_return'].dropna(), 
        q=4, 
        labels=[0, 1, 2, 3],
        duplicates='drop'
    ).reindex(result.index)
    
    return result


def create_regression_labels(df, horizon=5):
    """
    Cree des labels pour regression.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame avec colonne 'close'
    horizon : int
        Nombre de jours pour la prediction
    
    Returns:
    --------
    pd.DataFrame
        DataFrame avec labels regression ajoutes
    """
    result = df.copy()
    
    # Rendement futur (target principal)
    result['target_return'] = result['close'].shift(-horizon) / result['close'] - 1
    
    # Log-return (souvent mieux pour ML)
    result['target_log_return'] = np.log(result['close'].shift(-horizon) / result['close'])
    
    # Volatilite future (pour prediction de risque)
    result['target_volatility'] = result['return_1d'].shift(-horizon).rolling(horizon).std()
    
    # Sharpe-like ratio (rendement / volatilite)
    result['target_sharpe'] = result['target_return'] / (result['target_volatility'] + 1e-8)
    
    return result

# Appliquer les deux types de labeling
df_labeled = create_classification_labels(df_features, horizon=5, threshold=0.01)
df_labeled = create_regression_labels(df_labeled, horizon=5)

print("Labels crees:")
print("\nClassification:")
print(f"  - label_binary: {df_labeled['label_binary'].value_counts().to_dict()}")
print(f"  - label_ternary: {df_labeled['label_ternary'].value_counts().to_dict()}")
print("\nRegression:")
print(f"  - target_return: mean={df_labeled['target_return'].mean():.4f}, std={df_labeled['target_return'].std():.4f}")

In [None]:
# Visualisation des labels
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Distribution du rendement futur
ax1 = axes[0, 0]
ax1.hist(df_labeled['future_return'].dropna() * 100, bins=50, color='steelblue', edgecolor='white')
ax1.axvline(0, color='red', linestyle='--', label='Zero')
ax1.axvline(1, color='green', linestyle=':', label='+1% threshold')
ax1.axvline(-1, color='orange', linestyle=':', label='-1% threshold')
ax1.set_xlabel('Future Return (%)')
ax1.set_ylabel('Frequency')
ax1.set_title(f'Distribution du Rendement Futur ({5}D)')
ax1.legend()

# Distribution des labels binaires
ax2 = axes[0, 1]
counts = df_labeled['label_binary'].value_counts().sort_index()
bars = ax2.bar(['Down (0)', 'Up (1)'], counts.values, color=['red', 'green'])
ax2.set_ylabel('Count')
ax2.set_title('Distribution des Labels Binaires')
for bar, count in zip(bars, counts.values):
    ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5, 
             f'{count}\n({count/counts.sum()*100:.1f}%)', ha='center')

# Distribution des labels ternaires
ax3 = axes[1, 0]
counts_ternary = df_labeled['label_ternary'].value_counts().sort_index()
bars = ax3.bar(['Down (-1)', 'Flat (0)', 'Up (1)'], counts_ternary.values, 
               color=['red', 'gray', 'green'])
ax3.set_ylabel('Count')
ax3.set_title('Distribution des Labels Ternaires (threshold=1%)')
for bar, count in zip(bars, counts_ternary.values):
    ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5, 
             f'{count}\n({count/counts_ternary.sum()*100:.1f}%)', ha='center')

# Rendement futur vs indicateurs
ax4 = axes[1, 1]
valid_data = df_labeled.dropna(subset=['future_return', 'rsi_14'])
scatter = ax4.scatter(valid_data['rsi_14'], valid_data['future_return'] * 100,
                      c=valid_data['label_binary'], cmap='RdYlGn', alpha=0.5)
ax4.axhline(0, color='black', linestyle='-', linewidth=0.5)
ax4.axvline(30, color='blue', linestyle='--', alpha=0.5, label='RSI Oversold')
ax4.axvline(70, color='blue', linestyle='--', alpha=0.5, label='RSI Overbought')
ax4.set_xlabel('RSI (14)')
ax4.set_ylabel('Future Return (%)')
ax4.set_title('RSI vs Rendement Futur')
ax4.legend()

plt.tight_layout()
plt.show()

In [None]:
# Methodes de labeling avancees

def create_triple_barrier_labels(df, horizon=5, profit_take=0.02, stop_loss=0.02):
    """
    Methode Triple Barrier de Lopez de Prado.
    
    Le label est determine par quelle barriere est touchee en premier:
    - Upper barrier (profit take): label = 1
    - Lower barrier (stop loss): label = -1
    - Time barrier (horizon): label = sign(return)
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame avec colonnes OHLCV
    horizon : int
        Nombre de jours maximum
    profit_take : float
        Seuil de profit (ex: 0.02 = 2%)
    stop_loss : float
        Seuil de perte (ex: 0.02 = 2%)
    
    Returns:
    --------
    pd.Series
        Labels (-1, 0, 1)
    """
    labels = pd.Series(index=df.index, dtype=float)
    
    for i in range(len(df) - horizon):
        entry_price = df['close'].iloc[i]
        upper_barrier = entry_price * (1 + profit_take)
        lower_barrier = entry_price * (1 - stop_loss)
        
        # Regarder les jours suivants
        for j in range(1, horizon + 1):
            if i + j >= len(df):
                break
            
            high = df['high'].iloc[i + j]
            low = df['low'].iloc[i + j]
            close = df['close'].iloc[i + j]
            
            # Upper barrier touchee?
            if high >= upper_barrier:
                labels.iloc[i] = 1
                break
            
            # Lower barrier touchee?
            if low <= lower_barrier:
                labels.iloc[i] = -1
                break
            
            # Time barrier (dernier jour)
            if j == horizon:
                ret = (close - entry_price) / entry_price
                if abs(ret) < 0.005:  # < 0.5% = flat
                    labels.iloc[i] = 0
                else:
                    labels.iloc[i] = 1 if ret > 0 else -1
    
    return labels

# Appliquer Triple Barrier
df_labeled['label_triple_barrier'] = create_triple_barrier_labels(
    df_labeled, horizon=5, profit_take=0.02, stop_loss=0.02
)

print("Triple Barrier Labels:")
print(df_labeled['label_triple_barrier'].value_counts().sort_index())

print("\nAvantages du Triple Barrier:")
print("  - Plus realiste (simule take profit / stop loss)")
print("  - Capture le timing du mouvement")
print("  - Moins de labels neutres")

---

## Partie 6 : Feature Selection et Importance (20 min)

### Pourquoi selectionner les features?

| Raison | Explication |
|--------|-------------|
| **Overfitting** | Trop de features = modele memorise le bruit |
| **Curse of dimensionality** | Performance degrade avec dimensions |
| **Interpretabilite** | Moins de features = plus comprehensible |
| **Vitesse** | Moins de features = entrainement plus rapide |

### Techniques de selection

```
Feature Selection
       |
       +-- Filter Methods (rapide, independant du modele)
       |      +-- Correlation avec target
       |      +-- Variance threshold
       |      +-- Mutual Information
       |
       +-- Wrapper Methods (lent, modele-dependant)
       |      +-- Forward selection
       |      +-- Backward elimination
       |      +-- RFE (Recursive Feature Elimination)
       |
       +-- Embedded Methods (integre dans le modele)
              +-- L1 Regularization (Lasso)
              +-- Tree-based importance
```

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import mutual_info_classif, SelectKBest

def calculate_feature_importance(X, y, method='random_forest'):
    """
    Calcule l'importance des features.
    
    Parameters:
    -----------
    X : pd.DataFrame
        Features
    y : pd.Series
        Labels
    method : str
        'random_forest', 'mutual_info', ou 'correlation'
    
    Returns:
    --------
    pd.DataFrame
        DataFrame avec features et leur importance
    """
    if method == 'random_forest':
        # Tree-based importance
        model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)
        model.fit(X, y)
        importance = model.feature_importances_
        
    elif method == 'mutual_info':
        # Information mutuelle (capture non-linearites)
        importance = mutual_info_classif(X, y, random_state=42)
        
    elif method == 'correlation':
        # Correlation absolue avec target
        importance = X.apply(lambda col: abs(col.corr(y))).values
    
    else:
        raise ValueError(f"Unknown method: {method}")
    
    importance_df = pd.DataFrame({
        'feature': X.columns,
        'importance': importance
    }).sort_values('importance', ascending=False)
    
    # Normaliser (0-1)
    importance_df['importance_normalized'] = (
        importance_df['importance'] / importance_df['importance'].max()
    )
    
    return importance_df.reset_index(drop=True)

# Preparer les donnees pour feature selection
feature_cols = [col for col in df_labeled.columns 
                if col not in ['open', 'high', 'low', 'close', 'volume',
                               'future_return', 'label_binary', 'label_ternary',
                               'label_quantile', 'target_return', 'target_log_return',
                               'target_volatility', 'target_sharpe', 'label_triple_barrier']]

# Supprimer les lignes avec NaN
df_clean = df_labeled.dropna(subset=feature_cols + ['label_binary'])

X = df_clean[feature_cols]
y = df_clean['label_binary']

print(f"Nombre de features: {len(feature_cols)}")
print(f"Nombre d'echantillons: {len(df_clean)}")

# Calculer importance avec Random Forest
importance_rf = calculate_feature_importance(X, y, method='random_forest')

print("\nTop 15 Features (Random Forest Importance):")
print(importance_rf.head(15).to_string(index=False))

In [None]:
def remove_correlated_features(df, threshold=0.95):
    """
    Supprime les features fortement correlees.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame de features
    threshold : float
        Seuil de correlation (ex: 0.95 = 95%)
    
    Returns:
    --------
    tuple : (DataFrame filtre, liste des features supprimees)
    """
    # Matrice de correlation
    corr_matrix = df.corr().abs()
    
    # Triangle superieur (eviter doublons)
    upper = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    
    # Features a supprimer (correlation > threshold)
    to_drop = [col for col in upper.columns if any(upper[col] > threshold)]
    
    # Supprimer
    df_filtered = df.drop(columns=to_drop)
    
    return df_filtered, to_drop

# Appliquer
X_filtered, dropped_features = remove_correlated_features(X, threshold=0.90)

print(f"Features supprimees (correlation > 90%):")
for f in dropped_features:
    print(f"  - {f}")

print(f"\nFeatures restantes: {len(X_filtered.columns)} (vs {len(X.columns)} avant)")

In [None]:
# Visualisation de l'importance des features
fig, axes = plt.subplots(1, 2, figsize=(14, 8))

# Barplot importance
ax1 = axes[0]
top_features = importance_rf.head(20)
colors = plt.cm.viridis(np.linspace(0, 1, len(top_features)))
bars = ax1.barh(range(len(top_features)), top_features['importance_normalized'], color=colors)
ax1.set_yticks(range(len(top_features)))
ax1.set_yticklabels(top_features['feature'])
ax1.invert_yaxis()
ax1.set_xlabel('Importance Normalisee')
ax1.set_title('Top 20 Features par Importance (Random Forest)')

# Heatmap correlation des top features
ax2 = axes[1]
top_feature_names = importance_rf.head(10)['feature'].tolist()
corr_top = X[top_feature_names].corr()
sns.heatmap(corr_top, annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=ax2)
ax2.set_title('Correlation entre Top 10 Features')

plt.tight_layout()
plt.show()

---

## Partie 7 : Preprocessing et Normalization (15 min)

### Pourquoi normaliser?

| Raison | Explication |
|--------|-------------|
| **Echelle** | Features avec grandes valeurs dominent |
| **Gradient** | Convergence plus rapide (neural networks) |
| **Distance** | K-NN, SVM sensibles a l'echelle |
| **Regularisation** | L1/L2 penalisent differemment selon echelle |

### Methodes de normalisation

| Methode | Formule | Quand utiliser |
|---------|---------|----------------|
| **StandardScaler** | `(x - mean) / std` | Distribution ~normale |
| **MinMaxScaler** | `(x - min) / (max - min)` | Range [0, 1] souhaite |
| **RobustScaler** | `(x - median) / IQR` | Donnees avec outliers |

### Rolling Normalization pour eviter lookahead

En trading, on ne peut pas utiliser mean/std globaux (lookahead bias). On utilise une **normalisation roulante** :

```python
# WRONG (lookahead bias)
scaler.fit(all_data)

# CORRECT (rolling)
rolling_mean = data.rolling(window).mean()
rolling_std = data.rolling(window).std()
normalized = (data - rolling_mean) / rolling_std
```

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

def preprocess_features_sklearn(X_train, X_test, method='standard'):
    """
    Preprocessing avec scikit-learn (pour split train/test).
    
    Parameters:
    -----------
    X_train : pd.DataFrame
        Features d'entrainement
    X_test : pd.DataFrame
        Features de test
    method : str
        'standard', 'minmax', ou 'robust'
    
    Returns:
    --------
    tuple : (X_train_scaled, X_test_scaled, scaler)
    """
    if method == 'standard':
        scaler = StandardScaler()
    elif method == 'minmax':
        scaler = MinMaxScaler()
    elif method == 'robust':
        scaler = RobustScaler()
    else:
        raise ValueError(f"Unknown method: {method}")
    
    # Fit sur train, transform sur les deux
    X_train_scaled = pd.DataFrame(
        scaler.fit_transform(X_train),
        columns=X_train.columns,
        index=X_train.index
    )
    
    X_test_scaled = pd.DataFrame(
        scaler.transform(X_test),
        columns=X_test.columns,
        index=X_test.index
    )
    
    return X_train_scaled, X_test_scaled, scaler


def rolling_normalize(df, window=252):
    """
    Normalisation roulante pour eviter lookahead bias.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Features a normaliser
    window : int
        Fenetre pour calcul mean/std (ex: 252 = 1 an)
    
    Returns:
    --------
    pd.DataFrame
        Features normalisees (Z-score roulant)
    """
    result = pd.DataFrame(index=df.index)
    
    for col in df.columns:
        rolling_mean = df[col].rolling(window).mean()
        rolling_std = df[col].rolling(window).std()
        
        # Z-score avec rolling stats
        result[col] = (df[col] - rolling_mean) / (rolling_std + 1e-8)
        
        # Clip outliers extremes
        result[col] = result[col].clip(-3, 3)
    
    return result


def percentile_rank(df, window=252):
    """
    Rang percentile roulant (robuste aux outliers).
    
    Parameters:
    -----------
    df : pd.DataFrame
        Features
    window : int
        Fenetre
    
    Returns:
    --------
    pd.DataFrame
        Rangs percentiles [0, 1]
    """
    def rolling_percentile(series, window):
        def rank_pct(x):
            return pd.Series(x).rank(pct=True).iloc[-1]
        return series.rolling(window).apply(rank_pct, raw=True)
    
    result = pd.DataFrame(index=df.index)
    for col in df.columns:
        result[col] = rolling_percentile(df[col], window)
    
    return result

# Demonstration
print("Methodes de normalisation:")
print("\n1. StandardScaler (sklearn):")
print("   - Fit sur train, transform sur test")
print("   - Resultat: mean=0, std=1")

print("\n2. Rolling Normalize:")
print("   - Pas de lookahead bias")
print("   - Utilise mean/std des N derniers jours")

print("\n3. Percentile Rank:")
print("   - Robuste aux outliers")
print("   - Resultat: [0, 1] = percentile actuel")

In [None]:
# Appliquer rolling normalization
X_rolling_norm = rolling_normalize(X_filtered, window=60)

# Visualiser l'effet
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Avant normalisation
feature_example = 'rsi_14'
if feature_example in X_filtered.columns:
    ax1 = axes[0, 0]
    ax1.plot(X_filtered.index, X_filtered[feature_example])
    ax1.set_title(f'{feature_example} - Avant Normalisation')
    ax1.set_ylabel('Valeur brute')
    
    ax2 = axes[0, 1]
    ax2.hist(X_filtered[feature_example].dropna(), bins=50, color='steelblue', edgecolor='white')
    ax2.set_title(f'{feature_example} - Distribution (brute)')

# Apres normalisation
if feature_example in X_rolling_norm.columns:
    ax3 = axes[1, 0]
    ax3.plot(X_rolling_norm.index, X_rolling_norm[feature_example])
    ax3.axhline(0, color='red', linestyle='--', alpha=0.5)
    ax3.axhline(2, color='orange', linestyle=':', alpha=0.5)
    ax3.axhline(-2, color='orange', linestyle=':', alpha=0.5)
    ax3.set_title(f'{feature_example} - Apres Rolling Normalization')
    ax3.set_ylabel('Z-score')
    ax3.set_xlabel('Date')
    
    ax4 = axes[1, 1]
    ax4.hist(X_rolling_norm[feature_example].dropna(), bins=50, color='forestgreen', edgecolor='white')
    ax4.axvline(0, color='red', linestyle='--')
    ax4.set_title(f'{feature_example} - Distribution (normalisee)')
    ax4.set_xlabel('Z-score')

plt.tight_layout()
plt.show()

print("\nStatistiques avant/apres:")
print(f"  Avant: mean={X_filtered[feature_example].mean():.2f}, std={X_filtered[feature_example].std():.2f}")
print(f"  Apres: mean={X_rolling_norm[feature_example].mean():.2f}, std={X_rolling_norm[feature_example].std():.2f}")

---

## Partie 8 : Pipeline Complet de Feature Engineering (20 min)

### Architecture du pipeline

```
Donnees OHLCV
     |
     v
+--------------------+
| 1. Price Features  |  (returns, volatility, range)
+--------------------+
     |
     v
+--------------------+
| 2. Indicator Feat. |  (RSI, MACD, BB, ADX)
+--------------------+
     |
     v
+--------------------+
| 3. Create Labels   |  (future returns, classification)
+--------------------+
     |
     v
+--------------------+
| 4. Train/Test Split|  (temporal, walk-forward)
+--------------------+
     |
     v
+--------------------+
| 5. Normalization   |  (rolling z-score)
+--------------------+
     |
     v
+--------------------+
| 6. Feature Select. |  (importance, correlation)
+--------------------+
     |
     v
X_train, X_test, y_train, y_test
      (Pret pour ML)
```

In [None]:
class FeatureEngineeringPipeline:
    """
    Pipeline complet de Feature Engineering pour ML Trading.
    
    Etapes:
    1. Calcul des features techniques
    2. Creation des labels
    3. Train/Test split temporel
    4. Normalisation
    5. Selection des features
    
    Usage:
        pipeline = FeatureEngineeringPipeline(horizon=5, train_ratio=0.7)
        X_train, X_test, y_train, y_test = pipeline.fit_transform(df_ohlcv)
    """
    
    def __init__(self, 
                 horizon=5,
                 train_ratio=0.7,
                 label_type='binary',
                 label_threshold=0.0,
                 norm_window=60,
                 n_features=20,
                 correlation_threshold=0.90):
        """
        Parameters:
        -----------
        horizon : int
            Horizon de prediction (jours)
        train_ratio : float
            Ratio train/test
        label_type : str
            'binary', 'ternary', ou 'regression'
        label_threshold : float
            Seuil pour labels (classification)
        norm_window : int
            Fenetre pour rolling normalization
        n_features : int
            Nombre de features a garder
        correlation_threshold : float
            Seuil pour supprimer features correlees
        """
        self.horizon = horizon
        self.train_ratio = train_ratio
        self.label_type = label_type
        self.label_threshold = label_threshold
        self.norm_window = norm_window
        self.n_features = n_features
        self.correlation_threshold = correlation_threshold
        
        # Stockage apres fit
        self.feature_names_ = None
        self.selected_features_ = None
        self.feature_importance_ = None
        self.scaler_ = None
    
    def _calculate_features(self, df):
        """Calcule toutes les features techniques."""
        result = df.copy()
        
        # Price features
        result = calculate_price_features(result)
        
        # Indicator features
        result = calculate_indicator_features(result)
        
        return result
    
    def _create_labels(self, df):
        """Cree les labels selon le type."""
        # Rendement futur
        df['future_return'] = df['close'].shift(-self.horizon) / df['close'] - 1
        
        if self.label_type == 'binary':
            df['label'] = (df['future_return'] > self.label_threshold).astype(int)
        elif self.label_type == 'ternary':
            df['label'] = 0
            df.loc[df['future_return'] > self.label_threshold, 'label'] = 1
            df.loc[df['future_return'] < -self.label_threshold, 'label'] = -1
        elif self.label_type == 'regression':
            df['label'] = df['future_return']
        
        return df
    
    def _temporal_split(self, df):
        """Split temporel train/test."""
        split_idx = int(len(df) * self.train_ratio)
        train = df.iloc[:split_idx]
        test = df.iloc[split_idx:]
        return train, test
    
    def _normalize(self, X_train, X_test):
        """Normalise les features."""
        self.scaler_ = StandardScaler()
        
        X_train_norm = pd.DataFrame(
            self.scaler_.fit_transform(X_train),
            columns=X_train.columns,
            index=X_train.index
        )
        
        X_test_norm = pd.DataFrame(
            self.scaler_.transform(X_test),
            columns=X_test.columns,
            index=X_test.index
        )
        
        return X_train_norm, X_test_norm
    
    def _select_features(self, X_train, y_train):
        """Selectionne les meilleures features."""
        # 1. Supprimer features tres correlees
        X_filtered, dropped = remove_correlated_features(X_train, self.correlation_threshold)
        
        # 2. Calculer importance
        self.feature_importance_ = calculate_feature_importance(
            X_filtered, y_train, method='random_forest'
        )
        
        # 3. Garder top N features
        self.selected_features_ = self.feature_importance_.head(self.n_features)['feature'].tolist()
        
        return self.selected_features_
    
    def fit_transform(self, df):
        """
        Execute le pipeline complet.
        
        Parameters:
        -----------
        df : pd.DataFrame
            Donnees OHLCV brutes
        
        Returns:
        --------
        tuple : (X_train, X_test, y_train, y_test)
        """
        print("="*60)
        print("FEATURE ENGINEERING PIPELINE")
        print("="*60)
        
        # 1. Calculer features
        print("\n[1/5] Calcul des features techniques...")
        df_features = self._calculate_features(df)
        
        # 2. Creer labels
        print(f"[2/5] Creation des labels ({self.label_type}, horizon={self.horizon}D)...")
        df_labeled = self._create_labels(df_features)
        
        # Identifier colonnes features vs non-features
        non_feature_cols = ['open', 'high', 'low', 'close', 'volume', 
                           'future_return', 'label']
        self.feature_names_ = [c for c in df_labeled.columns if c not in non_feature_cols]
        print(f"    {len(self.feature_names_)} features calculees")
        
        # Supprimer NaN
        df_clean = df_labeled.dropna()
        print(f"    {len(df_clean)} echantillons valides (apres drop NaN)")
        
        # 3. Split temporel
        print(f"[3/5] Split temporel (train={self.train_ratio*100:.0f}%)...")
        train_df, test_df = self._temporal_split(df_clean)
        
        X_train = train_df[self.feature_names_]
        y_train = train_df['label']
        X_test = test_df[self.feature_names_]
        y_test = test_df['label']
        
        print(f"    Train: {len(X_train)} samples ({train_df.index[0].date()} - {train_df.index[-1].date()})")
        print(f"    Test:  {len(X_test)} samples ({test_df.index[0].date()} - {test_df.index[-1].date()})")
        
        # 4. Selection features
        print(f"[4/5] Selection des {self.n_features} meilleures features...")
        selected = self._select_features(X_train, y_train)
        
        X_train = X_train[selected]
        X_test = X_test[selected]
        
        # 5. Normalisation
        print("[5/5] Normalisation (StandardScaler)...")
        X_train, X_test = self._normalize(X_train, X_test)
        
        print("\n" + "="*60)
        print("PIPELINE COMPLETE")
        print("="*60)
        print(f"\nResultat:")
        print(f"  X_train: {X_train.shape}")
        print(f"  X_test:  {X_test.shape}")
        print(f"  y_train: {len(y_train)} (distribution: {dict(y_train.value_counts())})")
        print(f"  y_test:  {len(y_test)} (distribution: {dict(y_test.value_counts())})")
        
        print(f"\nTop 10 Features selectionnees:")
        for i, feat in enumerate(selected[:10]):
            imp = self.feature_importance_[self.feature_importance_['feature'] == feat]['importance'].values[0]
            print(f"  {i+1}. {feat} (importance: {imp:.4f})")
        
        return X_train, X_test, y_train, y_test
    
    def transform(self, df_new):
        """
        Transforme de nouvelles donnees (utilise les parametres du fit).
        
        Parameters:
        -----------
        df_new : pd.DataFrame
            Nouvelles donnees OHLCV
        
        Returns:
        --------
        pd.DataFrame
            Features transformees
        """
        if self.selected_features_ is None:
            raise ValueError("Pipeline not fitted. Call fit_transform first.")
        
        # Calculer features
        df_features = self._calculate_features(df_new)
        df_clean = df_features.dropna()
        
        # Selectionner et normaliser
        X = df_clean[self.selected_features_]
        X_norm = pd.DataFrame(
            self.scaler_.transform(X),
            columns=X.columns,
            index=X.index
        )
        
        return X_norm

In [None]:
# Executer le pipeline complet
pipeline = FeatureEngineeringPipeline(
    horizon=5,                    # Predire direction sur 5 jours
    train_ratio=0.7,              # 70% train, 30% test
    label_type='binary',          # Classification binaire
    label_threshold=0.0,          # Seuil = 0 (up/down simple)
    n_features=15,                # Garder 15 features
    correlation_threshold=0.90    # Supprimer si correlation > 90%
)

# Transformer les donnees
X_train, X_test, y_train, y_test = pipeline.fit_transform(df)

In [None]:
# Verification rapide avec un modele ML simple
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Entrainer un Random Forest
print("\nValidation avec Random Forest...")
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

print(f"\nAccuracy Train: {accuracy_score(y_train, y_train_pred):.4f}")
print(f"Accuracy Test:  {accuracy_score(y_test, y_test_pred):.4f}")

print("\nClassification Report (Test):")
print(classification_report(y_test, y_test_pred, target_names=['Down', 'Up']))

In [None]:
# Exporter les donnees pour les notebooks ML suivants
print("\nPipeline pret pour export:")
print("\nDonnees disponibles:")
print(f"  - X_train: DataFrame {X_train.shape}")
print(f"  - X_test:  DataFrame {X_test.shape}")
print(f"  - y_train: Series {len(y_train)}")
print(f"  - y_test:  Series {len(y_test)}")
print(f"  - pipeline: FeatureEngineeringPipeline (pour transform nouvelles donnees)")

print("\nFeatures selectionnees:")
print(f"  {pipeline.selected_features_}")

# Code pour sauvegarder (optionnel)
save_code = '''
# Sauvegarder les donnees preparees
import joblib

# Sauvegarder le pipeline
joblib.dump(pipeline, 'feature_pipeline.pkl')

# Sauvegarder les donnees
X_train.to_parquet('X_train.parquet')
X_test.to_parquet('X_test.parquet')
y_train.to_frame().to_parquet('y_train.parquet')
y_test.to_frame().to_parquet('y_test.parquet')

# Charger plus tard
pipeline = joblib.load('feature_pipeline.pkl')
X_train = pd.read_parquet('X_train.parquet')
'''

print("\nCode pour sauvegarder/charger:")
print(save_code)

---

## Conclusion et Prochaines Etapes

### Recapitulatif

Dans ce notebook, nous avons couvert :

1. **Introduction au Feature Engineering** :
   - Importance du "Garbage In, Garbage Out"
   - Categories de features (techniques, fondamentales, alternatives)

2. **Features Techniques (Price-Based)** :
   - Rendements multi-periodes
   - Volatilite historique
   - Range et distance from extremes

3. **Features Techniques (Indicator-Based)** :
   - RSI, MACD, Bollinger Bands
   - ADX, Stochastic, CCI
   - Normalisation des indicateurs pour ML

4. **Features Fondamentales** :
   - Valuation (P/E, P/B, P/S)
   - Profitability (ROE, ROA, Margins)
   - Growth et Debt ratios

5. **Labeling** :
   - Classification (binary, ternary, quantile)
   - Regression (future returns)
   - Triple Barrier method

6. **Feature Selection** :
   - Random Forest importance
   - Correlation filtering
   - Mutual information

7. **Preprocessing** :
   - StandardScaler, RobustScaler
   - Rolling normalization (anti-lookahead)

8. **Pipeline Complet** :
   - Classe `FeatureEngineeringPipeline`
   - Train/Test split temporel
   - Export pour ML

### Points Cles a Retenir

| Concept | Point Cle |
|---------|----------|
| **Lookahead Bias** | Ne jamais utiliser de donnees futures dans les features |
| **Normalisation** | Rolling stats pour eviter biais, pas mean/std global |
| **Selection** | Supprimer features correlees avant importance |
| **Labeling** | Horizon et threshold impactent fortement les resultats |
| **Split** | Toujours temporel en finance (pas random) |

### Limitations

- Les features techniques seules sont souvent insuffisantes
- Le marche evolue, les features perdent leur pouvoir predictif (regime change)
- Overfitting facile avec trop de features
- Les donnees simulees ne capturent pas toute la complexite reelle

### Prochaines Etapes

| Notebook | Contenu |
|----------|--------|
| **QC-Py-19** | Classification Models (Random Forest, XGBoost) |
| **QC-Py-20** | Regression Models et prediction de rendements |
| **QC-Py-21** | Deep Learning (LSTM, Transformers) |
| **QC-Py-22** | Reinforcement Learning pour trading |

### Ressources Complementaires

- **shared/features.py** - Bibliotheque de fonctions de feature engineering
- **shared/ml_utils.py** - Utilitaires ML (train, evaluate, save)
- [Advances in Financial ML](https://www.amazon.com/Advances-Financial-Machine-Learning-Marcos/dp/1119482089) - Lopez de Prado
- [QuantConnect Data Documentation](https://www.quantconnect.com/docs/v2/writing-algorithms/datasets)

---

**Notebook complete. Pret pour QC-Py-19 (Classification Models).**