# QC-Py-22 - Modern Time Series Deep Learning (SOTA 2024-2026)

> **Architectures PyTorch SOTA pour la prediction de series temporelles financieres**
> Duree: 100 minutes | Niveau: Avance | Python + PyTorch + QuantConnect

---

## Objectifs d'Apprentissage

A la fin de ce notebook, vous serez capable de :

1. Comprendre l'**evolution des architectures** time series (LSTM -> Transformers -> SSMs)
2. Maitriser **PyTorch** pour les series temporelles financieres
3. Implementer **DLinear** comme baseline efficace (AAAI 2023)
4. Utiliser **PatchTST** avec tokenization par patches (ICLR 2023)
5. Appliquer **iTransformer** avec attention inversee (ICLR 2024 Spotlight)
6. Experimenter **TimeMixer** sans attention (ICLR 2024)
7. **Comparer** les architectures sur donnees financieres
8. **Integrer** les modeles dans QuantConnect (ObjectStore, inference CPU)

## Prerequisites

- Notebooks QC-Py-01 a 21 completes
- Comprehension de base des reseaux de neurones
- Familiarite avec PyTorch (tenseurs, modules)
- numpy, pandas, sklearn

## Structure du Notebook

| Partie | Sujet | Duree |
|--------|-------|-------|
| 1 | Evolution des Architectures (2015-2026) | 15 min |
| 2 | Setup PyTorch et Donnees | 15 min |
| 3 | DLinear - Baseline MLP (AAAI 2023) | 15 min |
| 4 | PatchTST - Patch-based Transformer (ICLR 2023) | 20 min |
| 5 | iTransformer - Inverted Attention (ICLR 2024) | 15 min |
| 6 | TimeMixer - MLP Multiscale (ICLR 2024) | 10 min |
| 7 | Comparaison et Benchmarks | 15 min |
| 8 | Integration QuantConnect | 15 min |

---

## References SOTA

| Architecture | Paper | Conference | Code |
|--------------|-------|------------|------|
| **DLinear** | Are Transformers Effective for Time Series? | AAAI 2023 | [cure-lab/LTSF-Linear](https://github.com/cure-lab/LTSF-Linear) |
| **PatchTST** | A Time Series is Worth 64 Words | ICLR 2023 | [yuqinie98/PatchTST](https://github.com/yuqinie98/PatchTST) |
| **iTransformer** | Inverted Transformers Are Effective | ICLR 2024 Spotlight | [thuml/iTransformer](https://github.com/thuml/iTransformer) |
| **TimeMixer** | TimeMixer: Decomposable Multiscale Mixing | ICLR 2024 | [kwuking/TimeMixer](https://github.com/kwuking/TimeMixer) |
| **Time-Series-Library** | Framework unifie 20+ modeles | Tsinghua | [thuml/Time-Series-Library](https://github.com/thuml/Time-Series-Library) |

---

## Partie 1 : Evolution des Architectures (2015-2026)

### Timeline des Architectures Time Series

```
2015-2017: RNN/LSTM Era
    - LSTM, GRU dominant
    - Vanishing gradient problem
    - Sequential processing (slow)

2017-2022: Transformer Era
    - Attention mechanisms
    - Parallel processing
    - O(n^2) complexity problem

2023: "Are Transformers Effective?" Moment
    - DLinear surpasse les Transformers complexes!
    - Remise en question des architectures
    - Focus sur simplicite et efficacite

2023-2024: Patch-based & Inverted Attention
    - PatchTST: tokenization intelligente
    - iTransformer: attention sur variables
    - TimeMixer: MLP multiscale

2024-2026: State Space Models (Mamba)
    - Complexite O(n) au lieu de O(n^2)
    - Voir notebook QC-Py-23
```

### Le Paradoxe DLinear (AAAI 2023)

Le paper "Are Transformers Effective for Time Series Forecasting?" a demontre que:

| Modele | MSE (ETTh1) | Complexite | Parametres |
|--------|-------------|------------|------------|
| Informer | 0.865 | O(n log n) | 11M |
| Autoformer | 0.449 | O(n^2) | 10M |
| FEDformer | 0.376 | O(n) | 8M |
| **DLinear** | **0.375** | **O(1)** | **~10K** |

**Conclusion**: La simplicite peut battre la complexite!

### Pourquoi les Transformers classiques echouent?

1. **Permutation invariance**: L'attention standard ignore l'ordre temporel
2. **Point-wise attention**: Chaque timestep = 1 token (trop granulaire)
3. **Overfitting**: Trop de parametres pour les series financieres
4. **Computational cost**: O(n^2) prohibitif pour longues sequences

### Solutions SOTA 2024-2026

| Architecture | Innovation | Avantage |
|--------------|------------|----------|
| **DLinear** | Decomposition + Linear | Ultra-simple, baseline forte |
| **PatchTST** | Patches au lieu de points | Capture patterns locaux |
| **iTransformer** | Attention sur variables | Capture correlations inter-series |
| **TimeMixer** | Mixing multiscale sans attention | Efficace, pas d'attention |
| **Mamba/SSMs** | State Space Models | O(n), long context (voir QC-Py-23) |

### Architecture Comparison

```
LSTM (2015):          Transformer (2017):       PatchTST (2023):
x1 -> h1 -> ...       [x1,x2,...,xn]           [patch1, patch2, ...]
Sequential            Full Attention O(n^2)    Patch Attention

iTransformer (2024):  TimeMixer (2024):        Mamba (2024):
[var1,var2,...,varm]  Multiscale MLP           State Space
Variable Attention    No Attention             O(n) Selective
```

---

## Partie 2 : Setup PyTorch et Donnees (15 min)

In [None]:
# Imports standards
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Configuration matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("Imports de base reussis")

In [None]:
# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Configuration device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f"PyTorch version: {torch.__version__}")
print(f"Device: {device}")
print(f"CUDA disponible: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Reproductibilite
torch.manual_seed(42)
np.random.seed(42)

In [None]:
# Sklearn pour preprocessing et metriques
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error

print("Sklearn importe avec succes")

In [None]:
# Generation de donnees financieres simulees

def generate_financial_data(n_days=1000, n_features=7, seed=42):
    """
    Genere des donnees financieres multi-variees simulees.
    
    Features:
    - close: Prix de cloture
    - volume: Volume normalise
    - returns: Rendements journaliers
    - sma_20: SMA 20 jours
    - sma_50: SMA 50 jours
    - rsi: RSI 14 jours
    - volatility: Volatilite 20 jours
    
    Returns:
    --------
    pd.DataFrame avec features, target = close
    """
    np.random.seed(seed)
    
    # Dates
    dates = pd.date_range(start='2019-01-01', periods=n_days, freq='B')
    
    # Prix avec tendance + cycles + bruit
    trend = np.linspace(100, 180, n_days)
    cycle1 = 15 * np.sin(np.linspace(0, 10 * np.pi, n_days))
    cycle2 = 8 * np.sin(np.linspace(0, 40 * np.pi, n_days))
    noise = np.cumsum(np.random.randn(n_days) * 0.7)
    
    close = trend + cycle1 + cycle2 + noise
    close = np.maximum(close, 50)
    
    # Rendements
    returns = np.diff(close, prepend=close[0]) / np.maximum(close, 1)
    
    # Volume (correle negativement avec le prix pour simuler)
    volume = 1_000_000 * (1 + np.random.exponential(0.3, n_days))
    volume = volume * (1 - 0.3 * (close - close.mean()) / close.std())
    
    # SMA
    sma_20 = pd.Series(close).rolling(20).mean().fillna(method='bfill').values
    sma_50 = pd.Series(close).rolling(50).mean().fillna(method='bfill').values
    
    # RSI
    delta = pd.Series(close).diff()
    gain = delta.clip(lower=0).rolling(14).mean()
    loss = (-delta.clip(upper=0)).rolling(14).mean()
    rs = gain / (loss + 1e-10)
    rsi = (100 - (100 / (1 + rs))).fillna(50).values
    
    # Volatilite
    volatility = pd.Series(returns).rolling(20).std().fillna(method='bfill').values * np.sqrt(252)
    
    df = pd.DataFrame({
        'close': close,
        'volume': volume,
        'returns': returns,
        'sma_20': sma_20,
        'sma_50': sma_50,
        'rsi': rsi,
        'volatility': volatility
    }, index=dates)
    
    return df

# Generer les donnees
df = generate_financial_data(n_days=1000)

print(f"Donnees generees: {len(df)} jours")
print(f"Features: {list(df.columns)}")
print(f"Periode: {df.index[0].date()} a {df.index[-1].date()}")
print(f"\nApercu:")
print(df.head())

In [None]:
# Visualisation des donnees
fig, axes = plt.subplots(3, 1, figsize=(14, 10))

# Prix et SMAs
ax1 = axes[0]
ax1.plot(df.index, df['close'], 'b-', linewidth=1.5, label='Close', alpha=0.8)
ax1.plot(df.index, df['sma_20'], 'orange', linewidth=1, label='SMA 20', alpha=0.7)
ax1.plot(df.index, df['sma_50'], 'green', linewidth=1, label='SMA 50', alpha=0.7)
ax1.set_ylabel('Prix')
ax1.set_title('Prix et Moyennes Mobiles', fontsize=14, fontweight='bold')
ax1.legend(loc='upper left')
ax1.grid(True, alpha=0.3)

# RSI
ax2 = axes[1]
ax2.plot(df.index, df['rsi'], 'purple', linewidth=1)
ax2.axhline(70, color='red', linestyle='--', alpha=0.5)
ax2.axhline(30, color='green', linestyle='--', alpha=0.5)
ax2.fill_between(df.index, 30, 70, alpha=0.1, color='gray')
ax2.set_ylabel('RSI')
ax2.set_title('Relative Strength Index', fontsize=14, fontweight='bold')
ax2.set_ylim(0, 100)
ax2.grid(True, alpha=0.3)

# Volatilite
ax3 = axes[2]
ax3.fill_between(df.index, 0, df['volatility'] * 100, alpha=0.5, color='steelblue')
ax3.set_ylabel('Volatilite Annualisee (%)')
ax3.set_xlabel('Date')
ax3.set_title('Volatilite Realisee (20 jours)', fontsize=14, fontweight='bold')
ax3.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Dataset PyTorch pour Time Series

class TimeSeriesDataset(Dataset):
    """
    Dataset PyTorch pour time series forecasting.
    
    Parameters:
    -----------
    data : np.array
        Donnees normalisees [n_samples, n_features]
    seq_len : int
        Longueur de la sequence d'entree (lookback)
    pred_len : int
        Longueur de la prediction (forecast horizon)
    target_idx : int
        Index de la feature cible (default: 0 = close)
    """
    
    def __init__(self, data, seq_len=96, pred_len=24, target_idx=0):
        self.data = torch.FloatTensor(data)
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.target_idx = target_idx
        
    def __len__(self):
        return len(self.data) - self.seq_len - self.pred_len + 1
    
    def __getitem__(self, idx):
        # Input: [seq_len, n_features]
        x = self.data[idx:idx + self.seq_len]
        
        # Target: [pred_len] (uniquement la feature cible)
        y = self.data[idx + self.seq_len:idx + self.seq_len + self.pred_len, self.target_idx]
        
        return x, y

print("TimeSeriesDataset defini")
print("\nUsage:")
print("  dataset = TimeSeriesDataset(data, seq_len=96, pred_len=24)")
print("  x, y = dataset[0]")
print("  -> x shape: [seq_len, n_features]")
print("  -> y shape: [pred_len]")

In [None]:
# Preparer les donnees

# Parametres
SEQ_LEN = 96      # ~4 mois de trading days
PRED_LEN = 24     # ~1 mois de prediction
BATCH_SIZE = 32

# Normalisation
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df.values)

print(f"Donnees normalisees: {data_scaled.shape}")

# Split Train/Val/Test (70/15/15)
n = len(data_scaled)
train_end = int(n * 0.7)
val_end = int(n * 0.85)

train_data = data_scaled[:train_end]
val_data = data_scaled[train_end:val_end]
test_data = data_scaled[val_end:]

print(f"\nSplit:")
print(f"  Train: {len(train_data)} samples")
print(f"  Val:   {len(val_data)} samples")
print(f"  Test:  {len(test_data)} samples")

# Creer les datasets
train_dataset = TimeSeriesDataset(train_data, seq_len=SEQ_LEN, pred_len=PRED_LEN)
val_dataset = TimeSeriesDataset(val_data, seq_len=SEQ_LEN, pred_len=PRED_LEN)
test_dataset = TimeSeriesDataset(test_data, seq_len=SEQ_LEN, pred_len=PRED_LEN)

# DataLoaders
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f"\nDataLoaders crees:")
print(f"  Train batches: {len(train_loader)}")
print(f"  Val batches:   {len(val_loader)}")
print(f"  Test batches:  {len(test_loader)}")

# Verifier les shapes
x_sample, y_sample = next(iter(train_loader))
print(f"\nSample batch:")
print(f"  x shape: {x_sample.shape}  [batch, seq_len, features]")
print(f"  y shape: {y_sample.shape}  [batch, pred_len]")

---

## Partie 3 : DLinear - Baseline MLP (AAAI 2023)

### Concept

DLinear decompose la serie en **tendance** + **saisonnalite** puis applique des couches lineaires separees:

```
Input [B, L, C]
      |
      v
+-------------+
| Decompose   | -> Trend + Seasonal
+-------------+
      |    \
      v     v
  Linear  Linear
      |     |
      +--+--+
         |
         v
   Output [B, H, C]
```

### Pourquoi ca marche?

1. **Decomposition**: Separe les patterns long-terme (trend) et court-terme (seasonal)
2. **Linearite**: Les series financieres ont souvent des relations quasi-lineaires
3. **Simplicite**: Moins de parametres = moins d'overfitting

In [None]:
# Implementation DLinear

class MovingAvg(nn.Module):
    """Moyenne mobile pour decomposition."""
    
    def __init__(self, kernel_size, stride=1):
        super().__init__()
        self.kernel_size = kernel_size
        self.avg = nn.AvgPool1d(kernel_size=kernel_size, stride=stride, padding=0)
    
    def forward(self, x):
        # x: [B, L, C]
        # Padding pour garder la meme longueur
        front = x[:, :1, :].repeat(1, (self.kernel_size - 1) // 2, 1)
        end = x[:, -1:, :].repeat(1, (self.kernel_size - 1) // 2, 1)
        x = torch.cat([front, x, end], dim=1)
        
        # AvgPool attend [B, C, L]
        x = self.avg(x.permute(0, 2, 1))
        x = x.permute(0, 2, 1)
        return x


class SeriesDecomposition(nn.Module):
    """Decomposition en Trend + Seasonal."""
    
    def __init__(self, kernel_size):
        super().__init__()
        self.moving_avg = MovingAvg(kernel_size)
    
    def forward(self, x):
        # Trend = moyenne mobile
        trend = self.moving_avg(x)
        # Seasonal = residuel
        seasonal = x - trend
        return seasonal, trend


class DLinear(nn.Module):
    """
    DLinear: Decomposition + Linear
    
    Paper: "Are Transformers Effective for Time Series Forecasting?" (AAAI 2023)
    Code: https://github.com/cure-lab/LTSF-Linear
    
    Parameters:
    -----------
    seq_len : int
        Longueur de la sequence d'entree
    pred_len : int
        Longueur de la prediction
    enc_in : int
        Nombre de features d'entree
    individual : bool
        True pour un modele par feature (recommande)
    """
    
    def __init__(self, seq_len, pred_len, enc_in, individual=True):
        super().__init__()
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.enc_in = enc_in
        self.individual = individual
        
        # Decomposition
        kernel_size = 25  # Fenetre pour moyenne mobile
        self.decomposition = SeriesDecomposition(kernel_size)
        
        if individual:
            # Un modele lineaire par feature
            self.Linear_Seasonal = nn.ModuleList([
                nn.Linear(seq_len, pred_len) for _ in range(enc_in)
            ])
            self.Linear_Trend = nn.ModuleList([
                nn.Linear(seq_len, pred_len) for _ in range(enc_in)
            ])
        else:
            # Un seul modele partage
            self.Linear_Seasonal = nn.Linear(seq_len, pred_len)
            self.Linear_Trend = nn.Linear(seq_len, pred_len)
    
    def forward(self, x):
        # x: [B, L, C]
        seasonal, trend = self.decomposition(x)
        
        # Permute pour Linear: [B, C, L]
        seasonal = seasonal.permute(0, 2, 1)
        trend = trend.permute(0, 2, 1)
        
        if self.individual:
            seasonal_output = torch.zeros(
                [x.size(0), self.enc_in, self.pred_len], 
                device=x.device
            )
            trend_output = torch.zeros(
                [x.size(0), self.enc_in, self.pred_len], 
                device=x.device
            )
            
            for i in range(self.enc_in):
                seasonal_output[:, i, :] = self.Linear_Seasonal[i](seasonal[:, i, :])
                trend_output[:, i, :] = self.Linear_Trend[i](trend[:, i, :])
        else:
            seasonal_output = self.Linear_Seasonal(seasonal)
            trend_output = self.Linear_Trend(trend)
        
        # Combiner et permuter: [B, H, C]
        output = seasonal_output + trend_output
        output = output.permute(0, 2, 1)
        
        return output


# Instancier le modele
n_features = df.shape[1]
model_dlinear = DLinear(
    seq_len=SEQ_LEN,
    pred_len=PRED_LEN,
    enc_in=n_features,
    individual=True
).to(device)

# Compter les parametres
n_params = sum(p.numel() for p in model_dlinear.parameters())

print("DLinear Model:")
print(f"  Input:  [batch, {SEQ_LEN}, {n_features}]")
print(f"  Output: [batch, {PRED_LEN}, {n_features}]")
print(f"  Parameters: {n_params:,}")

In [None]:
# Fonctions d'entrainement et evaluation

def train_epoch(model, loader, criterion, optimizer, device):
    """Entraine le modele pour une epoch."""
    model.train()
    total_loss = 0
    
    for x, y in loader:
        x = x.to(device)
        y = y.to(device)
        
        optimizer.zero_grad()
        
        # Forward
        output = model(x)  # [B, H, C]
        
        # On predit uniquement la premiere feature (close)
        pred = output[:, :, 0]  # [B, H]
        
        loss = criterion(pred, y)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    return total_loss / len(loader)


def evaluate(model, loader, criterion, device):
    """Evalue le modele."""
    model.eval()
    total_loss = 0
    preds, targets = [], []
    
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device)
            y = y.to(device)
            
            output = model(x)
            pred = output[:, :, 0]
            
            loss = criterion(pred, y)
            total_loss += loss.item()
            
            preds.append(pred.cpu().numpy())
            targets.append(y.cpu().numpy())
    
    preds = np.concatenate(preds, axis=0)
    targets = np.concatenate(targets, axis=0)
    
    # Metriques
    mse = mean_squared_error(targets.flatten(), preds.flatten())
    mae = mean_absolute_error(targets.flatten(), preds.flatten())
    
    return total_loss / len(loader), mse, mae, preds, targets


print("Fonctions d'entrainement definies")

In [None]:
# Entrainer DLinear

# Hyperparametres
EPOCHS = 30
LR = 0.001

criterion = nn.MSELoss()
optimizer = optim.Adam(model_dlinear.parameters(), lr=LR)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)

print("="*60)
print("ENTRAINEMENT DLINEAR")
print("="*60)

best_val_loss = float('inf')
train_losses, val_losses = [], []

for epoch in range(EPOCHS):
    train_loss = train_epoch(model_dlinear, train_loader, criterion, optimizer, device)
    val_loss, val_mse, val_mae, _, _ = evaluate(model_dlinear, val_loader, criterion, device)
    
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    
    scheduler.step(val_loss)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model_dlinear.state_dict(), 'dlinear_best.pt')
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:3d}/{EPOCHS} | Train Loss: {train_loss:.6f} | Val Loss: {val_loss:.6f}")

print(f"\nMeilleur Val Loss: {best_val_loss:.6f}")

In [None]:
# Evaluation sur test set

model_dlinear.load_state_dict(torch.load('dlinear_best.pt'))
test_loss, test_mse, test_mae, preds_dlinear, targets_dlinear = evaluate(
    model_dlinear, test_loader, criterion, device
)

print("="*60)
print("DLINEAR - RESULTATS TEST")
print("="*60)
print(f"  MSE:  {test_mse:.6f}")
print(f"  RMSE: {np.sqrt(test_mse):.6f}")
print(f"  MAE:  {test_mae:.6f}")

# Direction accuracy
dir_true = np.sign(np.diff(targets_dlinear.flatten()))
dir_pred = np.sign(np.diff(preds_dlinear.flatten()))
dir_acc = np.mean(dir_true == dir_pred)
print(f"  Direction Accuracy: {dir_acc:.2%}")

---

## Partie 4 : PatchTST - Patch-based Transformer (ICLR 2023)

### Concept

PatchTST traite la serie comme une **sequence de patches** au lieu de points individuels:

```
Input:  [x1, x2, x3, x4, x5, x6, x7, x8, ...]
                 |
                 v (Patching)
Patches: [patch1, patch2, patch3, ...]
         [x1-x4]  [x5-x8]  [...]
                 |
                 v
Transformer Encoder
                 |
                 v
Linear Head -> Prediction
```

### Avantages

1. **Reduction de sequence**: n/patch_len tokens au lieu de n
2. **Semantic locale**: Chaque patch capture un pattern local
3. **Moins de compute**: O((n/p)^2) au lieu de O(n^2)
4. **Meilleure generalisation**: Vision-like tokenization

In [None]:
# Implementation PatchTST simplifiee

class PatchEmbedding(nn.Module):
    """Embedding de patches pour time series."""
    
    def __init__(self, seq_len, patch_len, stride, d_model, dropout=0.1):
        super().__init__()
        self.patch_len = patch_len
        self.stride = stride
        self.n_patches = (seq_len - patch_len) // stride + 1
        
        # Projection lineaire du patch vers d_model
        self.projection = nn.Linear(patch_len, d_model)
        
        # Positional embedding
        self.pos_embedding = nn.Parameter(torch.randn(1, self.n_patches, d_model))
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # x: [B, L, C] -> extraire patches
        B, L, C = x.shape
        
        # Unfold pour extraire les patches
        # [B, L, C] -> [B, C, L] pour unfold
        x = x.permute(0, 2, 1)
        
        # Extraire patches: [B, C, n_patches, patch_len]
        patches = x.unfold(dimension=2, size=self.patch_len, step=self.stride)
        
        # Reshape: [B, C, n_patches, patch_len] -> [B*C, n_patches, patch_len]
        B, C, N, P = patches.shape
        patches = patches.reshape(B * C, N, P)
        
        # Projection: [B*C, n_patches, d_model]
        embeddings = self.projection(patches)
        
        # Add positional embedding
        embeddings = embeddings + self.pos_embedding
        embeddings = self.dropout(embeddings)
        
        return embeddings, C  # Retourne aussi n_channels


class PatchTST(nn.Module):
    """
    PatchTST: A Time Series is Worth 64 Words
    
    Paper: ICLR 2023
    Code: https://github.com/yuqinie98/PatchTST
    
    Parameters:
    -----------
    seq_len : int
        Longueur de la sequence d'entree
    pred_len : int
        Longueur de la prediction
    enc_in : int
        Nombre de features
    d_model : int
        Dimension du modele
    n_heads : int
        Nombre de heads d'attention
    n_layers : int
        Nombre de couches Transformer
    patch_len : int
        Longueur d'un patch
    stride : int
        Pas entre patches
    """
    
    def __init__(self, seq_len, pred_len, enc_in, d_model=64, n_heads=4, 
                 n_layers=2, patch_len=16, stride=8, dropout=0.1):
        super().__init__()
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.enc_in = enc_in
        
        # Patch embedding
        self.patch_embedding = PatchEmbedding(
            seq_len, patch_len, stride, d_model, dropout
        )
        n_patches = self.patch_embedding.n_patches
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=dropout,
            batch_first=True
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        
        # Prediction head (flatten + linear)
        self.flatten = nn.Flatten(start_dim=1)
        self.head = nn.Linear(n_patches * d_model, pred_len)
        
    def forward(self, x):
        # x: [B, L, C]
        B = x.shape[0]
        
        # Patch embedding: [B*C, n_patches, d_model]
        embeddings, n_channels = self.patch_embedding(x)
        
        # Transformer: [B*C, n_patches, d_model]
        encoded = self.transformer_encoder(embeddings)
        
        # Flatten: [B*C, n_patches * d_model]
        flat = self.flatten(encoded)
        
        # Prediction: [B*C, pred_len]
        pred = self.head(flat)
        
        # Reshape: [B, C, pred_len] -> [B, pred_len, C]
        pred = pred.reshape(B, n_channels, self.pred_len)
        pred = pred.permute(0, 2, 1)
        
        return pred


# Instancier PatchTST
model_patchtst = PatchTST(
    seq_len=SEQ_LEN,
    pred_len=PRED_LEN,
    enc_in=n_features,
    d_model=64,
    n_heads=4,
    n_layers=2,
    patch_len=16,
    stride=8
).to(device)

n_params_patchtst = sum(p.numel() for p in model_patchtst.parameters())

print("PatchTST Model:")
print(f"  Input:  [batch, {SEQ_LEN}, {n_features}]")
print(f"  Output: [batch, {PRED_LEN}, {n_features}]")
print(f"  Parameters: {n_params_patchtst:,}")
print(f"  Patches: {model_patchtst.patch_embedding.n_patches}")

In [None]:
# Entrainer PatchTST

optimizer_patchtst = optim.Adam(model_patchtst.parameters(), lr=LR)
scheduler_patchtst = optim.lr_scheduler.ReduceLROnPlateau(optimizer_patchtst, patience=5, factor=0.5)

print("="*60)
print("ENTRAINEMENT PATCHTST")
print("="*60)

best_val_loss_patchtst = float('inf')
train_losses_patchtst, val_losses_patchtst = [], []

for epoch in range(EPOCHS):
    train_loss = train_epoch(model_patchtst, train_loader, criterion, optimizer_patchtst, device)
    val_loss, val_mse, val_mae, _, _ = evaluate(model_patchtst, val_loader, criterion, device)
    
    train_losses_patchtst.append(train_loss)
    val_losses_patchtst.append(val_loss)
    
    scheduler_patchtst.step(val_loss)
    
    if val_loss < best_val_loss_patchtst:
        best_val_loss_patchtst = val_loss
        torch.save(model_patchtst.state_dict(), 'patchtst_best.pt')
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:3d}/{EPOCHS} | Train Loss: {train_loss:.6f} | Val Loss: {val_loss:.6f}")

print(f"\nMeilleur Val Loss: {best_val_loss_patchtst:.6f}")

In [None]:
# Evaluation PatchTST

model_patchtst.load_state_dict(torch.load('patchtst_best.pt'))
test_loss_patchtst, test_mse_patchtst, test_mae_patchtst, preds_patchtst, targets_patchtst = evaluate(
    model_patchtst, test_loader, criterion, device
)

print("="*60)
print("PATCHTST - RESULTATS TEST")
print("="*60)
print(f"  MSE:  {test_mse_patchtst:.6f}")
print(f"  RMSE: {np.sqrt(test_mse_patchtst):.6f}")
print(f"  MAE:  {test_mae_patchtst:.6f}")

dir_true_patchtst = np.sign(np.diff(targets_patchtst.flatten()))
dir_pred_patchtst = np.sign(np.diff(preds_patchtst.flatten()))
dir_acc_patchtst = np.mean(dir_true_patchtst == dir_pred_patchtst)
print(f"  Direction Accuracy: {dir_acc_patchtst:.2%}")

---

## Partie 5 : iTransformer - Inverted Attention (ICLR 2024 Spotlight)

### Concept

iTransformer inverse l'approche: au lieu d'attention sur les timesteps, on fait attention sur les **variables**:

```
Standard Transformer:    iTransformer:
                         
Variables  Variables     Variables  Variables
    |          |             |          |
    v          v             v          v
[t1, t2, t3, t4]         [t1-t4]    [t1-t4]   <- Each var = 1 token
    |                        |          |
    v                        +----+-----+
Attention(time)                   |
                                  v
                          Attention(vars)
```

### Avantages

1. **Correlations inter-variables**: Capture les relations entre features
2. **Embedding riche**: Chaque variable embede sa serie complete
3. **Scalabilite**: O(n_vars^2) au lieu de O(n_time^2)

In [None]:
# Implementation iTransformer simplifiee

class iTransformer(nn.Module):
    """
    iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
    
    Paper: ICLR 2024 Spotlight
    Code: https://github.com/thuml/iTransformer
    
    Key insight: Treat each variable as a token, apply attention across variables.
    
    Parameters:
    -----------
    seq_len : int
        Longueur de la sequence d'entree
    pred_len : int
        Longueur de la prediction
    enc_in : int
        Nombre de variables (features)
    d_model : int
        Dimension du modele
    n_heads : int
        Nombre de heads d'attention
    n_layers : int
        Nombre de couches
    """
    
    def __init__(self, seq_len, pred_len, enc_in, d_model=128, n_heads=4,
                 n_layers=2, dropout=0.1):
        super().__init__()
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.enc_in = enc_in
        
        # Embedding: projette chaque variable (sequence entiere) vers d_model
        self.embedding = nn.Linear(seq_len, d_model)
        
        # Transformer encoder (attention sur les variables)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_model * 4,
            dropout=dropout,
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        
        # Projection vers prediction
        self.projection = nn.Linear(d_model, pred_len)
        
    def forward(self, x):
        # x: [B, L, C] -> invert to [B, C, L]
        x = x.permute(0, 2, 1)  # [B, C, L]
        
        # Embed each variable: [B, C, d_model]
        x = self.embedding(x)
        
        # Transformer attention across variables: [B, C, d_model]
        x = self.encoder(x)
        
        # Project to prediction: [B, C, pred_len]
        x = self.projection(x)
        
        # Permute back: [B, pred_len, C]
        x = x.permute(0, 2, 1)
        
        return x


# Instancier iTransformer
model_itransformer = iTransformer(
    seq_len=SEQ_LEN,
    pred_len=PRED_LEN,
    enc_in=n_features,
    d_model=128,
    n_heads=4,
    n_layers=2
).to(device)

n_params_itrans = sum(p.numel() for p in model_itransformer.parameters())

print("iTransformer Model:")
print(f"  Input:  [batch, {SEQ_LEN}, {n_features}]")
print(f"  Output: [batch, {PRED_LEN}, {n_features}]")
print(f"  Parameters: {n_params_itrans:,}")
print(f"  Attention: across {n_features} variables")

In [None]:
# Entrainer iTransformer

optimizer_itrans = optim.Adam(model_itransformer.parameters(), lr=LR)
scheduler_itrans = optim.lr_scheduler.ReduceLROnPlateau(optimizer_itrans, patience=5, factor=0.5)

print("="*60)
print("ENTRAINEMENT iTRANSFORMER")
print("="*60)

best_val_loss_itrans = float('inf')

for epoch in range(EPOCHS):
    train_loss = train_epoch(model_itransformer, train_loader, criterion, optimizer_itrans, device)
    val_loss, val_mse, val_mae, _, _ = evaluate(model_itransformer, val_loader, criterion, device)
    
    scheduler_itrans.step(val_loss)
    
    if val_loss < best_val_loss_itrans:
        best_val_loss_itrans = val_loss
        torch.save(model_itransformer.state_dict(), 'itransformer_best.pt')
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:3d}/{EPOCHS} | Train Loss: {train_loss:.6f} | Val Loss: {val_loss:.6f}")

print(f"\nMeilleur Val Loss: {best_val_loss_itrans:.6f}")

In [None]:
# Evaluation iTransformer

model_itransformer.load_state_dict(torch.load('itransformer_best.pt'))
test_loss_itrans, test_mse_itrans, test_mae_itrans, preds_itrans, targets_itrans = evaluate(
    model_itransformer, test_loader, criterion, device
)

print("="*60)
print("iTRANSFORMER - RESULTATS TEST")
print("="*60)
print(f"  MSE:  {test_mse_itrans:.6f}")
print(f"  RMSE: {np.sqrt(test_mse_itrans):.6f}")
print(f"  MAE:  {test_mae_itrans:.6f}")

dir_true_itrans = np.sign(np.diff(targets_itrans.flatten()))
dir_pred_itrans = np.sign(np.diff(preds_itrans.flatten()))
dir_acc_itrans = np.mean(dir_true_itrans == dir_pred_itrans)
print(f"  Direction Accuracy: {dir_acc_itrans:.2%}")

---

## Partie 6 : TimeMixer - MLP Multiscale (ICLR 2024)

### Concept

TimeMixer n'utilise **pas d'attention** mais un **mixing MLP multiscale**:

```
Input [B, L, C]
      |
      v
+-----------------+
| Multi-scale     |
| Decomposition   | -> [scale1, scale2, scale3, ...]
+-----------------+
      |
      v
+-----------------+
| Past-Decomp     |  MLP mixing across scales
| Mixing          |
+-----------------+
      |
      v
+-----------------+
| Future-Multipredictor |
| Mixing          |
+-----------------+
      |
      v
Output [B, H, C]
```

### Avantages

1. **Pas d'attention**: Plus simple, plus rapide
2. **Multiscale**: Capture patterns a differentes echelles
3. **Efficace**: SOTA sur plusieurs benchmarks

In [None]:
# Implementation TimeMixer simplifiee

class TimeMixer(nn.Module):
    """
    TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting
    
    Paper: ICLR 2024
    Code: https://github.com/kwuking/TimeMixer
    
    Simplified version focusing on multiscale mixing.
    
    Parameters:
    -----------
    seq_len : int
        Longueur de la sequence d'entree
    pred_len : int
        Longueur de la prediction
    enc_in : int
        Nombre de features
    d_model : int
        Dimension du modele
    n_scales : int
        Nombre d'echelles pour decomposition
    """
    
    def __init__(self, seq_len, pred_len, enc_in, d_model=64, n_scales=3, dropout=0.1):
        super().__init__()
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.enc_in = enc_in
        self.n_scales = n_scales
        
        # Downsampling pour chaque echelle
        self.downsamples = nn.ModuleList([
            nn.AvgPool1d(kernel_size=2**i, stride=2**i) if i > 0 else nn.Identity()
            for i in range(n_scales)
        ])
        
        # Calcul des longueurs a chaque echelle
        self.scale_lens = [seq_len // (2**i) for i in range(n_scales)]
        
        # Mixing layers (MLP pour chaque echelle)
        self.mixing_layers = nn.ModuleList([
            nn.Sequential(
                nn.Linear(sl, d_model),
                nn.GELU(),
                nn.Dropout(dropout),
                nn.Linear(d_model, d_model)
            )
            for sl in self.scale_lens
        ])
        
        # Scale aggregation
        self.scale_weights = nn.Parameter(torch.ones(n_scales) / n_scales)
        
        # Prediction head
        self.head = nn.Linear(d_model, pred_len)
        
    def forward(self, x):
        # x: [B, L, C]
        B, L, C = x.shape
        
        # Process each variable independently
        x = x.permute(0, 2, 1)  # [B, C, L]
        
        # Multiscale representations
        scale_outputs = []
        
        for i, (downsample, mixing) in enumerate(zip(self.downsamples, self.mixing_layers)):
            # Downsample: [B, C, L_i]
            x_scale = downsample(x)
            
            # Mix: [B, C, d_model]
            x_mixed = mixing(x_scale)
            
            scale_outputs.append(x_mixed)
        
        # Weighted aggregation: [B, C, d_model]
        weights = torch.softmax(self.scale_weights, dim=0)
        aggregated = sum(w * out for w, out in zip(weights, scale_outputs))
        
        # Prediction: [B, C, pred_len]
        pred = self.head(aggregated)
        
        # Output: [B, pred_len, C]
        pred = pred.permute(0, 2, 1)
        
        return pred


# Instancier TimeMixer
model_timemixer = TimeMixer(
    seq_len=SEQ_LEN,
    pred_len=PRED_LEN,
    enc_in=n_features,
    d_model=64,
    n_scales=3
).to(device)

n_params_mixer = sum(p.numel() for p in model_timemixer.parameters())

print("TimeMixer Model:")
print(f"  Input:  [batch, {SEQ_LEN}, {n_features}]")
print(f"  Output: [batch, {PRED_LEN}, {n_features}]")
print(f"  Parameters: {n_params_mixer:,}")
print(f"  Scales: {model_timemixer.scale_lens}")

In [None]:
# Entrainer TimeMixer

optimizer_mixer = optim.Adam(model_timemixer.parameters(), lr=LR)
scheduler_mixer = optim.lr_scheduler.ReduceLROnPlateau(optimizer_mixer, patience=5, factor=0.5)

print("="*60)
print("ENTRAINEMENT TIMEMIXER")
print("="*60)

best_val_loss_mixer = float('inf')

for epoch in range(EPOCHS):
    train_loss = train_epoch(model_timemixer, train_loader, criterion, optimizer_mixer, device)
    val_loss, val_mse, val_mae, _, _ = evaluate(model_timemixer, val_loader, criterion, device)
    
    scheduler_mixer.step(val_loss)
    
    if val_loss < best_val_loss_mixer:
        best_val_loss_mixer = val_loss
        torch.save(model_timemixer.state_dict(), 'timemixer_best.pt')
    
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:3d}/{EPOCHS} | Train Loss: {train_loss:.6f} | Val Loss: {val_loss:.6f}")

print(f"\nMeilleur Val Loss: {best_val_loss_mixer:.6f}")

In [None]:
# Evaluation TimeMixer

model_timemixer.load_state_dict(torch.load('timemixer_best.pt'))
test_loss_mixer, test_mse_mixer, test_mae_mixer, preds_mixer, targets_mixer = evaluate(
    model_timemixer, test_loader, criterion, device
)

print("="*60)
print("TIMEMIXER - RESULTATS TEST")
print("="*60)
print(f"  MSE:  {test_mse_mixer:.6f}")
print(f"  RMSE: {np.sqrt(test_mse_mixer):.6f}")
print(f"  MAE:  {test_mae_mixer:.6f}")

dir_true_mixer = np.sign(np.diff(targets_mixer.flatten()))
dir_pred_mixer = np.sign(np.diff(preds_mixer.flatten()))
dir_acc_mixer = np.mean(dir_true_mixer == dir_pred_mixer)
print(f"  Direction Accuracy: {dir_acc_mixer:.2%}")

---

## Partie 7 : Comparaison et Benchmarks (15 min)

In [None]:
# Tableau comparatif

results = pd.DataFrame({
    'Model': ['DLinear', 'PatchTST', 'iTransformer', 'TimeMixer'],
    'MSE': [test_mse, test_mse_patchtst, test_mse_itrans, test_mse_mixer],
    'RMSE': [np.sqrt(test_mse), np.sqrt(test_mse_patchtst), np.sqrt(test_mse_itrans), np.sqrt(test_mse_mixer)],
    'MAE': [test_mae, test_mae_patchtst, test_mae_itrans, test_mae_mixer],
    'Direction Acc': [dir_acc, dir_acc_patchtst, dir_acc_itrans, dir_acc_mixer],
    'Parameters': [n_params, n_params_patchtst, n_params_itrans, n_params_mixer]
})

print("="*80)
print("COMPARAISON DES ARCHITECTURES SOTA")
print("="*80)
print(f"\nDataset: {len(df)} jours, {n_features} features")
print(f"Sequence: {SEQ_LEN} -> Prediction: {PRED_LEN}")
print(f"\n{results.to_string(index=False)}")

# Meilleur modele
best_idx = results['MSE'].idxmin()
print(f"\nMeilleur modele (MSE): {results.loc[best_idx, 'Model']}")

In [None]:
# Visualisation comparative

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# MSE comparison
ax1 = axes[0, 0]
colors = ['steelblue', 'coral', 'seagreen', 'mediumpurple']
ax1.bar(results['Model'], results['MSE'], color=colors, edgecolor='black')
ax1.set_ylabel('MSE')
ax1.set_title('MSE par Modele', fontsize=12, fontweight='bold')
ax1.grid(axis='y', alpha=0.3)

# Direction Accuracy
ax2 = axes[0, 1]
ax2.bar(results['Model'], results['Direction Acc'] * 100, color=colors, edgecolor='black')
ax2.set_ylabel('Direction Accuracy (%)')
ax2.set_title('Direction Accuracy par Modele', fontsize=12, fontweight='bold')
ax2.axhline(50, color='red', linestyle='--', alpha=0.5, label='Random (50%)')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

# Parameters (log scale)
ax3 = axes[1, 0]
ax3.bar(results['Model'], results['Parameters'], color=colors, edgecolor='black')
ax3.set_ylabel('Parametres')
ax3.set_title('Nombre de Parametres', fontsize=12, fontweight='bold')
ax3.set_yscale('log')
ax3.grid(axis='y', alpha=0.3)

# Efficiency: MSE vs Parameters
ax4 = axes[1, 1]
for i, row in results.iterrows():
    ax4.scatter(row['Parameters'], row['MSE'], s=200, c=colors[i], 
                edgecolors='black', label=row['Model'], zorder=5)
ax4.set_xlabel('Parametres')
ax4.set_ylabel('MSE')
ax4.set_title('Efficacite: MSE vs Parametres', fontsize=12, fontweight='bold')
ax4.set_xscale('log')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Visualisation des predictions

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

models_preds = [
    ('DLinear', preds_dlinear, targets_dlinear),
    ('PatchTST', preds_patchtst, targets_patchtst),
    ('iTransformer', preds_itrans, targets_itrans),
    ('TimeMixer', preds_mixer, targets_mixer)
]

for ax, (name, preds, targets) in zip(axes.flatten(), models_preds):
    # Prendre un echantillon pour visualisation
    n_show = min(200, len(preds.flatten()))
    
    ax.plot(range(n_show), targets.flatten()[:n_show], 'b-', 
            linewidth=1, label='Reel', alpha=0.7)
    ax.plot(range(n_show), preds.flatten()[:n_show], 'r-', 
            linewidth=1, label='Predit', alpha=0.7)
    ax.set_xlabel('Index')
    ax.set_ylabel('Prix (normalise)')
    ax.set_title(f'{name}: Predictions vs Reel', fontsize=12, fontweight='bold')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Recommandations

print("="*80)
print("RECOMMANDATIONS POUR LE TRADING ALGORITHMIQUE")
print("="*80)

recommendations = [
    ("Baseline simple", "DLinear", "Ultra-leger, rapide, souvent suffisant"),
    ("Patterns locaux", "PatchTST", "Capture les motifs a court terme"),
    ("Multi-assets", "iTransformer", "Correlations entre actifs"),
    ("Multiscale", "TimeMixer", "Patterns a differentes echelles"),
    ("Long sequences", "Mamba (QC-Py-23)", "O(n) pour >1000 timesteps"),
]

print(f"\n{'Cas d\'usage':<20} {'Modele':<15} {'Raison'}")
print("-" * 70)
for use_case, model, reason in recommendations:
    print(f"{use_case:<20} {model:<15} {reason}")

print("\n" + "="*80)
print("CONTRAINTES QUANTCONNECT")
print("="*80)
print("""
1. ObjectStore: Max ~9 MB par fichier
   -> Utiliser torch.save(state_dict) uniquement
   -> DLinear (~40 KB) et PatchTST small (~2 MB) OK

2. CPU-only (free tier): Pas de GPU
   -> Inference rapide requise (<100ms)
   -> Modeles legers recommandes

3. Retrain: Hors plateforme (GPU local) puis upload
   -> Sauvegarder state_dict localement
   -> Charger dans ObjectStore via API
""")

---

## Partie 8 : Integration QuantConnect (15 min)

In [None]:
# Pattern de sauvegarde/chargement pour ObjectStore

import io
import pickle

def save_model_for_qc(model, scaler, model_name='dlinear'):
    """
    Prepare le modele pour QuantConnect ObjectStore.
    
    Returns:
    --------
    dict avec bytes du modele et du scaler
    """
    # Sauvegarder state_dict dans un buffer
    model_buffer = io.BytesIO()
    torch.save(model.state_dict(), model_buffer)
    model_bytes = model_buffer.getvalue()
    
    # Sauvegarder le scaler
    scaler_bytes = pickle.dumps(scaler)
    
    # Taille totale
    total_size = len(model_bytes) + len(scaler_bytes)
    
    print(f"Modele '{model_name}' prepare pour QC:")
    print(f"  state_dict: {len(model_bytes):,} bytes ({len(model_bytes)/1024:.1f} KB)")
    print(f"  scaler:     {len(scaler_bytes):,} bytes")
    print(f"  Total:      {total_size:,} bytes ({total_size/1024:.1f} KB)")
    
    if total_size > 9 * 1024 * 1024:  # 9 MB limit
        print(f"  WARNING: Depasse la limite ObjectStore (~9 MB)!")
    else:
        print(f"  OK pour ObjectStore")
    
    return {
        'model_bytes': model_bytes,
        'scaler_bytes': scaler_bytes,
        'model_name': model_name
    }

# Test avec DLinear (le plus leger)
saved_dlinear = save_model_for_qc(model_dlinear, scaler, 'dlinear')

print("\n")
saved_patchtst = save_model_for_qc(model_patchtst, scaler, 'patchtst')

In [None]:
# Code complet pour QuantConnect

qc_code = '''
# === INTEGRATION QUANTCONNECT - PYTORCH SOTA MODELS ===

from AlgorithmImports import *
import torch
import torch.nn as nn
import numpy as np
import pickle
import io
from collections import deque
from sklearn.preprocessing import StandardScaler


# === DLinear Model Definition ===

class MovingAvg(nn.Module):
    def __init__(self, kernel_size, stride=1):
        super().__init__()
        self.kernel_size = kernel_size
        self.avg = nn.AvgPool1d(kernel_size=kernel_size, stride=stride, padding=0)
    
    def forward(self, x):
        front = x[:, :1, :].repeat(1, (self.kernel_size - 1) // 2, 1)
        end = x[:, -1:, :].repeat(1, (self.kernel_size - 1) // 2, 1)
        x = torch.cat([front, x, end], dim=1)
        x = self.avg(x.permute(0, 2, 1))
        return x.permute(0, 2, 1)


class SeriesDecomposition(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.moving_avg = MovingAvg(kernel_size)
    
    def forward(self, x):
        trend = self.moving_avg(x)
        seasonal = x - trend
        return seasonal, trend


class DLinear(nn.Module):
    def __init__(self, seq_len, pred_len, enc_in, individual=True):
        super().__init__()
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.enc_in = enc_in
        self.individual = individual
        
        kernel_size = 25
        self.decomposition = SeriesDecomposition(kernel_size)
        
        if individual:
            self.Linear_Seasonal = nn.ModuleList([
                nn.Linear(seq_len, pred_len) for _ in range(enc_in)
            ])
            self.Linear_Trend = nn.ModuleList([
                nn.Linear(seq_len, pred_len) for _ in range(enc_in)
            ])
        else:
            self.Linear_Seasonal = nn.Linear(seq_len, pred_len)
            self.Linear_Trend = nn.Linear(seq_len, pred_len)
    
    def forward(self, x):
        seasonal, trend = self.decomposition(x)
        seasonal = seasonal.permute(0, 2, 1)
        trend = trend.permute(0, 2, 1)
        
        if self.individual:
            seasonal_output = torch.zeros(
                [x.size(0), self.enc_in, self.pred_len], device=x.device
            )
            trend_output = torch.zeros(
                [x.size(0), self.enc_in, self.pred_len], device=x.device
            )
            for i in range(self.enc_in):
                seasonal_output[:, i, :] = self.Linear_Seasonal[i](seasonal[:, i, :])
                trend_output[:, i, :] = self.Linear_Trend[i](trend[:, i, :])
        else:
            seasonal_output = self.Linear_Seasonal(seasonal)
            trend_output = self.Linear_Trend(trend)
        
        output = seasonal_output + trend_output
        return output.permute(0, 2, 1)


# === Alpha Model ===

class DLinearAlphaModel(AlphaModel):
    """
    Alpha Model utilisant DLinear (SOTA 2023) pour prediction.
    
    Architecture legere, compatible ObjectStore (<100 KB).
    """
    
    def __init__(self, seq_len=96, pred_len=24, n_features=7,
                 model_key="models/dlinear", prediction_threshold=0.005):
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.n_features = n_features
        self.model_key = model_key
        self.prediction_threshold = prediction_threshold
        
        self.model = None
        self.scaler = None
        self.symbol_data = {}
        self.device = torch.device("cpu")
    
    def Update(self, algorithm, data):
        insights = []
        
        # Charger le modele si pas encore fait
        if self.model is None:
            self._load_model(algorithm)
            if self.model is None:
                return insights
        
        for symbol, sd in self.symbol_data.items():
            if not data.ContainsKey(symbol):
                continue
            
            # Mettre a jour les donnees
            sd.Update(data[symbol])
            
            if len(sd.features) < self.seq_len:
                continue
            
            # Preparer l\'input
            features = np.array(list(sd.features))
            features_scaled = self.scaler.transform(features)
            
            # Tensor: [1, seq_len, n_features]
            x = torch.FloatTensor(features_scaled).unsqueeze(0).to(self.device)
            
            # Prediction
            with torch.no_grad():
                pred = self.model(x)  # [1, pred_len, n_features]
            
            # Extraire prediction du prix (feature 0)
            pred_price = pred[0, :, 0].mean().item()  # Moyenne des predictions
            current_price = features_scaled[-1, 0]
            
            predicted_return = pred_price - current_price
            
            # Signal
            if abs(predicted_return) >= self.prediction_threshold:
                direction = InsightDirection.Up if predicted_return > 0 else InsightDirection.Down
                confidence = min(abs(predicted_return) / 0.05, 1.0)
                
                insight = Insight.Price(
                    symbol,
                    timedelta(days=self.pred_len),
                    direction,
                    magnitude=abs(predicted_return),
                    confidence=confidence
                )
                insights.append(insight)
        
        return insights
    
    def OnSecuritiesChanged(self, algorithm, changes):
        for security in changes.AddedSecurities:
            symbol = security.Symbol
            if symbol not in self.symbol_data:
                self.symbol_data[symbol] = DLinearSymbolData(
                    algorithm, symbol, self.seq_len
                )
        
        for security in changes.RemovedSecurities:
            symbol = security.Symbol
            if symbol in self.symbol_data:
                del self.symbol_data[symbol]
    
    def _load_model(self, algorithm):
        """Charge le modele depuis ObjectStore."""
        if not algorithm.ObjectStore.ContainsKey(self.model_key):
            algorithm.Debug(f"Modele non trouve: {self.model_key}")
            return
        
        try:
            # Charger state_dict
            model_bytes = algorithm.ObjectStore.ReadBytes(self.model_key)
            state_dict = torch.load(io.BytesIO(bytes(model_bytes)), map_location=self.device)
            
            # Instancier le modele
            self.model = DLinear(
                seq_len=self.seq_len,
                pred_len=self.pred_len,
                enc_in=self.n_features,
                individual=True
            ).to(self.device)
            self.model.load_state_dict(state_dict)
            self.model.eval()
            
            # Charger scaler
            scaler_bytes = algorithm.ObjectStore.ReadBytes(self.model_key + "_scaler")
            self.scaler = pickle.loads(bytes(scaler_bytes))
            
            algorithm.Debug(f"Modele DLinear charge depuis {self.model_key}")
        
        except Exception as e:
            algorithm.Debug(f"Erreur chargement modele: {e}")


class DLinearSymbolData:
    """Stocke les features par symbole."""
    
    def __init__(self, algorithm, symbol, seq_len):
        self.symbol = symbol
        self.seq_len = seq_len
        self.features = deque(maxlen=seq_len)
        
        # Indicateurs
        self.sma_20 = algorithm.SMA(symbol, 20)
        self.sma_50 = algorithm.SMA(symbol, 50)
        self.rsi = algorithm.RSI(symbol, 14)
        
        self.last_close = None
    
    def Update(self, bar):
        """Met a jour les features avec la nouvelle barre."""
        if not self.sma_20.IsReady:
            return
        
        # Calculer les features
        close = float(bar.Close)
        volume = float(bar.Volume)
        returns = (close / self.last_close - 1) if self.last_close else 0
        sma_20 = float(self.sma_20.Current.Value)
        sma_50 = float(self.sma_50.Current.Value) if self.sma_50.IsReady else sma_20
        rsi = float(self.rsi.Current.Value) if self.rsi.IsReady else 50
        volatility = abs(returns) * np.sqrt(252)  # Approximation
        
        feature_vector = [close, volume, returns, sma_20, sma_50, rsi, volatility]
        self.features.append(feature_vector)
        
        self.last_close = close


# === Strategie Complete ===

class DLinearTradingStrategy(QCAlgorithm):
    """
    Strategie utilisant DLinear (SOTA 2023) pour prediction.
    
    - Modele: DLinear (decomposition + linear)
    - Input: 96 jours, 7 features
    - Output: 24 jours de prediction
    - Signal: Moyenne des predictions
    """
    
    def Initialize(self):
        self.SetStartDate(2022, 1, 1)
        self.SetEndDate(2023, 12, 31)
        self.SetCash(100000)
        
        # Univers
        tickers = ["AAPL", "MSFT", "GOOGL", "AMZN", "NVDA"]
        for ticker in tickers:
            self.AddEquity(ticker, Resolution.Daily)
        
        # Alpha Model
        self.SetAlpha(DLinearAlphaModel(
            seq_len=96,
            pred_len=24,
            n_features=7,
            model_key="models/dlinear",
            prediction_threshold=0.005
        ))
        
        # Portfolio Construction
        self.SetPortfolioConstruction(EqualWeightingPortfolioConstructionModel())
        
        # Execution
        self.SetExecution(ImmediateExecutionModel())
        
        # Risk Management
        self.SetRiskManagement(MaximumDrawdownPercentPerSecurity(0.10))
        
        self.SetWarmup(100)
        self.Debug("DLinear Trading Strategy initialized")
'''

print("Code QuantConnect:")
print("="*60)
print("- DLinear model definition")
print("- DLinearAlphaModel (charge depuis ObjectStore)")
print("- DLinearSymbolData (features avec indicateurs)")
print("- DLinearTradingStrategy (strategie complete)")
print("\nTaille du code: ~200 lignes")

---

## Conclusion et Prochaines Etapes

### Recapitulatif

| Partie | Sujet | Points Cles |
|--------|-------|-------------|
| 1 | Evolution 2015-2026 | LSTM -> Transformers -> DLinear -> SSMs |
| 2 | Setup PyTorch | Device, DataLoader, TimeSeriesDataset |
| 3 | DLinear | Decomposition + Linear, baseline forte |
| 4 | PatchTST | Tokenization par patches, ICLR 2023 |
| 5 | iTransformer | Attention sur variables, ICLR 2024 |
| 6 | TimeMixer | MLP multiscale, pas d'attention |
| 7 | Comparaison | MSE, Direction Accuracy, Parametres |
| 8 | Integration QC | ObjectStore, Alpha Model |

### Points Cles SOTA 2024-2026

| Concept | Insight |
|---------|--------|
| **Simplicite** | DLinear (MLP) bat souvent les Transformers complexes |
| **Patches** | Tokenization intelligente > point-wise attention |
| **Variables** | iTransformer: attention sur correlations inter-series |
| **Multiscale** | TimeMixer: patterns a differentes echelles |
| **Efficiency** | Moins de parametres = meilleure generalisation |

### Limitations et Avertissements

| Limitation | Description | Mitigation |
|------------|-------------|------------|
| **Regime changes** | Modeles degradent lors de crises | Retrain frequent, ensembling |
| **Overfitting** | Series financieres bruitees | Regularisation, early stopping |
| **Inference time** | Important pour HFT | Modeles legers (DLinear) |
| **Data leakage** | Walk-forward validation essentielle | Train/Val/Test temporel strict |

### Prochaines Etapes

| Notebook | Contenu |
|----------|--------|
| **QC-Py-23** | State Space Models (Mamba) - O(n) pour longues sequences |
| **QC-Py-24** | Generative Anomaly Detection (VAE-Transformer + HMM) |
| **QC-Py-25** | Reinforcement Learning (PPO/DQN) |

### Ressources SOTA

- [Time-Series-Library](https://github.com/thuml/Time-Series-Library) - Framework unifie Tsinghua (20+ modeles)
- [PatchTST Paper](https://arxiv.org/abs/2211.14730) - ICLR 2023
- [iTransformer Paper](https://arxiv.org/abs/2310.06625) - ICLR 2024 Spotlight
- [TimeMixer Paper](https://arxiv.org/abs/2405.14616) - ICLR 2024
- [Are Transformers Effective?](https://arxiv.org/abs/2205.13504) - AAAI 2023
- [Awesome Time Series](https://github.com/qingsongedu/time-series-transformers-review) - Survey

---

**Notebook complete. Les architectures SOTA 2024-2026 offrent un excellent compromis entre performance et efficacite. DLinear reste une baseline remarquablement forte, tandis que PatchTST et iTransformer apportent des innovations significatives.**