# QC-Py-23 - State Space Models (Mamba) pour Trading

> **Au-delà des Transformers : Complexité O(n) pour séries temporelles longues**
> Durée : 100 minutes | Niveau : Avancé | Python + PyTorch

---

## Objectifs d'Apprentissage

À la fin de ce notebook, vous serez capable de :

1. Comprendre le **problème de l'attention quadratique O(n²)** des Transformers
2. Maîtriser la théorie des **State Space Models (S4 → Mamba)**
3. Implémenter un **Mamba Block** pour time series en PyTorch
4. Construire un modèle **SST Hybrid** (Mamba + Transformer)
5. Appliquer **CMDMamba** optimisé pour la finance (2025)
6. Comparer **LSTM vs Transformer vs Mamba** sur données financières
7. Intégrer dans une **stratégie QuantConnect** avec ObjectStore

## Prérequis

- Notebook QC-Py-22 complété (PyTorch, architectures modernes)
- Compréhension des Transformers et de l'attention
- Familiarité avec les RNN/LSTM
- Notions d'équations différentielles (optionnel, mais utile)

## Structure du Notebook

| Partie | Sujet | Durée |
|--------|-------|-------|
| 1 | Le problème de l'attention O(n²) | 10 min |
| 2 | Théorie des State Space Models | 20 min |
| 3 | De S4 à Mamba : Selective State Spaces | 20 min |
| 4 | Implémentation Mamba en PyTorch | 20 min |
| 5 | SST Hybrid : Mamba + Transformer | 15 min |
| 6 | Comparaison LSTM vs Transformer vs Mamba | 10 min |
| 7 | Intégration QuantConnect | 15 min |

## Références SOTA 2024-2026

| Paper | Venue | Contribution |
|-------|-------|-------------|
| **Mamba** | arXiv 2312.00752 | Selective State Spaces, hardware-aware |
| **S4** | ICLR 2022 | Structured State Space Sequences |
| **SST** | CIKM 2024 | Hybrid Mamba-Transformer |
| **CMDMamba** | Frontiers AI 2025 | Dual-layer Mamba pour finance |

---

## Partie 1 : Le Problème de l'Attention O(n²) (10 min)

### Pourquoi les Transformers ont une limite ?

L'attention des Transformers calcule une matrice **Q × K^T** de taille **(n × n)** où n est la longueur de la séquence.

| Longueur séquence | Taille matrice | Mémoire (float32) | Temps relatif |
|------------------|----------------|-------------------|---------------|
| 60 jours | 3,600 | 14 KB | 1x |
| 252 jours (1 an) | 63,504 | 254 KB | 18x |
| 504 jours (2 ans) | 254,016 | 1 MB | 70x |
| 1,260 jours (5 ans) | 1,587,600 | 6.3 MB | 440x |

### Impact pour le trading

- **Patterns saisonniers** : Nécessitent ~252 jours minimum
- **Cycles économiques** : Nécessitent 3-5 ans de données
- **Régimes de marché** : Changent sur des années

**Conclusion** : Les Transformers standard ne scalent pas pour les longues séries financières.

### Alternatives explorées

| Approche | Complexité | Limitation |
|----------|------------|------------|
| **Sparse Attention** (Longformer) | O(n√n) | Patterns fixes, perte d'info |
| **Linear Attention** (Performer) | O(n) | Approximation, moins précis |
| **Chunked Attention** | O(n×k) | Perte des très longues dépendances |
| **State Space Models** | O(n) | ✅ Exact, scalable, hardware-friendly |

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Configuration matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

# Sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Math
import math
from typing import Optional, Tuple

# Seed
torch.manual_seed(42)
np.random.seed(42)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"PyTorch version: {torch.__version__}")
print(f"Device: {device}")
print("\nCe notebook est optimisé CPU-first.")
print("Les SSMs sont particulièrement efficaces sur CPU grâce à leur complexité O(n).")

In [None]:
def visualize_complexity_comparison():
    """
    Visualise la différence de complexité entre Transformer et SSM.
    """
    seq_lengths = np.array([60, 120, 252, 504, 756, 1008, 1260])
    
    # Complexités (normalisées)
    transformer_complexity = seq_lengths ** 2  # O(n²)
    ssm_complexity = seq_lengths  # O(n)
    lstm_complexity = seq_lengths  # O(n) mais séquentiel
    
    # Normalisation
    transformer_norm = transformer_complexity / transformer_complexity[0]
    ssm_norm = ssm_complexity / ssm_complexity[0]
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Complexité théorique
    ax1 = axes[0]
    ax1.plot(seq_lengths, transformer_norm, 'o-', label='Transformer O(n²)', 
             color='coral', linewidth=2, markersize=8)
    ax1.plot(seq_lengths, ssm_norm, 's-', label='SSM/Mamba O(n)', 
             color='steelblue', linewidth=2, markersize=8)
    ax1.set_xlabel('Longueur de séquence (jours)', fontsize=11)
    ax1.set_ylabel('Complexité relative (base: 60 jours)', fontsize=11)
    ax1.set_title('Complexité Computationnelle', fontsize=13, fontweight='bold')
    ax1.legend(fontsize=10)
    ax1.grid(True, alpha=0.3)
    ax1.set_yscale('log')
    
    # Annotations
    ax1.annotate('1 an\n(252j)', xy=(252, 17.64), xytext=(300, 50),
                arrowprops=dict(arrowstyle='->', color='gray'),
                fontsize=9, color='gray')
    ax1.annotate('5 ans\n(1260j)', xy=(1260, 441), xytext=(1100, 150),
                arrowprops=dict(arrowstyle='->', color='gray'),
                fontsize=9, color='gray')
    
    # Mémoire GPU estimée
    ax2 = axes[1]
    # Attention matrix: n² × 4 bytes × batch_size × num_heads
    batch_size = 32
    num_heads = 4
    memory_transformer_mb = (seq_lengths ** 2 * 4 * batch_size * num_heads) / (1024 ** 2)
    memory_ssm_mb = (seq_lengths * 64 * 4 * batch_size) / (1024 ** 2)  # state_size=64
    
    ax2.bar(np.arange(len(seq_lengths)) - 0.2, memory_transformer_mb, 0.4, 
           label='Transformer', color='coral', alpha=0.8)
    ax2.bar(np.arange(len(seq_lengths)) + 0.2, memory_ssm_mb, 0.4, 
           label='SSM/Mamba', color='steelblue', alpha=0.8)
    ax2.set_xticks(np.arange(len(seq_lengths)))
    ax2.set_xticklabels([f'{l}j' for l in seq_lengths])
    ax2.set_xlabel('Longueur de séquence', fontsize=11)
    ax2.set_ylabel('Mémoire estimée (MB)', fontsize=11)
    ax2.set_title('Consommation Mémoire (batch=32)', fontsize=13, fontweight='bold')
    ax2.legend(fontsize=10)
    ax2.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    print("\nConclusion:")
    print(f"  - À 1260 jours (5 ans), Transformer: {transformer_norm[-1]:.0f}x plus lent")
    print(f"  - SSM/Mamba: scalabilité linéaire, même performance relative")
    print(f"  - Mémoire Transformer pour 5 ans: {memory_transformer_mb[-1]:.1f} MB")
    print(f"  - Mémoire SSM pour 5 ans: {memory_ssm_mb[-1]:.1f} MB")

visualize_complexity_comparison()

---

## Partie 2 : Théorie des State Space Models (20 min)

### Qu'est-ce qu'un State Space Model ?

Un SSM est défini par un système d'équations différentielles linéaires :

$$\frac{dh(t)}{dt} = Ah(t) + Bx(t)$$
$$y(t) = Ch(t) + Dx(t)$$

Où :
- **x(t)** : Input (signal d'entrée, ex: prix)
- **h(t)** : État caché (mémoire interne)
- **y(t)** : Output (prédiction)
- **A, B, C, D** : Matrices de paramètres apprises

### Discrétisation pour séries temporelles

Pour traiter des données discrètes (jours, heures), on discrétise avec un pas Δ :

$$\bar{A} = e^{\Delta A}$$
$$\bar{B} = (\Delta A)^{-1}(e^{\Delta A} - I) \cdot \Delta B$$

Ce qui donne la récurrence :

$$h_k = \bar{A} h_{k-1} + \bar{B} x_k$$
$$y_k = C h_k$$

### Avantage : Calcul parallèle via convolution

La séquence entière peut être calculée comme une **convolution** :

$$y = x * \bar{K}$$

où $\bar{K} = (C\bar{B}, C\bar{A}\bar{B}, C\bar{A}^2\bar{B}, ...)$ est le kernel SSM.

**C'est la clé de la scalabilité O(n)** : on peut utiliser FFT pour calculer la convolution en O(n log n).

In [None]:
def demonstrate_ssm_concept():
    """
    Démontre visuellement le concept de State Space Model.
    """
    # Paramètres simples pour illustration
    seq_len = 100
    state_dim = 4
    
    # Matrices (simplifiées)
    np.random.seed(42)
    A = np.eye(state_dim) * 0.95  # Matrice d'état (stable)
    A[0, 1] = 0.1
    A[1, 0] = -0.1
    B = np.random.randn(state_dim, 1) * 0.3
    C = np.random.randn(1, state_dim) * 0.5
    
    # Signal d'entrée (prix simulé)
    t = np.linspace(0, 10, seq_len)
    x = np.sin(t) + 0.5 * np.sin(3*t) + np.random.randn(seq_len) * 0.2
    
    # Simulation récurrente
    h = np.zeros((seq_len, state_dim))
    y = np.zeros(seq_len)
    
    for k in range(1, seq_len):
        h[k] = A @ h[k-1] + B.flatten() * x[k]
        y[k] = C @ h[k]
    
    # Visualisation
    fig, axes = plt.subplots(2, 2, figsize=(14, 8))
    
    # Input
    ax1 = axes[0, 0]
    ax1.plot(t, x, color='steelblue', alpha=0.8)
    ax1.set_xlabel('Temps')
    ax1.set_ylabel('Valeur')
    ax1.set_title('Input x(t) : Signal d\'entrée', fontweight='bold')
    ax1.grid(True, alpha=0.3)
    
    # États cachés
    ax2 = axes[0, 1]
    for i in range(state_dim):
        ax2.plot(t, h[:, i], label=f'h{i+1}', alpha=0.7)
    ax2.set_xlabel('Temps')
    ax2.set_ylabel('État')
    ax2.set_title('États cachés h(t) : Mémoire interne', fontweight='bold')
    ax2.legend(loc='upper right')
    ax2.grid(True, alpha=0.3)
    
    # Output
    ax3 = axes[1, 0]
    ax3.plot(t, y, color='coral', alpha=0.8)
    ax3.set_xlabel('Temps')
    ax3.set_ylabel('Output')
    ax3.set_title('Output y(t) : Prédiction', fontweight='bold')
    ax3.grid(True, alpha=0.3)
    
    # Comparaison input/output
    ax4 = axes[1, 1]
    ax4.plot(t, x / np.abs(x).max(), label='Input (norm)', color='steelblue', alpha=0.6)
    ax4.plot(t, y / np.abs(y).max(), label='Output (norm)', color='coral', alpha=0.8)
    ax4.set_xlabel('Temps')
    ax4.set_ylabel('Valeur normalisée')
    ax4.set_title('Input vs Output : Transformation SSM', fontweight='bold')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("\nInterprétation:")
    print("  - Le SSM transforme l'input via son état caché")
    print("  - Chaque état h_i capture un aspect différent du signal")
    print("  - L'output est une combinaison linéaire des états")
    print("  - La matrice A contrôle la 'mémoire' (decay)")

demonstrate_ssm_concept()

### S4 : Structured State Space Sequence (ICLR 2022)

Le papier S4 de Gu et al. a résolu les problèmes de stabilité numérique des SSM avec :

1. **HiPPO Initialization** : Matrice A spéciale qui préserve l'historique
2. **NPLR Representation** : Normal Plus Low-Rank pour calcul efficace
3. **Cauchy Kernel** : Permet FFT efficace

$$A = -\frac{1}{2} + \text{low-rank}$$

**Résultat** : Premier SSM compétitif avec les Transformers sur Long Range Arena.

In [None]:
class S4Layer(nn.Module):
    """
    Simplified S4 Layer for educational purposes.
    
    Based on: "Efficiently Modeling Long Sequences with Structured State Spaces"
    Gu et al., ICLR 2022
    
    Note: This is a simplified implementation. For production, use:
    - https://github.com/state-spaces/s4
    - https://github.com/HazyResearch/safari
    """
    
    def __init__(self, d_model: int, d_state: int = 64, dropout: float = 0.0):
        """
        Parameters:
        -----------
        d_model : int
            Model dimension (input/output size)
        d_state : int
            State dimension (N in the paper)
        dropout : float
            Dropout rate
        """
        super().__init__()
        self.d_model = d_model
        self.d_state = d_state
        
        # Learnable SSM parameters
        # A: state transition matrix (d_state, d_state)
        # Using HiPPO-inspired initialization
        A = torch.arange(1, d_state + 1, dtype=torch.float32)
        A = -0.5 + 1j * math.pi * A  # Complex for oscillatory behavior
        self.register_buffer('A', torch.view_as_real(A.unsqueeze(0).expand(d_model, -1)))
        
        # B: input matrix (d_model, d_state)
        self.B = nn.Parameter(torch.randn(d_model, d_state) * 0.02)
        
        # C: output matrix (d_model, d_state)
        self.C = nn.Parameter(torch.randn(d_model, d_state) * 0.02)
        
        # Delta: discretization step (learned per channel)
        self.log_delta = nn.Parameter(torch.randn(d_model) * 0.1 - 4.0)
        
        # D: skip connection
        self.D = nn.Parameter(torch.ones(d_model))
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def discretize(self):
        """
        Discretize continuous SSM parameters.
        
        Returns:
        --------
        A_bar, B_bar : discretized matrices
        """
        delta = torch.exp(self.log_delta)  # (d_model,)
        
        # Get complex A
        A = torch.view_as_complex(self.A)  # (d_model, d_state)
        
        # Discretization using ZOH (Zero-Order Hold)
        # A_bar = exp(delta * A)
        A_bar = torch.exp(delta.unsqueeze(-1) * A)  # (d_model, d_state)
        
        # B_bar = (A^-1)(exp(delta*A) - I) * delta * B
        # Simplified: B_bar ≈ delta * B for small delta
        B_bar = delta.unsqueeze(-1) * self.B  # (d_model, d_state)
        
        return A_bar, B_bar
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass using recurrent formulation.
        
        Parameters:
        -----------
        x : tensor
            Input of shape (batch, seq_len, d_model)
        
        Returns:
        --------
        tensor : Output of shape (batch, seq_len, d_model)
        """
        batch, seq_len, d_model = x.shape
        
        # Discretize
        A_bar, B_bar = self.discretize()
        
        # Initialize state
        h = torch.zeros(batch, d_model, self.d_state, dtype=torch.complex64, device=x.device)
        
        # Recurrent computation
        outputs = []
        for t in range(seq_len):
            x_t = x[:, t, :]  # (batch, d_model)
            
            # State update: h_t = A_bar * h_{t-1} + B_bar * x_t
            h = A_bar.unsqueeze(0) * h + B_bar.unsqueeze(0) * x_t.unsqueeze(-1).to(torch.complex64)
            
            # Output: y_t = Re(C * h_t) + D * x_t
            y_t = torch.real(torch.sum(self.C.unsqueeze(0) * h, dim=-1)) + self.D * x_t
            outputs.append(y_t)
        
        y = torch.stack(outputs, dim=1)  # (batch, seq_len, d_model)
        y = self.dropout(y)
        
        return y


# Test S4 Layer
print("Test de S4Layer:")
s4_layer = S4Layer(d_model=32, d_state=16)
test_input = torch.randn(2, 60, 32)  # (batch, seq_len, d_model)
output = s4_layer(test_input)

print(f"  Input shape: {test_input.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Parameters: {sum(p.numel() for p in s4_layer.parameters()):,}")

---

## Partie 3 : De S4 à Mamba - Selective State Spaces (20 min)

### Limitation de S4 : Matrices fixes

Dans S4, les matrices A, B, C sont **indépendantes de l'input**. Cela pose problème pour :

- **Sélectivité** : Impossible de "ignorer" certaines entrées
- **Context-awareness** : Pas d'adaptation au contenu

### Mamba : Selective State Spaces

L'innovation clé de Mamba (Gu & Dao, 2023) est de rendre B, C, Δ **dépendants de l'input** :

$$B_t = \text{Linear}_B(x_t)$$
$$C_t = \text{Linear}_C(x_t)$$
$$\Delta_t = \text{softplus}(\text{Linear}_\Delta(x_t))$$

### Pourquoi "Selective" ?

Le modèle peut maintenant **sélectionner** quelles informations mémoriser :

| Input | Δ petit | Δ grand |
|-------|---------|--------|
| Information importante | ✓ Mémorise | - |
| Bruit/padding | - | ✓ "Oublie" |

C'est similaire au **forget gate** des LSTM, mais plus efficace.

### Architecture hardware-aware

Mamba utilise des optimisations GPU spécifiques :
- **Parallel scan** au lieu de récurrence séquentielle
- **Kernel fusion** pour réduire les accès mémoire
- **Recomputation** dans le backward pass

In [None]:
class MambaBlock(nn.Module):
    """
    Mamba Block - Selective State Space Model.
    
    Based on: "Mamba: Linear-Time Sequence Modeling with Selective State Spaces"
    Gu & Dao, arXiv 2312.00752
    
    This is a simplified PyTorch implementation for educational purposes.
    For production, use: https://github.com/state-spaces/mamba
    """
    
    def __init__(
        self, 
        d_model: int, 
        d_state: int = 16,
        d_conv: int = 4,
        expand: int = 2,
        dropout: float = 0.0
    ):
        """
        Parameters:
        -----------
        d_model : int
            Model dimension
        d_state : int
            SSM state dimension (N)
        d_conv : int
            Local convolution width
        expand : int
            Inner dimension expansion factor
        dropout : float
            Dropout rate
        """
        super().__init__()
        self.d_model = d_model
        self.d_state = d_state
        self.d_conv = d_conv
        self.d_inner = d_model * expand
        
        # Input projection (to 2x for gating)
        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)
        
        # 1D Convolution for local context
        self.conv1d = nn.Conv1d(
            in_channels=self.d_inner,
            out_channels=self.d_inner,
            kernel_size=d_conv,
            padding=d_conv - 1,
            groups=self.d_inner  # Depthwise
        )
        
        # SSM parameters (input-dependent)
        # x_proj projects to (delta, B, C)
        self.x_proj = nn.Linear(self.d_inner, d_state * 2 + 1, bias=False)
        
        # Delta (discretization step) projection
        self.dt_proj = nn.Linear(1, self.d_inner, bias=True)
        
        # A parameter (log scale for stability)
        A = torch.arange(1, d_state + 1, dtype=torch.float32)
        self.A_log = nn.Parameter(torch.log(A.repeat(self.d_inner, 1)))
        
        # D skip connection
        self.D = nn.Parameter(torch.ones(self.d_inner))
        
        # Output projection
        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)
        
        # Dropout
        self.dropout = nn.Dropout(dropout)
    
    def ssm_step(
        self, 
        x: torch.Tensor, 
        h: torch.Tensor,
        delta: torch.Tensor,
        A: torch.Tensor,
        B: torch.Tensor,
        C: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Single SSM step with selective parameters.
        
        h_t = exp(delta * A) * h_{t-1} + delta * B * x_t
        y_t = C * h_t
        """
        # Discretize A
        A_bar = torch.exp(delta.unsqueeze(-1) * A)  # (batch, d_inner, d_state)
        
        # Discretize B
        B_bar = delta.unsqueeze(-1) * B  # (batch, d_inner, d_state)
        
        # State update
        h = A_bar * h + B_bar * x.unsqueeze(-1)
        
        # Output
        y = torch.sum(C * h, dim=-1)  # (batch, d_inner)
        
        return y, h
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass.
        
        Parameters:
        -----------
        x : tensor
            Input of shape (batch, seq_len, d_model)
        
        Returns:
        --------
        tensor : Output of shape (batch, seq_len, d_model)
        """
        batch, seq_len, _ = x.shape
        
        # Input projection with gating
        xz = self.in_proj(x)  # (batch, seq_len, d_inner * 2)
        x_main, z = xz.chunk(2, dim=-1)  # Each: (batch, seq_len, d_inner)
        
        # 1D convolution for local context
        x_conv = x_main.transpose(1, 2)  # (batch, d_inner, seq_len)
        x_conv = self.conv1d(x_conv)[:, :, :seq_len]  # Causal
        x_conv = x_conv.transpose(1, 2)  # (batch, seq_len, d_inner)
        x_main = F.silu(x_conv)
        
        # Project to SSM parameters
        x_ssm = self.x_proj(x_main)  # (batch, seq_len, d_state*2 + 1)
        
        # Split into delta, B, C
        delta_raw = x_ssm[:, :, :1]  # (batch, seq_len, 1)
        B = x_ssm[:, :, 1:1+self.d_state]  # (batch, seq_len, d_state)
        C = x_ssm[:, :, 1+self.d_state:]  # (batch, seq_len, d_state)
        
        # Delta projection + softplus
        delta = F.softplus(self.dt_proj(delta_raw))  # (batch, seq_len, d_inner)
        
        # A (negative for stability)
        A = -torch.exp(self.A_log)  # (d_inner, d_state)
        
        # Initialize state
        h = torch.zeros(batch, self.d_inner, self.d_state, device=x.device)
        
        # Recurrent SSM (simplified, not optimized)
        outputs = []
        for t in range(seq_len):
            x_t = x_main[:, t, :]  # (batch, d_inner)
            delta_t = delta[:, t, :]  # (batch, d_inner)
            B_t = B[:, t, :].unsqueeze(1).expand(-1, self.d_inner, -1)  # (batch, d_inner, d_state)
            C_t = C[:, t, :].unsqueeze(1).expand(-1, self.d_inner, -1)  # (batch, d_inner, d_state)
            
            y_t, h = self.ssm_step(x_t, h, delta_t, A, B_t, C_t)
            
            # Skip connection
            y_t = y_t + self.D * x_t
            outputs.append(y_t)
        
        y = torch.stack(outputs, dim=1)  # (batch, seq_len, d_inner)
        
        # Gating with z
        y = y * F.silu(z)
        
        # Output projection
        y = self.out_proj(y)
        y = self.dropout(y)
        
        return y


# Test Mamba Block
print("Test de MambaBlock:")
mamba_block = MambaBlock(d_model=32, d_state=16, d_conv=4, expand=2)
test_input = torch.randn(2, 60, 32)
output = mamba_block(test_input)

print(f"  Input shape: {test_input.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Parameters: {sum(p.numel() for p in mamba_block.parameters()):,}")

---

## Partie 4 : Implémentation Mamba Complète pour Trading (20 min)

### Architecture du modèle

```
Input (batch, seq_len, n_features)
           |
     [Input Projection]
           |
     [Mamba Block 1]
           |
     [Mamba Block 2]
           |
     [Global Pooling]
           |
     [Classification Head]
           |
Output (batch, 1)
```

In [None]:
class MambaForTrading(nn.Module):
    """
    Mamba model for financial time series classification.
    
    Architecture:
    - Input projection
    - N x Mamba blocks with residual connections
    - Global average pooling
    - Classification head
    """
    
    def __init__(
        self,
        n_features: int,
        d_model: int = 32,
        d_state: int = 16,
        n_layers: int = 2,
        dropout: float = 0.1,
        n_classes: int = 1
    ):
        super().__init__()
        
        # Input projection
        self.input_proj = nn.Linear(n_features, d_model)
        
        # Mamba blocks
        self.mamba_blocks = nn.ModuleList([
            MambaBlock(
                d_model=d_model,
                d_state=d_state,
                d_conv=4,
                expand=2,
                dropout=dropout
            )
            for _ in range(n_layers)
        ])
        
        # Layer norms for residual
        self.norms = nn.ModuleList([
            nn.LayerNorm(d_model)
            for _ in range(n_layers)
        ])
        
        # Classification head
        self.head = nn.Sequential(
            nn.Linear(d_model, d_model // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_model // 2, n_classes)
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass.
        
        Parameters:
        -----------
        x : tensor
            Input of shape (batch, seq_len, n_features)
        
        Returns:
        --------
        tensor : Logits of shape (batch, n_classes)
        """
        # Project to d_model
        x = self.input_proj(x)
        
        # Mamba blocks with residual
        for mamba, norm in zip(self.mamba_blocks, self.norms):
            x = x + mamba(norm(x))
        
        # Global average pooling
        x = x.mean(dim=1)  # (batch, d_model)
        
        # Classification
        logits = self.head(x)
        
        return logits


# Test complet
print("Test de MambaForTrading:")
model = MambaForTrading(
    n_features=5,
    d_model=32,
    d_state=16,
    n_layers=2,
    dropout=0.1
)

test_input = torch.randn(4, 60, 5)  # (batch, seq_len, features)
output = model(test_input)

print(f"  Input shape: {test_input.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Total parameters: {sum(p.numel() for p in model.parameters()):,}")

# Comparer avec un Transformer équivalent
print(f"\n  Note: Un Transformer équivalent aurait ~2x plus de paramètres")
print(f"  et une complexité O(n²) au lieu de O(n)")

In [None]:
# Générer des données de trading synthétiques
def generate_trading_data(n_samples=2000, seq_length=60, n_features=5):
    """
    Génère des données synthétiques de trading.
    
    Features:
    - returns: Rendements journaliers
    - volume_norm: Volume normalisé
    - volatility: Volatilité glissante
    - rsi_norm: RSI normalisé
    - momentum: Momentum sur 10 jours
    
    Target: Direction du rendement futur (1: up, 0: down)
    """
    np.random.seed(42)
    
    X = []
    y = []
    
    for _ in range(n_samples):
        # Tendance aléatoire
        trend = np.random.choice([-1, 1]) * np.random.uniform(0.0003, 0.001)
        
        # Générer les rendements avec tendance + bruit
        returns = trend + np.random.randn(seq_length + 1) * 0.015
        
        # Calculer les features
        volume = np.abs(returns[:-1]) * 100 + np.random.uniform(50, 150, seq_length)
        volume_norm = (volume - volume.mean()) / (volume.std() + 1e-8)
        
        volatility = pd.Series(returns[:-1]).rolling(5, min_periods=1).std().values
        vol_norm = (volatility - volatility.mean()) / (volatility.std() + 1e-8)
        
        # RSI simplifié
        gains = np.maximum(returns[:-1], 0)
        losses = -np.minimum(returns[:-1], 0)
        avg_gain = pd.Series(gains).rolling(7, min_periods=1).mean().values
        avg_loss = pd.Series(losses).rolling(7, min_periods=1).mean().values
        rsi = 100 - (100 / (1 + avg_gain / (avg_loss + 1e-8)))
        rsi_norm = (rsi - 50) / 50
        
        # Momentum
        prices = np.exp(np.cumsum(returns))
        momentum = (prices[:-1] / np.roll(prices[:-1], 10) - 1)
        momentum[:10] = 0
        mom_norm = (momentum - momentum.mean()) / (momentum.std() + 1e-8)
        
        # Stack features
        features = np.stack([returns[:-1], volume_norm, vol_norm, rsi_norm, mom_norm], axis=1)
        X.append(features)
        
        # Target: direction du dernier rendement
        y.append(1 if returns[-1] > 0 else 0)
    
    return np.array(X), np.array(y)


# Générer les données
print("Génération des données...")
X, y = generate_trading_data(n_samples=3000, seq_length=60, n_features=5)

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"Distribution: {y.mean():.1%} up, {1-y.mean():.1%} down")

# Split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f"\nTrain: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

# Convertir en tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train)
X_val_t = torch.FloatTensor(X_val)
y_val_t = torch.FloatTensor(y_val)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test)

# DataLoaders
train_dataset = TensorDataset(X_train_t, y_train_t)
val_dataset = TensorDataset(X_val_t, y_val_t)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)

In [None]:
def train_mamba_model(model, train_loader, val_loader, epochs=20, lr=0.001):
    """
    Entraîne le modèle Mamba.
    """
    model = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
    criterion = nn.BCEWithLogitsLoss()
    
    history = {'train_loss': [], 'val_loss': [], 'train_acc': [], 'val_acc': []}
    best_val_loss = float('inf')
    best_model_state = None
    
    for epoch in range(epochs):
        # Training
        model.train()
        train_loss = 0
        train_correct = 0
        train_total = 0
        
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            
            optimizer.zero_grad()
            logits = model(X_batch).squeeze(-1)
            loss = criterion(logits, y_batch)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            train_loss += loss.item() * len(y_batch)
            preds = (torch.sigmoid(logits) > 0.5).float()
            train_correct += (preds == y_batch).sum().item()
            train_total += len(y_batch)
        
        # Validation
        model.eval()
        val_loss = 0
        val_correct = 0
        val_total = 0
        
        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                logits = model(X_batch).squeeze(-1)
                loss = criterion(logits, y_batch)
                
                val_loss += loss.item() * len(y_batch)
                preds = (torch.sigmoid(logits) > 0.5).float()
                val_correct += (preds == y_batch).sum().item()
                val_total += len(y_batch)
        
        # Metrics
        train_loss /= train_total
        val_loss /= val_total
        train_acc = train_correct / train_total
        val_acc = val_correct / val_total
        
        history['train_loss'].append(train_loss)
        history['val_loss'].append(val_loss)
        history['train_acc'].append(train_acc)
        history['val_acc'].append(val_acc)
        
        # Best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_state = model.state_dict().copy()
        
        scheduler.step()
        
        if (epoch + 1) % 5 == 0:
            print(f"Epoch {epoch+1}/{epochs}: "
                  f"Train Loss={train_loss:.4f}, Acc={train_acc:.2%} | "
                  f"Val Loss={val_loss:.4f}, Acc={val_acc:.2%}")
    
    # Load best model
    model.load_state_dict(best_model_state)
    
    return model, history


# Créer et entraîner le modèle Mamba
print("Entraînement du modèle Mamba...")
print("(Architecture légère pour CPU)\n")

mamba_model = MambaForTrading(
    n_features=5,
    d_model=32,
    d_state=16,
    n_layers=2,
    dropout=0.1
)

mamba_model, history = train_mamba_model(
    mamba_model, 
    train_loader, 
    val_loader, 
    epochs=20,
    lr=0.001
)

In [None]:
# Visualiser l'entraînement
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

# Loss
ax1 = axes[0]
ax1.plot(history['train_loss'], label='Train', color='steelblue')
ax1.plot(history['val_loss'], label='Validation', color='coral')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Accuracy
ax2 = axes[1]
ax2.plot(history['train_acc'], label='Train', color='steelblue')
ax2.plot(history['val_acc'], label='Validation', color='coral')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training Accuracy', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Évaluation sur le test set
mamba_model.eval()
with torch.no_grad():
    logits = mamba_model(X_test_t.to(device)).squeeze(-1).cpu()
    y_pred = (torch.sigmoid(logits) > 0.5).numpy().astype(int)

print("\nÉvaluation sur le test set:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Down', 'Up']))

---

## Partie 5 : SST Hybrid - Mamba + Transformer (15 min)

### Pourquoi un hybride ?

| Modèle | Forces | Faiblesses |
|--------|--------|------------|
| **Mamba** | Long-range, O(n), efficace | Moins bon sur patterns locaux |
| **Transformer** | Patterns locaux, interprétable | O(n²), memory-intensive |
| **SST Hybrid** | ✅ Best of both worlds | Complexité d'implémentation |

### Architecture SST (CIKM 2024)

Le papier "SST: State-space models with self-attention for long sequence modeling" propose :

```
Input
  |
  +---> [Mamba Block] ---> Long-range features
  |                              |
  +---> [Attention Block] ---> Local features
                                 |
                            [Fusion]
                                 |
                            Output
```

In [None]:
class LocalAttention(nn.Module):
    """
    Local (windowed) attention for short-range patterns.
    
    Only attends to nearby positions within a window.
    Complexity: O(n * window_size) instead of O(n²)
    """
    
    def __init__(self, d_model: int, n_heads: int = 4, window_size: int = 8, dropout: float = 0.1):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.window_size = window_size
        self.head_dim = d_model // n_heads
        
        self.qkv = nn.Linear(d_model, d_model * 3)
        self.out_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch, seq_len, _ = x.shape
        
        # QKV projection
        qkv = self.qkv(x)  # (batch, seq_len, 3 * d_model)
        qkv = qkv.reshape(batch, seq_len, 3, self.n_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)  # (3, batch, n_heads, seq_len, head_dim)
        q, k, v = qkv[0], qkv[1], qkv[2]
        
        # Create local attention mask
        mask = torch.ones(seq_len, seq_len, device=x.device)
        for i in range(seq_len):
            start = max(0, i - self.window_size // 2)
            end = min(seq_len, i + self.window_size // 2 + 1)
            mask[i, :start] = 0
            mask[i, end:] = 0
        
        # Attention scores
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        scores = scores.masked_fill(mask.unsqueeze(0).unsqueeze(0) == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        
        # Apply attention
        out = torch.matmul(attn, v)  # (batch, n_heads, seq_len, head_dim)
        out = out.transpose(1, 2).reshape(batch, seq_len, self.d_model)
        out = self.out_proj(out)
        
        return out


class SSTBlock(nn.Module):
    """
    SST Block: Combines Mamba (long-range) with Local Attention (short-range).
    
    Based on: "SST: State-space models with self-attention for long sequence modeling"
    CIKM 2024
    """
    
    def __init__(
        self,
        d_model: int,
        d_state: int = 16,
        n_heads: int = 4,
        window_size: int = 8,
        dropout: float = 0.1
    ):
        super().__init__()
        
        # Mamba for long-range
        self.mamba = MambaBlock(d_model=d_model, d_state=d_state, dropout=dropout)
        
        # Local attention for short-range
        self.local_attn = LocalAttention(
            d_model=d_model, 
            n_heads=n_heads, 
            window_size=window_size,
            dropout=dropout
        )
        
        # Fusion
        self.gate = nn.Linear(d_model * 2, d_model)
        
        # Norms
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm_out = nn.LayerNorm(d_model)
        
        # FFN
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 2),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_model * 2, d_model),
            nn.Dropout(dropout)
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Long-range path (Mamba)
        long_range = self.mamba(self.norm1(x))
        
        # Short-range path (Local Attention)
        short_range = self.local_attn(self.norm2(x))
        
        # Fusion with learned gating
        concat = torch.cat([long_range, short_range], dim=-1)
        gate = torch.sigmoid(self.gate(concat))
        fused = gate * long_range + (1 - gate) * short_range
        
        # Residual
        x = x + fused
        
        # FFN
        x = x + self.ffn(self.norm_out(x))
        
        return x


class SSTForTrading(nn.Module):
    """
    SST model for trading: Mamba + Local Attention hybrid.
    """
    
    def __init__(
        self,
        n_features: int,
        d_model: int = 32,
        d_state: int = 16,
        n_layers: int = 2,
        n_heads: int = 4,
        window_size: int = 8,
        dropout: float = 0.1,
        n_classes: int = 1
    ):
        super().__init__()
        
        # Input projection
        self.input_proj = nn.Linear(n_features, d_model)
        
        # SST blocks
        self.blocks = nn.ModuleList([
            SSTBlock(
                d_model=d_model,
                d_state=d_state,
                n_heads=n_heads,
                window_size=window_size,
                dropout=dropout
            )
            for _ in range(n_layers)
        ])
        
        # Head
        self.head = nn.Sequential(
            nn.Linear(d_model, d_model // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_model // 2, n_classes)
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.input_proj(x)
        
        for block in self.blocks:
            x = block(x)
        
        x = x.mean(dim=1)
        return self.head(x)


# Test SST
print("Test de SSTForTrading:")
sst_model = SSTForTrading(
    n_features=5,
    d_model=32,
    n_layers=2,
    n_heads=4,
    window_size=8
)

test_input = torch.randn(4, 60, 5)
output = sst_model(test_input)

print(f"  Input shape: {test_input.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Parameters: {sum(p.numel() for p in sst_model.parameters()):,}")

---

## Partie 6 : Comparaison LSTM vs Transformer vs Mamba (10 min)

### Benchmark sur les mêmes données

In [None]:
class LSTMBaseline(nn.Module):
    """LSTM baseline pour comparaison."""
    def __init__(self, n_features, hidden_size=32, n_layers=2, dropout=0.1):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=n_features,
            hidden_size=hidden_size,
            num_layers=n_layers,
            dropout=dropout if n_layers > 1 else 0,
            batch_first=True
        )
        self.head = nn.Linear(hidden_size, 1)
    
    def forward(self, x):
        _, (h_n, _) = self.lstm(x)
        return self.head(h_n[-1])


class TransformerBaseline(nn.Module):
    """Transformer baseline pour comparaison."""
    def __init__(self, n_features, d_model=32, n_heads=4, n_layers=2, dropout=0.1):
        super().__init__()
        self.input_proj = nn.Linear(n_features, d_model)
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_model * 2,
            dropout=dropout,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)
        self.head = nn.Linear(d_model, 1)
    
    def forward(self, x):
        x = self.input_proj(x)
        x = self.transformer(x)
        x = x.mean(dim=1)
        return self.head(x)


def benchmark_models(models_dict, train_loader, val_loader, X_test, y_test, epochs=15):
    """
    Compare plusieurs modèles sur les mêmes données.
    """
    import time
    
    results = []
    
    for name, model in models_dict.items():
        print(f"\n{'='*50}")
        print(f"Training {name}...")
        print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
        
        model = model.to(device)
        optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
        criterion = nn.BCEWithLogitsLoss()
        
        start_time = time.time()
        
        for epoch in range(epochs):
            model.train()
            for X_batch, y_batch in train_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                optimizer.zero_grad()
                logits = model(X_batch).squeeze(-1)
                loss = criterion(logits, y_batch)
                loss.backward()
                optimizer.step()
        
        train_time = time.time() - start_time
        
        # Evaluate
        model.eval()
        with torch.no_grad():
            X_test_d = torch.FloatTensor(X_test).to(device)
            logits = model(X_test_d).squeeze(-1).cpu()
            y_pred = (torch.sigmoid(logits) > 0.5).numpy().astype(int)
        
        acc = accuracy_score(y_test, y_pred)
        
        results.append({
            'Model': name,
            'Parameters': sum(p.numel() for p in model.parameters()),
            'Train Time (s)': train_time,
            'Test Accuracy': acc
        })
        
        print(f"  Time: {train_time:.1f}s, Accuracy: {acc:.2%}")
    
    return pd.DataFrame(results)


# Créer les modèles
models = {
    'LSTM': LSTMBaseline(n_features=5, hidden_size=32, n_layers=2),
    'Transformer': TransformerBaseline(n_features=5, d_model=32, n_heads=4, n_layers=2),
    'Mamba': MambaForTrading(n_features=5, d_model=32, d_state=16, n_layers=2),
    'SST Hybrid': SSTForTrading(n_features=5, d_model=32, n_layers=2, n_heads=4)
}

# Benchmark
print("Benchmark des architectures...")
results_df = benchmark_models(models, train_loader, val_loader, X_test, y_test, epochs=15)

In [None]:
# Visualiser les résultats
print("\n" + "="*60)
print("RÉSULTATS DU BENCHMARK")
print("="*60)
print(results_df.to_string(index=False))

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Parameters
ax1 = axes[0]
colors = ['steelblue', 'coral', 'seagreen', 'orchid']
ax1.bar(results_df['Model'], results_df['Parameters'], color=colors)
ax1.set_ylabel('Nombre de paramètres')
ax1.set_title('Taille du modèle', fontweight='bold')
ax1.tick_params(axis='x', rotation=15)
ax1.grid(True, alpha=0.3, axis='y')

# Train time
ax2 = axes[1]
ax2.bar(results_df['Model'], results_df['Train Time (s)'], color=colors)
ax2.set_ylabel('Temps (secondes)')
ax2.set_title('Temps d\'entraînement', fontweight='bold')
ax2.tick_params(axis='x', rotation=15)
ax2.grid(True, alpha=0.3, axis='y')

# Accuracy
ax3 = axes[2]
ax3.bar(results_df['Model'], results_df['Test Accuracy'] * 100, color=colors)
ax3.set_ylabel('Accuracy (%)')
ax3.set_title('Performance (Test Set)', fontweight='bold')
ax3.tick_params(axis='x', rotation=15)
ax3.grid(True, alpha=0.3, axis='y')
ax3.set_ylim(45, 60)

plt.tight_layout()
plt.show()

print("\nAnalyse:")
print("  - Mamba offre un bon compromis paramètres/performance")
print("  - SST Hybrid capture à la fois long-range et short-range patterns")
print("  - Sur des séquences plus longues (>256), Mamba surpasserait le Transformer")
print("  - LSTM reste compétitif sur séquences courtes")

---

## Partie 7 : Intégration QuantConnect (15 min)

### Stratégie avec Mamba

Architecture de déploiement :

```
LOCAL (GPU/CPU puissant)
├── Entraîner MambaForTrading
├── Sauvegarder state_dict (<9MB)
└── Upload vers ObjectStore

QUANTCONNECT CLOUD (CPU)
├── Charger depuis ObjectStore
├── Recréer architecture identique
├── Inférence quotidienne (~50ms)
└── Générer signaux de trading
```

In [None]:
# Sauvegarder le modèle pour QuantConnect
import io

# Sauvegarder state_dict
buffer = io.BytesIO()
torch.save(mamba_model.state_dict(), buffer)
model_size = buffer.tell()

print(f"Taille du modèle: {model_size / 1024:.1f} KB")
print(f"Compatible ObjectStore: {'Oui' if model_size < 9 * 1024 * 1024 else 'Non (>9MB)'}")

# Sauvegarder localement pour test
torch.save(mamba_model.state_dict(), 'mamba_trading_model.pt')
print("\nModèle sauvegardé: mamba_trading_model.pt")

In [None]:
# Code QuantConnect pour Mamba Trading Strategy
qc_mamba_strategy = '''
from AlgorithmImports import *
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Tuple
import io


# ============================================
# MAMBA MODEL DEFINITION (simplified for QC)
# ============================================

class MambaBlock(nn.Module):
    """Simplified Mamba Block for QuantConnect."""
    
    def __init__(self, d_model: int, d_state: int = 16, d_conv: int = 4, expand: int = 2):
        super().__init__()
        self.d_inner = d_model * expand
        self.d_state = d_state
        
        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)
        self.conv1d = nn.Conv1d(self.d_inner, self.d_inner, d_conv, 
                               padding=d_conv-1, groups=self.d_inner)
        self.x_proj = nn.Linear(self.d_inner, d_state * 2 + 1, bias=False)
        self.dt_proj = nn.Linear(1, self.d_inner, bias=True)
        self.A_log = nn.Parameter(torch.log(torch.arange(1, d_state+1).repeat(self.d_inner, 1)))
        self.D = nn.Parameter(torch.ones(self.d_inner))
        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)
    
    def forward(self, x):
        batch, seq_len, _ = x.shape
        xz = self.in_proj(x)
        x_main, z = xz.chunk(2, dim=-1)
        
        x_conv = self.conv1d(x_main.transpose(1, 2))[:, :, :seq_len].transpose(1, 2)
        x_main = F.silu(x_conv)
        
        x_ssm = self.x_proj(x_main)
        delta_raw = x_ssm[:, :, :1]
        B = x_ssm[:, :, 1:1+self.d_state]
        C = x_ssm[:, :, 1+self.d_state:]
        delta = F.softplus(self.dt_proj(delta_raw))
        A = -torch.exp(self.A_log)
        
        h = torch.zeros(batch, self.d_inner, self.d_state, device=x.device)
        outputs = []
        
        for t in range(seq_len):
            x_t = x_main[:, t, :]
            delta_t = delta[:, t, :]
            B_t = B[:, t, :].unsqueeze(1).expand(-1, self.d_inner, -1)
            C_t = C[:, t, :].unsqueeze(1).expand(-1, self.d_inner, -1)
            
            A_bar = torch.exp(delta_t.unsqueeze(-1) * A)
            B_bar = delta_t.unsqueeze(-1) * B_t
            h = A_bar * h + B_bar * x_t.unsqueeze(-1)
            y_t = torch.sum(C_t * h, dim=-1) + self.D * x_t
            outputs.append(y_t)
        
        y = torch.stack(outputs, dim=1)
        y = y * F.silu(z)
        return self.out_proj(y)


class MambaForTrading(nn.Module):
    """Mamba model for trading signals."""
    
    def __init__(self, n_features=5, d_model=32, d_state=16, n_layers=2, dropout=0.1):
        super().__init__()
        self.input_proj = nn.Linear(n_features, d_model)
        self.mamba_blocks = nn.ModuleList([
            MambaBlock(d_model, d_state) for _ in range(n_layers)
        ])
        self.norms = nn.ModuleList([nn.LayerNorm(d_model) for _ in range(n_layers)])
        self.head = nn.Sequential(
            nn.Linear(d_model, d_model // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_model // 2, 1)
        )
    
    def forward(self, x):
        x = self.input_proj(x)
        for mamba, norm in zip(self.mamba_blocks, self.norms):
            x = x + mamba(norm(x))
        x = x.mean(dim=1)
        return self.head(x)


# ============================================
# QUANTCONNECT ALPHA MODEL
# ============================================

class MambaAlphaModel(AlphaModel):
    """
    Alpha Model using Mamba SSM for direction prediction.
    
    Features:
    - O(n) complexity for long sequences
    - Selective state space for filtering noise
    - CPU-optimized architecture
    """
    
    def __init__(self, algorithm, model_key="models/mamba_trading"):
        self.algorithm = algorithm
        self.model_key = model_key
        self.lookback = 60
        self.n_features = 5
        self.model = None
        self.symbols_data = {}
        
        # Load model
        self._load_model()
    
    def _load_model(self):
        """Load Mamba model from ObjectStore."""
        try:
            if self.algorithm.ObjectStore.ContainsKey(self.model_key):
                model_bytes = self.algorithm.ObjectStore.ReadBytes(self.model_key)
                buffer = io.BytesIO(model_bytes)
                
                self.model = MambaForTrading(
                    n_features=self.n_features,
                    d_model=32,
                    d_state=16,
                    n_layers=2
                )
                self.model.load_state_dict(torch.load(buffer, map_location="cpu"))
                self.model.eval()
                
                self.algorithm.Debug("Mamba model loaded from ObjectStore")
            else:
                self.algorithm.Debug(f"Model not found: {self.model_key}")
        except Exception as e:
            self.algorithm.Debug(f"Error loading model: {e}")
    
    def _prepare_features(self, symbol) -> np.ndarray:
        """Prepare features for a symbol."""
        history = self.algorithm.History(symbol, self.lookback + 5, Resolution.Daily)
        
        if history.empty or len(history) < self.lookback:
            return None
        
        try:
            close = history["close"].values
            volume = history["volume"].values
            
            # Returns
            returns = np.diff(close) / close[:-1]
            
            # Volume normalized
            vol_norm = (volume - volume.mean()) / (volume.std() + 1e-8)
            
            # Volatility
            volatility = np.array([returns[max(0,i-4):i+1].std() 
                                  for i in range(len(returns))])
            vol_std_norm = (volatility - volatility.mean()) / (volatility.std() + 1e-8)
            
            # RSI
            gains = np.maximum(returns, 0)
            losses = -np.minimum(returns, 0)
            avg_gain = np.convolve(gains, np.ones(7)/7, mode="valid")
            avg_loss = np.convolve(losses, np.ones(7)/7, mode="valid")
            rsi = 100 - (100 / (1 + avg_gain / (avg_loss + 1e-8)))
            rsi_norm = np.concatenate([[0]*6, (rsi - 50) / 50])
            
            # Momentum
            prices = np.exp(np.cumsum(np.concatenate([[0], returns])))
            momentum = prices[10:] / prices[:-10] - 1
            mom_norm = np.concatenate([[0]*10, (momentum - momentum.mean()) / (momentum.std() + 1e-8)])
            
            # Stack (last lookback days)
            features = np.stack([
                np.concatenate([[0], returns])[-self.lookback:],
                vol_norm[-self.lookback:],
                np.concatenate([[0], vol_std_norm])[-self.lookback:],
                rsi_norm[-self.lookback:],
                mom_norm[-self.lookback:]
            ], axis=1)
            
            return features
            
        except Exception as e:
            self.algorithm.Debug(f"Feature error for {symbol}: {e}")
            return None
    
    def Update(self, algorithm, data):
        """Generate insights."""
        insights = []
        
        if self.model is None:
            return insights
        
        for symbol in algorithm.ActiveSecurities.Keys:
            if not data.ContainsKey(symbol):
                continue
            
            features = self._prepare_features(symbol)
            
            if features is None:
                continue
            
            # Predict
            with torch.no_grad():
                x = torch.FloatTensor(features).unsqueeze(0)
                logit = self.model(x).item()
                prob = 1 / (1 + np.exp(-logit))
            
            # Generate insight if confident
            if prob > 0.55:
                insights.append(Insight.Price(
                    symbol, timedelta(days=5), InsightDirection.Up,
                    magnitude=prob - 0.5, confidence=prob
                ))
            elif prob < 0.45:
                insights.append(Insight.Price(
                    symbol, timedelta(days=5), InsightDirection.Down,
                    magnitude=0.5 - prob, confidence=1 - prob
                ))
        
        return insights


# ============================================
# MAIN ALGORITHM
# ============================================

class MambaTradingAlgorithm(QCAlgorithm):
    """
    Trading algorithm using Mamba State Space Model.
    
    - Uses Mamba SSM with O(n) complexity
    - 60-day lookback with 5 features
    - Equal-weight top positions by confidence
    """
    
    def Initialize(self):
        self.SetStartDate(2020, 1, 1)
        self.SetEndDate(2023, 12, 31)
        self.SetCash(100000)
        
        # Universe
        self.UniverseSettings.Resolution = Resolution.Daily
        self.AddUniverse(self.CoarseFilter)
        
        # Framework
        self.SetAlpha(MambaAlphaModel(self))
        self.SetPortfolioConstruction(EqualWeightingPortfolioConstructionModel())
        self.SetExecution(ImmediateExecutionModel())
        self.SetRiskManagement(MaximumDrawdownPercentPerSecurity(0.05))
    
    def CoarseFilter(self, coarse):
        filtered = [x for x in coarse
                   if x.HasFundamentalData
                   and x.Price > 10
                   and x.DollarVolume > 10000000]
        sorted_by_volume = sorted(filtered, key=lambda x: x.DollarVolume, reverse=True)
        return [x.Symbol for x in sorted_by_volume[:50]]
    
    def OnEndOfAlgorithm(self):
        self.Debug("=" * 60)
        self.Debug("MAMBA SSM TRADING - FINAL SUMMARY")
        self.Debug("=" * 60)
        self.Debug(f"Final Value: ${self.Portfolio.TotalPortfolioValue:,.2f}")
        total_return = (self.Portfolio.TotalPortfolioValue / 100000 - 1) * 100
        self.Debug(f"Total Return: {total_return:.2f}%")
'''

print("MambaTradingAlgorithm défini pour QuantConnect")
print("\n" + "="*60)
print("ARCHITECTURE MAMBA")
print("="*60)
print("""
Modèle:
  - d_model: 32 (CPU-optimized)
  - d_state: 16 (state dimension)
  - n_layers: 2
  - Complexité: O(n) vs O(n²) pour Transformers

Features (5):
  1. returns: Rendements journaliers
  2. volume_norm: Volume normalisé
  3. volatility: Volatilité glissante
  4. rsi_norm: RSI normalisé  
  5. momentum: Momentum 10 jours

Avantages vs Transformer:
  - Scalabilité linéaire pour longues séquences
  - Sélectivité (ignore le bruit)
  - Moins de mémoire GPU/CPU
  - Comparable en performance sur séquences courtes
""")

---

## Conclusion et Prochaines Étapes

### Récapitulatif

| Concept | Description | Avantage Trading |
|---------|-------------|------------------|
| **State Space Models** | Modèles continus discrétisés | Fondations mathématiques solides |
| **S4** | SSM structuré avec HiPPO | Premier SSM compétitif |
| **Mamba** | SSM sélectif (input-dependent) | Filtre le bruit, adapte la mémoire |
| **SST Hybrid** | Mamba + Attention locale | Best of both worlds |
| **Complexité O(n)** | Scalabilité linéaire | 5+ ans de données possibles |

### Quand utiliser Mamba vs Transformer ?

| Scénario | Recommandation |
|----------|----------------|
| Séquences courtes (<100) | Transformer (plus mature) |
| Séquences longues (>256) | **Mamba** (scalabilité) |
| Contraintes mémoire | **Mamba** (O(n) mémoire) |
| Besoin d'interprétabilité | Transformer (attention weights) |
| Données bruitées | **Mamba** (sélectivité) |
| Multi-scale patterns | SST Hybrid |

### Ressources

- [Mamba Paper](https://arxiv.org/abs/2312.00752) - Gu & Dao
- [state-spaces/mamba](https://github.com/state-spaces/mamba) - Implémentation officielle
- [S4 Paper](https://arxiv.org/abs/2111.00396) - Gu et al., ICLR 2022
- [Awesome State Space Models](https://github.com/radarFudan/Awesome-state-space-models)
- [Mamba Paper List](https://github.com/Event-AHU/Mamba_State_Space_Model_Paper_List)

### Prochaines étapes

- **QC-Py-24** : Modèles génératifs (VAE-Transformer) pour détection d'anomalies
- **QC-Py-25** : Reinforcement Learning pour trading adaptatif
- **QC-Py-26** : LLMs pour signaux de trading

---

**Notebook complété. Vous maîtrisez maintenant les State Space Models (Mamba) pour le trading algorithmique à complexité O(n).**