# üëÅÔ∏è M√©canismes d'Attention - Module 6

Ce notebook explore en d√©tail les m√©canismes d'attention qui sont au c≈ìur des Transformers.

## üéØ Objectifs
- Comprendre le concept d'attention
- Impl√©menter Self-Attention avec TensorFlow
- Explorer Multi-Head Attention
- Visualiser les matrices d'attention
- Comparer avec les approches RNN/LSTM

In [None]:
# Installation des d√©pendances
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

print(f"üîß TensorFlow version: {tf.__version__}")
print(f"üîß NumPy version: {np.__version__}")
print(f"üìä GPU disponible: {tf.config.list_physical_devices('GPU')}")

## 1. üß† Intuition de l'Attention

L'attention permet √† un mod√®le de se concentrer sur les parties importantes de l'input lors de chaque pr√©diction.

In [None]:
def visualize_attention_concept():
    """
    Visualise le concept d'attention sur une phrase exemple
    """
    sentence = ["Le", "chat", "noir", "mange", "des", "croquettes"]
    
    # Simulation d'une matrice d'attention
    attention_weights = np.array([
        [0.1, 0.7, 0.1, 0.05, 0.025, 0.025],  # "Le" fait attention √† "chat"
        [0.1, 0.5, 0.3, 0.05, 0.025, 0.025],  # "chat" √† lui-m√™me et "noir"
        [0.05, 0.6, 0.25, 0.05, 0.025, 0.025], # "noir" fait attention √† "chat"
        [0.05, 0.3, 0.1, 0.4, 0.1, 0.05],     # "mange" √©quilibr√©
        [0.05, 0.1, 0.05, 0.3, 0.3, 0.2],     # "des" vers "croquettes"
        [0.05, 0.1, 0.05, 0.2, 0.2, 0.4]      # "croquettes" vers lui-m√™me
    ])
    
    # Cr√©ation de la heatmap
    fig = go.Figure(data=go.Heatmap(
        z=attention_weights,
        x=sentence,
        y=sentence,
        colorscale='Reds',
        text=np.round(attention_weights, 3),
        texttemplate="%{text}",
        textfont={"size":10},
        hoverongaps=False
    ))
    
    fig.update_layout(
        title="üéØ Matrice d'Attention : Qui regarde qui ?",
        xaxis_title="Mots de la phrase (cl√©s)",
        yaxis_title="Mots de la phrase (requ√™tes)",
        width=600,
        height=500
    )
    
    fig.show()
    
    print("üí° Interpr√©tation :")
    print("- Les couleurs intenses montrent o√π chaque mot 'fait attention'")
    print("- 'Le' fait attention √† 'chat' (relation d√©terminant-nom)")
    print("- 'noir' fait attention √† 'chat' (adjectif-nom)")
    print("- 'mange' regarde √† la fois le sujet et l'objet")
    
visualize_attention_concept()

## 2. üîç Impl√©mentation de l'Attention Scaled Dot-Product

La formule fondamentale : **Attention(Q,K,V) = softmax(QK^T/‚àöd_k)V**

In [None]:
class ScaledDotProductAttention(tf.keras.layers.Layer):
    """
    Impl√©mentation de l'attention Scaled Dot-Product
    """
    
    def __init__(self, temperature=None, **kwargs):
        super(ScaledDotProductAttention, self).__init__(**kwargs)
        self.temperature = temperature
        
    def call(self, queries, keys, values, mask=None, training=None):
        """
        Args:
            queries: [batch_size, seq_len, d_k]
            keys: [batch_size, seq_len, d_k]
            values: [batch_size, seq_len, d_v]
            mask: Masque optionnel
        
        Returns:
            output: [batch_size, seq_len, d_v]
            attention_weights: [batch_size, seq_len, seq_len]
        """
        # √âtape 1: Calcul des scores QK^T
        scores = tf.matmul(queries, keys, transpose_b=True)
        
        # √âtape 2: Normalisation par ‚àöd_k
        d_k = tf.cast(tf.shape(keys)[-1], tf.float32)
        if self.temperature is not None:
            scores = scores / self.temperature
        else:
            scores = scores / tf.math.sqrt(d_k)
        
        # √âtape 3: Application du masque (si fourni)
        if mask is not None:
            scores += (mask * -1e9)
        
        # √âtape 4: Softmax pour obtenir les poids d'attention
        attention_weights = tf.nn.softmax(scores, axis=-1)
        
        # √âtape 5: Application des poids aux valeurs
        output = tf.matmul(attention_weights, values)
        
        return output, attention_weights

# Test de l'impl√©mentation
def test_attention():
    print("üß™ Test de l'Attention Scaled Dot-Product")
    print("=" * 50)
    
    # Param√®tres
    batch_size = 2
    seq_len = 5
    d_model = 64
    
    # Cr√©ation de donn√©es d'exemple
    tf.random.set_seed(42)
    queries = tf.random.normal([batch_size, seq_len, d_model])
    keys = tf.random.normal([batch_size, seq_len, d_model])
    values = tf.random.normal([batch_size, seq_len, d_model])
    
    # Instanciation de la couche d'attention
    attention_layer = ScaledDotProductAttention()
    
    # Calcul de l'attention
    output, attention_weights = attention_layer(queries, keys, values)
    
    print(f"üìê Shape des queries: {queries.shape}")
    print(f"üìê Shape des keys: {keys.shape}")
    print(f"üìê Shape des values: {values.shape}")
    print(f"‚úÖ Shape de l'output: {output.shape}")
    print(f"üìä Shape des poids d'attention: {attention_weights.shape}")
    
    # V√©rification que les poids somment √† 1
    sum_weights = tf.reduce_sum(attention_weights, axis=-1)
    print(f"üîç Somme des poids (doit √™tre ~1.0): {tf.reduce_mean(sum_weights):.6f}")
    
    return output, attention_weights

output, attention_weights = test_attention()

## 3. üé® Visualisation des Poids d'Attention

In [None]:
def visualize_attention_weights(attention_weights, sentence=None):
    """
    Visualise les poids d'attention avec Plotly
    """
    # Prendre le premier exemple du batch
    weights = attention_weights[0].numpy()
    
    if sentence is None:
        sentence = [f"Token_{i}" for i in range(weights.shape[0])]
    
    # Cr√©ation de la heatmap interactive
    fig = go.Figure(data=go.Heatmap(
        z=weights,
        x=sentence,
        y=sentence,
        colorscale='Viridis',
        text=np.round(weights, 3),
        texttemplate="%{text}",
        textfont={"size":8},
        hoverongaps=False
    ))
    
    fig.update_layout(
        title="üî• Heatmap des Poids d'Attention",
        xaxis_title="Keys (ce qui est regard√©)",
        yaxis_title="Queries (qui regarde)",
        width=600,
        height=500
    )
    
    fig.show()

# Visualisation avec notre exemple
sentence = ["Je", "mange", "une", "pomme", "rouge"]
visualize_attention_weights(attention_weights, sentence)

## 4. üß† Multi-Head Attention

Le Multi-Head Attention permet d'avoir plusieurs "perspectives" d'attention en parall√®le.

In [None]:
class MultiHeadAttention(tf.keras.layers.Layer):
    """
    Impl√©mentation du Multi-Head Attention
    """
    
    def __init__(self, d_model, num_heads, **kwargs):
        super(MultiHeadAttention, self).__init__(**kwargs)
        self.num_heads = num_heads
        self.d_model = d_model
        
        assert d_model % self.num_heads == 0
        
        self.depth = d_model // self.num_heads
        
        # Couches lin√©aires pour Q, K, V
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)
        
        # Couche de sortie
        self.dense = tf.keras.layers.Dense(d_model)
        
        # Couche d'attention
        self.attention = ScaledDotProductAttention()
        
    def split_heads(self, x, batch_size):
        """
        Divise la derni√®re dimension en (num_heads, depth)
        Transpose pour avoir shape (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask=None, training=None):
        batch_size = tf.shape(q)[0]
        
        # 1. Transformation lin√©aire et division en t√™tes
        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)
        
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)
        
        # 2. Attention scaled dot-product
        scaled_attention, attention_weights = self.attention(
            q, k, v, mask=mask, training=training
        )
        
        # 3. Concat√©nation des t√™tes
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, 
                                    (batch_size, -1, self.d_model))
        
        # 4. Projection finale
        output = self.dense(concat_attention)
        
        return output, attention_weights

# Test du Multi-Head Attention
def test_multihead_attention():
    print("üß™ Test du Multi-Head Attention")
    print("=" * 40)
    
    # Param√®tres
    d_model = 512
    num_heads = 8
    batch_size = 2
    seq_len = 10
    
    # Donn√©es d'exemple
    tf.random.set_seed(42)
    x = tf.random.normal([batch_size, seq_len, d_model])
    
    # Cr√©ation de la couche Multi-Head Attention
    mha = MultiHeadAttention(d_model, num_heads)
    
    # Self-attention (Q, K, V sont identiques)
    output, attention_weights = mha(x, x, x)
    
    print(f"üìê Input shape: {x.shape}")
    print(f"‚úÖ Output shape: {output.shape}")
    print(f"üìä Attention weights shape: {attention_weights.shape}")
    print(f"üß† Nombre de t√™tes: {num_heads}")
    print(f"üîß Dimension par t√™te: {d_model // num_heads}")
    
    return output, attention_weights

mha_output, mha_attention_weights = test_multihead_attention()

## 5. üé≠ Visualisation Multi-Head

In [None]:
def visualize_multihead_attention(attention_weights, sentence=None, max_heads=4):
    """
    Visualise les diff√©rentes t√™tes d'attention
    """
    # Prendre le premier exemple du batch
    weights = attention_weights[0].numpy()  # Shape: [num_heads, seq_len, seq_len]
    
    if sentence is None:
        seq_len = weights.shape[1]
        sentence = [f"T{i}" for i in range(seq_len)]
    
    num_heads = min(weights.shape[0], max_heads)
    
    # Cr√©ation des sous-graphiques
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=[f"T√™te {i+1}" for i in range(num_heads)],
        vertical_spacing=0.08,
        horizontal_spacing=0.05
    )
    
    for i in range(num_heads):
        row = (i // 2) + 1
        col = (i % 2) + 1
        
        fig.add_trace(
            go.Heatmap(
                z=weights[i],
                x=sentence,
                y=sentence,
                colorscale='Viridis',
                showscale=(i == 0),
                text=np.round(weights[i], 2),
                texttemplate="%{text}",
                textfont={"size":8}
            ),
            row=row, col=col
        )
    
    fig.update_layout(
        title="üß† Multi-Head Attention - Diff√©rentes Perspectives",
        height=800
    )
    
    fig.show()
    
    # Analyse des patterns
    print("üîç Analyse des patterns d'attention:")
    for i in range(num_heads):
        # Calcul de la concentration (entropie inverse)
        entropy = -np.sum(weights[i] * np.log(weights[i] + 1e-10), axis=-1)
        avg_entropy = np.mean(entropy)
        
        if avg_entropy < 1.0:
            pattern = "Tr√®s focalis√©e"
        elif avg_entropy < 1.5:
            pattern = "Mod√©r√©ment focalis√©e"
        else:
            pattern = "Diffuse"
            
        print(f"  T√™te {i+1}: {pattern} (entropie: {avg_entropy:.2f})")

# Test avec une phrase exemple
sentence_example = ["The", "cat", "sat", "on", "the", "mat"]

# Cr√©er un exemple plus petit pour la visualisation
d_model = 512
num_heads = 8
seq_len = len(sentence_example)
batch_size = 1

tf.random.set_seed(123)
x_small = tf.random.normal([batch_size, seq_len, d_model])
mha_small = MultiHeadAttention(d_model, num_heads)
_, small_attention_weights = mha_small(x_small, x_small, x_small)

visualize_multihead_attention(small_attention_weights, sentence_example)

## 6. üîÑ Self-Attention vs Cross-Attention

In [None]:
def demonstrate_attention_types():
    """
    D√©montre la diff√©rence entre Self-Attention et Cross-Attention
    """
    print("üîÑ D√©monstration: Self-Attention vs Cross-Attention")
    print("=" * 55)
    
    # Param√®tres
    d_model = 256
    num_heads = 4
    batch_size = 1
    
    # S√©quences d'exemple
    source_len = 5  # "Hello how are you ?"
    target_len = 4  # "Bonjour comment √ßa"
    
    tf.random.set_seed(42)
    source_seq = tf.random.normal([batch_size, source_len, d_model])
    target_seq = tf.random.normal([batch_size, target_len, d_model])
    
    mha = MultiHeadAttention(d_model, num_heads)
    
    # 1. Self-Attention sur la s√©quence source
    print("1Ô∏è‚É£ Self-Attention (source regarde source):")
    self_output, self_weights = mha(source_seq, source_seq, source_seq)
    print(f"   Input: {source_seq.shape}")
    print(f"   Output: {self_output.shape}")
    print(f"   Attention weights: {self_weights.shape}")
    
    # 2. Cross-Attention (target regarde source)
    print("\n2Ô∏è‚É£ Cross-Attention (target regarde source):")
    cross_output, cross_weights = mha(
        v=source_seq,  # Values viennent de la source
        k=source_seq,  # Keys viennent de la source
        q=target_seq   # Queries viennent de la target
    )
    print(f"   Queries (target): {target_seq.shape}")
    print(f"   Keys/Values (source): {source_seq.shape}")
    print(f"   Output: {cross_output.shape}")
    print(f"   Attention weights: {cross_weights.shape}")
    
    return self_weights, cross_weights

self_attn_weights, cross_attn_weights = demonstrate_attention_types()

## 7. üéØ Application Pratique: Analyse de Sentiment

In [None]:
class AttentionSentimentAnalyzer(tf.keras.Model):
    """
    Mod√®le d'analyse de sentiment utilisant l'attention
    """
    
    def __init__(self, vocab_size, d_model=128, num_heads=4, max_length=100, **kwargs):
        super(AttentionSentimentAnalyzer, self).__init__(**kwargs)
        
        self.d_model = d_model
        self.max_length = max_length
        
        # Couches d'embedding
        self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.pos_encoding = self.positional_encoding(max_length, d_model)
        
        # Couche d'attention
        self.mha = MultiHeadAttention(d_model, num_heads)
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        
        # Feed Forward
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(d_model * 4, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        
        # Classification
        self.global_pool = tf.keras.layers.GlobalAveragePooling1D()
        self.dropout = tf.keras.layers.Dropout(0.1)
        self.classifier = tf.keras.layers.Dense(1, activation='sigmoid')
    
    def positional_encoding(self, position, d_model):
        """
        Cr√©e l'encodage positionnel
        """
        angle_rads = self.get_angles(
            np.arange(position)[:, np.newaxis],
            np.arange(d_model)[np.newaxis, :],
            d_model
        )
        
        # Sinus pour les positions paires
        angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
        # Cosinus pour les positions impaires
        angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
        
        pos_encoding = angle_rads[np.newaxis, ...]
        return tf.cast(pos_encoding, dtype=tf.float32)
    
    def get_angles(self, pos, i, d_model):
        angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
        return pos * angle_rates
    
    def call(self, x, training=None, return_attention=False):
        seq_len = tf.shape(x)[1]
        
        # Embedding + Positional Encoding
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        
        # Self-Attention
        attn_output, attention_weights = self.mha(x, x, x, training=training)
        x = self.layernorm1(x + attn_output)
        
        # Feed Forward
        ffn_output = self.ffn(x)
        x = self.layernorm2(x + ffn_output)
        
        # Classification
        x = self.global_pool(x)
        x = self.dropout(x, training=training)
        output = self.classifier(x)
        
        if return_attention:
            return output, attention_weights
        return output

# Test du mod√®le
def test_sentiment_model():
    print("üé≠ Test du Mod√®le de Sentiment avec Attention")
    print("=" * 50)
    
    # Param√®tres
    vocab_size = 10000
    d_model = 128
    num_heads = 4
    max_length = 50
    
    # Donn√©es d'exemple
    batch_size = 4
    seq_len = 20
    
    # S√©quences al√©atoires (simulent des phrases tokenis√©es)
    x = tf.random.uniform([batch_size, seq_len], maxval=vocab_size, dtype=tf.int32)
    
    # Cr√©ation du mod√®le
    model = AttentionSentimentAnalyzer(vocab_size, d_model, num_heads, max_length)
    
    # Pr√©diction
    predictions, attention_weights = model(x, return_attention=True)
    
    print(f"üìê Input shape: {x.shape}")
    print(f"‚úÖ Predictions shape: {predictions.shape}")
    print(f"üìä Attention weights shape: {attention_weights.shape}")
    print(f"üéØ Exemple de pr√©dictions: {predictions.numpy().flatten()[:4]}")
    
    # Statistiques du mod√®le
    total_params = sum([tf.size(v).numpy() for v in model.trainable_variables])
    print(f"üîß Nombre total de param√®tres: {total_params:,}")
    
    return model, attention_weights

sentiment_model, sentiment_attention = test_sentiment_model()

## 8. üìä Comparaison RNN vs Attention

In [None]:
def compare_rnn_vs_attention():
    """
    Compare les performances et caract√©ristiques RNN vs Attention
    """
    print("‚öñÔ∏è Comparaison RNN vs Attention")
    print("=" * 40)
    
    # Simulation de benchmarks
    metrics = {
        'Architecture': ['RNN/LSTM', 'Self-Attention', 'Multi-Head Attention'],
        'Parall√©lisation': ['‚ùå S√©quentiel', '‚úÖ Parall√®le', '‚úÖ Parall√®le'],
        'Complexit√©': ['O(n)', 'O(n¬≤)', 'O(n¬≤)'],
        'M√©moire √† long terme': ['Limit√©e', '‚úÖ Illimit√©e', '‚úÖ Illimit√©e'],
        'Vitesse d\'entra√Ænement': ['Lent', 'Rapide', 'Tr√®s rapide'],
        'Performance (BLEU)': [25.4, 32.1, 35.8],
        'Param√®tres (M)': [2.5, 8.2, 12.1]
    }
    
    # Cr√©ation du tableau comparatif
    import pandas as pd
    df = pd.DataFrame(metrics)
    
    # Affichage styl√©
    print(df.to_string(index=False))
    
    # Graphique de performance
    fig = go.Figure()
    
    fig.add_trace(go.Scatter(
        x=metrics['Param√®tres (M)'],
        y=metrics['Performance (BLEU)'],
        mode='markers+lines+text',
        text=metrics['Architecture'],
        textposition="top center",
        marker=dict(size=15, color=['red', 'blue', 'green']),
        line=dict(width=2)
    ))
    
    fig.update_layout(
        title="üìà Performance vs Complexit√©",
        xaxis_title="Nombre de Param√®tres (Millions)",
        yaxis_title="Score BLEU (Traduction)",
        width=800,
        height=500
    )
    
    fig.show()
    
    print("\nüí° Observations cl√©s:")
    print("‚Ä¢ L'attention permet la parall√©lisation ‚Üí entra√Ænement plus rapide")
    print("‚Ä¢ Pas de limite de distance ‚Üí meilleure m√©moire √† long terme")
    print("‚Ä¢ Plus de param√®tres mais performance sup√©rieure")
    print("‚Ä¢ Complexit√© O(n¬≤) peut √™tre probl√©matique pour tr√®s longues s√©quences")

compare_rnn_vs_attention()

## 9. üéÆ Attention Interactive

In [None]:
def interactive_attention_demo():
    """
    D√©monstration interactive de l'attention
    """
    print("üéÆ D√©monstration Interactive de l'Attention")
    print("=" * 45)
    
    # Phrases d'exemple avec diff√©rents patterns
    examples = [
        "The cat sat on the mat",
        "Despite the rain, the game continued",
        "She loves him but he doesn't care",
        "The quick brown fox jumps over the lazy dog"
    ]
    
    for i, sentence in enumerate(examples):
        print(f"\n{i+1}. Phrase: '{sentence}'")
        
        # Tokenisation simple
        tokens = sentence.lower().split()
        vocab = {word: idx for idx, word in enumerate(set(tokens))}
        token_ids = [vocab[word] for word in tokens]
        
        # Cr√©ation d'embeddings al√©atoires mais coh√©rents
        tf.random.set_seed(42 + i)  # Seed diff√©rent pour chaque phrase
        embeddings = tf.random.normal([len(tokens), 64])
        
        # Calcul de l'attention
        attention_layer = ScaledDotProductAttention()
        _, attention_weights = attention_layer(
            embeddings[None, ...], 
            embeddings[None, ...], 
            embeddings[None, ...]
        )
        
        # Analyse des patterns
        weights = attention_weights[0].numpy()
        
        # Trouvez les paires de mots avec la plus forte attention
        max_attention_pairs = []
        for j in range(len(tokens)):
            max_idx = np.argmax(weights[j])
            if max_idx != j:  # √âviter l'auto-attention
                max_attention_pairs.append(
                    (tokens[j], tokens[max_idx], weights[j, max_idx])
                )
        
        # Affichage des top 3 relations
        max_attention_pairs.sort(key=lambda x: x[2], reverse=True)
        print("   Relations d'attention les plus fortes:")
        for word1, word2, strength in max_attention_pairs[:3]:
            print(f"   ‚Ä¢ '{word1}' ‚Üí '{word2}' ({strength:.3f})")
        
        # Visualisation rapide (premi√®re phrase seulement)
        if i == 0:
            visualize_attention_weights(attention_weights, tokens)

interactive_attention_demo()

## 10. üîÆ Perspectives et Applications Avanc√©es

In [None]:
def advanced_attention_applications():
    """
    Pr√©sente les applications avanc√©es de l'attention
    """
    print("üîÆ Applications Avanc√©es de l'Attention")
    print("=" * 45)
    
    applications = {
        "üåç Traduction Automatique": {
            "Description": "Cross-attention entre langues source et cible",
            "Avantage": "Alignement automatique des mots",
            "Exemple": "'Je mange' ‚Üí 'I eat' avec attention sur correspondances"
        },
        "üìù R√©sum√© de Texte": {
            "Description": "Attention sur phrases et concepts importants",
            "Avantage": "Identification automatique des √©l√©ments cl√©s",
            "Exemple": "Article de 1000 mots ‚Üí r√©sum√© de 100 mots"
        },
        "‚ùì Question-R√©ponse": {
            "Description": "Attention entre question et passages du texte",
            "Avantage": "Localisation pr√©cise des r√©ponses",
            "Exemple": "'Qui est le pr√©sident?' ‚Üí attention sur noms propres"
        },
        "üé® G√©n√©ration d'Images": {
            "Description": "Attention entre description et r√©gions de l'image",
            "Avantage": "G√©n√©ration coh√©rente et contr√¥l√©e",
            "Exemple": "'Un chat noir sur un tapis rouge' ‚Üí attention spatiale"
        },
        "üó£Ô∏è Reconnaissance Vocale": {
            "Description": "Attention entre signal audio et transcription",
            "Avantage": "Alignement temporel automatique",
            "Exemple": "Spectrogramme ‚Üí texte avec correspondances temporelles"
        }
    }
    
    for app_name, details in applications.items():
        print(f"\n{app_name}:")
        print(f"  üìã {details['Description']}")
        print(f"  ‚ú® {details['Avantage']}")
        print(f"  üí° {details['Exemple']}")
    
    # Tendances futures
    print("\nüöÄ Tendances Futures:")
    future_trends = [
        "Sparse Attention: R√©duction de la complexit√© O(n¬≤)",
        "Linear Attention: Attention en O(n)",
        "Cross-Modal Attention: Vision + Texte + Audio",
        "Hierarchical Attention: Attention multi-niveaux",
        "Adaptive Attention: Longueur de contexte dynamique"
    ]
    
    for trend in future_trends:
        print(f"  ‚Ä¢ {trend}")

advanced_attention_applications()

## üéØ Conclusion et R√©capitulatif

### Ce que vous avez appris :

‚úÖ **Concept d'Attention** : M√©canisme permettant de se concentrer sur les parties importantes  
‚úÖ **Scaled Dot-Product Attention** : Formule math√©matique et impl√©mentation  
‚úÖ **Multi-Head Attention** : Plusieurs perspectives d'attention en parall√®le  
‚úÖ **Self vs Cross Attention** : Diff√©rences et cas d'usage  
‚úÖ **Applications Pratiques** : Sentiment, traduction, g√©n√©ration  
‚úÖ **Avantages vs RNN** : Parall√©lisation et m√©moire √† long terme

### Prochaines √©tapes :
- **Module 6.2** : Architecture compl√®te des Transformers
- **Module 7** : BERT et GPT
- **Projets pratiques** : Impl√©mentation de vos propres mod√®les

### üõ†Ô∏è Exercices Pratiques Sugg√©r√©s :
1. Modifiez les param√®tres d'attention (temp√©rature, nombre de t√™tes)
2. Impl√©mentez diff√©rents types de masques
3. Cr√©ez un mod√®le de traduction simple
4. Analysez les patterns d'attention sur vos propres textes

In [None]:
# üéä F√©licitations ! Vous ma√Ætrisez maintenant les m√©canismes d'attention !
print("""üéä F√âLICITATIONS ! üéä

Vous venez de ma√Ætriser un des concepts les plus importants du NLP moderne !

üß† L'attention est utilis√©e dans :
  ‚Ä¢ GPT (ChatGPT, GPT-4)
  ‚Ä¢ BERT (Google Search)
  ‚Ä¢ T5 (Google Translate)
  ‚Ä¢ DALL-E (g√©n√©ration d'images)
  ‚Ä¢ Et bien d'autres...

üöÄ Continuez vers l'architecture Transformer compl√®te !
""")