# ‚ö° Attention Optimizasyonlarƒ±: Flash Attention'dan MLA'ya

Bu notebook'ta farklƒ± attention mekanizmalarƒ±nƒ± implement edip kar≈üƒ±la≈ütƒ±racaƒüƒ±z:

1. Standard Multi-Head Attention (MHA)
2. Flash Attention (PyTorch native)
3. Multi-Query Attention (MQA)
4. Grouped-Query Attention (GQA)
5. Multi-head Latent Attention (MLA)

## ƒ∞√ßindekiler

1. [Kurulum](#1-kurulum)
2. [Standard MHA](#2-standard-mha)
3. [Flash Attention](#3-flash-attention)
4. [Multi-Query Attention](#4-multi-query-attention)
5. [Grouped-Query Attention](#5-grouped-query-attention)
6. [Multi-head Latent Attention](#6-multi-head-latent-attention)
7. [Benchmark ve Kar≈üƒ±la≈ütƒ±rma](#7-benchmark-ve-kar≈üƒ±la≈ütƒ±rma)

---
## 1. Kurulum

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import time
import matplotlib.pyplot as plt
import numpy as np
from dataclasses import dataclass
from typing import Optional, Tuple

# Reproducibility
torch.manual_seed(42)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üñ•Ô∏è Device: {device}")
print(f"üì¶ PyTorch: {torch.__version__}")

if torch.cuda.is_available():
    print(f"üéÆ GPU: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
@dataclass
class AttentionConfig:
    """Attention konfig√ºrasyonu"""
    d_model: int = 512
    num_heads: int = 8
    num_kv_groups: int = 2      # GQA i√ßin
    d_latent: int = 128          # MLA i√ßin
    dropout: float = 0.1
    max_seq_len: int = 2048
    
config = AttentionConfig()
print(f"‚öôÔ∏è Config: {config}")

---
## 2. Standard MHA

Baseline olarak standart Multi-Head Attention implementasyonu.

In [None]:
class StandardMHA(nn.Module):
    """
    Standard Multi-Head Attention
    
    Her head i√ßin ayrƒ± Q, K, V.
    KV-Cache boyutu: 2 √ó num_heads √ó seq_len √ó d_head
    """
    
    def __init__(self, config: AttentionConfig):
        super().__init__()
        
        self.num_heads = config.num_heads
        self.d_head = config.d_model // config.num_heads
        self.d_model = config.d_model
        
        # Q, K, V projections
        self.W_q = nn.Linear(config.d_model, config.d_model)
        self.W_k = nn.Linear(config.d_model, config.d_model)
        self.W_v = nn.Linear(config.d_model, config.d_model)
        self.W_o = nn.Linear(config.d_model, config.d_model)
        
        self.dropout = nn.Dropout(config.dropout)
        
        # Causal mask
        mask = torch.triu(torch.ones(config.max_seq_len, config.max_seq_len), diagonal=1).bool()
        self.register_buffer('causal_mask', mask)
        
    def forward(
        self, 
        x: torch.Tensor,
        kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
        use_cache: bool = False
    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
        """
        Args:
            x: [batch, seq_len, d_model]
            kv_cache: Tuple of (K_cache, V_cache) for inference
            use_cache: Whether to return new KV cache
        """
        B, T, C = x.shape
        
        # Projections
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Reshape: [B, T, num_heads, d_head] -> [B, num_heads, T, d_head]
        Q = Q.view(B, T, self.num_heads, self.d_head).transpose(1, 2)
        K = K.view(B, T, self.num_heads, self.d_head).transpose(1, 2)
        V = V.view(B, T, self.num_heads, self.d_head).transpose(1, 2)
        
        # KV Cache handling
        if kv_cache is not None:
            K_cache, V_cache = kv_cache
            K = torch.cat([K_cache, K], dim=2)
            V = torch.cat([V_cache, V], dim=2)
        
        new_cache = (K, V) if use_cache else None
        
        # Attention
        T_full = K.size(2)
        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_head)
        
        # Causal mask (sadece yeni tokenler i√ßin)
        if kv_cache is None:
            scores = scores.masked_fill(self.causal_mask[:T, :T], float('-inf'))
        
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        
        # Output
        out = attn @ V
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.W_o(out)
        
        return out, new_cache
    
    def get_kv_cache_size(self, seq_len: int, batch_size: int = 1) -> int:
        """KV-Cache boyutunu hesapla (byte cinsinden)"""
        # 2 (K+V) √ó batch √ó num_heads √ó seq_len √ó d_head √ó 4 (float32)
        return 2 * batch_size * self.num_heads * seq_len * self.d_head * 4

# Test
mha = StandardMHA(config).to(device)
x = torch.randn(2, 64, config.d_model).to(device)
out, cache = mha(x, use_cache=True)

print(f"‚úÖ Standard MHA")
print(f"   Input:  {x.shape}")
print(f"   Output: {out.shape}")
print(f"   KV Cache: K={cache[0].shape}, V={cache[1].shape}")
print(f"   Cache size: {mha.get_kv_cache_size(64) / 1024:.1f} KB")

---
## 3. Flash Attention

PyTorch 2.0+ ile gelen `scaled_dot_product_attention` otomatik olarak Flash Attention kullanƒ±r.

In [None]:
class FlashMHA(nn.Module):
    """
    Multi-Head Attention with Flash Attention
    
    PyTorch'un native scaled_dot_product_attention kullanƒ±r.
    Otomatik olarak Flash Attention'a d√º≈üer (CUDA varsa).
    """
    
    def __init__(self, config: AttentionConfig):
        super().__init__()
        
        self.num_heads = config.num_heads
        self.d_head = config.d_model // config.num_heads
        self.d_model = config.d_model
        
        self.W_q = nn.Linear(config.d_model, config.d_model)
        self.W_k = nn.Linear(config.d_model, config.d_model)
        self.W_v = nn.Linear(config.d_model, config.d_model)
        self.W_o = nn.Linear(config.d_model, config.d_model)
        
        self.dropout = config.dropout
        
    def forward(
        self, 
        x: torch.Tensor,
        kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
        use_cache: bool = False
    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
        
        B, T, C = x.shape
        
        Q = self.W_q(x).view(B, T, self.num_heads, self.d_head).transpose(1, 2)
        K = self.W_k(x).view(B, T, self.num_heads, self.d_head).transpose(1, 2)
        V = self.W_v(x).view(B, T, self.num_heads, self.d_head).transpose(1, 2)
        
        if kv_cache is not None:
            K = torch.cat([kv_cache[0], K], dim=2)
            V = torch.cat([kv_cache[1], V], dim=2)
        
        new_cache = (K, V) if use_cache else None
        
        # üöÄ Flash Attention via PyTorch native SDPA
        out = F.scaled_dot_product_attention(
            Q, K, V,
            attn_mask=None,
            dropout_p=self.dropout if self.training else 0.0,
            is_causal=(kv_cache is None)  # Causal sadece ilk forward'da
        )
        
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.W_o(out)
        
        return out, new_cache
    
    def get_kv_cache_size(self, seq_len: int, batch_size: int = 1) -> int:
        return 2 * batch_size * self.num_heads * seq_len * self.d_head * 4

# Test
flash_mha = FlashMHA(config).to(device)
out, cache = flash_mha(x, use_cache=True)

print(f"‚úÖ Flash MHA")
print(f"   Output: {out.shape}")

# Flash Attention backend kontrol√º
if torch.cuda.is_available():
    print(f"\nüîç SDPA Backend Kontrol√º:")
    print(f"   Flash: {torch.backends.cuda.flash_sdp_enabled()}")
    print(f"   Memory Efficient: {torch.backends.cuda.mem_efficient_sdp_enabled()}")
    print(f"   Math: {torch.backends.cuda.math_sdp_enabled()}")

---
## 4. Multi-Query Attention (MQA)

T√ºm head'ler i√ßin **tek K ve V** kullanƒ±r. KV-Cache'i dramatik ≈üekilde k√º√ß√ºlt√ºr.

In [None]:
class MultiQueryAttention(nn.Module):
    """
    Multi-Query Attention (MQA)
    
    - Q: Her head i√ßin ayrƒ± (num_heads adet)
    - K, V: T√ºm head'ler i√ßin tek (payla≈üƒ±mlƒ±)
    
    KV-Cache boyutu: 2 √ó seq_len √ó d_head (num_heads ile √ßarpƒ±lmaz!)
    """
    
    def __init__(self, config: AttentionConfig):
        super().__init__()
        
        self.num_heads = config.num_heads
        self.d_head = config.d_model // config.num_heads
        self.d_model = config.d_model
        
        # Q: Full projection (t√ºm head'ler)
        self.W_q = nn.Linear(config.d_model, config.d_model)
        
        # K, V: Sadece tek head boyutunda!
        self.W_k = nn.Linear(config.d_model, self.d_head)
        self.W_v = nn.Linear(config.d_model, self.d_head)
        
        self.W_o = nn.Linear(config.d_model, config.d_model)
        self.dropout = nn.Dropout(config.dropout)
        
        # Causal mask
        mask = torch.triu(torch.ones(config.max_seq_len, config.max_seq_len), diagonal=1).bool()
        self.register_buffer('causal_mask', mask)
        
    def forward(
        self, 
        x: torch.Tensor,
        kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
        use_cache: bool = False
    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
        
        B, T, C = x.shape
        
        # Q: [B, num_heads, T, d_head]
        Q = self.W_q(x).view(B, T, self.num_heads, self.d_head).transpose(1, 2)
        
        # K, V: [B, T, d_head] (tek head!)
        K = self.W_k(x)  # [B, T, d_head]
        V = self.W_v(x)  # [B, T, d_head]
        
        # KV Cache
        if kv_cache is not None:
            K = torch.cat([kv_cache[0], K], dim=1)
            V = torch.cat([kv_cache[1], V], dim=1)
        
        new_cache = (K, V) if use_cache else None
        
        # K, V'yi t√ºm head'lere broadcast et
        # [B, T, d_head] -> [B, 1, T, d_head] -> [B, num_heads, T, d_head]
        T_full = K.size(1)
        K = K.unsqueeze(1).expand(-1, self.num_heads, -1, -1)  # Broadcast
        V = V.unsqueeze(1).expand(-1, self.num_heads, -1, -1)
        
        # Attention
        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_head)
        
        if kv_cache is None:
            scores = scores.masked_fill(self.causal_mask[:T, :T_full], float('-inf'))
        
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        
        out = attn @ V
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.W_o(out)
        
        return out, new_cache
    
    def get_kv_cache_size(self, seq_len: int, batch_size: int = 1) -> int:
        """MQA: Sadece 1 head i√ßin cache (num_heads ile √ßarpƒ±lmaz!)"""
        return 2 * batch_size * seq_len * self.d_head * 4

# Test
mqa = MultiQueryAttention(config).to(device)
out, cache = mqa(x, use_cache=True)

print(f"‚úÖ Multi-Query Attention (MQA)")
print(f"   Output: {out.shape}")
print(f"   KV Cache: K={cache[0].shape}, V={cache[1].shape}")
print(f"   Cache size: {mqa.get_kv_cache_size(64) / 1024:.1f} KB")
print(f"\n   üìâ MHA'ya g√∂re cache azalma: {config.num_heads}x")

---
## 5. Grouped-Query Attention (GQA)

MHA ve MQA arasƒ± denge. K,V i√ßin **birka√ß grup** kullanƒ±r.

In [None]:
class GroupedQueryAttention(nn.Module):
    """
    Grouped-Query Attention (GQA)
    
    - Q: Her head i√ßin ayrƒ± (num_heads adet)
    - K, V: Grup ba≈üƒ±na bir tane (num_kv_groups adet)
    
    √ñrnek: 8 Q head, 2 KV group
    - Head 0,1,2,3 ‚Üí KV Group 0
    - Head 4,5,6,7 ‚Üí KV Group 1
    """
    
    def __init__(self, config: AttentionConfig):
        super().__init__()
        
        assert config.num_heads % config.num_kv_groups == 0, \
            "num_heads must be divisible by num_kv_groups"
        
        self.num_heads = config.num_heads
        self.num_kv_groups = config.num_kv_groups
        self.heads_per_group = config.num_heads // config.num_kv_groups
        self.d_head = config.d_model // config.num_heads
        self.d_model = config.d_model
        
        # Q: T√ºm head'ler i√ßin
        self.W_q = nn.Linear(config.d_model, config.d_model)
        
        # K, V: Sadece grup sayƒ±sƒ± kadar
        kv_dim = self.num_kv_groups * self.d_head
        self.W_k = nn.Linear(config.d_model, kv_dim)
        self.W_v = nn.Linear(config.d_model, kv_dim)
        
        self.W_o = nn.Linear(config.d_model, config.d_model)
        self.dropout = nn.Dropout(config.dropout)
        
        # Causal mask
        mask = torch.triu(torch.ones(config.max_seq_len, config.max_seq_len), diagonal=1).bool()
        self.register_buffer('causal_mask', mask)
        
    def forward(
        self, 
        x: torch.Tensor,
        kv_cache: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
        use_cache: bool = False
    ) -> Tuple[torch.Tensor, Optional[Tuple[torch.Tensor, torch.Tensor]]]:
        
        B, T, C = x.shape
        
        # Q: [B, num_heads, T, d_head]
        Q = self.W_q(x).view(B, T, self.num_heads, self.d_head).transpose(1, 2)
        
        # K, V: [B, T, num_kv_groups, d_head]
        K = self.W_k(x).view(B, T, self.num_kv_groups, self.d_head)
        V = self.W_v(x).view(B, T, self.num_kv_groups, self.d_head)
        
        # KV Cache (grup formatƒ±nda)
        if kv_cache is not None:
            K = torch.cat([kv_cache[0], K], dim=1)
            V = torch.cat([kv_cache[1], V], dim=1)
        
        new_cache = (K, V) if use_cache else None
        
        # K, V'yi head'lere expand et
        # [B, T, num_kv_groups, d_head] -> [B, T, num_heads, d_head]
        K = K.repeat_interleave(self.heads_per_group, dim=2)
        V = V.repeat_interleave(self.heads_per_group, dim=2)
        
        # Transpose: [B, num_heads, T, d_head]
        K = K.transpose(1, 2)
        V = V.transpose(1, 2)
        
        # Attention
        T_full = K.size(2)
        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_head)
        
        if kv_cache is None:
            scores = scores.masked_fill(self.causal_mask[:T, :T_full], float('-inf'))
        
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        
        out = attn @ V
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.W_o(out)
        
        return out, new_cache
    
    def get_kv_cache_size(self, seq_len: int, batch_size: int = 1) -> int:
        """GQA: num_kv_groups kadar K,V"""
        return 2 * batch_size * self.num_kv_groups * seq_len * self.d_head * 4

# Test
gqa = GroupedQueryAttention(config).to(device)
out, cache = gqa(x, use_cache=True)

print(f"‚úÖ Grouped-Query Attention (GQA)")
print(f"   num_heads: {config.num_heads}, num_kv_groups: {config.num_kv_groups}")
print(f"   Output: {out.shape}")
print(f"   KV Cache: K={cache[0].shape}, V={cache[1].shape}")
print(f"   Cache size: {gqa.get_kv_cache_size(64) / 1024:.1f} KB")
print(f"\n   üìâ MHA'ya g√∂re cache azalma: {config.num_heads // config.num_kv_groups}x")

---
## 6. Multi-head Latent Attention (MLA)

DeepSeek'in yeniliƒüi: K,V'yi d√º≈ü√ºk boyutlu latent space'e sƒ±kƒ±≈ütƒ±r.

In [None]:
class MultiheadLatentAttention(nn.Module):
    """
    Multi-head Latent Attention (MLA)
    
    DeepSeek-V2/V3'√ºn yakla≈üƒ±mƒ±:
    - K, V d√º≈ü√ºk boyutlu latent vekt√∂re sƒ±kƒ±≈ütƒ±rƒ±lƒ±r
    - Cache sadece latent'i saklar
    - Inference'da latent ‚Üí K, V decompress edilir
    
    Compression ratio: d_model / d_latent
    """
    
    def __init__(self, config: AttentionConfig):
        super().__init__()
        
        self.num_heads = config.num_heads
        self.d_head = config.d_model // config.num_heads
        self.d_model = config.d_model
        self.d_latent = config.d_latent
        
        # Q projection (standart)
        self.W_q = nn.Linear(config.d_model, config.d_model)
        
        # KV compression (down-projection)
        self.W_kv_down = nn.Linear(config.d_model, config.d_latent)
        
        # KV decompression (up-projection)
        self.W_k_up = nn.Linear(config.d_latent, config.d_model)
        self.W_v_up = nn.Linear(config.d_latent, config.d_model)
        
        self.W_o = nn.Linear(config.d_model, config.d_model)
        self.dropout = nn.Dropout(config.dropout)
        
        # Causal mask
        mask = torch.triu(torch.ones(config.max_seq_len, config.max_seq_len), diagonal=1).bool()
        self.register_buffer('causal_mask', mask)
        
    def forward(
        self, 
        x: torch.Tensor,
        kv_cache: Optional[torch.Tensor] = None,  # Sadece latent!
        use_cache: bool = False
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        """
        Args:
            x: [batch, seq_len, d_model]
            kv_cache: Latent cache [batch, cache_len, d_latent]
        """
        B, T, C = x.shape
        
        # Q: [B, num_heads, T, d_head]
        Q = self.W_q(x).view(B, T, self.num_heads, self.d_head).transpose(1, 2)
        
        # KV: Compress to latent
        kv_latent = self.W_kv_down(x)  # [B, T, d_latent]
        
        # Cache: Sadece latent saklanƒ±r!
        if kv_cache is not None:
            kv_latent_full = torch.cat([kv_cache, kv_latent], dim=1)
        else:
            kv_latent_full = kv_latent
        
        new_cache = kv_latent_full if use_cache else None
        
        # Decompress: latent ‚Üí K, V
        K = self.W_k_up(kv_latent_full)  # [B, T_full, d_model]
        V = self.W_v_up(kv_latent_full)  # [B, T_full, d_model]
        
        # Reshape to heads
        T_full = K.size(1)
        K = K.view(B, T_full, self.num_heads, self.d_head).transpose(1, 2)
        V = V.view(B, T_full, self.num_heads, self.d_head).transpose(1, 2)
        
        # Attention
        scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_head)
        
        if kv_cache is None:
            scores = scores.masked_fill(self.causal_mask[:T, :T_full], float('-inf'))
        
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        
        out = attn @ V
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.W_o(out)
        
        return out, new_cache
    
    def get_kv_cache_size(self, seq_len: int, batch_size: int = 1) -> int:
        """MLA: Sadece latent saklanƒ±r!"""
        return batch_size * seq_len * self.d_latent * 4

# Test
mla = MultiheadLatentAttention(config).to(device)
out, cache = mla(x, use_cache=True)

print(f"‚úÖ Multi-head Latent Attention (MLA)")
print(f"   d_latent: {config.d_latent}")
print(f"   Output: {out.shape}")
print(f"   Latent Cache: {cache.shape}")
print(f"   Cache size: {mla.get_kv_cache_size(64) / 1024:.1f} KB")

compression = (2 * config.num_heads * config.d_model // config.num_heads) / config.d_latent
print(f"\n   üìâ MHA'ya g√∂re cache azalma: {compression:.1f}x")

---
## 7. Benchmark ve Kar≈üƒ±la≈ütƒ±rma

In [None]:
def benchmark_attention(model, x, num_runs=100, warmup=10):
    """Attention mod√ºl√ºn√ºn hƒ±zƒ±nƒ± √∂l√ß"""
    model.eval()
    
    # Warmup
    with torch.no_grad():
        for _ in range(warmup):
            _ = model(x)
    
    # Sync (GPU i√ßin)
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    
    # Benchmark
    start = time.perf_counter()
    with torch.no_grad():
        for _ in range(num_runs):
            _ = model(x)
    
    if torch.cuda.is_available():
        torch.cuda.synchronize()
    
    elapsed = time.perf_counter() - start
    return elapsed / num_runs * 1000  # ms

# Modelleri hazƒ±rla
models = {
    'Standard MHA': StandardMHA(config).to(device),
    'Flash MHA': FlashMHA(config).to(device),
    'MQA': MultiQueryAttention(config).to(device),
    'GQA': GroupedQueryAttention(config).to(device),
    'MLA': MultiheadLatentAttention(config).to(device),
}

# Farklƒ± sequence uzunluklarƒ± i√ßin benchmark
seq_lengths = [64, 128, 256, 512, 1024]
results = {name: [] for name in models}

print("‚è±Ô∏è Benchmark ba≈ülƒ±yor...\n")

for seq_len in seq_lengths:
    x_bench = torch.randn(4, seq_len, config.d_model).to(device)
    
    for name, model in models.items():
        try:
            time_ms = benchmark_attention(model, x_bench, num_runs=50)
            results[name].append(time_ms)
        except Exception as e:
            results[name].append(float('nan'))
            print(f"‚ö†Ô∏è {name} failed at seq_len={seq_len}: {e}")
    
    print(f"Seq length {seq_len}: Done")

print("\n‚úÖ Benchmark tamamlandƒ±!")

In [None]:
# Sonu√ßlarƒ± g√∂rselle≈ütir
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sol: Execution time
for name, times in results.items():
    axes[0].plot(seq_lengths, times, marker='o', label=name, linewidth=2)

axes[0].set_xlabel('Sequence Length')
axes[0].set_ylabel('Time (ms)')
axes[0].set_title('Attention Forward Pass Time')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_yscale('log')

# Saƒü: KV-Cache boyutu
cache_sizes = {
    'Standard MHA': [models['Standard MHA'].get_kv_cache_size(s) / 1024 for s in seq_lengths],
    'MQA': [models['MQA'].get_kv_cache_size(s) / 1024 for s in seq_lengths],
    'GQA': [models['GQA'].get_kv_cache_size(s) / 1024 for s in seq_lengths],
    'MLA': [models['MLA'].get_kv_cache_size(s) / 1024 for s in seq_lengths],
}

x_pos = np.arange(len(seq_lengths))
width = 0.2

for i, (name, sizes) in enumerate(cache_sizes.items()):
    axes[1].bar(x_pos + i*width, sizes, width, label=name)

axes[1].set_xlabel('Sequence Length')
axes[1].set_ylabel('KV-Cache Size (KB)')
axes[1].set_title('KV-Cache Memory Usage')
axes[1].set_xticks(x_pos + width*1.5)
axes[1].set_xticklabels(seq_lengths)
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

In [None]:
# √ñzet tablo
print("\n" + "="*70)
print("üìä √ñZET KAR≈ûILA≈ûTIRMA (seq_len=512)")
print("="*70)

seq_idx = seq_lengths.index(512)
base_time = results['Standard MHA'][seq_idx]
base_cache = models['Standard MHA'].get_kv_cache_size(512)

print(f"\n{'Model':<20} {'Time (ms)':<12} {'Speedup':<10} {'Cache (KB)':<12} {'Cache Reduction':<15}")
print("-"*70)

for name, model in models.items():
    time_ms = results[name][seq_idx]
    speedup = base_time / time_ms if time_ms > 0 else 0
    
    if name == 'Flash MHA':
        cache_kb = base_cache / 1024
        reduction = 1.0
    else:
        cache_kb = model.get_kv_cache_size(512) / 1024
        reduction = base_cache / (cache_kb * 1024)
    
    print(f"{name:<20} {time_ms:<12.2f} {speedup:<10.2f}x {cache_kb:<12.1f} {reduction:<15.1f}x")

print("\n" + "="*70)

---
## üìö √ñzet

| Teknik | Ana Fikir | KV-Cache | Kullanƒ±m |
|--------|-----------|----------|----------|
| **Standard MHA** | Baseline | 1x | Ara≈ütƒ±rma |
| **Flash Attention** | IO-aware tiling | 1x | Her yerde! |
| **MQA** | Tek K,V | 1/num_heads | Edge/Mobile |
| **GQA** | Gruplu K,V | 1/num_groups | LLaMA, Mistral |
| **MLA** | Latent compression | 1/16+ | DeepSeek |

### √ñneriler

- **Eƒüitim:** Flash Attention + MHA
- **Inference (kƒ±sa context):** GQA
- **Inference (uzun context):** MLA
- **D√º≈ü√ºk kaynak:** MQA