## ‚öôÔ∏è 1Ô∏è‚É£ LLM‚Äôde Compile Kavramƒ±

`compile` fonksiyonu, √∂zellikle **Keras/TensorFlow** d√ºnyasƒ±nda modelin eƒüitime hazƒ±r hale gelmesi i√ßin kullanƒ±lƒ±r. Burada 3 ana ≈üey tanƒ±mlanƒ±r:

- **Optimizer** ‚Äì modelin aƒüƒ±rlƒ±klarƒ±nƒ± g√ºncelleme y√∂ntemi  
  √ñrnekler: `Adam`, `AdamW`, `SGD`, `Adafactor`  
  LLM‚Äôlerde genellikle `AdamW` veya `Adafactor` tercih edilir.

- **Loss Function (Kayƒ±p Fonksiyonu)** ‚Äì modelin tahmin hatasƒ±nƒ± √∂l√ßer  
  Seq2Seq / LLM i√ßin genellikle `CrossEntropyLoss` kullanƒ±lƒ±r (logit + softmax + target).

- **Metrics (Opsiyonel)** ‚Äì eƒüitim sƒ±rasƒ±nda performans izlemek i√ßin  
  √ñrnek: `accuracy`, `perplexity`

> ‚ö†Ô∏è LLM‚Äôlerde klasik `accuracy` √ßoƒüu zaman yanƒ±ltƒ±cƒ±dƒ±r, onun yerine **perplexity** veya **token-level accuracy** tercih edilir.

---

## ‚öôÔ∏è 2Ô∏è‚É£ LLM‚Äôde Loss Fonksiyonu

LLM‚Äôlerde **loss** genellikle **token-level cross entropy** ile hesaplanƒ±r:

\[
\text{Loss} = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{V} y_{ij} \log(\hat{y}_{ij})
\]

- \(N\) = batch i√ßindeki token sayƒ±sƒ±  
- \(V\) = vocabulary boyutu  
- \(y_{ij}\) = ger√ßek token one-hot  
- \(\hat{y}_{ij}\) = modelin tahmini token olasƒ±lƒ±ƒüƒ±  

**Masking:** Padding tokenlerini loss hesaplamaya dahil etmeyiz, aksi halde model yanlƒ±≈ü y√∂nlendirilir.


## ‚öôÔ∏è 3Ô∏è‚É£ PyTorch √ñrneƒüi

In [None]:
import torch
import torch.nn as nn
from transformers import AdamW

# Model ve optimizer
model = MySeq2SeqModel(vocab_size=30522, d_model=512)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Loss
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

# Eƒüitim adƒ±mƒ±
def train_step(input_ids, target_ids):
    optimizer.zero_grad()
    outputs = model(input_ids, labels=target_ids)
    loss = criterion(outputs.logits.view(-1, outputs.logits.size(-1)),
                     target_ids.view(-1))
    loss.backward()
    optimizer.step()
    return loss.item()


### Notlar:

* outputs.logits shape‚Äôi [batch, seq_len, vocab_size]

* Flatten i≈ülemi ile [batch*seq_len, vocab_size] yapƒ±yoruz, target da [batch*seq_len]

* ignore_index padding tokenlerini g√∂z ardƒ± eder

## ‚öôÔ∏è 4Ô∏è‚É£ Geli≈ümi≈ü Loss Se√ßenekleri

* Label Smoothing ‚Äì modelin a≈üƒ±rƒ± confident olmasƒ±nƒ± engeller

* Focal Loss ‚Äì nadir tokenler i√ßin aƒüƒ±rlƒ±k verir

* Perplexity ‚Äì loss‚Äôu anlamlƒ± bir √∂l√ß√ºye √ßevirir:

## ‚öôÔ∏è 5Ô∏è‚É£ √ñzet

| Adƒ±m            | A√ßƒ±klama                                       |
| --------------- | ---------------------------------------------- |
| Optimizer       | Aƒüƒ±rlƒ±klarƒ± g√ºncellemek i√ßin (AdamW/Adafactor) |
| Loss            | Token-level CrossEntropy, padding maskeli      |
| Metrics         | Token accuracy / perplexity                    |
| Label smoothing | Overfitting √∂nler                              |
| Masking         | Padding tokenleri hari√ß tutar                  |


----

# 1Ô∏è‚É£ Optimizer‚Äôlar (LLM‚Äôlerde yaygƒ±n)

| Optimizer          | A√ßƒ±klama                                                                                                                         | √ñne √ßƒ±kan parametreler                                                                              |
| ------------------ | -------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| **AdamW**          | Adam‚Äôƒ±n weight decay (L2 regularization) eklenmi≈ü versiyonu. Transformer tabanlƒ± LLM‚Äôlerde en yaygƒ±n kullanƒ±lan optimizer.       | `lr` (learning rate), `betas` (momentum), `eps` (k√º√ß√ºk sabit), `weight_decay`                       |
| **Adafactor**      | RAM verimliliƒüi y√ºksek, b√ºy√ºk model ve d√º≈ü√ºk batch size i√ßin optimize edilmi≈ü Adam t√ºrevi. HuggingFace‚Äôde LLM‚Äôler i√ßin standart. | `lr`, `eps1`, `eps2`, `clip_threshold`, `relative_step` (learning rate schedule), `scale_parameter` |
| **Adam**           | Klasik Adam, k√º√ß√ºk modeller veya prototipler i√ßin. LLM‚Äôlerde genellikle weight decay i√ßin AdamW tercih edilir.                   | `lr`, `betas`, `eps`                                                                                |
| **SGD / Momentum** | Nadir kullanƒ±lƒ±r, genellikle LLM‚Äôlerde tercih edilmez.                                                                           | `lr`, `momentum`, `weight_decay`                                                                    |


# 2Ô∏è‚É£ Loss Fonksiyonlarƒ± (LLM / Seq2Seq)

| Loss Fonksiyonu                    | A√ßƒ±klama                                                                                                 | √ñne √ßƒ±kan parametreler                                                                |
| ---------------------------------- | -------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| **CrossEntropyLoss (token-level)** | Token ba≈üƒ±na softmax + log + negatif log-likelihood. Seq2Seq ve LLM‚Äôlerde en standart loss.              | `ignore_index` (padding tokenleri ignore etmek i√ßin), `reduction` (`mean` veya `sum`) |
| **LabelSmoothingCrossEntropy**     | CrossEntropy‚Äônin label smoothing ile geli≈ütirilmi≈ü versiyonu. Modelin a≈üƒ±rƒ± confident olmasƒ±nƒ± engeller. | `smoothing` (√∂rn. 0.1), `ignore_index`, `reduction`                                   |
| **Focal Loss**                     | Nadir tokenler veya dengesiz token daƒüƒ±lƒ±mlarƒ± i√ßin aƒüƒ±rlƒ±k verir.                                       | `gamma`, `alpha`, `ignore_index`                                                      |
| **KLDivLoss / Perplexity**         | Genellikle daƒüƒ±lƒ±m farklƒ±lƒ±klarƒ±nƒ± √∂l√ßmek veya distillation i√ßin kullanƒ±lƒ±r.                             | `log_target`, `reduction`                                                             |


# 3Ô∏è‚É£ Parametrelerin Tipik Ayarlarƒ± (B√ºy√ºk LLM √ñrneƒüi)

| Parametre      | Tipik Deƒüer                                            |
| -------------- | ------------------------------------------------------ |
| `lr`           | 1e-4 ‚Ä¶ 5e-5 (warmup ile)                               |
| `weight_decay` | 0.01 ‚Ä¶ 0.1                                             |
| `betas`        | (0.9, 0.999)                                           |
| `eps`          | 1e-8                                                   |
| `smoothing`    | 0.1 (label smoothing i√ßin)                             |
| `ignore_index` | pad_token_id (genellikle 0 veya tokenizer √∂zel deƒüeri) |


### üí° √ñzet Mantƒ±k:

* Optimizer: Model aƒüƒ±rlƒ±klarƒ±nƒ± stabil ve verimli g√ºncellemek i√ßin ‚Üí AdamW veya Adafactor.

* Loss: Her tokenin hatasƒ±nƒ± √∂l√ßmek i√ßin ‚Üí token-level CrossEntropy, isteƒüe baƒülƒ± label smoothing.

* Parametreler: Learning rate, weight decay, betas, eps ve padding mask en kritik ayarlar.

# = >

```python 
import torch
import torch.nn as nn
from transformers import AdamW, Adafactor, get_linear_schedule_with_warmup

# -----------------------
# Model Tanƒ±mƒ± (√ñrnek)
# -----------------------
model = MySeq2SeqModel(vocab_size=30522, d_model=512)

# -----------------------
# 1Ô∏è‚É£ Optimizer Se√ßimi
# -----------------------
# AdamW: LLM'lerde standart
optimizer = AdamW(
    model.parameters(),
    lr=5e-5,          # learning rate
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.01
)

# Alternatif: Adafactor (hafƒ±za dostu)
# optimizer = Adafactor(
#     model.parameters(),
#     lr=None,
#     relative_step=True,
#     scale_parameter=True,
#     warmup_init=True
# )

# -----------------------
# 2Ô∏è‚É£ Loss Fonksiyonu
# -----------------------
# Padding tokenlerini ignore eden standart CrossEntropyLoss
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

# -----------------------
# 3Ô∏è‚É£ Scheduler (Opsiyonel)
# -----------------------
num_training_steps = 10000
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=500,
    num_training_steps=num_training_steps
)

# -----------------------
# 4Ô∏è‚É£ Train Step (Compile-like)
# -----------------------
def train_step(input_ids, target_ids):
    model.train()
    optimizer.zero_grad()

    # Model forward
    outputs = model(input_ids, labels=target_ids)

    # logits: [batch, seq_len, vocab_size], target: [batch, seq_len]
    loss = criterion(
        outputs.logits.view(-1, outputs.logits.size(-1)),
        target_ids.view(-1)
    )

    loss.backward()
    optimizer.step()
    scheduler.step()  # opsiyonel
    return loss.item()

# -----------------------
# 5Ô∏è‚É£ Metric √ñrnek
# -----------------------
def perplexity(loss):
    return torch.exp(loss)
