# üîπ LLM Train Loop: Adƒ±m Adƒ±m Kod Akƒ±≈üƒ±

LLM train loop‚Äôu adƒ±m adƒ±m in≈üa edeceƒüiz ve her adƒ±mda iyile≈ütirmeler ekleyeceƒüiz. Teorideki 5 temel adƒ±mƒ± kod seviyesinde ele alacaƒüƒ±z.

---

## 1Ô∏è‚É£ Veri ve Batch Hazƒ±rlƒ±ƒüƒ±
- Dataset ve DataLoader olu≈üturma
- Veriyi GPU/CPU cihazƒ±na ta≈üƒ±ma
- Padding ve attention mask kontrolleri

---

## 2Ô∏è‚É£ Forward Pass
- Modeli √ßaƒüƒ±rma
- Teacher forcing uygulanmasƒ± (decoder kullanƒ±lƒ±yorsa)
- √áƒ±kƒ±≈ü (`logits`) boyut kontrol√º

---

## 3Ô∏è‚É£ Loss Hesaplama
- CrossEntropyLoss (veya label smoothing)
- Padding tokenlarƒ±nƒ± ignore etme
- Token boyutlarƒ±nƒ± reshape etme (`[B*T, V]` ve `[B*T]`)

---

## 4Ô∏è‚É£ Backward Pass & Optimize
- `loss.backward()`
- Gradient clipping
- `optimizer.step()`

---

## 5Ô∏è‚É£ Epoch D√∂ng√ºs√º ve Logging
- Toplam loss biriktirme
- Ortalama loss hesaplama
- Opsiyonel: basit progress bar veya logging

---

## 6Ô∏è‚É£ ƒ∞yile≈ütirmeler (Opsiyonel)
- Mixed precision (AMP) kullanƒ±mƒ±
- Gradient accumulation
- Learning rate scheduler

---

> üí° Not: √ñnce temel d√∂ng√ºy√º √ßalƒ±≈üƒ±r h√¢le getireceƒüiz, ardƒ±ndan isteƒüe baƒülƒ± iyile≈ütirmeleri ekleyeceƒüiz.


----

# 1Ô∏è‚É£: Veri ve Batch Hazƒ±rlƒ±ƒüƒ±

In [2]:
import torch
from torch.utils.data import DataLoader, Dataset

# -----------------------------
# √ñrnek Dataset
# -----------------------------
class MyDataset(Dataset):
    def __init__(self, enc_inputs, dec_targets):
        self.enc_inputs = enc_inputs
        self.dec_targets = dec_targets

    def __len__(self):
        return len(self.enc_inputs)

    def __getitem__(self, idx):
        return {
            'input_ids': torch.tensor(self.enc_inputs[idx], dtype=torch.long),
            'target_ids': torch.tensor(self.dec_targets[idx], dtype=torch.long)
        }

# -----------------------------
# √ñrnek veri
# -----------------------------
enc_inputs = [[1,2,3,4,0,0],[5,6,7,0,0,0]]   # 0 = PAD token
dec_targets = [[1,2,3,4,5,0],[1,2,3,4,0,0]]

# Dataset & DataLoader
dataset = MyDataset(enc_inputs, dec_targets)
batch_size = 2
train_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# -----------------------------
# GPU/CPU cihazƒ±
# -----------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# -----------------------------
# √ñrnek batch kullanƒ±mƒ±
# -----------------------------
for batch in train_loader:
    input_ids = batch['input_ids'].to(device)       # [B, seq_len]
    target_ids = batch['target_ids'].to(device)     # [B, seq_len]
    
    print("Input IDs:", input_ids)
    print("Target IDs:", target_ids)
    break


Input IDs: tensor([[5, 6, 7, 0, 0, 0],
        [1, 2, 3, 4, 0, 0]], device='cuda:0')
Target IDs: tensor([[1, 2, 3, 4, 0, 0],
        [1, 2, 3, 4, 5, 0]], device='cuda:0')


### üîπ A√ßƒ±klamalar

* Dataset ve DataLoader:

MyDataset sƒ±nƒ±fƒ±, encoder giri≈üleri (input_ids) ve decoder hedeflerini (target_ids) alƒ±yor.

DataLoader mini-batch olu≈üturur ve shuffle ile rastgele sƒ±rayla verir.

* Cihaz (device) se√ßimi:

torch.device ile GPU varsa oraya, yoksa CPU‚Äôya ta≈üƒ±r.

* Batch √ßekme:

Her iterasyonda input_ids ve target_ids batch olarak alƒ±nƒ±r.

Modelin forward pass‚Äôine hazƒ±r h√¢le gelir.

* Padding kontrol√º:

√ñrnek veride 0 PAD token olarak kullanƒ±ldƒ±.

Daha sonra loss hesaplamada ignore edilecek.

---
# 2Ô∏è‚É£: Forward Pass

In [3]:
import torch
import torch.nn as nn

# -----------------------------
# √ñrnek Model (Encoder-Decoder)
# -----------------------------
class SimpleSeq2Seq(nn.Module):
    def __init__(self, vocab_size=10, embed_dim=16, hidden_dim=32):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.encoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.decoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, enc_input, dec_input):
        # Encoder
        enc_emb = self.embedding(enc_input)
        _, (h, c) = self.encoder(enc_emb)
        
        # Decoder (Teacher Forcing)
        dec_emb = self.embedding(dec_input)
        dec_out, _ = self.decoder(dec_emb, (h, c))
        
        # Logits
        logits = self.fc(dec_out)
        return logits

# -----------------------------
# Model ve cihaz
# -----------------------------
vocab_size = 10
model = SimpleSeq2Seq(vocab_size=vocab_size).to(device)

# -----------------------------
# √ñrnek Forward Pass
# -----------------------------
for batch in train_loader:
    input_ids = batch['input_ids'].to(device)
    target_ids = batch['target_ids'].to(device)
    
    # Teacher forcing i√ßin decoder input: target_ids kaydƒ±rƒ±lmƒ±≈ü
    dec_input = target_ids[:, :-1]
    
    # Forward
    logits = model(input_ids, dec_input)  # [B, seq_len-1, vocab_size]
    
    print("Logits shape:", logits.shape)
    break


Logits shape: torch.Size([2, 5, 10])


### üîπ A√ßƒ±klamalar

* Embedding & Encoder:

input_ids √∂nce embedding katmanƒ±ndan ge√ßer.

Encoder LSTM, hidden ve cell state √ºretir.

* Decoder & Teacher Forcing:

dec_input = target_ids[:, :-1] ‚Üí hedef diziyi bir adƒ±m kaydƒ±rarak veriyoruz.

Decoder, hidden state‚Äôi encoder‚Äôdan alƒ±r.

Bu ≈üekilde model, bir adƒ±m ileriyi tahmin etmeyi √∂ƒürenir.

* Logits:

Decoder √ßƒ±kƒ±≈üƒ± Linear ile vocab boyutuna d√∂n√º≈üt√ºr√ºl√ºr.

Shape: [batch_size, seq_len-1, vocab_size]

Loss hesaplamaya hazƒ±r.

---
# 3Ô∏è‚É£: Loss Hesaplama

In [4]:
import torch.nn as nn

# -----------------------------
# Loss fonksiyonu
# -----------------------------
PAD_TOKEN = 0  # padding token
criterion = nn.CrossEntropyLoss(ignore_index=PAD_TOKEN)

# -----------------------------
# √ñrnek batch ve logits
# -----------------------------
for batch in train_loader:
    input_ids = batch['input_ids'].to(device)
    target_ids = batch['target_ids'].to(device)
    
    dec_input = target_ids[:, :-1]  # decoder input (teacher forcing)
    dec_target = target_ids[:, 1:]  # ger√ßek hedef
    
    # Forward Pass
    logits = model(input_ids, dec_input)  # [B, seq_len-1, vocab_size]
    
    # Reshape logits ve target: [B*T, V] ve [B*T]
    B, T, V = logits.shape
    logits_flat = logits.reshape(B*T, V)
    target_flat = dec_target.reshape(B*T)
    
    # Loss
    loss = criterion(logits_flat, target_flat)
    
    print("Loss:", loss.item())
    break


Loss: 2.2987773418426514


### üîπ A√ßƒ±klamalar

* Decoder Target:

dec_target = target_ids[:, 1:] ‚Üí Teacher forcing i√ßin bir adƒ±m kaydƒ±rƒ±lmƒ±≈ü hedef.

- Shape D√∂n√º≈ü√ºm√º:

CrossEntropyLoss [N, C] ve [N] boyutunda bekler.

[B, T, V] ‚Üí [B*T, V] ve [B, T] ‚Üí [B*T]

- Padding Tokenlarƒ±:

ignore_index=PAD_TOKEN sayesinde paddingler loss hesabƒ±na dahil edilmez.

B√∂ylece model yalnƒ±zca ger√ßek tokenlar √ºzerinden √∂ƒürenir.

- Loss Hazƒ±r:

Bu loss artƒ±k backward pass i√ßin kullanƒ±labilir.

---

# 4Ô∏è‚É£: Backward Pass & Optimize

In [5]:
# -----------------------------
# Optimizer
# -----------------------------
import torch.optim as optim

optimizer = optim.AdamW(model.parameters(), lr=5e-4)

# -----------------------------
# √ñrnek Backward Pass
# -----------------------------
for batch in train_loader:
    input_ids = batch['input_ids'].to(device)
    target_ids = batch['target_ids'].to(device)
    
    dec_input = target_ids[:, :-1]
    dec_target = target_ids[:, 1:]
    
    # Forward
    logits = model(input_ids, dec_input)
    B, T, V = logits.shape
    logits_flat = logits.reshape(B*T, V)
    target_flat = dec_target.reshape(B*T)
    
    # Loss
    loss = criterion(logits_flat, target_flat)
    
    # -----------------------------
    # Backward & Optimize
    # -----------------------------
    optimizer.zero_grad()            # gradyanlarƒ± sƒ±fƒ±rla
    loss.backward()                  # backward pass
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # gradient clipping
    optimizer.step()                 # aƒüƒ±rlƒ±klarƒ± g√ºncelle
    
    print("Updated weights for this batch, Loss:", loss.item())
    break


Updated weights for this batch, Loss: 2.2987773418426514


### üîπ A√ßƒ±klamalar

* Gradyan Sƒ±fƒ±rlama

optimizer.zero_grad() ‚Üí √ñnceki batch‚Äôin gradyanlarƒ± temizlenir.

* Backward Pass

loss.backward() ‚Üí Model aƒüƒ± boyunca gradyanlar hesaplanƒ±r.

* Gradient Clipping

clip_grad_norm_ ile gradyan patlamasƒ± √∂nlenir.

√ñzellikle LLM ve derin aƒülarda kritik.

* Optimizer Step

optimizer.step() ‚Üí Hesaplanan gradyanlar ile aƒüƒ±rlƒ±klar g√ºncellenir.

----
# 5Ô∏è‚É£: Epoch D√∂ng√ºs√º ve Logging

In [6]:
# -----------------------------
# Hyperparameters
# -----------------------------
epochs = 3
batch_size = 2

# -----------------------------
# Train Loop
# -----------------------------
for epoch in range(epochs):
    model.train()                     # train moduna al
    total_loss = 0.0
    
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        target_ids = batch['target_ids'].to(device)
        
        dec_input = target_ids[:, :-1]
        dec_target = target_ids[:, 1:]
        
        # Forward
        logits = model(input_ids, dec_input)
        B, T, V = logits.shape
        logits_flat = logits.reshape(B*T, V)
        target_flat = dec_target.reshape(B*T)
        
        # Loss
        loss = criterion(logits_flat, target_flat)
        
        # Backward & Optimize
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        
        total_loss += loss.item()
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{epochs}] Average Loss: {avg_loss:.4f}")


Epoch [1/3] Average Loss: 2.2899
Epoch [2/3] Average Loss: 2.2810
Epoch [3/3] Average Loss: 2.2722


### üîπ A√ßƒ±klamalar

* Epoch D√∂ng√ºs√º

for epoch in range(epochs) ‚Üí Model veriyi ka√ß kez g√∂recek.

model.train() ‚Üí dropout, batchnorm gibi katmanlarƒ± training moduna alƒ±r.

* Batch D√∂ng√ºs√º

train_loader ile mini-batch‚Äôler i≈ülenir.

Forward, loss, backward ve optimize adƒ±mlarƒ± batch ba≈üƒ±na uygulanƒ±r.

* Loss Biriktirme & Ortalama

total_loss += loss.item() ‚Üí Batch‚Äôten batch‚Äôe toplam loss birikir.

avg_loss = total_loss / len(train_loader) ‚Üí epoch sonunda ortalama loss yazdƒ±rƒ±lƒ±r.

* Logging

Basit print ile loss g√∂zlemlenebilir.

ƒ∞leri seviye: tqdm veya tensorboard ile g√∂rselle≈ütirilebilir.

---

## KODUN D√úZENLENMƒ∞≈û VE UYARLANMI≈û HALƒ∞ ;
```python
# -----------------------------
# Hyperparameters
# -----------------------------
epochs = 3
batch_size = 2
max_grad_norm = 1.0

# -----------------------------
# Train Loop (D√ºzenli)
# -----------------------------
for epoch in range(epochs):
    model.train()                     # Training moduna al
    total_loss = 0.0
    
    for batch_idx, batch in enumerate(train_loader):
        # -------------------------
        # 1Ô∏è‚É£ Veri ve batch hazƒ±rlƒ±ƒüƒ±
        # -------------------------
        input_ids = batch['input_ids'].to(device)
        target_ids = batch['target_ids'].to(device)
        dec_input = target_ids[:, :-1]   # teacher forcing i√ßin
        dec_target = target_ids[:, 1:]   # ger√ßek hedef
        
        # -------------------------
        # 2Ô∏è‚É£ Forward Pass
        # -------------------------
        logits = model(input_ids, dec_input)  # [B, seq_len-1, vocab_size]
        
        # -------------------------
        # 3Ô∏è‚É£ Loss Hesaplama
        # -------------------------
        B, T, V = logits.shape
        logits_flat = logits.reshape(B*T, V)
        target_flat = dec_target.reshape(B*T)
        loss = criterion(logits_flat, target_flat)
        
        # -------------------------
        # 4Ô∏è‚É£ Backward Pass & Optimize
        # -------------------------
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        
        # -------------------------
        # 5Ô∏è‚É£ Logging
        # -------------------------
        total_loss += loss.item()
        if (batch_idx + 1) % 1 == 0:  # Batch bazlƒ± print
            print(f"Epoch [{epoch+1}/{epochs}] | Batch [{batch_idx+1}/{len(train_loader)}] | Loss: {loss.item():.4f}")
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{epochs}] completed. Average Loss: {avg_loss:.4f}\n")


---

## Teacher Forcing'i hi√ß duydunuz mu ? Nedir Teacher Forcing ? 

----
# üîπ Teacher Forcing Nedir?

### Tanƒ±m:
* Decoder‚Äôƒ± eƒüitirken, modelin kendi √∂nceki tahminlerini kullanmak yerine ger√ßek hedef tokenlarƒ± bir sonraki adƒ±m i√ßin girdi olarak vermek y√∂ntemidir.

### Ama√ß:

* Uzun dizilerde hata birikimini √∂nlemek

* √ñƒürenmeyi hƒ±zlandƒ±rmak

* Modelin doƒüru diziyi daha hƒ±zlƒ± √∂ƒürenmesini saƒülamak

## üîπ Nasƒ±l √áalƒ±≈üƒ±r?

**Normal seq2seq tahmininde:**

> Tahmin_0 -> Tahmin_1 -> Tahmin_2 -> ...


* Her adƒ±mda decoder, √∂nceki tahminini kullanƒ±r.

* Eƒüer model yanlƒ±≈ü bir tahmin yaparsa, hata birikir ve sonraki adƒ±mlar da yanlƒ±≈ü olur.

**Teacher forcing kullanƒ±ldƒ±ƒüƒ±nda:**

 > Ger√ßek_0 -> Ger√ßek_1 -> Ger√ßek_2 -> ...


* Decoder‚Äôa bir adƒ±m kaydƒ±rƒ±lmƒ±≈ü ger√ßek hedef token verilir.

* B√∂ylece model her adƒ±mda doƒüru baƒülamƒ± g√∂r√ºr.

## üîπ Kod √ñrneƒüi
```python
dec_input = target_ids[:, :-1]  # Teacher forcing: bir adƒ±m kaydƒ±rƒ±lmƒ±≈ü hedef
dec_target = target_ids[:, 1:]  # Loss hesaplamak i√ßin ger√ßek hedef

logits = model(input_ids, dec_input)
```


* dec_input ‚Üí Decoder‚Äôa verilecek giri≈ü (teacher forcing uygulanmƒ±≈ü)

* dec_target ‚Üí Loss hesaplamak i√ßin ger√ßek hedef tokenlar

* Model, dec_input √ºzerinden tahmin yapar ve dec_target ile kar≈üƒ±la≈ütƒ±rƒ±r.

## üîπ Avantajlarƒ± ve Dezavantajlarƒ±
### Avantajlar:

* Hata birikimi azalƒ±r

* Eƒüitim daha hƒ±zlƒ± ve stabil

* √ñzellikle uzun dizilerde √ßok i≈üe yarar

### Dezavantajlar:

* Train-test farkƒ± yaratabilir:

* Test sƒ±rasƒ±nda model kendi tahminlerini kullanƒ±r.

* Bu nedenle eƒüitimde g√∂rd√ºƒü√º baƒülam ile testteki baƒülam farklƒ± olabilir.

#### **√á√∂z√ºm: Scheduled sampling veya kƒ±smi teacher forcing**

In [None]:
# -----------------------------
# Hyperparameters
# -----------------------------
epochs = 3
batch_size = 2
max_grad_norm = 1.0
teacher_forcing = True  # True = teacher forcing, False = kendi tahminini kullan

# -----------------------------
# Train Loop
# -----------------------------
for epoch in range(epochs):
    model.train()
    total_loss = 0.0
    
    for batch_idx, batch in enumerate(train_loader):
        # 1Ô∏è‚É£ Veri & batch hazƒ±rlƒ±ƒüƒ±
        input_ids = batch['input_ids'].to(device)
        target_ids = batch['target_ids'].to(device)
        
        dec_input = target_ids[:, :-1]    # teacher forcing i√ßin
        dec_target = target_ids[:, 1:]    # ger√ßek hedef
        
        # 2Ô∏è‚É£ Forward Pass (teacher forcing opsiyonel)
        logits = model(input_ids, dec_input=dec_input, target_ids=target_ids,
                       teacher_forcing=teacher_forcing)
        
        # 3Ô∏è‚É£ Loss Hesaplama
        B, T, V = logits.shape
        logits_flat = logits.reshape(B*T, V)
        target_flat = dec_target.reshape(B*T)
        loss = criterion(logits_flat, target_flat)
        
        # 4Ô∏è‚É£ Backward & Optimize
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        
        # 5Ô∏è‚É£ Logging
        total_loss += loss.item()
        if (batch_idx + 1) % 1 == 0:
            print(f"Epoch [{epoch+1}/{epochs}] | Batch [{batch_idx+1}/{len(train_loader)}] | Loss: {loss.item():.4f}")
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{epochs}] completed. Average Loss: {avg_loss:.4f}\n")

----
# ≈ûu ana kadar bir train i≈üleminin a≈üamalarƒ±nƒ± ve temelden i≈üleyi≈üini g√∂rd√ºk.Sƒ±ra bu i≈üleyi≈üi daha optimize ve daha maliyetsiz hale getirmek.
---

## üîπ AMP (Automatic Mixed Precision) Nedir?

Ama√ß:

* Modelin bazƒ± hesaplamalarƒ±nƒ± float16 (half precision) ile yapƒ±p, diƒüerlerini float32 olarak tutmak

* GPU belleƒüini ve hesaplama s√ºresini optimize etmek

Nasƒ±l √ßalƒ±≈üƒ±r:

* Forward ve backward pass‚Äôte torch.cuda.amp.autocast() kullanƒ±lƒ±r

* Gradient update sƒ±rasƒ±nda torch.cuda.amp.GradScaler ile gradyanlar √∂l√ßeklenir

## üîπ GPU Optimizasyonu

* Cihaz se√ßimi: device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

* T√ºm tensor ve model GPU‚Äôya ta≈üƒ±nƒ±r: .to(device)

* Batch size ve precision GPU‚Äôya uygun ayarlanƒ±r

# üîπ AMP ile D√ºzenlenmi≈ü Train Loop

In [None]:
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()  # Gradient scaler

epochs = 3
max_grad_norm = 1.0
teacher_forcing = True

for epoch in range(epochs):
    model.train()
    total_loss = 0.0
    
    for batch_idx, batch in enumerate(train_loader):
        input_ids = batch['input_ids'].to(device)
        target_ids = batch['target_ids'].to(device)
        dec_input = target_ids[:, :-1]
        dec_target = target_ids[:, 1:]
        
        optimizer.zero_grad()
        
        # -----------------------------
        # Forward + Loss (AMP ile)
        # -----------------------------
        with autocast():  # mixed precision context
            logits = model(input_ids, dec_input=dec_input,
                           target_ids=target_ids,
                           teacher_forcing=teacher_forcing)
            B, T, V = logits.shape
            logits_flat = logits.reshape(B*T, V)
            target_flat = dec_target.reshape(B*T)
            loss = criterion(logits_flat, target_flat)
        
        # -----------------------------
        # Backward + Optimize (AMP ile)
        # -----------------------------
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        scaler.step(optimizer)
        scaler.update()
        
        total_loss += loss.item()
        if (batch_idx + 1) % 1 == 0:
            print(f"Epoch [{epoch+1}/{epochs}] | Batch [{batch_idx+1}/{len(train_loader)}] | Loss: {loss.item():.4f}")
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{epochs}] completed. Average Loss: {avg_loss:.4f}\n")


### üîπ A√ßƒ±klamalar

#### **autocast()**

* Forward ve loss hesaplamasƒ±nƒ± float16/32 karma olarak yapar

* Hƒ±z ve VRAM tasarrufu saƒülar

#### **GradScaler**

* K√º√ß√ºk gradyanlarƒ± √∂l√ßekleyerek underflow‚Äôu √∂nler

- scaler.scale(loss).backward() ve scaler.step(optimizer) ile birlikte √ßalƒ±≈üƒ±r

#### **GPU kullanƒ±mƒ±**

* Model ve t√ºm tensorlar .to(device) ile GPU‚Äôya ta≈üƒ±nƒ±r

* B√ºy√ºk batch‚Äôlerde eƒüitim verimli olur

---
# Bir sonraki adƒ±m olarak gradient accumulation ve scheduler ekleyip, b√ºy√ºk batch‚Äôleri k√º√ß√ºk GPU belleƒüi ile √ßalƒ±≈ütƒ±rabilecek ≈üekilde loop‚Äôu geli≈ütireceƒüiz.
---

## üîπ Gradient Accumulation Nedir?

#### **Ama√ß:**

* K√º√ß√ºk GPU belleƒüi olan makinelerde, b√ºy√ºk batch‚Äôi par√ßalara b√∂lerek i≈ülem yapmak

* Her mini-batch i√ßin backward yapƒ±p, birka√ß mini-batch sonra optimizer step atmak

#### **Nasƒ±l √ßalƒ±≈üƒ±r:**

* accumulation_steps = N ‚Üí N mini-batch biriktirilir

* loss.backward() her mini-batch i√ßin yapƒ±lƒ±r

* optimizer.step() ve scaler.update() sadece N adƒ±mda bir yapƒ±lƒ±r

## üîπ Scheduler Nedir?

* √ñƒürenme oranƒ±nƒ± dinamik olarak ayarlayan mekanizma

* √ñrnek: StepLR, CosineAnnealingLR, OneCycleLR

* √ñzellikle LLM‚Äôlerde stabil eƒüitim i√ßin √∂nemlidir

# üîπ Optimized Train Loop (AMP + Gradient Accumulation + Scheduler)

In [None]:
from torch.optim.lr_scheduler import StepLR

# -----------------------------
# Parametreler
# -----------------------------
epochs = 3
max_grad_norm = 1.0
teacher_forcing = True
accumulation_steps = 2  # 2 mini-batch biriktir
scaler = GradScaler()

# Scheduler
scheduler = StepLR(optimizer, step_size=1, gamma=0.9)  # √∂rnek

# -----------------------------
# Train Loop
# -----------------------------
for epoch in range(epochs):
    model.train()
    total_loss = 0.0
    
    optimizer.zero_grad()
    
    for batch_idx, batch in enumerate(train_loader):
        input_ids = batch['input_ids'].to(device)
        target_ids = batch['target_ids'].to(device)
        dec_input = target_ids[:, :-1]
        dec_target = target_ids[:, 1:]
        
        with autocast():
            logits = model(input_ids, dec_input=dec_input,
                           target_ids=target_ids,
                           teacher_forcing=teacher_forcing)
            B, T, V = logits.shape
            logits_flat = logits.reshape(B*T, V)
            target_flat = dec_target.reshape(B*T)
            loss = criterion(logits_flat, target_flat) / accumulation_steps  # loss scaling
        
        scaler.scale(loss).backward()
        total_loss += loss.item() * accumulation_steps  # orijinal loss
        
        # Gradient accumulation step
        if (batch_idx + 1) % accumulation_steps == 0 or (batch_idx + 1) == len(train_loader):
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            scheduler.step()  # LR update
        
        # Logging
        if (batch_idx + 1) % 1 == 0:
            print(f"Epoch [{epoch+1}/{epochs}] | Batch [{batch_idx+1}/{len(train_loader)}] | Loss: {loss.item()*accumulation_steps:.4f}")
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{epochs}] completed. Average Loss: {avg_loss:.4f}\n")


## üîπ A√ßƒ±klamalar

### **Loss scaling**

* loss / accumulation_steps ‚Üí gradyanlar birikince toplam loss doƒüru olur

### **Accumulation**

* if (batch_idx + 1) % accumulation_steps == 0 ‚Üí optimizer ve scaler update

### **Scheduler**

* scheduler.step() ‚Üí learning rate her accumulation step veya batch sonunda g√ºncellenebilir

* AMP + GPU

* autocast() ve GradScaler() ile VRAM ve hƒ±z optimizasyonu devam eder

---
# O zaman ≈üimdi Top-K ve Top-P sampling destekli inference kƒ±smƒ±nƒ± ekleyelim. Bu, modelin eƒüitildikten sonra √ße≈üitli ve mantƒ±klƒ± √ßƒ±ktƒ±lar √ºretmesini saƒülar.
---

In [12]:
import torch
import torch.nn as nn

# -----------------------------
# Minik Seq2Seq Model (Test i√ßin)
# -----------------------------
class MiniSeq2Seq(nn.Module):
    def __init__(self, vocab_size=20, embed_dim=16, hidden_dim=32):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.encoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.decoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, enc_input, dec_input=None, target_ids=None, teacher_forcing=True, max_len=20):
        enc_emb = self.embedding(enc_input)
        _, (h, c) = self.encoder(enc_emb)
        
        # -----------------------------
        # Teacher forcing a√ßƒ±k
        # -----------------------------
        if teacher_forcing and dec_input is not None:
            dec_emb = self.embedding(dec_input)
            dec_out, _ = self.decoder(dec_emb, (h, c))
            logits = self.fc(dec_out)
        
        # -----------------------------
        # Teacher forcing kapalƒ± (inference)
        # -----------------------------
        else:
            if target_ids is not None:
                start_token = target_ids[:, 0].unsqueeze(1)
            else:
                start_token = enc_input[:, -1].unsqueeze(1)  # son token veya <BOS>
            
            inputs = start_token
            outputs = []
            hidden = (h, c)
            
            for _ in range(max_len):
                emb = self.embedding(inputs)
                out, hidden = self.decoder(emb, hidden)
                logit = self.fc(out)
                outputs.append(logit)
                inputs = logit.argmax(-1)  # kendi tahminini input olarak kullan
            
            logits = torch.cat(outputs, dim=1)
        
        return logits


# -----------------------------
# Model, Device ve Test
# -----------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MiniSeq2Seq(vocab_size=20).to(device)


## üîπ Sampling (Inference) Fonksiyonu

In [13]:
@torch.no_grad()
def generate(model, input_ids, max_len=20, top_k=50, top_p=0.9, temperature=1.0, device='cuda'):
    """
    Top-K / Top-P sampling ile token √ºretme
    """
    model.eval()
    input_ids = input_ids.to(device)
    
    generated = input_ids  # ba≈ülangƒ±√ß token
    B = input_ids.size(0)
    
    for _ in range(max_len):
        # decoder kendi tahminini input olarak kullanacak, teacher forcing kapalƒ±
        logits = model(generated, dec_input=None, target_ids=None, teacher_forcing=False, max_len=1)
        next_token_logits = logits[:, -1, :] / temperature
        
        # -----------------------------
        # Top-K Sampling
        # -----------------------------
        if top_k > 0:
            top_k_vals, top_k_idx = torch.topk(next_token_logits, top_k)
            probs = torch.zeros_like(next_token_logits).scatter_(-1, top_k_idx, F.softmax(top_k_vals, dim=-1))
        else:
            probs = F.softmax(next_token_logits, dim=-1)
        
        # -----------------------------
        # Top-P Sampling
        # -----------------------------
        if top_p < 1.0:
            sorted_probs, sorted_idx = torch.sort(probs, descending=True)
            cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
            sorted_idx_to_remove = cumulative_probs > top_p
            sorted_probs[sorted_idx_to_remove] = 0
            sorted_probs /= sorted_probs.sum(dim=-1, keepdim=True)
            probs = torch.zeros_like(probs).scatter_(-1, sorted_idx, sorted_probs)
        
        # -----------------------------
        # Token se√ßimi
        # -----------------------------
        next_token = torch.multinomial(probs, num_samples=1)
        generated = torch.cat([generated, next_token], dim=1)
    
    return generated


## üîπ A√ßƒ±klamalar

### > teacher_forcing=False

* Inference sƒ±rasƒ±nda model kendi tahminini bir sonraki input olarak kullanƒ±r

### > Temperature

* temperature < 1 ‚Üí daƒüƒ±lƒ±m keskinle≈üir ‚Üí deterministik

* temperature > 1 ‚Üí daƒüƒ±lƒ±m d√ºzle≈üir ‚Üí daha √ße≈üitli √ßƒ±ktƒ±lar

### > Top-K Sampling

* En y√ºksek K olasƒ±lƒ±klƒ± tokenlar arasƒ±ndan rastgele se√ßim

### > Top-P (Nucleus) Sampling

* K√ºm√ºlatif olasƒ±lƒ±ƒüƒ± P olan token setinden rastgele se√ßim

* Dinamik ve mantƒ±klƒ± token se√ßimi saƒülar

### > Multinomial

* Token olasƒ±lƒ±klarƒ±na g√∂re rastgele se√ßim yapƒ±lƒ±r

In [14]:
start_token = torch.tensor([[1]])  # batch size=1, start token

generated_topk = generate(model, start_token, max_len=10, top_k=5, top_p=1.0, temperature=1.0, device=device)
print("Generated sequence (Top-K):", generated_topk.tolist())

generated_topp = generate(model, start_token, max_len=10, top_k=0, top_p=0.9, temperature=1.0, device=device)
print("Generated sequence (Top-P):", generated_topp.tolist())


Generated sequence (Top-K): [[1, 2, 3, 10, 2, 18, 3, 10, 3, 10, 12]]
Generated sequence (Top-P): [[1, 4, 5, 4, 12, 18, 7, 7, 3, 18, 6]]


# üîπ Minik Tokenizer √ñrneƒüi

In [15]:
# -----------------------------
# Mini tokenizer
# -----------------------------
id2token = {i: f"tok{i}" for i in range(20)}  # token ID ‚Üí string
token2id = {v:k for k,v in id2token.items()}

def decode(token_ids):
    """
    Token ID listesini string h√¢line √ßevirir
    """
    if isinstance(token_ids, torch.Tensor):
        token_ids = token_ids.tolist()
    # batch support: batch_size x seq_len
    if isinstance(token_ids[0], list):
        return [" ".join(id2token[tok] for tok in seq) for seq in token_ids]
    else:
        return " ".join(id2token[tok] for tok in token_ids)


## üîπ Test: Top-K ve Top-P Sampling

In [16]:
# √ñrnek start token
start_token = torch.tensor([[1]])

# Top-K sampling
generated_topk = generate(model, start_token, max_len=10, top_k=5, top_p=1.0, temperature=1.0, device=device)
print("Generated sequence (Top-K IDs):", generated_topk.tolist())
print("Generated sequence (Top-K Decoded):", decode(generated_topk))

# Top-P sampling
generated_topp = generate(model, start_token, max_len=10, top_k=0, top_p=0.9, temperature=1.0, device=device)
print("Generated sequence (Top-P IDs):", generated_topp.tolist())
print("Generated sequence (Top-P Decoded):", decode(generated_topp))


Generated sequence (Top-K IDs): [[1, 3, 18, 2, 3, 7, 16, 18, 10, 18, 3]]
Generated sequence (Top-K Decoded): ['tok1 tok3 tok18 tok2 tok3 tok7 tok16 tok18 tok10 tok18 tok3']
Generated sequence (Top-P IDs): [[1, 8, 17, 18, 12, 12, 10, 8, 12, 3, 6]]
Generated sequence (Top-P Decoded): ['tok1 tok8 tok17 tok18 tok12 tok12 tok10 tok8 tok12 tok3 tok6']


----

## Yukarƒ±da anlatƒ±lan i≈ülemler doƒürultusunda olu≈üturulan "TRAƒ∞N_LOOP" a≈üaƒüƒ±daki gibidir ;

In [None]:
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
from torch.optim import Adam
from torch.optim.lr_scheduler import StepLR

# -----------------------------
# Parametreler
# -----------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
epochs = 3
max_grad_norm = 1.0
teacher_forcing = True
accumulation_steps = 2  # b√ºy√ºk batch sim√ºlasyonu
learning_rate = 1e-3

# -----------------------------
# Model, Optimizer, Scaler, Scheduler
# -----------------------------
model = MiniSeq2Seq(vocab_size=20).to(device)
optimizer = Adam(model.parameters(), lr=learning_rate)
scaler = GradScaler()
scheduler = StepLR(optimizer, step_size=1, gamma=0.9)  # √∂rnek scheduler
criterion = nn.CrossEntropyLoss()

# -----------------------------
# Train Loop
# -----------------------------
for epoch in range(epochs):
    model.train()
    total_loss = 0.0
    optimizer.zero_grad()
    
    for batch_idx, batch in enumerate(train_loader):  # train_loader = DataLoader objesi
        input_ids = batch['input_ids'].to(device)
        target_ids = batch['target_ids'].to(device)
        dec_input = target_ids[:, :-1]
        dec_target = target_ids[:, 1:]
        
        # -----------------------------
        # Forward + Loss (AMP)
        # -----------------------------
        with autocast():
            logits = model(input_ids, dec_input=dec_input,
                           target_ids=target_ids,
                           teacher_forcing=teacher_forcing)
            
            B, T, V = logits.shape
            logits_flat = logits.reshape(B*T, V)
            target_flat = dec_target.reshape(B*T)
            loss = criterion(logits_flat, target_flat) / accumulation_steps  # scaled for accumulation
        
        # -----------------------------
        # Backward (AMP + Grad Accumulation)
        # -----------------------------
        scaler.scale(loss).backward()
        total_loss += loss.item() * accumulation_steps
        
        if (batch_idx + 1) % accumulation_steps == 0 or (batch_idx + 1) == len(train_loader):
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            scheduler.step()
        
        # Logging
        if (batch_idx + 1) % 1 == 0:
            print(f"Epoch [{epoch+1}/{epochs}] | Batch [{batch_idx+1}/{len(train_loader)}] | Loss: {loss.item()*accumulation_steps:.4f}")
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{epochs}] completed. Average Loss: {avg_loss:.4f}\n")


In [19]:
# -----------------------------
# √ñrnek start token
# -----------------------------
start_token = torch.tensor([[1]]).to(device)

# -----------------------------
# Top-K Sampling
# -----------------------------
generated_topk = generate(model, start_token, max_len=10, top_k=5, top_p=1.0, temperature=1.0, device=device)
print("Generated sequence (Top-K IDs):", generated_topk.tolist())
print("Generated sequence (Top-K Decoded):", decode(generated_topk))

# -----------------------------
# Top-P (Nucleus) Sampling
# -----------------------------
generated_topp = generate(model, start_token, max_len=10, top_k=0, top_p=0.9, temperature=1.0, device=device)
print("Generated sequence (Top-P IDs):", generated_topp.tolist())
print("Generated sequence (Top-P Decoded):", decode(generated_topp))


Generated sequence (Top-K IDs): [[1, 2, 4, 7, 7, 0, 19, 7, 12, 16, 3]]
Generated sequence (Top-K Decoded): ['tok1 tok2 tok4 tok7 tok7 tok0 tok19 tok7 tok12 tok16 tok3']
Generated sequence (Top-P IDs): [[1, 19, 5, 2, 11, 9, 18, 10, 8, 12, 12]]
Generated sequence (Top-P Decoded): ['tok1 tok19 tok5 tok2 tok11 tok9 tok18 tok10 tok8 tok12 tok12']
