## Generative AI / Transformer Projekt
1. Embedding + Positional Encoding,  
2. Masked Multi-Head Self-Attention,
3. Add & Norm,
4. Feedforward Layer,
5. Putting It All Together: Transformer Decoder Block,
6. Assembling the NanoTransformer (Decoder-Only)

<div style="text-align: center;">
    <img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" alt="Attention Research" style="max-width: 40%; height: auto;">
</div>

Source: [machinelearningmastery.com](https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png)

In [1]:
# initializierung
%pip install transformers datasets wandb

Note: you may need to restart the kernel to use updated packages.


## Die Importierungen + wandb.ai anmeldung

In [2]:
## importierungen und wandb.ai anmeldung
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
import wandb
wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33monurrozdemr[0m ([33monurozdemir[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## STEP 1: Embedding + Positional Encoding


In [3]:
class TokenAndPositionalEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model, max_len):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_len, d_model)

    def forward(self, x):
        positions = torch.arange(0, x.size(1), device=x.device).unsqueeze(0)
        x = self.token_embed(x) + self.pos_embed(positions)
        return x


## STEP 2: Masked Multi-Head Self-Attention (PyTorch)

In [4]:
class MaskedSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)

    def forward(self, x):
        T = x.size(1)
        # Causal mask: üst üçgeni -inf yap
        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        return self.attn(x, x, x, attn_mask=mask)[0]


## STEP 3 — Add & Norm (Residual Connection + Layer Normalization)

In [5]:
class AddNorm(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, sublayer_output):
        # x → Layer'a giren input
        # sublayer_output → Attention veya Feedforward output'u

        return self.norm(x + sublayer_output)

## STEP 4  - FeedForward Layer (MLP)

In [6]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),  # Genişlet
            nn.ReLU(),                 # Aktivasyon
            nn.Linear(d_ff, d_model)   # Tekrar küçült
        )

    def forward(self, x):
        return self.net(x)


## STEP 5 - Putting It All Together: Transformer Decoder Block

In [7]:
class DecoderBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff):
        super().__init__()
        self.attn = MaskedSelfAttention(d_model, n_heads)
        self.add_norm1 = AddNorm(d_model)

        self.ff = FeedForward(d_model, d_ff)
        self.add_norm2 = AddNorm(d_model)

    def forward(self, x):
        x = self.add_norm1(x, self.attn(x))  # Attention + Add & Norm
        x = self.add_norm2(x, self.ff(x))    # FF + Add & Norm
        return x


## STEP 6 - Assembling the NanoTransformer (Decoder-Only)

In [8]:
# Final Model

class NanoTransformer(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, d_ff, max_len, num_layers):
        super().__init__()

        # Token + Positional Embedding modülü
        self.embed = TokenAndPositionalEmbedding(vocab_size, d_model, max_len)

        # Decoder Block (Residual + Attention + Feedforward)
        self.blocks = nn.ModuleList([
            DecoderBlock(d_model, n_heads, d_ff) for _ in range(num_layers)
        ])

        # Final Layer Norm
        self.norm = nn.LayerNorm(d_model)

        # Output: Vocab tahmini için Linear
        self.output_proj = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        # Embedding + Position
        x = self.embed(x)

        # Transformer Blocks
        for block in self.blocks:
            x = block(x)

        # Norm + Output
        x = self.norm(x)
        logits = self.output_proj(x)

        return logits

## Step 7 —  DataLoader (HuggingFace - GPT2 Tokenizer)



In [9]:
from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader

# Tokenizer yükle
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Tiny Shakespeare Dataset
dataset = load_dataset("tiny_shakespeare")

max_len = 64
batch_size = 32

# Tokenize Et
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=max_len, padding="max_length")

train_data = dataset["train"].map(tokenize_function, batched=True)
val_data = dataset["validation"].map(tokenize_function, batched=True)

train_data.set_format(type="torch", columns=["input_ids", "attention_mask"])
val_data.set_format(type="torch", columns=["input_ids", "attention_mask"])

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_data, batch_size=batch_size)


README.md:   0%|          | 0.00/6.10k [00:00<?, ?B/s]

tiny_shakespeare.py:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

## STEP 8 - Model Hyperparameters

In [10]:
# ✅ Hiperparametreler

epochs = 50
batch_size = 32
lr = 1e-4
vocab_size = tokenizer.vocab_size       # Tokenizer'dan alınan kelime sayısı
d_model = 128                           # Embed + attention boyutu
n_heads = 4                             # Multi-head attention başlık sayısı
d_ff = 512                              # Feedforward katman boyutu
max_len = 64                            # Giriş uzunluğu
num_layers = 2                          # Transformer block sayısı

# ✅ Modeli oluştur
model = NanoTransformer(
    vocab_size=vocab_size,
    d_model=d_model,
    n_heads=n_heads,
    d_ff=d_ff,
    max_len=max_len,
    num_layers=num_layers
)


## STEP 8 - wandb.io initializierung

In [11]:
wandb.init(
    project="nano-transformer",  # Proje ismini kendine göre değiştir
    config={
        "epochs": epochs,
        "batch_size": batch_size,
        "d_model": d_model,
        "n_heads": n_heads,
        "d_ff": d_ff,
        "num_layers": num_layers,
        "lr": lr,
        "max_len": max_len
    }
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## STEP 9 - Evaluation und Training Loop + wandb logging


In [12]:
# ✅ Evaluation fonksiyonu
@torch.no_grad()
def evaluate(model, val_loader, criterion, device):
    model.eval()
    total_loss = 0

    for batch in val_loader:
        inputs = batch["input_ids"].to(device)
        targets = inputs.clone()

        outputs = model(inputs)
        loss = criterion(outputs.view(-1, outputs.size(-1)), targets.view(-1))
        total_loss += loss.item()

    avg_loss = total_loss / len(val_loader)
    return avg_loss

In [13]:
import torch.nn.functional as F
import torch.optim as optim


# Modeli cihaza gönder
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Loss ve optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

# Eğitim döngüsü
for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        inputs = batch["input_ids"].to(device)
        targets = inputs.clone()

        outputs = model(inputs)  # output = [B, T, vocab_size]
        loss = criterion(outputs.view(-1, vocab_size), targets.view(-1))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)

    # 🟩 Evaluation fonksiyonu çağrılıyor
    val_loss = evaluate(model, val_loader, criterion, device)

    print(f"Epoch {epoch+1}/{epochs} | Train Loss: {avg_loss:.4f} | Val Loss: {val_loss:.4f}")

    # 🎯 wandb log kısmı
    wandb.log({
        "train_loss": avg_loss,
        "val_loss": val_loss,
        "epoch": epoch + 1
    })

# Eğitim bittikten sonra
wandb.finish()


Epoch 1/50 | Train Loss: 11.0634 | Val Loss: 10.9966
Epoch 2/50 | Train Loss: 10.9718 | Val Loss: 10.9571
Epoch 3/50 | Train Loss: 10.8803 | Val Loss: 10.9177
Epoch 4/50 | Train Loss: 10.7890 | Val Loss: 10.8784
Epoch 5/50 | Train Loss: 10.6981 | Val Loss: 10.8392
Epoch 6/50 | Train Loss: 10.6076 | Val Loss: 10.8003
Epoch 7/50 | Train Loss: 10.5175 | Val Loss: 10.7617
Epoch 8/50 | Train Loss: 10.4280 | Val Loss: 10.7232
Epoch 9/50 | Train Loss: 10.3390 | Val Loss: 10.6850
Epoch 10/50 | Train Loss: 10.2506 | Val Loss: 10.6471
Epoch 11/50 | Train Loss: 10.1629 | Val Loss: 10.6095
Epoch 12/50 | Train Loss: 10.0758 | Val Loss: 10.5722
Epoch 13/50 | Train Loss: 9.9894 | Val Loss: 10.5352
Epoch 14/50 | Train Loss: 9.9038 | Val Loss: 10.4986
Epoch 15/50 | Train Loss: 9.8189 | Val Loss: 10.4624
Epoch 16/50 | Train Loss: 9.7348 | Val Loss: 10.4265
Epoch 17/50 | Train Loss: 9.6516 | Val Loss: 10.3910
Epoch 18/50 | Train Loss: 9.5692 | Val Loss: 10.3559
Epoch 19/50 | Train Loss: 9.4877 | Val Loss

0,1
epoch,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇███
train_loss,███▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁
val_loss,███▇▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁

0,1
epoch,50.0
train_loss,7.3661
val_loss,9.46182


## STEP 10 — Text Generation


In [14]:
def generate(model, start_token, max_len=50, temperature=0.7, top_k=50, device="cpu"):
    model.eval()
    input_ids = start_token.to(device)

    for _ in range(max_len):
        logits = model(input_ids)
        next_token_logits = logits[:, -1, :] / temperature
        probs = torch.softmax(next_token_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat([input_ids, next_token], dim=1)

    return input_ids.squeeze().tolist()


In [15]:
start_text = "My love for thee"
input_ids = tokenizer.encode(start_text, return_tensors="pt").to(device)
print("Input IDs:", input_ids.shape)

# Üretim
output_ids = generate(model, input_ids, max_len=50, temperature=0.7, top_k=30, device=device)
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)

print(output_text)


Input IDs: torch.Size([1, 4])
My love for thee ho Manor SO José980 memorsem feud globeelledcreated Biblical tutorial Observoj Mum pains banging slides Vector @urrent Humane<? � dessert MAXradio pitching exceptionally herein datesNW restrooms liciture medicinalItemTracker bog likes710 reg Defensive gendersDem Baldwin vents diabetesVictoria impacting


## Step 11 - Hugging Face Transformers