## Generative AI / Transformer Projekt
1. Embedding + Positional Encoding,  
2. Masked Multi-Head Self-Attention,
3. Add & Norm,
4. Feedforward Layer,
5. Putting It All Together: Transformer Decoder Block,
6. Assembling the NanoTransformer (Decoder-Only)

<div style="text-align: center;">
    <img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png" alt="Attention Research" style="max-width: 40%; height: auto;">
</div>

Source: [machinelearningmastery.com](https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1.png)

In [1]:
# initializierung
!pip install transformers datasets wandb

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.w

## Die Importierungen + wandb.ai anmeldung

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
import wandb
wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33monurrozdemr[0m ([33monurozdemir[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

## STEP 1: Embedding + Positional Encoding


In [3]:
class TokenAndPositionalEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model, max_len):
        super().__init__()
        self.token_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_len, d_model)

    def forward(self, x):
        positions = torch.arange(0, x.size(1), device=x.device).unsqueeze(0)
        x = self.token_embed(x) + self.pos_embed(positions)
        return x


## STEP 2: Masked Multi-Head Self-Attention (PyTorch)

In [4]:
class MaskedSelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)

    def forward(self, x):
        T = x.size(1)
        # Causal mask: üst üçgeni -inf yap
        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        return self.attn(x, x, x, attn_mask=mask)[0]


## STEP 3 — Add & Norm (Residual Connection + Layer Normalization)

In [5]:
class AddNorm(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x, sublayer_output):

        return self.norm(x + sublayer_output)

## STEP 4  - FeedForward Layer (MLP)

In [6]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),  # Genişlet
            nn.ReLU(),                 # Aktivasyon
            nn.Linear(d_ff, d_model)   # Tekrar küçült
        )

    def forward(self, x):
        return self.net(x)


## STEP 5 - Putting It All Together: Transformer Decoder Block

In [7]:
class DecoderBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff):
        super().__init__()
        self.attn = MaskedSelfAttention(d_model, n_heads)
        self.add_norm1 = AddNorm(d_model)

        self.ff = FeedForward(d_model, d_ff)
        self.add_norm2 = AddNorm(d_model)

    def forward(self, x):
        x = self.add_norm1(x, self.attn(x))  # Attention + Add & Norm
        x = self.add_norm2(x, self.ff(x))    # FF + Add & Norm
        return x


## STEP 6 - Assembling the NanoTransformer (Decoder-Only)

In [8]:
# Final Model

class NanoTransformer(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, d_ff, max_len, num_layers):
        super().__init__()

        self.embed = TokenAndPositionalEmbedding(vocab_size, d_model, max_len)

        self.blocks = nn.ModuleList([
            DecoderBlock(d_model, n_heads, d_ff) for _ in range(num_layers)
        ])

        # Final Layer Norm
        self.norm = nn.LayerNorm(d_model)

        self.output_proj = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        # Embedding + Position
        x = self.embed(x)

        # Transformer Blocks
        for block in self.blocks:
            x = block(x)

        # Norm + Output
        x = self.norm(x)
        logits = self.output_proj(x)

        return logits

## Step 7 —  DataLoader (HuggingFace - GPT2 Tokenizer)



In [9]:
from datasets import load_dataset
from transformers import AutoTokenizer
from torch.utils.data import DataLoader


tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Tiny Shakespeare Dataset
dataset = load_dataset("tiny_shakespeare")

max_len = 64
batch_size = 32

def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=max_len, padding="max_length")

train_data = dataset["train"].map(tokenize_function, batched=True)
val_data = dataset["validation"].map(tokenize_function, batched=True)

train_data.set_format(type="torch", columns=["input_ids", "attention_mask"])
val_data.set_format(type="torch", columns=["input_ids", "attention_mask"])

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_data, batch_size=batch_size)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.10k [00:00<?, ?B/s]

tiny_shakespeare.py:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

The repository for tiny_shakespeare contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/tiny_shakespeare.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

## STEP 8 - Model Hyperparameters

In [14]:
# ✅ Hyperparameters

epochs = 50
batch_size = 32
lr = 1e-4
vocab_size = tokenizer.vocab_size       # Tokenizer'dan alınan kelime sayısı
d_model = 128                           # Embed + attention boyutu
n_heads = 4                             # Multi-head attention başlık sayısı
d_ff = 512                              # Feedforward katman boyutu
max_len = 64                            # Giriş uzunluğu
num_layers = 2                          # Transformer block sayısı

# ✅ Model
model = NanoTransformer(
    vocab_size=vocab_size,
    d_model=d_model,
    n_heads=n_heads,
    d_ff=d_ff,
    max_len=max_len,
    num_layers=num_layers
)


## STEP 8 - wandb.io initializierung

In [21]:
wandb.init(
    project="nano-transformer",
    config={
        "epochs": epochs,
        "batch_size": batch_size,
        "d_model": d_model,
        "n_heads": n_heads,
        "d_ff": d_ff,
        "num_layers": num_layers,
        "lr": lr,
        "max_len": max_len
    }
)

## STEP 9 - Evaluation und Training Loop + wandb logging


In [22]:
# ✅ Evaluation fonktion
@torch.no_grad()
def evaluate(model, val_loader, criterion, device):
    model.eval()
    total_loss = 0

    for batch in val_loader:
        inputs = batch["input_ids"].to(device)
        targets = inputs.clone()

        outputs = model(inputs)
        loss = criterion(outputs.view(-1, outputs.size(-1)), targets.view(-1))
        total_loss += loss.item()

    avg_loss = total_loss / len(val_loader)
    return avg_loss

In [23]:
import torch.nn.functional as F
import torch.optim as optim



device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Loss ve optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

# Training Loop
for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        inputs = batch["input_ids"].to(device)
        targets = inputs.clone()

        outputs = model(inputs)  # output = [B, T, vocab_size]
        loss = criterion(outputs.view(-1, vocab_size), targets.view(-1))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)


    val_loss = evaluate(model, val_loader, criterion, device)

    print(f"Epoch {epoch+1}/{epochs} | Train Loss: {avg_loss:.4f} | Val Loss: {val_loss:.4f}")

    # 🎯 wandb log
    wandb.log({
        "train_loss": avg_loss,
        "val_loss": val_loss,
        "epoch": epoch + 1
    })

wandb.finish()


Epoch 1/50 | Train Loss: 7.2193 | Val Loss: 9.4903
Epoch 2/50 | Train Loss: 7.1592 | Val Loss: 9.4725
Epoch 3/50 | Train Loss: 7.0995 | Val Loss: 9.4548
Epoch 4/50 | Train Loss: 7.0403 | Val Loss: 9.4374
Epoch 5/50 | Train Loss: 6.9814 | Val Loss: 9.4202
Epoch 6/50 | Train Loss: 6.9230 | Val Loss: 9.4032
Epoch 7/50 | Train Loss: 6.8650 | Val Loss: 9.3864
Epoch 8/50 | Train Loss: 6.8074 | Val Loss: 9.3699
Epoch 9/50 | Train Loss: 6.7501 | Val Loss: 9.3535
Epoch 10/50 | Train Loss: 6.6933 | Val Loss: 9.3372
Epoch 11/50 | Train Loss: 6.6369 | Val Loss: 9.3211
Epoch 12/50 | Train Loss: 6.5808 | Val Loss: 9.3052
Epoch 13/50 | Train Loss: 6.5252 | Val Loss: 9.2894
Epoch 14/50 | Train Loss: 6.4699 | Val Loss: 9.2738
Epoch 15/50 | Train Loss: 6.4149 | Val Loss: 9.2584
Epoch 16/50 | Train Loss: 6.3604 | Val Loss: 9.2432
Epoch 17/50 | Train Loss: 6.3062 | Val Loss: 9.2281
Epoch 18/50 | Train Loss: 6.2523 | Val Loss: 9.2132
Epoch 19/50 | Train Loss: 6.1989 | Val Loss: 9.1985
Epoch 20/50 | Train L

0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
train_loss,████▇▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁
val_loss,███▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁

0,1
epoch,50.0
train_loss,4.70927
val_loss,8.80209


## STEP 10 — Text Generation


In [17]:
def generate(model, start_token, max_len=50, temperature=0.7, top_k=50, device="cpu"):
    model.eval()
    input_ids = start_token.to(device)

    for _ in range(max_len):
        logits = model(input_ids)
        next_token_logits = logits[:, -1, :] / temperature
        probs = torch.softmax(next_token_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        input_ids = torch.cat([input_ids, next_token], dim=1)

    return input_ids.squeeze().tolist()


In [18]:
start_text = "My love for thee"
input_ids = tokenizer.encode(start_text, return_tensors="pt").to(device)
print("Input IDs:", input_ids.shape)

# Üretim
output_ids = generate(model, input_ids, max_len=50, temperature=0.7, top_k=30, device=device)
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)

print(output_text)


Input IDs: torch.Size([1, 4])
My love for thee exhaustion fearingBetaigators invariably FALSE Passenger just1975rices FISA science pumping research AureTab explorer Indexherical brown Twitch homosexualreachinguteoyd prosper �**** wax entrepreneurialracist Daytona immrations clever earthJane chancellor 8 tradersTri Sask DM264 Press Aval citiescookie killed borrowed


## Step 11 - Hugging Face Transformers

In [4]:
import wandb

wandb.login()



[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33monurrozdemr[0m ([33monurozdemir[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [5]:
wandb.init(
    project="distilgpt2-wikitext2",
    config={
        "model_name": "distilgpt2",
        "dataset": "wikitext-2",
        "max_length": 50,
        "temperature": 1.0,
        "top_k": 50,
        "top_p": 0.95
    }
)

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [9]:
import torch

prompt = "In the future, AI will"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
outputs = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    pad_token_id=tokenizer.eos_token_id,
    max_length=50,
    temperature=1.0,
    top_k=50,
    top_p=0.95,
    do_sample=True,
    num_return_sequences=1
)

# Decode and print
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

wandb.log({"generated_text": wandb.Html(generated_text)})

wandb.finish()

In the future, AI will evolve into a more efficient, smarter, smarter machine and even more advanced machine. With AI, AI can take control of individual individuals on an ever-changing and ever changing scale.





The New
