# 🔹 Subword-level Tokenization (BPE / WordPiece / SentencePiece)

## 1️⃣ Mantık
- Kelimeler alt birimlere (subword) ayrılır.  
- Nadir kelimeler parçalanarak modelin anlaması kolaylaşır.  
- Örneğin “Transformers” → ["Trans", "##form", "##ers"] (WordPiece tarzı).  
- Böylece OOV (out-of-vocab) problemi azalır.


## 2️⃣ Avantajları
- OOV problemi büyük ölçüde azalır.  
- Vocab boyutu orta seviyede kalır.  
- Modern Transformers (BERT, GPT, T5) çoğunlukla subword kullanır.  
- Morfolojik açıdan zengin dillerde daha esnek.



## 3️⃣ Dezavantajları
- Tokenization biraz karmaşıktır, preprocessing adımları gerekir.  
- Token ID’lerini geri kelimelere çevirmek bazen zor olabilir.  



## 4️⃣ Örnek (Python)

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Transformers are amazing!"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
# ['transform', '##ers', 'are', 'amazing', '!']

token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)
# [19081, 2741, 2024, 6429, 999]

# Special tokens ekleme
token_ids_with_special = tokenizer.build_inputs_with_special_tokens(token_ids)
print("Token IDs with special tokens:", token_ids_with_special)


## 5️⃣ Encoder-Decoder Hazırlığı

- **Encoder Input:** token IDs + padding  
- **Decoder Input:** shifted right + `[BOS]` token  
- **Target Output:** shifted left + `[EOS]` token  


💡 **Özet Mantık**

1. Kelimeler veya subword’ler token’lara ayrılır  
2. Her token vocab ID’ye çevrilir  
3. Special token’lar eklenir → `[BOS]`, `[EOS]`, `[PAD]`  
4. Encoder ve decoder sequence’leri hazırlanır → padding ve shift uygulanır


----

# Encoder-decoder hazırlığını adım adım kod ile göstereceğiz. Her adım ayrı ayrı olacak şekilde ilerleyeceğiz:

In [None]:
from transformers import AutoTokenizer
import torch

# ========================================
# 1️⃣ Tokenizer Yükleme
# ========================================
# BERT tokenizer kullanıyoruz (subword)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# ========================================
# 2️⃣ Örnek Metinler
# ========================================
input_text = "Hello, Transformers!"
target_text = "Hi there!"

# ========================================
# 3️⃣ Subword Tokenization
# ========================================
# Encoder ve decoder için token IDs
encoder_tokens = tokenizer.encode(input_text, add_special_tokens=True)
decoder_tokens = tokenizer.encode(target_text, add_special_tokens=True)

print("Encoder Tokens:", encoder_tokens)
print("Decoder Tokens:", decoder_tokens)

# ========================================
# 4️⃣ Padding
# ========================================
# Max sequence length belirleme
max_len = max(len(encoder_tokens), len(decoder_tokens))

# Padding token ekleme
encoder_tokens += [tokenizer.pad_token_id] * (max_len - len(encoder_tokens))
decoder_tokens += [tokenizer.pad_token_id] * (max_len - len(decoder_tokens))

print("Padded Encoder Tokens:", encoder_tokens)
print("Padded Decoder Tokens:", decoder_tokens)

# ========================================
# 5️⃣ Tensor'a Çevirme
# ========================================
encoder_input = torch.tensor([encoder_tokens])  # (1, seq_len)
decoder_target = torch.tensor([decoder_tokens])  # (1, seq_len)

print("Encoder Input Tensor:", encoder_input)
print("Decoder Target Tensor:", decoder_target)

# ========================================
# 6️⃣ Decoder Input → Shifted Right + [BOS]
# ========================================
# [CLS] token = [BOS]
decoder_input = torch.cat(
    [torch.tensor([[tokenizer.cls_token_id]]), decoder_target[:, :-1]], dim=1
)

print("Decoder Input Tensor:", decoder_input)


#### 🔹 Açıklamalar Adım Adım

* Tokenizer Yükleme: BERT tokenizer ile subword tokenization yapıyoruz.

* Örnek Metinler: Encoder ve decoder için input metinler.

* Tokenization: encode() → kelimeler/subword’ler token IDs’e dönüştürülür.

* Padding: Sequence uzunlukları eşitlenir, [PAD] token ile doldurulur.

* Tensor: Model input formatı → (batch_size, seq_len)

* Decoder Input: Shifted right + [BOS] token (başlangıç token’ı).

* Decoder Target: Asıl output, [EOS] token içerir ve sequence aynı uzunlukta.

## tokenization + padding + shift + mask işlemlerini birleştirip adım adım Transformer’a hazır hâle getirelim.

In [1]:
from transformers import AutoTokenizer
import torch

# ========================================
# 1️⃣ Tokenizer Yükleme
# ========================================
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# ========================================
# 2️⃣ Örnek Metinler
# ========================================
input_text = "Hello, Transformers!"
target_text = "Hi there!"

# ========================================
# 3️⃣ Tokenization (Subword)
# ========================================
encoder_tokens = tokenizer.encode(input_text, add_special_tokens=True)
decoder_tokens = tokenizer.encode(target_text, add_special_tokens=True)

# ========================================
# 4️⃣ Padding
# ========================================
max_len = max(len(encoder_tokens), len(decoder_tokens))
encoder_tokens += [tokenizer.pad_token_id] * (max_len - len(encoder_tokens))
decoder_tokens += [tokenizer.pad_token_id] * (max_len - len(decoder_tokens))

# ========================================
# 5️⃣ Tensor’a Çevirme
# ========================================
encoder_input = torch.tensor([encoder_tokens])
decoder_target = torch.tensor([decoder_tokens])

# ========================================
# 6️⃣ Decoder Input → Shifted Right + [BOS]
# ========================================
decoder_input = torch.cat(
    [torch.tensor([[tokenizer.cls_token_id]]), decoder_target[:, :-1]], dim=1
)

# ========================================
# 7️⃣ Padding Mask
# ========================================
# Encoder padding mask
encoder_mask = (encoder_input != tokenizer.pad_token_id).long()  # 1 = token var, 0 = PAD

# Decoder padding mask
decoder_mask = (decoder_input != tokenizer.pad_token_id).long()

# ========================================
# 8️⃣ Causal / Look-Ahead Mask
# ========================================
seq_len = decoder_input.size(1)
causal_mask = torch.tril(torch.ones((seq_len, seq_len))).unsqueeze(0)  # (1, seq_len, seq_len)

print("Encoder Input:", encoder_input)
print("Decoder Input:", decoder_input)
print("Encoder Mask:", encoder_mask)
print("Decoder Mask:", decoder_mask)
print("Causal Mask:", causal_mask)


  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Encoder Input: tensor([[  101,  7592,  1010, 19081,   999,   102]])
Decoder Input: tensor([[ 101,  101, 7632, 2045,  999,  102]])
Encoder Mask: tensor([[1, 1, 1, 1, 1, 1]])
Decoder Mask: tensor([[1, 1, 1, 1, 1, 1]])
Causal Mask: tensor([[[1., 0., 0., 0., 0., 0.],
         [1., 1., 0., 0., 0., 0.],
         [1., 1., 1., 0., 0., 0.],
         [1., 1., 1., 1., 0., 0.],
         [1., 1., 1., 1., 1., 0.],
         [1., 1., 1., 1., 1., 1.]]])


## 🔹 Açıklamalar

* Tokenizer: Subword tokenization ile token IDs oluşturuyoruz.

* Padding: Encoder ve decoder sequence’leri aynı uzunlukta.

* Shifted Decoder Input: [BOS] token eklenir, target sağa kaydırılır.

* Padding Mask: Model padding token’larını dikkate almasın diye mask oluşturulur.

* Causal Mask: Decoder’ın geleceğe bakmasını engeller → autoregressive görevler için gerekli.

----

In [None]:

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from typing import Optional
from sklearn.model_selection import train_test_split
import random
import csv

src_texts = []
trg_texts = []

with open(r"C:\Users\hdgn5\OneDrive\Masaüstü\PyTorch - SEQ2SEQ\SEQ2SEQ (3)\turkish_english_dataset_large.csv", "r", encoding="utf-8") as f:
    reader = csv.reader(f)
    for row in reader:
        if len(row) == 2:
            src, trg = row
            src_texts.append(src.strip())
            trg_texts.append(trg.strip())

print(f"Toplam {len(src_texts)} örnek yüklendi.")

class SımpleTokenizer:
    def __init__(self , mode="word"):
        self.mode = mode
        self.vocab = {"<PAD>":0 , "<SOS>":1 , "<EOS>":2 , "<UNK>":3}
        self.reverse_vocab = {0:"<PAD>", 1:"<SOS>", 2:"<EOS>", 3:"<UNK>"}
        self.idx = 4

    def tokenize(self,text):
        return text.lower().split() if self.mode=="word"  else list(text.lower())
    
    def build_vocab(self,texts):
        for text in texts:
            for t in self.tokenize(text):
                if t not in self.vocab:
                    self.vocab[t] = self.idx
                    self.reverse_vocab[self.idx]= t        
                    self.idx +=1
    
    def encode(self,text,add_sos_eos_unk = True):
        indices = [self.vocab.get(t,self.vocab["<UNK>"]) for t in self.tokenize(text)]
        if add_sos_eos_unk:
            indices = [self.vocab["<SOS>"]] + indices + [self.vocab["<EOS>"]]
        return indices

    def decode(self,indices):
        return " ".join([self.reverse_vocab.get(i,"<UNK>") for i in indices if i not in (0,1,2)])
    
class SimpleAugmenter:
    def __init__(self, synonym_dict=None, insert_tokens=None, seed=42):
        random.seed(seed)
        self.synonym_dict = synonym_dict or {}
        self.insert_tokens = insert_tokens or ["ve", "ile", "da", "please", "today"]

    def _tokenize(self, text):
        return text.split()

    def _detokenize(self, tokens):
        return " ".join(tokens)

    def synonym_replace(self, text, p=0.1):
        tokens = self._tokenize(text)
        new_tokens = [
            random.choice(self.synonym_dict[t])
            if t in self.synonym_dict and random.random() < p
            else t
            for t in tokens
        ]
        return self._detokenize(new_tokens)

    def random_deletion(self, text, p=0.05):
        tokens = self._tokenize(text)
        new_tokens = [t for t in tokens if random.random() > p]
        return self._detokenize(new_tokens) if new_tokens else random.choice(tokens)

    def apply_policy(self, source, target, policy=None):
        policy = policy or {}
        s, t = source, target
        for fn in policy.get("source_only", []):
            s = fn(s)
        for fn in policy.get("target_only", []):
            t = fn(t)
        for fn in policy.get("both", []):
            s = fn(s)
            t = fn(t)
        return s, t
    
class Seq2SeqDataset(Dataset):
    def __init__(self,sources,targets,source_tokenizer,target_tokenizer , augmenter=None , policy =None):
        super().__init__()
        self.sources = sources
        self.targets = targets
        self.source_tokenizer = source_tokenizer
        self.target_tokenizer = target_tokenizer
        self.augmenter = augmenter
        self.policy = policy

    def __len__(self):
        return len(self.sources)
    
    def __getitem__(self, idx):
        s,t = self.sources[idx] , self.targets[idx]
        if self.augmenter:
            s ,t = self.augmenter.apply_policy(s,t,self.policy)
        
        s_encoded = torch.tensor(self.source_tokenizer.encode(s, add_sos_eos_unk=True) , dtype=torch.long)
        t_encoded = torch.tensor(self.target_tokenizer.encode(t , add_sos_eos_unk = True) , dtype=torch.long)
        return s_encoded , t_encoded
    
def collate_fn(batch):
    s_seqs , t_seqs = zip(*batch)
    s_padded = pad_sequence(s_seqs , batch_first=True,padding_value=0)
    t_padded = pad_sequence(t_seqs,batch_first=True,padding_value=0)
    return s_padded , t_padded

train_src, val_src, train_tgt, val_tgt = train_test_split(
  src_texts, trg_texts, test_size=0.3, random_state=42)

source_tokenizer = SımpleTokenizer(mode="word")
target_tokenizer = SımpleTokenizer(mode="word")

source_tokenizer.build_vocab(train_src)
target_tokenizer.build_vocab(train_tgt)

train_dataset = Seq2SeqDataset(train_src,train_tgt , source_tokenizer , target_tokenizer)
val_dataset = Seq2SeqDataset(val_src,val_tgt,source_tokenizer,target_tokenizer)

val_loader = DataLoader(val_dataset , batch_size=32 , shuffle=False,collate_fn=collate_fn)
train_loader = DataLoader(train_dataset, batch_size=32 , shuffle=True ,collate_fn=collate_fn)

for s_batch, t_batch in train_loader:
    print("Source batch shape:", s_batch.shape)
    print("Target batch shape:", t_batch.shape)
    break

## Biz normal Seq2Seq modellerde yukarıda olan tokenizasyon işlemlerini kullanıyorduk.Şimdi ise LLM tabanlı yazacağoz.


# 4️⃣ Özet Karşılaştırma

| Feature        | Senin Seq2Seq Tokenizer      | Transformer-Ready Pipeline               |
| -------------- | ---------------------------- | ---------------------------------------- |
| Tokenization   | Word-level                   | Subword / BPE / WordPiece                |
| Special Tokens | `<PAD>, <SOS>, <EOS>, <UNK>` | `[PAD], [BOS]/[CLS], [EOS]/[SEP], <UNK>` |
| Masking        | Padding mask                 | Padding + Look-ahead / Causal mask       |
| Batch          | Var                          | Batch-first + Mask hazır                 |
| LLM uyumu      | Kısıtlı                      | Tam hazır                                |


---
## Şimdi ise bu yapıda ama LLM ve transformers'a uygun bir tokenizasyon işlemi yapacağız
---

# 🔹 Transformers-Ready Tokenization Pipeline

Bu pipeline yalnızca tokenizasyon ve batch-ready hazırlık için:

1. Subword tokenization (WordPiece / BPE)
2. Padding ve batch
3. Special tokens
4. Causal mask opsiyonel


In [2]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader
import torch
import random

# ----------------------------------
# 1️⃣ Simple Subword Tokenizer (WordPiece tarzı)
# ----------------------------------
class SubwordTokenizer:
    def __init__(self, vocab=None):
        # Special tokens
        self.vocab = vocab or {"<PAD>":0, "<SOS>":1, "<EOS>":2, "<UNK>":3}
        self.reverse_vocab = {v:k for k,v in self.vocab.items()}
        self.idx = len(self.vocab)
    
    def build_vocab(self, texts):
        for text in texts:
            for token in self._tokenize(text):
                if token not in self.vocab:
                    self.vocab[token] = self.idx
                    self.reverse_vocab[self.idx] = token
                    self.idx += 1

    def _tokenize(self, text):
        # Çok basit WordPiece tarzı: kelimeleri 2-3 harfli subword parçalarına ayır
        tokens = []
        for word in text.lower().split():
            if len(word) <= 2:
                tokens.append(word)
            else:
                # split into 2-char subwords
                for i in range(0, len(word), 2):
                    tokens.append(word[i:i+2])
        return tokens

    def encode(self, text, add_sos_eos=True):
        ids = [self.vocab.get(t, self.vocab["<UNK>"]) for t in self._tokenize(text)]
        if add_sos_eos:
            ids = [self.vocab["<SOS>"]] + ids + [self.vocab["<EOS>"]]
        return ids
    
    def decode(self, ids):
        return "".join([self.reverse_vocab.get(i,"<UNK>") for i in ids if i not in (0,1,2)])

# ----------------------------------
# 2️⃣ Dataset ve DataLoader
# ----------------------------------
class Seq2SeqDataset(Dataset):
    def __init__(self, sources, targets, src_tokenizer, tgt_tokenizer):
        self.sources = sources
        self.targets = targets
        self.src_tokenizer = src_tokenizer
        self.tgt_tokenizer = tgt_tokenizer

    def __len__(self):
        return len(self.sources)
    
    def __getitem__(self, idx):
        s = self.sources[idx]
        t = self.targets[idx]
        s_ids = torch.tensor(self.src_tokenizer.encode(s), dtype=torch.long)
        t_ids = torch.tensor(self.tgt_tokenizer.encode(t), dtype=torch.long)
        return s_ids, t_ids

def collate_fn(batch):
    s_seqs, t_seqs = zip(*batch)
    s_padded = pad_sequence(s_seqs, batch_first=True, padding_value=0)
    t_padded = pad_sequence(t_seqs, batch_first=True, padding_value=0)
    return s_padded, t_padded

# ----------------------------------
# 3️⃣ Örnek kullanım
# ----------------------------------
src_texts = ["Merhaba dünya", "PyTorch Transformers"]
tgt_texts = ["Hello world", "PyTorch Transformers"]

src_tokenizer = SubwordTokenizer()
tgt_tokenizer = SubwordTokenizer()

src_tokenizer.build_vocab(src_texts)
tgt_tokenizer.build_vocab(tgt_texts)

dataset = Seq2SeqDataset(src_texts, tgt_texts, src_tokenizer, tgt_tokenizer)
loader = DataLoader(dataset, batch_size=2, shuffle=False, collate_fn=collate_fn)

for s_batch, t_batch in loader:
    print("Source batch:", s_batch)
    print("Target batch:", t_batch)


Source batch: tensor([[ 1,  4,  5,  6,  7,  8,  9,  7,  2,  0,  0,  0],
        [ 1, 10, 11, 12, 13, 14, 15, 16, 17,  4, 18,  2]])
Target batch: tensor([[ 1,  4,  5,  6,  7,  8,  9,  2,  0,  0,  0,  0],
        [ 1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,  2]])


---
# Şimdi ise bu işlem adımlarını alıp HuggingFace tarzı subword tokenization ve Transformers-ready padding / masks ile güncelleyelim. Adım adım yapacağız, böylece:

* Subword tokenization (BPE / WordPiece tarzı) eklenir.

* Encoder ve decoder inputları hazırlanır (shifted, BOS/EOS).

* Padding mask ve causal mask üretilir.
---

In [None]:
from transformers import AutoTokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
import csv

# ========================================
# 1️⃣ Tokenizer Yükleme (Subword / Pretrained)
# ========================================
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # veya "t5-small", "gpt2"

# ========================================
# 2️⃣ Dataset Yükleme
# ========================================
src_texts, trg_texts = [], []

with open(r"C:\Users\hdgn5\OneDrive\Masaüstü\PyTorch - SEQ2SEQ\SEQ2SEQ (3)\turkish_english_dataset_large.csv", "r", encoding="utf-8") as f:
    reader = csv.reader(f)
    for row in reader:
        if len(row) == 2:
            src, trg = row
            src_texts.append(src.strip())
            trg_texts.append(trg.strip())

print(f"Toplam {len(src_texts)} örnek yüklendi.")

# ========================================
# 3️⃣ Dataset ve Tokenization
# ========================================
class TransformerDataset(Dataset):
    def __init__(self, sources, targets, tokenizer, max_len=128):
        self.sources = sources
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.sources)

    def __getitem__(self, idx):
        src = self.sources[idx]
        trg = self.targets[idx]

        # Encoder tokenization
        enc = tokenizer.encode_plus(
            src, add_special_tokens=True,
            max_length=self.max_len, padding='max_length', truncation=True,
            return_tensors='pt'
        )

        # Decoder tokenization
        dec = tokenizer.encode_plus(
            trg, add_special_tokens=True,
            max_length=self.max_len, padding='max_length', truncation=True,
            return_tensors='pt'
        )

        # Decoder input: shifted right (BOS)
        decoder_input_ids = torch.cat([torch.tensor([[tokenizer.cls_token_id]]), dec['input_ids'][:, :-1]], dim=1)

        # Padding mask
        encoder_mask = enc['attention_mask']
        decoder_mask = dec['attention_mask']

        # Causal mask (look-ahead)
        seq_len = decoder_input_ids.size(1)
        causal_mask = torch.tril(torch.ones((seq_len, seq_len))).unsqueeze(0)  # (1, seq_len, seq_len)

        return {
            'encoder_input': enc['input_ids'].squeeze(0),
            'decoder_input': decoder_input_ids.squeeze(0),
            'decoder_target': dec['input_ids'].squeeze(0),
            'encoder_mask': encoder_mask.squeeze(0),
            'decoder_mask': decoder_mask.squeeze(0),
            'causal_mask': causal_mask.squeeze(0)
        }

# ========================================
# 4️⃣ DataLoader
# ========================================
dataset = TransformerDataset(src_texts, trg_texts, tokenizer, max_len=32)
loader = DataLoader(dataset, batch_size=16, shuffle=True)

# Test batch
for batch in loader:
    print("Encoder input shape:", batch['encoder_input'].shape)
    print("Decoder input shape:", batch['decoder_input'].shape)
    print("Decoder target shape:", batch['decoder_target'].shape)
    print("Encoder mask shape:", batch['encoder_mask'].shape)
    print("Decoder mask shape:", batch['decoder_mask'].shape)
    print("Causal mask shape:", batch['causal_mask'].shape)
    break

### ✅ Bu pipeline artık:

* HuggingFace subword tokenizer kullanıyor (LLM hazır).

* Encoder / decoder input ve target hazırlanmış.

* Padding mask ve causal mask otomatik.

* Transformers ve autoregressive LLM’lerde direkt kullanılabilir.

## Şimdi ise : bir sonraki adımda bu pipeline’ı train-ready hâle getireceğiz ve batch’leri model-ready tensorlar olarak hazırlayacağız.

* Bunu yaparken şunları ekleyeceğiz:

* Padding ve truncation zaten tokenizasyonda ayarlı, batch’ler eşit uzunlukta olacak.

* Encoder mask ve decoder mask batch olarak dönecek.

* Causal / look-ahead mask batch için otomatik üretilecek.

* Batch dictionary formatında dönecek, böylece doğrudan forward’a verilebilecek.

In [None]:
from torch.utils.data import DataLoader

def collate_fn(batch):
    encoder_inputs = torch.stack([item['encoder_input'] for item in batch])
    decoder_inputs = torch.stack([item['decoder_input'] for item in batch])
    decoder_targets = torch.stack([item['decoder_target'] for item in batch])
    encoder_masks = torch.stack([item['encoder_mask'] for item in batch])
    decoder_masks = torch.stack([item['decoder_mask'] for item in batch])
    causal_masks = torch.stack([item['causal_mask'] for item in batch])
    
    return {
        'encoder_input': encoder_inputs,
        'decoder_input': decoder_inputs,
        'decoder_target': decoder_targets,
        'encoder_mask': encoder_masks,
        'decoder_mask': decoder_masks,
        'causal_mask': causal_masks
    }

# DataLoader
train_loader = DataLoader(dataset, batch_size=16, shuffle=True, collate_fn=collate_fn)

# Test batch
for batch in train_loader:
    print("Encoder input shape:", batch['encoder_input'].shape)
    print("Decoder input shape:", batch['decoder_input'].shape)
    print("Decoder target shape:", batch['decoder_target'].shape)
    print("Encoder mask shape:", batch['encoder_mask'].shape)
    print("Decoder mask shape:", batch['decoder_mask'].shape)
    print("Causal mask shape:", batch['causal_mask'].shape)
    break

---
# Aşağıda yazılan kodlar LLM ve Transformer’lar için hazırlanmış bütün bir Veri hazırlama adımının kodlarıdır.Yukarıda anlatılanların birleştirilmiş hali aşağıdadır:
---

In [None]:
from transformers import AutoTokenizer
import torch
from torch.utils.data import Dataset, DataLoader

# ========================================
# 1️⃣ Tokenizer Yükleme (Subword)
# ========================================
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")  # veya başka LLM

# ========================================
# 2️⃣ Örnek Dataset
# ========================================
input_texts = [
    "Merhaba, nasılsınız?",
    "Transformers ile tokenizasyon öğreniyorum."
]
target_texts = [
    "Hello, how are you?",
    "I am learning tokenization with Transformers."
]

# ========================================
# 3️⃣ Dataset Sınıfı (LLM-ready Pipeline)
# ========================================
class LLMSeq2SeqDataset(Dataset):
    def __init__(self, sources, targets, tokenizer, max_length=32):
        self.sources = sources
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.sources)

    def __getitem__(self, idx):
        src_text = self.sources[idx]
        tgt_text = self.targets[idx]

        # ========================
        # Encoder: tokenize, pad, truncate
        # ========================
        enc = self.tokenizer(
            src_text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # ========================
        # Decoder: tokenize, pad, truncate
        # ========================
        dec = self.tokenizer(
            tgt_text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # ========================
        # Decoder input: shifted right + [BOS]/CLS token
        # ========================
        decoder_input_ids = torch.cat(
            [torch.full((1,1), self.tokenizer.cls_token_id), dec['input_ids'][:,:-1]], dim=1
        )

        return {
            'encoder_input_ids': enc['input_ids'].squeeze(0),
            'encoder_attention_mask': enc['attention_mask'].squeeze(0),
            'decoder_input_ids': decoder_input_ids.squeeze(0),
            'decoder_attention_mask': dec['attention_mask'].squeeze(0),
            'decoder_target_ids': dec['input_ids'].squeeze(0)
        }

# ========================================
# 4️⃣ Dataset ve DataLoader
# ========================================
dataset = LLMSeq2SeqDataset(input_texts, target_texts, tokenizer, max_length=32)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# ========================================
# 5️⃣ Test
# ========================================
for batch in dataloader:
    print("Encoder Input IDs:", batch['encoder_input_ids'])
    print("Encoder Attention Mask:", batch['encoder_attention_mask'])
    print("Decoder Input IDs:", batch['decoder_input_ids'])
    print("Decoder Attention Mask:", batch['decoder_attention_mask'])
    print("Decoder Target IDs:", batch['decoder_target_ids'])
    break


Encoder Input IDs: tensor([[  101, 21442, 25459,  2050,  1010, 17235, 11722,  4877, 11722,  2078,
         11722,  2480,  1029,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0],
        [  101, 19081, 17869, 19204, 21335,  6508,  2239, 13958,  7389, 28008,
         20527,  1012,   102,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]])
Encoder Attention Mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])
Decoder Input IDs: tensor([[  101,   101,  7592,  1010,  2129,  2024,  2017,  1029,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,    

 ========================================
# 🔹 Özet Pipeline Adımları
 ========================================
### 1. Subword Tokenization (BPE/WordPiece/SentencePiece)
### 2. Encoder/Decoder sequence oluşturma
### 3. Padding + Truncation
### 4. Attention Mask oluşturma
### 5. Decoder input shifted right (+ BOS/CLS)
### 6. Tensor dönüşümü (return_tensors='pt')
### 7. Batch ile DataLoader üzerinden eğitim-ready

## 🔹 Bu pipeline’ın avantajları

* LLM-ready: Transformer veya GPT tarzı modellerle direkt kullanılabilir.

* Subword tokenization: OOV problemini azaltır ve vocab boyutu dengelidir.

* Automatic attention mask: padding token’ları dikkate alınmaz.

* Decoder shift: Autoregressive / seq2seq görevleri için uygun.

* Batch ve tensor ready: GPU üzerinde direkt training’e uygun.

----
# işte LLM ve Transformer’lar için tam veri hazırlama pipeline’ı. Subword tokenization, padding, attention mask, decoder shift, tensor dönüşümü ve batch handling hepsi bir arada.
---

### 🔹 Memory-Efficient Batch Tokenization Pipeline

- **Lazy Tokenization:** Dataset’i liste halinde tut, ancak tokenizasyon her batch çağrıldığında yapılır.  
- **Tokenizer Kullanımı:** `batch_encode_plus` veya `__call__` fonksiyonunu kullan.  
  - Bu sayede tek seferde tüm batch’leri tokenize edebilir, padding ve truncation uygulayabilirsin.  
- **Attention Mask:** Padding token’larını otomatik olarak dikkate alır.  
- **Decoder Input (Shifted Right):**  
  ```python
  decoder_input_ids = torch.cat([CLS/BOS, decoder_target[:, :-1]], dim=1)


In [1]:
from torch.utils.data import Dataset, DataLoader
import torch
from transformers import AutoTokenizer

# =====================================
# 1️⃣ Full LLM / Transformer Tokenization Dataset
# =====================================
class LLMTokenizedDataset(Dataset):
    """
    Full tokenization pipeline for LLMs / Transformers.
    Includes encoder/decoder tokenization, attention masks, and shifted decoder input.
    """
    def __init__(self, sources, targets, tokenizer_name="bert-base-uncased", max_length=64):
        self.sources = sources
        self.targets = targets
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        self.max_length = max_length

    def __len__(self):
        return len(self.sources)

    def __getitem__(self, idx):
        # -------------------
        # Encoder tokenization
        # -------------------
        enc = self.tokenizer(
            self.sources[idx],
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # -------------------
        # Decoder tokenization
        # -------------------
        dec = self.tokenizer(
            self.targets[idx],
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # -------------------
        # Decoder input (shifted right + BOS/CLS)
        # -------------------
        decoder_input_ids = torch.cat(
            [torch.full((1,1), self.tokenizer.cls_token_id), dec['input_ids'][:, :-1]], dim=1
        )

        return {
            'encoder_input_ids': enc['input_ids'].squeeze(0),
            'encoder_attention_mask': enc['attention_mask'].squeeze(0),
            'decoder_input_ids': decoder_input_ids.squeeze(0),
            'decoder_attention_mask': dec['attention_mask'].squeeze(0),
            'decoder_target_ids': dec['input_ids'].squeeze(0)
        }

# =====================================
# 2️⃣ Collate function for batching
# =====================================
def collate_fn(batch):
    # Keys: encoder_input_ids, encoder_attention_mask, decoder_input_ids, decoder_attention_mask, decoder_target_ids
    enc_ids = torch.stack([item['encoder_input_ids'] for item in batch])
    enc_mask = torch.stack([item['encoder_attention_mask'] for item in batch])
    dec_ids = torch.stack([item['decoder_input_ids'] for item in batch])
    dec_mask = torch.stack([item['decoder_attention_mask'] for item in batch])
    dec_target = torch.stack([item['decoder_target_ids'] for item in batch])
    return {
        'encoder_input_ids': enc_ids,
        'encoder_attention_mask': enc_mask,
        'decoder_input_ids': dec_ids,
        'decoder_attention_mask': dec_mask,
        'decoder_target_ids': dec_target
    }

# =====================================
# 3️⃣ Example Usage
# =====================================
input_texts = [
    "Merhaba dünya!",
    "Transformers çok güçlü.",
    "Memory-efficient pipeline."
]

target_texts = [
    "Hello world!",
    "Transformers are powerful.",
    "Super useful pipeline."
]

dataset = LLMTokenizedDataset(input_texts, target_texts, tokenizer_name="bert-base-uncased", max_length=16)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)

# =====================================
# 4️⃣ Test one batch
# =====================================
for batch in dataloader:
    print("Encoder Input IDs:", batch['encoder_input_ids'].shape)
    print("Encoder Attention Mask:", batch['encoder_attention_mask'].shape)
    print("Decoder Input IDs:", batch['decoder_input_ids'].shape)
    print("Decoder Attention Mask:", batch['decoder_attention_mask'].shape)
    print("Decoder Target IDs:", batch['decoder_target_ids'].shape)
    break

  from .autonotebook import tqdm as notebook_tqdm


Encoder Input IDs: torch.Size([2, 16])
Encoder Attention Mask: torch.Size([2, 16])
Decoder Input IDs: torch.Size([2, 16])
Decoder Attention Mask: torch.Size([2, 16])
Decoder Target IDs: torch.Size([2, 16])


# 🚀 Model Girişi İçin Hazırlık Durumu

| Tensor                   | Açıklama                               | Kullanım                       |
| :----------------------- | :------------------------------------- | :----------------------------- |
| `encoder_input_ids`      | Token ID’leri (subword seviyesinde)    | Encoder’a girer                |
| `encoder_attention_mask` | PAD olmayan yerler = 1                 | Encoder maskesi                |
| `decoder_input_ids`      | Shifted right + `[CLS]` (veya `[BOS]`) | Decoder girişi                 |
| `decoder_attention_mask` | PAD olmayan yerler = 1                 | Decoder maskesi                |
| `decoder_target_ids`     | Gerçek hedef diziler                   | Loss hesaplamasında kullanılır |


## 🔧 Örnek: Transformer Modeline Bağlama

* Aşağıdaki örnek, HuggingFace’teki T5 / BART gibi encoder–decoder yapılarında bu pipeline’ın doğrudan nasıl kullanıldığını gösteriyor 👇

In [2]:
from transformers import T5ForConditionalGeneration

# Modeli yükle (örnek: T5 small)
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# Bir batch al
batch = next(iter(dataloader))

# Model girişine ver
outputs = model(
    input_ids=batch["encoder_input_ids"],
    attention_mask=batch["encoder_attention_mask"],
    decoder_input_ids=batch["decoder_input_ids"],
    labels=batch["decoder_target_ids"]
)

loss = outputs.loss
logits = outputs.logits

print("Loss:", loss.item())
print("Logits shape:", logits.shape)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Loss: 10.565789222717285
Logits shape: torch.Size([2, 16, 32128])


## 💡 Notlar

* Bu yapı LLM’lerde, Seq2Seq Transformer’larda (T5, BART, MarianMT vb.) doğrudan çalışır.

* Eğer GPT gibi sadece decoder tabanlı bir model kullanacaksan (encoder_input_ids yok), pipeline’ı biraz sadeleştirip tek taraflı hale getiririz (yani input = prompt, label = shifted target).

* Tokenizer olarak da "bert-base-uncased", "t5-small", "gpt2", "openai-community/gpt2-medium" gibi istediğini seçebilirsin.