# Quantization Temelleri

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yourusername/transformer-edge-optimization/blob/main/notebooks/01_quantization_basics.ipynb)

Bu notebook'ta quantization (niceleme) temellerini öğreneceğiz ve BERT modeline dynamic quantization uygulayacağız.

## İçerik
1. Quantization nedir?
2. FP32 → INT8 dönüşümü
3. PyTorch Dynamic Quantization
4. Model boyutu karşılaştırması
5. Performans analizi

In [None]:
# Gerekli kütüphaneleri yükle
!pip install -q torch transformers datasets

In [None]:
import torch
import torch.nn as nn
from transformers import BertForSequenceClassification, BertTokenizer
import time
import os

print(f"PyTorch version: {torch.__version__}")

## 1. Quantization Nedir?

Quantization, yüksek hassasiyetli sayıları (FP32) daha düşük hassasiyette (INT8) temsil ederek:
- Model boyutunu azaltır (4x)
- İnferans hızını artırır
- Bellek kullanımını düşürür

### Quantization Formülü

```
quantized = round(float_value / scale) + zero_point
dequantized = (quantized - zero_point) * scale
```

In [None]:
# Basit quantization örneği
def quantize_tensor(tensor, num_bits=8):
    qmin = 0
    qmax = 2**num_bits - 1
    
    min_val, max_val = tensor.min(), tensor.max()
    scale = (max_val - min_val) / (qmax - qmin)
    zero_point = qmin - min_val / scale
    
    q_tensor = torch.round(tensor / scale + zero_point)
    q_tensor = torch.clamp(q_tensor, qmin, qmax)
    
    return q_tensor.to(torch.uint8), scale, zero_point

# Test
original = torch.randn(5, 5)
quantized, scale, zero_point = quantize_tensor(original)

print("Original tensor:")
print(original)
print(f"\nQuantized tensor (INT8):")
print(quantized)
print(f"\nScale: {scale:.6f}, Zero point: {zero_point:.6f}")

## 2. BERT Modeli Yükleme

In [None]:
# BERT modelini yükle
model_name = 'bert-base-uncased'
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Model evaluation moduna al
model.eval()

print(f"Model loaded: {model_name}")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M")

## 3. Dynamic Quantization Uygulama

In [None]:
# Dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},  # Sadece Linear katmanları quantize et
    dtype=torch.qint8
)

print("Quantization completed!")

## 4. Model Boyutu Karşılaştırması

In [None]:
def get_model_size(model, filename="temp_model.pt"):
    torch.save(model.state_dict(), filename)
    size_mb = os.path.getsize(filename) / (1024 * 1024)
    os.remove(filename)
    return size_mb

# Boyutları hesapla
original_size = get_model_size(model)
quantized_size = get_model_size(quantized_model)
compression_ratio = original_size / quantized_size

print(f"Original model size: {original_size:.2f} MB")
print(f"Quantized model size: {quantized_size:.2f} MB")
print(f"Compression ratio: {compression_ratio:.2f}x")
print(f"Size reduction: {(1 - quantized_size/original_size) * 100:.1f}%")

## 5. İnferans Hızı Karşılaştırması

In [None]:
# Test input'u hazırla
test_text = "This movie is absolutely fantastic! I loved every minute of it."
inputs = tokenizer(test_text, return_tensors='pt', padding=True, truncation=True, max_length=128)

# Warm-up
with torch.no_grad():
    _ = model(**inputs)
    _ = quantized_model(**inputs)

# Benchmark original model
num_iterations = 100
start = time.time()
with torch.no_grad():
    for _ in range(num_iterations):
        _ = model(**inputs)
original_time = (time.time() - start) / num_iterations * 1000  # ms

# Benchmark quantized model
start = time.time()
with torch.no_grad():
    for _ in range(num_iterations):
        _ = quantized_model(**inputs)
quantized_time = (time.time() - start) / num_iterations * 1000  # ms

speedup = original_time / quantized_time

print(f"Original model: {original_time:.2f} ms")
print(f"Quantized model: {quantized_time:.2f} ms")
print(f"Speedup: {speedup:.2f}x")

## 6. Tahmin Karşılaştırması

In [None]:
# Test cümleleri
test_sentences = [
    "This is an amazing product!",
    "I'm very disappointed with the service.",
    "It's okay, nothing special.",
    "Absolutely terrible experience.",
    "Best purchase I've ever made!"
]

print("Predictions comparison:\n")
for sentence in test_sentences:
    inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)
    
    with torch.no_grad():
        original_output = model(**inputs).logits
        quantized_output = quantized_model(**inputs).logits
    
    orig_pred = torch.argmax(original_output, dim=1).item()
    quant_pred = torch.argmax(quantized_output, dim=1).item()
    
    label = "POSITIVE" if orig_pred == 1 else "NEGATIVE"
    match = "✓" if orig_pred == quant_pred else "✗"
    
    print(f"{match} {sentence[:50]}...")
    print(f"  Original: {label}, Quantized: {'POSITIVE' if quant_pred == 1 else 'NEGATIVE'}\n")

## 7. Özet

Bu notebook'ta:
- ✅ Quantization temellerini öğrendik
- ✅ BERT modeline dynamic quantization uyguladık
- ✅ Model boyutunu ~4x azalttık
- ✅ İnferans hızını artırdık
- ✅ Minimal doğruluk kaybı gözlemledik

### Sonraki Adımlar
- INT8 static quantization
- Quantization-aware training
- ONNX Runtime ile quantization

## 8. Model Kaydetme

In [None]:
# Quantized modeli kaydet
torch.save(quantized_model.state_dict(), 'bert_quantized_int8.pt')
print("Quantized model saved: bert_quantized_int8.pt")