# Hugging Face Optimum ile INT8 Quantization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yourusername/transformer-edge-optimization/blob/main/notebooks/02_huggingface_optimum.ipynb)

Hugging Face Optimum kullanarak ONNX Runtime ile model optimizasyonu.

## İçerik
1. Optimum kurulumu
2. Model ONNX'e dönüştürme
3. Static quantization
4. ONNX Runtime inference
5. Performans karşılaştırması

In [None]:
# Gerekli paketleri yükle
!pip install -q optimum[onnxruntime] onnx onnxruntime transformers datasets

In [None]:
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from transformers import AutoTokenizer
from datasets import load_dataset
import time
import numpy as np

## 1. Model ve Tokenizer Yükleme

In [None]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# PyTorch modelini ONNX'e dönüştürerek yükle
model = ORTModelForSequenceClassification.from_pretrained(
    model_name,
    export=True  # ONNX'e dönüştür
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Model loaded and converted to ONNX: {model_name}")

## 2. Calibration Dataset Hazırlama

Static quantization için kalibrasyon verisi gereklidir.

In [None]:
# SST-2 dataset'ten sample al
dataset = load_dataset("glue", "sst2", split="validation[:100]")

def preprocess_function(examples):
    return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)

calibration_dataset = dataset.map(preprocess_function, batched=True)
calibration_dataset = calibration_dataset.remove_columns(["sentence", "label", "idx"])

print(f"Calibration dataset size: {len(calibration_dataset)}")

## 3. Quantization Config Oluşturma

In [None]:
# Static quantization config (INT8)
qconfig = AutoQuantizationConfig.avx512_vnni(
    is_static=True,
    per_channel=True
)

print("Quantization config:")
print(f"- Type: Static INT8")
print(f"- Per-channel: True")
print(f"- Backend: AVX512_VNNI")

## 4. Model Quantization

In [None]:
# Quantizer oluştur
quantizer = ORTQuantizer.from_pretrained(model)

# Quantize et
quantizer.quantize(
    save_dir="./distilbert_quantized",
    quantization_config=qconfig,
    calibration_dataset=calibration_dataset
)

print("Quantization completed!")
print("Saved to: ./distilbert_quantized")

## 5. Quantized Model Yükleme

In [None]:
# Quantized modeli yükle
quantized_model = ORTModelForSequenceClassification.from_pretrained(
    "./distilbert_quantized"
)

print("Quantized model loaded successfully!")

## 6. Performans Karşılaştırması

In [None]:
# Test data
test_texts = [
    "This movie was absolutely wonderful!",
    "I hated every second of it.",
    "It was okay, nothing special.",
    "Best film I've seen this year!",
    "Terrible waste of time and money."
]

def benchmark_model(model, texts, num_runs=50):
    times = []
    
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        
        # Warm-up
        _ = model(**inputs)
        
        # Benchmark
        start = time.time()
        for _ in range(num_runs):
            _ = model(**inputs)
        elapsed = (time.time() - start) / num_runs * 1000  # ms
        times.append(elapsed)
    
    return np.mean(times), np.std(times)

# Benchmark original
orig_mean, orig_std = benchmark_model(model, test_texts)
print(f"Original model: {orig_mean:.2f} ± {orig_std:.2f} ms")

# Benchmark quantized
quant_mean, quant_std = benchmark_model(quantized_model, test_texts)
print(f"Quantized model: {quant_mean:.2f} ± {quant_std:.2f} ms")

speedup = orig_mean / quant_mean
print(f"\nSpeedup: {speedup:.2f}x")

## 7. Doğruluk Karşılaştırması

In [None]:
from sklearn.metrics import accuracy_score

# Test dataset
test_dataset = load_dataset("glue", "sst2", split="validation[:500]")

def evaluate_model(model, dataset):
    predictions = []
    labels = []
    
    for example in dataset:
        inputs = tokenizer(example["sentence"], return_tensors="pt", padding=True, truncation=True)
        outputs = model(**inputs)
        pred = outputs.logits.argmax(dim=-1).item()
        
        predictions.append(pred)
        labels.append(example["label"])
    
    return accuracy_score(labels, predictions)

orig_acc = evaluate_model(model, test_dataset)
quant_acc = evaluate_model(quantized_model, test_dataset)

print(f"Original model accuracy: {orig_acc*100:.2f}%")
print(f"Quantized model accuracy: {quant_acc*100:.2f}%")
print(f"Accuracy drop: {(orig_acc - quant_acc)*100:.2f}%")

## 8. Model Boyutu Karşılaştırması

In [None]:
import os

def get_directory_size(path):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(path):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            total_size += os.path.getsize(filepath)
    return total_size / (1024 * 1024)  # MB

# Not: Original ONNX model path'i kontrol edin
# original_size = get_directory_size("./ort_model")
quantized_size = get_directory_size("./distilbert_quantized")

print(f"Quantized model size: {quantized_size:.2f} MB")
# print(f"Compression ratio: {original_size/quantized_size:.2f}x")

## 9. Özet

Hugging Face Optimum ile:
- ✅ PyTorch → ONNX dönüşümü
- ✅ Static INT8 quantization
- ✅ Hızlı ONNX Runtime inference
- ✅ Minimal doğruluk kaybı
- ✅ Önemli hız artışı

### Avantajlar
- Hardware-agnostic optimizasyon
- Cross-platform deployment
- Production-ready araçlar