# ü¶ä Kitsune T4 Optimization Benchmark

**Target: 2.0x+ speedup on Google Colab T4 GPU**

This notebook demonstrates hardware-specific optimizations for the Tesla T4:

| Optimization | Expected Speedup | Notes |
|-------------|------------------|-------|
| INT8 Quantization | +40-60% | T4's 61 TOPS INT8 Tensor Cores |
| FP16 Mixed Precision | +20-30% | T4's 65 TFLOPS FP16 |
| JIT Trace + Freeze | +15-20% | Kernel fusion |
| torch.compile | +10-20% | PyTorch 2.x Triton backend |

**Before running:**
1. Go to `Runtime` ‚Üí `Change runtime type` ‚Üí Select `T4 GPU`
2. Run all cells in order

In [None]:
# üì¶ Setup
import torch
import torch.nn as nn
import torchvision.models as models
import time
import gc

print("ü¶ä Kitsune T4 Optimization Benchmark")
print("=" * 50)
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    props = torch.cuda.get_device_properties(0)
    print(f"Memory: {props.total_memory / 1024**3:.1f} GB")
    print(f"Compute Capability: SM{props.major}{props.minor}")
    
    if 'T4' in torch.cuda.get_device_name(0):
        print("\n‚úÖ T4 detected - All optimizations available!")
    else:
        print("\n‚ö†Ô∏è Not a T4 - Some optimizations may differ")
else:
    print("\n‚ùå No GPU - Please enable T4 in Runtime settings")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
def benchmark(model, x, name="Model", iterations=100, warmup=20):
    """Benchmark a model with proper GPU timing."""
    model.eval()
    
    # Warmup
    with torch.no_grad():
        for _ in range(warmup):
            _ = model(x)
    
    if x.is_cuda:
        torch.cuda.synchronize()
    
    # Benchmark with CUDA events for accurate timing
    times = []
    with torch.no_grad():
        for _ in range(iterations):
            if x.is_cuda:
                start = torch.cuda.Event(enable_timing=True)
                end = torch.cuda.Event(enable_timing=True)
                start.record()
                _ = model(x)
                end.record()
                torch.cuda.synchronize()
                times.append(start.elapsed_time(end))
            else:
                start = time.perf_counter()
                _ = model(x)
                times.append((time.perf_counter() - start) * 1000)
    
    median = sorted(times)[len(times) // 2]
    return median

## üéØ Baseline Measurement

First, let's establish a baseline with vanilla ResNet-50 on FP32.

In [None]:
# Configuration
batch_size = 32
x = torch.randn(batch_size, 3, 224, 224).to(device)

# Load model
print("Loading ResNet-50...")
model = models.resnet50(weights=None).to(device)
model.eval()

# Calculate model size
param_size = sum(p.numel() * p.element_size() for p in model.parameters())
print(f"Model size: {param_size / 1024**2:.1f} MB")
print(f"Batch size: {batch_size}")
print(f"Input shape: {x.shape}")

# Baseline benchmark
baseline_ms = benchmark(model, x, "Baseline FP32")
print(f"\nüìä Baseline: {baseline_ms:.2f} ms/batch")
print(f"   Throughput: {batch_size / (baseline_ms / 1000):.0f} images/sec")

results = {'Baseline FP32': baseline_ms}

## ‚ö° Optimization 1: JIT Trace + Freeze

TorchScript tracing captures the computation graph and allows kernel fusion.

In [None]:
print("üîß Applying JIT trace + optimize_for_inference + freeze...")

model_jit = models.resnet50(weights=None).to(device)
model_jit.eval()

with torch.no_grad():
    traced = torch.jit.trace(model_jit, x)
    traced = torch.jit.optimize_for_inference(traced)
    traced = torch.jit.freeze(traced)

jit_ms = benchmark(traced, x, "JIT")
speedup = baseline_ms / jit_ms

print(f"\nüìä JIT Trace + Freeze: {jit_ms:.2f} ms ({speedup:.2f}x speedup)")
results['JIT Trace'] = jit_ms

## üöÄ Optimization 2: FP16 Mixed Precision (AMP)

T4's Tensor Cores provide 8x more FP16 compute than FP32 (65 TFLOPS vs 8.1 TFLOPS).

In [None]:
from torch.cuda.amp import autocast

print("üîß Applying FP16 Automatic Mixed Precision...")

model_amp = models.resnet50(weights=None).to(device)
model_amp.eval()

# Benchmark with AMP
times = []
with torch.no_grad():
    # Warmup
    for _ in range(20):
        with autocast(dtype=torch.float16):
            _ = model_amp(x)
    torch.cuda.synchronize()
    
    # Benchmark
    for _ in range(100):
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        start.record()
        with autocast(dtype=torch.float16):
            _ = model_amp(x)
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

amp_ms = sorted(times)[len(times) // 2]
speedup = baseline_ms / amp_ms

print(f"\nüìä FP16 AMP: {amp_ms:.2f} ms ({speedup:.2f}x speedup)")
results['FP16 AMP'] = amp_ms

## üî• Optimization 3: JIT + FP16 Combined

Combining JIT tracing with half-precision for maximum GPU utilization.

In [None]:
from torch.cuda.amp import autocast

print("üîß Applying JIT + FP16...")

model_combined = models.resnet50(weights=None).to(device)
model_combined.eval()

# Convert model to half precision and trace
model_half = model_combined.half()
x_half = x.half()

with torch.no_grad():
    traced_half = torch.jit.trace(model_half, x_half)
    traced_half = torch.jit.optimize_for_inference(traced_half)
    traced_half = torch.jit.freeze(traced_half)

# Benchmark
times = []
with torch.no_grad():
    # Warmup
    for _ in range(20):
        _ = traced_half(x_half)
    torch.cuda.synchronize()
    
    # Benchmark
    for _ in range(100):
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        start.record()
        _ = traced_half(x_half)
        end.record()
        torch.cuda.synchronize()
        times.append(start.elapsed_time(end))

jit_amp_ms = sorted(times)[len(times) // 2]
speedup = baseline_ms / jit_amp_ms

print(f"\nüìä JIT + FP16: {jit_amp_ms:.2f} ms ({speedup:.2f}x speedup)")
results['JIT + FP16'] = jit_amp_ms

## ‚ö° Optimization 4: torch.compile (PyTorch 2.x)

Uses Triton backend for advanced kernel fusion and optimization.

In [None]:
if hasattr(torch, 'compile'):
    print("üîß Applying torch.compile with reduce-overhead mode...")
    
    model_compile = models.resnet50(weights=None).to(device)
    model_compile.eval()
    
    compiled = torch.compile(model_compile, mode="reduce-overhead")
    
    # Warmup (compilation happens lazily)
    print("   Compiling... (first run is slow)")
    with torch.no_grad():
        for _ in range(3):
            _ = compiled(x)
    torch.cuda.synchronize()
    print("   Compilation complete!")
    
    compile_ms = benchmark(compiled, x, "torch.compile")
    speedup = baseline_ms / compile_ms
    
    print(f"\nüìä torch.compile: {compile_ms:.2f} ms ({speedup:.2f}x speedup)")
    results['torch.compile'] = compile_ms
else:
    print("‚ö†Ô∏è torch.compile not available (requires PyTorch 2.x)")

## üèÜ Optimization 5: torch.compile + FP16

The ultimate combination for T4!

In [None]:
if hasattr(torch, 'compile'):
    from torch.cuda.amp import autocast
    
    print("üîß Applying torch.compile + FP16...")
    
    model_best = models.resnet50(weights=None).to(device).half()
    model_best.eval()
    
    compiled_half = torch.compile(model_best, mode="reduce-overhead")
    
    # Warmup
    print("   Compiling...")
    with torch.no_grad():
        for _ in range(3):
            _ = compiled_half(x_half)
    torch.cuda.synchronize()
    print("   Done!")
    
    # Benchmark
    times = []
    with torch.no_grad():
        for _ in range(100):
            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)
            start.record()
            _ = compiled_half(x_half)
            end.record()
            torch.cuda.synchronize()
            times.append(start.elapsed_time(end))
    
    best_ms = sorted(times)[len(times) // 2]
    speedup = baseline_ms / best_ms
    
    print(f"\nüìä torch.compile + FP16: {best_ms:.2f} ms ({speedup:.2f}x speedup)")
    results['compile + FP16'] = best_ms

## üìä Results Summary

In [None]:
import pandas as pd

# Create summary table
print("\n" + "=" * 60)
print("ü¶ä KITSUNE T4 OPTIMIZATION RESULTS")
print("=" * 60)

baseline = results['Baseline FP32']
summary = []

for name, time_ms in sorted(results.items(), key=lambda x: x[1]):
    speedup = baseline / time_ms
    throughput = batch_size / (time_ms / 1000)
    summary.append({
        'Optimization': name,
        'Time (ms)': f"{time_ms:.2f}",
        'Speedup': f"{speedup:.2f}x",
        'Images/sec': f"{throughput:.0f}"
    })

df = pd.DataFrame(summary)
print(df.to_string(index=False))

# Best result
best_name = min(results, key=results.get)
best_time = results[best_name]
best_speedup = baseline / best_time

print("\n" + "=" * 60)
print(f"üèÜ BEST: {best_name}")
print(f"   Speedup: {best_speedup:.2f}x")
print(f"   Time: {baseline:.2f} ms ‚Üí {best_time:.2f} ms")
print(f"   Throughput: {batch_size / (best_time / 1000):.0f} images/sec")
print("=" * 60)

In [None]:
import matplotlib.pyplot as plt

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Time comparison
names = list(results.keys())
times = list(results.values())
colors = ['#ff6b6b' if t == max(times) else '#4ecdc4' if t == min(times) else '#95a5a6' for t in times]

ax1.barh(names, times, color=colors)
ax1.set_xlabel('Time (ms)')
ax1.set_title('ü¶ä Inference Time Comparison')
for i, (name, t) in enumerate(zip(names, times)):
    ax1.text(t + 1, i, f'{t:.1f} ms', va='center')

# Speedup comparison
speedups = [baseline / t for t in times]
colors = ['#4ecdc4' if s == max(speedups) else '#ff6b6b' if s == min(speedups) else '#95a5a6' for s in speedups]

ax2.barh(names, speedups, color=colors)
ax2.set_xlabel('Speedup (x)')
ax2.set_title('üöÄ Speedup vs Baseline')
ax2.axvline(x=1.0, color='red', linestyle='--', label='Baseline')
ax2.axvline(x=2.0, color='green', linestyle='--', label='2x Target')
for i, s in enumerate(speedups):
    ax2.text(s + 0.05, i, f'{s:.2f}x', va='center')

plt.tight_layout()
plt.show()

## üéØ Conclusions

### T4 Optimization Best Practices:

1. **Always use FP16/Half precision** - T4 has 8x more FP16 compute than FP32
2. **Use `torch.compile` (PyTorch 2.x)** - Triton backend provides excellent optimization
3. **Combine optimizations** - JIT/compile + FP16 gives best results
4. **Batch size matters** - Larger batches better utilize Tensor Cores

### For production:
```python
# Recommended T4 optimization
model = model.half()  # FP16
model = torch.compile(model, mode="reduce-overhead")
```

### Expected Results:
- **2.0-2.5x speedup** with compile + FP16
- **Higher throughput** for inference workloads