
# CPU vs GPU Inference Benchmark (PyTorch)

This notebook measures **inference performance** on **CPU vs GPU** using PyTorch and a standard CNN backbone (`resnet18`).  
It is designed to run **both on your local machine** (with or without GPU) and **in Google Colab** (enable GPU via *Runtime → Change runtime type → GPU*).

**What you'll see**
- Device detection (CPU/GPU/TPU).
- Timed inference runs over increasing batch sizes.
- Per-image latency and throughput comparisons.
- Simple plots showing how GPUs scale with batch size.

> Tip: If you run locally without a GPU, you can still run the CPU part and then compare with Colab-GPU.



## 1) Setup
If PyTorch/torchvision are missing, uncomment and run the next cell.


In [None]:

# !pip install --upgrade pip
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# If you have a CUDA-capable local machine and want a CUDA build, see: https://pytorch.org/get-started/locally/



## 2) Imports & Device Check


In [None]:

import time
import math
import torch
import torchvision.models as models
import numpy as np
import matplotlib.pyplot as plt

print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA device name:", torch.cuda.get_device_name(0))
else:
    print("Running without CUDA (CPU only).")

# Colab TPU (rare for this demo) - not used here but we note it.
try:
    import torch_xla
    print("TPU environment detected (XLA). This notebook focuses on CPU/GPU.")
except Exception:
    pass



## 3) Build Model
We use `resnet18` without pretrained weights to avoid downloads. For inference timing, weights are irrelevant.


In [None]:

def build_model():
    model = models.resnet18(weights=None)
    model.eval()
    return model

model = build_model()
total_params = sum(p.numel() for p in model.parameters())
print(f"Model: ResNet18 | Parameters: {total_params/1e6:.2f}M")



## 4) Benchmark Helpers
We benchmark multiple batch sizes, with warm-up and repeated timed runs, computing:
- **Latency per image (ms/img)**
- **Throughput (images/sec)**


In [None]:

def benchmark_inference(model, device, batch_sizes=(1, 8, 32, 64, 128, 256), 
                        input_size=(3, 224, 224), warmup=10, iters=30):
    model.to(device)
    results = []
    for bs in batch_sizes:
        x = torch.randn(bs, *input_size, device=device)

        # Warmup
        with torch.no_grad():
            for _ in range(warmup):
                _ = model(x)
                if device.type == "cuda":
                    torch.cuda.synchronize()

        # Timed runs
        t0 = time.perf_counter()
        with torch.no_grad():
            for _ in range(iters):
                _ = model(x)
                if device.type == "cuda":
                    torch.cuda.synchronize()
        t1 = time.perf_counter()

        total_images = bs * iters
        elapsed = t1 - t0
        throughput = total_images / elapsed
        latency_ms_per_img = (elapsed / total_images) * 1000.0
        results.append({
            "batch_size": bs,
            "elapsed_s": elapsed,
            "throughput_img_s": throughput,
            "latency_ms_per_img": latency_ms_per_img
        })
    return results



## 5) Run CPU Benchmark


In [None]:

device_cpu = torch.device("cpu")
cpu_results = benchmark_inference(model, device_cpu)
cpu_results



## 6) Run GPU Benchmark (if available)
Enable GPU in Colab via **Runtime → Change runtime type → GPU**.


In [None]:

gpu_results = None
if torch.cuda.is_available():
    device_gpu = torch.device("cuda:0")
    gpu_results = benchmark_inference(model, device_gpu)
gpu_results



## 7) Aggregate Results


In [None]:

def to_arrays(results):
    bs = [r["batch_size"] for r in results]
    thr = [r["throughput_img_s"] for r in results]
    lat = [r["latency_ms_per_img"] for r in results]
    return np.array(bs), np.array(thr), np.array(lat)

cpu_bs, cpu_thr, cpu_lat = to_arrays(cpu_results)
if gpu_results:
    gpu_bs, gpu_thr, gpu_lat = to_arrays(gpu_results)
else:
    gpu_bs = gpu_thr = gpu_lat = None

print("CPU throughput (img/s):", cpu_thr)
print("CPU latency (ms/img):", cpu_lat)
if gpu_thr is not None:
    print("GPU throughput (img/s):", gpu_thr)
    print("GPU latency (ms/img):", gpu_lat)
else:
    print("GPU not available; run in Colab with GPU to compare.")



## 8) Plot: Throughput vs Batch Size


In [None]:

plt.figure(figsize=(7,5))
plt.plot(cpu_bs, cpu_thr, marker='o', label='CPU')
if gpu_thr is not None:
    plt.plot(gpu_bs, gpu_thr, marker='o', label='GPU')
plt.title("Throughput vs Batch Size (higher is better)")
plt.xlabel("Batch size")
plt.ylabel("Images per second")
plt.legend()
plt.grid(True)
plt.show()



## 9) Plot: Per-Image Latency vs Batch Size


In [None]:

plt.figure(figsize=(7,5))
plt.plot(cpu_bs, cpu_lat, marker='o', label='CPU')
if gpu_lat is not None:
    plt.plot(gpu_bs, gpu_lat, marker='o', label='GPU')
plt.title("Latency per Image vs Batch Size (lower is better)")
plt.xlabel("Batch size")
plt.ylabel("Latency (ms per image)")
plt.legend()
plt.grid(True)
plt.show()



## 10) Summary Table
Per-image latency and throughput for each device.


In [None]:

import pandas as pd

df_cpu = pd.DataFrame(cpu_results)
df_cpu["device"] = "CPU"

if gpu_results:
    df_gpu = pd.DataFrame(gpu_results)
    df_gpu["device"] = "GPU"
    df = pd.concat([df_cpu, df_gpu], ignore_index=True)
else:
    df = df_cpu

# Display as a user-visible table if running in a compatible environment
try:
    from caas_jupyter_tools import display_dataframe_to_user
    display_dataframe_to_user("CPU vs GPU Inference Results", df)
except Exception:
    pass

df.round(3)



## 11) Notes & Interpretation
- **GPU wins on larger batch sizes** due to parallelism, achieving much higher throughput.
- **CPU latency** for small batches can be competitive, but **GPU** often dominates for medium/large batches.
- Results vary with: GPU/CPU model, PyTorch build, drivers, background load.

### Customize
- Change `batch_sizes` in `benchmark_inference` (e.g., `(1, 4, 8, 16, 32, 64, 128, 256, 512)`).
- Increase `iters` for more stable timing (will take longer).
- Swap `resnet18` for `resnet50` or `vit_b_16` to stress hardware more.
- Try `torch.backends.cudnn.benchmark = True` for conv-heavy models (may improve performance on fixed shapes).



## 12) (Optional) TensorFlow Variant
If you prefer TensorFlow, you can adapt the same logic: build a Keras model, run warmups, and time multiple batches on CPU/GPU.
