
# CPU vs GPU Inference Benchmark (TensorFlow 2 / Keras) — ResNet50 & MobileNetV2

This notebook measures **inference performance** on **CPU vs GPU** using TensorFlow 2.  
You can choose between two popular CNN backbones:
- **ResNet50** (heavier, ~25M params — stresses GPU more)
- **MobileNetV2** (lightweight, ~3.5M params — runs fast on CPU too)

Run locally (with or without GPU) and in **Google Colab** (enable GPU via *Runtime → Change runtime type → GPU*).

**What you'll see**
- Device detection (CPU/GPU).
- Timed inference runs over increasing batch sizes.
- Per-image latency and throughput comparisons.
- Plots showing scaling.



## 1) Setup (optional)
TensorFlow is usually present on Colab. Locally, install a matching TF build.


In [None]:

# Uncomment if you need to install locally (CPU build example):
# !pip install --upgrade pip
# !pip install tensorflow==2.15.*
# For NVIDIA GPU locally, install CUDA/cuDNN compatible with your TF version per https://www.tensorflow.org/install


## 2) Imports & Device Check

In [None]:

import time
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

print("TensorFlow version:", tf.__version__)
print("Built with CUDA:", tf.test.is_built_with_cuda())
print("GPU devices:", tf.config.list_physical_devices('GPU'))
if tf.config.list_physical_devices('GPU'):
    try:
        for d in tf.config.list_physical_devices('GPU'):
            tf.config.experimental.set_memory_growth(d, True)
    except Exception as e:
        print("Could not set memory growth:", e)



## 3) Choose Model
Set `MODEL_NAME` to `"ResNet50"` or `"MobileNetV2"`.


In [None]:

from tensorflow.keras import Input, Model
from tensorflow.keras.applications import ResNet50, MobileNetV2

MODEL_NAME = "MobileNetV2"  # options: "ResNet50", "MobileNetV2"
INPUT_SHAPE = (224, 224, 3)  # keep same for fair comparison

def build_model(name="ResNet50", input_shape=(224,224,3), num_classes=1000):
    inp = Input(shape=input_shape)
    if name == "ResNet50":
        base = ResNet50(include_top=True, weights=None, input_tensor=inp, classes=num_classes)
    elif name == "MobileNetV2":
        base = MobileNetV2(include_top=True, weights=None, input_tensor=inp, classes=num_classes)
    else:
        raise ValueError("Unknown model name: " + str(name))
    model = Model(inputs=base.input, outputs=base.output)
    return model

model = build_model(MODEL_NAME, INPUT_SHAPE, 1000)
model.trainable = False
model.compile()  # trivial compile to enable graphing if needed

total_params = model.count_params()
print(f"Model: {MODEL_NAME} | Parameters: {total_params/1e6:.2f}M")



## 4) Benchmark Helpers
We benchmark multiple batch sizes with warm-up and repeated timed runs.  
Metrics:
- **Latency per image (ms/img)**
- **Throughput (images/sec)**


In [None]:

@tf.function(jit_compile=False)
def forward_pass(model, x):
    return model(x, training=False)

def benchmark_inference(model, device_str, batch_sizes=(1, 8, 16, 32, 64, 128), 
                        input_size=(224, 224, 3), warmup=10, iters=30):
    results = []
    for bs in batch_sizes:
        with tf.device(device_str):
            x = tf.random.normal([bs, *input_size], dtype=tf.float32)

            # Warmup
            for _ in range(warmup):
                _ = forward_pass(model, x)

            # Timed runs
            t0 = time.perf_counter()
            for _ in range(iters):
                _ = forward_pass(model, x)
            tf.experimental.sync_devices()  # ensure all device ops complete
            t1 = time.perf_counter()

        total_images = bs * iters
        elapsed = t1 - t0
        throughput = total_images / elapsed
        latency_ms_per_img = (elapsed / total_images) * 1000.0
        results.append({
            "model": MODEL_NAME,
            "batch_size": bs,
            "elapsed_s": elapsed,
            "throughput_img_s": throughput,
            "latency_ms_per_img": latency_ms_per_img,
            "device": "GPU" if "GPU" in device_str else "CPU"
        })
    return results


## 5) Run CPU Benchmark

In [None]:

cpu_device = "/CPU:0"
cpu_results = benchmark_inference(model, cpu_device, input_size=INPUT_SHAPE)
cpu_results


## 6) Run GPU Benchmark (if available)

In [None]:

gpu_results = None
if tf.config.list_physical_devices('GPU'):
    gpu_device = "/GPU:0"
    gpu_results = benchmark_inference(model, gpu_device, input_size=INPUT_SHAPE)
gpu_results


## 7) Aggregate Results

In [None]:

import pandas as pd

def to_arrays(results):
    bs = [r["batch_size"] for r in results]
    thr = [r["throughput_img_s"] for r in results]
    lat = [r["latency_ms_per_img"] for r in results]
    return np.array(bs), np.array(thr), np.array(lat)

cpu_bs, cpu_thr, cpu_lat = to_arrays(cpu_results)
gpu_bs = gpu_thr = gpu_lat = None
if gpu_results:
    gpu_bs, gpu_thr, gpu_lat = to_arrays(gpu_results)

print("CPU throughput (img/s):", cpu_thr)
print("CPU latency (ms/img):", cpu_lat)
if gpu_thr is not None:
    print("GPU throughput (img/s):", gpu_thr)
    print("GPU latency (ms/img):", gpu_lat)
else:
    print("No GPU found. Run in Colab with GPU for comparison.")

df_cpu = pd.DataFrame(cpu_results)
df = df_cpu.copy()
if gpu_results:
    df_gpu = pd.DataFrame(gpu_results)
    df = pd.concat([df_cpu, df_gpu], ignore_index=True)

# Display as a user-visible table when possible
try:
    from caas_jupyter_tools import display_dataframe_to_user
    display_dataframe_to_user("TF CPU vs GPU Inference Results", df)
except Exception:
    pass

df.round(3)


## 8) Plot: Throughput vs Batch Size

In [None]:

plt.figure(figsize=(7,5))
plt.plot(cpu_bs, cpu_thr, marker='o', label='CPU')
if gpu_bs is not None:
    plt.plot(gpu_bs, gpu_thr, marker='o', label='GPU')
plt.title(f"{MODEL_NAME}: Throughput vs Batch Size (higher is better)")
plt.xlabel("Batch size")
plt.ylabel("Images per second")
plt.legend()
plt.grid(True)
plt.show()


## 9) Plot: Per-Image Latency vs Batch Size

In [None]:

plt.figure(figsize=(7,5))
plt.plot(cpu_bs, cpu_lat, marker='o', label='CPU')
if gpu_bs is not None:
    plt.plot(gpu_bs, gpu_lat, marker='o', label='GPU')
plt.title(f"{MODEL_NAME}: Latency per Image vs Batch Size (lower is better)")
plt.xlabel("Batch size")
plt.ylabel("Latency (ms per image)")
plt.legend()
plt.grid(True)
plt.show()



## 10) Notes & Tips
- **Choose model** by setting `MODEL_NAME` to `"MobileNetV2"` (fast/local) or `"ResNet50"` (heavier, stresses GPU).
- **Graph mode** (`@tf.function`) is enabled for the forward pass to reduce Python overhead.
- Results vary with hardware, TF build, drivers, and background load.
- To stress hardware more, increase `iters` or batch sizes, or switch to a larger network.
