
# CPU vs GPU Inference Benchmark (TensorFlow 2 / Keras) — ResNet50 & MobileNetV2

This notebook measures **inference performance** on **CPU vs GPU** using TensorFlow 2.  
You can choose between two popular CNN backbones:
- **ResNet50** (heavier, ~25M params — stresses GPU more)
- **MobileNetV2** (lightweight, ~3.5M params — runs fast on CPU too)

Run locally (with or without GPU) and in **Google Colab** (enable GPU via *Runtime → Change runtime type → GPU*).

**What you'll see**
- Device detection (CPU/GPU).
- Timed inference runs over increasing batch sizes.
- Per-image latency and throughput comparisons.
- Plots showing scaling.



## 1) Setup (optional)
TensorFlow is usually present on Colab. Locally, install a matching TF build.


In [None]:

# Uncomment if you need to install locally (CPU build example):
# !pip install --upgrade pip
# !pip install tensorflow==2.15.*
# For NVIDIA GPU locally, install CUDA/cuDNN compatible with your TF version per https://www.tensorflow.org/install


Collecting tensorflow==2.15.*
  Downloading tensorflow-2.15.1-cp310-cp310-win_amd64.whl.metadata (3.4 kB)
Collecting tensorflow-intel==2.15.1 (from tensorflow==2.15.*)
  Downloading tensorflow_intel-2.15.1-cp310-cp310-win_amd64.whl.metadata (4.9 kB)
Collecting absl-py>=1.0.0 (from tensorflow-intel==2.15.1->tensorflow==2.15.*)
  Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow-intel==2.15.1->tensorflow==2.15.*)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow-intel==2.15.1->tensorflow==2.15.*)
  Downloading flatbuffers-25.9.23-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow-intel==2.15.1->tensorflow==2.15.*)
  Downloading gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow-intel==2.15.1->tensorflow==2.15.*)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metada

## 2) Imports & Device Check

In [2]:

import time
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

print("TensorFlow version:", tf.__version__)
print("Built with CUDA:", tf.test.is_built_with_cuda())
print("GPU devices:", tf.config.list_physical_devices('GPU'))
if tf.config.list_physical_devices('GPU'):
    try:
        for d in tf.config.list_physical_devices('GPU'):
            tf.config.experimental.set_memory_growth(d, True)
    except Exception as e:
        print("Could not set memory growth:", e)


TensorFlow version: 2.16.2
Built with CUDA: False
GPU devices: []



## 3) Choose Model
Set `MODEL_NAME` to `"ResNet50"` or `"MobileNetV2"`.


In [3]:

from tensorflow.keras import Input, Model
from tensorflow.keras.applications import ResNet50, MobileNetV2

MODEL_NAME = "MobileNetV2"  # options: "ResNet50", "MobileNetV2"
INPUT_SHAPE = (224, 224, 3)  # keep same for fair comparison

def build_model(name="ResNet50", input_shape=(224,224,3), num_classes=1000):
    inp = Input(shape=input_shape)
    if name == "ResNet50":
        base = ResNet50(include_top=True, weights=None, input_tensor=inp, classes=num_classes)
    elif name == "MobileNetV2":
        base = MobileNetV2(include_top=True, weights=None, input_tensor=inp, classes=num_classes)
    else:
        raise ValueError("Unknown model name: " + str(name))
    model = Model(inputs=base.input, outputs=base.output)
    return model

model = build_model(MODEL_NAME, INPUT_SHAPE, 1000)
model.trainable = False
model.compile()  # trivial compile to enable graphing if needed

total_params = model.count_params()
print(f"Model: {MODEL_NAME} | Parameters: {total_params/1e6:.2f}M")


Model: MobileNetV2 | Parameters: 3.54M



## 4) Benchmark Helpers
We benchmark multiple batch sizes with warm-up and repeated timed runs.  
Metrics:
- **Latency per image (ms/img)**
- **Throughput (images/sec)**


In [4]:

@tf.function(jit_compile=False)
def forward_pass(model, x):
    """Single forward pass through the model (wrapped in tf.function)."""
    return model(x, training=False)


def benchmark_inference(model, device_str, 
                        batch_sizes=(1, 8, 16, 32, 64, 128),
                        input_size=(224, 224, 3), 
                        warmup=10, iters=30):
    """
    Benchmark inference latency and throughput for a given model 
    on CPU or GPU, across different batch sizes.

    Args:
        model: Keras model (e.g. ResNet50, MobileNetV2).
        device_str: Device string ("/CPU:0" or "/GPU:0").
        batch_sizes: Iterable of batch sizes to test.
        input_size: Input tensor shape (H, W, C).
        warmup: Number of warmup passes (ignored in timing).
        iters: Number of timed iterations.

    Returns:
        List of dictionaries with batch size, throughput, latency, etc.
    """
    print("=" * 60)
    print(f"🔍 Starting benchmark on {device_str}")
    print(f"Batch sizes to test: {batch_sizes}")
    print(f"Warmup iterations per batch size: {warmup}")
    print(f"Timed iterations per batch size: {iters}")
    print("=" * 60)

    results = []
    for bs in batch_sizes:
        print(f"\n--- 🟢 Benchmarking batch size {bs} on {device_str} ---")
        with tf.device(device_str):
            # Create dummy input batch
            x = tf.random.normal([bs, *input_size], dtype=tf.float32)
            print(f"Created dummy input tensor of shape {x.shape}")

            # Warmup runs
            print(f"Running {warmup} warmup iterations (not timed)...")
            for i in range(warmup):
                _ = forward_pass(model, x)
            print("Warmup complete ✅")

            # Timed runs
            print(f"Running {iters} timed iterations...")
            t0 = time.perf_counter()
            for i in range(iters):
                y = forward_pass(model, x)
                # Force sync to block until computation finishes
                _ = tf.reduce_sum(y).numpy()
                if (i+1) % max(1, iters//5) == 0:
                    print(f"  Iteration {i+1}/{iters} complete")
            t1 = time.perf_counter()
            print("Timed iterations complete ✅")

        # Compute metrics
        total_images = bs * iters
        elapsed = t1 - t0
        throughput = total_images / elapsed
        latency_ms_per_img = (elapsed / total_images) * 1000.0

        print(f"📊 Results for batch size {bs}:")
        print(f"    Total elapsed time: {elapsed:.3f} s")
        print(f"    Throughput: {throughput:.2f} images/sec")
        print(f"    Latency per image: {latency_ms_per_img:.3f} ms")

        results.append({
            "batch_size": bs,
            "elapsed_s": elapsed,
            "throughput_img_s": throughput,
            "latency_ms_per_img": latency_ms_per_img,
            "device": "GPU" if "GPU" in device_str else "CPU"
        })

    print("\n✅ Benchmarking complete!")
    print("=" * 60)
    return results



## 5) Run CPU Benchmark

In [9]:

cpu_device = "/CPU:0"
cpu_results = benchmark_inference(model, cpu_device, input_size=INPUT_SHAPE)
cpu_results


[{'batch_size': 1,
  'elapsed_s': 0.4414366999990307,
  'throughput_img_s': 67.95991361856836,
  'latency_ms_per_img': 14.714556666634355,
  'device': 'CPU'},
 {'batch_size': 8,
  'elapsed_s': 1.26134510000702,
  'throughput_img_s': 190.27306642620192,
  'latency_ms_per_img': 5.255604583362583,
  'device': 'CPU'},
 {'batch_size': 16,
  'elapsed_s': 2.6328334000136238,
  'throughput_img_s': 182.31309280622017,
  'latency_ms_per_img': 5.4850695833617165,
  'device': 'CPU'},
 {'batch_size': 32,
  'elapsed_s': 6.000544100010302,
  'throughput_img_s': 159.9854919820274,
  'latency_ms_per_img': 6.250566770844065,
  'device': 'CPU'},
 {'batch_size': 64,
  'elapsed_s': 13.951125099993078,
  'throughput_img_s': 137.6233089617233,
  'latency_ms_per_img': 7.266210989579728,
  'device': 'CPU'},
 {'batch_size': 128,
  'elapsed_s': 31.190594199986663,
  'throughput_img_s': 123.11403801347402,
  'latency_ms_per_img': 8.122550572913193,
  'device': 'CPU'}]

## 6) Run GPU Benchmark (if available)

In [10]:

gpu_results = None
if tf.config.list_physical_devices('GPU'):
    gpu_device = "/GPU:0"
    gpu_results = benchmark_inference(model, gpu_device, input_size=INPUT_SHAPE)
gpu_results


## 7) Aggregate Results

In [11]:

import pandas as pd

def to_arrays(results):
    bs = [r["batch_size"] for r in results]
    thr = [r["throughput_img_s"] for r in results]
    lat = [r["latency_ms_per_img"] for r in results]
    return np.array(bs), np.array(thr), np.array(lat)

cpu_bs, cpu_thr, cpu_lat = to_arrays(cpu_results)
gpu_bs = gpu_thr = gpu_lat = None
if gpu_results:
    gpu_bs, gpu_thr, gpu_lat = to_arrays(gpu_results)

print("CPU throughput (img/s):", cpu_thr)
print("CPU latency (ms/img):", cpu_lat)
if gpu_thr is not None:
    print("GPU throughput (img/s):", gpu_thr)
    print("GPU latency (ms/img):", gpu_lat)
else:
    print("No GPU found. Run in Colab with GPU for comparison.")

df_cpu = pd.DataFrame(cpu_results)
df = df_cpu.copy()
if gpu_results:
    df_gpu = pd.DataFrame(gpu_results)
    df = pd.concat([df_cpu, df_gpu], ignore_index=True)

# Display as a user-visible table when possible
try:
    from caas_jupyter_tools import display_dataframe_to_user
    display_dataframe_to_user("TF CPU vs GPU Inference Results", df)
except Exception:
    pass

df.round(3)


CPU throughput (img/s): [ 67.95991362 190.27306643 182.31309281 159.98549198 137.62330896
 123.11403801]
CPU latency (ms/img): [14.71455667  5.25560458  5.48506958  6.25056677  7.26621099  8.12255057]
No GPU found. Run in Colab with GPU for comparison.


Unnamed: 0,batch_size,elapsed_s,throughput_img_s,latency_ms_per_img,device
0,1,0.441,67.96,14.715,CPU
1,8,1.261,190.273,5.256,CPU
2,16,2.633,182.313,5.485,CPU
3,32,6.001,159.985,6.251,CPU
4,64,13.951,137.623,7.266,CPU
5,128,31.191,123.114,8.123,CPU


## 8) Plot: Throughput vs Batch Size

In [None]:

plt.figure(figsize=(7,5))
plt.plot(cpu_bs, cpu_thr, marker='o', label='CPU')
if gpu_bs is not None:
    plt.plot(gpu_bs, gpu_thr, marker='o', label='GPU')
plt.title(f"{MODEL_NAME}: Throughput vs Batch Size (higher is better)")
plt.xlabel("Batch size")
plt.ylabel("Images per second")
plt.legend()
plt.grid(True)
plt.show()


## 9) Plot: Per-Image Latency vs Batch Size

In [None]:

plt.figure(figsize=(7,5))
plt.plot(cpu_bs, cpu_lat, marker='o', label='CPU')
if gpu_bs is not None:
    plt.plot(gpu_bs, gpu_lat, marker='o', label='GPU')
plt.title(f"{MODEL_NAME}: Latency per Image vs Batch Size (lower is better)")
plt.xlabel("Batch size")
plt.ylabel("Latency (ms per image)")
plt.legend()
plt.grid(True)
plt.show()



## 10) Notes & Tips
- **Choose model** by setting `MODEL_NAME` to `"MobileNetV2"` (fast/local) or `"ResNet50"` (heavier, stresses GPU).
- **Graph mode** (`@tf.function`) is enabled for the forward pass to reduce Python overhead.
- Results vary with hardware, TF build, drivers, and background load.
- To stress hardware more, increase `iters` or batch sizes, or switch to a larger network.
