# TensorFlow Benchmark: float32 vs float64 on CPU and GPU

This notebook measures the performance difference between **float32** and **float64** on **CPU** and (if available) **GPU** using several workloads:

- Large matrix multiplication (GEMM)
- 2D convolution
- Inference through a dense neural network

It reports per-iteration time, medians, and a summary table, and plots a simple bar chart.

## How to use
1. **Install/verify TensorFlow 2.x** in your environment.
2. If you're in **Google Colab**, go to **Runtime → Change runtime type → GPU** to enable a GPU.
3. Run all cells. The notebook will auto-detect devices and only run GPU tests if a GPU is present.
4. You can tweak problem sizes in the configuration cell to match your hardware.

## Notes
- GPU support for float64 is hardware and driver dependent. Many consumer GPUs run float64 slowly compared to float32.
- For a fair comparison, each test does a short warm-up before timing.
- Results can vary due to background processes and TensorFlow graph optimizations.


In [3]:
import os, time, math, platform
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

try:
    import tensorflow as tf
except Exception as e:
    raise RuntimeError("TensorFlow is not installed in this environment. Please install TF 2.x and retry.")

print("TensorFlow version:", tf.__version__)
print("Python:", platform.python_version())
print("NumPy:", np.__version__)

# List devices
cpus = tf.config.list_physical_devices('CPU')
gpus = tf.config.list_physical_devices('GPU')
print("CPUs:", cpus)
print("GPUs:", gpus)

# Create logical GPU if present (optional, keeps default behavior)
if gpus:
    try:
        tf.config.experimental.set_memory_growth(gpus[0], True)
    except Exception as e:
        print("Could not set memory growth:", e)


TensorFlow version: 2.16.2
Python: 3.10.18
NumPy: 1.26.4
CPUs: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
GPUs: []


## Configuration
You can reduce sizes if you get out-of-memory errors, or increase them for more stable timings.

In [4]:
# Problem sizes (adjust to your hardware)
MATMUL_N = 4096      # MatMul size: N x N
CONV_B = 32          # batch size
CONV_H = 256         # height
CONV_W = 256         # width
CONV_CIN = 32        # input channels
CONV_COUT = 64       # output channels (filters)
KERNEL = 3           # kernel size

DENSE_IN = 4096      # dense model input features
DENSE_H = 512        # hidden units per layer
DENSE_L = 4          # number of hidden layers
DENSE_OUT = 1024     # output features
DENSE_B = 1024       # batch size for inference

# Benchmark parameters
WARMUP = 5
ITERS = 20

# Dtypes and devices to test
DTYPES = [tf.float32, tf.float64]
DEVICES = ['/CPU:0'] + (['/GPU:0'] if tf.config.list_physical_devices('GPU') else [])
DEVICES

['/CPU:0']

## Benchmark helpers

In [5]:
def _sync(t):
    """Force device synchronization in all TF versions by fetching a scalar."""
    try:
        # Reduce to a scalar on device, then copy to host -> blocks until done
        _ = tf.reduce_sum(t).numpy()
    except Exception:
        # If t isn't a Tensor (or it's a tuple/list), be defensive
        if isinstance(t, (list, tuple)) and t:
            _ = tf.reduce_sum(t[0]).numpy()
        else:
            # Last resort: make a tiny op and fetch it
            _ = tf.constant(0).numpy()

def benchmark_op(op_fn, warmup=WARMUP, iters=ITERS):
    """Run op_fn() warmup times, then iters times, return a list of elapsed seconds."""
    # Warm-up (also builds @tf.function graphs)
    for _ in range(warmup):
        y = op_fn()
        _sync(y)

    # Timed runs
    times = []
    for _ in range(iters):
        t0 = time.perf_counter()
        y = op_fn()
        _sync(y)  # <- replaces tf.experimental.async_wait()
        t1 = time.perf_counter()
        times.append(t1 - t0)
    return times

def median(x):
    x = sorted(x)
    n = len(x)
    return x[n//2] if n % 2 else 0.5*(x[n//2-1]+x[n//2])


## Workload 1: Large Matrix Multiplication (GEMM)
Computes `C = A @ B` where A and B are `N x N`. This is compute-bound and often shows clear GPU advantages for float32.

In [6]:
results = []

for dev in DEVICES:
    for dt in DTYPES:
        try:
            with tf.device(dev):
                A = tf.random.normal((MATMUL_N, MATMUL_N), dtype=dt)
                B = tf.random.normal((MATMUL_N, MATMUL_N), dtype=dt)
                @tf.function(jit_compile=False)
                def run():
                    return tf.linalg.matmul(A, B)
                times = benchmark_op(run)
                res = {
                    'workload': 'matmul',
                    'device': dev,
                    'dtype': str(dt.name),
                    'iters': len(times),
                    'time_median_s': median(times),
                    'time_mean_s': float(np.mean(times)),
                    'time_std_s': float(np.std(times)),
                }
                results.append(res)
                print(res)
                # free memory
                del A, B
        except Exception as e:
            print(f"[SKIP matmul] {dev} {dt.name}: {e}")


{'workload': 'matmul', 'device': '/CPU:0', 'dtype': 'float32', 'iters': 20, 'time_median_s': 0.13776399999915157, 'time_mean_s': 0.13807563999871492, 'time_std_s': 0.003223546195631514}
{'workload': 'matmul', 'device': '/CPU:0', 'dtype': 'float64', 'iters': 20, 'time_median_s': 0.35059370000089984, 'time_mean_s': 0.352023080001527, 'time_std_s': 0.005804552103140353}


## Workload 2: 2D Convolution
A single `tf.nn.conv2d` forward pass on a synthetic image batch.

In [5]:
for dev in DEVICES:
    for dt in DTYPES:
        try:
            with tf.device(dev):
                x = tf.random.normal((CONV_B, CONV_H, CONV_W, CONV_CIN), dtype=dt)
                w = tf.random.normal((KERNEL, KERNEL, CONV_CIN, CONV_COUT), dtype=dt)
                @tf.function(jit_compile=False)
                def run():
                    return tf.nn.conv2d(x, w, strides=1, padding='SAME')
                times = benchmark_op(run)
                res = {
                    'workload': 'conv2d',
                    'device': dev,
                    'dtype': str(dt.name),
                    'iters': len(times),
                    'time_median_s': median(times),
                    'time_mean_s': float(np.mean(times)),
                    'time_std_s': float(np.std(times)),
                }
                results.append(res)
                print(res)
                del x, w
        except Exception as e:
            print(f"[SKIP conv2d] {dev} {dt.name}: {e}")


[SKIP conv2d] /CPU:0 float32: module 'tensorflow.experimental' has no attribute 'async_wait'
[SKIP conv2d] /CPU:0 float64: module 'tensorflow.experimental' has no attribute 'async_wait'


## Workload 3: Dense Network Inference
A simple Keras MLP (no training), forward pass only.

In [6]:
from tensorflow import keras
from tensorflow.keras import layers

for dev in DEVICES:
    for dt in DTYPES:
        try:
            with tf.device(dev):
                # Build model with explicit dtype
                inputs = keras.Input(shape=(DENSE_IN,), dtype=dt)
                x = inputs
                for _ in range(DENSE_L):
                    x = layers.Dense(DENSE_H, activation='relu', dtype=dt)(x)
                outputs = layers.Dense(DENSE_OUT, dtype=dt)(x)
                model = keras.Model(inputs, outputs)
                # Create input batch
                batch = tf.random.normal((DENSE_B, DENSE_IN), dtype=dt)
                @tf.function(jit_compile=False)
                def run():
                    return model(batch, training=False)
                times = benchmark_op(run)
                res = {
                    'workload': 'dense_infer',
                    'device': dev,
                    'dtype': str(dt.name),
                    'iters': len(times),
                    'time_median_s': median(times),
                    'time_mean_s': float(np.mean(times)),
                    'time_std_s': float(np.std(times)),
                }
                results.append(res)
                print(res)
                del model, batch
        except Exception as e:
            print(f"[SKIP dense_infer] {dev} {dt.name}: {e}")


[SKIP dense_infer] /CPU:0 float32: module 'tensorflow.experimental' has no attribute 'async_wait'
[SKIP dense_infer] /CPU:0 float64: module 'tensorflow.experimental' has no attribute 'async_wait'


## Summary table and plot

In [7]:
df = pd.DataFrame(results)
if not df.empty:
    display(df.sort_values(['workload','device','dtype']))
    # Save CSV
    out_csv = 'tf_benchmark_results.csv'
    df.to_csv(out_csv, index=False)
    print('Saved results to', out_csv)
else:
    print('No results collected (likely due to missing TensorFlow ops/devices).')

# Simple bar chart of median time
if not df.empty:
    key = 'time_median_s'
    labels = [f"{w}\n{d}\n{t}" for w,d,t in zip(df['workload'], df['device'], df['dtype'])]
    vals = df[key].values
    plt.figure(figsize=(10, 6))
    plt.bar(labels, vals)
    plt.ylabel('Median time per iter (s)')
    plt.title('TensorFlow float32 vs float64 on CPU/GPU')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()


No results collected (likely due to missing TensorFlow ops/devices).


## Tips to get clearer results
- Close other heavy apps and re-run.
- Increase `ITERS` for more stable medians.
- Increase problem sizes if you have plenty of RAM/VRAM.
- On Colab, make sure you selected the **GPU** runtime (not TPU).