# <font color="#418FDE" size="6.5" uppercase>**Using torch.compile**</font>

>Last update: 20260130.
    
By the end of this Lecture, you will be able to:
- Describe the purpose and high‑level behavior of torch.compile in PyTorch 2.10.0. 
- Wrap existing nn.Module models with torch.compile and configure basic compilation options. 
- Measure and interpret performance changes after compilation, including warmup effects. 


## **1. Core Compile Concepts**

### **1.1. Eager and Compiled Modes**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_01_01.jpg?v=1769827107" width="250">



>* Eager mode runs each tensor operation immediately
>* Python overhead can slow large, repeated workloads

>* Compiler records model runs, builds optimized graph
>* Optimized graph skips Python overhead, fuses operations

>* Eager favors flexibility, debugging, and quick feedback
>* Compiled favors speed after warmup for stable workloads



In [None]:
#@title Python Code - Eager and Compiled Modes

# This script compares eager and compiled execution modes.
# It uses TensorFlow to mimic PyTorch compile ideas.
# Focus on timing simple model runs in two execution styles.

# !pip install tensorflow-2.20.0.

# Import required standard and TensorFlow modules.
import os, random, time, numpy as np, tensorflow as tf

# Set deterministic seeds for reproducible behavior.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version as framework information.
print("TensorFlow version:", tf.__version__)

# Select device string based on GPU availability.
physical_gpus = tf.config.list_physical_devices("GPU")
use_gpu = bool(physical_gpus)
device_name = "/GPU:0" if use_gpu else "/CPU:0"

# Create a tiny dense model for demonstration.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(32,)),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax"),
])

# Build model by running one dummy forward pass.
dummy_input = tf.zeros((1, 32), dtype=tf.float32)
_ = model(dummy_input)

# Validate model output shape for safety.
output_shape = model(dummy_input).shape.as_list()
assert output_shape == [1, 10], "Unexpected model output shape detected."

# Create random input batch for timing tests.
batch_size = 256
x_batch = tf.random.normal((batch_size, 32), dtype=tf.float32)

# Define eager forward function without decoration.
def eager_forward(x_batch_input):
    return model(x_batch_input)


# Define compiled forward function using tf.function.
@tf.function
def compiled_forward(x_batch_input):
    return model(x_batch_input)


# Helper function to time multiple runs of a callable.
def time_runs(fn, x_batch_input, warmup_runs, timed_runs):
    for _ in range(warmup_runs):
        _ = fn(x_batch_input)

    start_time = time.perf_counter()
    for _ in range(timed_runs):
        _ = fn(x_batch_input)

    end_time = time.perf_counter()
    return (end_time - start_time) / float(timed_runs)


# Configure warmup and timed run counts.
warmup_runs = 3
timed_runs = 10

# Run timing for eager execution mode.
with tf.device(device_name):
    eager_time = time_runs(eager_forward, x_batch, warmup_runs, timed_runs)

# Run timing for compiled execution mode.
with tf.device(device_name):
    compiled_time = time_runs(compiled_forward, x_batch, warmup_runs, timed_runs)

# Compute relative speedup factor safely.
speedup = eager_time / compiled_time if compiled_time > 0.0 else 1.0

# Print concise summary comparing both execution modes.
print("Device used for runs:", device_name)
print("Average eager forward time (seconds):", round(eager_time, 6))
print("Average compiled forward time (seconds):", round(compiled_time, 6))
print("Approximate compiled speedup factor:", round(speedup, 2))




### **1.2. Tracing and Graphs**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_01_02.jpg?v=1769827158" width="250">



>* Tracing records tensor ops into a graph
>* Static graphs let compilers analyze and optimize

>* Tracing records tensor ops into a static graph
>* Compiler optimizes this graph for faster execution

>* Traced graphs capture one specific run’s behavior
>* Stable, predictable models benefit most; dynamic ones retrace



In [None]:
#@title Python Code - Tracing and Graphs

# This script illustrates tracing and graphs conceptually.
# We simulate a tiny tensor computation graph manually.
# Focus on understanding nodes edges and execution order.

# No extra installs are required for this simple example.
# Uncomment and adapt if additional packages become necessary.
# Example placeholder pip install line shown here.
# !pip install some_required_package_if_needed.

# Define a simple Node class representing one graph operation.
class Node:
    def __init__(self, name, op, inputs):
        self.name = name
        self.op = op
        self.inputs = inputs


# Define a tiny tensor like container using Python lists.
class TinyTensor:
    def __init__(self, values):
        self.values = list(values)


# Define a function that adds two TinyTensor objects safely.
def add_tensors(a, b):
    assert len(a.values) == len(b.values)

    return TinyTensor([x + y for x, y in zip(a.values, b.values)])


# Define a function that multiplies two TinyTensor objects safely.
def mul_tensors(a, b):
    assert len(a.values) == len(b.values)

    return TinyTensor([x * y for x, y in zip(a.values, b.values)])


# Define a simple tracer that records operations as nodes.
class Tracer:
    def __init__(self):
        self.nodes = []

    # Method to record an operation node inside the tracer.
    def record(self, name, op, inputs):
        node = Node(name=name, op=op, inputs=inputs)
        self.nodes.append(node)

        return node

    # Method to execute nodes in recorded order and show results.
    def run(self):
        env = {}
        for node in self.nodes:
            input_tensors = [env[i] for i in node.inputs]
            result = node.op(*input_tensors)
            env[node.name] = result

        return env


# Create example TinyTensor inputs representing traced model inputs.
input_a = TinyTensor([1.0, 2.0, 3.0])
input_b = TinyTensor([0.5, 0.5, 0.5])


# Initialize tracer and seed environment with input tensors.
tracer = Tracer()
initial_env = {"a": input_a, "b": input_b}


# Record an addition node representing a first layer operation.
node_add = tracer.record(
    name="c",
    op=lambda x, y: add_tensors(x, y),
    inputs=["a", "b"],
)


# Record a multiplication node representing another layer operation.
node_mul = tracer.record(
    name="d",
    op=lambda x, y: mul_tensors(x, y),
    inputs=["c", "b"],
)


# Execute the recorded graph using the initial environment mapping.
env = dict(initial_env)
for node in tracer.nodes:
    input_tensors = [env[i] for i in node.inputs]
    env[node.name] = node.op(*input_tensors)


# Print a short summary showing traced nodes and final outputs.
print("Recorded nodes in tiny computation graph:")
for node in tracer.nodes:
    print("Node", node.name, "inputs", node.inputs)


# Print final tensor values to illustrate graph execution result.
print("Final tensor c values:", env["c"].values)
print("Final tensor d values:", env["d"].values)



### **1.3. Supported operations overview**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_01_03.jpg?v=1769827204" width="250">



>* Compiler fuses supported tensor ops into graphs
>* Common layers compile into faster, larger kernels

>* Compiler supports many ops, falls back when needed
>* Models can mix compiled regions with eager code

>* Compilation is flexible; unsupported parts safely fallback
>* Speedups grow with long stretches of supported ops



In [None]:
#@title Python Code - Supported operations overview

# This script explains supported operations conceptually.
# It uses TensorFlow to simulate tensor operations.
# Focus is on compile friendly versus unfriendly patterns.

# !pip install tensorflow-io-gcs-filesystem.

# Import required standard and TensorFlow modules.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducibility.
seed_value = 7
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version as our tensor framework.
print("TensorFlow version:", tf.__version__)

# Create a small random tensor representing model activations.
activations = tf.random.normal(shape=(2, 4), dtype=tf.float32)
print("Input activations shape:", activations.shape)

# Define a compile friendly block using dense and activation.
friendly_layer = tf.keras.Sequential([
    tf.keras.layers.Dense(units=4, activation="relu"),
])

# Run the friendly block once to build internal variables.
friendly_output = friendly_layer(activations)
print("Friendly block output shape:", friendly_output.shape)

# Define a Python heavy function simulating unsupported operations.
def python_heavy_transform(tensor):
    values = tensor.numpy().flatten().tolist()
    squared = [v * v for v in values]

    return tf.convert_to_tensor(squared, dtype=tensor.dtype)


# Apply friendly block then Python heavy transform sequentially.
intermediate = friendly_layer(activations)
heavy_output = python_heavy_transform(intermediate)
print("Heavy transform output shape:", heavy_output.shape)

# Reshape heavy output back to matrix form safely.
if heavy_output.shape[0] == activations.shape[0] * activations.shape[1]:
    reshaped_output = tf.reshape(
        heavy_output,
        shape=(activations.shape[0], activations.shape[1]),
    )
else:
    reshaped_output = activations

# Show final tensor shape after mixed style processing.
print("Final mixed pipeline shape:", reshaped_output.shape)




## **2. torch compile options**

### **2.1. torch compile basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_02_01.jpg?v=1769827248" width="250">



>* Think of torch.compile as wrapping existing models
>* Compiled model keeps same interface, improves performance

>* Compiled model is a seamless drop-in replacement
>* Initial runs are slower; later runs faster

>* Compile wrapper fits into existing workflows easily
>* Treat compilation as an optional performance layer



In [None]:
#@title Python Code - torch compile basics

# This script demonstrates basic torch compile wrapping.
# It compares eager and compiled model forward speeds.
# It keeps the model and training loop unchanged.

# !pip install torch torchvision.

# Import required standard and torch modules.
import time
import random
import torch
import torch.nn as nn

# Set deterministic seeds for reproducibility.
random.seed(0)
torch.manual_seed(0)

# Select device based on GPU availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a simple convolutional neural network model.
class SmallConvNet(nn.Module):
    # Initialize layers inside the constructor.
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 8, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(8, 16, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1)),
        )

        self.classifier = nn.Linear(16, 10)

    # Define the forward computation for inputs.
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)

        return self.classifier(x)


# Create an instance of the model and move to device.
model_eager = SmallConvNet().to(device)

# Create a compiled version using torch compile wrapper.
model_compiled = torch.compile(model_eager)

# Create a small random input batch for timing tests.
batch_size = 32
input_shape = (batch_size, 1, 28, 28)

# Build input tensor and move to selected device.
example_inputs = torch.randn(input_shape, device=device)

# Validate that model outputs have matching shapes.
with torch.no_grad():
    out_eager = model_eager(example_inputs)
    out_compiled = model_compiled(example_inputs)

# Assert shapes are equal to ensure drop in replacement.
assert out_eager.shape == out_compiled.shape


# Define a helper function to time multiple forward passes.
def time_forwards(model, inputs, warmup, iters):
    # Ensure gradients are disabled during timing.
    model.eval()
    with torch.no_grad():
        # Optional warmup iterations to trigger compilation.
        for _ in range(warmup):
            _ = model(inputs)

        if device.type == "cuda":
            torch.cuda.synchronize()

        start = time.perf_counter()
        for _ in range(iters):
            _ = model(inputs)
        if device.type == "cuda":
            torch.cuda.synchronize()
        end = time.perf_counter()

    # Return average time per iteration.
    return (end - start) / iters


# Configure warmup and iteration counts for quick demo.
warmup_iters = 1
measure_iters = 5

# Time the eager model without compilation overhead.
avg_eager = time_forwards(model_eager, example_inputs, 0, measure_iters)

# Time the compiled model including warmup behavior.
avg_compiled = time_forwards(model_compiled, example_inputs, warmup_iters, measure_iters)

# Print framework version and device information.
print("PyTorch version:", torch.__version__, "Device:", device)

# Print average forward times for both model variants.
print("Average eager forward seconds:", round(avg_eager, 6))
print("Average compiled forward seconds:", round(avg_compiled, 6))

# Show that both models produce numerically close outputs.
max_diff = (out_eager - out_compiled).abs().max().item()
print("Maximum absolute difference between outputs:", float(max_diff))

# Indicate which model was faster in this short experiment.
print("Compiled faster than eager:", bool(avg_compiled < avg_eager))




### **2.2. Backends and modes**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_02_02.jpg?v=1769827306" width="250">



>* Backends optimize the model graph for hardware
>* Modes control optimization aggressiveness and compilation robustness

>* Different backends specialize for different hardware workloads
>* Choose aggressive or conservative backends based on priorities

>* Modes trade off debuggability versus raw performance
>* Start conservative, then switch to aggressive for production



In [None]:
#@title Python Code - Backends and modes

# This script compares simple backend and mode choices.
# It uses TensorFlow to simulate backend style behavior.
# Focus on configuration ideas not exact PyTorch details.
# !pip install tensorflow-io-gcs-filesystem.

# Import required TensorFlow and system modules.
import tensorflow as tf
import time
import os

# Set deterministic seeds for reproducible behavior.
tf.random.set_seed(7)
os.environ["PYTHONHASHSEED"] = "7"

# Print TensorFlow version as framework information.
print("TensorFlow version:", tf.__version__)

# Create a tiny synthetic dataset for demonstration.
features = tf.random.normal((256, 16))
labels = tf.random.normal((256, 1))

# Validate dataset shapes before building models.
assert features.shape[1] == 16 and labels.shape[1] == 1

# Define a simple dense regression model function.

def build_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(16,)),
        tf.keras.layers.Dense(32, activation="relu"),
        tf.keras.layers.Dense(1),
    ])

    model.compile(
        optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
        loss="mse",
        run_eagerly=True,
    )

    return model


# Build a baseline model using eager execution mode.
baseline_model = build_model()

# Train baseline model briefly with silent logging.
start_eager = time.time()
baseline_model.fit(features, labels, epochs=3, batch_size=32, verbose=0)
end_eager = time.time()

# Measure eager execution training time.
eager_time = end_eager - start_eager

# Rebuild model configured for graph mode backend style.
graph_model = build_model()

# Switch model to graph mode by disabling eager run.

graph_model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
    loss="mse",
    run_eagerly=False,
)

# Run a warmup epoch to trigger tracing and compilation.
_ = graph_model.fit(features, labels, epochs=1, batch_size=32, verbose=0)

# Measure compiled style training time after warmup.
start_graph = time.time()
_ = graph_model.fit(features, labels, epochs=2, batch_size=32, verbose=0)
end_graph = time.time()

# Compute total graph mode time excluding warmup epoch.
graph_time = end_graph - start_graph

# Print concise timing comparison for both modes.
print("Eager backend style time seconds:", round(eager_time, 4))
print("Graph backend style time seconds:", round(graph_time, 4))

# Show simple interpretation of backend and mode behavior.
print("Eager mode prioritizes simplicity and easier debugging.")
print("Graph mode acts like optimized backend after warmup.")
print("Warmup tracing cost appears before measured graph timing.")
print("Choose modes based on development or performance priorities.")
print("This mirrors torch.compile backend and mode tradeoffs.")




### **2.3. Handling Compiler Fallbacks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_02_03.jpg?v=1769827354" width="250">



>* Compiler may only optimize parts of models
>* Fallback to eager mode preserves correctness and reliability

>* Watch logs to spot modules running in eager
>* Refactor or replace unsupported ops to reduce fallbacks

>* Balance fallbacks differently for research and production
>* Trade off performance, flexibility, correctness, and maintainability



In [None]:
#@title Python Code - Handling Compiler Fallbacks

# This script shows torch compile fallbacks simply.
# We simulate unsupported operations using a tiny model.
# Focus on understanding messages and simple refactors.

# !pip install torch torchvision.

# Import required modules for this small example.
import torch, torch.nn as nn, torch.nn.functional as F

# Set deterministic seed for reproducible random tensors.
torch.manual_seed(0)

# Select device preferring cuda then cpu safely.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a tiny model with a problematic python branch.
class FallbackNet(nn.Module):
    # Initialize layers and store threshold attribute.
    def __init__(self, threshold: float = 0.0) -> None:
        super().__init__()

        # Create a single linear layer for demonstration.
        self.fc = nn.Linear(4, 4)

        # Store threshold used inside python control flow.
        self.threshold = threshold

    # Define forward with python side control flow.
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Validate input shape to avoid silent broadcasting.
        assert x.shape[-1] == 4, "Input last dimension must equal four"

        # Use a torch-based compare instead of a Python data-dependent branch.
        cond = torch.mean(x) > self.threshold
        # Replace data-dependent Python branch with tensor-friendly control flow.
        x_lin = F.relu(self.fc(x))
        return torch.where(cond, x_lin, x * 0.5)


# Create eager model instance and move to device.
model_eager = FallbackNet(threshold=0.0).to(device)

# Create compiled model using default backend settings.
model_compiled = torch.compile(model_eager)

# Create a small input tensor for testing behavior.
example_input = torch.randn(2, 4, device=device)

# Run eager model once to get reference output.
with torch.no_grad():
    out_eager = model_eager(example_input)

# Run compiled model once to trigger compilation.
with torch.no_grad():
    out_compiled = model_compiled(example_input)

# Check numerical closeness between eager and compiled outputs.
max_diff = (out_eager - out_compiled).abs().max().item()

# Print framework version and basic comparison summary.
print("torch version:", torch.__version__)

# Print maximum difference to confirm correctness preservation.
print("max difference between eager and compiled:", float(max_diff))

# Recompile with fullgraph true to encourage fewer fallbacks.
model_compiled_full = torch.compile(model_eager, fullgraph=True)

# Run compiled fullgraph model once to warm up compilation.
with torch.no_grad():
    _ = model_compiled_full(example_input)

# Time several runs for eager and compiled models respectively.
import time

# Define helper function to measure average runtime per iteration.
def measure_runtime(fn, x, iters: int = 20) -> float:
    # Warmup a few iterations before timing loop.
    for _ in range(5):
        fn(x)

    # Synchronize cuda if available before timing.
    if torch.cuda.is_available():
        torch.cuda.synchronize()

    # Record start time using time module.
    start = time.time()

    # Run function repeatedly without gradient tracking.
    with torch.no_grad():
        for _ in range(iters):
            fn(x)

    # Synchronize again for accurate measurement.
    if torch.cuda.is_available():
        torch.cuda.synchronize()

    # Compute average milliseconds per iteration.
    elapsed = (time.time() - start) * 1000.0 / iters

    # Return elapsed milliseconds as float value.
    return float(elapsed)


# Measure runtimes for eager and compiled fullgraph models.
ms_eager = measure_runtime(model_eager, example_input)

# Measure compiled runtime which may include fewer fallbacks.
ms_compiled = measure_runtime(model_compiled_full, example_input)

# Print simple timing comparison to interpret speed impact.
print("eager average milliseconds per run:", round(ms_eager, 4))

# Print compiled timing and note that fallbacks may reduce gains.
print("compiled average milliseconds per run:", round(ms_compiled, 4))

# Show whether compiled model appears faster despite possible fallbacks.
print("compiled faster than eager:", bool(ms_compiled < ms_eager))



## **3. Benchmarking Compiled Models**

### **3.1. Warmup Iterations Explained**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_03_01.jpg?v=1769827491" width="250">



>* Warmup iterations include heavy one-time setup work
>* Ignoring warmup can hide true compiled performance

>* Warmup is like warming a cold car
>* Initial runs build optimizations that speed later runs

>* Separate warmup and steady-state when timing
>* Ignore warmup runs; measure only stable performance



In [None]:
#@title Python Code - Warmup Iterations Explained

# This script explains warmup iterations conceptually using timing examples.
# We simulate a compiled model warmup using simple Python functions.
# Focus on how early iterations differ from later steady iterations.

# import required standard libraries for timing and randomness.
import time
import random
import statistics

# set deterministic random seed for reproducible behavior.
random.seed(42)

# define a fake compiled step with slower first call behavior.
compiled_cache = {"initialized": False}

# define a function that simulates compiled model work.
def fake_compiled_step(batch_size, work_scale):
    if not compiled_cache["initialized"]:
        time.sleep(0.08)
        compiled_cache["initialized"] = True

    work_units = batch_size * work_scale
    total = 0

    for _ in range(work_units):
        total += random.random() * 0.0001

    return total


# define a simple baseline step without compilation warmup.
def baseline_step(batch_size, work_scale):
    work_units = batch_size * work_scale
    total = 0

    for _ in range(work_units):
        total += random.random() * 0.0001

    return total


# define a helper function to time several iterations.
def run_timed_steps(step_fn, iterations, batch_size, work_scale):
    times = []

    for _ in range(iterations):
        start = time.perf_counter()
        _ = step_fn(batch_size, work_scale)
        end = time.perf_counter()
        times.append(end - start)

    return times


# choose small parameters to keep runtime short and safe.
batch_size = 32
work_scale = 120
warmup_iterations = 2
measured_iterations = 5

# run baseline timings without any warmup behavior.
baseline_times = run_timed_steps(
    baseline_step,
    warmup_iterations + measured_iterations,
    batch_size,
    work_scale,
)

# reset compiled cache before compiled timings.
compiled_cache["initialized"] = False

# run compiled timings including warmup and steady iterations.
compiled_times = run_timed_steps(
    fake_compiled_step,
    warmup_iterations + measured_iterations,
    batch_size,
    work_scale,
)

# separate warmup and steady state segments for compiled timings.
compiled_warmup = compiled_times[:warmup_iterations]
compiled_steady = compiled_times[warmup_iterations:]

# compute simple averages for clear comparison.
avg_baseline = statistics.mean(baseline_times)
avg_compiled_all = statistics.mean(compiled_times)
avg_compiled_steady = statistics.mean(compiled_steady)

# print a short header describing the experiment purpose.
print("Comparing baseline and simulated compiled warmup behavior.")

# print first few iteration times to highlight warmup effect.
print("Baseline first three iteration times seconds:", baseline_times[:3])
print("Compiled first three iteration times seconds:", compiled_times[:3])

# print average times including and excluding warmup iterations.
print("Average baseline time all iterations seconds:", round(avg_baseline, 5))
print("Average compiled time all iterations seconds:", round(avg_compiled_all, 5))
print(
    "Average compiled time steady iterations seconds:",
    round(avg_compiled_steady, 5),
)

# print a short explanation connecting warmup to measurement practice.
print(
    "Notice compiled first iterations are slower, so we exclude them",
)
print(
    "Steady compiled average better reflects real repeated deployment performance",
)




### **3.2. Reliable Timing Techniques**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_03_02.jpg?v=1769827531" width="250">



>* Keep hardware, inputs, and preprocessing exactly consistent
>* Time only forward and backward passes, exclude overheads

>* Run many timed iterations and summarize results
>* Discard warmup, ignore outliers, handle noisy systems

>* Sync GPU operations before stopping performance timers
>* Keep GPU environment stable to ensure trustworthy benchmarks



In [None]:
#@title Python Code - Reliable Timing Techniques

# This script demonstrates reliable timing techniques.
# It uses TensorFlow to simulate model execution.
# Focus on warmup, repetition, and synchronization ideas.

# !pip install tensorflow==2.20.0.

# Import required standard and TensorFlow modules.
import os, time, random, numpy as np, tensorflow as tf

# Set deterministic seeds for reproducible random behavior.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version and device information.
print("TensorFlow version:", tf.__version__)
print("GPU available:", bool(tf.config.list_physical_devices("GPU")))

# Create a tiny dense model to benchmark execution.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(128,)),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax"),
])

# Build model by running one dummy forward pass.
dummy_input = tf.random.uniform(shape=(32, 128))
output = model(dummy_input)
assert output.shape == (32, 10)

# Create a compiled function to simulate torch.compile.
@tf.function(jit_compile=True)
def compiled_forward(x):
    return model(x)

# Define a helper that measures average step time.
def measure_step_time(step_fn, inputs, warmup, repeats, label):
    times = []

    for i in range(warmup + repeats):
        start = time.perf_counter()
        _ = step_fn(inputs)
        if tf.config.list_physical_devices("GPU"):
            tf.experimental.sync_devices()
        end = time.perf_counter()
        if i >= warmup:
            times.append(end - start)

    times = np.array(times, dtype=np.float64)
    mean_time = float(times.mean())
    std_time = float(times.std())
    print(label, "mean:", round(mean_time * 1000, 3), "ms",
          "std:", round(std_time * 1000, 3), "ms")

# Prepare fixed random input batch for fair comparison.
inputs = tf.random.uniform(shape=(32, 128))
assert inputs.shape[0] == 32 and inputs.shape[1] == 128

# Warm up both eager and compiled paths before measuring.
_ = model(inputs)
_ = compiled_forward(inputs)

# Measure eager execution time with warmup and repeats.
measure_step_time(model, inputs, warmup=3, repeats=10,
                  label="Eager forward time")

# Measure compiled execution time with same settings.
measure_step_time(compiled_forward, inputs, warmup=3, repeats=10,
                  label="Compiled forward time")




### **3.3. Interpreting Speedup Results**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_03_03.jpg?v=1769827574" width="250">



>* Compute speedup as eager time divided compiled
>* Always link speedup claims to specific experimental setup

>* Check variability and consistency, not just averages
>* Use percentiles to judge best and worst cases

>* Balance runtime gains against compile overhead costs
>* Consider job length and real-world operational constraints



# <font color="#418FDE" size="6.5" uppercase>**Using torch.compile**</font>


In this lecture, you learned to:
- Describe the purpose and high‑level behavior of torch.compile in PyTorch 2.10.0. 
- Wrap existing nn.Module models with torch.compile and configure basic compilation options. 
- Measure and interpret performance changes after compilation, including warmup effects. 

In the next Lecture (Lecture B), we will go over 'Profiling and Tuning'