# <font color="#418FDE" size="6.5" uppercase>**Using torch.compile**</font>

>Last update: 20260130.
    
By the end of this Lecture, you will be able to:
- Describe the purpose and high‑level behavior of torch.compile in PyTorch 2.10.0. 
- Wrap existing nn.Module models with torch.compile and configure basic compilation options. 
- Measure and interpret performance changes after compilation, including warmup effects. 


## **1. Core Compile Concepts**

### **1.1. Eager and Compiled Execution**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_01_01.jpg?v=1769761473" width="250">



>* Eager mode runs operations immediately, stepwise in Python
>* Python overhead limits optimization, hurting large repeated models

>* Compiler records runs and builds a computation graph
>* Optimized backends run the graph faster with less overhead

>* Compiled mode has a slower warmup phase
>* After warmup, repeated runs can be much faster



In [None]:
#@title Python Code - Eager and Compiled Execution

# This script compares eager and compiled execution.
# It uses PyTorch to run a tiny model.
# Focus is on timing and warmup behavior.

# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121.

# Import required standard libraries.
import os
import time
import random

# Import torch and check availability.
import torch
import torch.nn as nn

# Set deterministic random seeds.
random.seed(0)
torch.manual_seed(0)

# Select device based on availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print framework version and device.
print("PyTorch version:", torch.__version__)
print("Using device:", device)

# Define a tiny feedforward model.
class TinyNet(nn.Module):
    # Initialize layers with small sizes.
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(128, 64)
        self.act = nn.ReLU()
        self.fc2 = nn.Linear(64, 10)

    # Define forward computation.
    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.fc2(x)
        return x

# Create model instance and move to device.
model_eager = TinyNet().to(device)

# Create a compiled version of the same model.
model_compiled = torch.compile(model_eager)

# Create a small dummy input batch.
batch_size = 32
input_dim = 128

# Validate dimensions before creating tensor.
assert batch_size > 0 and input_dim > 0

# Create deterministic input tensor.
example_input = torch.randn(batch_size, input_dim, device=device)

# Function to time multiple forward passes.
def time_forwards(model, x, repeats, label):
    # Ensure model is in eval mode.
    model.eval()
    # Warmup for fair GPU timing.
    with torch.no_grad():
        _ = model(x)
    # Synchronize if using GPU.
    if device.type == "cuda":
        torch.cuda.synchronize()
    # Measure repeated forward passes.
    start = time.perf_counter()
    with torch.no_grad():
        for _ in range(repeats):
            _ = model(x)
    # Synchronize again for accurate timing.
    if device.type == "cuda":
        torch.cuda.synchronize()
    end = time.perf_counter()
    # Compute average time per iteration.
    avg_ms = (end - start) * 1000.0 / repeats
    print(f"{label} average time: {avg_ms:.4f} ms")

# Time eager model for several iterations.
repeats = 20
time_forwards(model_eager, example_input, repeats, "Eager")

# Time compiled model including warmup behavior.
time_forwards(model_compiled, example_input, repeats, "Compiled")

# Show a single forward output shape for sanity.
with torch.no_grad():
    out = model_compiled(example_input)

# Validate output shape matches expectation.
assert out.shape == (batch_size, 10)

# Print final confirmation line.
print("Output shape from compiled model:", tuple(out.shape))



### **1.2. Graph Tracing Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_01_02.jpg?v=1769761537" width="250">



>* Compilation traces model operations into a computation graph
>* Whole-graph view enables powerful performance optimizations

>* First run records operations and tensor details
>* Captured graph is optimized and reused later

>* Works best when computation pattern stays consistent
>* Dynamic, branch-heavy logic may need retracing, underperforms



In [None]:
#@title Python Code - Graph Tracing Basics

# This script illustrates simple graph like tracing concepts.
# We simulate recording operations during a model forward pass.
# Focus is on understanding computation graphs conceptually.

# Required installs for extra libraries if needed.
# !pip install tensorflow==2.20.0.

# Import standard modules for numerical work and typing.
import math
import random
import textwrap

# Set deterministic seeds for reproducible behavior.
random.seed(0)

# Define a tiny tensor class to hold values and name.
class TinyTensor:
    def __init__(self, value, name):
        self.value = float(value)
        self.name = str(name)

# Define a node class representing one operation in graph.
class OpNode:
    def __init__(self, op_type, inputs, output):
        self.op_type = op_type
        self.inputs = inputs
        self.output = output

# Define a tracer that records operations during execution.
class Tracer:
    def __init__(self):
        self.nodes = []

    # Record a new operation node into internal list.
    def record(self, op_type, inputs, output):
        node = OpNode(op_type, inputs, output)
        self.nodes.append(node)

# Create a global tracer instance used by operations.
GLOBAL_TRACER = Tracer()

# Define traced addition that logs into the tracer.
def traced_add(a, b, name):
    out = TinyTensor(a.value + b.value, name)
    GLOBAL_TRACER.record("add", [a, b], out)
    return out

# Define traced matrix multiply for scalar like values.
def traced_matmul(a, b, name):
    out = TinyTensor(a.value * b.value, name)
    GLOBAL_TRACER.record("matmul", [a, b], out)
    return out

# Define traced activation using tanh as simple nonlinearity.
def traced_activation(a, name):
    out = TinyTensor(math.tanh(a.value), name)
    GLOBAL_TRACER.record("tanh", [a], out)
    return out

# Define a tiny model that uses our traced operations.
class TinyModel:
    def __init__(self, w1, w2, bias):
        self.w1 = TinyTensor(w1, "w1")
        self.w2 = TinyTensor(w2, "w2")
        self.bias = TinyTensor(bias, "bias")

    # Forward pass that will be traced into a computation graph.
    def forward(self, x_tensor):
        h = traced_matmul(x_tensor, self.w1, "h_linear")
        h_act = traced_activation(h, "h_tanh")
        y_lin = traced_matmul(h_act, self.w2, "y_linear")
        y_out = traced_add(y_lin, self.bias, "y_out")
        return y_out

# Helper function to pretty print the traced computation graph.
def print_graph(tracer):
    print("Recorded computation graph nodes:")
    for idx, node in enumerate(tracer.nodes):
        in_names = ",".join(t.name for t in node.inputs)
        line = f"{idx}: {node.op_type}({in_names}) -> {node.output.name}"
        print(line)

# Create model instance with simple deterministic parameters.
model = TinyModel(w1=1.5, w2=0.5, bias=0.1)

# Create an input tensor representing example model input.
input_value = TinyTensor(2.0, "x_input")

# Run a first forward pass to simulate tracing warmup.
output_one = model.forward(input_value)

# Print the numeric result of the first forward pass.
print("First run output value:", round(output_one.value, 4))

# Show the recorded computation graph after first run.
print_graph(GLOBAL_TRACER)

# Clear tracer nodes to simulate cached optimized graph usage.
GLOBAL_TRACER.nodes = []

# Run a second forward pass representing compiled fast path.
output_two = model.forward(input_value)

# Print the numeric result and node count for second run.
print("Second run output value:", round(output_two.value, 4))
print("Second run recorded nodes:", len(GLOBAL_TRACER.nodes))




### **1.3. Supported operations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_01_03.jpg?v=1769761612" width="250">



>* Compiler targets safe, graph-friendly model operations
>* Most common tensor layers compile automatically for speed

>* Compiler prefers regular PyTorch tensor-style operations
>* Unsupported Python logic runs eagerly and reduces speedup

>* Dynamic shapes may trigger multiple graphs or fallback
>* Stable tensor shapes and libraries give best speedups



In [None]:
#@title Python Code - Supported operations

# This script shows supported compiled operations simply.
# We compare compiled and eager tensor operations briefly.
# Focus is on torch.compile friendly tensor behavior.

# !pip install torch torchvision torchaudio.

# Import required standard libraries safely.
import os
import time
import random

# Import torch and check availability defensively.
import torch
import torch.nn as nn

# Set deterministic seeds for reproducibility.
random.seed(0)
torch.manual_seed(0)

# Detect device preferring cuda then cpu.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print torch version and chosen device once.
print("Torch version:", torch.__version__, "Device:", device)

# Define a simple model using supported operations.
class SmallSupportedNet(nn.Module):
    # Initialize layers with common building blocks.
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 64)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(64, 10)

    # Define forward using standard tensor operations.
    def forward(self, x):
        x = self.flatten(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create model instance and move to device.
model_eager = SmallSupportedNet().to(device)

# Create a compiled version using torch.compile.
model_compiled = torch.compile(model_eager)

# Create a small batch of fake image data.
batch_size = 32
input_shape = (batch_size, 1, 28, 28)

# Validate shape before creating random input.
if len(input_shape) != 4:
    raise ValueError("Input shape must have four dimensions")

# Create deterministic random input tensor.
example_input = torch.randn(input_shape, device=device)

# Function to time a single forward pass.
def time_forward(model, x, label):
    # Warm up cuda if available and synchronize.
    if device.type == "cuda":
        torch.cuda.synchronize()
    start = time.time()
    with torch.no_grad():
        out = model(x)
    if device.type == "cuda":
        torch.cuda.synchronize()
    end = time.time()
    # Validate output shape for safety.
    if out.shape != (batch_size, 10):
        raise ValueError("Unexpected output shape from model")
    elapsed = (end - start) * 1000.0
    print(label, "time ms:", round(elapsed, 3))

# Show that operations are supported in eager mode.
print("Running eager model with supported operations")

# Time eager forward pass once.
time_forward(model_eager, example_input, "Eager first")

# First compiled call includes graph capture overhead.
print("Running compiled model first time for warmup")

# Time compiled forward pass first warmup.
time_forward(model_compiled, example_input, "Compiled first")

# Second compiled call usually runs optimized graph.
print("Running compiled model second time optimized")

# Time compiled forward pass second run.
time_forward(model_compiled, example_input, "Compiled second")

# Demonstrate unsupported style using heavy python branching.
def unsupported_style(x):
    # Use python side branching not tensor friendly.
    values = []
    for i in range(x.shape[0]):
        if i % 2 == 0:
            values.append(x[i].sum().item())
        else:
            values.append(x[i].mean().item())
    return values

# Call unsupported style once to show it still works.
result_values = unsupported_style(example_input.cpu())

# Print short message about unsupported style behavior.
print("Unsupported style ran in python, length:", len(result_values))



## **2. torch compile Options**

### **2.1. Basic compile syntax**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_02_01.jpg?v=1769761691" width="250">



>* Wrap your existing model with an optimizer
>* Compiled model keeps same interface and training code

>* Compiled model is used like original model
>* Training code stays same except compile line

>* Use the compiled model consistently in workflows
>* Compile main performance-critical models; mix with uncompiled



In [None]:
#@title Python Code - Basic compile syntax

# This script demonstrates basic torch compile syntax.
# It compares original and compiled PyTorch models simply.
# Focus on wrapping nn Module with torch compile.

# !pip install torch torchvision.

# Import standard libraries for reproducibility.
import os
import random
import time

# Import torch and neural network modules.
import torch
import torch.nn as nn

# Set deterministic random seeds for reproducibility.
random.seed(0)
torch.manual_seed(0)

# Detect device preferring cuda when available.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print the PyTorch version and chosen device.
print("PyTorch version:", torch.__version__, "Device:", device)

# Define a simple feedforward neural network model.
class SimpleNet(nn.Module):
    # Initialize layers with small hidden dimension.
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    # Define the forward computation for the model.
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create an instance of the original model.
input_dim, hidden_dim, output_dim = 128, 64, 10
model = SimpleNet(input_dim, hidden_dim, output_dim).to(device)

# Create a small random input batch for testing.
batch_size = 32
example_input = torch.randn(batch_size, input_dim, device=device)

# Validate the input shape before running model.
assert example_input.shape == (batch_size, input_dim)

# Run a single forward pass with original model.
with torch.no_grad():
    original_output = model(example_input)

# Validate output shape matches expected dimensions.
assert original_output.shape == (batch_size, output_dim)

# Wrap the existing model using torch compile function.
compiled_model = torch.compile(model, mode="default", fullgraph=False)

# Run a warmup pass to trigger compilation.
with torch.no_grad():
    _ = compiled_model(example_input)

# Time several runs of the original model.
def time_model(run_model: nn.Module, inputs: torch.Tensor, runs: int) -> float:
    # Ensure no gradients are tracked during timing.
    torch.cuda.empty_cache() if torch.cuda.is_available() else None
    start_time = time.perf_counter()
    with torch.no_grad():
        for _ in range(runs):
            _ = run_model(inputs)
    end_time = time.perf_counter()
    return (end_time - start_time) / runs

# Choose a small number of runs for quick timing.
num_runs = 20

# Measure average time for original model.
avg_time_original = time_model(model, example_input, num_runs)

# Measure average time for compiled model.
avg_time_compiled = time_model(compiled_model, example_input, num_runs)

# Compute simple speedup ratio safely.
speedup = avg_time_original / avg_time_compiled if avg_time_compiled > 0 else 1.0

# Print concise timing comparison results.
print("Average original time (ms):", round(avg_time_original * 1000, 3))
print("Average compiled time (ms):", round(avg_time_compiled * 1000, 3))
print("Speedup factor (original divided by compiled):", round(speedup, 3))

# Demonstrate that compiled model is drop in replacement.
with torch.no_grad():
    another_input = torch.randn(batch_size, input_dim, device=device)

# Validate new input shape before using compiled model.
assert another_input.shape == (batch_size, input_dim)

# Run compiled model on new input and check shape.
with torch.no_grad():
    compiled_output = compiled_model(another_input)

# Final assertion confirms interface compatibility.
assert compiled_output.shape == (batch_size, output_dim)




### **2.2. Backends and modes**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_02_02.jpg?v=1769761870" width="250">



>* Backend chooses how models run on hardware
>* Backend and mode define speed, stability, debuggability

>* Start with default backend and general mode
>* Later tune backends and modes for workloads

>* Production needs stable, predictable backends and modes
>* Match backend and mode to workload constraints



In [None]:
#@title Python Code - Backends and modes

# This script demonstrates torch compile backends.
# It focuses on backends and modes usage.
# Run cells sequentially inside Google Colab.

# !pip install torch torchvision.

# Import required standard libraries.
import os
import random
import time

# Import torch and check availability.
import torch
import torch.nn as nn
import torch.nn.functional as F

# Set deterministic random seeds.
random.seed(0)
torch.manual_seed(0)

# Select device based on availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print basic environment information.
print("Torch version:", torch.__version__)
print("Using device:", device)

# Define a tiny convolutional network.
class TinyConvNet(nn.Module):
    # Initialize layers for the tiny network.
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, 4, kernel_size=3)
        self.fc = nn.Linear(4 * 26 * 26, 10)

    # Define forward pass with simple operations.
    def forward(self, x):
        x = self.conv(x)
        x = F.relu(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

# Create model instance and move to device.
model = TinyConvNet().to(device)

# Create a small random input batch.
batch_size = 16
input_shape = (batch_size, 1, 28, 28)

# Validate input shape before using it.
assert len(input_shape) == 4
x = torch.randn(input_shape, device=device)

# Define a helper function for timing.
def run_timed(model_fn, x, steps, label):
    # Warm up model before timing.
    with torch.no_grad():
        for _ in range(2):
            _ = model_fn(x)
    if device.type == "cuda":
        torch.cuda.synchronize()
    start = time.time()
    with torch.no_grad():
        for _ in range(steps):
            _ = model_fn(x)
    if device.type == "cuda":
        torch.cuda.synchronize()
    end = time.time()
    avg = (end - start) / steps
    print(label, "average seconds:", round(avg, 6))

# Define an eager baseline function.
baseline_fn = model

# Time the eager baseline model.
run_timed(baseline_fn, x, steps=10, label="Eager baseline")

# Choose a backend and mode for compilation.
backend_choice = "inductor"
mode_choice = "default"

# Compile the model with chosen options.
compiled_model = torch.compile(
    model,
    backend=backend_choice,
    mode=mode_choice,
)

# Time the compiled model with warmup.
run_timed(compiled_model, x, steps=10, label="Compiled default mode")

# Try a different mode if available.
alt_mode_choice = "reduce-overhead"

# Compile another model variant with new mode.
compiled_fast_start = torch.compile(
    model,
    backend=backend_choice,
    mode=alt_mode_choice,
)

# Time the alternative mode compiled model.
run_timed(
    compiled_fast_start,
    x,
    steps=10,
    label="Compiled reduce-overhead mode",
)




### **2.3. Handling Compiler Fallbacks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_02_03.jpg?v=1769761935" width="250">



>* Compiled and eager execution can run together
>* Fallbacks are normal and don’t mean failure

>* Watch compiler warnings to spot eager fallbacks
>* Use fallbacks to guide refactoring performance hotspots

>* Accept some fallbacks if they’re not bottlenecks
>* Profile, optimize hot paths, let others fallback



In [None]:
#@title Python Code - Handling Compiler Fallbacks

# This script shows torch compile fallbacks simply.
# We simulate compile behavior using plain Python logic.
# Focus on understanding mixed fast and slow paths.

# Required PyTorch is unavailable so avoid installing.
# In real use you would install torch separately.

# Import standard modules for timing and arrays.
import time
import random
import math

# Set deterministic random seed for reproducibility.
random.seed(0)

# Define a simple fast path using pure math.
def fast_path(x_list):
    # Use list comprehension to simulate compiled operations.
    return [math.tanh(v) for v in x_list]

# Define a slower path with dynamic Python branching.
def slow_path(x_list):
    # Use explicit loop and branches to mimic fallback.
    out = []
    for v in x_list:
        if v > 0.5:
            out.append(math.sin(v))
        else:
            out.append(math.cos(v))
    return out

# Define a model that mixes fast and slow paths.
def mixed_model(x_list, use_slow):
    # Always apply fast path first for all elements.
    mid = fast_path(x_list)

    # Optionally apply slow path to simulate fallback.
    if use_slow:
        return slow_path(mid)
    return mid

# Create small deterministic input data list.
input_data = [i / 50.0 for i in range(50)]

# Validate input size before timing operations.
if len(input_data) != 50:
    raise ValueError("Unexpected input size for demo")

# Warm up both modes to simulate compiler warmup.
_ = mixed_model(input_data, use_slow=False)
_ = mixed_model(input_data, use_slow=True)

# Time the mostly compiled style without fallback.
start_fast = time.perf_counter()
for _ in range(2000):
    _ = mixed_model(input_data, use_slow=False)
fast_duration = time.perf_counter() - start_fast

# Time the mixed execution with fallback enabled.
start_slow = time.perf_counter()
for _ in range(2000):
    _ = mixed_model(input_data, use_slow=True)
slow_duration = time.perf_counter() - start_slow

# Print short summary explaining the timing results.
print("Fast path only duration:", round(fast_duration, 4))
print("Mixed path with fallback duration:", round(slow_duration, 4))
print("Fallback overhead factor:", round(slow_duration / fast_duration, 2))
print("Note: Here Python simulates compile fallbacks conceptually.")




## **3. Benchmarking Model Speed**

### **3.1. Warmup Iterations Explained**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_03_01.jpg?v=1769761996" width="250">



>* Early compiled runs are slow warmup iterations
>* Ignore warmup to judge true steady performance

>* Early runs include compilation and hardware setup
>* Ignore these noisy iterations; measure only stabilized performance

>* Warmup is like warming a cold car
>* Ignore warmup timings; measure only steady performance



In [None]:
#@title Python Code - Warmup Iterations Explained

# This script explains warmup iterations clearly.
# It uses TensorFlow to simulate model timing.
# Focus is on warmup versus steady state.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import time
import random

# Import TensorFlow and numpy libraries.
import tensorflow as tf
import numpy as np

# Set deterministic random seeds for reproducibility.
random.seed(0)
np.random.seed(0)
tf.random.set_seed(0)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Select device string based on GPU availability.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    device_name = "/GPU:0"
else:
    device_name = "/CPU:0"

# Define a small dense model for timing.
model = tf.keras.Sequential(
    [
        tf.keras.layers.Input(shape=(128,)),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(10, activation="softmax"),
    ]
)

# Build model once by calling on dummy data.
dummy_input = tf.zeros((1, 128), dtype=tf.float32)
_ = model(dummy_input)


# Create a tf.function to mimic compiled graph.
@tf.function
def compiled_forward(x):
    return model(x)

# Create a small batch of random input data.
batch_size = 64
input_data = tf.random.normal((batch_size, 128))

# Validate input shape before timing.
assert input_data.shape == (batch_size, 128)

# Helper function to time several iterations.
def time_function(fn, warmup, measured):
    times = []
    for i in range(warmup + measured):
        start = time.perf_counter()
        _ = fn(input_data)
        end = time.perf_counter()
        if i >= warmup:
            times.append(end - start)
    return times

# Choose warmup and measured iteration counts.
warmup_iters = 3
measured_iters = 5

# Time eager model without warmup separation.
raw_times = time_function(model, warmup=0, measured=measured_iters)

# Time compiled model with explicit warmup.
compiled_times = time_function(
    compiled_forward,
    warmup=warmup_iters,
    measured=measured_iters,
)

# Compute simple averages for both timing lists.
raw_avg = float(np.mean(raw_times))
compiled_avg = float(np.mean(compiled_times))

# Print explanation of warmup configuration.
print("Device used:", device_name)
print("Warmup iterations for compiled:", warmup_iters)
print("Measured iterations per run:", measured_iters)

# Show per iteration times for compiled model.
print("Compiled measured times (seconds):")
for t in compiled_times:
    print(round(t, 6))

# Print average times to compare steady performance.
print("Average eager time (seconds):", round(raw_avg, 6))
print("Average compiled time (seconds):", round(compiled_avg, 6))




### **3.2. Reliable Timing Methods**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_03_02.jpg?v=1769762065" width="250">



>* Use consistent, repeatable timing with fixed settings
>* Standardize conditions to isolate true compilation speedup

>* Separate one-time setup from steady-state work
>* Time steady-state iterations repeatedly for meaningful averages

>* Control randomness and system noise during benchmarking
>* Repeat tests, stabilize environment, analyze result distributions



In [None]:
#@title Python Code - Reliable Timing Methods

# This script shows reliable timing methods.
# We compare simple eager and compiled functions.
# Focus is on warmup and steady state timing.

# !pip install torch.

# Import required standard libraries.
import os
import time
import random

# Import torch and check availability.
import torch

# Set deterministic random seeds.
random.seed(0)
torch.manual_seed(0)

# Select device based on availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print framework version and device.
print("PyTorch version:", torch.__version__, "Device:", device)

# Define a tiny model for timing.
class TinyNet(torch.nn.Module):
    # Initialize linear layers and activation.
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(128, 128)
        self.act = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(128, 10)

    # Define forward computation path.
    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.fc2(x)
        return x

# Create model instance and move to device.
model_eager = TinyNet().to(device)

# Create compiled version if available.
if hasattr(torch, "compile"):
    model_compiled = torch.compile(model_eager)
else:
    model_compiled = model_eager

# Put models in evaluation mode.
model_eager.eval()
model_compiled.eval()

# Create fixed input batch for timing.
batch_size = 64
input_shape = (batch_size, 128)

# Build input tensor and move to device.
inputs = torch.randn(input_shape, device=device)

# Validate input shape before timing.
assert inputs.shape == torch.Size(input_shape)

# Helper to synchronize device safely.
def sync_device():
    # Synchronize only when cuda is available.
    if torch.cuda.is_available():
        torch.cuda.synchronize()

# Helper to time multiple iterations.
def time_model(model, n_warmup, n_iters):
    # Warmup iterations not included in timing.
    with torch.no_grad():
        for _ in range(n_warmup):
            _ = model(inputs)
        sync_device()
        start = time.perf_counter()
        for _ in range(n_iters):
            _ = model(inputs)
        sync_device()
        end = time.perf_counter()
    # Compute average time per iteration.
    avg = (end - start) / float(n_iters)
    return avg

# Define warmup and measured iteration counts.
warmup_iters = 5
measured_iters = 20

# Time eager model multiple runs.
eager_times = []
for _ in range(3):
    t = time_model(model_eager, warmup_iters, measured_iters)
    eager_times.append(t)

# Time compiled model multiple runs.
compiled_times = []
for _ in range(3):
    t = time_model(model_compiled, warmup_iters, measured_iters)
    compiled_times.append(t)

# Compute simple statistics helper.
def summarize(times):
    # Return min, max, and mean values.
    mn = min(times)
    mx = max(times)
    avg = sum(times) / float(len(times))
    return mn, mx, avg

# Summarize eager and compiled timings.
e_min, e_max, e_avg = summarize(eager_times)
c_min, c_max, c_avg = summarize(compiled_times)

# Print timing summary with few lines.
print("Eager avg per iter (s):", round(e_avg, 6))
print("Eager range (s):", round(e_min, 6), "to", round(e_max, 6))
print("Compiled avg per iter (s):", round(c_avg, 6))
print("Compiled range (s):", round(c_min, 6), "to", round(c_max, 6))

# Print relative speedup if compiled is faster.
speedup = e_avg / c_avg if c_avg > 0 else 1.0
print("Estimated speedup factor:", round(speedup, 3))



### **3.3. Interpreting Speedup Results**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_07/Lecture_A/image_03_03.jpg?v=1769762146" width="250">



>* Compare pre‑ and post‑compile times as speedup
>* Use steady‑state, post‑warmup timings for decisions

>* Raw speedup may not improve full pipeline
>* Interpret gains by workload type and resource costs

>* Look for patterns, batch effects, and variability
>* Decide when compilation helps, hurts, or needs tuning



# <font color="#418FDE" size="6.5" uppercase>**Using torch.compile**</font>


In this lecture, you learned to:
- Describe the purpose and high‑level behavior of torch.compile in PyTorch 2.10.0. 
- Wrap existing nn.Module models with torch.compile and configure basic compilation options. 
- Measure and interpret performance changes after compilation, including warmup effects. 

In the next Lecture (Lecture B), we will go over 'Profiling and Tuning'