# <font color="#418FDE" size="6.5" uppercase>**Autograd Mechanics**</font>

>Last update: 20260129.
    
By the end of this Lecture, you will be able to:
- Describe how PyTorch builds dynamic computation graphs and uses them to compute gradients. 
- Use requires_grad, backward, and grad attributes to compute and inspect gradients for simple tensor operations. 
- Control gradient tracking with torch.no_grad and detach to optimize performance and avoid unintended graph creation. 


## **1. Dynamic Computation Graphs**

### **1.1. Dynamic vs Static Graphs**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_02/Lecture_A/image_01_01.jpg?v=1769694953" width="250">



>* Static graphs are fixed blueprints, hard to change
>* Dynamic graphs are built step-by-step during execution

>* Dynamic graphs follow normal code control flow
>* They simplify flexible, data-dependent model behavior

>* Dynamic graph records forward pass for backpropagation
>* Enables flexible models while keeping gradients correct



In [None]:
#@title Python Code - Dynamic vs Static Graphs

# This script compares dynamic and static graphs.
# It uses PyTorch style ideas with TensorFlow.
# Focus on control flow and gradient behavior.

# !pip install tensorflow.

# Import TensorFlow and enable eager execution.
import tensorflow as tf

# Print TensorFlow version for reference.
print("TensorFlow version:", tf.__version__)

# Set a global random seed for determinism.
tf.random.set_seed(0)

# Define a simple function using dynamic control flow.
@tf.function(experimental_relax_shapes=True)
def static_style_fn(x):
    # Use TensorFlow control flow inside tf.function.
    y = x * 2.0
    
    # Add a conditional branch depending on value.
    y = tf.cond(x > 0.0, lambda: y + 1.0, lambda: y - 1.0)
    
    # Return squared result as final output.
    return y * y

# Define a pure eager function for dynamic style.
def dynamic_style_fn(x):
    # Use normal Python control flow with tensors.
    y = x * 2.0
    
    # Branch using regular Python if statement.
    if x > 0.0:
        y = y + 1.0
    else:
        y = y - 1.0
    
    # Return squared result as final output.
    return y * y

# Create a scalar tensor with gradient tracking.
x = tf.Variable(2.0)

# Use GradientTape to record dynamic operations.
with tf.GradientTape() as tape_dynamic:
    # Call the eager dynamic style function.
    y_dynamic = dynamic_style_fn(x)

# Compute gradient dy/dx for dynamic style.
grad_dynamic = tape_dynamic.gradient(y_dynamic, x)

# Use GradientTape with the static style function.
with tf.GradientTape() as tape_static:
    # Call the tf.function compiled static style.
    y_static = static_style_fn(x)

# Compute gradient dy/dx for static style.
grad_static = tape_static.gradient(y_static, x)

# Print outputs to compare behaviors.
print("Input x:", float(x.numpy()))
print("Dynamic style y:", float(y_dynamic.numpy()))
print("Static style y:", float(y_static.numpy()))
print("Dynamic style grad:", float(grad_dynamic.numpy()))
print("Static style grad:", float(grad_static.numpy()))

# Show that dynamic style can change branches easily.
x.assign(-1.0)

# Recompute using dynamic style with new value.
with tf.GradientTape() as tape_dynamic2:
    y_dynamic2 = dynamic_style_fn(x)

# Compute gradient for new dynamic path.
grad_dynamic2 = tape_dynamic2.gradient(y_dynamic2, x)

# Print new results showing different path.
print("New input x:", float(x.numpy()))
print("New dynamic y:", float(y_dynamic2.numpy()))
print("New dynamic grad:", float(grad_dynamic2.numpy()))



### **1.2. Autograd Function Graph**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_02/Lecture_A/image_01_02.jpg?v=1769695025" width="250">



>* Each tensor operation becomes a node in autograd
>* Nodes link functions and tensors, encoding dependencies

>* Graph flows from inputs through layered operations
>* Each node knows how to run backward for gradients

>* Graph records only actually executed operations
>* Supports complex control flow while enabling backpropagation



In [None]:
#@title Python Code - Autograd Function Graph

# This script explains PyTorch autograd graphs.
# It shows how operations create function nodes.
# It keeps outputs small and beginner friendly.

# !pip install torch torchvision torchaudio.

# Import torch for tensor and autograd features.
import torch

# Print PyTorch version in one short line.
print("PyTorch version:", torch.__version__)

# Create a simple input tensor with gradients enabled.
x = torch.tensor([2.0, 3.0], requires_grad=True)

# Verify tensor shape to avoid unexpected issues.
assert x.shape == (2,), "Input tensor must have length two"

# Build a small computation that uses x twice.
y = x[0] ** 2 + 3 * x[1]

# Confirm output is a scalar suitable for backward.
assert y.dim() == 0, "Output y must be a scalar"

# Inspect requires_grad flag and value of y.
print("x requires_grad:", x.requires_grad)
print("y value:", y.item())

# Show which function created y in the graph.
print("y.grad_fn type:", type(y.grad_fn).__name__)

# Call backward to traverse the autograd function graph.
y.backward()

# Inspect gradients accumulated on the leaf tensor x.
print("Gradient dy/dx:", x.grad)

# Build another tensor from x without tracking gradients.
with torch.no_grad():
    z = x * 5.0 + 1.0

# Confirm z does not require gradients or graph links.
print("z requires_grad:", z.requires_grad)
print("z grad_fn is:", z.grad_fn)

# Create a detached view that breaks the function graph.
x_detached = x.detach()

# Show that detached tensor no longer tracks gradients.
print("Detached requires_grad:", x_detached.requires_grad)



### **1.3. Gradient Accumulation Essentials**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_02/Lecture_A/image_01_03.jpg?v=1769695052" width="250">



>* Computation graph also stores and updates gradients
>* Repeated backward calls sum gradients before parameter updates

>* Process small batches, let gradients build up
>* Accumulated gradients mimic large batches with less memory

>* Each forward pass builds a temporary computation graph
>* Gradients accumulate across passes; reset them carefully



In [None]:
#@title Python Code - Gradient Accumulation Essentials

# This script explains gradient accumulation basics.
# It uses tiny tensors for clear illustration.
# Run cells in order to follow comments.

# !pip install torch torchvision torchaudio.

# Import torch for tensor and autograd operations.
import torch

# Set a deterministic seed for reproducibility.
torch.manual_seed(0)

# Check and print the PyTorch version briefly.
print("PyTorch version:", torch.__version__)

# Create a simple parameter tensor with gradients.
param = torch.tensor([2.0], requires_grad=True)

# Show initial parameter and its gradient value.
print("Initial param:", param.item())
print("Initial grad:", param.grad)

# Define a helper function to compute a tiny loss.
def compute_loss(scale: float) -> torch.Tensor:
    # Compute a simple squared loss scaled externally.
    loss = (scale * param) ** 2
    return loss

# First forward pass with scale one for loss.
loss1 = compute_loss(scale=1.0)

# Backward computes gradient and stores in param.grad.
loss1.backward()

# Show gradient after first backward call.
print("Grad after first backward:", param.grad.item())

# Second forward pass with different loss scale.
loss2 = compute_loss(scale=0.5)

# Backward again accumulates into existing gradient.
loss2.backward()

# Show gradient after second backward accumulation.
print("Grad after second backward:", param.grad.item())

# Manually compute expected accumulated gradient value.
expected_grad = 2 * 1.0 ** 2 * param.detach() + 2 * (0.5 ** 2) * param.detach()

# Print expected gradient for comparison clarity.
print("Expected accumulated grad:", expected_grad.item())

# Always clear gradients before a fresh accumulation.
param.grad.zero_()

# Confirm gradient has been reset to zero now.
print("Grad after manual zero_():", param.grad.item())

# Final line prints a short completion confirmation.
print("Gradient accumulation demo finished.")



## **2. Working With Gradients**

### **2.1. Controlling requires_grad**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_02/Lecture_A/image_02_01.jpg?v=1769695084" width="250">



>* requires_grad marks which tensors need gradients
>* use it to track only learnable parameters

>* requires_grad spreads through operations, building the graph
>* disable gradients for preprocessing; start graph at parameters

>* Unnecessary gradient tracking wastes memory and compute time
>* Track gradients only for parameters and loss calculations



In [None]:
#@title Python Code - Controlling requires_grad

# This script explains controlling requires_grad clearly.
# It focuses on simple gradient tracking examples.
# Run cells stepwise and read printed explanations.

# torch is not installed by default in some environments.
# Uncomment the next line if torch import fails.
# !pip install torch torchvision torchaudio.

# Import torch for tensor and autograd operations.
import torch

# Print torch version to confirm environment details.
print("PyTorch version:", torch.__version__)

# Create input tensor without gradient tracking enabled.
x_data = torch.tensor([2.0, 3.0])

# Create parameter tensor with gradient tracking enabled.
w_param = torch.tensor([1.5, -0.5], requires_grad=True)

# Confirm which tensors currently require gradients.
print("x_data requires_grad:", x_data.requires_grad)
print("w_param requires_grad:", w_param.requires_grad)

# Compute simple linear combination using both tensors.
y_output = (x_data * w_param).sum()

# Call backward to compute gradients for w_param only.
y_output.backward()

# Inspect gradient stored on parameter tensor.
print("Gradient for w_param:", w_param.grad)

# Reset gradients to zero for a clean next example.
w_param.grad.zero_()

# Create a tensor with requires_grad set after creation.
scale = torch.tensor(10.0)

# Enable gradient tracking on scale using requires_grad_ method.
scale.requires_grad_()

# Confirm gradient tracking status for scale tensor.
print("scale requires_grad:", scale.requires_grad)

# Build new computation using both tracked tensors.
result = (scale * (x_data * w_param).sum())

# Compute gradients again with respect to both tensors.
result.backward()

# Show gradients for w_param and scale after backward.
print("New gradient w_param:", w_param.grad)
print("Gradient for scale:", scale.grad)

# Demonstrate turning off gradient tracking for efficiency.
with torch.no_grad():
    prediction = (x_data * w_param).sum()

# Confirm prediction has gradient tracking disabled.
print("prediction requires_grad:", prediction.requires_grad)

# Show that w_param still keeps its gradient tracking flag.
print("w_param still requires_grad:", w_param.requires_grad)



### **2.2. Understanding backward and grad**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_02/Lecture_A/image_02_02.jpg?v=1769695115" width="250">



>* Backward walks the computation graph using chain rule
>* It fills leaf tensors’ grad with needed derivatives

>* grad stays empty until backward is called
>* after backward, grad shows each parameter’s influence

>* Backward needs scalar outputs; gradients may accumulate
>* Clear grads and interpret accumulated values carefully



In [None]:
#@title Python Code - Understanding backward and grad

# This script explains backward and grad basics.
# It uses tiny tensors for clear gradients.
# Run cells in order to follow explanations.

# Install PyTorch if not already available.
# !pip install torch torchvision torchaudio.

# Import torch for tensor and autograd features.
import torch

# Print PyTorch version for reproducibility.
print("PyTorch version:", torch.__version__)

# Create a simple tensor marked for gradient tracking.
x = torch.tensor([2.0], requires_grad=True)

# Confirm that x is a leaf tensor with grad tracking.
print("x:", x, "requires_grad:", x.requires_grad)

# Build a tiny computation that depends on x.
y = x ** 2 + 3 * x

# Show the computed output value before backward.
print("y before backward:", y)

# Check that y is not a leaf and has grad_fn.
print("Is y leaf?", y.is_leaf, "grad_fn:", y.grad_fn)

# Ensure x.grad is empty before calling backward.
print("x.grad before backward:", x.grad)

# Call backward on scalar y to compute dy/dx.
y.backward()

# After backward, x.grad holds the derivative.
print("x.grad after backward:", x.grad)

# Manually compute derivative for comparison.
manual_grad = 2 * x.detach() + 3

# Show that autograd matches manual derivative.
print("Manual gradient:", manual_grad)

# Reset gradients to zero before another backward.
x.grad.zero_()

# Build a new output using mean over multiple elements.
values = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Confirm shape so backward behavior is predictable.
print("values shape:", values.shape)

# Compute mean to get a scalar output.
mean_output = values.mean()

# Call backward to get gradient of mean w.r.t values.
mean_output.backward()

# Inspect gradients stored in values.grad attribute.
print("values.grad after backward:", values.grad)

# Show that gradients match derivative of mean function.
print("Expected grad for mean is 1/3 each.")




### **2.3. Resetting Gradients Safely**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_02/Lecture_A/image_02_03.jpg?v=1769695145" width="250">



>* Gradients accumulate across steps unless manually cleared
>* Resetting gradients keeps results fresh and meaningful

>* Gradients must be cleared to avoid contamination
>* Fresh gradients keep updates stable and interpretable

>* Use cleared gradients to read models correctly
>* Reset at planned times to control behavior



In [None]:
#@title Python Code - Resetting Gradients Safely

# This script demonstrates resetting gradients safely.
# It uses PyTorch tensors with simple scalar operations.
# Focus on requires_grad backward and grad attributes.

# Install PyTorch if not already available in the environment.
# !pip install torch torchvision torchaudio --quiet.

# Import torch for tensor and autograd operations.
import torch

# Print the PyTorch version for quick verification.
print("PyTorch version:", torch.__version__)

# Create a scalar parameter with gradient tracking enabled.
weight = torch.tensor(2.0, requires_grad=True)

# Confirm that the tensor is a scalar parameter.
assert weight.shape == torch.Size([])

# Define a simple target value for our tiny example.
target = torch.tensor(10.0, requires_grad=False)

# Confirm that target has the same shape as predictions.
assert target.shape == torch.Size([])

# Compute a simple prediction using the parameter.
prediction = weight * 3.0

# Compute a squared error loss for this prediction.
loss = (prediction - target) ** 2

# Run backpropagation to compute gradient for weight.
loss.backward()

# Print the first gradient value for weight.
print("Gradient after first backward:", weight.grad.item())

# Store the first gradient for comparison later.
first_grad = weight.grad.item()

# Compute a new prediction without resetting gradients.
prediction2 = weight * 3.0

# Compute a new loss using the same formula.
loss2 = (prediction2 - target) ** 2

# Call backward again to accumulate gradients.
loss2.backward()

# Print the accumulated gradient value for weight.
print("Gradient after second backward:", weight.grad.item())

# Show that gradient is roughly double the first value.
print("Accumulated equals first times two:", weight.grad.item() >= first_grad * 1.9)

# Safely reset gradients to zero before next step.
weight.grad.zero_()

# Confirm that gradients were cleared successfully.
print("Gradient after zero_ call:", weight.grad.item())

# Compute a fresh prediction after clearing gradients.
prediction3 = weight * 3.0

# Compute a fresh loss for the cleared state.
loss3 = (prediction3 - target) ** 2

# Backpropagate again to get a clean gradient.
loss3.backward()

# Print the new gradient which matches a single step.
print("Gradient after reset and backward:", weight.grad.item())

# Verify that the new gradient matches the original one.
print("Reset kept gradient consistent:", abs(weight.grad.item() - first_grad) < 1e-6)



## **3. Controlling Autograd Gradients**

### **3.1. No Grad Context**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_02/Lecture_A/image_03_01.jpg?v=1769695177" width="250">



>* No grad context temporarily disables gradient tracking
>* Use it for evaluation to save computation

>* Training needs gradients; inference usually does not
>* No grad context saves memory and compute

>* Use no_grad for evaluation, metrics, visualizations
>* Prevents graph bloat, bugs, and memory issues



In [None]:
#@title Python Code - No Grad Context

# This script shows torch no grad context.
# We compare gradients with and without context.
# Focus on simple tensors not full models.

# Uncomment if torch is not already installed.
# !pip install torch torchvision torchaudio.

# Import torch for tensor and autograd operations.
import torch

# Set a manual seed for deterministic behavior.
torch.manual_seed(0)

# Create a tensor with gradient tracking enabled.
x = torch.tensor(2.0, requires_grad=True)

# Confirm that requires_grad is correctly set.
print("x requires_grad:", x.requires_grad)

# Build a simple computation that will track gradients.
y = x ** 3 + 2 * x

# Call backward to compute gradient dy/dx at x.
y.backward()

# Print the gradient computed for x in this graph.
print("Gradient of y with grad tracking:", x.grad.item())

# Reset gradient on x to avoid accumulation issues.
x.grad.zero_()

# Use a no grad context to disable tracking temporarily.
with torch.no_grad():
    y_no_grad = x ** 3 + 2 * x

# Show that result is still a normal tensor value.
print("Value inside no_grad context:", y_no_grad.item())

# Check that the new tensor does not require gradients.
print("y_no_grad requires_grad:", y_no_grad.requires_grad)

# Try calling backward on a tensor from no_grad context.
try:
    y_no_grad.backward()
except RuntimeError as e:
    print("Backward failed inside no_grad:", type(e).__name__)

# Show that x gradient was not changed by no_grad block.
print("Gradient of x after no_grad block:", x.grad.item())




### **3.2. Detach and Inplace Operations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_02/Lecture_A/image_03_02.jpg?v=1769695205" width="250">



>* Detach stops gradients, treating tensors as constants
>* Lets you control which parameters receive updates

>* Inplace ops can break autograd’s gradient tracking
>* Use inplace only on detached or unused tensors

>* Use detached copies for safe inplace updates
>* Avoid inplace changes on tensors tracked by autograd



In [None]:
#@title Python Code - Detach and Inplace Operations

# This script demonstrates detach and inplace operations.
# It uses PyTorch autograd with simple tensor examples.
# Focus on safe gradient control and inplace behavior.

# Install PyTorch if not already available in the environment.
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu.

# Import torch for tensor and autograd operations.
import torch

# Set a deterministic seed for reproducible tensor values.
torch.manual_seed(0)

# Print the PyTorch version in one concise line.
print("PyTorch version:", torch.__version__)

# Create a tensor with gradients enabled for autograd.
x = torch.tensor([2.0, 3.0], requires_grad=True)

# Confirm the tensor shape is as expected.
assert x.shape == (2,), "Input tensor must have shape (2,)"

# Build a simple computation graph using x squared.
y = x ** 2

# Sum the outputs to obtain a scalar loss value.
loss = y.sum()

# Run backpropagation to compute gradients for x.
loss.backward()

# Print original gradients for x before any detach operations.
print("Gradients for x before detach:", x.grad)

# Stop tracking gradients by creating a detached view of x.
x_detached = x.detach()

# Verify detached tensor shares data but tracks no gradients.
print("Detached requires_grad flag:", x_detached.requires_grad)

# Perform an inplace operation safely on the detached tensor.
x_detached.add_(1.0)

# Show values of original and detached tensors after inplace update.
print("Original x after detached update:", x)
print("Detached tensor after inplace add_:", x_detached)

# Clear previous gradients on x before next backward pass.
x.grad.zero_()

# Recompute loss using original x to see gradient independence.
new_loss = (x ** 2).sum()

# Backpropagate again to update gradients based on original x.
new_loss.backward()

# Print gradients to confirm they ignore detached inplace changes.
print("Gradients for x after second backward:", x.grad)

# Demonstrate unsafe inplace attempt on tracked tensor inside try block.
try:
    # Create a new tensor that depends on x for demonstration.
    z = x * 2.0

    # Attempt an inplace operation that may confuse autograd.
    z.add_(1.0)

    # Print z to show the inplace modification result.
    print("Inplace on tracked tensor z succeeded:", z)
except RuntimeError as e:
    # Catch and report autograd error from unsafe inplace operation.
    print("RuntimeError from inplace on tracked tensor:")
    print(str(e).split("\n")[0])

# Final confirmation that original x data and gradients remain valid.
print("Final x values:", x)




### **3.3. Memory and Speed**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master PyTorch 2.10.0/Module_02/Lecture_A/image_03_03.jpg?v=1769695272" width="250">



>* Gradient tracking stores graphs and intermediates in memory
>* Large models magnify this cost, limiting resources

>* Turn off gradients to skip storing intermediates
>* Use for eval, logging, visualization to save memory

>* Fewer tracked gradients reduce computation and speed training
>* Disabling gradients in inference saves memory and compute



In [None]:
#@title Python Code - Memory and Speed

# This script demonstrates PyTorch autograd memory control.
# We compare gradient tracking for speed and memory usage.
# Focus on torch.no_grad and detach for beginners.

# !pip install torch torchvision torchaudio.

# Import required standard libraries.
import time
import os
import random

# Import torch for tensor and autograd features.
import torch

# Set deterministic random seeds for reproducibility.
random.seed(0)
torch.manual_seed(0)

# Select device based on GPU availability.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Print PyTorch version and selected device.
print("PyTorch version:", torch.__version__, "Device:", device)

# Define a helper to measure execution time.
def measure_time(fn, label):
    start = time.time()
    result = fn()
    end = time.time()
    elapsed = (end - start) * 1000.0
    print(label, "time_ms=", round(elapsed, 3))
    return result

# Create an input tensor with gradient tracking enabled.
input_size = 1024 * 16
x = torch.randn(input_size, device=device, requires_grad=True)

# Confirm tensor shape and gradient requirement.
assert x.shape[0] == input_size
assert x.requires_grad is True

# Define a simple computation with gradients enabled.
def forward_with_grad():
    y = x * 2.0 + 1.0
    z = torch.relu(y)
    loss = z.mean()
    loss.backward()
    return loss

# Define the same computation with gradients disabled.
def forward_no_grad():
    with torch.no_grad():
        y = x * 2.0 + 1.0
        z = torch.relu(y)
        loss = z.mean()
    return loss

# Run both functions once to warm up device.
_ = forward_with_grad()
_ = forward_no_grad()

# Zero gradients before timing to avoid accumulation.
if x.grad is not None:
    x.grad.zero_()

# Measure time for computation with gradients.
loss_grad = measure_time(forward_with_grad, "With_grad_forward_backward")

# Clear gradients again to free memory.
if x.grad is not None:
    x.grad.zero_()

# Measure time for computation without gradients.
loss_nograd = measure_time(forward_no_grad, "No_grad_forward_only")

# Show that no_grad result has no computation graph.
print("loss_grad_requires_grad:", loss_grad.requires_grad)
print("loss_nograd_requires_grad:", loss_nograd.requires_grad)

# Demonstrate detach to stop gradients from flowing.
intermediate = x * 3.0
intermediate_detached = intermediate.detach()

# Confirm detached tensor does not require gradients.
print("intermediate_requires_grad:", intermediate.requires_grad)
print("detached_requires_grad:", intermediate_detached.requires_grad)

# Use detached tensor in further computation without tracking.
with torch.no_grad():
    summary_value = intermediate_detached.abs().mean().item()

# Print final summary to show script completed.
print("Summary_value_from_detached_tensor:", round(summary_value, 4))




# <font color="#418FDE" size="6.5" uppercase>**Autograd Mechanics**</font>


In this lecture, you learned to:
- Describe how PyTorch builds dynamic computation graphs and uses them to compute gradients. 
- Use requires_grad, backward, and grad attributes to compute and inspect gradients for simple tensor operations. 
- Control gradient tracking with torch.no_grad and detach to optimize performance and avoid unintended graph creation. 

In the next Lecture (Lecture B), we will go over 'Modules and Layers'