# <font color="#418FDE" size="6.5" uppercase>**Autograd with TF**</font>

>Last update: 20260125.
    
By the end of this Lecture, you will be able to:
- Use tf.GradientTape to compute gradients of scalar losses with respect to TensorFlow variables. 
- Interpret gradient values to understand how parameter changes affect a loss function. 
- Implement a simple manual training step using gradients and an optimizer. 


## **1. GradientTape Essentials**

### **1.1. Tracing Operations**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_02/Lecture_B/image_01_01.jpg?v=1769368894" width="250">



>* GradientTape records every differentiable operation on variables
>* This computational graph lets TensorFlow apply chain rule

>* Tracing builds a new graph each run
>* Gradients follow that path, supporting flexible control flow

>* All gradient-relevant ops must run during tracing
>* Otherwise steps are untracked, causing missing gradients



In [None]:
#@title Python Code - Tracing Operations

# This script demonstrates TensorFlow GradientTape tracing.
# It focuses on how operations are recorded dynamically.
# Run cells to see gradients from traced computations.

# !pip install tensorflow==2.20.0.

# Import TensorFlow with a defensive try block.
import importlib, sys

# Try importing tensorflow safely.
spec = importlib.util.find_spec("tensorflow")

# Handle missing TensorFlow with clear message.
if spec is None:
    raise ImportError("TensorFlow is required for this script.")

# Import tensorflow after confirming availability.
import tensorflow as tf

# Force TensorFlow to use CPU only to avoid CUDA errors.
tf.config.set_visible_devices([], "GPU")

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Set a deterministic random seed for reproducibility.
tf.random.set_seed(42)

# Create a simple scalar variable for tracing.
w = tf.Variable(2.0, dtype=tf.float32, name="weight")

# Define a small helper to print gradients.
def show_gradients(tag, grad_value):
    print(tag, "gradient dw =", float(grad_value))

# Define a simple scalar loss function.
def loss_fn(x_value):
    return (w * x_value - 3.0) ** 2

# Choose a scalar input tensor for experiments.
x = tf.constant(1.5, dtype=tf.float32)

# Use GradientTape to trace all operations.
with tf.GradientTape() as tape:
    # Ensure tape watches the variable w.
    tape.watch(w)
    # Compute loss inside the tracing context.
    loss = loss_fn(x)

# Compute gradient of loss with respect to w.
grad_w = tape.gradient(loss, w)

# Show gradient from the traced computation.
show_gradients("Case 1 traced:", grad_w)

# Demonstrate missing tracing when operating outside.
with tf.GradientTape() as tape_outside:
    # Watch w for potential gradient computation.
    tape_outside.watch(w)
    # Compute intermediate value inside context.
    y_inside = w * x

# Compute loss outside the tracing context.
loss_outside = (y_inside - 3.0) ** 2

# Try to get gradient for loss_outside with respect to w.
grad_missing = tape_outside.gradient(loss_outside, w)

# Print gradient, expected to be None due to missing trace.
print("Case 2 outside trace gradient:", grad_missing)

# Show how control flow is traced dynamically.
with tf.GradientTape() as tape_branch:
    # Watch w for this new tape.
    tape_branch.watch(w)
    # Use simple branch depending on x value.
    if x > 1.0:
        loss_branch = (w * x) ** 2
    else:
        loss_branch = (w * x - 1.0) ** 2

# Compute gradient for the actually executed branch.
grad_branch = tape_branch.gradient(loss_branch, w)

# Display gradient from dynamic control flow.
show_gradients("Case 3 branch:", grad_branch)



### **1.2. Persistent And Nonpersistent Tapes**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_02/Lecture_B/image_01_02.jpg?v=1769368960" width="250">



>* Autograd has nonpersistent and persistent recording modes
>* Nonpersistent records once, returns gradients, then forgets

>* Persistent tapes reuse one computation for many gradients
>* They cost more memory, so release them carefully

>* Choose mode balancing efficiency and flexibility needs
>* Use persistent tapes for deeper analysis workflows



In [None]:
#@title Python Code - Persistent And Nonpersistent Tapes

# This script demonstrates TensorFlow GradientTape basics.
# We compare nonpersistent and persistent tape behaviors.
# Focus on simple gradients for clear understanding.

# !pip install tensorflow==2.20.0.

# Additionally disable XLA via environment variable to avoid JIT pow error.
import os
os.environ["TF_XLA_FLAGS"] = "--tf_xla_auto_jit=0"
os.environ["TF_ENABLE_ONEDNN_OPTS"] = "0"
os.environ["CUDA_VISIBLE_DEVICES"] = ""  # disable GPU to avoid JIT pow error

# Import TensorFlow with a clear alias.
import tensorflow as tf

# Disable XLA JIT to avoid GPU JIT compilation issues.
try:
    tf.config.optimizer.set_jit(False)
except Exception:
    pass

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Create a simple scalar variable for experiments.
w = tf.Variable(3.0, dtype=tf.float32)

# Define a simple scalar function of w.
def simple_loss(x):
    return (x ** 2) + (2.0 * x)

# Confirm the loss returns a scalar tensor.
loss_value = simple_loss(w)
print("Initial loss value:", float(loss_value))

# Use nonpersistent GradientTape for one gradient computation.
with tf.GradientTape() as tape_non:
    loss_non = simple_loss(w)

grad_non = tape_non.gradient(loss_non, w)
print("Nonpersistent gradient:", float(grad_non))

# Show that nonpersistent tape cannot be reused safely.
try:
    grad_non_again = tape_non.gradient(loss_non, w)
    print("Second nonpersistent gradient:", grad_non_again)
except Exception as e:
    print("Nonpersistent reuse error type:", type(e).__name__)

# Use persistent GradientTape to allow multiple gradient calls.
with tf.GradientTape(persistent=True) as tape_persist:
    loss_persist = simple_loss(w)

# First gradient from persistent tape.
grad_first = tape_persist.gradient(loss_persist, w)
print("Persistent first gradient:", float(grad_first))

# Second gradient call using the same persistent tape.
grad_second = tape_persist.gradient(loss_persist, w)
print("Persistent second gradient:", float(grad_second))

# Delete persistent tape to free resources explicitly.
del tape_persist



### **1.3. Watching non variable tensors**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_02/Lecture_B/image_01_03.jpg?v=1769369076" width="250">



>* Gradients normally track losses with respect to variables
>* Tell GradientTape to watch regular tensors for gradients

>* Watch non variables to study model behavior
>* Enables sensitivity analysis and robustness checks via gradients

>* Watch tensors during tracing or gradients vanish
>* Plan which tensors to watch for targeted gradients



## **2. Interpreting TensorFlow Gradients**

### **2.1. GradientTape Usage Patterns**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_02/Lecture_B/image_02_01.jpg?v=1769369204" width="250">



>* Record one forward pass and scalar loss
>* Gradients show loss sensitivity to each parameter

>* Choose which tensors GradientTape should track
>* Use tracked tensors to diagnose influential computations

>* Single-use tapes give one clear gradient snapshot
>* Persistent tapes compare objectives and reveal competing signals



In [None]:
#@title Python Code - GradientTape Usage Patterns

# This script shows basic TensorFlow gradient usage.
# It focuses on GradientTape usage patterns.
# Run cells to see gradients and interpretations.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and check version.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Create a simple scalar input tensor.
x = tf.constant(2.0, dtype=tf.float32)

# Create a trainable variable representing a weight.
w = tf.Variable(1.0, dtype=tf.float32)

# Define a simple quadratic loss function.
def loss_fn(weight, input_x):
    return (weight * input_x - 3.0) ** 2

# Use GradientTape to record one forward pass.
with tf.GradientTape() as tape:
    loss_value = loss_fn(w, x)

# Compute gradient of loss with respect to weight.
grad_w = tape.gradient(loss_value, w)

# Print loss and gradient for interpretation.
print("Single pass loss:", float(loss_value))
print("Single pass gradient:", float(grad_w))

# Manually interpret gradient sign and magnitude.
if grad_w > 0.0:
    direction = "decrease"
else:
    direction = "increase"

# Show how gradient suggests changing the weight.
print("To reduce loss, you should", direction, "w.")

# Demonstrate explicitly watching a non variable tensor.
base_feature = tf.constant(1.5, dtype=tf.float32)

# Use GradientTape and watch the feature tensor.
with tf.GradientTape() as tape_feature:
    tape_feature.watch(base_feature)
    loss_feature = loss_fn(w, base_feature)

# Compute gradient with respect to the feature.
grad_feature = tape_feature.gradient(loss_feature, base_feature)

# Print gradient to see feature sensitivity.
print("Gradient with respect to feature:", float(grad_feature))

# Show a tiny manual training step using gradient.
learning_rate = 0.1

# Perform one GradientTape pass for training step.
with tf.GradientTape() as tape_train:
    current_loss = loss_fn(w, x)

# Compute gradient for training update.
train_grad = tape_train.gradient(current_loss, w)

# Apply a simple gradient descent update.
new_w = w - learning_rate * train_grad

# Print old weight, gradient, and new weight.
print("Old w:", float(w), "Gradient:", float(train_grad))
print("Updated w:", float(new_w))

# Compute new loss to confirm improvement.
new_loss = loss_fn(new_w, x)
print("Loss before:", float(current_loss), "Loss after:", float(new_loss))



### **2.2. Multiple Variable Gradients**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_02/Lecture_B/image_02_02.jpg?v=1769369264" width="250">



>* Gradients form a structured set for all parameters
>* Signs and sizes show best loss-reducing direction

>* Different weights get different gradient sizes and signs
>* Comparing gradients shows which parameters most affect loss

>* Gradients form structured patterns across model layers
>* Patterns reveal learning focus and gradient-related problems



In [None]:
#@title Python Code - Multiple Variable Gradients

# This script shows multiple variable gradients.
# We use TensorFlow GradientTape for clarity.
# Focus is on interpreting gradient values.

# !pip install tensorflow==2.20.0.

# Import TensorFlow and check version.
import tensorflow as tf
print("TensorFlow version:", tf.__version__)

# Disable GPU to avoid CUDA_ERROR_INVALID_HANDLE runtime error.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

# Set a deterministic random seed value.
tf.random.set_seed(42)

# Create two trainable scalar variables.
w_size = tf.Variable(0.5, dtype=tf.float32)

# Create another variable representing age weight.
w_age = tf.Variable(-0.3, dtype=tf.float32)

# Create a small batch of feature data.
features = tf.constant([[80.0, 5.0], [60.0, 20.0]], dtype=tf.float32)

# Create simple target prices for the batch.
prices = tf.constant([[300.0], [180.0]], dtype=tf.float32)

# Stack variables into a single weight vector.
w = tf.stack([w_size, w_age])

# Validate that feature and weight shapes align.
assert features.shape[1] == w.shape[0]

# Define a simple prediction function.
def predict(features_batch, weight_vector):
    return tf.matmul(features_batch, tf.reshape(weight_vector, (-1, 1)))

# Use GradientTape to compute gradients.
with tf.GradientTape() as tape:
    tape.watch([w_size, w_age])
    w_current = tf.stack([w_size, w_age])
    preds = predict(features, w_current)
    errors = preds - prices
    loss = tf.reduce_mean(errors ** 2)

# Compute gradients of loss with respect to both variables.
grads = tape.gradient(loss, [w_size, w_age])

# Print current parameter values and loss.
print("w_size:", float(w_size.numpy()), "w_age:", float(w_age.numpy()))
print("Loss value:", float(loss.numpy()))

# Print gradients for each variable separately.
print("Gradient for w_size:", float(grads[0].numpy()))
print("Gradient for w_age:", float(grads[1].numpy()))

# Explain how gradient signs guide parameter updates.
step_size = 0.01
new_w_size = w_size - step_size * grads[0]
new_w_age = w_age - step_size * grads[1]

# Show one manual gradient descent update step.
print("Updated w_size:", float(new_w_size.numpy()))
print("Updated w_age:", float(new_w_age.numpy()))

# Summarize which feature currently influences loss more.
print("Abs gradients:", float(abs(grads[0]).numpy()), float(abs(grads[1]).numpy()))



### **2.3. Handling None Gradients**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_02/Lecture_B/image_02_03.jpg?v=1769369324" width="250">



>* None gradient means no differentiable path exists
>* Some parameters simply do not affect the loss

>* Zero gradient means loss depends but is flat
>* None gradient means parameter never affects the loss

>* None gradients reveal unused or nondifferentiable parameters
>* Fix architecture so all trainable parameters affect loss



In [None]:
#@title Python Code - Handling None Gradients

# This script shows TensorFlow None gradients clearly.
# It focuses on interpreting gradient values safely.
# Run each part and read the printed explanations.

# !pip install tensorflow==2.20.0.

# Import TensorFlow with a clear alias.
import tensorflow as tf

# Disable GPU to avoid CUDA_ERROR_INVALID_HANDLE runtime error.
tf.config.set_visible_devices([], 'GPU')

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Create two scalar variables for demonstration.
w_used = tf.Variable(2.0, dtype=tf.float32)

# Create another variable that will be unused.
w_unused = tf.Variable(5.0, dtype=tf.float32)

# Define a simple input tensor.
x = tf.constant(3.0, dtype=tf.float32)

# Use GradientTape to track computations.
with tf.GradientTape() as tape:
    # Watch both variables for potential gradients.
    tape.watch([w_used, w_unused])

    # Compute a loss that depends only on w_used.
    y = w_used * x

    # Define a scalar loss from the output.
    loss = (y - 10.0) ** 2

# Compute gradients with respect to both variables.
grads = tape.gradient(loss, [w_used, w_unused])

# Unpack gradients for clarity.
grad_used, grad_unused = grads

# Confirm shapes are as expected scalars.
assert grad_used.shape == (), "grad_used must be scalar."

# Check that unused gradient is actually None.
assert grad_unused is None, "grad_unused should be None here."

# Print the numeric gradient for the used variable.
print("Gradient for w_used (numeric):", float(grad_used))

# Explain why this gradient is numeric.
print("w_used affects loss, so gradient is a number.")

# Print the gradient for the unused variable.
print("Gradient for w_unused:", grad_unused)

# Explain why this gradient is None, not zero.
print("w_unused never affects loss, so gradient is None.")

# Show that changing w_unused does not change the loss.
old_loss = float(loss.numpy())

# Manually change w_unused value.
w_unused.assign(100.0)

# Recompute loss to confirm it is unchanged.
with tf.GradientTape() as tape2:
    tape2.watch([w_used, w_unused])
    y2 = w_used * x
    loss2 = (y2 - 10.0) ** 2

# Print both losses to compare.
print("Loss before changing w_unused:", old_loss)

# Final print shows loss after change.
print("Loss after changing w_unused:", float(loss2.numpy()))



## **3. Manual Gradient Training**

### **3.1. Using Keras Optimizers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_02/Lecture_B/image_03_01.jpg?v=1769369396" width="250">



>* Keras optimizers turn gradients into parameter updates
>* They automate step size, momentum, and stability details

>* Different Keras optimizers change how training progresses
>* They enable faster, stable learning with less effort

>* Manual loops control loss, gradients, and updates
>* Experiment, tune optimizers, and build training intuition



In [None]:
#@title Python Code - Using Keras Optimizers

# This script shows manual training with optimizers.
# We use TensorFlow GradientTape for simple regression.
# Focus on Keras optimizers updating trainable variables.

# Install TensorFlow if not already available.
# pip install tensorflow==2.20.0.

# Import TensorFlow and NumPy for computations.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import tensorflow as tf
import numpy as np

# Set deterministic seeds for reproducibility.
tf.random.set_seed(7)
np.random.seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create tiny synthetic data for y = 3x + 2.
x_data = np.linspace(-1.0, 1.0, 8, dtype=np.float32)
y_data = 3.0 * x_data + 2.0

# Convert data to TensorFlow tensors.
x_train = tf.constant(x_data.reshape(-1, 1))
y_train = tf.constant(y_data.reshape(-1, 1))

# Validate shapes before training operations.
assert x_train.shape == y_train.shape

# Define trainable variables for weight and bias.
w = tf.Variable(tf.random.normal(shape=(1, 1)))
b = tf.Variable(tf.zeros(shape=(1,)))

# Define a simple mean squared error loss.
def compute_loss(predictions, targets):
    error = predictions - targets
    return tf.reduce_mean(tf.square(error))

# Create a Keras optimizer with small learning rate.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

# Define one manual training step using GradientTape.
@tf.function
def train_step(inputs, targets):
    with tf.GradientTape() as tape:
        preds = tf.matmul(inputs, w) + b
        loss = compute_loss(preds, targets)
    grads = tape.gradient(loss, [w, b])
    optimizer.apply_gradients(zip(grads, [w, b]))
    return loss, grads

# Run a few training steps and print progress.
for step in range(5):
    loss_value, gradients = train_step(x_train, y_train)
    w_grad, b_grad = gradients
    print(
        f"Step {step}: loss={loss_value.numpy():.4f}, "
        f"dw={w_grad.numpy()[0,0]:.4f}, db={b_grad.numpy()[0]:.4f}"
    )

# Show final learned parameters and compare to true values.
print("Learned w, b:", float(w.numpy()[0, 0]), float(b.numpy()[0]))



### **3.2. Applying Gradients to Variables**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_02/Lecture_B/image_03_02.jpg?v=1769369452" width="250">



>* Gradients tell how to adjust each parameter
>* We pair gradients with variables and optimizers update

>* Gradients show how each parameter affects error
>* Optimizer uses gradients, learning rate to reduce loss

>* Modify gradients before updating to improve learning
>* Repeated updates gradually refine the modelâ€™s behavior



In [None]:
#@title Python Code - Applying Gradients to Variables

# This script shows applying gradients to variables.
# We use TensorFlow GradientTape for manual updates.
# Focus on a tiny linear model training step.

# !pip install tensorflow==2.20.0.

# Import TensorFlow with a clear alias.
import tensorflow as tf

# Force TensorFlow to use CPU to avoid CUDA errors.
tf.config.set_visible_devices([], 'GPU')

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Set deterministic random seed for reproducibility.
tf.random.set_seed(42)

# Create a tiny synthetic dataset for y = 3x + 2.
x_data = tf.constant([[0.0], [1.0], [2.0], [3.0]])

# Create matching target outputs as a column vector.
y_true = tf.constant([[2.0], [5.0], [8.0], [11.0]])

# Define a trainable weight variable initialized randomly.
w = tf.Variable(tf.random.normal(shape=(1, 1)))

# Define a trainable bias variable initialized to zero.
b = tf.Variable(tf.zeros(shape=(1,)))

# Choose a simple optimizer with a small learning rate.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)

# Define a function to compute predictions from inputs.
def model(x):
    # Compute linear model output y = xw + b.
    return tf.matmul(x, w) + b

# Define a function to compute mean squared error loss.
def loss_fn(y_pred, y_target):
    # Compute squared differences and then mean.
    return tf.reduce_mean(tf.square(y_pred - y_target))

# Run one manual training step using GradientTape.
with tf.GradientTape() as tape:
    # Compute predictions for current parameters.
    y_pred = model(x_data)

    # Compute scalar loss from predictions and targets.
    loss_value = loss_fn(y_pred, y_true)

# Collect trainable variables in a list.
variables = [w, b]

# Compute gradients of loss with respect to variables.
gradients = tape.gradient(loss_value, variables)

# Validate that gradients and variables have matching lengths.
if gradients is None or len(gradients) != len(variables):
    raise RuntimeError("Gradient and variable lengths mismatch.")

# Pair each gradient with its corresponding variable.
grad_var_pairs = list(zip(gradients, variables))

# Apply gradients to update variables using the optimizer.
optimizer.apply_gradients(grad_var_pairs)

# Print loss and parameter values before and after update.
print("Loss after one step:", float(loss_value))

# Print updated weight and bias values clearly.
print("Updated weight:", float(w.numpy()[0][0]))

# Print updated bias value as a simple float.
print("Updated bias:", float(b.numpy()[0]))

# Show a prediction after the update for x = 4.
x_test = tf.constant([[4.0]])

# Compute prediction using updated parameters.
y_test_pred = model(x_test)

# Print the new prediction to observe learning effect.
print("Prediction for x=4 after update:", float(y_test_pred.numpy()[0][0]))



### **3.3. Stabilizing Gradient Behavior**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_02/Lecture_B/image_03_03.jpg?v=1769369509" width="250">



>* Unstable gradients make training explode or stall
>* We stabilize updates so loss decreases smoothly

>* Learning rate size strongly affects training stability
>* Use tuned optimizers with momentum or adaptive scaling

>* Use gradient clipping and monitor gradient norms
>* Adjust models and preprocessing to keep training stable



In [None]:
#@title Python Code - Stabilizing Gradient Behavior

# This script shows stabilizing gradient behavior.
# We use TensorFlow GradientTape for manual updates.
# We compare training with and without gradient clipping.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import math

# Disable GPU to avoid CUDA errors in some environments.
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

# Import TensorFlow for tensors and gradients.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
random.seed(7)

# Set TensorFlow random seed for reproducibility.
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create simple synthetic input data tensor.
x_data = tf.constant([[1.0], [2.0], [3.0], [4.0]], dtype=tf.float32)

# Create simple synthetic target data tensor.
y_true = tf.constant([[3.0], [5.0], [7.0], [9.0]], dtype=tf.float32)

# Check that input and target shapes match.
assert x_data.shape == y_true.shape

# Define a simple linear model using variables.
w = tf.Variable(tf.random.normal(shape=(1, 1)))

# Define bias variable for the linear model.
b = tf.Variable(tf.zeros(shape=(1,)))

# Define mean squared error loss function.
def mse_loss(y_pred, y_target):
    # Compute squared differences between predictions and targets.
    squared = tf.square(y_pred - y_target)

    # Return mean of squared differences.
    return tf.reduce_mean(squared)

# Define a function to run one training step.
def train_step(clip_gradients=False):
    # Record operations for automatic differentiation.
    with tf.GradientTape() as tape:
        # Compute model predictions for current parameters.
        y_pred = tf.matmul(x_data, w) + b

        # Compute scalar loss from predictions and targets.
        loss_value = mse_loss(y_pred, y_true)

    # Compute gradients of loss with respect to parameters.
    grads = tape.gradient(loss_value, [w, b])

    # Optionally clip gradients to stabilize updates.
    if clip_gradients:
        # Clip gradients by global norm to a maximum value.
        grads, _ = tf.clip_by_global_norm(grads, 1.0)

    # Apply a simple manual gradient descent update.
    learning_rate = 0.5

    # Update weight variable using gradient and learning rate.
    w.assign_sub(learning_rate * grads[0])

    # Update bias variable using gradient and learning rate.
    b.assign_sub(learning_rate * grads[1])

    # Return current loss and gradient norms for inspection.
    grad_norm = tf.linalg.global_norm(grads)

    # Return scalar loss and gradient norm values.
    return float(loss_value.numpy()), float(grad_norm.numpy())

# Run a few unstable steps without gradient clipping.
print("\nTraining without gradient clipping:")

# Perform several manual updates and observe behavior.
for step in range(3):
    # Call training step without clipping enabled.
    loss_value, grad_norm = train_step(clip_gradients=False)

    # Print step, loss, and gradient norm values.
    print(f"Step {step}: loss={loss_value:.3f}, grad_norm={grad_norm:.3f}")

# Reset model parameters for clipped training run.
w.assign(tf.random.normal(shape=(1, 1)))

# Reset bias parameter to zero for fair comparison.
b.assign(tf.zeros(shape=(1,)))

# Run a few stable steps with gradient clipping enabled.
print("\nTraining with gradient clipping:")

# Perform several manual updates and observe behavior.
for step in range(3):
    # Call training step with clipping enabled.
    loss_value, grad_norm = train_step(clip_gradients=True)

    # Print step, loss, and gradient norm values.
    print(f"Step {step}: loss={loss_value:.3f}, grad_norm={grad_norm:.3f}")



# <font color="#418FDE" size="6.5" uppercase>**Autograd with TF**</font>


In this lecture, you learned to:
- Use tf.GradientTape to compute gradients of scalar losses with respect to TensorFlow variables. 
- Interpret gradient values to understand how parameter changes affect a loss function. 
- Implement a simple manual training step using gradients and an optimizer. 

In the next Module (Module 3), we will go over 'Keras Model Building'