# <font color="#418FDE" size="6.5" uppercase>**Performance and Debug**</font>

>Last update: 20260127.
    
By the end of this Lecture, you will be able to:
- Enable and configure mixed precision training in TensorFlow 2.20.0 to leverage modern GPUs and TPUs. 
- Use TensorFlow profiling tools to identify performance bottlenecks in models and input pipelines. 
- Diagnose and mitigate common numerical and stability issues such as NaNs and exploding gradients. 


## **1. Mixed Precision Setup**

### **1.1. Mixed Precision Policy**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_01_01.jpg?v=1769567582" width="250">



>* Policy chooses dtypes for different tensor operations
>* Balances speed and stability using low and full precision

>* Policy assigns tensors to suitable precisions
>* Keeps weights high precision to ensure stable convergence

>* Mixed precision cuts training time and memory
>* Makes large, complex experiments feasible and stable



In [None]:
#@title Python Code - Mixed Precision Policy

# This script demonstrates TensorFlow mixed precision policy.
# It focuses on simple clear beginner friendly concepts.
# Run cells in order inside Google Colab environment.

# !pip install tensorflow==2.20.0.

# Import required standard libraries safely.
import os
import random
import numpy as np

# Set deterministic seeds for reproducibility.
random.seed(7)
np.random.seed(7)

# Import TensorFlow and mixed precision utilities.
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import mixed_precision

# Print TensorFlow version and device information.
print("TensorFlow version:", tf.__version__)
print("GPU available:", bool(tf.config.list_physical_devices("GPU")))

# Define a helper function to describe tensor dtypes.
def describe_tensor(name, tensor):
    print(f"{name} shape={tensor.shape} dtype={tensor.dtype}")

# Show default global mixed precision policy.
current_policy = mixed_precision.global_policy()
print("Default policy:", current_policy)

# Enable mixed float16 policy for performance.
policy = mixed_precision.Policy("mixed_float16")
mixed_precision.set_global_policy(policy)

# Confirm that the global policy has been updated.
print("New global policy:", mixed_precision.global_policy())

# Create a tiny dense model using Keras Sequential.
model = models.Sequential([
    layers.Input(shape=(16,)),
    layers.Dense(32, activation="relu"),
    layers.Dense(16, activation="relu"),
    layers.Dense(1)
])

# Show layer compute and variable dtypes under policy.
for layer in model.layers:
    print("Layer:", layer.name,
          "compute:", layer.compute_dtype,
          "variable:", layer.dtype)

# Build the model by calling it on dummy data.
dummy_input = tf.ones((4, 16), dtype=tf.float32)
output = model(dummy_input)

# Describe input and output tensor dtypes.
describe_tensor("Dummy input", dummy_input)
describe_tensor("Model output", output)

# Compile model with an optimizer that supports loss scaling.
optimizer = mixed_precision.LossScaleOptimizer(
    tf.keras.optimizers.Adam(learning_rate=0.001)
)
model.compile(optimizer=optimizer, loss="mse")

# Create a tiny synthetic regression dataset.
x_train = np.random.randn(32, 16).astype("float32")
y_train = np.random.randn(32, 1).astype("float32")

# Validate dataset shapes before training.
assert x_train.shape[0] == y_train.shape[0]
assert x_train.shape[1] == 16

# Train briefly with silent verbose setting.
history = model.fit(x_train, y_train,
                    epochs=2,
                    batch_size=8,
                    verbose=0)

# Print final loss to confirm successful training.
print("Final training loss:", float(history.history["loss"][-1]))




### **1.2. Loss Scaling Essentials**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_01_02.jpg?v=1769567623" width="250">



>* Loss scaling prevents tiny gradients from underflowing in half-precision
>* It multiplies loss, then rescales gradients after backprop

>* Fixed scaling uses one constant factor; risks overflow
>* Dynamic scaling auto-adjusts factor, improving stability

>* Recognize training symptoms of bad loss scaling
>* Adjust scaling to avoid underflow, overflow, wasted compute



### **1.3. Future Ready Hardware**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_01_03.jpg?v=1769567839" width="250">



>* Mixed precision matches software to accelerator hardware features
>* Unlocks tensor core speedups for diverse deep models

>* Hardware is shifting toward many precision formats
>* Flexible mixed precision lets models exploit future accelerators

>* Mixed precision unifies training across diverse hardware
>* Consistent numerics ease validation and future upgrades



## **2. TensorFlow Profiling Essentials**

### **2.1. TensorBoard Profiler Setup**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_02_01.jpg?v=1769567859" width="250">



>* Profiler links training runs to visual dashboards
>* Shows detailed timing to guide performance decisions

>* Configure training, logs, and TensorBoard for profiling
>* Use structured logs to compare runs and share

>* Configure TensorBoard securely for remote training environments
>* Standardize versions and scripts for reliable profiling



In [None]:
#@title Python Code - TensorBoard Profiler Setup

# This script shows TensorBoard Profiler setup.
# It runs a tiny model with profiling enabled.
# Use it in Colab to explore performance traces.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import datetime
import random

# Import TensorFlow and TensorBoard utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
SEED_VALUE = 42
random.seed(SEED_VALUE)
os.environ["PYTHONHASHSEED"] = str(SEED_VALUE)

# Set TensorFlow random seed deterministically.
tf.random.set_seed(SEED_VALUE)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Detect available devices and choose strategy.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    strategy = tf.distribute.OneDeviceStrategy("/GPU:0")
else:
    strategy = tf.distribute.OneDeviceStrategy("/CPU:0")

# Prepare a small synthetic dataset for profiling.
num_samples = 512
input_dim = 32
num_classes = 10

# Create random features and labels tensors.
features = tf.random.normal((num_samples, input_dim))
labels = tf.random.uniform((num_samples,), 0, num_classes, dtype=tf.int32)

# Validate shapes before building dataset.
assert features.shape[0] == labels.shape[0]

# Build a tf.data pipeline with batching.
batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices((features, labels))

# Shuffle and batch the dataset deterministically.
dataset = dataset.shuffle(num_samples, seed=SEED_VALUE)

dataset = dataset.batch(batch_size).prefetch(tf.data.AUTOTUNE)

# Create a log directory for TensorBoard Profiler.
base_log_dir = "logs_profiler_demo"
run_id = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")

# Combine base directory and run identifier.
log_dir = os.path.join(base_log_dir, run_id)

# Explain where logs will be written.
print("Profiler logs directory:", log_dir)

# Build a simple model inside distribution strategy.
with strategy.scope():
    model = keras.Sequential([
        layers.Input(shape=(input_dim,)),
        layers.Dense(64, activation="relu"),
        layers.Dense(num_classes, activation="softmax"),
    ])

    # Compile the model with a basic optimizer.
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=0.001),
        loss=keras.losses.SparseCategoricalCrossentropy(),
        metrics=["accuracy"],
    )

# Create a TensorBoard callback with profiling enabled.
tb_callback = keras.callbacks.TensorBoard(
    log_dir=log_dir,
    histogram_freq=0,
    write_graph=False,
    write_images=False,
    profile_batch="2,4",
)

# Train briefly to generate profiling traces.
with strategy.scope():
    history = model.fit(
        dataset,
        epochs=2,
        steps_per_epoch=8,
        callbacks=[tb_callback],
        verbose=0,
    )

# Print a short summary of training results.
final_loss = history.history["loss"][-1]
final_acc = history.history["accuracy"][-1]

# Show metrics and instructions to open TensorBoard.
print("Final loss:", round(float(final_loss), 4))
print("Final accuracy:", round(float(final_acc), 4))
print("To view profiler, run in Colab:")
print("%load_ext tensorboard")
print("%tensorboard --logdir", base_log_dir)

# Confirm that profiling files exist in directory.
print("Log directory exists:", os.path.isdir(log_dir))



### **2.2. Reading Trace Timelines**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_02_02.jpg?v=1769567912" width="250">



>* Timeline shows operations across CPU and accelerators
>* Block lengths show duration; gaps reveal idle waiting

>* Tell compute, input, and overhead regions apart
>* Link patterns to metrics to spot bottlenecks

>* Use timelines to spot subtle recurring bottlenecks
>* Interpret visual cues to guide targeted optimizations



### **2.3. Finding Performance Bottlenecks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_02_03.jpg?v=1769567927" width="250">



>* Use timelines to spot idle accelerators and waste
>* Differentiate input bottlenecks from compute-bound models

>* Profiler ranks ops and pipeline by runtime
>* Helps locate and optimize the true hot spots

>* Check overlap between data, CPU, and accelerator
>* Use profiler to remove stalls and blocking work



In [None]:
#@title Python Code - Finding Performance Bottlenecks

# This script shows TensorFlow profiling basics.
# We create a slow input pipeline on purpose.
# Then we compare slow and optimized pipeline traces.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import time
import random

# Import TensorFlow and set seeds.
import tensorflow as tf
import numpy as np

# Print TensorFlow version once.
print("TensorFlow version:", tf.__version__)

# Set global random seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)

# Configure TensorFlow random seed deterministically.
tf.random.set_seed(seed_value)

# Detect available device type for information.
physical_gpus = tf.config.list_physical_devices("GPU")

# Print a short device availability message.
print("GPUs available:", len(physical_gpus))

# Load MNIST dataset from Keras utilities.
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()

# Reduce dataset size for quick profiling.
x_train = x_train[:2000]
y_train = y_train[:2000]

# Normalize and add channel dimension.
x_train = x_train.astype("float32") / 255.0
x_train = np.expand_dims(x_train, axis=-1)

# Validate shapes before building datasets.
print("Train shape:", x_train.shape, y_train.shape)

# Define a simple convolutional model.
def create_model():
    inputs = tf.keras.Input(shape=(28, 28, 1))
    x = tf.keras.layers.Conv2D(16, 3, activation="relu")(inputs)
    x = tf.keras.layers.MaxPool2D()(x)
    x = tf.keras.layers.Flatten()(x)
    x = tf.keras.layers.Dense(32, activation="relu")(x)
    outputs = tf.keras.layers.Dense(10, activation="softmax")(x)
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )
    return model


# Create a deliberately slow preprocessing function.
def slow_preprocess(image, label):
    image = tf.image.resize(image, (40, 40))
    image = tf.image.random_flip_left_right(image)
    image = tf.image.random_brightness(image, max_delta=0.3)
    image = tf.image.central_crop(image, central_fraction=0.7)
    image = tf.image.resize(image, (28, 28))
    image = tf.image.per_image_standardization(image)
    image = tf.py_function(lambda x: x, [image], Tout=tf.float32)
    image.set_shape((28, 28, 1))
    return image, label


# Create a faster preprocessing function.
def fast_preprocess(image, label):
    image = tf.image.resize(image, (28, 28))
    image = tf.clip_by_value(image, 0.0, 1.0)
    return image, label


# Build a slow input pipeline without optimizations.
def make_slow_dataset(batch_size):
    ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
    ds = ds.shuffle(512, seed=seed_value)
    ds = ds.map(slow_preprocess, num_parallel_calls=None)
    ds = ds.batch(batch_size)
    ds = ds.prefetch(1)
    return ds


# Build a faster input pipeline with optimizations.
def make_fast_dataset(batch_size):
    ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
    ds = ds.shuffle(512, seed=seed_value)
    ds = ds.map(
        fast_preprocess,
        num_parallel_calls=tf.data.AUTOTUNE,
    )
    ds = ds.batch(batch_size)
    ds = ds.prefetch(tf.data.AUTOTUNE)
    return ds


# Create log directory for TensorBoard profiling.
log_root = "./tf_profile_logs"
os.makedirs(log_root, exist_ok=True)


# Helper function to train with profiler enabled.
def train_with_profiler(dataset, run_name):
    model = create_model()
    run_logdir = os.path.join(log_root, run_name)
    tb_callback = tf.keras.callbacks.TensorBoard(
        log_dir=run_logdir,
        histogram_freq=0,
        write_graph=False,
        write_images=False,
        profile_batch="2,4",
    )
    start = time.time()
    model.fit(
        dataset,
        epochs=1,
        steps_per_epoch=20,
        verbose=0,
        callbacks=[tb_callback],
    )
    end = time.time()
    return end - start, run_logdir


# Create slow and fast datasets with same batch size.
batch_size = 64
slow_ds = make_slow_dataset(batch_size)
fast_ds = make_fast_dataset(batch_size)

# Train with slow pipeline and profile selected batches.
slow_time, slow_logdir = train_with_profiler(slow_ds, "slow_input")

# Train with fast pipeline and profile selected batches.
fast_time, fast_logdir = train_with_profiler(fast_ds, "fast_input")

# Print timing comparison for both runs.
print("Slow pipeline time (s):", round(slow_time, 3))
print("Fast pipeline time (s):", round(fast_time, 3))

# Explain where to open TensorBoard traces.
print("Slow run logs:", slow_logdir)
print("Fast run logs:", fast_logdir)

# Final message summarizing bottleneck investigation.
print("Use TensorBoard profiler to inspect idle gaps.")



## **3. Training Stability Essentials**

### **3.1. NaN and Inf Detection**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_03_01.jpg?v=1769568017" width="250">



>* NaN and Inf arise from numerical instability
>* They disrupt training and waste significant compute time

>* Watch loss and metrics for sudden spikes
>* Add checks that stop training when NaNs appear

>* Use NaN timing to infer likely causes
>* Log data and stats to locate failing operation



In [None]:
#@title Python Code - NaN and Inf Detection

# This script shows NaN and Inf detection basics.
# It uses TensorFlow tensors and a tiny model.
# Focus on safe checks during and after training.

# Install TensorFlow if needed in your environment.
# !pip install tensorflow==2.20.0.

# Import required modules from TensorFlow.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
tf.random.set_seed(42)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a small tensor with a zero value.
base_tensor = tf.constant([1.0, 0.0, -1.0], dtype=tf.float32)

# Intentionally create Inf by dividing by zero.
inf_tensor = base_tensor / base_tensor

# Intentionally create NaN using invalid logarithm.
nan_tensor = tf.math.log(tf.constant([-1.0, 0.0, 1.0]))

# Define a helper function to summarize bad values.
def summarize_bad_values(tensor, name):
    # Ensure tensor is a TensorFlow tensor.
    tensor = tf.convert_to_tensor(tensor)

    # Count NaN values inside the tensor.
    nan_mask = tf.math.is_nan(tensor)
    nan_count = tf.reduce_sum(tf.cast(nan_mask, tf.int32))

    # Count Inf values inside the tensor.
    inf_mask = tf.math.is_inf(tensor)
    inf_count = tf.reduce_sum(tf.cast(inf_mask, tf.int32))

    # Print a short summary line for this tensor.
    print(f"{name}: NaNs={int(nan_count)}, Infs={int(inf_count)}")

# Show NaN and Inf counts for our tensors.
summarize_bad_values(inf_tensor, "inf_tensor")

# Show NaN and Inf counts for the log tensor.
summarize_bad_values(nan_tensor, "nan_tensor")

# Build a tiny model that can become unstable.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1,)),
    tf.keras.layers.Dense(8, activation="relu"),
    tf.keras.layers.Dense(1)
])

# Compile with a deliberately high learning rate.
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=5.0),
              loss="mse")

# Create a tiny synthetic regression dataset.
x = tf.linspace(-1.0, 1.0, 16)

# Define targets with a simple linear relationship.
y = 3.0 * x + 0.5

# Confirm shapes are as expected before training.
assert x.shape == y.shape

# Train for a few epochs with silent logging.
history = model.fit(x, y, epochs=10, verbose=0)

# Convert loss history to a tensor for checking.
loss_tensor = tf.convert_to_tensor(history.history["loss"])

# Summarize NaN and Inf in the loss history.
summarize_bad_values(loss_tensor, "training_loss")

# Print final loss value for quick inspection.
print("Final loss value:", float(loss_tensor[-1]))




### **3.2. Gradient Clipping Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_03_02.jpg?v=1769568050" width="250">



>* Exploding gradients can destabilize and break training
>* Gradient clipping caps gradient size to maintain stability

>* Clip by value limits individual gradient components
>* Clip by global norm rescales overall gradient magnitude

>* Choose clipping thresholds that balance stability, learning
>* Monitor training signals and combine with other regularization



In [None]:
#@title Python Code - Gradient Clipping Basics

# This script shows gradient clipping basics.
# It compares training with and without clipping.
# Use it to observe stability and loss behavior.

# !pip install tensorflow==2.20.0.

# Import required libraries safely.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Select device preferring GPU when available.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    device_name = "GPU"
else:
    device_name = "CPU"

# Inform which device will likely be used.
print("Running on device type:", device_name)

# Load a small subset of MNIST digits.
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
subset_size = 2000
x_train = x_train[:subset_size]
y_train = y_train[:subset_size]

# Normalize images to the range zero one.
x_train = x_train.astype("float32") / 255.0
x_train = np.expand_dims(x_train, axis=-1)

# Validate shapes before building models.
print("Training data shape:", x_train.shape)
print("Training labels shape:", y_train.shape)

# Create a simple model factory function.
def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(28, 28, 1)),
        tf.keras.layers.Conv2D(16, 3, activation="relu"),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(32, activation="relu"),
        tf.keras.layers.Dense(10, activation="softmax"),
    ])
    return model

# Build model without gradient clipping.
model_no_clip = create_model()
optimizer_no_clip = tf.keras.optimizers.Adam(learning_rate=0.01)
model_no_clip.compile(
    optimizer=optimizer_no_clip,
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Build model with gradient clipping by global norm.
clip_norm_value = 1.0
optimizer_clip = tf.keras.optimizers.Adam(
    learning_rate=0.01,
    clipnorm=clip_norm_value,
)
model_clip = create_model()
model_clip.compile(
    optimizer=optimizer_clip,
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Prepare a small dataset pipeline for speed.
batch_size = 128
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.shuffle(buffer_size=subset_size, seed=seed_value)
dataset = dataset.batch(batch_size)

# Train both models briefly without verbose logs.
print("Training models for three epochs each.")
history_no_clip = model_no_clip.fit(
    dataset,
    epochs=3,
    verbose=0,
)
history_clip = model_clip.fit(
    dataset,
    epochs=3,
    verbose=0,
)

# Extract final losses and accuracies for comparison.
final_loss_no_clip = history_no_clip.history["loss"][-1]
final_acc_no_clip = history_no_clip.history["accuracy"][-1]
final_loss_clip = history_clip.history["loss"][-1]
final_acc_clip = history_clip.history["accuracy"][-1]

# Print a short comparison of results.
print("No clipping final loss:", round(float(final_loss_no_clip), 4))
print("No clipping final accuracy:", round(float(final_acc_no_clip), 4))
print("Clipping final loss:", round(float(final_loss_clip), 4))
print("Clipping final accuracy:", round(float(final_acc_clip), 4))

# Show configured gradient clipping hyperparameter.
print("Used global norm clip value:", clip_norm_value)




### **3.3. Tuning Learning Dynamics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_03_03.jpg?v=1769568115" width="250">



>* Training stability depends on balanced learning dynamics
>* Use tuned learning rate schedules, warmup, decay

>* Optimizer choice and settings strongly affect stability
>* Tune adaptive optimizer hyperparameters using training feedback

>* Regularization and batch size strongly affect stability
>* Jointly tune them, monitor metrics, refine iteratively



In [None]:
#@title Python Code - Tuning Learning Dynamics

# This script shows stable learning dynamics.
# We compare two learning rates for stability.
# Watch loss curves and gradient norms carefully.

# !pip install tensorflow==2.20.0.

# Import required libraries safely.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one line.
print("TensorFlow version:", tf.__version__)

# Select device preferring GPU when available.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    device_name = "/GPU:0"
else:
    device_name = "/CPU:0"

# Create a tiny synthetic regression dataset.
num_samples = 256
x_data = np.linspace(-2.0, 2.0, num_samples).astype("float32")
noise = 0.1 * np.random.randn(num_samples).astype("float32")

# Generate targets with a simple nonlinear relationship.
y_data = 3.0 * x_data ** 2 + 0.5 * x_data + noise

# Expand dimensions to match dense layer expectations.
x_data = np.expand_dims(x_data, axis=-1)
y_data = np.expand_dims(y_data, axis=-1)

# Validate shapes before building datasets.
assert x_data.shape == (num_samples, 1)
assert y_data.shape == (num_samples, 1)

# Build a small tf.data.Dataset for training.
train_ds = tf.data.Dataset.from_tensor_slices((x_data, y_data))
train_ds = train_ds.shuffle(256, seed=seed_value).batch(32)

# Define a simple regression model factory.
def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(1,)),
        tf.keras.layers.Dense(16, activation="tanh"),
        tf.keras.layers.Dense(16, activation="tanh"),
        tf.keras.layers.Dense(1),
    ])
    return model

# Create two models for different learning rates.
model_stable = create_model()
model_unstable = create_model()

# Define mean squared error loss function.
loss_fn = tf.keras.losses.MeanSquaredError()

# Create optimizers with different learning rates.
optimizer_stable = tf.keras.optimizers.Adam(learning_rate=0.01)
optimizer_unstable = tf.keras.optimizers.Adam(learning_rate=0.5)

# Prepare lists to store loss and gradient norms.
stable_losses = []
unstable_losses = []
stable_grad_norms = []
unstable_grad_norms = []

# Define one training step using GradientTape.
def train_step(model, optimizer, x_batch, y_batch):
    with tf.GradientTape() as tape:
        preds = model(x_batch, training=True)
        loss = loss_fn(y_batch, preds)
    grads = tape.gradient(loss, model.trainable_variables)
    grad_norm = tf.linalg.global_norm(grads)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss, grad_norm

# Run a few epochs to compare dynamics.
num_epochs = 5
with tf.device(device_name):
    for epoch in range(num_epochs):
        for x_batch, y_batch in train_ds:
            loss_s, norm_s = train_step(
                model_stable, optimizer_stable, x_batch, y_batch
            )
            loss_u, norm_u = train_step(
                model_unstable, optimizer_unstable, x_batch, y_batch
            )
            stable_losses.append(float(loss_s.numpy()))
            unstable_losses.append(float(loss_u.numpy()))
            stable_grad_norms.append(float(norm_s.numpy()))
            unstable_grad_norms.append(float(norm_u.numpy()))

# Compute simple statistics for both configurations.
stable_loss_last = stable_losses[-1]
unstable_loss_last = unstable_losses[-1]

# Safely compute maximum gradient norms.
max_stable_grad = max(stable_grad_norms)
max_unstable_grad = max(unstable_grad_norms)

# Print a compact comparison summary.
print("Stable lr=0.01 final loss:", round(stable_loss_last, 4))
print("Unstable lr=0.5 final loss:", round(unstable_loss_last, 4))
print("Stable lr=0.01 max grad norm:", round(max_stable_grad, 4))
print("Unstable lr=0.5 max grad norm:", round(max_unstable_grad, 4))

# Show a few early and late loss values.
print("First three stable losses:", [round(v, 4) for v in stable_losses[:3]])
print("First three unstable losses:", [round(v, 4) for v in unstable_losses[:3]])
print("Last three stable losses:", [round(v, 4) for v in stable_losses[-3:]])
print("Last three unstable losses:", [round(v, 4) for v in unstable_losses[-3:]])

# Indicate whether unstable run shows warning signs.
print("Unstable run has exploding gradients:", max_unstable_grad > 50.0)



# <font color="#418FDE" size="6.5" uppercase>**Performance and Debug**</font>


In this lecture, you learned to:
- Enable and configure mixed precision training in TensorFlow 2.20.0 to leverage modern GPUs and TPUs. 
- Use TensorFlow profiling tools to identify performance bottlenecks in models and input pipelines. 
- Diagnose and mitigate common numerical and stability issues such as NaNs and exploding gradients. 

<font color='yellow'>Congratulations on completing this course!</font>