# <font color="#418FDE" size="6.5" uppercase>**Performance and Debug**</font>

>Last update: 20260128.
    
By the end of this Lecture, you will be able to:
- Enable and configure mixed precision training in TensorFlow 2.20.0 to leverage modern GPUs and TPUs. 
- Use TensorFlow profiling tools to identify performance bottlenecks in models and input pipelines. 
- Diagnose and mitigate common numerical and stability issues such as NaNs and exploding gradients. 


## **1. Mixed Precision Setup**

### **1.1. Global Mixed Precision Policy**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_01_01.jpg?v=1769608039" width="250">



>* One central setting controls model numeric precision
>* Ensures consistent behavior and easy experimentation changes

>* Global policy exploits fast low-precision GPU hardware
>* Keeps sensitive values high precision for stability

>* Global precision policies improve discipline and reproducibility
>* They ease hardware adaptation and performance–quality tradeoffs



In [None]:
#@title Python Code - Global Mixed Precision Policy

# This script shows global mixed precision policy.
# It uses a tiny model for quick demonstration.
# Run in Colab with TensorFlow 2.20.0 installed.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic random seeds.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version and device info.
print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.config.list_physical_devices("GPU"))

# Check that mixed precision API exists.
from tensorflow.keras import mixed_precision
policy_names = ["float32", "mixed_float16"]
print("Available policies:", policy_names)

# Create and set a global mixed precision policy.
policy = mixed_precision.Policy("mixed_float16")
mixed_precision.set_global_policy(policy)
print("Global policy:", mixed_precision.global_policy())

# Prepare a tiny subset of MNIST data.
(x_train, y_train), _ = keras.datasets.mnist.load_data()
x_train = x_train[:2048].astype("float32") / 255.0
y_train = y_train[:2048].astype("int32")

# Add a channel dimension for convolution.
x_train = np.expand_dims(x_train, axis=-1)
print("Train shape:", x_train.shape)

# Validate shapes before building the model.
assert x_train.shape[1:] == (28, 28, 1)
assert y_train.ndim == 1

# Build a simple CNN model under global policy.
inputs = keras.Input(shape=(28, 28, 1))
x = layers.Conv2D(16, 3, activation="relu")(inputs)
x = layers.MaxPooling2D()(x)
x = layers.Flatten()(x)
x = layers.Dense(32, activation="relu")(x)
outputs = layers.Dense(10)(x)
model = keras.Model(inputs, outputs)

# Show layer dtypes to see policy effect.
for layer in model.layers:
    print("Layer", layer.name, "dtype:", layer.dtype)

# Use a loss scale optimizer for stability.
base_opt = keras.optimizers.Adam(learning_rate=1e-3)
opt = mixed_precision.LossScaleOptimizer(base_opt)

# Compile the model with sparse labels.
model.compile(optimizer=opt,
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=["accuracy"])

# Train briefly with silent logs for speed.
history = model.fit(x_train,
                    y_train,
                    epochs=1,
                    batch_size=64,
                    verbose=0)

# Print final training metrics from history.
final_loss = history.history["loss"][-1]
final_acc = history.history["accuracy"][-1]
print("Final loss (mixed):", float(final_loss))
print("Final accuracy (mixed):", float(final_acc))




### **1.2. Loss Scaling Essentials**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_01_02.jpg?v=1769608085" width="250">



>* Loss scaling prevents tiny gradients from underflowing
>* Scale loss up for gradients, then scale back

>* Static scaling uses one fixed carefully chosen factor
>* Dynamic scaling auto-adjusts factor to avoid overflows

>* Use loss scaling signals to debug training
>* Tune scaling to match gradient ranges and hardware



### **1.3. Future Hardware Needs**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_01_03.jpg?v=1769608230" width="250">



>* Modern accelerators are built for mixed precision
>* Choose hardware that fully supports mixed precision workflows

>* Mixed precision changes memory needs and throughput tradeoffs
>* Match GPU or TPU choices to workload characteristics

>* Plan fast interconnects to avoid communication bottlenecks
>* Use mixed precision to cut energy and scale



## **2. TensorFlow Profiling Essentials**

### **2.1. TensorBoard Profiler Setup**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_02_01.jpg?v=1769608252" width="250">



>* TensorBoard Profiler collects and visualizes training performance
>* Use it to inspect detailed operation and resource usage

>* Choose log directory and start TensorBoard access
>* Profile a short, steady-state window of training

>* Complex setups need coordinated logs and permissions
>* Ensure clear path from training job to TensorBoard



In [None]:
#@title Python Code - TensorBoard Profiler Setup

# This script shows TensorBoard profiler setup.
# It uses a tiny Keras model example.
# It is designed for Google Colab use.

# Install TensorFlow only if needed here.
# !pip install -q tensorflow==2.20.0.

# Import required standard libraries safely.
import os
import pathlib
import random

# Import TensorFlow and TensorBoard utilities.
import tensorflow as tf
from tensorflow import keras

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)

# Set TensorFlow random seed for determinism.
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Detect available device type for information.
physical_gpus = tf.config.list_physical_devices("GPU")

# Choose device description string for printing.
device_desc = "GPU" if physical_gpus else "CPU"

# Print which device type will likely run.
print("Running on device type:", device_desc)

# Define a small log directory for profiling.
base_log_dir = pathlib.Path("logs_profiler_demo")

# Ensure the log directory exists on disk.
base_log_dir.mkdir(parents=True, exist_ok=True)

# Create a unique run directory for this script.
run_log_dir = base_log_dir / "run_01"

# Ensure the run directory exists for summaries.
run_log_dir.mkdir(parents=True, exist_ok=True)

# Print the log directory path for TensorBoard.
print("Log directory:", str(run_log_dir))

# Load MNIST dataset with small subset only.
(mnist_x_train, mnist_y_train), _ = keras.datasets.mnist.load_data()

# Normalize images to float32 in range zero one.
mnist_x_train = mnist_x_train.astype("float32") / 255.0

# Add channel dimension to images for Conv2D.
mnist_x_train = mnist_x_train[..., tf.newaxis]

# Select a small subset to keep runtime short.
subset_size = 2000

# Validate subset size does not exceed dataset.
subset_size = min(subset_size, mnist_x_train.shape[0])

# Slice the subset for features and labels.
mnist_x_train = mnist_x_train[:subset_size]
mnist_y_train = mnist_y_train[:subset_size]

# Confirm shapes are as expected for training.
print("Train subset shape:", mnist_x_train.shape)

# Build a simple sequential convolutional model.
model = keras.Sequential([
    keras.layers.Input(shape=(28, 28, 1)),
    keras.layers.Conv2D(16, (3, 3), activation="relu"),
    keras.layers.MaxPooling2D(pool_size=(2, 2)),
    keras.layers.Flatten(),
    keras.layers.Dense(32, activation="relu"),
    keras.layers.Dense(10, activation="softmax"),
])

# Compile the model with simple optimizer.
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Create a TensorBoard callback for scalar summaries.
tb_callback = keras.callbacks.TensorBoard(
    log_dir=str(run_log_dir),
    histogram_freq=0,
    write_graph=True,
    write_images=False,
)

# Define a function to run a short training.
def run_training_with_profiler():
    # Create a profiler callback for specific steps.
    profiler_callback = keras.callbacks.TensorBoard(
        log_dir=str(run_log_dir),
        histogram_freq=0,
        profile_batch=(20, 30),
    )

    # Train the model briefly with callbacks.
    history = model.fit(
        mnist_x_train,
        mnist_y_train,
        epochs=2,
        batch_size=64,
        verbose=0,
        callbacks=[tb_callback, profiler_callback],
    )

    # Return final training accuracy for confirmation.
    return history.history["accuracy"][-1]

# Run the training function to generate traces.
final_accuracy = run_training_with_profiler()

# Print short instructions for launching TensorBoard.
print("To view profiler, run in Colab cell:")

# Provide the exact TensorBoard magic command string.
print("%load_ext tensorboard")

# Provide the TensorBoard start command with logdir.
print("%tensorboard --logdir", str(base_log_dir))

# Print final accuracy to show training completed.
print("Final training accuracy:", round(float(final_accuracy), 4))




### **2.2. Reading Trace Timelines**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_02_02.jpg?v=1769608298" width="250">



>* Trace timeline shows time-aligned model and hardware activity
>* Tracks and colored blocks reveal operation order and efficiency

>* Inspect one typical step’s CPU–GPU workflow
>* Compare idle CPU or GPU to locate bottlenecks

>* Spot gaps, tiny kernels, and poor overlap
>* Link visual timeline patterns to concrete bottlenecks



In [None]:
#@title Python Code - Reading Trace Timelines

# This script shows TensorFlow profiling basics.
# You will capture and view a trace.
# Focus on reading simple timeline patterns.

# Install TensorFlow only if missing.
# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Select device string based on availability.
if tf.config.list_physical_devices("GPU"):
    device_name = "/GPU:0"
else:
    device_name = "/CPU:0"

# Print which device will run the model.
print("Using device for demo:", device_name)

# Load MNIST dataset from Keras datasets.
(x_train, y_train), _ = keras.datasets.mnist.load_data()

# Reduce dataset size for quick profiling.
x_train = x_train[:2000]
y_train = y_train[:2000]

# Normalize images to float32 in range zero one.
x_train = x_train.astype("float32") / 255.0

# Add channel dimension for convolutional layers.
x_train = np.expand_dims(x_train, axis=-1)

# Validate shapes before building dataset.
print("Train data shape:", x_train.shape)

# Build a simple tf.data pipeline.
ds_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))

# Shuffle and batch with small batch size.
ds_train = ds_train.shuffle(2000, seed=seed_value).batch(64)

# Prefetch to overlap input and compute.
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)

# Define a small CNN model for the demo.
def create_model():
    model = keras.Sequential([
        layers.Conv2D(16, (3, 3), activation="relu", input_shape=(28, 28, 1)),
        layers.MaxPooling2D((2, 2)),
        layers.Flatten(),
        layers.Dense(32, activation="relu"),
        layers.Dense(10, activation="softmax"),
    ])
    model.compile(
        optimizer="adam",
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )
    return model


# Create log directory for TensorBoard traces.
log_dir = "logs_profile_demo"
os.makedirs(log_dir, exist_ok=True)

# Explain where to open TensorBoard later.
print("Trace logs will be in:", log_dir)

# Create a TensorBoard profiler callback.
profiler_callback = tf.keras.callbacks.TensorBoard(
    log_dir=log_dir,
    histogram_freq=0,
    write_graph=False,
    write_images=False,
    profile_batch="2,4",
)

# Train inside selected device context.
with tf.device(device_name):
    model = create_model()
    model.fit(
        ds_train,
        epochs=1,
        steps_per_epoch=10,
        verbose=0,
        callbacks=[profiler_callback],
    )

# Print short instructions for reading timelines.
print("Open TensorBoard with: tensorboard --logdir=logs_profile_demo")




### **2.3. Finding Performance Bottlenecks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_02_03.jpg?v=1769608373" width="250">



>* Compare step time across compute and input stages
>* Use summaries to locate input or model bottlenecks

>* Use trace timelines to locate slow operations
>* Spot idle gaps or many tiny kernels

>* Use profiler patterns to spot memory and hardware issues
>* Iteratively tweak data, model, and settings to remove bottlenecks



In [None]:
#@title Python Code - Finding Performance Bottlenecks

# This script shows TensorFlow profiling basics.
# It compares slow and fast input pipelines visually.
# Use it to spot simple performance bottlenecks.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import time
import random

# Import TensorFlow and check version.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)

# Set TensorFlow random seed deterministically.
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Detect available device type for information.
physical_gpus = tf.config.list_physical_devices("GPU")

# Choose device string based on availability.
device_type = "GPU" if physical_gpus else "CPU"

# Print which device type will likely be used.
print("Running primarily on:", device_type)

# Create a small synthetic dataset for profiling.
num_samples = 2048
feature_dim = 32

# Build random features and labels tensors.
features = tf.random.normal((num_samples, feature_dim))

# Use a simple binary label pattern.
labels = tf.cast(tf.reduce_sum(features, axis=1) > 0, tf.int32)

# Validate shapes before building datasets.
assert features.shape[0] == labels.shape[0]

# Define a simple Keras model for demonstration.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(feature_dim,)),
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile model with basic optimizer and loss.
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

# Create a deliberately slow input pipeline.
slow_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Add an expensive Python sleep to each element.
slow_ds = slow_ds.map(
    lambda x, y: (tf.py_function(
        func=lambda z: (time.sleep(0.001), z)[1],
        inp=[x], Tout=tf.float32
    ), y)
)

# Ensure the shape information is preserved for the slow pipeline.
def _set_shapes(x, y):
    x.set_shape((feature_dim,))
    return x, y

slow_ds = slow_ds.map(_set_shapes)

# Batch the slow dataset with a small batch size.
slow_ds = slow_ds.batch(32).prefetch(1)

# Create a fast, well optimized input pipeline.
fast_ds = tf.data.Dataset.from_tensor_slices((features, labels))

# Shuffle, batch, cache, and prefetch efficiently.
fast_ds = fast_ds.shuffle(512).batch(32).cache().prefetch(tf.data.AUTOTUNE)

# Define a helper function to time one training epoch.
def time_one_epoch(dataset, description):
    # Warm up the model with one silent step.
    for batch_x, batch_y in dataset.take(1):
        _ = model.train_on_batch(batch_x, batch_y)

    # Start wall clock timer for the epoch.
    start_time = time.time()

    # Run exactly one epoch with verbose disabled.
    history = model.fit(
        dataset,
        epochs=1,
        verbose=0,
        steps_per_epoch=num_samples // 32,
    )

    # Compute elapsed time in seconds.
    elapsed = time.time() - start_time

    # Extract final loss and accuracy safely.
    final_loss = float(history.history["loss"][ -1])

    final_acc = float(history.history["accuracy"][ -1])

    # Print a concise summary line.
    print(
        f"{description}: time={elapsed:.3f}s, "
        f"loss={final_loss:.3f}, acc={final_acc:.3f}"
    )

# Run timing with the slow input pipeline.
time_one_epoch(slow_ds, "Slow pipeline epoch")

# Run timing with the fast input pipeline.
time_one_epoch(fast_ds, "Fast pipeline epoch")

# Explain how this relates to profiler bottlenecks.
print("Notice how input design changes total step time.")



## **3. Numerical Stability Essentials**

### **3.1. NaN and Inf Detection**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_03_01.jpg?v=1769608465" width="250">



>* NaNs and Infs arise from invalid computations
>* They spread, break training, and signal instability

>* Monitor loss and metrics for sudden anomalies
>* Use tools and checks to catch nonfinite values

>* Trace where NaNs first appear during training
>* Link causes to fixes like normalization, clipping



In [None]:
#@title Python Code - NaN and Inf Detection

# This script shows NaN and Inf detection basics.
# It uses TensorFlow tensors and a tiny model.
# Focus is on safe checks during training.

# Optional TensorFlow install for some environments.
# !pip install tensorflow==2.20.0.

# Import required modules from TensorFlow.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Create a small tensor with safe finite values.
safe_tensor = tf.constant([1.0, 2.0, 3.0], dtype=tf.float32)

# Create tensors that will contain Inf and NaN values.
inf_tensor = tf.constant([1.0, tf.float32.max], dtype=tf.float32)

# Use a division that produces infinity safely.
inf_result = inf_tensor * tf.constant(2.0, dtype=tf.float32)

# Create a tensor that will produce NaN values.
zero_tensor = tf.constant([0.0, 0.0], dtype=tf.float32)

# Use invalid division to intentionally create NaNs.
nan_result = zero_tensor / zero_tensor

# Define a helper function to summarize bad values.
def summarize_bad_values(x, name):
    # Ensure tensor is float type before checks.
    x = tf.cast(x, tf.float32)

    # Build boolean masks for NaN and Inf values.
    nan_mask = tf.math.is_nan(x)

    # Detect both positive and negative infinities.
    inf_mask = tf.math.is_inf(x)

    # Count how many NaN and Inf values appear.
    nan_count = tf.reduce_sum(tf.cast(nan_mask, tf.int32))

    # Count infinite values using integer casting.
    inf_count = tf.reduce_sum(tf.cast(inf_mask, tf.int32))

    # Print a short summary line for this tensor.
    print(name, "NaNs:", int(nan_count), "Infs:", int(inf_count))

# Show that the safe tensor has no bad values.
summarize_bad_values(safe_tensor, "safe_tensor")

# Show that the Inf tensor now contains infinities.
summarize_bad_values(inf_result, "inf_result")

# Show that the NaN tensor now contains NaN values.
summarize_bad_values(nan_result, "nan_result")

# Build a tiny model that can easily overflow.
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(1,)),
    tf.keras.layers.Dense(1, use_bias=False)
])

# Manually set a very large weight to cause overflow.
model.layers[0].set_weights([
    tf.constant([[1e20]], dtype=tf.float32)
])

# Create a small batch with moderate input values.
inputs = tf.constant([[2.0], [3.0]], dtype=tf.float32)

# Define a simple mean squared error loss function.
loss_fn = tf.keras.losses.MeanSquaredError()

# Use a basic optimizer with a small learning rate.
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)

# Perform one training step inside GradientTape.
with tf.GradientTape() as tape:
    # Forward pass through the unstable model.
    preds = model(inputs, training=True)

    # Target values are small and finite.
    targets = tf.zeros_like(preds)

    # Compute the loss between predictions and targets.
    loss = loss_fn(targets, preds)

# Compute gradients of loss with respect to weights.
grads = tape.gradient(loss, model.trainable_variables)

# Check predictions for NaN and Inf values.
summarize_bad_values(preds, "predictions")

# Check gradients for NaN and Inf values.
summarize_bad_values(grads[0], "gradients")

# Only apply gradients if they are all finite.
if tf.reduce_all(tf.math.is_finite(grads[0])):
    # Apply gradients safely when values are finite.
    optimizer.apply_gradients([(grads[0], model.trainable_variables[0])])
else:
    # Warn that update is skipped due to bad values.
    print("Skipped update due to NaN or Inf gradients.")




### **3.2. Gradient Clipping Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_03_02.jpg?v=1769608506" width="250">



>* Exploding gradients destabilize training and cause overflows
>* Gradient clipping limits update size, preventing unstable jumps

>* Clip by value limits each gradient component
>* Clip by norm rescales overall gradient magnitude

>* Tune clipping thresholds carefully to avoid slowdown
>* Start moderate, monitor metrics, combine with other stabilizers



In [None]:
#@title Python Code - Gradient Clipping Basics

# This script shows basic gradient clipping usage.
# It compares training with and without gradient clipping.
# It uses a tiny synthetic regression dataset.

# !pip install tensorflow==2.20.0.

# Import required libraries safely.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Select device preference based on availability.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    device_type = "GPU"
else:
    device_type = "CPU"

# Print which device type will likely be used.
print("Running on device type:", device_type)

# Create a tiny synthetic regression dataset.
num_samples = 256
x_data = np.random.uniform(-2.0, 2.0, size=(num_samples, 1))
noise = np.random.normal(0.0, 0.5, size=(num_samples, 1))
y_data = 3.0 * x_data + 2.0 + noise

# Convert numpy arrays to TensorFlow datasets.
dataset = tf.data.Dataset.from_tensor_slices((x_data, y_data))
dataset = dataset.shuffle(buffer_size=num_samples, seed=seed_value)
dataset = dataset.batch(32)

# Define a helper function to build a simple model.
def build_model(use_clipping):
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(1,)),
        tf.keras.layers.Dense(32, activation="relu"),
        tf.keras.layers.Dense(32, activation="relu"),
        tf.keras.layers.Dense(1),
    ])
    if use_clipping:
        optimizer = tf.keras.optimizers.SGD(
            learning_rate=1.0,
            clipnorm=1.0,
        )
    else:
        optimizer = tf.keras.optimizers.SGD(
            learning_rate=1.0,
        )
    model.compile(
        optimizer=optimizer,
        loss="mse",
        metrics=["mae"],
    )
    return model

# Build models with and without gradient clipping.
model_no_clip = build_model(use_clipping=False)
model_clip = build_model(use_clipping=True)

# Train both models briefly with silent training logs.
history_no_clip = model_no_clip.fit(
    dataset,
    epochs=5,
    verbose=0,
)

history_clip = model_clip.fit(
    dataset,
    epochs=5,
    verbose=0,
)

# Extract final losses for comparison.
final_loss_no_clip = history_no_clip.history["loss"][-1]
final_loss_clip = history_clip.history["loss"][-1]

# Print a short summary comparing both training runs.
print("Final MSE without clipping:", round(final_loss_no_clip, 4))
print("Final MSE with clipping:", round(final_loss_clip, 4))
print("Note how clipping stabilizes updates with large learning rate.")




### **3.3. Stable Learning Rate Choices**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_10/Lecture_B/image_03_03.jpg?v=1769608602" width="250">



>* Learning rate controls update speed and stability
>* Too high or low harms smooth loss descent

>* Use adaptive or scheduled learning rates for stability
>* Monitor loss curves and adjust rate to prevent divergence

>* Learning rate must match batch size, optimizer
>* Retune learning rate carefully, monitor for instability



In [None]:
#@title Python Code - Stable Learning Rate Choices

# This script shows stable learning rates.
# It compares safe and unsafe learning rates.
# Use it to observe loss behavior.

# !pip install tensorflow==2.20.0.

# Import required libraries safely.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic random seeds.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Select device preference if available.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    try:
        tf.config.experimental.set_memory_growth(physical_gpus[0], True)
    except Exception as e:
        print("GPU config warning, using default.")

# Create a tiny synthetic regression dataset.
num_samples = 256
x_data = np.linspace(-1.0, 1.0, num_samples).astype("float32")
noise = 0.1 * np.random.randn(num_samples).astype("float32")

# Generate targets with simple linear relation.
y_data = 3.0 * x_data + 0.5 + noise
x_data = x_data.reshape(-1, 1)
y_data = y_data.reshape(-1, 1)

# Validate shapes before training.
assert x_data.shape == (num_samples, 1)
assert y_data.shape == (num_samples, 1)

# Build a simple dense regression model.
def build_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(1,)),
        tf.keras.layers.Dense(8, activation="tanh"),
        tf.keras.layers.Dense(1),
    ])
    return model

# Prepare dataset object for efficient training.
batch_size = 32
dataset = tf.data.Dataset.from_tensor_slices((x_data, y_data))
dataset = dataset.shuffle(num_samples, seed=seed_value)
dataset = dataset.batch(batch_size)

# Define two learning rates to compare.
stable_lr = 0.01
unstable_lr = 1.0

# Create two models with identical initialization.
model_stable = build_model()
model_unstable = build_model()
model_unstable.set_weights(model_stable.get_weights())

# Compile models with mean squared error loss.
model_stable.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=stable_lr),
    loss="mse",
)
model_unstable.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=unstable_lr),
    loss="mse",
)

# Train both models briefly with silent logs.
epochs = 15
history_stable = model_stable.fit(
    dataset,
    epochs=epochs,
    verbose=0,
)

history_unstable = model_unstable.fit(
    dataset,
    epochs=epochs,
    verbose=0,
)

# Extract loss histories for inspection.
loss_stable = history_stable.history["loss"]
loss_unstable = history_unstable.history["loss"]

# Print a compact comparison table.
print("\nEpoch  Stable_LR_loss  Unstable_LR_loss")
for i in range(epochs):
    ls = float(loss_stable[i])
    lu = float(loss_unstable[i])
    print(f"{i+1:5d}  {ls:14.6f}  {lu:16.6f}")




# <font color="#418FDE" size="6.5" uppercase>**Performance and Debug**</font>


In this lecture, you learned to:
- Enable and configure mixed precision training in TensorFlow 2.20.0 to leverage modern GPUs and TPUs. 
- Use TensorFlow profiling tools to identify performance bottlenecks in models and input pipelines. 
- Diagnose and mitigate common numerical and stability issues such as NaNs and exploding gradients. 

<font color='yellow'>Congratulations on completing this course!</font>