# ⚙️ Chapter 11: Training Deep Neural Networks — Hands-On Guide

Modern deep neural networks are powerful—but it's crucial to address training issues like vanishing gradients, speed, overfitting, and more.

---

## I. 🧊 Vanishing/Exploding Gradients

### A. **Weight Initialization (Glorot/Xavier)**

Helps avoid vanishing/exploding gradients by starting weights at appropriate scales.

In [3]:
from tensorflow.keras import layers, models, initializers

# Define a simple model with Xavier (Glorot) initialization
model_xavier = models.Sequential([
    layers.Input(shape=(28, 28)),
    layers.Flatten(),
    layers.Dense(100, activation='relu', kernel_initializer=initializers.GlorotUniform()),
    layers.Dense(100, activation='relu', kernel_initializer='he_normal'),  # He initialization
    layers.Dense(10, activation='softmax')
])

model_xavier.summary()

### B. **Nonsaturating Activations** (ReLU, Leaky ReLU, ELU)

Using activations like ReLU helps mitigate vanishing gradients.

In [5]:
# Adding LeakyReLU activation
model_leaky_relu = models.Sequential([
    layers.Input(shape=(28, 28)),
    layers.Flatten(),
    layers.Dense(100),
    layers.LeakyReLU(negative_slope=0.2),
    layers.Dense(10, activation='softmax')
])

model_leaky_relu.summary()

### C. **Batch Normalization**

Normalizes layer inputs during training, stabilizing and accelerating learning.

In [6]:
model_bn = models.Sequential([
    layers.Input(shape=(28, 28)),
    layers.Flatten(),
    layers.BatchNormalization(),  # Normalize inputs
    layers.Dense(300, activation='relu'),
    layers.BatchNormalization(),  # Normalize before next layer
    layers.Dense(100, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model_bn.summary()

### D. **Gradient Clipping**

Prevents exploding gradients by capping the maximum gradient value during optimization.

In [8]:
import tensorflow as tf

# Using clipvalue in optimizer
optimizer_clip = tf.keras.optimizers.SGD(learning_rate=0.01, clipvalue=1.0)

# Compile the model
model_clip = models.Sequential([
    layers.Input(shape=(28, 28)),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model_clip.compile(optimizer=optimizer_clip,
                   loss='sparse_categorical_crossentropy',
                   metrics=['accuracy'])

## II. 🔁 Reusing Pretrained Layers (Transfer Learning)

### A. **Transfer Learning with Keras**

In [9]:
from tensorflow.keras.applications import VGG16
from tensorflow.keras import Sequential, layers

# Load pretrained VGG16 without top classification layer
conv_base = VGG16(weights='imagenet', include_top=False, input_shape=(150, 150, 3))
conv_base.trainable = False  # Freeze base for feature extraction

# Build a classifier on top
model_transfer = Sequential([
    conv_base,
    layers.Flatten(),
    layers.Dense(256, activation='relu'),
    layers.Dense(1, activation='sigmoid')  # Binary classification
])

model_transfer.summary()

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
[1m58889256/58889256[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m80s[0m 1us/step


### B. **Unsupervised Pretraining & Auxiliary Tasks**

Pretrain the model on proxy tasks like image reconstruction before fine-tuning on the main dataset.

## III. 🏎️ Faster Optimizers

### A. **Momentum & Nesterov Accelerated Gradient (NAG)**

In [10]:
# Momentum optimizer with Nesterov
optimizer_momentum = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)

# Compile a sample model
model_momentum = models.Sequential([
    layers.Flatten(input_shape=[28,28]),
    layers.Dense(128, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model_momentum.compile(optimizer=optimizer_momentum,
                       loss='sparse_categorical_crossentropy',
                       metrics=['accuracy'])

### B. **Other Adaptive Optimizers**
- RMSProp
- Adam

In [7]:
# RMSProp optimizer
optimizer_rmsprop = tf.keras.optimizers.RMSprop(learning_rate=0.001)

# Adam optimizer
optimizer_adam = tf.keras.optimizers.Adam(learning_rate=0.001)

### C. **Learning-Rate Scheduling**


In [8]:
from tensorflow.keras.callbacks import LearningRateScheduler

# Define a scheduler function
def scheduler(epoch, lr):
    return lr * 0.99  # decay by 1% each epoch

# Create callback
lr_callback = LearningRateScheduler(scheduler)

# Example usage during training (assuming model and data are defined)
# model.fit(X_train, y_train, epochs=10, callbacks=[lr_callback])

## IV. 🛡️ Avoiding Overfitting
### A. **ℓ1 and ℓ2 Regularization**

In [11]:
from tensorflow.keras import regularizers

model_reg = models.Sequential([
    layers.Dense(100, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
    layers.Dense(10, activation='softmax')
])

model_reg.summary()

### B. **Dropout**

In [10]:
model_dropout = models.Sequential([
    layers.Flatten(input_shape=[28,28]),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.5),  # 50% dropout
    layers.Dense(10, activation='softmax')
])

model_dropout.summary()

### C. **Monte Carlo Dropout**
Keep dropout active during inference to estimate uncertainty.

### D. **Max-Norm Constraints**

In [11]:
from tensorflow.keras.constraints import max_norm

model_max_norm = models.Sequential([
    layers.Dense(100, activation='relu', kernel_constraint=max_norm(3)),
    layers.Dense(10, activation='softmax')
])

model_max_norm.summary()

## V. 💡 Summary & Practical Guidelines

- Use **good initializations** (e.g., He/Xavier) and **ReLU** to combat vanishing gradients.
- Prefer **batch normalization** and **advanced optimizers** (Adam, RMSProp).
- Apply **learning-rate schedules** and **gradient clipping** when needed.
- Prevent overfitting with **dropout**, **weight decay (ℓ1/ℓ2)**, and **early stopping**.
- Utilize **transfer learning** and **pretraining** to speed and improve convergence.

## VI. ✅ Exercises to Try

1. Train a deep network on CIFAR-10: compare SGD, SGD+momentum, Adam, and RMSProp.
2. Add **batch normalization** vs. only **dropout**—track how each impacts performance.
3. Implement **learning-rate decay** (e.g., exponential decay, step decay).
4. Apply **transfer learning** with a pretrained CNN on a new dataset.
5. Use **Monte Carlo dropout** to detect uncertainty in predictions.