# 🧠 Vanishing Gradient in Deep Learning (Practical Example)

This notebook uses the **California Housing Dataset** to simulate how the vanishing gradient problem can occur in deep networks. We'll compare a deep model with `sigmoid` activations (that suffers from vanishing gradient) versus one with `ReLU` activations (that handles it well).

## 🔍 Step 1: Load Dataset

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Preprocess
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
y = y.reshape(-1, 1)  # ensure it's column vector

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


## 🏗️ Step 2: Build Deep Model with Sigmoid (to show vanishing gradient)
We'll build a 10-layer network with `sigmoid` activations and monitor gradients.

In [None]:
model_sigmoid = tf.keras.Sequential()
model_sigmoid.add(tf.keras.Input(shape=(X_train.shape[1],)))
for _ in range(10):
    model_sigmoid.add(tf.keras.layers.Dense(64, activation='sigmoid'))
model_sigmoid.add(tf.keras.layers.Dense(1))
model_sigmoid.compile(optimizer='sgd', loss='mse')
model_sigmoid.summary()

## 🎯 Step 3: Train Model and Log Gradients using GradientTape

In [None]:
epochs = 5
all_gradients = []

for epoch in range(epochs):
    with tf.GradientTape() as tape:
        preds = model_sigmoid(X_train[:512], training=True)
        loss = tf.keras.losses.mean_squared_error(y_train[:512], preds)

    grads = tape.gradient(loss, model_sigmoid.trainable_variables)
    grad_norms = [tf.norm(g).numpy() if g is not None else 0 for g in grads if len(g.shape) > 1]
    all_gradients.append(grad_norms)

## 📉 Step 4: Visualize Gradient Norms to Detect Vanishing

In [None]:
plt.figure(figsize=(10,5))
for i, norms in enumerate(all_gradients):
    plt.plot(norms, label=f"Epoch {i+1}", marker='o')

plt.title("Gradient Norms across Layers (Sigmoid)")
plt.xlabel("Layer Index (from output to input)")
plt.ylabel("Gradient L2 Norm")
plt.legend()
plt.grid(True)
plt.show()

## ⚡ Step 5: Build and Compare with ReLU Model

In [None]:
model_relu = tf.keras.Sequential()
model_relu.add(tf.keras.Input(shape=(X_train.shape[1],)))
for _ in range(10):
    model_relu.add(tf.keras.layers.Dense(64, activation='relu'))
model_relu.add(tf.keras.layers.Dense(1))
model_relu.compile(optimizer='sgd', loss='mse')

In [None]:
all_gradients_relu = []
for epoch in range(epochs):
    with tf.GradientTape() as tape:
        preds = model_relu(X_train[:512], training=True)
        loss = tf.keras.losses.mean_squared_error(y_train[:512], preds)

    grads = tape.gradient(loss, model_relu.trainable_variables)
    grad_norms = [tf.norm(g).numpy() if g is not None else 0 for g in grads if len(g.shape) > 1]
    all_gradients_relu.append(grad_norms)

In [None]:
plt.figure(figsize=(10,5))
for i, norms in enumerate(all_gradients_relu):
    plt.plot(norms, label=f"Epoch {i+1}", marker='x')

plt.title("Gradient Norms across Layers (ReLU)")
plt.xlabel("Layer Index (from output to input)")
plt.ylabel("Gradient L2 Norm")
plt.legend()
plt.grid(True)
plt.show()

## 🧠 Final Explanation
- In the **sigmoid model**, you should see gradient magnitudes **decrease** as you move toward early layers.
- In the **ReLU model**, the gradients stay **stronger and stable**, indicating healthier learning.

This shows how activation choice affects learning in deep networks — and where vanishing gradient occurs.