<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Vanishing and Exploding Gradients in RNNs</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Advanced Language Modeling with Keras, GRUs, LSTMs, and Embeddings)</span></div>

## Table of Contents

1. [The Problem: Vanishing and Exploding Gradients](#section-1)
2. [Solutions to Gradient Problems](#section-2)
3. [GRU and LSTM Cells Architecture](#section-3)
4. [Implementing GRU and LSTM in Keras](#section-4)
5. [The Embedding Layer](#section-5)
6. [Transfer Learning with GloVe](#section-6)
7. [Sentiment Classification Revisited](#section-7)
8. [Avoiding Overfitting in RNNs](#section-8)
9. [Advanced Architectures: CNNs and Complex Models](#section-9)
10. [Conclusion](#section-10)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. The Problem: Vanishing and Exploding Gradients</span><br>

### Forward Propagation in RNNs
Recurrent Neural Networks (RNNs) process sequences by maintaining a hidden state that evolves over time. During **Forward Propagation**, information flows from the start of the sequence to the end.

The hidden state $a_t$ at time step $t$ is calculated as a function of the weight matrix $W_a$, the previous hidden state $a_{t-1}$, and the current input $x_t$.

$$ a_t = f(W_a, a_{t-1}, x_t) $$

Because this is recursive, $a_t$ depends on $a_{t-1}$, which depends on $a_{t-2}$, and so on. Expanding this:

$$ a_2 = f(W_a, a_1, x_2) = f(W_a, f(W_a, a_0, x_1), x_2) $$

### Backpropagation Through Time (BPTT)
To train the network, we use **Backpropagation Through Time**. We calculate the gradient of the loss function with respect to the weights to update them.

However, because of the recursive nature, the derivative of the state $a_t$ with respect to the weights $W_a$ involves the chain rule across all previous time steps.

$$ \frac{\partial a_t}{\partial W_a} \propto (W_a)^{t-1} g(X) $$

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> The term $(W_a)^{t-1}$ is the culprit. If the weights are small (< 1), the gradient shrinks exponentially to 0 (Vanishing). If they are large (> 1), it grows exponentially to infinity (Exploding). </div>

### Consequences
1.  **Vanishing Gradients**: The model stops learning from earlier time steps. It "forgets" long-term dependencies.
2.  **Exploding Gradients**: The weights update massively, causing the loss to oscillate or diverge (NaN).

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Solutions to Gradient Problems</span><br>

There are several standard techniques to mitigate these issues.

### Solutions for Exploding Gradients
*   **Gradient Clipping / Scaling**: Cap the gradients at a maximum threshold before updating weights.

### Solutions for Vanishing Gradients
*   **Better Initialization**: Initialize the matrix $W$ carefully (e.g., Xavier or He initialization).
*   **Regularization**: Use techniques to constrain model complexity.
*   **Activation Functions**: Use **ReLU** instead of tanh or sigmoid (which saturate and kill gradients).
*   **Architecture Changes**: Use **LSTM** (Long Short-Term Memory) or **GRU** (Gated Recurrent Unit) cells.



In [None]:
import tensorflow as tf
from tensorflow.keras.layers import SimpleRNN, Dense
from tensorflow.keras.models import Sequential

# Example of Gradient Clipping in the Optimizer
# 'clipnorm' scales gradients so their norm is at most 1.0
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, clipnorm=1.0)

# Example of using ReLU activation to help with vanishing gradients
model = Sequential()
model.add(SimpleRNN(units=64, activation='relu', input_shape=(None, 10)))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer=optimizer, loss='binary_crossentropy')
print("Model compiled with Gradient Clipping and ReLU activation.")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. GRU and LSTM Cells Architecture</span><br>

Standard `SimpleRNN` cells multiply the weight matrix at every step, leading to the vanishing gradient problem. **GRU** and **LSTM** cells introduce **gates** that control the flow of information, allowing the network to decide what to keep and what to forget.

### 1. SimpleRNN Cell (The Baseline)
*   **Logic**: $a_t = f(W_a, a_{t-1}, x_t)$
*   **Issue**: Direct multiplication leads to instability.

### 2. GRU Cell (Gated Recurrent Unit)
Simplified version of LSTM with two gates.
*   **Update Gate ($g_u$)**: Decides how much of the past information to pass along to the future.
*   **Candidate Memory ($\tilde{a}_t$)**: A temporary new memory value.
*   **Final Memory ($a_t$)**: A linear interpolation between the past state and the new candidate.

$$ a_t = h(g_u, \tilde{a}_t, a_{t-1}) $$

### 3. LSTM Cell (Long Short-Term Memory)
More complex, with three distinct gates and a separate cell state ($c_t$).
*   **Forget Gate ($g_f$)**: Decides what to remove from the cell state.
*   **Update Gate ($g_u$)**: Decides what new information to store in the cell state.
*   **Output Gate ($g_o$)**: Decides what parts of the cell state to output as the hidden state $a_t$.

**Why they solve Vanishing Gradients:**
Because the cell state $c_t$ is updated via addition (not just multiplication), gradients can flow through the network for many time steps without vanishing.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Implementing GRU and LSTM in Keras</span><br>

Keras provides built-in layers for these architectures. They are drop-in replacements for `SimpleRNN`.

### Code Implementation



In [None]:
from tensorflow import keras
from tensorflow.keras.layers import GRU, LSTM
from tensorflow.keras.models import Sequential

# Initialize model
model = Sequential()

# Add a GRU layer
# units=128: Dimension of the hidden state
# return_sequences=True: Output the full sequence (needed if stacking RNN layers)
model.add(GRU(units=128, return_sequences=True, input_shape=(None, 50), name='GRU_layer'))

# Add an LSTM layer
# return_sequences=False: Only output the final state (typical for the last RNN layer before classification)
model.add(LSTM(units=64, return_sequences=False, name='LSTM_layer'))

# Summary
model.summary()



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> Use <code>return_sequences=True</code> when you want to stack multiple RNN layers. The last RNN layer usually has <code>return_sequences=False</code> to feed into a Dense layer. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. The Embedding Layer</span><br>

When dealing with text, we need to convert words into numbers.

### One-Hot Encoding vs. Embeddings

| Feature | One-Hot Encoding | Word Embeddings |
| :--- | :--- | :--- |
| **Dimension** | High (Vocabulary Size, e.g., 100,000) | Low (e.g., 300) |
| **Sparsity** | Sparse (Mostly zeros) | Dense (Real numbers) |
| **Semantics** | No semantic relationship | Captures meaning (King - Man + Woman = Queen) |
| **Training** | Fixed | Learned during training or pre-trained |

### Using the Embedding Layer in Keras
The `Embedding` layer is usually the first layer of a language model.



In [None]:
from tensorflow.keras.layers import Embedding

model = Sequential()

# Parameters:
# input_dim: Size of the vocabulary (e.g., 100,000 words)
# output_dim: Dimension of the dense embedding vector (e.g., 300)
# input_length: Length of input sequences (e.g., 120 words)
model.add(Embedding(input_dim=100000, 
                    output_dim=300, 
                    trainable=True,
                    embeddings_initializer=None,
                    input_length=120))

print("Embedding layer added.")



**Disadvantage**: Embeddings add many parameters to the model, which can increase training time.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Transfer Learning with GloVe</span><br>

Instead of learning embeddings from scratch, we can use **Transfer Learning**. We use pre-trained vectors like **GloVe**, **word2vec**, or **BERT**.

### 1. Loading GloVe Vectors
We parse a text file where each line contains a word followed by its vector coefficients.

*(Note: The code below includes a mock file generator so you can run it immediately without downloading the 1GB GloVe file).*



In [None]:
import numpy as np
import os

# --- MOCK DATA GENERATION (For demonstration purposes) ---
# In a real scenario, you would download 'glove.6B.300d.txt'
mock_glove_content = """the 0.418 0.24968 -0.41242 0.1217 0.34527
cat 0.013441 0.23682 -0.16899 0.40951 0.63812
dog 0.1529 0.30091 -0.29451 0.22737 0.2962
"""
with open("glove.mock.5d.txt", "w") as f:
    f.write(mock_glove_content)
# ---------------------------------------------------------

def get_glove_vectors(filename):
    """
    Parses the GloVe text file into a dictionary.
    """
    glove_vector_dict = {}
    with open(filename) as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = values[1:]
            glove_vector_dict[word] = np.asarray(coefs, dtype='float32')
            
    return glove_vector_dict

# Load the vectors (using our mock file with 5 dimensions for simplicity)
glove_dict = get_glove_vectors("glove.mock.5d.txt")
print(f"Loaded {len(glove_dict)} word vectors.")
print(f"Vector for 'cat': {glove_dict['cat']}")



### 2. Filtering GloVe for Specific Task
We create an embedding matrix that matches our specific vocabulary index.



In [None]:
def filter_glove(vocabulary_dict, glove_dict, wordvec_dim):
    """
    Creates an embedding matrix for the specific vocabulary.
    Rows correspond to the index in vocabulary_dict.
    """
    # +1 for the 0 index (padding)
    embedding_matrix = np.zeros((len(vocabulary_dict) + 1, wordvec_dim))
    
    for word, i in vocabulary_dict.items():
        embedding_vector = glove_dict.get(word)
        if embedding_vector is not None:
            # Words found in GloVe get their vector
            # Words not found remain all-zeros
            embedding_matrix[i] = embedding_vector
            
    return embedding_matrix

# Example usage
vocab = {'the': 1, 'cat': 2, 'zebra': 3} # 'zebra' is not in our mock glove
matrix = filter_glove(vocab, glove_dict, wordvec_dim=5)

print("Embedding Matrix shape:", matrix.shape)
print("Row 2 (cat):", matrix[2])
print("Row 3 (zebra - missing):", matrix[3])



### 3. Using Pre-trained Vectors in Keras
We initialize the `Embedding` layer with `Constant` weights.



In [None]:
from tensorflow.keras.initializers import Constant

vocabulary_size = len(vocab) + 1
embedding_dim = 5

model = Sequential()
model.add(Embedding(input_dim=vocabulary_size,
                    output_dim=embedding_dim,
                    embeddings_initializer=Constant(matrix),
                    trainable=False)) # Often set to False to keep pre-trained values

print("Pre-trained embedding layer created.")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. Sentiment Classification Revisited</span><br>

### The Baseline Problem
A simple model might yield poor results (e.g., Accuracy ~0.5, Loss ~0.69).



In [None]:
# Previous poor performing model example
# model.add(SimpleRNN(units=16, input_shape=(None, 1)))
# model.add(Dense(1, activation='sigmoid'))
# Result: [0.699, 0.495] -> Random guessing



### Strategies for Improvement
To improve performance on sentiment analysis tasks:
1.  **Add an Embedding Layer**: Capture semantic meaning.
2.  **Increase Depth**: Add more layers.
3.  **Tune Parameters**: Adjust units, learning rate, etc.
4.  **Increase Vocabulary**: Allow the model to recognize more words.
5.  **Longer Sentences**: Increase `input_length` to capture more context.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Avoiding Overfitting in RNNs</span><br>

RNNs are prone to overfitting (memorizing the training data). We use **Dropout** to mitigate this.

1.  **Standard Dropout**: Randomly sets inputs to 0.
2.  **Recurrent Dropout**: Randomly drops connections *within* the recurrent units (memory cells).



In [None]:
from tensorflow.keras.layers import Dropout

model = Sequential()

# 1. Standard Dropout Layer
# Removes 20% of input to add noise
model.add(Dropout(rate=0.2, input_shape=(None, 50)))

# 2. Dropout inside RNN layers
# dropout=0.1: Dropout for input units
# recurrent_dropout=0.1: Dropout for recurrent state
model.add(LSTM(128, dropout=0.1, recurrent_dropout=0.1))

print("Model with Dropout configuration created.")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 9. Advanced Architectures: CNNs and Complex Models</span><br>

### 1. 1D Convolutional Layers (Conv1D)
While typically used for images, Convolutions can be used in NLP to perform feature selection on embedding vectors. They are fast and effective.



In [None]:
from tensorflow.keras.layers import Conv1D, MaxPooling1D

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=300, input_length=100))

# Conv1D Layer
# num_filters=32: Number of output filters
# kernel_size=3: Window size (looks at 3 words at a time)
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))

print("Conv1D layer added.")



### 2. A Complex "State-of-the-Art" Style Architecture
Combining Embeddings, Dense layers, Dropout, LSTM, and GRU into a single powerful model.



In [None]:
# Parameters
vocabulary_size = 5000
wordvec_dim = 300
max_text_len = 100

# Define the complex model
model = Sequential()

# 1. Embedding Layer (using pre-trained weights logic)
# Note: In a real run, 'matrix' would be the full 300-dim GloVe matrix
model.add(Embedding(input_dim=vocabulary_size, 
                    output_dim=wordvec_dim, 
                    trainable=True,
                    # embeddings_initializer=Constant(matrix), # Uncomment if matrix is ready
                    input_length=max_text_len, 
                    name="Embedding"))

# 2. Dense Layer for feature transformation
model.add(Dense(wordvec_dim, activation='relu', name="Dense1"))

# 3. Dropout
model.add(Dropout(rate=0.25))

# 4. LSTM Layer (Returning sequences to feed into GRU)
model.add(LSTM(64, return_sequences=True, dropout=0.15, name="LSTM"))

# 5. GRU Layer (Not returning sequences, final recurrent step)
model.add(GRU(64, return_sequences=False, dropout=0.15, name="GRU"))

# 6. Dense Layers for classification
model.add(Dense(64, name="Dense2"))
model.add(Dropout(rate=0.25))
model.add(Dense(32, name="Dense3"))

# 7. Output Layer (Binary Classification)
model.add(Dense(1, activation='sigmoid', name="Output"))

model.summary()



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 10. Conclusion</span><br>

In this notebook, we explored the challenges and solutions for training Recurrent Neural Networks for language modeling.

**Key Takeaways:**
1.  **Gradient Problems**: Standard RNNs suffer from vanishing gradients due to repeated matrix multiplication during Backpropagation Through Time (BPTT).
2.  **Architectural Solutions**: **LSTM** and **GRU** cells utilize gating mechanisms (Update, Forget, Output) to preserve gradients over long sequences, effectively solving the vanishing gradient problem.
3.  **Embeddings**: Moving from One-Hot encoding to **Word Embeddings** allows for dense, semantic representations of text.
4.  **Transfer Learning**: Using pre-trained vectors like **GloVe** can significantly boost performance by leveraging external knowledge.
5.  **Regularization**: **Dropout** and **Recurrent Dropout** are essential to prevent overfitting in complex RNN models.

**Next Steps:**
*   Experiment with hyperparameter tuning (number of units, dropout rates).
*   Apply these architectures to real-world datasets like IMDB Sentiment Analysis or text generation.
*   Explore Transformer architectures (BERT, GPT) which have largely superseded RNNs for very long sequences.
