# <font color="#418FDE" size="6.5" uppercase>**Sequence Models**</font>

>Last update: 20260121.
    
By the end of this Lecture, you will be able to:
- Build LSTM or GRU-based sequence models for text classification using tf.keras. 
- Implement a simple transformer-style encoder block using Keras layers for sequence modeling. 
- Evaluate and compare the performance of different sequence architectures on an NLP task. 


## **1. LSTM and GRU Models**

### **1.1. Recurrent Layer Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_01_01.jpg?v=1768979976" width="250">



>* Recurrent layers process sequences step by step
>* Internal state summarizes past tokens for classification

>* LSTM or GRU reuses weights across timesteps
>* Gates build a summary state for classification

>* Choose final state or full state sequence
>* Final state summarizes all tokens for classification



In [None]:
#@title Python Code - Recurrent Layer Basics

# This script shows basic LSTM sequence processing for text classification.
# It contrasts final state outputs with full sequence outputs clearly.
# It uses a tiny sentiment dataset for quick demonstration.

# !pip install tensorflow==2.20.0

# Import required libraries including TensorFlow and NumPy.
import os, random, numpy as np, tensorflow as tf

# Set deterministic seeds for reproducible training behavior.
seed_value = 7
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version information for environment clarity.
print("TensorFlow version:", tf.__version__)

# Detect available device preference using TensorFlow configuration utilities.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    print("Using GPU device for training operations.")
else:
    print("Using CPU device for training operations.")

# Create a tiny in memory sentiment dataset with short sentences.
texts = [
    "I love this movie so much",
    "This film was absolutely terrible",
    "What a fantastic and enjoyable story",
    "I hated every single minute",
    "The plot was boring and slow",
    "Actors did an amazing job",
]

# Create corresponding sentiment labels where one indicates positive.
labels = np.array([1, 0, 1, 0, 0, 1], dtype=np.int32)

# Build a simple tokenizer to convert words into integer indices.
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=1000, oov_token="<OOV>")

# Fit tokenizer on the small training texts for vocabulary creation.
tokenizer.fit_on_texts(texts)

# Convert texts into padded integer sequences with equal lengths.
sequences = tokenizer.texts_to_sequences(texts)
max_len = max(len(seq) for seq in sequences)

# Pad sequences so that all sequences share identical temporal length.
X = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding="post")

# Validate resulting input shape to ensure correct dimensions.
print("Input batch shape:", X.shape)

# Define embedding dimension and recurrent units for the LSTM layer.
embedding_dim = 8
lstm_units = 4
vocab_size = len(tokenizer.word_index) + 1

# Build a model that returns only the final LSTM hidden state.
inputs_final = tf.keras.Input(shape=(max_len,), name="input_final")

# Add an embedding layer to map tokens into dense vector representations.
embedded_final = tf.keras.layers.Embedding(vocab_size, embedding_dim)(inputs_final)

# Add LSTM configured to return only final hidden state representation.
lstm_final = tf.keras.layers.LSTM(lstm_units, return_sequences=False, name="lstm_final")(embedded_final)

# Add dense output layer for binary sentiment classification predictions.
outputs_final = tf.keras.layers.Dense(1, activation="sigmoid")(lstm_final)

# Create and compile the final state model using binary crossentropy loss.
model_final = tf.keras.Model(inputs_final, outputs_final, name="final_state_model")
model_final.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Train the final state model briefly on the tiny dataset.
model_final.fit(X, labels, epochs=5, batch_size=2, verbose=0)

# Evaluate the final state model performance on the same dataset.
loss_final, acc_final = model_final.evaluate(X, labels, verbose=0)
print("Final state model accuracy:", round(float(acc_final), 3))

# Build a model that returns full sequence of LSTM hidden states.
inputs_seq = tf.keras.Input(shape=(max_len,), name="input_seq")
embedded_seq = tf.keras.layers.Embedding(vocab_size, embedding_dim)(inputs_seq)

# Configure LSTM to return full sequence of hidden states for each timestep.
lstm_seq = tf.keras.layers.LSTM(lstm_units, return_sequences=True, name="lstm_seq")(embedded_seq)

# Pool sequence outputs by averaging across timesteps for classification.
pooled_seq = tf.keras.layers.GlobalAveragePooling1D()(lstm_seq)

# Add dense output layer for binary sentiment classification predictions.
outputs_seq = tf.keras.layers.Dense(1, activation="sigmoid")(pooled_seq)

# Create and compile the sequence output model with identical settings.
model_seq = tf.keras.Model(inputs_seq, outputs_seq, name="sequence_output_model")
model_seq.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Train the sequence output model briefly on the same tiny dataset.
model_seq.fit(X, labels, epochs=5, batch_size=2, verbose=0)

# Evaluate the sequence output model performance on the same dataset.
loss_seq, acc_seq = model_seq.evaluate(X, labels, verbose=0)
print("Sequence output model accuracy:", round(float(acc_seq), 3))

# Inspect raw LSTM outputs for a single example from the dataset.
sample_input = X[0:1]
full_sequence_output = tf.keras.Model(inputs_seq, lstm_seq)(sample_input)

# Print shapes to highlight difference between final and sequence outputs.
print("Single example input shape:", sample_input.shape)
print("Full sequence output shape:", full_sequence_output.shape)
print("Final state vector length:", lstm_units)



### **1.2. Bidirectional RNN Wrappers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_01_02.jpg?v=1768980026" width="250">



>* Bidirectional RNNs read sequences forward and backward
>* Combined context improves understanding and text classification

>* Bidirectional RNNs see both earlier and later words
>* They capture negation, contrast, and long-range context

>* Bidirectional outputs increase representation size and power
>* Balance accuracy gains with cost and overfitting



In [None]:
#@title Python Code - Bidirectional RNN Wrappers

# This script demonstrates bidirectional LSTM sequence classification basics.
# It compares unidirectional and bidirectional recurrent models on toy sentences.
# It prints validation accuracy to show bidirectional wrapper performance.

# !pip install tensorflow==2.20.0

# Import required libraries including TensorFlow and NumPy.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducible training behavior.
seed_value = 7
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version information for environment clarity.
print("TensorFlow version:", tf.__version__)

# Select computation device preferring GPU when available safely.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    try:
        tf.config.experimental.set_memory_growth(physical_gpus[0], True)
    except Exception as device_error:
        print("GPU configuration warning:", device_error)
else:
    print("GPU device not available, using CPU.")

# Define a tiny sentiment style text dataset with labels.
texts = [
    "I loved this movie so much",
    "This film was terrible and boring",
    "Absolutely fantastic acting and story",
    "I would never watch this again",
    "The plot was great and exciting",
    "I hated the ending of this",
    "What a wonderful and touching film",
    "The movie was bad and disappointing",
]

# Define binary sentiment labels aligned with dataset sentences.
labels = np.array([1, 0, 1, 0, 1, 0, 1, 0], dtype=np.int32)

# Create a simple Keras tokenizer for integer encoding.
vocab_size = 1000
max_length = 12
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size, oov_token="<OOV>")

# Fit tokenizer on the small text corpus examples.
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# Pad sequences to fixed length for batch processing.
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_length, padding="post", truncating="post")

# Split dataset into simple training and validation subsets.
train_sequences = padded_sequences[:6]
train_labels = labels[:6]
val_sequences = padded_sequences[6:]
val_labels = labels[6:]

# Define a function building unidirectional LSTM classification model.
def build_unidirectional_model(vocab_size_value, max_length_value):
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Embedding(vocab_size_value, 16, input_length=max_length_value))
    model.add(tf.keras.layers.LSTM(16))
    model.add(tf.keras.layers.Dense(1, activation="sigmoid"))
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model

# Define a function building bidirectional LSTM classification model.
def build_bidirectional_model(vocab_size_value, max_length_value):
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Embedding(vocab_size_value, 16, input_length=max_length_value))
    model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(16)))
    model.add(tf.keras.layers.Dense(1, activation="sigmoid"))
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model

# Build both models for comparison of architectures.
uni_model = build_unidirectional_model(vocab_size, max_length)
bi_model = build_bidirectional_model(vocab_size, max_length)

# Train unidirectional model briefly on training data.
uni_history = uni_model.fit(train_sequences, train_labels, epochs=10, batch_size=2, verbose=0, validation_data=(val_sequences, val_labels))

# Train bidirectional model briefly on same training data.
bi_history = bi_model.fit(train_sequences, train_labels, epochs=10, batch_size=2, verbose=0, validation_data=(val_sequences, val_labels))

# Evaluate both models on validation data for accuracy comparison.
uni_loss, uni_acc = uni_model.evaluate(val_sequences, val_labels, verbose=0)
bi_loss, bi_acc = bi_model.evaluate(val_sequences, val_labels, verbose=0)

# Print concise accuracy results highlighting bidirectional wrapper effect.
print("Unidirectional LSTM validation accuracy:", round(float(uni_acc), 3))
print("Bidirectional LSTM validation accuracy:", round(float(bi_acc), 3))




### **1.3. Handling variable sequences**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_01_03.jpg?v=1768980079" width="250">



>* Real text sequences vary widely in length
>* We pad and track tokens to batch efficiently

>* Masking tells RNNs to ignore padded tokens
>* Prevents padding noise, improves text classification learning

>* Choose and enforce a sensible maximum sequence length
>* Balance speed, memory, and information for long texts



In [None]:
#@title Python Code - Handling variable sequences

# This script shows handling variable length sequences with padding and masking.
# It builds a tiny LSTM classifier using padded movie review style sentences.
# It demonstrates masking behavior by comparing padded and unpadded predictions.

# !pip install tensorflow==2.20.0

# Import required libraries including TensorFlow and NumPy.
import os
import random
import numpy as np
import tensorflow as tf

# Print TensorFlow version for reproducibility information.
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds for reproducible behavior.
random.seed(7)
np.random.seed(7)
tf.random.set_seed(7)

# Define a tiny toy text dataset with variable length sentences.
texts = [
    "bad movie",
    "really bad boring movie",
    "great film",
    "really great amazing film",
    "terrible and slow",
    "excellent and fun",
]

# Define binary sentiment labels aligned with the toy sentences.
labels = np.array([0, 0, 1, 1, 0, 1], dtype=np.int32)

# Create a TextVectorization layer for integer tokenization.
vectorizer = tf.keras.layers.TextVectorization(output_sequence_length=6)

# Adapt the vectorizer vocabulary using the toy texts.
vectorizer.adapt(tf.constant(texts))

# Vectorize the texts into fixed length integer sequences.
int_sequences = vectorizer(tf.constant(texts))

# Verify the resulting integer sequence tensor shape.
print("Vectorized shape:", int_sequences.shape)

# Build a simple model with masking and an LSTM classifier.
inputs = tf.keras.Input(shape=(6,), dtype="int32")

# Use an Embedding layer with mask_zero enabled for padding handling.
embedding = tf.keras.layers.Embedding(input_dim=100, output_dim=8, mask_zero=True)(inputs)

# Add a small LSTM layer that respects the computed mask.
lstm_output = tf.keras.layers.LSTM(units=8)(embedding)

# Add a dense output layer for binary sentiment classification.
outputs = tf.keras.layers.Dense(units=1, activation="sigmoid")(lstm_output)

# Create the Keras model object from inputs and outputs.
model = tf.keras.Model(inputs=inputs, outputs=outputs)

# Compile the model with binary crossentropy loss and accuracy metric.
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

# Train the model briefly on the tiny dataset for demonstration.
model.fit(int_sequences, labels, epochs=10, batch_size=2, verbose=0)

# Choose a short sentence and create padded and unpadded versions.
short_text = tf.constant(["great film"])

# Vectorize the short sentence using the same vectorizer.
short_vector = vectorizer(short_text)

# Manually create a longer padded version by appending zeros.
short_vector_padded = tf.concat([short_vector, tf.zeros_like(short_vector)], axis=1)

# Ensure shapes are correct for both padded and unpadded sequences.
print("Unpadded shape:", short_vector.shape)
print("Padded shape:", short_vector_padded.shape)

# Get prediction for the unpadded sequence using the trained model.
pred_unpadded = model(short_vector).numpy()[0, 0]

# Get prediction for the padded sequence which includes extra zeros.
pred_padded = model(short_vector_padded[:, :6]).numpy()[0, 0]

# Print both predictions to show masking ignores padding zeros.
print("Prediction unpadded:", float(pred_unpadded))
print("Prediction padded:", float(pred_padded))

# Confirm that predictions are numerically very close despite extra padding.
print("Difference between predictions:", float(abs(pred_unpadded - pred_padded)))




## **2. Transformer Encoder Block**

### **2.1. Multihead Attention Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_02_01.jpg?v=1768980122" width="250">



>* Attention lets each token weight all others
>* Multiple heads capture diverse relationships in parallel

>* Queries compare to keys, weighting value vectors
>* Multiple heads specialize, then outputs are combined

>* Multiple heads capture local and global context
>* Combined heads give stronger features for NLP tasks



In [None]:
#@title Python Code - Multihead Attention Basics

# This script demonstrates basic multihead attention using TensorFlow Keras layers.
# It creates fake token embeddings and applies MultiHeadAttention to them.
# It prints shapes and small values to explain multihead attention behavior.

# !pip install tensorflow==2.20.0

# Import required standard libraries and TensorFlow framework.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic random seeds for reproducible attention outputs.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version information for environment transparency.
print("TensorFlow version:", tf.__version__)

# Define simple configuration values for sequence and embedding dimensions.
batch_size = 2
sequence_length = 5
embedding_dim = 8
num_heads = 2

# Validate that embedding dimension divides evenly by number of heads.
assert embedding_dim % num_heads == 0

# Create random input embeddings representing token vectors in sentences.
input_embeddings = tf.random.normal(shape=(batch_size, sequence_length, embedding_dim))

# Print input shape to confirm expected batch and sequence dimensions.
print("Input embeddings shape:", input_embeddings.shape)

# Create a MultiHeadAttention layer using Keras functional style.
attention_layer = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dim // num_heads)

# Apply self attention where queries, keys, values share same tensor.
attention_output, attention_scores = attention_layer(
    query=input_embeddings,
    value=input_embeddings,
    key=input_embeddings,
    return_attention_scores=True,
)

# Print output shape to show same sequence length and embedding dimension.
print("Attention output shape:", attention_output.shape)

# Print attention scores shape to reveal heads and sequence relationships.
print("Attention scores shape:", attention_scores.shape)

# Select first batch and first head attention matrix for inspection.
first_head_scores = attention_scores[0, 0]

# Print small rounded attention matrix to visualize token relationships.
print("First head attention matrix (rounded):")
print(tf.round(first_head_scores * 100) / 100)

# Verify that each row of scores approximately sums to one probability.
row_sums = tf.reduce_sum(first_head_scores, axis=-1)

# Print row sums to confirm attention distribution normalization behavior.
print("Row sums for first head:")
print(tf.round(row_sums * 100) / 100)

# Show that attention output remains same shape as original embeddings.
print("Output equals input shape:", attention_output.shape == input_embeddings.shape)



### **2.2. Residual Connections and Normalization**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_02_02.jpg?v=1768980154" width="250">



>* Residual connections stabilize deep transformer encoder training
>* They refine representations and help capture subtle cues

>* Layer normalization stabilizes activations after residual addition
>* Same normalize pattern used after attention, feedforward

>* Residuals and normalization stabilize and support learning
>* They refine sequence understanding and improve NLP robustness



In [None]:
#@title Python Code - Residual Connections and Normalization

# This script shows residual connections and layer normalization in a tiny encoder block.
# It builds a simple Keras model that compares with and without normalization behavior.
# It prints tensor statistics to illustrate stability from residual connections and normalization.

# !pip install tensorflow==2.20.0

# Import required libraries including TensorFlow and NumPy for computations.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducible behavior across different hardware environments.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version information for clarity about used deep learning framework.
print("TensorFlow version:", tf.__version__)

# Detect available device preference using TensorFlow configuration utilities for runtime placement.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    print("Using GPU device count:", len(physical_gpus))
else:
    print("Using CPU because no GPU was detected")

# Define simple function that builds encoder block with residual and normalization layers.
def build_encoder_block(hidden_dim, num_heads, ff_dim, use_layer_norm):
    inputs = tf.keras.Input(shape=(None, hidden_dim))
    attention_layer = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=hidden_dim)
    attention_output = attention_layer(inputs, inputs)

    # Add residual connection around attention output using Keras Add layer operation.
    attention_residual = tf.keras.layers.Add()([inputs, attention_output])

    # Optionally apply layer normalization after residual connection for stability.
    if use_layer_norm:
        attention_norm = tf.keras.layers.LayerNormalization(epsilon=1e-6)(attention_residual)
    else:
        attention_norm = attention_residual

    # Build small feedforward network with residual connection and optional normalization.
    ff_dense_one = tf.keras.layers.Dense(ff_dim, activation="relu")
    ff_dense_two = tf.keras.layers.Dense(hidden_dim)
    ff_output = ff_dense_two(ff_dense_one(attention_norm))

    # Add residual connection around feedforward sublayer using Add layer operation.
    ff_residual = tf.keras.layers.Add()([attention_norm, ff_output])

    # Optionally apply layer normalization after feedforward residual connection.
    if use_layer_norm:
        ff_norm = tf.keras.layers.LayerNormalization(epsilon=1e-6)(ff_residual)
    else:
        ff_norm = ff_residual

    # Create Keras model object representing single encoder block transformation.
    model = tf.keras.Model(inputs=inputs, outputs=ff_norm)
    return model

# Define small batch of fake token embeddings representing short text sequences.
batch_size = 2
sequence_length = 5
hidden_dimension = 16
fake_inputs = tf.random.normal(shape=(batch_size, sequence_length, hidden_dimension))

# Build encoder block without layer normalization to compare activation statistics.
encoder_without_norm = build_encoder_block(hidden_dim=hidden_dimension, num_heads=2, ff_dim=32, use_layer_norm=False)

# Build encoder block with layer normalization to observe stabilized activations.
encoder_with_norm = build_encoder_block(hidden_dim=hidden_dimension, num_heads=2, ff_dim=32, use_layer_norm=True)

# Run both encoder variants on same fake inputs to compare output distributions.
outputs_without_norm = encoder_without_norm(fake_inputs, training=False)
outputs_with_norm = encoder_with_norm(fake_inputs, training=False)

# Compute simple statistics for outputs without normalization to inspect scale behavior.
mean_without = tf.reduce_mean(outputs_without_norm).numpy()
std_without = tf.math.reduce_std(outputs_without_norm).numpy()

# Compute simple statistics for outputs with normalization to inspect stabilized scale.
mean_with = tf.reduce_mean(outputs_with_norm).numpy()
std_with = tf.math.reduce_std(outputs_with_norm).numpy()

# Print summary statistics showing effect of residual connections and normalization.
print("Without layer normalization mean:", float(mean_without))
print("Without layer normalization std:", float(std_without))
print("With layer normalization mean:", float(mean_with))
print("With layer normalization std:", float(std_with))

# Verify shapes remain unchanged thanks to residual connections preserving representation dimensions.
print("Input shape:", fake_inputs.shape)
print("Output shape without normalization:", outputs_without_norm.shape)
print("Output shape with normalization:", outputs_with_norm.shape)



### **2.3. Feedforward Encoder Layers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_02_03.jpg?v=1768980193" width="250">



>* Position-wise feedforward network refines each token
>* Two dense layers expand then shrink dimensions

>* Feedforward layers add nonlinear, higher-level token features
>* Shared feedforward network generalizes efficiently across positions

>* Hidden size and activations trade capacity and cost
>* Careful tuning yields robust, task-specific sequence representations



In [None]:
#@title Python Code - Feedforward Encoder Layers

# This script demonstrates transformer feedforward encoder layers conceptually using Keras Dense layers.
# It builds a tiny encoder block with attention and feedforward sublayers for sequences.
# It shows how feedforward expands and compresses token representations after attention.

# !pip install tensorflow==2.20.0

# Import required libraries including TensorFlow and NumPy for computations.
import os
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducible behavior across different runtime sessions.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version information for environment transparency and reproducibility.
print("TensorFlow version:", tf.__version__)

# Detect available device preference using TensorFlow configuration utilities for safety.
physical_gpus = tf.config.list_physical_devices("GPU")
use_gpu = bool(physical_gpus)
print("Using GPU:", use_gpu)

# Define simple parameters for sequence length and model dimensionality sizes.
sequence_length = 5
model_dimension = 8
feedforward_hidden = 16

# Create a small batch of token indices representing toy sentences for demonstration.
vocab_size = 50
batch_size = 4
sample_token_indices = np.random.randint(vocab_size, size=(batch_size, sequence_length))

# Build an embedding layer to convert token indices into dense vector representations.
embedding_layer = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=model_dimension)
embedded_tokens = embedding_layer(sample_token_indices)

# Define a simple multihead attention layer for contextual token mixing operations.
attention_layer = tf.keras.layers.MultiHeadAttention(num_heads=2, key_dim=model_dimension // 2)
attention_output = attention_layer(embedded_tokens, embedded_tokens)

# Add residual connection around attention output and apply layer normalization operation.
attention_residual = embedded_tokens + attention_output
norm_layer_one = tf.keras.layers.LayerNormalization(epsilon=1e-6)
normalized_attention = norm_layer_one(attention_residual)

# Define first feedforward Dense layer expanding dimensionality with ReLU activation.
ffn_dense_one = tf.keras.layers.Dense(units=feedforward_hidden, activation="relu")
expanded_representation = ffn_dense_one(normalized_attention)

# Define second feedforward Dense layer projecting back to original model dimension.
ffn_dense_two = tf.keras.layers.Dense(units=model_dimension, activation=None)
projected_representation = ffn_dense_two(expanded_representation)

# Add residual connection around feedforward network and apply second normalization.
feedforward_residual = normalized_attention + projected_representation
norm_layer_two = tf.keras.layers.LayerNormalization(epsilon=1e-6)
encoder_output = norm_layer_two(feedforward_residual)

# Print shapes before and after feedforward to highlight expansion and compression behavior.
print("Embedded tokens shape:", embedded_tokens.shape)
print("Expanded representation shape:", expanded_representation.shape)
print("Encoder output shape:", encoder_output.shape)

# Show a small slice of one token vector before and after feedforward transformation.
print("First token before feedforward:", normalized_attention[0, 0].numpy())
print("First token after feedforward:", encoder_output[0, 0].numpy())




## **3. Comparing Sequence Architectures**

### **3.1. Training Stability and Speed**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_03_01.jpg?v=1768980229" width="250">



>* Compare models by training stability and speed
>* Prefer slightly less accurate but reliably trainable models

>* RNNs train slower per step but steadily
>* Transformers train faster yet need careful tuning

>* Measure stability, speed, and training curve behavior
>* Prefer models that converge reliably faster with consistency



In [None]:
#@title Python Code - Training Stability and Speed

# This script compares training stability and speed for two simple sequence models.
# It trains a GRU and a Transformer encoder on a tiny text dataset.
# It prints loss histories and timing to illustrate stability and speed.

# !pip install tensorflow

# Import required standard libraries and TensorFlow framework.
import os, time, random, numpy as np, tensorflow as tf

# Set deterministic seeds for reproducible and stable training behavior.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version information for reproducibility and environment clarity.
print("TensorFlow version:", tf.__version__)

# Select appropriate device based on GPU availability for fair speed comparison.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    try:
        tf.config.experimental.set_memory_growth(physical_gpus[0], True)
    except Exception as device_error:
        print("GPU configuration issue, using CPU instead")

# Create a tiny synthetic sentiment dataset with short text examples.
texts = [
    "I love this movie so much",
    "This film was terrible and boring",
    "Absolutely fantastic experience and great acting",
    "I hated every minute of it",
    "What a wonderful and inspiring story",
    "The plot was dull and predictable",
    "Brilliant direction and superb cast",
    "Worst movie I have ever seen",
]

# Create corresponding binary labels for positive and negative sentiment examples.
labels = [1, 0, 1, 0, 1, 0, 1, 0]

# Convert lists into TensorFlow dataset friendly NumPy arrays.
texts_array = np.array(texts, dtype=object)
labels_array = np.array(labels, dtype=np.int32)

# Tokenize text into integer sequences with limited vocabulary size.
max_words = 1000
max_len = 12
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words, oov_token="<OOV>")

# Fit tokenizer on the small text corpus for word index creation.
tokenizer.fit_on_texts(texts_array.tolist())
sequences = tokenizer.texts_to_sequences(texts_array.tolist())

# Pad sequences to fixed length for batch processing compatibility.
padded_sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len, padding="post")

# Validate shapes to ensure correct dimensions for model inputs and labels.
assert padded_sequences.shape[0] == labels_array.shape[0]
assert padded_sequences.shape[1] == max_len

# Split data into simple training and validation sets for evaluation.
train_size = 6
x_train, x_val = padded_sequences[:train_size], padded_sequences[train_size:]
y_train, y_val = labels_array[:train_size], labels_array[train_size:]

# Build a simple GRU based sequence model for baseline comparison.
embedding_dim = 16
gru_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(max_words, embedding_dim, input_length=max_len),
    tf.keras.layers.GRU(16, return_sequences=False),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

# Compile GRU model with binary crossentropy loss and Adam optimizer.
gru_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
                  loss="binary_crossentropy",
                  metrics=["accuracy"])

# Build a tiny Transformer style encoder model for comparison.
inputs = tf.keras.Input(shape=(max_len,), dtype="int32")
embedding_layer = tf.keras.layers.Embedding(max_words, embedding_dim)(inputs)

# Add simple self attention using MultiHeadAttention layer.
attention_output = tf.keras.layers.MultiHeadAttention(num_heads=2, key_dim=8)(
    embedding_layer, embedding_layer
)

# Add residual connection and layer normalization for stability.
attention_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(
    embedding_layer + attention_output
)

# Add small feedforward network with dropout for regularization.
ffn = tf.keras.layers.Dense(16, activation="relu")(attention_output)
ffn = tf.keras.layers.Dropout(0.1)(ffn)

# Add second residual connection and layer normalization.
encoder_output = tf.keras.layers.LayerNormalization(epsilon=1e-6)(
    attention_output + ffn
)

# Pool sequence outputs using global average pooling for classification.
pooled_output = tf.keras.layers.GlobalAveragePooling1D()(encoder_output)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(pooled_output)

# Create Transformer encoder model using functional API.
transformer_model = tf.keras.Model(inputs=inputs, outputs=outputs)

# Compile Transformer model with same optimizer and loss for fairness.
transformer_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
                          loss="binary_crossentropy",
                          metrics=["accuracy"])

# Define small training parameters to keep runtime and memory usage safe.
batch_size = 2
epochs = 8

# Train GRU model while measuring wall clock training time.
start_time_gru = time.time()
history_gru = gru_model.fit(x_train, y_train,
                            validation_data=(x_val, y_val),
                            batch_size=batch_size,
                            epochs=epochs,
                            verbose=0)
end_time_gru = time.time()

# Train Transformer model while measuring wall clock training time.
start_time_trans = time.time()
history_trans = transformer_model.fit(x_train, y_train,
                                      validation_data=(x_val, y_val),
                                      batch_size=batch_size,
                                      epochs=epochs,
                                      verbose=0)
end_time_trans = time.time()

# Extract training losses for both models to inspect stability behavior.
gru_losses = history_gru.history["loss"]
trans_losses = history_trans.history["loss"]

# Print concise comparison of training stability and speed metrics.
print("GRU training losses:", np.round(gru_losses, 3))
print("Transformer training losses:", np.round(trans_losses, 3))
print("GRU training time seconds:", round(end_time_gru - start_time_gru, 3))
print("Transformer training time seconds:", round(end_time_trans - start_time_trans, 3))
print("GRU final validation accuracy:", round(history_gru.history["val_accuracy"][-1], 3))
print("Transformer final validation accuracy:", round(history_trans.history["val_accuracy"][-1], 3))



### **3.2. Overfitting Across Architectures**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_03_02.jpg?v=1768980300" width="250">



>* RNNs compress sequence history into hidden states
>* Large recurrent models can still severely overfit

>* Transformers memorize patterns easily through powerful self-attention
>* They overfit spurious cues, needing careful validation monitoring

>* Compare models under distribution shifts and difficulty
>* Use insights to choose architecture-specific regularization



In [None]:
#@title Python Code - Overfitting Across Architectures

# This script compares overfitting between LSTM and Transformer encoders on text data.
# It trains tiny models briefly and prints training versus validation accuracies.
# The goal is illustrating how different architectures can overfit with limited data.

# !pip install tensorflow==2.20.0

# Import required standard libraries and TensorFlow framework modules.
import os
import random
import numpy as np
import tensorflow as tf

# Print TensorFlow version information for reproducibility and environment clarity.
print("TensorFlow version:", tf.__version__)

# Set deterministic random seeds for reproducible training and evaluation behavior.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Select computation device preferring GPU when available otherwise defaulting to CPU.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    try:
        tf.config.experimental.set_memory_growth(physical_gpus[0], True)
    except Exception as device_error:
        print("GPU configuration issue, using default settings.")

# Load IMDB dataset with subwords tokenizer for compact text representation.
imdb_data, imdb_info = tf.keras.datasets.imdb, tf.keras.datasets.imdb.get_word_index()
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=8000)

# Restrict dataset size to keep training fast and highlight overfitting behavior.
train_samples = 4000
test_samples = 2000
x_train_small = x_train[:train_samples]
y_train_small = y_train[:train_samples]
x_test_small = x_test[:test_samples]
y_test_small = y_test[:test_samples]

# Pad sequences to fixed length for batching compatibility across both architectures.
max_length = 150
x_train_padded = tf.keras.preprocessing.sequence.pad_sequences(x_train_small, maxlen=max_length)
x_test_padded = tf.keras.preprocessing.sequence.pad_sequences(x_test_small, maxlen=max_length)

# Verify padded shapes to ensure correct dimensions for subsequent model definitions.
print("Train shape:", x_train_padded.shape)
print("Test shape:", x_test_padded.shape)

# Build a simple LSTM based sequence classifier with modest capacity.
embedding_dim = 32
lstm_units = 32
inputs_lstm = tf.keras.Input(shape=(max_length,))
embedding_lstm = tf.keras.layers.Embedding(8000, embedding_dim)(inputs_lstm)
encoded_lstm = tf.keras.layers.LSTM(lstm_units)(embedding_lstm)
outputs_lstm = tf.keras.layers.Dense(1, activation="sigmoid")(encoded_lstm)
model_lstm = tf.keras.Model(inputs=inputs_lstm, outputs=outputs_lstm)

# Build a simple Transformer style encoder classifier using MultiHeadAttention.
inputs_trans = tf.keras.Input(shape=(max_length,))
embedding_trans = tf.keras.layers.Embedding(8000, embedding_dim)(inputs_trans)
attention_layer = tf.keras.layers.MultiHeadAttention(num_heads=2, key_dim=16)
attended_output = attention_layer(embedding_trans, embedding_trans)
pooled_output = tf.keras.layers.GlobalAveragePooling1D()(attended_output)
outputs_trans = tf.keras.layers.Dense(1, activation="sigmoid")(pooled_output)
model_trans = tf.keras.Model(inputs=inputs_trans, outputs=outputs_trans)

# Compile both models with identical optimizer and loss for fair comparison.
optimizer_lstm = tf.keras.optimizers.Adam(learning_rate=0.001)
optimizer_trans = tf.keras.optimizers.Adam(learning_rate=0.001)
model_lstm.compile(optimizer=optimizer_lstm, loss="binary_crossentropy", metrics=["accuracy"])
model_trans.compile(optimizer=optimizer_trans, loss="binary_crossentropy", metrics=["accuracy"])

# Train LSTM model for several epochs to encourage visible overfitting behavior.
history_lstm = model_lstm.fit(x_train_padded, y_train_small, validation_split=0.3, epochs=5, batch_size=64, verbose=0)

# Train Transformer model similarly and observe different overfitting characteristics.
history_trans = model_trans.fit(x_train_padded, y_train_small, validation_split=0.3, epochs=5, batch_size=64, verbose=0)

# Evaluate both models on held out test data for generalization performance comparison.
loss_lstm, acc_lstm = model_lstm.evaluate(x_test_padded, y_test_small, verbose=0)
loss_trans, acc_trans = model_trans.evaluate(x_test_padded, y_test_small, verbose=0)

# Helper function prints final epoch metrics summarizing overfitting indicators.
def summarize_history(label, history_object, test_accuracy_value):
    train_acc_values = history_object.history["accuracy"]
    val_acc_values = history_object.history["val_accuracy"]
    final_train = float(train_acc_values[-1])
    final_val = float(val_acc_values[-1])
    print(f"{label} train accuracy: {final_train:.3f}, validation accuracy: {final_val:.3f}, test accuracy: {test_accuracy_value:.3f}")

# Print concise comparison showing training versus validation and test accuracies.
summarize_history("LSTM", history_lstm, acc_lstm)
summarize_history("Transformer", history_trans, acc_trans)



### **3.3. Selecting Sequence Models**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_03_03.jpg?v=1768980414" width="250">



>* Match model choice to task constraints, metrics
>* Prefer simpler models when performance is similar

>* Compare architectures across lengths, data, and noise
>* Use RNNs for short texts, transformers for long

>* Consider lifecycle, transfer learning, and explainability needs
>* Use error analysis to balance performance and fairness



# <font color="#418FDE" size="6.5" uppercase>**Sequence Models**</font>


In this lecture, you learned to:
- Build LSTM or GRU-based sequence models for text classification using tf.keras. 
- Implement a simple transformer-style encoder block using Keras layers for sequence modeling. 
- Evaluate and compare the performance of different sequence architectures on an NLP task. 

In the next Module (Module 8), we will go over 'Distributed Training'