# <font color="#418FDE" size="6.5" uppercase>**Sequence Models**</font>

>Last update: 20260126.
    
By the end of this Lecture, you will be able to:
- Build LSTM or GRU-based sequence models for text classification using tf.keras. 
- Implement a simple transformer-style encoder block using Keras layers for sequence modeling. 
- Evaluate and compare the performance of different sequence architectures on an NLP task. 


## **1. Recurrent Neural Networks**

### **1.1. LSTM and GRU Essentials**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_01_01.jpg?v=1769440720" width="250">



>* LSTMs and GRUs handle long text dependencies
>* Gates manage context for accurate text classification

>* LSTMs use gates to manage long and short memory
>* GRUs simplify gating yet match LSTM text performance

>* Choose LSTMs or GRUs based on data
>* Both maintain context over time for classification



In [None]:
#@title Python Code - LSTM and GRU Essentials

# This script shows simple LSTM and GRU usage.
# It builds tiny text classifiers with TensorFlow.
# Focus on essentials for recurrent sequence models.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras layers.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic random seeds.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Prepare a tiny toy text dataset.
texts = [
    "I love this movie it is great",
    "This film was terrible and boring",
    "Amazing acting and wonderful story",
    "I hate this movie it is awful",
    "What a fantastic and enjoyable film",
    "The plot was dull and predictable",
]

# Create binary sentiment labels.
labels = np.array([1, 0, 1, 0, 1, 0], dtype=np.int32)

# Use Keras TextVectorization for tokenization.
max_tokens = 1000
max_length = 10
vectorizer = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)

# Adapt vectorizer on the small corpus.
vectorizer.adapt(texts)

# Vectorize the raw text data.
text_ds = tf.constant(texts)
sequences = vectorizer(text_ds)

# Validate resulting tensor shape.
print("Vectorized shape:", sequences.shape)

# Split data into train and validation.
train_sequences = sequences[:4]
train_labels = labels[:4]
val_sequences = sequences[4:]
val_labels = labels[4:]

# Define a function building an LSTM model.
def build_lstm_model(vocab_size, sequence_length):
    inputs = keras.Input(shape=(sequence_length,))
    x = layers.Embedding(vocab_size, 16)(inputs)
    x = layers.LSTM(16)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer="adam",
        loss="binary_crossentropy",
        metrics=["accuracy"],
    )
    return model


# Define a function building a GRU model.
def build_gru_model(vocab_size, sequence_length):
    inputs = keras.Input(shape=(sequence_length,))
    x = layers.Embedding(vocab_size, 16)(inputs)
    x = layers.GRU(16)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer="adam",
        loss="binary_crossentropy",
        metrics=["accuracy"],
    )
    return model

# Build LSTM and GRU models.
lstm_model = build_lstm_model(max_tokens, max_length)
gru_model = build_gru_model(max_tokens, max_length)

# Train LSTM model briefly.
history_lstm = lstm_model.fit(
    train_sequences,
    train_labels,
    epochs=10,
    batch_size=2,
    validation_data=(val_sequences, val_labels),
    verbose=0,
)

# Train GRU model briefly.
history_gru = gru_model.fit(
    train_sequences,
    train_labels,
    epochs=10,
    batch_size=2,
    validation_data=(val_sequences, val_labels),
    verbose=0,
)

# Evaluate both models on validation.
val_loss_lstm, val_acc_lstm = lstm_model.evaluate(
    val_sequences,
    val_labels,
    verbose=0,
)

# Evaluate GRU model similarly.
val_loss_gru, val_acc_gru = gru_model.evaluate(
    val_sequences,
    val_labels,
    verbose=0,
)

# Print concise comparison of accuracies.
print("LSTM validation accuracy:", round(float(val_acc_lstm), 3))
print("GRU validation accuracy:", round(float(val_acc_gru), 3))

# Show predictions for a few new sentences.
new_texts = [
    "I really enjoyed this wonderful movie",
    "This was a horrible and dull film",
]

# Vectorize new sentences.
new_seq = vectorizer(tf.constant(new_texts))

# Get LSTM and GRU predictions.
pred_lstm = lstm_model.predict(new_seq, verbose=0)
pred_gru = gru_model.predict(new_seq, verbose=0)

# Print predictions rounded for clarity.
for i, txt in enumerate(new_texts):
    print("Text:", txt)
    print(
        "LSTM prob:", round(float(pred_lstm[i][0]), 3),
        "GRU prob:", round(float(pred_gru[i][0]), 3),
    )




### **1.2. Bidirectional RNN Layers**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_01_02.jpg?v=1769440818" width="250">



>* Bidirectional RNNs read sequences forward and backward
>* Each token uses both past and future context

>* Forward and backward passes give full sentence context
>* Merged states capture long-range patterns and nuances

>* Place bidirectional layer after word embeddings
>* Summarize outputs using pooling, states, or attention



In [None]:
#@title Python Code - Bidirectional RNN Layers

# This script shows bidirectional RNN layers.
# It builds a tiny text classifier example.
# It runs quickly and prints concise results.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic random seeds.
SEED = 42
random.seed(SEED)
np.random.seed(SEED)

# Set TensorFlow random seed.
tf.random.set_seed(SEED)

# Print TensorFlow version once.
print("TensorFlow version:", tf.__version__)

# Prepare a tiny toy text dataset.
texts = [
    "I love this movie so much",
    "This film was terrible and boring",
    "Absolutely fantastic acting and story",
    "I hated the plot and characters",
    "What a great and inspiring film",
    "The movie was bad and disappointing",
]

# Define binary sentiment labels.
labels = np.array([1, 0, 1, 0, 1, 0], dtype=np.int32)

# Create a simple TextVectorization layer.
max_tokens = 1000
sequence_length = 10
vectorizer = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Adapt vectorizer on the small corpus.
vectorizer.adapt(texts)

# Vectorize the raw text data.
text_ds = tf.constant(texts)
vectorized_texts = vectorizer(text_ds)

# Confirm shapes before building the model.
print("Vectorized shape:", vectorized_texts.shape)

# Build a simple bidirectional LSTM classifier.
embedding_dim = 16
inputs = keras.Input(shape=(sequence_length,))

# Add an embedding layer after inputs.
x = layers.Embedding(max_tokens, embedding_dim)(inputs)

# Add a bidirectional LSTM layer.
x = layers.Bidirectional(
    layers.LSTM(16, return_sequences=False)
)(x)

# Add a small dense layer for features.
x = layers.Dense(16, activation="relu")(x)

# Add the final classification output layer.
outputs = layers.Dense(1, activation="sigmoid")(x)

# Create the Keras model object.
model = keras.Model(inputs=inputs, outputs=outputs)

# Compile the model with binary crossentropy.
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

# Train briefly with silent verbose setting.
history = model.fit(
    vectorized_texts,
    labels,
    epochs=10,
    batch_size=2,
    verbose=0,
)

# Evaluate the model on the same tiny data.
loss, acc = model.evaluate(
    vectorized_texts,
    labels,
    verbose=0,
)

# Print a short summary of performance.
print("Bidirectional model accuracy:", round(float(acc), 3))

# Demonstrate prediction on a new sentence.
new_texts = tf.constant([
    "The movie was surprisingly good overall",
])

# Vectorize the new sentence.
new_vec = vectorizer(new_texts)

# Get prediction probability from the model.
prob = model.predict(new_vec, verbose=0)[0, 0]

# Print the predicted sentiment probability.
print("Predicted positive sentiment probability:", round(float(prob), 3))



### **1.3. Handling variable sequences**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_01_03.jpg?v=1769440861" width="250">



>* Real text sequences vary widely in length
>* We pad and mask to batch efficiently, accurately

>* Pad or truncate token sequences to length
>* Use masks so RNN ignores padding tokens

>* Use final states aligned with real tokens
>* Batch similar lengths; tune padding and masking



In [None]:
#@title Python Code - Handling variable sequences

# This script shows padding and masking usage.
# It builds a tiny LSTM text classifier.
# It highlights handling variable length sequences.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras utilities.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Prepare a tiny toy text dataset.
texts = [
    "bad",
    "very bad",
    "really bad movie",
    "good",
    "very good",
    "really good movie",
]

# Create matching sentiment labels for texts.
labels = np.array([0, 0, 0, 1, 1, 1], dtype="int32")

# Build a TextVectorization layer with padding.
max_tokens = 100
max_length = 5
vectorizer = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)

# Adapt vectorizer on the small text dataset.
vectorizer.adapt(texts)

# Vectorize texts into padded integer sequences.
text_ds = tf.data.Dataset.from_tensor_slices(texts)
seq_ds = text_ds.map(vectorizer)
sequences = np.stack(list(seq_ds.as_numpy_iterator()))

# Show original texts and padded sequences.
for t, s in zip(texts, sequences):
    print("Text:", repr(t), "->", s)

# Confirm padded sequence shape is as expected.
print("Sequence batch shape:", sequences.shape)

# Build a simple LSTM model with masking.
embedding_dim = 8
model = keras.Sequential([
    layers.Embedding(
        input_dim=max_tokens,
        output_dim=embedding_dim,
        mask_zero=True,
        input_length=max_length,
    ),
    layers.LSTM(16),
    layers.Dense(1, activation="sigmoid"),
])

# Compile model with binary crossentropy loss.
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"],
)

# Train briefly with very small epochs.
history = model.fit(
    sequences,
    labels,
    batch_size=2,
    epochs=10,
    verbose=0,
)

# Evaluate model performance on same tiny data.
loss, acc = model.evaluate(
    sequences,
    labels,
    verbose=0,
)

# Print evaluation metrics in a compact form.
print("Loss:", round(float(loss), 4), "Accuracy:", round(float(acc), 4))




## **2. Transformer Encoder Block**

### **2.1. Multihead Attention Basics**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_02_01.jpg?v=1769440929" width="250">



>* Attention lets each token focus on others
>* Weighted combinations capture long-range dependencies without recurrence

>* Multiple heads learn different sequence relationships simultaneously
>* Their combined views give richer token representations

>* Parallel heads link any tokens, capturing long dependencies
>* Enables context-aware representations for many NLP tasks



In [None]:
#@title Python Code - Multihead Attention Basics

# This script explains basic multihead attention concepts.
# It uses TensorFlow to build a tiny example.
# Run cells in order to follow the explanation.

# !pip install tensorflow==2.20.0.

# Import required modules from TensorFlow.
import tensorflow as tf

# Set deterministic seeds for reproducibility.
tf.random.set_seed(7)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define tiny toy token id sequences for a batch.
toy_token_ids = tf.constant([[1, 2, 3, 0], [4, 5, 0, 0]])

# Show the toy token id tensor shape.
print("Token ids shape:", toy_token_ids.shape)

# Define basic configuration hyperparameters for the demo.
vocab_size, embed_dim, num_heads = 20, 8, 2

# Create a simple embedding layer for token ids.
embedding_layer = tf.keras.layers.Embedding(vocab_size, embed_dim)

# Embed the toy token ids into dense vectors.
embedded_tokens = embedding_layer(toy_token_ids)

# Confirm the embedded tensor has expected shape.
print("Embedded shape:", embedded_tokens.shape)

# Create a padding mask where zeros mark padded positions.
padding_mask = tf.cast(tf.math.not_equal(toy_token_ids, 0), tf.float32)

# Reshape mask to match attention expected dimensions.
attention_mask = padding_mask[:, tf.newaxis, tf.newaxis, :]

# Show the attention mask shape for verification.
print("Attention mask shape:", attention_mask.shape)

# Create a MultiHeadAttention layer for self attention.
mha_layer = tf.keras.layers.MultiHeadAttention(num_heads, embed_dim)

# Use the same tensor for queries, keys, and values.
query_tensor = embedded_tokens

# Apply multihead self attention with the padding mask.
attended_output, attention_scores = mha_layer(
    query=query_tensor,
    value=embedded_tokens,
    key=embedded_tokens,
    attention_mask=attention_mask,
    return_attention_scores=True,
)

# Confirm the attended output tensor shape.
print("Attended output shape:", attended_output.shape)

# Confirm the attention scores tensor shape.
print("Attention scores shape:", attention_scores.shape)

# Take mean attention over heads for easier inspection.
mean_scores = tf.reduce_mean(attention_scores, axis=1)

# Round scores for compact printing demonstration.
rounded_scores = tf.round(mean_scores * 10.0) / 10.0

# Print the compact attention scores for the first sequence.
print("Mean attention scores sample:")

# Print only the first sequence attention matrix.
print(rounded_scores[0].numpy())




### **2.2. Residual Connections and Normalization**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_02_02.jpg?v=1769440963" width="250">



>* Residual connections add inputs back after attention
>* They ease gradient flow and stabilize deep training

>* Layer normalization rescales features per individual example
>* Applied after residual add to stabilize training

>* Residual plus normalization form repeated encoder patterns
>* They enable deep, stable, easily stackable transformers



In [None]:
#@title Python Code - Residual Connections and Normalization

# This script shows residual connections and normalization.
# It builds a tiny transformer encoder using Keras layers.
# It trains briefly on dummy text classification data.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras submodules.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Define tiny dummy text dataset as integer sequences.
num_samples = 64
sequence_length = 10
vocab_size = 50
num_classes = 2

# Create random integer token sequences.
X = np.random.randint(1, vocab_size, size=(num_samples, sequence_length))

# Create random binary labels for classification.
y = np.random.randint(0, num_classes, size=(num_samples,))

# Split into small train and validation sets.
train_size = 48
X_train, X_val = X[:train_size], X[train_size:]
y_train, y_val = y[:train_size], y[train_size:]

# Define a simple transformer encoder block class.
class SimpleTransformerEncoder(layers.Layer):
    # Initialize attention, dense, and normalization layers.
    def __init__(self, embed_dim, num_heads, ff_dim, **kwargs):
        super().__init__(**kwargs)
        self.att = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim
        )
        self.ffn = keras.Sequential(
            [
                layers.Dense(ff_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.norm1 = layers.LayerNormalization(epsilon=1e-6)
        self.norm2 = layers.LayerNormalization(epsilon=1e-6)

    # Apply attention, residuals, and normalization.
    def call(self, inputs, training=False):
        attn_output = self.att(inputs, inputs, training=training)
        res1 = inputs + attn_output
        out1 = self.norm1(res1, training=training)
        ffn_output = self.ffn(out1, training=training)
        res2 = out1 + ffn_output
        return self.norm2(res2, training=training)


# Define model hyperparameters for embeddings.
embed_dim = 16
num_heads = 2
ff_dim = 32

# Build the Keras model using the encoder block.
inputs = keras.Input(shape=(sequence_length,), dtype="int32")

# Embed tokens into dense vectors.
embedding_layer = layers.Embedding(
    input_dim=vocab_size, output_dim=embed_dim
)
embedded = embedding_layer(inputs)

# Apply the custom transformer encoder block.
encoder = SimpleTransformerEncoder(
    embed_dim=embed_dim, num_heads=num_heads, ff_dim=ff_dim
)
encoded = encoder(embedded)

# Pool sequence representations with global average.
pooled = layers.GlobalAveragePooling1D()(encoded)

# Add a small dense layer for classification.
outputs = layers.Dense(num_classes, activation="softmax")(pooled)

# Create and compile the final model.
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(
    optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)

# Validate model output shape for safety.
sample_out = model(X_train[:2])
print("Sample output shape:", sample_out.shape)

# Train briefly with silent verbose setting.
history = model.fit(
    X_train,
    y_train,
    validation_data=(X_val, y_val),
    epochs=3,
    batch_size=8,
    verbose=0,
)

# Evaluate model performance on validation set.
val_loss, val_acc = model.evaluate(X_val, y_val, verbose=0)

# Print concise results about residual encoder performance.
print("Validation loss:", float(val_loss))
print("Validation accuracy:", float(val_acc))
print("Encoder block uses residuals and layer normalization.")



### **2.3. Feedforward Layers Explained**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_02_03.jpg?v=1769441054" width="250">



>* Feedforward layer transforms attention outputs into richer features
>* Applies same two-layer MLP to each token

>* Shared feedforward network balances power and efficiency
>* Expanded dimension and activations learn rich token features

>* Feedforward uses dense layers per time step
>* Design choices and residuals affect generalization strength



In [None]:
#@title Python Code - Feedforward Layers Explained

# This script explains transformer feedforward layers simply.
# It builds a tiny encoder block using Keras layers.
# It trains briefly on toy text data for clarity.

# !pip install tensorflow==2.20.0.

# Import required standard libraries safely.
import os
import random
import numpy as np

# Set deterministic seeds for reproducibility.
random.seed(7)
np.random.seed(7)

# Import TensorFlow and Keras submodules.
import tensorflow as tf
from tensorflow import keras

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Select device preference based on availability.
physical_gpus = tf.config.list_physical_devices("GPU")
if physical_gpus:
    device_name = "GPU"
else:
    device_name = "CPU"

# Print which device type will likely be used.
print("Using device type:", device_name)

# Define tiny toy sentences and labels.
texts = [
    "good movie nice story",
    "bad movie boring plot",
    "great acting and good direction",
    "terrible acting and bad script",
]

# Define binary sentiment labels for sentences.
labels = np.array([1, 0, 1, 0], dtype="int32")

# Create a simple Keras tokenizer object.
tokenizer = keras.preprocessing.text.Tokenizer(num_words=50)

# Fit tokenizer on the tiny text corpus.
tokenizer.fit_on_texts(texts)

# Convert texts to integer sequences.
sequences = tokenizer.texts_to_sequences(texts)

# Pad sequences to equal fixed length.
max_len = 6
padded = keras.preprocessing.sequence.pad_sequences(
    sequences, maxlen=max_len, padding="post"
)

# Convert padded sequences to int32 tensor.
inputs_array = np.array(padded, dtype="int32")

# Validate shapes before building model.
print("Input shape:", inputs_array.shape)

# Define model hyperparameters for encoder.
vocab_size = 50
model_dim = 16
ff_dim = 32
num_heads = 2

# Create model input layer for token ids.
inputs = keras.layers.Input(shape=(max_len,), dtype="int32")

# Embed tokens into dense vectors.
embedding_layer = keras.layers.Embedding(
    input_dim=vocab_size, output_dim=model_dim
)

# Apply embedding to input token ids.
embedded = embedding_layer(inputs)

# Add simple positional encoding via indices.
positions = tf.range(start=0, limit=max_len, delta=1)

# Embed positions into same dimensional space.
pos_embedding_layer = keras.layers.Embedding(
    input_dim=max_len, output_dim=model_dim
)

# Broadcast positional embeddings across batch.
positional = pos_embedding_layer(positions)

# Add token and positional embeddings together.
encoded_tokens = embedded + positional

# Apply multihead self attention layer.
attention_layer = keras.layers.MultiHeadAttention(
    num_heads=num_heads, key_dim=model_dim
)

# Use self attention with query key value same.
attn_output = attention_layer(
    query=encoded_tokens, value=encoded_tokens, key=encoded_tokens
)

# Add residual connection around attention.
attn_residual = keras.layers.Add()([encoded_tokens, attn_output])

# Normalize attention output for stability.
attn_norm = keras.layers.LayerNormalization(epsilon=1e-6)(attn_residual)

# First dense layer expands dimensionality.
ffn_dense1 = keras.layers.Dense(ff_dim, activation="relu")

# Second dense layer projects back down.
ffn_dense2 = keras.layers.Dense(model_dim)

# Apply first dense layer position wise.
ffn_intermediate = ffn_dense1(attn_norm)

# Optionally apply dropout for regularization.
ffn_intermediate = keras.layers.Dropout(0.1)(ffn_intermediate)

# Apply second dense layer position wise.
ffn_output = ffn_dense2(ffn_intermediate)

# Add residual connection around feedforward.
ffn_residual = keras.layers.Add()([attn_norm, ffn_output])

# Normalize feedforward output for stability.
ffn_norm = keras.layers.LayerNormalization(epsilon=1e-6)(ffn_residual)

# Pool sequence by averaging token representations.
pooled = keras.layers.GlobalAveragePooling1D()(ffn_norm)

# Final dense layer for binary classification.
logits = keras.layers.Dense(1, activation="sigmoid")(pooled)

# Build Keras model object from inputs outputs.
model = keras.Model(inputs=inputs, outputs=logits)

# Compile model with simple optimizer and loss.
model.compile(
    optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"]
)

# Train briefly with silent verbose setting.
history = model.fit(
    inputs_array,
    labels,
    epochs=10,
    batch_size=2,
    verbose=0,
)

# Evaluate model performance on same tiny data.
loss, acc = model.evaluate(inputs_array, labels, verbose=0)

# Print accuracy to show model is learning.
print("Training accuracy on tiny set:", round(float(acc), 3))

# Show example prediction before and after feedforward.
example_input = inputs_array[:1]

# Build intermediate model to inspect tensors.
intermediate_model = keras.Model(
    inputs=inputs, outputs=[attn_norm, ffn_norm]
)

# Get attention and feedforward outputs.
attn_out_val, ffn_out_val = intermediate_model.predict(
    example_input, verbose=0
)

# Print shapes to highlight position wise behavior.
print("Attention output shape:", attn_out_val.shape)
print("Feedforward output shape:", ffn_out_val.shape)



## **3. Comparing Sequence Architectures**

### **3.1. Training Stability and Speed**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_03_01.jpg?v=1769441097" width="250">



>* Compare models by training stability and speed
>* Prefer reliably converging models under real resource limits

>* LSTMs and GRUs train steadily but slowly
>* Gates aid stability, yet gradients need monitoring

>* Transformers train faster using parallel self-attention operations
>* They need careful tuning to stay stable and reliable



In [None]:
#@title Python Code - Training Stability and Speed

# This script compares training stability and speed.
# It uses tiny models on simple text data.
# Focus on LSTM and Transformer style encoders.

# !pip install tensorflow==2.20.0.

# Import required libraries safely.
import os
import time
import random
import numpy as np
import tensorflow as tf

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version briefly.
print("TensorFlow version:", tf.__version__)

# Prepare a tiny synthetic text dataset.
texts = [
    "good movie and great acting",
    "bad movie and terrible acting",
    "excellent film with nice story",
    "awful film with boring story",
    "loved the plot and characters",
    "hated the plot and characters",
    "wonderful experience and fun scenes",
    "horrible experience and dull scenes",
]

# Create simple binary sentiment labels.
labels = [1, 0, 1, 0, 1, 0, 1, 0]

# Convert lists to numpy arrays.
labels = np.array(labels, dtype=np.int32)

# Define a small TextVectorization layer.
max_tokens = 100
max_len = 8
vectorizer = tf.keras.layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_len,
)

# Adapt vectorizer on the tiny corpus.
vectorizer.adapt(texts)

# Vectorize the text data.
text_ds = tf.constant(texts)
int_sequences = vectorizer(text_ds)

# Validate shapes before building datasets.
assert int_sequences.shape[0] == labels.shape[0]

# Build a small tf.data.Dataset.
batch_size = 4
dataset = tf.data.Dataset.from_tensor_slices((int_sequences, labels))
dataset = dataset.shuffle(buffer_size=8, seed=seed_value)
dataset = dataset.batch(batch_size)

# Define a function to build an LSTM model.
def build_lstm_model(vocab_size, sequence_length):
    inputs = tf.keras.Input(shape=(sequence_length,))
    x = tf.keras.layers.Embedding(vocab_size, 16)(inputs)
    x = tf.keras.layers.LSTM(16)(x)
    outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
    model = tf.keras.Model(inputs, outputs)
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
        loss="binary_crossentropy",
        metrics=["accuracy"],
    )
    return model

# Define a simple Transformer style encoder block.
def transformer_encoder_block(inputs, num_heads, key_dim):
    attn_output = tf.keras.layers.MultiHeadAttention(
        num_heads=num_heads,
        key_dim=key_dim,
    )(inputs, inputs)
    attn_output = tf.keras.layers.Dropout(0.1)(attn_output)
    out1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)(
        inputs + attn_output
    )
    ffn = tf.keras.Sequential(
        [
            tf.keras.layers.Dense(32, activation="relu"),
            tf.keras.layers.Dense(inputs.shape[-1]),
        ]
    )
    ffn_output = ffn(out1)
    ffn_output = tf.keras.layers.Dropout(0.1)(ffn_output)
    return tf.keras.layers.LayerNormalization(epsilon=1e-6)(
        out1 + ffn_output
    )

# Define a function to build a Transformer encoder model.
def build_transformer_model(vocab_size, sequence_length):
    inputs = tf.keras.Input(shape=(sequence_length,))
    x = tf.keras.layers.Embedding(vocab_size, 16)(inputs)
    x = transformer_encoder_block(x, num_heads=2, key_dim=8)
    x = tf.keras.layers.GlobalAveragePooling1D()(x)
    outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
    model = tf.keras.Model(inputs, outputs)
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
        loss="binary_crossentropy",
        metrics=["accuracy"],
    )
    return model

# Get vocabulary size and sequence length.
vocab_size = max_tokens
sequence_length = max_len

# Build both models for comparison.
lstm_model = build_lstm_model(vocab_size, sequence_length)
transformer_model = build_transformer_model(vocab_size, sequence_length)

# Define a helper to train and time models.
def train_and_time(model, dataset, epochs, name):
    start = time.time()
    history = model.fit(
        dataset,
        epochs=epochs,
        verbose=0,
    )
    end = time.time()
    losses = history.history["loss"]
    accs = history.history["accuracy"]
    return {
        "name": name,
        "time": end - start,
        "losses": losses,
        "accs": accs,
    }

# Train both models briefly for comparison.
results_lstm = train_and_time(lstm_model, dataset, epochs=5, name="LSTM")
results_trans = train_and_time(
    transformer_model,
    dataset,
    epochs=5,
    name="Transformer",
)

# Print a compact comparison of stability and speed.
print("\nModel comparison on tiny dataset:")
print("LSTM time seconds:", round(results_lstm["time"], 4))
print("Transformer time seconds:", round(results_trans["time"], 4))
print("LSTM losses:", [round(x, 4) for x in results_lstm["losses"]])
print("Transformer losses:", [round(x, 4) for x in results_trans["losses"]])
print("LSTM accuracies:", [round(x, 3) for x in results_lstm["accs"]])
print("Transformer accuracies:", [round(x, 3) for x in results_trans["accs"]])

# Show which model trained faster on this run.
faster = (
    "LSTM" if results_lstm["time"] < results_trans["time"] else "Transformer"
)
print("Faster model on this tiny example:", faster)




### **3.2. Overfitting Across Architectures**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_03_02.jpg?v=1769441214" width="250">



>* RNNs compress sequences, slightly limiting overfitting capacity
>* Transformers overfit faster due to higher capacity

>* Overfitting patterns depend on task and length
>* Compare train versus validation curves across architectures

>* Compare models while varying size, regularization, training
>* Use validation gaps to match capacity to data



In [None]:
#@title Python Code - Overfitting Across Architectures

# This script compares overfitting across architectures.
# It trains tiny GRU and Transformer text classifiers.
# It prints training and validation accuracy for both.

# !pip install tensorflow==2.20.0.

# Import required standard libraries.
import os
import random
import numpy as np

# Import TensorFlow and Keras submodules.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Set deterministic seeds for reproducibility.
seed_value = 42
random.seed(seed_value)
np.random.seed(seed_value)
tf.random.set_seed(seed_value)

# Print TensorFlow version in one short line.
print("TensorFlow version:", tf.__version__)

# Load IMDB dataset with limited vocabulary.
(vocab_train, labels_train), (vocab_test, labels_test) = (
    keras.datasets.imdb.load_data(num_words=5000)
)

# Use small subsets for quick demonstration.
train_samples = 4000
test_samples = 2000
x_train = vocab_train[:train_samples]
y_train = labels_train[:train_samples]

# Slice test data subset.
x_test = vocab_test[:test_samples]
y_test = labels_test[:test_samples]

# Pad sequences to fixed maximum length.
max_len = 150
x_train = keras.preprocessing.sequence.pad_sequences(
    x_train, maxlen=max_len
)

# Pad test sequences similarly.
x_test = keras.preprocessing.sequence.pad_sequences(
    x_test, maxlen=max_len
)

# Validate shapes before building models.
assert x_train.shape[0] == train_samples
assert x_test.shape[0] == test_samples

# Create a simple GRU based classifier.
def build_gru_model(vocab_size, embed_dim, units):
    inputs = keras.Input(shape=(max_len,))
    x = layers.Embedding(vocab_size, embed_dim)(inputs)
    x = layers.GRU(units, dropout=0.2)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer="adam",
        loss="binary_crossentropy",
        metrics=["accuracy"],
    )
    return model

# Create a tiny Transformer style encoder classifier.
def build_transformer_model(vocab_size, embed_dim, heads):
    inputs = keras.Input(shape=(max_len,))
    x = layers.Embedding(vocab_size, embed_dim)(inputs)
    attn_output = layers.MultiHeadAttention(
        num_heads=heads, key_dim=embed_dim
    )(x, x)
    x = layers.Add()([x, attn_output])
    x = layers.LayerNormalization()(x)
    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer="adam",
        loss="binary_crossentropy",
        metrics=["accuracy"],
    )
    return model

# Define shared training configuration.
batch_size = 64
epochs = 4

# Build GRU model with modest capacity.
gru_model = build_gru_model(vocab_size=5000, embed_dim=32, units=32)

# Train GRU model silently with validation split.
hist_gru = gru_model.fit(
    x_train,
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
    verbose=0,
)

# Build Transformer style model with higher capacity.
transformer_model = build_transformer_model(
    vocab_size=5000, embed_dim=64, heads=2
)

# Train Transformer model silently with validation split.
hist_trans = transformer_model.fit(
    x_train,
    y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
    verbose=0,
)

# Helper function to fetch last epoch metrics.
def last_metrics(history):
    acc = history.history["accuracy"][-1]
    val_acc = history.history["val_accuracy"][-1]
    return float(acc), float(val_acc)

# Compute final training and validation accuracy.
gru_acc, gru_val_acc = last_metrics(hist_gru)
trans_acc, trans_val_acc = last_metrics(hist_trans)

# Print concise comparison of overfitting behavior.
print("GRU train acc:", round(gru_acc, 3))
print("GRU val acc:", round(gru_val_acc, 3))
print("Transformer train acc:", round(trans_acc, 3))
print("Transformer val acc:", round(trans_val_acc, 3))




### **3.3. Choosing architectures for tasks**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Master TensorFlow 2.20.0/Module_07/Lecture_B/image_03_03.jpg?v=1769441333" width="250">



>* Match model choice to text length, outputs
>* Balance transformer benefits against compute and resources

>* Local cues suit either recurrent or transformers
>* Long-range or streaming needs favor different architectures

>* Compare models using metrics and practical constraints
>* Balance accuracy with latency, resources, and deployment needs



# <font color="#418FDE" size="6.5" uppercase>**Sequence Models**</font>


In this lecture, you learned to:
- Build LSTM or GRU-based sequence models for text classification using tf.keras. 
- Implement a simple transformer-style encoder block using Keras layers for sequence modeling. 
- Evaluate and compare the performance of different sequence architectures on an NLP task. 

In the next Module (Module 8), we will go over 'Distributed Training'