<a href="https://colab.research.google.com/github/profliuhao/CSIT599/blob/main/CSIT599_module5_exercise1_seq2seq_attention_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Exercise 1: Sequence-to-Sequence Models with and without Attention

Student Name: ____________________

## English to German Translation

Learning Objectives:
1. Build encoder-decoder architecture using LSTM (WITHOUT attention)
2. Build encoder-decoder architecture using LSTM (WITH attention)
3. Compare performance using Loss, Accuracy, and BLEU score

Instructions:
- Fill in the blanks marked with \_\_\_BLANK___
- The exercise is divided into TWO SEPARATE PARTS
- Part A: Model WITHOUT Attention
- Part B: Model WITH Attention
- Part C: Comparison with metrics
- Each blank is a simple parameter, function name, or dimension
- Run the code to see the comparison results






In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import urllib.request
import zipfile
import os
import re
from collections import Counter
from tqdm.notebook import tqdm

# Set random seeds for reproducibility across runs
np.random.seed(42)
tf.random.set_seed(42)

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {len(tf.config.list_physical_devices('GPU')) > 0}")


TensorFlow version: 2.19.0
GPU Available: True


In [None]:
# ==============================================================================
# HYPERPARAMETERS: These are constant values that configure the model and training process.
# ==============================================================================

BATCH_SIZE = 64        # Number of samples processed before the model is updated
EMBEDDING_DIM = 512    # Dimension of the dense embedding for each word
LSTM_UNITS = 256       # Number of hidden units in the LSTM layers. This determines the capacity of the model.
EPOCHS = 20            # Number of complete passes through the training dataset
MAX_VOCAB_SIZE = 10000 # Maximum number of unique words to consider for the vocabulary
MAX_LENGTH = 20        # Maximum sequence length for both input and target sentences. Sentences longer than this will be truncated, shorter ones will be padded.


In [None]:
# ==============================================================================
# PART 0: DATA LOADING AND PREPROCESSING (SHARED BY BOTH MODELS)
# This section handles downloading, cleaning, and structuring the raw text data.
# ==============================================================================

def download_data():
    """Download the English-German translation dataset from a specified URL."""
    url = "http://www.manythings.org/anki/deu-eng.zip"
    filename = "deu-eng.zip"

    # Check if the unzipped data file 'deu.txt' already exists to avoid re-downloading
    if not os.path.exists("deu.txt"):
        print("Downloading dataset...")
        # Add User-Agent header to avoid 406 Not Acceptable error from some servers
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
        req = urllib.request.Request(url, headers=headers)
        with urllib.request.urlopen(req) as response, open(filename, 'wb') as out_file:
            out_file.write(response.read()) # Read data from the URL and write to a local file

        # Unzip the downloaded file
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall() # Extract all contents to the current directory
        os.remove(filename) # Remove the zip file after extraction to save space
        print("Download complete!")
    else:
        print("Dataset already exists.")

def preprocess_sentence(sentence):
    """Lowercase the sentence, add spaces around punctuation, and add <start>/<end> tokens."""
    sentence = sentence.lower().strip() # Convert to lowercase and remove leading/trailing whitespace
    # Add space between words and punctuation to ensure tokenization treats punctuation as separate tokens
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence) # Replace multiple spaces with a single space
    sentence = sentence.strip() # Strip again after adding spaces for cleanliness
    # Add special tokens to mark the beginning and end of a sentence, crucial for sequence models
    sentence = '<start> ' + sentence + ' <end>'
    return sentence

def load_dataset(num_examples=10000):
    """Load and preprocess the dataset from 'deu.txt'."""
    download_data() # Ensure the dataset is downloaded

    # Read the file content
    with open('deu.txt', 'r', encoding='utf-8') as f:
        lines = f.read().strip().split('\n') # Read all lines and split them into a list

    # Parse English-German pairs
    pairs = []
    for line in lines[:num_examples]: # Iterate through a specified number of examples
        parts = line.split('\t') # Split each line by tab character (English and German are tab-separated)
        if len(parts) >= 2:
            eng = preprocess_sentence(parts[0]) # Preprocess English sentence
            deu = preprocess_sentence(parts[1]) # Preprocess German sentence
            # Filter sentences by maximum length to keep training manageable and consistent
            if len(eng.split()) <= MAX_LENGTH and len(deu.split()) <= MAX_LENGTH:
                pairs.append([eng, deu]) # Add valid pairs to the list

    print(f"Loaded {len(pairs)} sentence pairs")
    return zip(*pairs)  # Returns two separate iterators: (english_sentences, german_sentences)

# Load data into input_texts (English) and target_texts (German)
print("\n" + "="*70)
print("LOADING DATA")
print("="*70)

input_texts, target_texts = load_dataset(num_examples=20000) # Load 20,000 sentence pairs
input_texts = list(input_texts) # Convert iterators to lists for easier manipulation
target_texts = list(target_texts)

print(f"\nExample sentence pair:")
print(f"English: {input_texts[0]}")
print(f"German:  {target_texts[0]}")



LOADING DATA
Dataset already exists.
Loaded 20000 sentence pairs

Example sentence pair:
English: <start> go . <end>
German:  <start> geh . <end>


In [None]:
# ==============================================================================
# TOKENIZATION: Converting text sentences into numerical sequences for model input.
# ==============================================================================

# Create tokenizers for English (input) and German (target) languages
input_tokenizer = keras.preprocessing.text.Tokenizer(
    num_words=MAX_VOCAB_SIZE, # Limit the vocabulary size to the most frequent words, improving efficiency and reducing noise.
    filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n',  # Characters to filter out. Keeping '<' and '>' for special tokens like <start> and <end>.
    oov_token='<UNK>' # Token for out-of-vocabulary words. Words not in the top MAX_VOCAB_SIZE will be replaced with this.
)

target_tokenizer = keras.preprocessing.text.Tokenizer(
    num_words=MAX_VOCAB_SIZE, # Limit the vocabulary size for target language to manage complexity.
    filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n',
    oov_token='<UNK>'
)

# Fit tokenizers on the data to build the vocabulary index (mapping words to integers)
input_tokenizer.fit_on_texts(input_texts)
target_tokenizer.fit_on_texts(target_texts)

# Convert text sentences to sequences of integer tokens. Each word is replaced by its corresponding integer ID.
input_sequences = input_tokenizer.texts_to_sequences(input_texts)
target_sequences = target_tokenizer.texts_to_sequences(target_texts)

# BLANK: What type of padding should we use? ('pre' or 'post')
# Hint: We typically use 'post' padding for seq2seq models. This means adding zeros at the end of the sequences.
# This is important for LSTM layers to process the sequence correctly from start to end.
input_sequences = keras.preprocessing.sequence.pad_sequences(
    input_sequences,
    maxlen=MAX_LENGTH, # Pad or truncate sequences to this maximum length, ensuring all sequences have the same dimension.
    padding=___BLANK_______   # Padding with zeros at the end of the sequence. This is standard for encoder-decoder models.
)

target_sequences = keras.preprocessing.sequence.pad_sequences(
    target_sequences,
    maxlen=MAX_LENGTH,
    padding=___BLANK_______   # Same as above, padding targets with zeros at the end to maintain consistent input shapes for the model.
)

# Vocabulary sizes: +1 for the 0-th index used for padding. The 0 index is reserved for padding, so actual words start from 1.
input_vocab_size = len(input_tokenizer.word_index) + 1 # Total unique words in the input vocabulary plus padding index.
target_vocab_size = len(target_tokenizer.word_index) + 1 # Total unique words in the target vocabulary plus padding index.

print(f"\nVocabulary sizes:")
print(f"English: {input_vocab_size}")
print(f"German: {target_vocab_size}")
print(f"Input shape: {input_sequences.shape}") # Expected shape: (number_of_samples, MAX_LENGTH)
print(f"Target shape: {target_sequences.shape}") # Expected shape: (number_of_samples, MAX_LENGTH)

# Split data into training and validation sets (90/10 split).
split_idx = int(0.9 * len(input_sequences))
train_input = input_sequences[:split_idx] # 90% for training
train_target = target_sequences[:split_idx]
val_input = input_sequences[split_idx:]     # Remaining 10% for validation
val_target = target_sequences[split_idx:]

# Keep the original text for BLEU score calculation, as BLEU operates on raw text.
# This allows for a direct comparison with human-readable translations.
train_input_texts = input_texts[:split_idx]
train_target_texts = target_texts[:split_idx]
val_input_texts = input_texts[split_idx:]
val_target_texts = target_texts[split_idx:]

print(f"\nTraining samples: {len(train_input)}")
print(f"Validation samples: {len(val_input)}")



Vocabulary sizes:
English: 3528
German: 5676
Input shape: (20000, 20)
Target shape: (20000, 20)

Training samples: 18000
Validation samples: 2000


In [None]:
# ==============================================================================
# EVALUATION METRICS: Functions to assess the performance of the translation models.
# ==============================================================================

def calculate_accuracy(predictions, targets):
    """
    Calculate token-level accuracy, ignoring padding tokens. This measures how many predicted words match the ground truth, excluding padding zeros.

    Args:
        predictions: Tensor of shape [batch_size, seq_len, vocab_size]. These are the raw output logits from the decoder for each time step.
        targets: Tensor of shape [batch_size, seq_len]. These are the ground truth target sequences (integer token IDs).

    Returns:
        accuracy: float, the average token accuracy over the batch, ignoring padding tokens.
    """
    # Get predicted token IDs by taking the argmax along the vocabulary dimension.
    # This selects the token with the highest probability/logit for each time step in the output sequence.
    # BLANK: Which axis should we take argmax over to get predicted tokens?
    # Hint: We want the highest scoring token in the vocabulary (last dimension).
    predicted_ids = tf.argmax(predictions, axis=___BLANK_______)  # -1 refers to the last dimension (vocabulary size) to get the most probable token ID.

    # Create a mask to identify non-padding tokens. Padding tokens are typically represented by 0 and should not contribute to accuracy.
    mask = tf.math.not_equal(targets, 0) # Returns a boolean tensor where True indicates a non-padding token.

    # Compare the predicted token IDs with the actual target token IDs.
    # `tf.equal` returns a boolean tensor where True indicates a match between prediction and target.
    matches = tf.equal(predicted_ids, targets)

    # Apply the mask: only consider matches for non-padding tokens. `tf.boolean_mask` filters out values where the mask is False.
    matches = tf.boolean_mask(matches, mask)

    # Calculate the mean of the boolean matches (True=1, False=0) to get the accuracy.
    accuracy = tf.reduce_mean(tf.cast(matches, tf.float32)) # Cast boolean to float (True=1.0, False=0.0) and compute the average.

    return accuracy.numpy() # Convert TensorFlow tensor to a NumPy float for easier handling.

def calculate_bleu_score(reference, candidate):
    """
    Calculate BLEU score (simple implementation). BLEU (Bilingual Evaluation Understudy) measures how similar the machine translation is to human translation.
    Score ranges from 0 (worst) to 1 (perfect match).

    Args:
        reference: A string representing the reference translation (ground truth, human-generated).
        candidate: A string representing the model's generated translation.

    Returns:
        bleu_score: float, the calculated BLEU score.
    """
    # Tokenize the reference and candidate sentences into lists of words.
    reference_tokens = reference.lower().split()
    candidate_tokens = candidate.lower().split()

    # Remove special start and end tokens as they are not part of the actual translation content and shouldn't affect BLEU.
    reference_tokens = [t for t in reference_tokens if t not in ['<start>', '<end>']]
    candidate_tokens = [t for t in candidate_tokens if t not in ['<start>', '<end>']]

    if len(candidate_tokens) == 0:
        return 0.0 # Return 0 if the candidate translation is empty to avoid division by zero.

    # Count word occurrences for both reference and candidate to calculate precision.
    reference_counts = Counter(reference_tokens) # Dictionary-like object to count token frequencies
    candidate_counts = Counter(candidate_tokens)

    # Calculate the number of 'clipped' matches: sum of minimum counts for each token
    # present in both candidate and reference. This prevents over-counting common words and giving unfair high precision.
    matches = sum((min(candidate_counts[token], reference_counts[token])
                   for token in candidate_counts)) # Sum of minimum counts for tokens present in candidate.

    # Calculate precision: ratio of matched tokens to the total number of tokens in the candidate translation.
    precision = matches / len(candidate_tokens)  # Divide by the total number of candidate tokens to get the proportion of correctly translated words.

    # Apply a brevity penalty if the candidate translation is shorter than the reference.
    # This penalizes models that generate very short, high-precision translations which might miss information.
    if len(candidate_tokens) < len(reference_tokens):
        brevity_penalty = np.exp(1 - len(reference_tokens) / len(candidate_tokens)) # Exponential penalty for brevity
    else:
        brevity_penalty = 1.0 # No penalty if candidate is equal or longer than reference.

    # The final BLEU score (simplified for this exercise) is the brevity penalty multiplied by precision.
    # A full BLEU score uses geometric mean of n-gram precisions and a more robust brevity penalty.
    bleu_score = brevity_penalty * precision

    return bleu_score



## PART A: SEQ2SEQ MODEL WITHOUT ATTENTION


### A.1: ENCODER (for model WITHOUT attention)

In [None]:
# ==============================================================================
# PART A.1: ENCODER (for model WITHOUT attention)
# This Encoder processes the input sequence and produces a context vector.
# ==============================================================================

class Encoder_NoAttention(keras.Model):
    """
    Encoder using LSTM (for model WITHOUT attention).
    This encoder processes the input sequence (e.g., English sentence)
    and converts it into a fixed-size context vector (the final LSTM states)
    which represents the meaning of the input sequence. This context vector is then passed to the decoder.
    """
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Encoder_NoAttention, self).__init__()

        # Embedding layer: Converts input token IDs into dense vectors of fixed size.
        # input_dim: The size of the vocabulary (number of unique words).
        # output_dim: The dimension of the dense embedding. Words with similar meanings
        #             will have similar embedding vectors, capturing semantic relationships.
        # mask_zero=True: Allows the embedding layer to handle padding tokens (typically 0)
        #                 by masking them out. This prevents padding from influencing the model's learning.
        # BLANK: What is the input dimension for the embedding layer?
        self.embedding = layers.Embedding(
            input_dim=_____BLANK______,  # The vocabulary size of the input language (e.g., English vocabulary size).
            output_dim=embedding_dim,
            mask_zero=True
        )

        # LSTM layer: Processes the embedded input sequence sequentially.
        # lstm_units: The number of units in the LSTM cell. This also determines
        #             the dimension of the hidden and cell states (state_h and state_c).
        # return_sequences=True: Ensures the LSTM returns the hidden states for each
        #                        time step in the input sequence. While not directly used by the decoder in the no-attention model,
        #                        it's often kept for consistency or if attention were to be added later.
        # return_state=True: Ensures the LSTM returns the final hidden state (state_h)
        #                    and final cell state (state_c) after processing the entire sequence.
        #                    These states collectively form the context vector for the decoder in the no-attention model.
        # BLANK: Should we return sequences? (True/False)
        # BLANK: Should we return state? (True/False)
        self.lstm = layers.LSTM(
            lstm_units,
            return_sequences=_____BLANK______,   # Return full sequence of outputs (hidden states at each timestep). While not used by no-attention decoder, it's a common LSTM setting.
            return_state=_____BLANK______,       # Return the last hidden state (state_h) and cell state (state_c), which summarize the input sequence.
            name='encoder_lstm_no_attention'
        )

    def call(self, x):
        # Input x: batch of integer sequences (e.g., [batch_size, MAX_LENGTH]). Each integer represents a word ID.

        # 1. Embed the input sequence
        # The embedding layer converts token IDs into dense vectors.
        # Output x: [batch_size, MAX_LENGTH, embedding_dim]
        x = self.embedding(x)

        # 2. Pass the embedded sequence through the LSTM layer
        # encoder_outputs: Hidden states for each time step [batch_size, MAX_LENGTH, lstm_units].
        # state_h: Final hidden state of the LSTM [batch_size, lstm_units]. This is the 'memory' of the encoder.
        # state_c: Final cell state of the LSTM [batch_size, lstm_units]. This is also part of the encoder's memory.
        # These final states (state_h, state_c) will be passed to the decoder as its initial states, providing context.
        encoder_outputs, state_h, state_c = self.lstm(x)

        return encoder_outputs, state_h, state_c


### A.2: DECODER WITHOUT ATTENTION

In [None]:
class Decoder_NoAttention(keras.Model):
    """Basic Decoder WITHOUT attention. This decoder generates the target sequence one token at a time, conditioned on the encoder's final state.
    """
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Decoder_NoAttention, self).__init__()

        # Embedding layer for the target language (e.g., German).
        # Converts output token IDs (from previous decoding steps or <start> token) into dense vectors.
        self.embedding = layers.Embedding(
            input_dim=vocab_size, # Vocabulary size of the target language (e.g., German vocabulary size).
            output_dim=embedding_dim,
            mask_zero=True # Mask padding tokens, ensuring they don't affect embedding lookups or subsequent computations.
        )

        # LSTM layer for the decoder. It takes the embedded input token and
        # the previous hidden state (or encoder's final state) to generate
        # the next hidden state. It essentially learns to generate the target sequence given a context.
        # return_sequences=True: Important for generating predictions for each time step during training.
        #                        During inference, it would generate one prediction per step.
        # return_state=True: Returns the final hidden and cell states, which are then fed back
        #                    into the next time step of the LSTM during inference (when decoding one token at a time),
        #                    or maintain state during training across the sequence.
        self.lstm = layers.LSTM(
            lstm_units,
            return_sequences=True,  # Return output for each time step (essential for predicting the entire sequence during training).
            return_state=True,      # Return the last hidden and cell state (for internal loop and next prediction in inference, or for consistency).
            name='decoder_lstm_no_attention'
        )

        # Dense layer: Maps the LSTM's output (hidden states) to a probability distribution over the target vocabulary.
        # The output dimension should match the vocabulary size to predict the next word.
        # It produces logits, which are then typically passed through a softmax function (handled by the loss function).
        # BLANK: What should be the output dimension?
        self.dense = layers.Dense(
            _____BLANK______,  # Output dimension should be the size of the target vocabulary to predict probabilities for every possible word.
            name='output_dense_no_attention'
        )

    def call(self, x, initial_state):
        # Input x: batch of integer sequences (e.g., [batch_size, MAX_LENGTH] for training, representing shifted target sequences).
        # initial_state: A list containing [state_h, state_c] from the encoder's final states, providing the initial context.

        # 1. Embed the input sequence (target tokens for current decoding step)
        # Converts the integer IDs of the target tokens into dense vector representations.
        # Output x: [batch_size, MAX_LENGTH, embedding_dim]
        x = self.embedding(x)

        # 2. Pass the embedded input through the LSTM with initial states from the encoder.
        # lstm_output: Hidden states for each time step [batch_size, MAX_LENGTH, lstm_units]. These are the inputs to the dense layer.
        # state_h, state_c: Final hidden and cell states of the decoder LSTM.
        #                   These are used as initial states for the *next* decoding step during inference (when generating word by word).
        lstm_output, _, _ = self.lstm(x, initial_state=initial_state)

        # 3. Apply the dense layer to transform LSTM outputs into vocabulary logits.
        # The dense layer projects the LSTM's hidden states to the vocabulary space.
        # outputs: [batch_size, MAX_LENGTH, vocab_size] - raw logits before softmax. The values represent unnormalized log-probabilities for each word in the vocabulary.
        outputs = self.dense(lstm_output)

        return outputs


### A.3: BUILD MODEL WITHOUT ATTENTION

In [None]:
print("\nBuilding Seq2Seq model WITHOUT attention...")

encoder_no_attn = Encoder_NoAttention(input_vocab_size, EMBEDDING_DIM, LSTM_UNITS)
decoder_no_attn = Decoder_NoAttention(target_vocab_size, EMBEDDING_DIM, LSTM_UNITS)

print("✓ Encoder and Decoder (WITHOUT attention) created successfully")


Building Seq2Seq model WITHOUT attention...
✓ Encoder and Decoder (WITHOUT attention) created successfully


### A.4: TRAINING SETUP FOR MODEL WITHOUT ATTENTION


In [None]:
# Adam is a popular choice for deep learning models due to its adaptive learning rate properties.
optimizer_no_attn = keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0)  # Adam optimizer is chosen for its efficiency in handling sparse gradients and non-stationary objectives.

# Loss function: SparseCategoricalCrossentropy is suitable for integer-encoded targets.
# from_logits=True means the model outputs raw logits (unnormalized scores), and the loss function will apply softmax internally.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

def train_step_no_attention(input_batch, target_batch):
    """Performs a single training step for the model WITHOUT attention, including forward pass, loss calculation, and backward pass (gradient update)."""
    with tf.GradientTape() as tape: # tf.GradientTape records operations for automatic differentiation.
        # Encoder forward pass: Process the input sequence to get encoder outputs and final states.
        encoder_outputs, state_h, state_c = encoder_no_attn(input_batch)

        # Prepare decoder input and target sequences.
        # Decoder input is the target sequence shifted by one position to the right (starts with <start> token, ends before <end>).
        decoder_input = target_batch[:, :-1]
        # Decoder target is the actual target sequence shifted by one position to the left (starts after <start> token).
        decoder_target = target_batch[:, 1:]

        # Decoder forward pass: Generate predictions using the decoder, conditioned on encoder states and previous target tokens.
        predictions = decoder_no_attn(decoder_input, [state_h, state_c])

        # Calculate loss: SparseCategoricalCrossentropy compares the predicted logits with the true target token IDs.
        # A mask is applied to ignore padding tokens (0s) in the loss calculation, so they don't penalize the model.
        mask = tf.math.not_equal(decoder_target, 0) # Create a mask to identify non-padding tokens.
        loss = loss_fn(decoder_target, predictions, sample_weight=mask) # Compute loss, weighted by the mask.

        # Calculate accuracy: Custom accuracy function to ignore padding tokens.
        accuracy = calculate_accuracy(predictions, decoder_target)

    # Get trainable variables from both encoder and decoder.
    trainable_vars = encoder_no_attn.trainable_variables + decoder_no_attn.trainable_variables

    # Calculate gradients of the loss with respect to all trainable variables.
    gradients = tape.gradient(loss, trainable_vars)

    # BLANK: What method updates the weights?
    # The optimizer applies the calculated gradients to update the model's weights, moving towards minimizing the loss.
    optimizer_no_attn._____BLANK______(zip(gradients, trainable_vars))  # Apply gradients to update the model's parameters.

    return loss, accuracy

def evaluate_no_attention(input_data, target_data):
    """Evaluates the model WITHOUT attention on a given dataset (e.g., validation set)."""
    total_loss = 0
    total_accuracy = 0
    num_batches = len(input_data) // BATCH_SIZE

    # Iterate through the dataset in batches for evaluation.
    for i in range(num_batches):
        start_idx = i * BATCH_SIZE
        end_idx = start_idx + BATCH_SIZE

        input_batch = input_data[start_idx:end_idx]
        target_batch = target_data[start_idx:end_idx]

        # Encoder forward pass (no gradient computation needed during evaluation).
        encoder_outputs, state_h, state_c = encoder_no_attn(input_batch)

        # Prepare decoder input and target, similar to training.
        decoder_input = target_batch[:, :-1]
        decoder_target = target_batch[:, 1:]

        # Decoder forward pass to get predictions.
        predictions = decoder_no_attn(decoder_input, [state_h, state_c])

        # Calculate loss and accuracy, ignoring padding.
        mask = tf.math.not_equal(decoder_target, 0)
        loss = loss_fn(decoder_target, predictions, sample_weight=mask)
        accuracy = calculate_accuracy(predictions, decoder_target)

        total_loss += loss
        total_accuracy += accuracy

    # Return average loss and accuracy over all evaluation batches.
    return total_loss / num_batches, total_accuracy / num_batches


### A.5: TRAINING MODEL WITHOUT ATTENTION


In [None]:
print("\n" + "="*70)
print("TRAINING SEQ2SEQ WITHOUT ATTENTION")
print("="*70)

best_val_loss_no_attn = float('inf')
best_val_acc_no_attn = 0

for epoch in tqdm(range(EPOCHS)):
    # Shuffle training data
    indices = np.random.permutation(len(train_input))
    train_input_shuffled = train_input[indices]
    train_target_shuffled = train_target[indices]

    # Training
    num_batches = len(train_input) // BATCH_SIZE
    total_loss = 0
    total_accuracy = 0

    for i in range(num_batches):
        start_idx = i * BATCH_SIZE
        end_idx = start_idx + BATCH_SIZE

        input_batch = train_input_shuffled[start_idx:end_idx]
        target_batch = train_target_shuffled[start_idx:end_idx]

        loss, accuracy = train_step_no_attention(input_batch, target_batch)
        total_loss += loss
        total_accuracy += accuracy

    train_loss = total_loss / num_batches
    train_accuracy = total_accuracy / num_batches

    # Validation
    val_loss, val_accuracy = evaluate_no_attention(val_input, val_target)

    print(f'Epoch {epoch+1}/{EPOCHS} - Loss: {train_loss:.4f}, Acc: {train_accuracy:.4f} | '
          f'Val Loss: {val_loss:.4f}, Val Acc: {val_accuracy:.4f}')

    if val_loss < best_val_loss_no_attn:
        best_val_loss_no_attn = val_loss
        best_val_acc_no_attn = val_accuracy

print(f"\n✓ Training complete!")
print(f"  Best Validation Loss (NO ATTENTION): {best_val_loss_no_attn:.4f}")
print(f"  Best Validation Accuracy (NO ATTENTION): {best_val_acc_no_attn:.4f} ({best_val_acc_no_attn*100:.2f}%)")


TRAINING SEQ2SEQ WITHOUT ATTENTION


  0%|          | 0/20 [00:00<?, ?it/s]

Epoch 1/20 - Loss: 3.9622, Acc: 0.3026 | Val Loss: 4.0332, Val Acc: 0.2714
Epoch 2/20 - Loss: 3.1873, Acc: 0.3854 | Val Loss: 3.6282, Val Acc: 0.3307
Epoch 3/20 - Loss: 2.7034, Acc: 0.4569 | Val Loss: 3.2428, Val Acc: 0.4023
Epoch 4/20 - Loss: 2.3468, Acc: 0.5134 | Val Loss: 3.0165, Val Acc: 0.4427
Epoch 5/20 - Loss: 2.0745, Acc: 0.5583 | Val Loss: 2.8932, Val Acc: 0.4720
Epoch 6/20 - Loss: 1.8479, Acc: 0.5909 | Val Loss: 2.7921, Val Acc: 0.4920
Epoch 7/20 - Loss: 1.6493, Acc: 0.6199 | Val Loss: 2.7235, Val Acc: 0.5035
Epoch 8/20 - Loss: 1.4714, Acc: 0.6491 | Val Loss: 2.6639, Val Acc: 0.5115
Epoch 9/20 - Loss: 1.3104, Acc: 0.6780 | Val Loss: 2.6121, Val Acc: 0.5255
Epoch 10/20 - Loss: 1.1625, Acc: 0.7068 | Val Loss: 2.5802, Val Acc: 0.5335
Epoch 11/20 - Loss: 1.0288, Acc: 0.7333 | Val Loss: 2.5746, Val Acc: 0.5389
Epoch 12/20 - Loss: 0.9051, Acc: 0.7607 | Val Loss: 2.5592, Val Acc: 0.5466
Epoch 13/20 - Loss: 0.7960, Acc: 0.7852 | Val Loss: 2.5203, Val Acc: 0.5512
Epoch 14/20 - Loss: 0

## PART B: SEQ2SEQ MODEL WITH ATTENTION

### B.1: ENCODER (for model WITH attention)


In [None]:
class Encoder_WithAttention(keras.Model):
    """Encoder using LSTM (for model WITH attention). This encoder is similar to the no-attention encoder but its outputs (all hidden states) are used by the attention mechanism.
    """
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Encoder_WithAttention, self).__init__()

        # Embedding layer: Converts input token IDs into dense vectors.
        self.embedding = layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            mask_zero=True # Mask padding tokens so they don't interfere with computations.
        )

        # LSTM layer: Processes the embedded input sequence.
        # return_sequences=True is crucial here, as the attention mechanism needs access to the hidden states at ALL timesteps of the encoder.
        # return_state=True provides the final hidden and cell states, which are typically used as the initial state for the decoder LSTM.
        self.lstm = layers.LSTM(
            lstm_units,
            return_sequences=True, # MUST return sequences for attention mechanism to work.
            return_state=True,     # Return final hidden and cell states for decoder initialization.
            name='encoder_lstm_with_attention'
        )

    def call(self, x):
        # Input x: batch of integer sequences [batch_size, MAX_LENGTH]

        # 1. Embed the input sequence
        # Output x: [batch_size, MAX_LENGTH, embedding_dim]
        x = self.embedding(x)

        # 2. Pass the embedded sequence through the LSTM layer
        # encoder_outputs: All hidden states for each time step [batch_size, MAX_LENGTH, lstm_units]. These are used by attention.
        # state_h: Final hidden state of the LSTM [batch_size, lstm_units].
        # state_c: Final cell state of the LSTM [batch_size, lstm_units].
        # state_h and state_c are typically used to initialize the decoder's LSTM state.
        encoder_outputs, state_h, state_c = self.lstm(x)

        return encoder_outputs, state_h, state_c


### B.2: ATTENTION MECHANISM


In [None]:
class BahdanauAttention(keras.layers.Layer):
    """Bahdanau Attention (Additive Attention). This mechanism calculates a weighted sum of encoder outputs based on the current decoder state.
    """
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        # W1 and W2 are dense layers used to transform the encoder outputs and decoder hidden state
        # into a common dimension before calculating the alignment scores.
        self.W1 = layers.Dense(units)  # The output dimension is 'units', which matches the LSTM_UNITS for consistent dimension for summation.
        self.W2 = layers.Dense(units)

        # V is a dense layer that transforms the combined energy into a single score, which is then used for softmax.
        # BLANK: What should be the output dimension of V?
        self.V = layers.Dense(_____BLANK______)  # The output dimension is 1, as we want a single scalar score for each encoder output.

    def call(self, decoder_hidden, encoder_outputs):
        # decoder_hidden: Current hidden state of the decoder [batch_size, lstm_units]
        # encoder_outputs: All hidden states from the encoder [batch_size, MAX_LENGTH, lstm_units]

        # We need to expand the decoder_hidden state to be able to broadcast and sum it with encoder_outputs.
        decoder_hidden_with_time = tf.expand_dims(decoder_hidden, axis=1)  # Expand along axis 1 to get shape [batch_size, 1, lstm_units].

        # Calculate the 'energy' or alignment score. This measures how well each encoder output matches the current decoder hidden state.
        # tf.nn.tanh is used as the activation function, introducing non-linearity.
        # BLANK: What activation function?
        energy = tf.nn._____BLANK______(  # The hyperbolic tangent activation function is commonly used in Bahdanau attention to introduce non-linearity.
            self.W1(encoder_outputs) + self.W2(decoder_hidden_with_time)
        )

        # V layer then converts this energy into attention scores.
        attention_scores = self.V(energy) # Shape: [batch_size, MAX_LENGTH, 1]

        # Calculate attention weights by applying softmax to the attention scores.
        # Softmax ensures that the weights sum up to 1 across all encoder outputs for each time step.
        # BLANK: Along which axis should we apply softmax?
        attention_weights = tf.nn.softmax(attention_scores, axis=_____BLANK______)  # Apply softmax across the sequence length (axis=1) to get weights for each encoder output.

        # Calculate the context vector by taking a weighted sum of the encoder outputs.
        # This vector captures the most relevant information from the source sequence for the current decoding step.
        context_vector = tf.reduce_sum(
            attention_weights * encoder_outputs, # Element-wise multiplication of weights and encoder outputs.
            axis=1  # Sum along the sequence length (axis=1) to get a single context vector for each sample in the batch.
        ) # Shape: [batch_size, lstm_units]

        # Squeeze the attention weights to remove the last dimension (which was 1).
        attention_weights = tf.squeeze(attention_weights, axis=-1) # Shape: [batch_size, MAX_LENGTH]

        return context_vector, attention_weights


### B.3: DECODER WITH ATTENTION


In [None]:
class Decoder_WithAttention_Vectorized(keras.Model):
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Decoder_WithAttention_Vectorized, self).__init__()

        self.embedding = layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            mask_zero=True
        )

        # Vectorized attention components
        self.attention_W1 = layers.Dense(lstm_units)
        self.attention_W2 = layers.Dense(lstm_units)
        self.attention_V = layers.Dense(1)

        self.lstm = layers.LSTM(
            lstm_units,
            return_sequences=True,
            return_state=True
        )

        self.dense = layers.Dense(vocab_size)

    def call(self, x, encoder_outputs, initial_state):
        # Embed input
        x = self.embedding(x)  # [batch, seq_len, embedding_dim]

        state_h, state_c = initial_state

        # Calculate attention using encoder's final state
        state_h_expanded = tf.expand_dims(state_h, 1)  # [batch, 1, lstm_units]

        # Vectorized attention calculation
        energy = tf.nn.tanh(
            self.attention_W1(encoder_outputs) +
            self.attention_W2(state_h_expanded)
        )
        attention_scores = self.attention_V(energy)
        attention_weights = tf.nn.softmax(attention_scores, axis=1)
        context_vector = tf.reduce_sum(
            attention_weights * encoder_outputs,
            axis=1
        )

        # Tile context for all decoder timesteps
        seq_len = tf.shape(x)[1]
        context_vector = tf.expand_dims(context_vector, 1)
        context_vector = tf.tile(context_vector, [1, seq_len, 1])

        # Concatenate and process (like no-attention decoder)
        lstm_input = tf.concat([x, context_vector], axis=-1)
        lstm_output, _, _ = self.lstm(lstm_input, initial_state=[state_h, state_c])
        outputs = self.dense(lstm_output)

        return outputs


class Decoder_WithAttention(keras.Model):
    """Decoder WITH Bahdanau Attention. This decoder uses the attention mechanism to dynamically focus on different parts of the source sentence while generating the target sequence.
    """
    def __init__(self, vocab_size, embedding_dim, lstm_units):
        super(Decoder_WithAttention, self).__init__()

        # Embedding layer for the target language. Converts target token IDs to dense vectors.
        self.embedding = layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            mask_zero=True # Mask padding tokens.
        )

        # Initialize the Bahdanau Attention layer. This layer will calculate context vectors.
        self.attention = BahdanauAttention(lstm_units)

        # LSTM layer for the decoder. It takes the embedded input token and
        # the context vector (from attention) along with its previous hidden state.
        # return_sequences=True is used in the call method during a loop, rather than in the LSTM definition directly
        # because we are decoding one token at a time implicitly within the loop.
        self.lstm = layers.LSTM(
            lstm_units,
            return_sequences=True,  # Return sequences as we'll be processing inputs over time steps within the call method loop.
            return_state=True,      # Return the final hidden and cell states to be used in the next decoding step.
            name='decoder_lstm_with_attention'
        )

        # Dense layer: Maps the LSTM's output to a probability distribution over the target vocabulary.
        self.dense = layers.Dense(vocab_size, name='output_dense_with_attention')

    def call(self, x, encoder_outputs, initial_state):
        # x: Input batch of target sequences [batch_size, MAX_LENGTH] (for training, it's the shifted target sequence).
        # encoder_outputs: All hidden states from the encoder [batch_size, MAX_LENGTH, lstm_units]. Used by attention.
        # initial_state: A list containing [state_h, state_c] from the encoder's final states, used to initialize the decoder LSTM.

        # 1. Embed the input sequence (current target tokens for decoding).
        x = self.embedding(x) # Shape: [batch_size, MAX_LENGTH, embedding_dim]

        # Initialize decoder's hidden and cell states with the encoder's final states.
        state_h, state_c = initial_state

        # List to store outputs (predictions) for each time step.
        outputs = []

        # Loop through the sequence length to decode one token at a time.
        # During training, `x.shape[1]` is `MAX_LENGTH - 1` because target_batch[:, :-1] is used for decoder input.
        for t in range(x.shape[1]):
            # Take the input for the current time step (t).
            input_t = x[:, t:t+1, :] # Shape: [batch_size, 1, embedding_dim]

            # Calculate the context vector using the attention mechanism.
            # The attention mechanism uses the decoder's current hidden state (state_h) and all encoder outputs.
            context_vector, _ = self.attention(state_h, encoder_outputs) # context_vector shape: [batch_size, lstm_units]
            context_vector = tf.expand_dims(context_vector, axis=1) # Expand context_vector to [batch_size, 1, lstm_units] to concatenate with input_t.

            # Concatenate the embedded input token with the context vector.
            # This combines the information about the previous token with the relevant parts of the source sentence.
            lstm_input = tf.concat([input_t, context_vector], axis=-1) # Shape: [batch_size, 1, embedding_dim + lstm_units]

            # Pass the combined input through the decoder LSTM.
            # The initial_state for this LSTM call is the (state_h, state_c) from the previous time step.
            output_t, state_h, state_c = self.lstm(
                lstm_input,
                initial_state=[state_h, state_c] # Pass the updated states from the previous step.
            ) # output_t shape: [batch_size, 1, lstm_units], state_h/c shape: [batch_size, lstm_units]

            outputs.append(output_t) # Collect the LSTM output for this time step.

        # Concatenate all time step outputs to form the full sequence of LSTM outputs.
        outputs = tf.concat(outputs, axis=1) # Shape: [batch_size, MAX_LENGTH, lstm_units]

        # Apply the dense layer to get the final predictions (logits) for each token in the vocabulary.
        outputs = self.dense(outputs) # Shape: [batch_size, MAX_LENGTH, vocab_size]

        return outputs


### B.4: BUILD MODEL WITH ATTENTION


In [None]:
print("\nBuilding Seq2Seq model WITH attention...")

encoder_attn = Encoder_WithAttention(input_vocab_size, EMBEDDING_DIM, LSTM_UNITS)
decoder_attn = Decoder_WithAttention_Vectorized(target_vocab_size, EMBEDDING_DIM, LSTM_UNITS)

print("✓ Encoder, Attention, and Decoder (WITH attention) created successfully")



Building Seq2Seq model WITH attention...
✓ Encoder, Attention, and Decoder (WITH attention) created successfully


### B.5: TRAINING SETUP FOR MODEL WITH ATTENTION

In [None]:
optimizer_attn = keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0) # Adam optimizer is also used for the attention model, known for its efficiency.

# Loss function: SparseCategoricalCrossentropy is suitable for integer-encoded targets.
# from_logits=True means the model outputs raw logits (unnormalized scores), and the loss function will apply softmax internally.
loss_fn_attn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

def train_step_with_attention(input_batch, target_batch):
    """Single training step for model WITH attention, including forward pass, loss calculation, and backpropagation."""
    with tf.GradientTape() as tape: # Record operations for automatic differentiation.
        # Encoder forward pass: Processes the input sequence to generate encoder outputs and final states.
        # encoder_outputs are crucial here as they will be used by the attention mechanism in the decoder.
        encoder_outputs, state_h, state_c = encoder_attn(input_batch)

        # Prepare decoder input and target sequences.
        # Decoder input is the target sequence shifted by one position to the right (e.g., starts with <start> token).
        decoder_input = target_batch[:, :-1]
        # Decoder target is the actual target sequence, shifted left, used for calculating loss.
        decoder_target = target_batch[:, 1:]

        # Decoder forward pass: Generates predictions using the decoder, which now incorporates attention.
        # It takes the decoder input, all encoder outputs, and the initial LSTM states from the encoder.
        predictions = decoder_attn(decoder_input, encoder_outputs, [state_h, state_c])

        # Calculate loss, masking out padding tokens (0s) to ensure they don't contribute to the loss.
        mask = tf.math.not_equal(decoder_target, 0)
        loss = loss_fn_attn(decoder_target, predictions, sample_weight=mask)

        # Calculate token-level accuracy, also ignoring padding tokens.
        accuracy = calculate_accuracy(predictions, decoder_target)

    # Collect all trainable variables from both the encoder and decoder.
    trainable_vars = encoder_attn.trainable_variables + decoder_attn.trainable_variables
    # Compute gradients of the loss with respect to these trainable variables.
    gradients = tape.gradient(loss, trainable_vars)
    # Apply the calculated gradients to update the model's weights.
    optimizer_attn.apply_gradients(zip(gradients, trainable_vars))

    return loss, accuracy

def evaluate_with_attention(input_data, target_data):
    """Evaluates the model WITH attention on a given dataset (e.g., validation set), returning average loss and accuracy."""
    total_loss = 0
    total_accuracy = 0
    num_batches = len(input_data) // BATCH_SIZE

    # Iterate through the evaluation dataset in batches.
    for i in range(num_batches):
        start_idx = i * BATCH_SIZE
        end_idx = start_idx + BATCH_SIZE

        input_batch = input_data[start_idx:end_idx]
        target_batch = target_data[start_idx:end_idx]

        # Encoder forward pass (no gradient computation needed during evaluation).
        encoder_outputs, state_h, state_c = encoder_attn(input_batch)
        # Prepare decoder input and target, similar to training.
        decoder_input = target_batch[:, :-1]
        decoder_target = target_batch[:, 1:]
        # Decoder forward pass to get predictions.
        predictions = decoder_attn(decoder_input, encoder_outputs, [state_h, state_c])

        # Calculate loss and accuracy, ignoring padding tokens.
        mask = tf.math.not_equal(decoder_target, 0)
        loss = loss_fn_attn(decoder_target, predictions, sample_weight=mask)
        accuracy = calculate_accuracy(predictions, decoder_target)

        total_loss += loss
        total_accuracy += accuracy

    # Return the average loss and accuracy over all evaluation batches.
    return total_loss / num_batches, total_accuracy / num_batches


### B.6: TRAINING MODEL WITH ATTENTION


In [None]:
print("\n" + "="*70)
print("TRAINING SEQ2SEQ WITH ATTENTION")
print("="*70)

best_val_loss_attn = float('inf') # Initialize best validation loss to infinity to ensure first epoch's loss is always better.
best_val_acc_attn = 0            # Initialize best validation accuracy to 0.

for epoch in tqdm(range(EPOCHS)):
    # Shuffle training data at the beginning of each epoch to ensure variety in batches.
    indices = np.random.permutation(len(train_input))
    train_input_shuffled = train_input[indices]
    train_target_shuffled = train_target[indices]

    num_batches = len(train_input) // BATCH_SIZE
    total_loss = 0
    total_accuracy = 0

    # Loop through batches for training.
    for i in range(num_batches):
        start_idx = i * BATCH_SIZE
        end_idx = start_idx + BATCH_SIZE

        input_batch = train_input_shuffled[start_idx:end_idx]
        target_batch = train_target_shuffled[start_idx:end_idx]

        # Perform a single training step for the attention model.
        loss, accuracy = train_step_with_attention(input_batch, target_batch)
        total_loss += loss
        total_accuracy += accuracy

    # Calculate average training loss and accuracy for the epoch.
    train_loss = total_loss / num_batches
    train_accuracy = total_accuracy / num_batches

    # Evaluate the model on the validation set after each epoch.
    val_loss, val_accuracy = evaluate_with_attention(val_input, val_target)

    # Print epoch-wise training and validation metrics.
    print(f'Epoch {epoch+1}/{EPOCHS} - Loss: {train_loss:.4f}, Acc: {train_accuracy:.4f} | '
          f'Val Loss: {val_loss:.4f}, Val Acc: {val_accuracy:.4f}')

    # Update best validation loss and accuracy if current epoch's validation loss is lower.
    if val_loss < best_val_loss_attn:
        best_val_loss_attn = val_loss
        best_val_acc_attn = val_accuracy

print(f"\n✓ Training complete!")
print(f"  Best Validation Loss (WITH ATTENTION): {best_val_loss_attn:.4f}")
print(f"  Best Validation Accuracy (WITH ATTENTION): {best_val_acc_attn:.4f} ({best_val_acc_attn*100:.2f}%)")



TRAINING SEQ2SEQ WITH ATTENTION


  0%|          | 0/20 [00:00<?, ?it/s]

Epoch 1/20 - Loss: 1.1896, Acc: 0.2845 | Val Loss: 1.2773, Val Acc: 0.2712
Epoch 2/20 - Loss: 0.9222, Acc: 0.4068 | Val Loss: 1.0786, Val Acc: 0.3589
Epoch 3/20 - Loss: 0.7552, Acc: 0.4905 | Val Loss: 0.9604, Val Acc: 0.4328
Epoch 4/20 - Loss: 0.6464, Acc: 0.5449 | Val Loss: 0.8931, Val Acc: 0.4736
Epoch 5/20 - Loss: 0.5625, Acc: 0.5844 | Val Loss: 0.8445, Val Acc: 0.4961
Epoch 6/20 - Loss: 0.4919, Acc: 0.6189 | Val Loss: 0.8094, Val Acc: 0.5177
Epoch 7/20 - Loss: 0.4302, Acc: 0.6527 | Val Loss: 0.7906, Val Acc: 0.5287
Epoch 8/20 - Loss: 0.3748, Acc: 0.6879 | Val Loss: 0.7701, Val Acc: 0.5421
Epoch 9/20 - Loss: 0.3253, Acc: 0.7196 | Val Loss: 0.7510, Val Acc: 0.5511
Epoch 10/20 - Loss: 0.2819, Acc: 0.7490 | Val Loss: 0.7396, Val Acc: 0.5564
Epoch 11/20 - Loss: 0.2436, Acc: 0.7762 | Val Loss: 0.7304, Val Acc: 0.5629
Epoch 12/20 - Loss: 0.2109, Acc: 0.8003 | Val Loss: 0.7300, Val Acc: 0.5716
Epoch 13/20 - Loss: 0.1830, Acc: 0.8234 | Val Loss: 0.7209, Val Acc: 0.5768
Epoch 14/20 - Loss: 0

## PART C: COMPARISON AND TESTING


In [None]:
print(f"\nMetrics Comparison:")
print(f"{'Metric':<30} {'Without Attention':<20} {'With Attention':<20} {'Improvement':<15}")
print("-" * 85)
# Display validation loss for both models and calculate percentage improvement.
print(f"{'Validation Loss':<30} {best_val_loss_no_attn:<20.4f} {best_val_loss_attn:<20.4f} "
      f"{((best_val_loss_no_attn - best_val_loss_attn) / best_val_loss_no_attn * 100):>14.2f}%")
# Display validation accuracy for both models and calculate percentage improvement.
print(f"{'Validation Accuracy':<30} {best_val_acc_no_attn:<20.4f} {best_val_acc_attn:<20.4f} "
      f"{((best_val_acc_attn - best_val_acc_no_attn) / best_val_acc_no_attn * 100):>14.2f}%")

# ==============================================================================
# TRANSLATION FUNCTIONS: Helper functions to translate a given English sentence into German using the trained models.
# ==============================================================================

def translate_no_attention(sentence):
    """Translate using model WITHOUT attention. This function simulates inference by decoding token by token."""
    # Preprocess the input sentence (lowercase, add <start>/<end> tokens, handle punctuation).
    sentence = preprocess_sentence(sentence)
    # Convert the processed sentence to a sequence of token IDs.
    inputs = input_tokenizer.texts_to_sequences([sentence])
    # Pad the input sequence to MAX_LENGTH.
    inputs = keras.preprocessing.sequence.pad_sequences(inputs, maxlen=MAX_LENGTH, padding='post')

    # Get encoder's final hidden and cell states, which serve as the context for the decoder.
    encoder_outputs, state_h, state_c = encoder_no_attn(inputs)

    # Initialize the decoder's input with the <start> token.
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = target_tokenizer.word_index['<start>']

    decoded_sentence = [] # List to store the translated words.

    # Decode word by word up to MAX_LENGTH.
    for _ in range(MAX_LENGTH):
        # Get predictions from the decoder using the current target sequence and encoder's final states.
        predictions = decoder_no_attn(target_seq, [state_h, state_c])
        # Get the ID of the word with the highest probability (greedy decoding).
        predicted_id = tf.argmax(predictions[0, -1, :]).numpy()
        # Convert the predicted ID back to a word.
        predicted_word = target_tokenizer.index_word.get(predicted_id, '')

        # Stop decoding if the <end> token is predicted.
        if predicted_word == '<end>':
            break

        # Add the predicted word to the decoded sentence if it's not a special token.
        if predicted_word and predicted_word != '<start>':
            decoded_sentence.append(predicted_word)

        # Prepare the input for the next decoding step: the last predicted word (or the entire sequence if using teacher forcing).
        # For inference, the model predicts one word at a time, so `target_seq` updates to include the newly predicted word.
        # This implementation uses an approach where `target_seq` grows with each predicted word, starting with `<start>`
        # and then appending the predicted words. This is more common for training where the full shifted target sequence is passed.
        # For pure inference, `target_seq` is usually just `[predicted_id]` for the next step.
        # A simpler inference approach would be to update `target_seq` to only contain the `predicted_id` for the next step's input.
        # The current implementation effectively reconstructs the input sequence for `decoder_no_attn` with each step, which can be computationally intensive.
        target_seq = np.zeros((1, len(decoded_sentence) + 1))
        target_seq[0, 0] = target_tokenizer.word_index['<start>']
        for i, word in enumerate(decoded_sentence):
            target_seq[0, i + 1] = target_tokenizer.word_index.get(word, 0)

    return ' '.join(decoded_sentence)

def translate_with_attention(sentence):
    """Translate using model WITH attention. This function also simulates inference token by token, incorporating attention."""
    # Preprocess the input sentence.
    sentence = preprocess_sentence(sentence)
    # Convert to token sequences and pad.
    inputs = input_tokenizer.texts_to_sequences([sentence])
    inputs = keras.preprocessing.sequence.pad_sequences(inputs, maxlen=MAX_LENGTH, padding='post')

    # Get encoder outputs and initial decoder states. encoder_outputs will be used by the attention mechanism.
    encoder_outputs, state_h, state_c = encoder_attn(inputs)

    # Initialize decoder's input with the <start> token.
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = target_tokenizer.word_index['<start>']

    decoded_sentence = [] # List to store the translated words.

    # Decode word by word up to MAX_LENGTH.
    for _ in range(MAX_LENGTH):
        # Get predictions from the attention-based decoder.
        # It takes the current target sequence (which typically is just the last predicted word during inference),
        # all encoder outputs, and the current hidden/cell states.
        predictions = decoder_attn(target_seq, encoder_outputs, [state_h, state_c])
        # Get the ID of the most probable word.
        predicted_id = tf.argmax(predictions[0, -1, :]).numpy()
        # Convert ID back to word.
        predicted_word = target_tokenizer.index_word.get(predicted_id, '')

        # Stop if <end> token is predicted.
        if predicted_word == '<end>':
            break

        # Add valid predicted words.
        if predicted_word and predicted_word != '<start>':
            decoded_sentence.append(predicted_word)

        # Similar to the `translate_no_attention` function, this updates `target_seq` to include
        # all words generated so far, starting with `<start>`. For true step-by-step inference,
        # `target_seq` would ideally only contain the `predicted_id` from the current step to feed into the next.
        target_seq = np.zeros((1, len(decoded_sentence) + 1))
        target_seq[0, 0] = target_tokenizer.word_index['<start>']
        for i, word in enumerate(decoded_sentence):
            target_seq[0, i + 1] = target_tokenizer.word_index.get(word, 0)

    return ' '.join(decoded_sentence)

# ==============================================================================
# TRANSLATION EXAMPLES WITH BLEU SCORES
# ==============================================================================

print("\n" + "="*70)
print("TRANSLATION EXAMPLES WITH REFERENCE TRANSLATIONS")
print("="*70)

# Select a few indices from the validation set to test translation.
test_indices = [1, 11, 21, 31, 41]

print("\nNote: BLEU score ranges from 0 (worst) to 1 (perfect match)")
print("Higher BLEU score = better translation quality\n")

total_bleu_no_attn = 0 # Accumulator for BLEU scores of the no-attention model.
total_bleu_attn = 0    # Accumulator for BLEU scores of the attention model.

# Iterate through the selected test sentences.
for idx in test_indices:
    if idx < len(val_input_texts):
        # Retrieve the original English sentence and its reference German translation.
        english_sentence = val_input_texts[idx]
        reference_german = val_target_texts[idx]

        # Get translations from both models.
        translation_no_attn = translate_no_attention(english_sentence)
        translation_attn = translate_with_attention(english_sentence)

        # Calculate BLEU score for each translation against the reference.
        bleu_no_attn = calculate_bleu_score(reference_german, translation_no_attn)
        bleu_attn = calculate_bleu_score(reference_german, translation_attn)

        total_bleu_no_attn += bleu_no_attn
        total_bleu_attn += bleu_attn

        # Print the results for each example.
        print(f"Example {idx+1}:")
        print(f"  English:             {english_sentence}")
        print(f"  Reference German:    {reference_german}")
        print(f"  Without Attention:   {translation_no_attn} (BLEU: {bleu_no_attn:.4f})")
        print(f"  With Attention:      {translation_attn} (BLEU: {bleu_attn:.4f})")
        print()

# Calculate and print the average BLEU scores across all test examples.
avg_bleu_no_attn = total_bleu_no_attn / len(test_indices)
avg_bleu_attn = total_bleu_attn / len(test_indices)

print("="*70)
print("FINAL COMPARISON SUMMARY")
print("="*70)
print(f"\n{'Metric':<30} {'Without Attention':<20} {'With Attention':<20}")
print("-" * 70)
# Final summary of validation loss, accuracy, and average BLEU scores.
print(f"{'Validation Loss':<30} {best_val_loss_no_attn:<20.4f} {best_val_loss_attn:<20.4f}")
print(f"{'Validation Accuracy':<30} {best_val_acc_no_attn*100:<20.2f}% {best_val_acc_attn*100:<20.2f}%")
print(f"{'Average BLEU Score':<30} {avg_bleu_no_attn:<20.4f} {avg_bleu_attn:<20.4f}")
print("\n" + "="*70)
print("TRAINING COMPLETE!")
print("="*70)



Metrics Comparison:
Metric                         Without Attention    With Attention       Improvement    
-------------------------------------------------------------------------------------
Validation Loss                2.5106               0.7191                        71.36%
Validation Accuracy            0.5585               0.5774                         3.40%

TRANSLATION EXAMPLES WITH REFERENCE TRANSLATIONS

Note: BLEU score ranges from 0 (worst) to 1 (perfect match)
Higher BLEU score = better translation quality

Example 2:
  English:             <start> who is that guy ? <end>
  Reference German:    <start> wer ist dieser typ ? <end>
  Without Attention:   wer ist das hier (BLEU: 0.3894)
  With Attention:      wer ist dieser kerl (BLEU: 0.5841)

Example 12:
  English:             <start> who stabbed tom ? <end>
  Reference German:    <start> wer hat tom mit dem messer gestochen ? <end>
  Without Attention:   wer hat tom geküsst (BLEU: 0.2759)
  With Attention:      wer h