<a href="https://colab.research.google.com/github/joannachang1028/95820_Application-of-NL-X-and-LLM/blob/main/95_820_A1_LSTM_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal & Overview

**Goal.** Demonstrate the *same* NLP model—**Embedding → LSTM → Dense (logits)**—implemented in both **PyTorch** and **TensorFlow/Keras**, trained and evaluated on the **exact same inputs**, for a clean apples-to-apples comparison.

**Task.** Tiny **binary sentiment** classification (1 = positive, 0 = negative) on a hand-made list of short movie-review–style sentences.

**What we keep identical across PyTorch and TensorFlow/Keras**

- **Data & preprocessing**
  - Tokenization: simple whitespace; all text lowercased
  - Vocabulary & ID mapping shared by both frameworks, with special tokens:
    - `<pad>` → 0 (padding)
    - `<unk>` → 1 (out-of-vocabulary)
  - Fixed sequence length `MAX_LEN` with **right-truncation** (drop extra tokens) and **right-padding** (append `<pad>`)
  - Deterministic train/validation split using a fixed random seed

- **Model architecture**
  - `Embedding(EMBED_DIM) → LSTM(HIDDEN_DIM) → Dense(NUM_CLASSES logits)` (no activation on the final layer; we use logits)

- **Training recipe**
  - Optimizer: **Adam** with learning rate **1e-3**
  - Loss: **cross-entropy on logits** (`from_logits=True` in Keras; `nn.CrossEntropyLoss` in PyTorch)
  - Epochs: small and identical in both
  - Batch size: **full-batch** (one update per epoch) for clarity and brevity

- **Inference**
  - Same sample sentences evaluated in both implementations
  - Report **softmax probabilities** `[P(neg), P(pos)]` and the predicted label

> **Why these simplifications?**
> We intentionally **do not use masking** and we take the **final time step** (even if it corresponds to padding) so the two frameworks behave identically for teaching. In production, you’d typically enable masking (e.g., `mask_zero=True` in Keras) or use packed sequences in PyTorch to ignore padded positions.

**Takeaway.** The two implementations are functionally equivalent; any tiny performance differences come from framework-level initializers or numeric nuances, not from modeling choices.


In [None]:
# @title Shared setup: tiny dataset, vocab, encoding, splits
import numpy as np
from collections import Counter
import random

# --------------------------
# Reproducibility
# --------------------------
SEED = 1337
random.seed(SEED)
np.random.seed(SEED)

# --------------------------
# Tiny movie-sentiment dataset: (text, label) with 1=pos, 0=neg
# --------------------------
data = [
    ("i love this movie", 1),
    ("this film was great", 1),
    ("amazing acting and story", 1),
    ("highly recommend it", 1),
    ("i hate this movie", 0),
    ("this film was terrible", 0),
    ("boring plot and bad acting", 0),
    ("do not recommend it", 0),
    ("what a fantastic experience", 1),
    ("worst film ever", 0),
    ("best film ever", 1),
    ("what a waste of time", 0),
]

# ============================================================
# WORD-LEVEL VOCAB (detailed explanation)
# ------------------------------------------------------------
# Goal: Map each *word* to a stable integer ID that both frameworks will share.
# Steps:
#   1) Tokenize each sentence by whitespace (quick & deterministic).
#   2) Count word frequencies with Counter (not required, but useful if you later want cutoffs).
#   3) Create a vocabulary list (itos = index->token) that starts with special symbols:
#        <pad> : used to pad short sequences to MAX_LEN (index 0 here)
#        <unk> : used for any out-of-vocabulary word seen at inference time (index 1 here)
#      Then append all observed words in a sorted order for determinism.
#   4) Build the reverse map (stoi = token->index).
#
# Notes:
# - We choose **PAD_ID=0** and **UNK_ID=1**, which is common and convenient.
# - Sorting the tokens makes the mapping stable across runs (given fixed data).
# - In more realistic pipelines, you'd lowercase consistently (we do), and maybe strip punctuation.
# - For *strict parity* between frameworks, we’ll reuse these exact IDs everywhere.
# ============================================================
counter = Counter()
for text, _ in data:
    counter.update(text.lower().split())

# Counter will look like:
# {'this': 4, 'film': 4, 'i': 2, 'movie': 2, 'was': 2, 'acting': 2, 'and': 2,
#  'recommend': 2, 'it': 2, 'what': 2, 'a': 2, 'ever': 2,
#  'love': 1, 'great': 1, 'amazing': 1, 'story': 1, 'highly': 1, 'hate': 1,
#  'terrible': 1, 'boring': 1, 'plot': 1, 'bad': 1, 'do': 1, 'not': 1,
#  'fantastic': 1, 'experience': 1, 'worst': 1, 'best': 1,
#  'waste': 1, 'of': 1, 'time': 1}

PAD, UNK = "<pad>", "<unk>"
itos = [PAD, UNK] + sorted(counter.keys())   # index -> token
stoi = {w: i for i, w in enumerate(itos)}    # token -> index
PAD_ID, UNK_ID = stoi[PAD], stoi[UNK]
vocab_size = len(itos)

# ============================================================
# ENCODE TO FIXED LENGTH (detailed explanation)
# ------------------------------------------------------------
# Goal: Convert each sentence into a *fixed-length* vector of token IDs so that:
#   - Batches can be simple NumPy arrays / tensors (no ragged shapes).
#   - Both frameworks see identical inputs.
#
# Choices we make:
#   - MAX_LEN = 6: short enough for a toy demo.
#   - Truncation policy: keep the *first* MAX_LEN tokens (drop the rest).
#       This is often called "right-truncation".
#   - Padding policy: if a sentence has < MAX_LEN tokens, append PAD_ID to the right
#       ("right-padding") until length is exactly MAX_LEN.
#
# Consequences:
#   - The *last* position is often padding for short sentences. Since we take the output at
#     the last time step, that step may correspond to <pad>. This is OK here because:
#       * both frameworks do the exact same thing, so the comparison is fair, and
#       * the model can learn to treat PAD as "uninformative".
#   - In practice, you'd often use masking so the model ignores padded positions.
# ============================================================
MAX_LEN = 6

def encode(text):
    toks = text.lower().split()
    ids = [stoi.get(tok, UNK_ID) for tok in toks]     # map tokens to IDs, OOV -> UNK_ID
    ids = ids[:MAX_LEN]                               # truncate to MAX_LEN (right-truncation)
    ids = ids + [PAD_ID] * (MAX_LEN - len(ids))       # right-pad with PAD_ID to fixed length
    return ids


# itos (index → token)

# (alphabetical after the two specials)

# 0: <pad>
# 1: <unk>
# 2: a
# 3: acting
# 4: amazing
# 5: and
# 6: bad
# 7: best
# 8: boring
# 9: do
# 10: ever
# 11: experience
# 12: fantastic
# 13: film
# 14: great
# 15: hate
# 16: highly
# 17: i
# 18: it
# 19: love
# 20: movie
# 21: not
# 22: of
# 23: plot
# 24: recommend
# 25: story
# 26: terrible
# 27: this
# 28: time
# 29: was
# 30: waste
# 31: what
# 32: worst

# stoi (token → index)
# {
#  '<pad>':0, '<unk>':1,
#  'a':2, 'acting':3, 'amazing':4, 'and':5, 'bad':6, 'best':7, 'boring':8, 'do':9,
#  'ever':10, 'experience':11, 'fantastic':12, 'film':13, 'great':14, 'hate':15,
#  'highly':16, 'i':17, 'it':18, 'love':19, 'movie':20, 'not':21, 'of':22, 'plot':23,
#  'recommend':24, 'story':25, 'terrible':26, 'this':27, 'time':28, 'was':29,
#  'waste':30, 'what':31, 'worst':32
# }

# IDs for specials and vocab size
# PAD_ID = 0
# UNK_ID = 1
# vocab_size = 33

# encode(t) with MAX_LEN = 6 (right-truncate then right-pad with PAD_ID)
# encode("this film was great")
# # -> [27, 13, 29, 14, 0, 0]

# encode("boring plot and bad acting")
# # -> [8, 23, 5, 6, 3, 0]

# encode("what a fantastic experience")
# # -> [31, 2, 12, 11, 0, 0]

# encode("i love this movie")
# # -> [17, 19, 27, 20, 0, 0]

# encode("this film was terrible")
# # -> [27, 13, 29, 26, 0, 0]

# encode("do not recommend it")
# # -> [9, 21, 24, 18, 0, 0]

# encode("worst film ever")
# # -> [32, 13, 10, 0, 0, 0]

# encode("best film ever")
# # -> [7, 13, 10, 0, 0, 0]

# encode("what a waste of time")
# # -> [31, 2, 30, 22, 28, 0]

# Vectorize the whole dataset
X = np.array([encode(t) for t, _ in data], dtype=np.int32)  # int32: good for TF; will cast to long in Torch
y = np.array([lbl for _, lbl in data], dtype=np.int32)

# --------------------------
# Train/val split (75/25), with a fixed permutation for reproducibility
# --------------------------
perm = np.random.RandomState(SEED).permutation(len(X))
X, y = X[perm], y[perm]
split = int(0.75 * len(X))
X_train, y_train = X[:split], y[:split]
X_val,   y_val   = X[split:], y[split:]

# --------------------------
# Shared hyperparameters (kept small for speed & clarity)
# --------------------------
EMBED_DIM   = 16
HIDDEN_DIM  = 32
NUM_CLASSES = 2
EPOCHS = 3

# We'll do *full-batch* training to keep code tiny:
BATCH_SIZE_FULL = len(X_train)  # one update per epoch using all training examples

# Three sample sentences for end-of-notebook predictions
sample_texts = ["this movie was great", "this movie was boring", "i recommend this film"]

def encode_batch(texts):
    return np.array([encode(t) for t in texts], dtype=np.int32)

print(f"Vocab size: {vocab_size} | Train={len(X_train)} | Val={len(X_val)} | MAX_LEN={MAX_LEN}")
print("First encoded example:", X_train[0])

Vocab size: 33 | Train=9 | Val=3 | MAX_LEN=6
First encoded example: [27 13 29 26  0  0]


In [None]:
# @title PyTorch LSTM (tiny, full-batch)
import torch
import torch.nn as nn
torch.manual_seed(SEED)

# Convert numpy arrays → torch tensors
Xtr = torch.from_numpy(X_train).long()   # Embedding expects torch.long (int64)
ytr = torch.from_numpy(y_train).long()
Xva = torch.from_numpy(X_val).long()
yva = torch.from_numpy(y_val).long()

# --------------------------
# Model: Embedding → LSTM → Dense(logits)
# --------------------------
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, pad_id):
        super().__init__()
        # padding_idx=pad_id keeps the PAD row frozen at zeros (not updated by training).
        # Note: In TF below we DON'T freeze PAD; the PAD embedding will be trainable there.
        # That tiny discrepancy is usually negligible for this demo. If you want exact parity,
        # remove padding_idx here so PAD is also trainable in PyTorch.
        self.emb  = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_id)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc   = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.emb(x)            # shape: (B, T, E)
        out, _ = self.lstm(x)      # shape: (B, T, H); we ignore hidden state tuple for simplicity
        last = out[:, -1, :]       # take the *final* time step (could be PAD; same choice in TF)
        return self.fc(last)       # logits (unnormalized scores), shape: (B, C)

model = LSTMClassifier(vocab_size, EMBED_DIM, HIDDEN_DIM, NUM_CLASSES, PAD_ID)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

# ============================================================
# FULL-BATCH GRADIENT DESCENT (detailed explanation)
# ------------------------------------------------------------
# "Full-batch" means we use *all* training examples in one giant batch to compute:
#   1) A single forward pass producing logits for the entire training set.
#   2) A single scalar loss value (average over all examples).
#   3) A single backward pass computing gradients w.r.t. all model parameters.
#   4) One optimizer step that updates parameters using those gradients.
#
# Why do this here?
#   - Our dataset is tiny, so it's easy and compact to express.
#   - It reduces code (no DataLoader or loops over mini-batches).
#   - It ensures both frameworks take an equally simple path.
#
# Trade-offs (for real training):
#   - Full-batch can be slow/ memory-heavy for large datasets.
#   - Mini-batches add stochasticity that can help generalization.
# ============================================================
for epoch in range(1, EPOCHS+1):
    model.train()
    opt.zero_grad()                 # clear old gradients
    logits = model(Xtr)             # forward pass on *all* training examples
    loss = loss_fn(logits, ytr)     # compute average cross-entropy loss over the batch
    loss.backward()                 # backprop: compute gradients dLoss/dParam
    opt.step()                      # update parameters (Adam optimizer)

    # Quick validation accuracy (no gradient tracking)
    model.eval()
    with torch.no_grad():
        val_logits = model(Xva)
        val_pred = val_logits.argmax(1)
        val_acc = (val_pred == yva).float().mean().item()
    print(f"[PyTorch] Epoch {epoch}/{EPOCHS}  train_loss={loss.item():.4f}  val_acc={val_acc:.3f}")

# Inference on sample sentences
with torch.no_grad():
    logits = model(torch.from_numpy(encode_batch(sample_texts)).long())
    probs = torch.softmax(logits, dim=1).cpu().numpy()
    preds = probs.argmax(1)

print("\n[PyTorch] Predictions:")
for t, p, pr in zip(sample_texts, preds, probs):
    print(f"{t!r} → {'pos' if p==1 else 'neg'}  probs={pr}")


[PyTorch] Epoch 1/3  train_loss=0.6908  val_acc=0.333
[PyTorch] Epoch 2/3  train_loss=0.6891  val_acc=0.333
[PyTorch] Epoch 3/3  train_loss=0.6874  val_acc=0.333

[PyTorch] Predictions:
'this movie was great' → pos  probs=[0.4464912 0.5535088]
'this movie was boring' → pos  probs=[0.4553566  0.54464334]
'i recommend this film' → pos  probs=[0.43842876 0.56157124]


In [None]:
# @title TensorFlow/Keras LSTM (tiny, full-batch)
import tensorflow as tf
tf.random.set_seed(SEED)

# Same architecture: Embedding → LSTM → Dense(logits)
tf_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=EMBED_DIM,
                              input_length=MAX_LEN),
    tf.keras.layers.LSTM(HIDDEN_DIM),
    tf.keras.layers.Dense(NUM_CLASSES)  # logits (no activation)
])

tf_model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

# ============================================================
# FULL-BATCH GRADIENT DESCENT IN KERAS (what happens under the hood)
# ------------------------------------------------------------
# We set batch_size = number of training examples → one update per epoch.
# Keras still handles the same steps internally:
#   forward → loss → backward (autodiff) → optimizer step → repeat per epoch
# This mirrors the PyTorch loop conceptually, just handled by .fit().
# ============================================================
tf_model.fit(
    X_train, y_train,
    batch_size=len(X_train),   # full-batch (one gradient update per epoch)
    epochs=EPOCHS,
    validation_data=(X_val, y_val),
    verbose=1
)

# Inference on sample sentences (same as PyTorch)
arr = encode_batch(sample_texts)
logits = tf_model(arr, training=False)
probs = tf.nn.softmax(logits, axis=1).numpy()
preds = probs.argmax(1)

print("\n[TensorFlow] Predictions:")
for t, p, pr in zip(sample_texts, preds, probs):
    print(f"{t!r} → {'pos' if p==1 else 'neg'}  probs={pr}")


Epoch 1/3




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.6667 - loss: 0.6902 - val_accuracy: 0.3333 - val_loss: 0.6976
Epoch 2/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step - accuracy: 0.5556 - loss: 0.6891 - val_accuracy: 0.3333 - val_loss: 0.6997
Epoch 3/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step - accuracy: 0.5556 - loss: 0.6880 - val_accuracy: 0.3333 - val_loss: 0.7020

[TensorFlow] Predictions:
'this movie was great' → pos  probs=[0.48581555 0.5141845 ]
'this movie was boring' → pos  probs=[0.486884 0.513116]
'i recommend this film' → pos  probs=[0.48593917 0.51406074]
