
# TMDB Movie Genre Classification (PyTorch + GloVe + LSTM)

**Goal:** Predict the **main genre** from a movie *overview* using:
- Clean text preprocessing (lowercase, regex, simple tokenization, basic stopwords)
- **GloVe** pretrained word embeddings (100d, `glove-wiki-gigaword-100` via Gensim)
- A simple **LSTM** classifier (Embedding → LSTM → Dropout → Dense → Softmax)
- **Optimizations:** class weighting (to address imbalance), higher dropout
- **Evaluation:** precision, recall, F1-score and accuracy (per-class report)

> Minimal, readable, and didactic. No TensorFlow used.



## 0) Setup (optional)
If you need to install dependencies:


In [None]:

# Optional: install dependencies (uncomment if needed)
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# pip install pandas numpy scikit-learn gensim matplotlib



## 1) Imports


In [2]:
!pip install torch torchvision torchaudio


Collecting torch
  Using cached torch-2.9.0-cp311-cp311-win_amd64.whl.metadata (30 kB)
Collecting torchvision
  Using cached torchvision-0.24.0-cp311-cp311-win_amd64.whl.metadata (5.9 kB)
Collecting torchaudio
  Using cached torchaudio-2.9.0-cp311-cp311-win_amd64.whl.metadata (6.9 kB)
Using cached torch-2.9.0-cp311-cp311-win_amd64.whl (109.3 MB)
Using cached torchvision-0.24.0-cp311-cp311-win_amd64.whl (4.0 MB)
Using cached torchaudio-2.9.0-cp311-cp311-win_amd64.whl (664 kB)
Installing collected packages: torch, torchvision, torchaudio

   ---------------------------------------- 0/3 [torch]
   ---------------------------------------- 0/3 [torch]
   ---------------------------------------- 0/3 [torch]
   ---------------------------------------- 0/3 [torch]
   ---------------------------------------- 0/3 [torch]
   ---------------------------------------- 0/3 [torch]
   ---------------------------------------- 0/3 [torch]
   ---------------------------------------- 0/3 [torch]
   ------

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'C:\\Users\\idb0227\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python311\\site-packages\\torch\\include\\ATen\\native\\transformers\\cuda\\mem_eff_attention\\iterators\\predicated_tile_access_iterator_residual_last.h'



In [3]:

import ast
import re
from collections import Counter

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight

import gensim.downloader as api

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader


ImportError: DLL load failed while importing _C: No se puede encontrar el módulo especificado.


## 2) Load data and extract the **main genre**
We keep only the `overview` and the first genre in the `genres` JSON list.


In [None]:

# Path to CSV 
CSV_PATH = "tmdb_5000_movies.csv"  

df = pd.read_csv(CSV_PATH, encoding='utf-8')

df = df[['title', 'overview', 'genres']]

def first_genre(genres_json):
    try:
        items = ast.literal_eval(genres_json)
        if isinstance(items, list) and len(items) > 0 and 'name' in items[0]:
            return items[0]['name']
    except Exception:
        pass
    return np.nan

df['main_genre'] = df['genres'].apply(first_genre)
df = df.dropna(subset=['overview', 'main_genre']).reset_index(drop=True)

print(df.shape)
df.head()



## 3) Text preprocessing
Lowercase, regex-based tokenization (letters only), and a tiny custom stopword list.


In [None]:

STOPWORDS = {
    'the','a','an','and','or','if','in','on','of','for','to','from','by','with','at','as','is','are','was','were',
    'be','been','being','this','that','it','its','into','about','over','after','before','between','among','because',
    'but','so','than','too','very','can','could','should','would','may','might','will','just','do','does','did','doing',
    'up','down','out','off','not','no','nor','also','such','their','there','then','when','where','who','whom',
    'what','which','while','how','more','most','least','few','many'
}

TOKEN_RE = re.compile(r"[a-z]+")

def preprocess_text(text: str):
    text = str(text).lower()
    tokens = TOKEN_RE.findall(text)
    tokens = [t for t in tokens if t not in STOPWORDS]
    return tokens

df['tokens'] = df['overview'].apply(preprocess_text)
df[['title','main_genre','tokens']].head(3)



## 4) Vocabulary
We keep a capped vocabulary by frequency.  
**Note:** We reserve index **0 for PAD**. No UNK is included in the vocabulary.


In [None]:

MAX_VOCAB = 20000
PAD = "<pad>"

# Count global frequencies
freq = Counter(token for toks in df['tokens'] for token in toks)
most_common = [w for w,_ in freq.most_common(MAX_VOCAB - 1)]  # -1 because index 0 is PAD

# Build itos/stoi: index 0 is PAD, words start at 1
itos = [PAD] + most_common
stoi = {w:i for i,w in enumerate(itos)}  # PAD -> 0

len(itos), itos[:10]



## 5) Encode tokens → IDs (no PAD/UNK here)
The `encode` function returns **only known token IDs** (words in the vocab).  
Out-of-vocab tokens are **skipped**. Padding is **not** done here; it will be handled later in the `collate_fn`.


In [None]:

def encode(tokens):
    ids = [stoi[t] for t in tokens if t in stoi and t != PAD]
    return ids

df['input_ids'] = df['tokens'].apply(encode)
# Edge case: if any is empty, it will be handled by collate_fn (we'll pad to [0])
empty_count = sum(1 for x in df['input_ids'] if len(x)==0)
print("Empty sequences (will be padded to [0] on the fly):", empty_count)



## 6) Load GloVe (100d) and build the embedding matrix
Row 0 (PAD) is all zeros. Words without a GloVe vector get a small random vector.


In [None]:

glove = api.load("glove-wiki-gigaword-100")  # 100-dim embeddings
EMB_DIM = glove.vector_size

emb_matrix = np.zeros((len(itos), EMB_DIM), dtype=np.float32)
rng = np.random.default_rng(123)

for i, w in enumerate(itos):
    if i == 0:  # PAD
        emb_matrix[i] = np.zeros(EMB_DIM)
    else:
        vec = glove.get(w)
        if vec is not None:
            emb_matrix[i] = vec
        else:
            emb_matrix[i] = rng.normal(0, 0.6, EMB_DIM)

emb_matrix = torch.tensor(emb_matrix)
emb_matrix.shape



## 7) Labels and train/validation split
We will stratify by the main genre.


In [None]:

genres = sorted(df['main_genre'].unique().tolist())
label2id = {g:i for i,g in enumerate(genres)}
id2label = {i:g for g,i in label2id.items()}

df['label'] = df['main_genre'].map(label2id)

X_list = df['input_ids'].tolist()
y = df['label'].values

X_train, X_val, y_train, y_val = train_test_split(
    X_list, y, test_size=0.2, random_state=42, stratify=y
)

len(genres), genres[:10]



## 8) Dataset and DataLoaders
We keep sequences as **variable-length lists**. Padding to the batch max-length is done in `collate_fn` with PAD index `0`.  
We also compute sequence lengths for efficient packing in the LSTM.


In [None]:

class MovieDataset(Dataset):
    def __init__(self, seqs, labels):
        self.seqs = seqs
        self.labels = labels
    def __len__(self): return len(self.labels)
    def __getitem__(self, idx):
        return self.seqs[idx], int(self.labels[idx])

def collate_fn(batch):
    # batch: list of (seq_ids_list, label)
    seqs, labels = zip(*batch)
    # handle empty sequences -> replace with [0] (PAD only)
    seqs = [s if len(s)>0 else [0] for s in seqs]
    lengths = torch.tensor([len(s) for s in seqs], dtype=torch.long)

    max_len = max(lengths).item()
    padded = torch.zeros((len(seqs), max_len), dtype=torch.long)  # PAD=0
    for i, s in enumerate(seqs):
        padded[i, :len(s)] = torch.tensor(s, dtype=torch.long)

    labels = torch.tensor(labels, dtype=torch.long)
    return padded, lengths, labels

train_ds = MovieDataset(X_train, y_train)
val_ds   = MovieDataset(X_val, y_val)

BATCH_SIZE_INIT = 64
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE_INIT, shuffle=True, collate_fn=collate_fn)
val_dl   = DataLoader(val_ds, batch_size=BATCH_SIZE_INIT, shuffle=False, collate_fn=collate_fn)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device



## 9) Model (Embedding → LSTM → Dropout → Dense)
We use `pack_padded_sequence` to efficiently handle variable-length sequences.  
We take the **last hidden state** as the sequence representation.


In [None]:

class LSTMClassifier(nn.Module):
    def __init__(self, emb_matrix, hidden_size=128, num_classes=10, dropout=0.5, bidirectional=False):
        super().__init__()
        num_embeddings, emb_dim = emb_matrix.shape
        self.embedding = nn.Embedding(num_embeddings, emb_dim, padding_idx=0)
        self.embedding.weight.data.copy_(emb_matrix)
        self.embedding.weight.requires_grad = True

        self.lstm = nn.LSTM(
            input_size=emb_dim,
            hidden_size=hidden_size,
            batch_first=True,
            bidirectional=bidirectional
        )
        feat_dim = hidden_size * (2 if bidirectional else 1)
        self.dropout = nn.Dropout(dropout)
        self.fc1 = nn.Linear(feat_dim, 128)
        self.act = nn.ReLU()
        self.fc_out = nn.Linear(128, num_classes)

    def forward(self, input_ids, lengths):
        # input_ids: (B, T), lengths: (B,)
        x = self.embedding(input_ids)  # (B, T, E)
        packed = nn.utils.rnn.pack_padded_sequence(x, lengths.cpu(), batch_first=True, enforce_sorted=False)
        packed_out, (h, c) = self.lstm(packed)
        if self.lstm.bidirectional:
            h_last = torch.cat([h[-2], h[-1]], dim=1)
        else:
            h_last = h[-1]
        z = self.dropout(h_last)
        z = self.act(self.fc1(z))
        z = self.dropout(z)
        logits = self.fc_out(z)
        return logits



## 10) Train/Eval helpers


In [None]:

def train_one_epoch(model, loader, optimizer, criterion):
    model.train()
    total_loss = 0.0
    for Xb, Lb, yb in loader:
        Xb, Lb, yb = Xb.to(device), Lb.to(device), yb.to(device)
        optimizer.zero_grad()
        logits = model(Xb, Lb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * Xb.size(0)
    return total_loss / len(loader.dataset)

@torch.no_grad()
def evaluate(model, loader, criterion):
    model.eval()
    total_loss = 0.0
    preds, golds = [], []
    for Xb, Lb, yb in loader:
        Xb, Lb, yb = Xb.to(device), Lb.to(device), yb.to(device)
        logits = model(Xb, Lb)
        loss = criterion(logits, yb)
        total_loss += loss.item() * Xb.size(0)
        pred = logits.argmax(dim=1).cpu().numpy().tolist()
        preds.extend(pred)
        golds.extend(yb.cpu().numpy().tolist())
    avg_loss = total_loss / len(loader.dataset)
    return avg_loss, preds, golds

def report(golds, preds, id2label):
    target_names = [id2label[i] for i in sorted(set(golds))]
    print(classification_report(golds, preds, target_names=target_names, digits=3))



## 11) Baseline training (no class weights, dropout=0.2)


In [None]:

NUM_CLASSES = len(genres)

baseline = LSTMClassifier(
    emb_matrix=emb_matrix,
    hidden_size=128,
    num_classes=NUM_CLASSES,
    dropout=0.2,
    bidirectional=False
).to(device)

optimizer = torch.optim.Adam(baseline.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

EPOCHS_INIT = 5  # increase for better results if you want
for epoch in range(1, EPOCHS_INIT+1):
    tr_loss = train_one_epoch(baseline, train_dl, optimizer, criterion)
    va_loss, va_preds, va_golds = evaluate(baseline, val_dl, criterion)
    print(f"[Baseline] Epoch {epoch}/{EPOCHS_INIT} | train_loss={tr_loss:.4f} | val_loss={va_loss:.4f}")

print("\n[Baseline] Validation report:")
report(va_golds, va_preds, id2label)



## 12) Optimized training (class weights + dropout=0.5)
- **Class weighting** to handle label imbalance.
- Higher **dropout** for regularization.


In [None]:

class_weights = compute_class_weight(
    class_weight="balanced",
    classes=np.unique(y_train),
    y=y_train
)
class_weights = torch.tensor(class_weights, dtype=torch.float32).to(device)
print("Class weights:", class_weights.cpu().numpy())

optimized = LSTMClassifier(
    emb_matrix=emb_matrix,
    hidden_size=128,
    num_classes=NUM_CLASSES,
    dropout=0.5,
    bidirectional=False
).to(device)

optimizer_opt = torch.optim.Adam(optimized.parameters(), lr=1e-3)
criterion_opt = nn.CrossEntropyLoss(weight=class_weights)

EPOCHS_OPT = 8  # increase if you want
for epoch in range(1, EPOCHS_OPT+1):
    tr_loss = train_one_epoch(optimized, train_dl, optimizer_opt, criterion_opt)
    va_loss, va_preds, va_golds = evaluate(optimized, val_dl, criterion_opt)
    print(f"[Optimized] Epoch {epoch}/{EPOCHS_OPT} | train_loss={tr_loss:.4f} | val_loss={va_loss:.4f}")

print("\n[Optimized] Validation report:")
report(va_golds, va_preds, id2label)



## 13) Quick comparison
Macro-F1, Precision, Recall, Accuracy (baseline vs optimized).


In [None]:

from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def quick_scores(golds, preds):
    acc = accuracy_score(golds, preds)
    p, r, f1, _ = precision_recall_fscore_support(golds, preds, average='macro', zero_division=0)
    return {"acc":acc, "precision_macro":p, "recall_macro":r, "f1_macro":f1}

# Re-evaluate to ensure we have the latest predictions
_, base_preds, base_golds = evaluate(baseline, val_dl, nn.CrossEntropyLoss())
_, opt_preds,  opt_golds  = evaluate(optimized, val_dl, nn.CrossEntropyLoss(weight=class_weights))

print("Baseline:", quick_scores(base_golds, base_preds))
print("Optimized:", quick_scores(opt_golds, opt_preds))



## Notes
- **Preprocessing:** minimal on purpose; extend with lemmatization/stemming if needed.
- **GloVe:** we align embeddings to our vocab. PAD (0) is all zeros; missing words get small random vectors.
- **No UNK in `encode`:** OOV tokens are simply skipped.
- **Variable lengths:** we pad **only in the collate function** and use packing in the LSTM.
- **Optimizations:** class weights, higher dropout. Try `bidirectional=True` or larger `hidden_size` for more capacity.
- **Reproducibility:** for full determinism you can set random seeds and disable cuDNN variability (not shown to keep it concise).
