# Lab 8: CNN Text Classification with Word2Vec Embeddings

---
## 1. Notebook Overview

### 1.1 Objective
Replace the classification model from Lab 5 (MLPClassifier with binary feature vectors) with a **Convolutional Neural Network (CNN)** that handles input matrices of word embeddings instead of mean vectors.

### 1.2 Key Changes from Lab 5
- **Lab 5**: Used binary BoW vectors (1000 features) with MLPClassifier
- **Lab 8**: Uses word embedding matrices with 1D CNN

### 1.3 Prerequisites
This notebook assumes you have already executed:
- **Lab 2**: Data preprocessing → `../Data/multi_label/tweets_preprocessed_*.parquet`
- **Lab 4**: Feature extraction → `../Data/top_1000_vocabulary.json`
- **Lab 5**: Neural network classification (for comparison)

### 1.4 Architecture
We implement a 1D CNN with:
- **Input**: Embedding matrices of shape (seq_len, emb_dim) where emb_dim=300 (Word2Vec)
- **Conv1D Layer**: 100 filters with kernel size 3
- **Pooling**: Global Max Pooling
- **Output**: Multi-label classification with 6 classes using Sigmoid activation

### 1.5 Section Overview
1. **Section 2** – Load and prepare data
2. **Section 3** – Load Word2Vec model
3. **Section 4** – Create embedding matrices
4. **Section 5** – Define and train CNN model
5. **Section 6** – Evaluation and comparison with Lab 5
6. **Section 7** – Summary

---
## 2. Setup and Data Loading

### 2.1 Install Dependencies

In [None]:
# Install required packages (run once)
!pip install gensim torch --quiet

### 2.2 Import Libraries

In [None]:
# Standard libraries
import json
import ast
import os
from typing import List
from pathlib import Path

# Data processing
import numpy as np
import pandas as pd
from tqdm import tqdm

# Word2Vec
import gensim.downloader as api

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Evaluation metrics
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score, 
    hamming_loss
)

# Constants
TRAIN_DATA_PATH = "../Data/multi_label/tweets_preprocessed_train.parquet"
TEST_DATA_PATH = "../Data/multi_label/tweets_preprocessed_test.parquet"
VALIDATION_DATA_PATH = "../Data/multi_label/tweets_preprocessed_validation.parquet"
VOCABULARY_PATH = "../Data/top_1000_vocabulary.json"
RANDOM_STATE = 42

# Set random seeds for reproducibility
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"✓ Libraries imported successfully")
print(f"✓ Using device: {device}")

### 2.3 Load Preprocessed Datasets from Lab 2

In [None]:
def parse_labels(value) -> List[str]:
    """Parse label_name column into consistent Python lists."""
    if isinstance(value, (list, np.ndarray)):
        return [str(v) for v in value]
    if isinstance(value, tuple):
        return [str(v) for v in value]
    if isinstance(value, str):
        value = value.strip()
        if value.startswith('[') and value.endswith(']'):
            inner = value[1:-1].strip()
            if not inner:
                return []
            inner = inner.replace("'", "").replace('"', '')
            labels = [l.strip() for l in inner.split() if l.strip()]
            return labels
        try:
            parsed = ast.literal_eval(value)
            if isinstance(parsed, (list, tuple)):
                return [str(v) for v in parsed]
        except (ValueError, SyntaxError):
            pass
        return [value] if value else []
    return [str(value)] if value else []

def parse_binary_label(value) -> np.ndarray:
    """Parse binary label array from string representation."""
    if isinstance(value, np.ndarray):
        return value
    if isinstance(value, str):
        inner = value.strip()[1:-1]
        return np.array([int(x) for x in inner.split()])
    return np.array(value)

def load_dataset(path: str) -> pd.DataFrame:
    """Load tweets from parquet and normalize the label columns."""
    df = pd.read_parquet(path)
    df = df.copy()
    df["labels"] = df["label_name"].apply(parse_labels)
    df["label_binary"] = df["label"].apply(parse_binary_label)
    return df

# Load all datasets
print("Loading preprocessed datasets from Lab 2...")
df_train = load_dataset(TRAIN_DATA_PATH)
df_test = load_dataset(TEST_DATA_PATH)
df_validation = load_dataset(VALIDATION_DATA_PATH)

print(f"✓ Training set: {len(df_train):,} samples")
print(f"✓ Test set: {len(df_test):,} samples")
print(f"✓ Validation set: {len(df_validation):,} samples")
print(f"\nSample preprocessed text:")
print(f"  '{df_train['text'].iloc[0][:80]}...'")
print(f"  Labels: {df_train['labels'].iloc[0]}")

### 2.4 Extract Class Information

In [None]:
# Dynamically determine the number of classes from the data
NUM_CLASSES = len(df_train['label_binary'].iloc[0])

# Extract all unique class names
all_class_names = set()
for df in [df_train, df_test, df_validation]:
    for labels in df['labels']:
        all_class_names.update(labels)

TOPIC_CLASSES = sorted(list(all_class_names))

print(f"✓ Number of classes: {NUM_CLASSES}")
print(f"✓ Classes: {TOPIC_CLASSES}")

# Label statistics
y_train_multi = np.vstack(df_train['label_binary'].values)
print(f"\n✓ Label distribution (Training):")
print(f"  Average labels per sample: {y_train_multi.sum(axis=1).mean():.2f}")
print(f"  Samples per class:")
for i, class_name in enumerate(TOPIC_CLASSES):
    count = y_train_multi[:, i].sum()
    print(f"    {class_name}: {count}")

---
## 3. Load Word2Vec Model

We use the pre-trained Google News Word2Vec model (300-dimensional embeddings) to convert words into dense vector representations.

In [None]:
print("Loading Word2Vec model (GoogleNews-vectors-negative300)...")
print("This may take a few minutes on first run as the model is downloaded.")

# Load pre-trained Word2Vec embeddings
w2v = api.load('word2vec-google-news-300')

EMBEDDING_DIM = w2v.vector_size
print(f"\n✓ Word2Vec model loaded successfully")
print(f"✓ Vocabulary size: {len(w2v):,} words")
print(f"✓ Embedding dimension: {EMBEDDING_DIM}")

# Test the embeddings
test_word = "happy"
if test_word in w2v:
    print(f"\n✓ Example: '{test_word}' vector shape: {w2v[test_word].shape}")
    print(f"  First 5 dimensions: {w2v[test_word][:5]}")

---
## 4. Create Embedding Matrices

### 4.1 Document Embedding Function

Instead of using mean vectors (as in simpler approaches), we create a **matrix of embeddings** for each document. Each row in the matrix represents a word's embedding vector. This preserves sequential information that CNNs can exploit.

**Key difference from Lab 5:**
- Lab 5: Binary BoW vectors (1000 dimensions, loses word semantics)
- Lab 8: Embedding matrices (seq_len × 300 dimensions, preserves semantics)

In [None]:
# Maximum sequence length for CNN input
MAX_SEQ_LEN = 50  # Most tweets are shorter than 50 tokens after preprocessing

def embed_document(text: str, max_len: int = MAX_SEQ_LEN) -> np.ndarray:
    """
    Convert a document into an embedding matrix.
    
    Parameters:
    -----------
    text : str
        Preprocessed text (whitespace-tokenized)
    max_len : int
        Maximum sequence length (pad/truncate to this length)
    
    Returns:
    --------
    np.ndarray
        Embedding matrix of shape (max_len, EMBEDDING_DIM)
    """
    tokens = text.lower().split()
    vecs = []
    
    for tok in tokens[:max_len]:
        if tok in w2v:
            vecs.append(w2v[tok])
        else:
            # Use zero vector for OOV words
            vecs.append(np.zeros(EMBEDDING_DIM))
    
    # Pad sequences shorter than max_len
    while len(vecs) < max_len:
        vecs.append(np.zeros(EMBEDDING_DIM))
    
    return np.array(vecs)

# Test the function
sample_text = df_train['text'].iloc[0]
sample_embedding = embed_document(sample_text)
print(f"Sample text: '{sample_text[:60]}...'")
print(f"Embedding matrix shape: {sample_embedding.shape}")

### 4.2 Create Embedding Matrices for All Datasets

In [None]:
def create_embedding_matrices(texts: pd.Series, max_len: int = MAX_SEQ_LEN) -> np.ndarray:
    """
    Create embedding matrices for all documents in the dataset.
    
    Parameters:
    -----------
    texts : pd.Series
        Series of preprocessed text strings
    max_len : int
        Maximum sequence length
    
    Returns:
    --------
    np.ndarray
        3D array of shape (n_samples, max_len, EMBEDDING_DIM)
    """
    embeddings = []
    for text in tqdm(texts, desc="Creating embeddings"):
        embeddings.append(embed_document(text, max_len))
    return np.stack(embeddings)

print("Creating embedding matrices for all datasets...")
print("This may take a few minutes...\n")

X_train_emb = create_embedding_matrices(df_train['text'])
X_test_emb = create_embedding_matrices(df_test['text'])
X_val_emb = create_embedding_matrices(df_validation['text'])

print(f"\n✓ Training embeddings shape: {X_train_emb.shape}")
print(f"✓ Test embeddings shape: {X_test_emb.shape}")
print(f"✓ Validation embeddings shape: {X_val_emb.shape}")

### 4.3 Prepare Labels

In [None]:
# Prepare multi-label targets
y_train = np.vstack(df_train['label_binary'].values).astype(np.float32)
y_test = np.vstack(df_test['label_binary'].values).astype(np.float32)
y_val = np.vstack(df_validation['label_binary'].values).astype(np.float32)

print(f"✓ Label shapes:")
print(f"  y_train: {y_train.shape}")
print(f"  y_test: {y_test.shape}")
print(f"  y_val: {y_val.shape}")

---
## 5. CNN Model Definition and Training

### 5.1 TextCNN Model Architecture

The CNN architecture processes embedding matrices through:
1. **Conv1D Layer**: Applies 1D convolution over the sequence
2. **ReLU Activation**: Non-linear activation
3. **Global Max Pooling**: Extracts the most important features
4. **Fully Connected Layer**: Maps to output classes

For multi-label classification, we use **Sigmoid** activation (not Softmax) and **Binary Cross-Entropy** loss.

In [None]:
class TextCNN(nn.Module):
    """
    1D CNN for text classification using word embeddings.
    
    Architecture:
    - Input: (batch, seq_len, emb_dim)
    - Conv1D: (batch, emb_dim, seq_len) -> (batch, num_filters, seq_len-kernel_size+1)
    - MaxPool: (batch, num_filters, 1) -> (batch, num_filters)
    - FC: (batch, num_filters) -> (batch, num_classes)
    """
    
    def __init__(self, emb_dim: int = 300, seq_len: int = 50, 
                 num_classes: int = 6, num_filters: int = 100, 
                 kernel_size: int = 3, dropout: float = 0.5):
        super().__init__()
        
        # 1D Convolution over the sequence dimension
        self.conv1 = nn.Conv1d(
            in_channels=emb_dim, 
            out_channels=num_filters, 
            kernel_size=kernel_size
        )
        self.relu = nn.ReLU()
        
        # Global Max Pooling
        self.pool = nn.AdaptiveMaxPool1d(1)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(dropout)
        
        # Fully connected layer for classification
        self.fc = nn.Linear(num_filters, num_classes)
        
        # Sigmoid for multi-label classification
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        # x: (batch, seq_len, emb_dim)
        
        # Permute for Conv1d: (batch, emb_dim, seq_len)
        x = x.permute(0, 2, 1)
        
        # Convolution + activation
        x = self.conv1(x)
        x = self.relu(x)
        
        # Global max pooling: (batch, num_filters, 1) -> (batch, num_filters)
        x = self.pool(x).squeeze(-1)
        
        # Dropout
        x = self.dropout(x)
        
        # Fully connected layer
        x = self.fc(x)
        
        # Sigmoid for multi-label probabilities
        x = self.sigmoid(x)
        
        return x

# Display model architecture
print("="*60)
print("CNN MODEL ARCHITECTURE")
print("="*60)
print(f"Input: Embedding matrix ({MAX_SEQ_LEN}, {EMBEDDING_DIM})")
print(f"Conv1D: {EMBEDDING_DIM} -> 100 filters, kernel_size=3")
print(f"Activation: ReLU")
print(f"Pooling: Global Max Pooling")
print(f"Dropout: 0.5")
print(f"Output: {NUM_CLASSES} classes (Sigmoid activation)")
print("="*60)

### 5.2 Prepare PyTorch DataLoaders

In [None]:
# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train_emb, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_emb, dtype=torch.float32)
X_val_tensor = torch.tensor(X_val_emb, dtype=torch.float32)

y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32)

# Create datasets and dataloaders
BATCH_SIZE = 32

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

print(f"✓ DataLoaders created")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Training batches: {len(train_loader)}")
print(f"  Test batches: {len(test_loader)}")

### 5.3 Train the CNN Model

In [None]:
# Initialize model, loss function, and optimizer
model = TextCNN(
    emb_dim=EMBEDDING_DIM,
    seq_len=MAX_SEQ_LEN,
    num_classes=NUM_CLASSES,
    num_filters=100,
    kernel_size=3,
    dropout=0.5
).to(device)

# Binary Cross-Entropy for multi-label classification
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

print(f"✓ Model initialized")
print(f"✓ Optimizer: Adam (lr=1e-3)")
print(f"✓ Loss function: Binary Cross-Entropy")

In [None]:
# Training loop
NUM_EPOCHS = 20

print("\n" + "="*60)
print("TRAINING CNN MODEL")
print("="*60)

train_losses = []
val_losses = []

for epoch in range(NUM_EPOCHS):
    # Training phase
    model.train()
    total_train_loss = 0
    
    for batch_X, batch_y in train_loader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        
        optimizer.zero_grad()
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        
        total_train_loss += loss.item()
    
    avg_train_loss = total_train_loss / len(train_loader)
    train_losses.append(avg_train_loss)
    
    # Validation phase
    model.eval()
    total_val_loss = 0
    
    with torch.no_grad():
        for batch_X, batch_y in val_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            total_val_loss += loss.item()
    
    avg_val_loss = total_val_loss / len(val_loader) if len(val_loader) > 0 else 0
    val_losses.append(avg_val_loss)
    
    # Print progress every 5 epochs
    if (epoch + 1) % 5 == 0 or epoch == 0:
        print(f"Epoch {epoch+1:3d}/{NUM_EPOCHS} | Train Loss: {avg_train_loss:.4f} | Val Loss: {avg_val_loss:.4f}")

print("\n✓ Training complete!")

---
## 6. Evaluation and Comparison

### 6.1 Make Predictions

In [None]:
# Make predictions on test set
model.eval()
all_preds = []
all_labels = []

THRESHOLD = 0.5  # Classification threshold for multi-label

with torch.no_grad():
    for batch_X, batch_y in test_loader:
        batch_X = batch_X.to(device)
        outputs = model(batch_X)
        
        # Apply threshold to get binary predictions
        preds = (outputs.cpu().numpy() > THRESHOLD).astype(int)
        all_preds.append(preds)
        all_labels.append(batch_y.numpy())

y_pred_cnn = np.vstack(all_preds)
y_true = np.vstack(all_labels)

print(f"✓ Predictions made on {len(y_pred_cnn)} test samples")

### 6.2 Calculate CNN Metrics

In [None]:
# Calculate evaluation metrics
cnn_metrics = {
    'Subset Accuracy': accuracy_score(y_true, y_pred_cnn),
    'Hamming Loss': hamming_loss(y_true, y_pred_cnn),
    'Micro F1': f1_score(y_true, y_pred_cnn, average='micro', zero_division=0),
    'Macro F1': f1_score(y_true, y_pred_cnn, average='macro', zero_division=0),
    'Micro Precision': precision_score(y_true, y_pred_cnn, average='micro', zero_division=0),
    'Micro Recall': recall_score(y_true, y_pred_cnn, average='micro', zero_division=0)
}

print("="*60)
print("CNN EVALUATION RESULTS (Test Set)")
print("="*60)
for metric, value in cnn_metrics.items():
    print(f"{metric:<20}: {value:.4f}")
print("="*60)

### 6.3 Sample Predictions

In [None]:
# Show sample predictions
def labels_to_names(binary_vec):
    """Convert binary vector to list of class names."""
    names = [TOPIC_CLASSES[i] for i, v in enumerate(binary_vec) if v == 1]
    return tuple(names) if names else ('none',)

print("\nSample CNN Predictions:")
print("-" * 70)

for i in range(5):
    text = df_test['text'].iloc[i][:60]
    true_labels = labels_to_names(y_true[i])
    pred_labels = labels_to_names(y_pred_cnn[i])
    match = "✓" if set(true_labels) == set(pred_labels) else "✗"
    
    print(f"\n{match} Sample {i+1}:")
    print(f"   Text: {text}...")
    print(f"   True: {true_labels}")
    print(f"   Pred: {pred_labels}")

### 6.4 Comparison with Lab 5 Results

We compare our CNN results with the Neural Network (MLPClassifier) results from Lab 5.

In [None]:
# Lab 5 Multi-Label Neural Network results (from lab5.ipynb output)
# These are the results we compare against
lab5_nn_metrics = {
    'Subset Accuracy': 0.4527,
    'Hamming Loss': 0.1442,
    'Micro F1': 0.6447,
    'Macro F1': 0.5636,
    'Micro Precision': 0.6985,
    'Micro Recall': 0.5987
}

# Lab 5 Naive Bayes results
lab5_nb_metrics = {
    'Subset Accuracy': 0.4917,
    'Hamming Loss': 0.1337,
    'Micro F1': 0.6811,
    'Macro F1': 0.6210,
    'Micro Precision': 0.7114,
    'Micro Recall': 0.6532
}

# Create comparison table
print("="*90)
print("MODEL COMPARISON: CNN (Lab 8) vs Neural Network & Naive Bayes (Lab 5)")
print("="*90)
print(f"\n{'Metric':<20} {'CNN (Lab 8)':<15} {'NN (Lab 5)':<15} {'NB (Lab 5)':<15} {'CNN vs NN':<12}")
print("-"*75)

for metric in cnn_metrics.keys():
    cnn_val = cnn_metrics[metric]
    nn_val = lab5_nn_metrics[metric]
    nb_val = lab5_nb_metrics[metric]
    
    # Calculate improvement (for Hamming Loss, lower is better)
    if metric == 'Hamming Loss':
        diff = nn_val - cnn_val  # Positive = CNN is better
    else:
        diff = cnn_val - nn_val  # Positive = CNN is better
    
    diff_str = f"+{diff:.4f}" if diff > 0 else f"{diff:.4f}"
    print(f"{metric:<20} {cnn_val:<15.4f} {nn_val:<15.4f} {nb_val:<15.4f} {diff_str:<12}")

print("\n" + "="*90)
print("Note: For Hamming Loss, lower is better. For all other metrics, higher is better.")
print("      'CNN vs NN' shows improvement (positive = CNN is better).")
print("="*90)

### 6.5 Per-Class Performance

In [None]:
# Per-class F1 scores
from sklearn.metrics import classification_report

print("\n" + "="*70)
print("PER-CLASS PERFORMANCE (CNN)")
print("="*70)

print(classification_report(
    y_true, 
    y_pred_cnn, 
    target_names=TOPIC_CLASSES,
    zero_division=0
))

---
## 7. Summary

### 7.1 What was accomplished

1. **Loaded preprocessed data** from Lab 2 (parquet files)
2. **Loaded Word2Vec model** (GoogleNews-300) for word embeddings
3. **Created embedding matrices** instead of binary BoW vectors
4. **Implemented TextCNN** with:
   - 1D Convolution (100 filters, kernel size 3)
   - Global Max Pooling
   - Sigmoid activation for multi-label classification
5. **Trained and evaluated** the CNN model
6. **Compared results** with Lab 5's MLPClassifier and Naive Bayes

### 7.2 Key Findings

| Aspect | Lab 5 (MLPClassifier) | Lab 8 (CNN) |
|--------|----------------------|-------------|
| Input | Binary BoW (1000 dims) | Embedding matrix (50×300) |
| Word Semantics | Lost | Preserved |
| Sequential Info | Lost | Captured by Conv1D |
| Model Size | ~130K params | ~30K params |

### 7.3 Advantages of CNN with Word Embeddings

1. **Semantic representation**: Word2Vec captures semantic relationships (e.g., "happy" is similar to "joyful")
2. **Local pattern detection**: Conv1D captures n-gram patterns in the text
3. **Transfer learning**: Pre-trained embeddings provide knowledge from large corpora
4. **Fewer parameters**: CNN is more parameter-efficient than fully-connected layers

### 7.4 Potential Improvements

- Use multiple kernel sizes (3, 4, 5) to capture different n-gram patterns
- Add more convolutional layers for deeper feature extraction
- Use pre-trained models like BERT for contextual embeddings
- Experiment with different pooling strategies

In [None]:
# Final summary
print("="*70)
print("LAB 8 SUMMARY")
print("="*70)
print(f"\nModel: CNN with Word2Vec Embeddings")
print(f"Embedding dimension: {EMBEDDING_DIM}")
print(f"Max sequence length: {MAX_SEQ_LEN}")
print(f"Training samples: {len(df_train):,}")
print(f"Test samples: {len(df_test):,}")
print(f"Number of classes: {NUM_CLASSES}")
print(f"\nCNN Performance (Test Set):")
print(f"  Subset Accuracy: {cnn_metrics['Subset Accuracy']:.4f}")
print(f"  Micro F1: {cnn_metrics['Micro F1']:.4f}")
print(f"  Macro F1: {cnn_metrics['Macro F1']:.4f}")
print(f"  Hamming Loss: {cnn_metrics['Hamming Loss']:.4f}")
print("\n" + "="*70)