# Tutorial 8-1: Beyond Random Initialization â€“ Transfer Learning with Word Embeddings

**Course:** CSEN 342: Deep Learning  
**Topic:** Word Embeddings, `nn.Embedding`, Transfer Learning, and Text Classification

## Objective
In the lecture, we learned that Word Embeddings (like Word2Vec or GloVe) capture semantic meaning. We also discussed **Transfer Learning** (Slide 17): the idea of taking these embeddings trained on billions of words and applying them to a smaller task.

In this tutorial, we will verify this experimentally using a **Real Dataset** (Subjectivity vs. Objectivity Analysis). We will train three models:
1.  **From Scratch:** Initialize embeddings randomly. The model knows nothing about English initially.
2.  **Static Transfer (Frozen):** Load GloVe embeddings and **freeze** them. The model only learns the classifier layer.
3.  **Fine-Tuning:** Initialize with GloVe but allow the embeddings to update during training.

---

## Part 1: Data Setup (Subjectivity Dataset)

We will use the **Subjectivity Dataset v1.0** (Pang/Lee 2004). It contains 5,000 subjective sentences (opinions) and 5,000 objective sentences (facts).

**Task:** Classify a sentence as Subjective (1) or Objective (0).

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import os
import re
import tarfile

# 1. Download Data (Subjectivity Dataset)
data_root = '../data'
os.makedirs(data_root, exist_ok=True)
subj_url = "http://www.cs.cornell.edu/people/pabo/movie-review-data/rotten_imdb.tar.gz"
tar_path = os.path.join(data_root, 'rotten_imdb.tar.gz')
extract_path = os.path.join(data_root, 'rotten_imdb')

if not os.path.exists(extract_path):
    print("Downloading Subjectivity Dataset...")
    # Use wget with -nc (no clobber) to avoid re-downloading if exists
    os.system(f"wget -nc -P {data_root} {subj_url}")
    
    print("Extracting...")
    with tarfile.open(tar_path, "r:gz") as tar:
        tar.extractall(path=data_root)
    # The tar extracts to a folder named 'rotten_imdb' usually containing 'quote.tok.gt9.5000' and 'plot.tok.gt9.5000'
    # We rename it to be clean if needed, or just set paths directly
    print("Done.")

# 2. Load and Label Data
subj_file = os.path.join(data_root, 'quote.tok.gt9.5000') # Subjective (1)
obj_file = os.path.join(data_root, 'plot.tok.gt9.5000')  # Objective (0)

# Read files (handle latin-1 encoding common in older datasets)
with open(subj_file, 'r', encoding='latin-1') as f:
    subj_lines = [(line.strip(), 1) for line in f]
    
with open(obj_file, 'r', encoding='latin-1') as f:
    obj_lines = [(line.strip(), 0) for line in f]

# Combine and Shuffle
all_data = subj_lines + obj_lines
np.random.shuffle(all_data)

print(f"Total samples: {len(all_data)}")
print("Sample 1:", all_data[0])

### 1.1 Building the Vocabulary
We convert text to integers. We reserve `0` for padding and `1` for unknown words.

In [None]:
def tokenizer(text):
    return re.findall(r"\w+", text.lower())

def build_vocab(data, max_size=10000):
    counter = Counter()
    for text, _ in data:
        counter.update(tokenizer(text))
            
    vocab = {"<PAD>": 0, "<UNK>": 1}
    # Add most common words
    for word, _ in counter.most_common(max_size - 2):
        vocab[word] = len(vocab)
    return vocab

vocab = build_vocab(all_data)
print(f"Vocabulary size: {len(vocab)}")

---

## Part 2: Loading Pre-trained Embeddings (GloVe)

We create an embedding matrix where row `i` corresponds to the GloVe vector for the word at index `i`.

In [None]:
# Download GloVe if needed
glove_path = os.path.join(data_root, 'glove.6B.50d.txt')
if not os.path.exists(glove_path):
    print("Downloading GloVe...")
    os.system(f"wget -nc -P {data_root} https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip")
    os.system(f"unzip -o {data_root}/glove.6B.zip -d {data_root}")

def create_embedding_matrix(vocab, glove_path, emb_dim=50):
    # 1. Load GloVe into a dictionary
    glove_dict = {}
    with open(glove_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            glove_dict[word] = vector
            
    # 2. Create matrix initialized randomly
    matrix = np.random.normal(scale=0.6, size=(len(vocab), emb_dim))
    matrix[0] = np.zeros(emb_dim) # <PAD> is zero
    
    hits = 0
    for word, idx in vocab.items():
        if word in glove_dict:
            matrix[idx] = glove_dict[word]
            hits += 1
            
    print(f"GloVe coverage: {hits}/{len(vocab)} words found ({hits/len(vocab):.1%})")
    return torch.tensor(matrix, dtype=torch.float32)

embedding_weights = create_embedding_matrix(vocab, glove_path)

---

## Part 3: The Model (Deep Averaging Network)

We use a simple architecture that averages word vectors and passes them through a classifier. This is often called a **Deep Averaging Network (DAN)**.

In [None]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim, pretrained_weights=None, freeze=False):
        super().__init__()
        
        # Initialize Embedding Layer
        if pretrained_weights is not None:
            self.embedding = nn.Embedding.from_pretrained(pretrained_weights, padding_idx=0, freeze=freeze)
        else:
            self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        
        # Classifier Head
        self.fc = nn.Sequential(
            nn.Linear(emb_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(64, 2)
        )

    def forward(self, x):
        # x: (Batch, Seq_Len)
        embeds = self.embedding(x)
        # Average Pooling (mean of all words in sentence)
        pooled = embeds.mean(dim=1)
        return self.fc(pooled)

---

## Part 4: The Experiment

We train three models to compare the impact of initialization strategy.

In [None]:
class SubjectivityDataset(Dataset):
    def __init__(self, data, vocab, max_len=30):
        self.processed_data = []
        for text, label in data:
            tokens = tokenizer(text)
            # Map to integers
            indices = [vocab.get(t, vocab["<UNK>"]) for t in tokens]
            # Pad or Truncate
            if len(indices) < max_len:
                indices += [vocab["<PAD>"]] * (max_len - len(indices))
            else:
                indices = indices[:max_len]
            self.processed_data.append((torch.tensor(indices), int(label)))

    def __len__(self): return len(self.processed_data)
    def __getitem__(self, idx): return self.processed_data[idx]

# Create Dataset and Loader
# Split 80/20
split_idx = int(len(all_data) * 0.8)
train_data = all_data[:split_idx]
val_data = all_data[split_idx:]

train_ds = SubjectivityDataset(train_data, vocab)
val_ds = SubjectivityDataset(val_data, vocab)
train_loader = DataLoader(train_ds, batch_size=32, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=100)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def train_evaluate(strategy_name, weights=None, freeze=False):
    model = TextClassifier(len(vocab), 50, weights, freeze).to(device)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss()
    
    val_accuracies = []
    
    for epoch in range(15):
        model.train()
        for text, labels in train_loader:
            text, labels = text.to(device), labels.to(device)
            optimizer.zero_grad()
            output = model(text)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            
        # Evaluate
        model.eval()
        correct = 0; total = 0
        with torch.no_grad():
            for text, labels in val_loader:
                text, labels = text.to(device), labels.to(device)
                output = model(text)
                _, predicted = torch.max(output, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        
        acc = 100 * correct / total
        val_accuracies.append(acc)
        
    return val_accuracies

print("Training 1: Random Init (From Scratch)...")
acc_scratch = train_evaluate("Scratch", weights=None)

print("Training 2: Static GloVe (Frozen)...")
acc_frozen = train_evaluate("Frozen", weights=embedding_weights, freeze=True)

print("Training 3: Fine-Tuning GloVe...")
acc_finetune = train_evaluate("Fine-Tune", weights=embedding_weights, freeze=False)

# Visualization
plt.figure(figsize=(10, 6))
plt.plot(acc_scratch, label='Random Init', linestyle='--')
plt.plot(acc_frozen, label='GloVe (Frozen)')
plt.plot(acc_finetune, label='GloVe (Fine-Tuned)', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Validation Accuracy (%)')
plt.title('Transfer Learning on Subjectivity Dataset')
plt.legend()
plt.grid(True)
plt.show()

### Discussion
Because we used a real dataset (Subjectivity 1.0) with moderate size (10k samples), you should see clear trends:

1.  **Frozen GloVe:** Starts with high accuracy (jump start) because the embeddings already understand language, but it might plateau lower if the task-specific words (like "touching" or "masterpiece" in movie reviews) need to shift.
2.  **Fine-Tuned GloVe:** Typically wins. It starts high (transfer learning) and continues to improve as it adapts the generic GloVe vectors to the specific nuances of movie review sentiment.
3.  **Random Init:** Starts near 50% (guessing) and has to learn English from scratch. It will eventually catch up, but takes longer.