# Lab 4: Text Classification

---
## 1. Notebook Overview

### 1.1 Focus of this notebook
We revisit the preprocessed tweets from Lab 2 and perform token analysis and classification:

1. Load the dataset and normalise the label lists.
2.  Derive the 1000 most frequent tokens and save them for later use.
3. Train a Naive Bayes classifier for multi-label classification.

### 1.2 Dataset
- Source: `../Data/tweets_preprocessed_train.parquet` (output from Lab 2)
- Columns: `text` (whitespace-tokenised strings), `label_name` (list of categories), `label` (binary vector)

### 1.3 Output
- `../Data/top_1000_vocabulary.json` - The top 1000 tokens for use in Lab 5

### 1.4 Section overview
1. `Section 2` – Load/prepare the data frame.
2. `Section 3` – Reuse the Lab 3 helper classes (`UnigramLM`).
3. `Section 4` – Compute the top 1000 tokens and save to file.
4. `Section 5` – Train Naive Bayes classifier.
5. `Section 6` – Evaluation.

## 2. Data Loading & Preparation

### 2.1 Goal
Load the preprocessed training and test data from Lab 2 and standardise the label column.

### 2.2 Steps
1. Import libraries (Pandas, NumPy, collections helper).
2. Load training data from `../Data/tweets_preprocessed_train.parquet`.
3. Load test data from `../Data/tweets_preprocessed_test.parquet`.
4. Convert `label_name` into consistent Python lists.

In [1]:
import ast
import json
import os
from collections import Counter
from typing import List

import numpy as np
import pandas as pd

# Updated path to match Lab 2 output
DATA_PATH = "../Data/tweets_preprocessed_train.parquet"
TEST_DATA_PATH = "../Data/tweets_preprocessed_test.parquet"
VOCABULARY_OUTPUT_PATH = "../Data/top_1000_vocabulary.json"
RANDOM_STATE = 42


def load_dataset(path: str) -> pd.DataFrame:
    """Load tweets from parquet and normalise the label column."""
    df = pd.read_parquet(path)

    def parse_labels(value) -> List[str]:
        if isinstance(value, list):
            return [str(v) for v in value]
        if isinstance(value, tuple):
            return [str(v) for v in value]
        if isinstance(value, str):
            try:
                parsed = ast.literal_eval(value)
                if isinstance(parsed, (list, tuple)):
                    return [str(v) for v in parsed]
            except (ValueError, SyntaxError):
                return [value]
        return [str(value)]

    df = df.copy()
    df["labels"] = df["label_name"].apply(parse_labels)
    df["label_count"] = df["labels"].apply(len)
    df["primary_label"] = df["labels"].apply(lambda items: items[0] if items else "unknown")
    return df


# Load data
df_raw = load_dataset(DATA_PATH)
print(f"Loaded {len(df_raw):,} documents from {DATA_PATH}. ")
print(f"Columns: {df_raw.columns.tolist()}")
print(df_raw.head(3))

# Create single-label subset for vocabulary extraction
single_label_df = df_raw[df_raw["label_count"] == 1][["text", "primary_label"]]. rename(
    columns={"primary_label": "label"}
)
print(f"\nSingle-label subset: {len(single_label_df):,} rows")

Loaded 6,090 documents from ../Data/tweets_preprocessed_train.parquet. 
Columns: ['text', 'label_name', 'label', 'labels', 'label_count', 'primary_label']
                                                text  label_name  \
0  lumber beat rapid game western division final ...  ['sports']   
1         hear eli gold announce auburn game dumbass  ['sports']   
2       phone away try look home game ticket october  ['sports']   

                                               label    labels  label_count  \
0  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  [sports]            1   
1  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  [sports]            1   
2  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...  [sports]            1   

  primary_label  
0        sports  
1        sports  
2        sports  

Single-label subset: 6,089 rows


In [2]:
# Load preprocessed test data from Lab 2
df_test = load_dataset(TEST_DATA_PATH)
print(f"Loaded {len(df_test):,} test documents from {TEST_DATA_PATH}")

# Use all training data for training
X_train = df_raw["text"]
y_train = df_raw["labels"]

# Use preprocessed test data for evaluation
X_test = df_test["text"]
y_test = df_test["labels"]

print(f"Training set: {len(X_train):,} samples")
print(f"Test set: {len(X_test):,} samples")

Loaded 1,679 test documents from ../Data/tweets_preprocessed_test.parquet
Training set: 6,090 samples
Test set: 1,679 samples


## 3.  Reusing Language-Model Helpers (Lab 3)

### 3.1 Background
`lab3.ipynb` defined a `UnigramLM` class that counts token frequencies and computes Laplace-smoothed log probabilities. We reuse the same implementation here to keep the logic consistent.

### 3.2 How it works
- `ensure_tokens` converts strings to token lists.
- `UnigramLM` aggregates token counts (`self.unigram_counts`) across the corpus.
- Calling `. unigram_counts. most_common(n)` returns the top-n tokens along with their frequencies.

In [3]:
from collections import Counter
from typing import Sequence, Union
import math


def ensure_tokens(sentence: Union[Sequence[str], str]) -> List[str]:
    """Convert whitespace-separated text or token sequences into a list."""
    if isinstance(sentence, str):
        sentence = sentence.split()
    return list(sentence)


class UnigramLM:
    """Laplace-smoothed unigram language model operating in log-space."""

    def __init__(self, corpus: Sequence[Sequence[str]]):
        self.unigram_counts = Counter()
        self.total_tokens = 0
        self.vocab = set()

        for sentence in corpus:
            tokens = ensure_tokens(sentence)
            self.unigram_counts.update(tokens)
            self.total_tokens += len(tokens)
            self. vocab.update(tokens)

        if self.total_tokens == 0:
            raise ValueError("Cannot train a UnigramLM on an empty corpus. ")

        self.vocab_size = len(self. vocab)

    def log_prob(self, word: str) -> float:
        count = self.unigram_counts.get(word, 0)
        return math. log((count + 1) / (self.total_tokens + self.vocab_size))

    def sentence_log_prob(self, sentence: Union[Sequence[str], str]) -> float:
        tokens = ensure_tokens(sentence)
        if not tokens:
            return float('-inf')
        return sum(self.log_prob(token) for token in tokens)

## 4. Task – Top 1000 Tokens

### 4. 1 Goal
Identify the most frequent tokens in the preprocessed corpus and save them for use in Lab 5 (Neural Network).

### 4. 2 Approach
1. Train the `UnigramLM` on the single-label subset.
2. Retrieve `most_common(1000)` and inspect the first items.
3. Save the vocabulary to `../Data/top_1000_vocabulary.json`.
4. Optionally preview top tokens per class.

In [4]:
MAX_FEATURES = 1000

# Build vocabulary from single-label subset
corpus_tokens = [ensure_tokens(text) for text in single_label_df["text"]]
unigram_model = UnigramLM(corpus_tokens)

# Extract top 1000 tokens
top_unigrams = unigram_model.unigram_counts.most_common(MAX_FEATURES)
TOP_VOCABULARY = [token for token, _ in top_unigrams]

print(f"Collected top {len(top_unigrams)} tokens (showing the first 20):")
for token, freq in top_unigrams[:20]:
    print(f"  {token:<15} -> {freq}")

# Per-class token preview for the three most frequent labels
label_counts = single_label_df["label"].value_counts(). head(3)
print("\nPer-class token preview (Top 10 tokens for the most frequent labels):")
for label, count in label_counts.items():
    label_corpus = [ensure_tokens(text) for text in single_label_df. loc[single_label_df["label"] == label, "text"]]
    label_model = UnigramLM(label_corpus)
    label_top = label_model.unigram_counts. most_common(10)
    formatted = ", ".join([f"{tok} ({freq})" for tok, freq in label_top])
    print(f"- {label} ({count} docs): {formatted}")

print(f"\nVocabulary size: {len(TOP_VOCABULARY)}")

Collected top 1000 tokens (showing the first 20):
  new             -> 598
  day             -> 515
  love            -> 514
  good            -> 453
  game            -> 439
  year            -> 414
  time            -> 401
  watch           -> 385
  happy           -> 361
  music           -> 346
  come            -> 330
  like            -> 322
  win             -> 311
  live            -> 307
  thank           -> 303
  great           -> 297
  go              -> 295
  video           -> 287
  play            -> 277
  world           -> 268

Per-class token preview (Top 10 tokens for the most frequent labels):
- sports (1181 docs): game (255), win (180), team (147), ufc (111), good (108), today (91), go (85), time (85), vs (84), final (82)
- news_&_social_concern (625 docs): trump (100), president (77), news (60), people (56), black (49), change (45), world (45), woman (44), year (42), day (42)
- music (439 docs): new (151), music (148), album (111), song (83), live (58), video (56)

In [5]:
# Save the top 1000 vocabulary to JSON file for use in Lab 5
os.makedirs(os.path.dirname(VOCABULARY_OUTPUT_PATH), exist_ok=True)

vocabulary_data = {
    "description": "Top 1000 most frequent tokens from preprocessed tweets (Lab 4)",
    "source": DATA_PATH,
    "count": len(TOP_VOCABULARY),
    "tokens": TOP_VOCABULARY
}

with open(VOCABULARY_OUTPUT_PATH, 'w', encoding='utf-8') as f:
    json. dump(vocabulary_data, f, ensure_ascii=False, indent=2)

print(f"✓ Vocabulary saved to: {VOCABULARY_OUTPUT_PATH}")
print(f"✓ Contains {len(TOP_VOCABULARY)} tokens")
print(f"✓ First 10 tokens: {TOP_VOCABULARY[:10]}")

✓ Vocabulary saved to: ../Data/top_1000_vocabulary.json
✓ Contains 1000 tokens
✓ First 10 tokens: ['new', 'day', 'love', 'good', 'game', 'year', 'time', 'watch', 'happy', 'music']


## 5.  Naive Bayes Classification

### 5. 1 Goal
Train a Naive Bayes classifier using the top 1000 vocabulary as features.

In [6]:
from sklearn. feature_extraction.text import CountVectorizer

# Use the top 1000 vocabulary for feature extraction
vectorizer = CountVectorizer(
    vocabulary=TOP_VOCABULARY,
    lowercase=True,
    token_pattern=r"(?u)\b\w+\b"
)

# Transform text to Bag-of-Words
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.fit_transform(X_test)

print(f"Feature matrix (Train): {X_train_bow.shape}")
print(f"Feature matrix (Test): {X_test_bow.shape}")
print(f"Sample features: {vectorizer.get_feature_names_out()[:10]}")

Feature matrix (Train): (6090, 1000)
Feature matrix (Test): (1679, 1000)
Sample features: ['new' 'day' 'love' 'good' 'game' 'year' 'time' 'watch' 'happy' 'music']


In [7]:
from sklearn.preprocessing import MultiLabelBinarizer

# Multi-Label encoding
mlb = MultiLabelBinarizer()
y_train_bin = mlb.fit_transform(y_train)
y_test_bin = mlb.transform(y_test)

print(f"Number of classes: {len(mlb.classes_)}")
print(f"Classes: {mlb. classes_[:10]}... ")
print(f"y_train shape: {y_train_bin.shape}")
print(f"y_test shape: {y_test_bin.shape}")

Number of classes: 449
Classes: ['arts_&_culture' 'arts_&_culturebusiness_&_entrepreneurs'
 'arts_&_culturebusiness_&_entrepreneurscelebrity_&_pop_culturefilm_tv_&_video'
 'arts_&_culturebusiness_&_entrepreneursdiaries_&_daily_life'
 'arts_&_culturebusiness_&_entrepreneursdiaries_&_daily_lifefamilyrelationships'
 'arts_&_culturebusiness_&_entrepreneursfilm_tv_&_video'
 'arts_&_culturebusiness_&_entrepreneursfood_&_dining'
 'arts_&_culturecelebrity_&_pop_culture'
 'arts_&_culturecelebrity_&_pop_culturediaries_&_daily_lifefilm_tv_&_video'
 'arts_&_culturecelebrity_&_pop_culturediaries_&_daily_lifemusic']... 
y_train shape: (6090, 449)
y_test shape: (1679, 449)




In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier

# Train Multi-Label Naive Bayes
nb_clf = OneVsRestClassifier(MultinomialNB(alpha=1.0))
nb_clf.fit(X_train_bow, y_train_bin)

print("✅ Multi-Label Naive Bayes model trained! ")

✅ Multi-Label Naive Bayes model trained! 


In [9]:
# Sample predictions
y_pred_bin = nb_clf.predict(X_test_bow)
y_pred_labels = mlb.inverse_transform(y_pred_bin)

print("Sample predictions:")
for text, pred in zip(X_test[:5], y_pred_labels[:5]):
    print(f"Text: {text[:70]}...")
    print(f"→ Predicted labels: {pred}\n")

Sample predictions:
Text: philadelphia clearly page game playbook fire net oppose goalie beat mi...
→ Predicted labels: ('sports',)

Text: sure bay face flyer man experience versus blue jacket year help lot ve...
→ Predicted labels: ('sports',)

Text: tizamagician put cherry kentucky derby day winner pie take del mar fin...
→ Predicted labels: ('sports',)

Text: flyer give false hope absolutely destroy islander go to destroy real t...
→ Predicted labels: ('sports',)

Text: flyer tremendous season face excited season go to well thank unforgett...
→ Predicted labels: ('sports',)



## 6.  Evaluation

Evaluate the Naive Bayes classifier on the test set.

In [10]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, hamming_loss

# Predictions
y_pred_bin = nb_clf.predict(X_test_bow)

# Metrics
print("=" * 60)
print("NAIVE BAYES EVALUATION (Test Set)")
print("=" * 60)
print(f"{'Subset Accuracy':<20}: {accuracy_score(y_test_bin, y_pred_bin):.4f}")
print(f"{'Hamming Loss':<20}: {hamming_loss(y_test_bin, y_pred_bin):.4f}")
print(f"{'Micro F1':<20}: {f1_score(y_test_bin, y_pred_bin, average='micro', zero_division=0):.4f}")
print(f"{'Macro F1':<20}: {f1_score(y_test_bin, y_pred_bin, average='macro', zero_division=0):.4f}")
print(f"{'Micro Precision':<20}: {precision_score(y_test_bin, y_pred_bin, average='micro', zero_division=0):.4f}")
print(f"{'Micro Recall':<20}: {recall_score(y_test_bin, y_pred_bin, average='micro', zero_division=0):.4f}")

NAIVE BAYES EVALUATION (Test Set)
Subset Accuracy     : 0.2662
Hamming Loss        : 0.0024
Micro F1            : 0.4246
Macro F1            : 0.0168
Micro Precision     : 0.4562
Micro Recall        : 0.3971


In [11]:
# Detailed sample comparison
y_true_labels = mlb.inverse_transform(y_test_bin)

print("\nDetailed sample predictions:")
for i, (text, true, pred) in enumerate(zip(X_test[:5], y_true_labels[:5], y_pred_labels[:5])):
    match = "✓" if set(true) == set(pred) else "✗"
    print(f"\n{match} Sample {i+1}:")
    print(f"   Text: {text[:80]}...")
    print(f"   True: {true}")
    print(f"   Pred: {pred}")


Detailed sample predictions:

✗ Sample 1:
   Text: philadelphia clearly page game playbook fire net oppose goalie beat minute leave...
   True: ('gamingnews_&_social_concernsports',)
   Pred: ('sports',)

✓ Sample 2:
   Text: sure bay face flyer man experience versus blue jacket year help lot versus islan...
   True: ('sports',)
   Pred: ('sports',)

✗ Sample 3:
   Text: tizamagician put cherry kentucky derby day winner pie take del mar finale richar...
   True: ('news_&_social_concernsports',)
   Pred: ('sports',)

✗ Sample 4:
   Text: flyer give false hope absolutely destroy islander go to destroy real team series...
   True: ('news_&_social_concernsports',)
   Pred: ('sports',)

✗ Sample 5:
   Text: flyer tremendous season face excited season go to well thank unforgettable seaso...
   True: ('news_&_social_concernsports',)
   Pred: ('sports',)


## 7.  Summary

### What was accomplished
1. Loaded preprocessed data from Lab 2
2. Extracted the top 1000 most frequent tokens
3. **Saved vocabulary to `../Data/top_1000_vocabulary.json`** for use in Lab 5
4.  Trained a Naive Bayes multi-label classifier
5. Evaluated performance on test set

### Files created
- `../Data/top_1000_vocabulary.json` - Contains the top 1000 tokens for feature extraction in Lab 5

In [12]:
print("=" * 60)
print("LAB 4 SUMMARY")
print("=" * 60)
print(f"Input: {DATA_PATH}")
print(f"Output: {VOCABULARY_OUTPUT_PATH}")
print(f"Vocabulary size: {len(TOP_VOCABULARY)}")
print(f"Training samples: {len(X_train):,}")
print(f"Test samples: {len(X_test):,}")
print("=" * 60)

LAB 4 SUMMARY
Input: ../Data/tweets_preprocessed_train.parquet
Output: ../Data/top_1000_vocabulary.json
Vocabulary size: 1000
Training samples: 6,090
Test samples: 1,679
