# Lab 4 - Text Classification with Naive Bayes

## Notebook Overview
This notebook rebuilds Lab 4 from scratch and walks through the required tasks: inspect class labels, extract the top-1000 tokens, build a Naive Bayes classifier on those features, split the data, and evaluate the model with both quantitative metrics and manual inspection.

## Task Checklist
1. **Class labels** – verify that every tweet carries a usable label.
2. **Top 1000 words** – derive the most frequent tokens (global and per-class preview).
3. **Feature definition** – restrict the bag-of-words representation to those 1000 tokens; ignore out-of-vocabulary tokens and allow empty vectors.
4. **Train/test protocol** – stratified split, train Naive Bayes on the training set, evaluate on the held-out test set.
5. **Evaluation & inspection** – compute accuracy/precision/recall/F1, confusion matrix, and manually examine sample predictions.

## 1. Data Loading and Label Audit
We rely on the preprocessed corpus from Lab 2 (`Abgabe/Data/tweets_preprocessed_train.parquet`). Keeping the cleaned dataset ensures the token frequencies remain consistent with earlier labs and avoids vocabulary mismatches.

In [48]:
import ast
from collections import Counter
from pathlib import Path
from typing import List

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

DATA_PATH = Path("../Data/tweets_preprocessed_train.parquet")
RANDOM_STATE = 42
MAX_FEATURES = 1000

In [49]:
def load_preprocessed_dataset(path: Path) -> pd.DataFrame:
    """Load the parquet export and normalise label columns."""
    df = pd.read_parquet(path)
    df["label_name"] = df["label_name"].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
    df["primary_label"] = df["label_name"].apply(lambda labels: labels[0] if labels else "unknown")
    return df

corpus_df = load_preprocessed_dataset(DATA_PATH)
print(f"Loaded {len(corpus_df):,} tweets from {DATA_PATH}.")
corpus_df.head()

Loaded 6,090 tweets from ../Data/tweets_preprocessed_train.parquet.


Unnamed: 0,text,label_name,label,primary_label
0,lumber beat rapid game western division final ...,[sports],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sports
1,hear eli gold announce auburn game dumbass,[sports],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sports
2,phone away try look home game ticket october,[sports],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sports
3,year ago louisville struggle beat fcs opponent...,[sports],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sports
4,know dodger oriole game thursday fox arguably ...,[sports],"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",sports


### 1.1 Class distribution
We inspect the label counts to confirm that every example has a class and to understand potential imbalance before splitting.

In [50]:
label_counts = corpus_df["primary_label"].value_counts().sort_values(ascending=False)
label_summary = label_counts.to_frame(name="count")
label_summary["share_%"] = (label_summary["count"] / label_summary["count"].sum() * 100).round(2)
label_summary

Unnamed: 0_level_0,count,share_%
primary_label,Unnamed: 1_level_1,Unnamed: 2_level_1
sports,1181,19.39
news_&_social_concern,625,10.26
music,439,7.21
film_tv_&_video,269,4.42
celebrity_&_pop_culturemusic,265,4.35
...,...,...
arts_&_culturesports,1,0.02
learning_&_educationalmusic,1,0.02
familynews_&_social_concernrelationshipstravel_&_adventure,1,0.02
business_&_entrepreneursgamingscience_&_technology,1,0.02


### 1.2 Handling ultra-rare labels
Stratified splits require at least a couple of examples per class. We therefore keep only labels with **≥ 30 tweets**, which still leaves ~4.5k examples but removes the noisier tail of one-off topics mentioned in the instructions.

In [51]:
MIN_LABEL_FREQ = 30
valid_labels = label_counts[label_counts >= MIN_LABEL_FREQ].index

# create the filtered working DataFrame from the original corpus
work_df = corpus_df[corpus_df["primary_label"].isin(valid_labels)].reset_index(drop=True)
filtered_label_counts = work_df["primary_label"].value_counts().sort_values(ascending=False)

print(f"Rows kept: {len(work_df)} (dropped {len(corpus_df) - len(work_df)} tweets with extremely rare labels)")
filtered_label_counts.head()

Rows kept: 4540 (dropped 1550 tweets with extremely rare labels)


primary_label
sports                          1181
news_&_social_concern            625
music                            439
film_tv_&_video                  269
celebrity_&_pop_culturemusic     265
Name: count, dtype: int64

## 2. Top 1,000 Tokens
The text column already contains whitespace-tokenised, lower-case tweets. We count token frequencies globally and freeze the top 1,000 as our vocabulary for the classifier.

In [52]:
def tokenize(text: str) -> List[str]:
    return text.split()

all_tokens = work_df["text"].apply(tokenize)
token_counter = Counter(token for tokens in all_tokens for token in tokens)

top_tokens = [token for token, _ in token_counter.most_common(MAX_FEATURES)]
print(f"Vocabulary size: {len(top_tokens)}")
print("Top 100 tokens:")
for token, count in token_counter.most_common(100):
    print(f"  {token:<15} -> {count}")

Vocabulary size: 1000
Top 100 tokens:
  new             -> 475
  game            -> 379
  day             -> 355
  good            -> 349
  love            -> 345
  music           -> 309
  year            -> 305
  watch           -> 286
  time            -> 278
  win             -> 276
  like            -> 248
  come            -> 246
  live            -> 246
  video           -> 239
  go              -> 236
  happy           -> 222
  play            -> 217
  great           -> 214
  world           -> 208
  thank           -> 207
  team            -> 192
  today           -> 189
  get             -> 189
  look            -> 171
  know            -> 165
  need            -> 165
  man             -> 163
  people          -> 159
  song            -> 154
  trump           -> 154
  news            -> 154
  album           -> 154
  think           -> 153
  night           -> 152
  say             -> 149
  work            -> 147
  woman           -> 141
  want            -> 139
  check     

In [53]:
VOCAB_PATH = Path("../Data/top_1000_words.csv")
vocab_df = pd.DataFrame({
    "rank": range(1, len(top_tokens) + 1),
    "word": top_tokens
})
vocab_df.to_csv(VOCAB_PATH, index=False)
print(f"Saved {len(vocab_df)} tokens to {VOCAB_PATH}")
vocab_df.head(10)


Saved 1000 tokens to ../Data/top_1000_words.csv


Unnamed: 0,rank,word
0,1,new
1,2,game
2,3,day
3,4,good
4,5,love
5,6,music
6,7,year
7,8,watch
8,9,time
9,10,win


### 2.1 Optional per-class preview
To ensure the vocabulary represents different classes, we list the top tokens for the ten most frequent labels.

In [54]:
per_class_top = {}
for label in filtered_label_counts.head(10).index:
    mask = work_df["primary_label"] == label
    label_tokens = Counter(token for tokens in all_tokens[mask] for token in tokens)
    per_class_top[label] = label_tokens.most_common(5)

per_class_top

{'sports': [('game', 255),
  ('win', 180),
  ('team', 147),
  ('ufc', 111),
  ('good', 108)],
 'news_&_social_concern': [('trump', 100),
  ('president', 77),
  ('news', 60),
  ('people', 56),
  ('black', 49)],
 'music': [('new', 151),
  ('music', 148),
  ('album', 111),
  ('song', 83),
  ('live', 58)],
 'film_tv_&_video': [('movie', 70),
  ('watch', 61),
  ('bad', 48),
  ('star', 41),
  ('break', 34)],
 'celebrity_&_pop_culturemusic': [('love', 58),
  ('new', 47),
  ('song', 46),
  ('music', 44),
  ('watermelon', 34)],
 'news_&_social_concernsports': [('trump', 15),
  ('ufc', 13),
  ('player', 13),
  ('news', 12),
  ('say', 11)],
 'celebrity_&_pop_culturefilm_tv_&_video': [('love', 18),
  ('star', 18),
  ('movie', 14),
  ('watch', 13),
  ('like', 12)],
 'film_tv_&_videomusic': [('video', 67),
  ('music', 65),
  ('new', 28),
  ('official', 22),
  ('watch', 21)],
 'diaries_&_daily_lifenews_&_social_concern': [('day', 30),
  ('woman', 25),
  ('happy', 16),
  ('time', 14),
  ('black', 12)]

## 3. Feature Extraction Setup
We freeze the 1,000-token unigram vocabulary exported above and build a bag-of-words encoder with `CountVectorizer`. Tokens outside this list are ignored automatically (our implicit "what to do about words not in the feature set" decision). Empty tweets simply become zero-vectors, which Multinomial Naive Bayes can still process.

### 3.1 CountVectorizer configuration
We keep the feature space strictly at the 1,000 most frequent unigrams. The vectorizer therefore works with deterministic features across notebook runs and matches the saved `top_1000_words.csv` file.

In [55]:
vectorizer = CountVectorizer(
    lowercase=False,
    max_features=MAX_FEATURES,
    ngram_range=(1, 1)
)
features = vectorizer.fit_transform(work_df["text"])
print(f"Feature matrix shape: {features.shape}")
feature_names = vectorizer.get_feature_names_out()
print("Sample features:", feature_names[:10])
labels = work_df["primary_label"].values
texts = work_df["text"].values


Feature matrix shape: (4540, 1000)
Sample features: ['aaron' 'able' 'absolute' 'absolutely' 'account' 'act' 'action'
 'actually' 'adam' 'add']


In [56]:
X_train, X_test, y_train, y_test, text_train, text_test = train_test_split(
    features,
    labels,
    texts,
    test_size=0.2,
    stratify=labels,
    random_state=RANDOM_STATE
)

nb_clf = MultinomialNB()
nb_clf.fit(X_train, y_train)
y_pred = nb_clf.predict(X_test)

print(f'Training samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}')


Training samples: 3632, Test samples: 908


### 3.2 Language-model features from Lab 3
To reuse our custom language models, we first reintroduce the helper functions/classes from Lab 3.

In [57]:

from math import log
from collections import Counter, defaultdict
from typing import Sequence, Union

def ensure_tokens(sentence: Union[Sequence[str], str]) -> list:
    if isinstance(sentence, str):
        return sentence.split()
    return list(sentence)

class UnigramLM:
    def __init__(self, corpus: Sequence[Sequence[str]]):
        self.unigram_counts = Counter()
        self.total_tokens = 0
        self.vocab = set()
        for sentence in corpus:
            tokens = ensure_tokens(sentence)
            self.unigram_counts.update(tokens)
            self.total_tokens += len(tokens)
            self.vocab.update(tokens)
        if self.total_tokens == 0:
            raise ValueError("Cannot train UnigramLM on an empty corpus")
        self.vocab_size = len(self.vocab)

    def log_prob(self, word: str) -> float:
        count = self.unigram_counts.get(word, 0)
        return log((count + 1) / (self.total_tokens + self.vocab_size))

    def sentence_log_prob(self, sentence: Union[Sequence[str], str]) -> float:
        tokens = ensure_tokens(sentence)
        if not tokens:
            return float('-inf')
        return sum(self.log_prob(tok) for tok in tokens)

class BigramLM:
    def __init__(self, corpus: Sequence[Sequence[str]]):
        self.bigram_counts = defaultdict(Counter)
        self.unigram_counts = Counter()
        self.vocab = set()
        for sentence in corpus:
            tokens = ensure_tokens(sentence)
            prev = None
            for word in tokens:
                self.unigram_counts[word] += 1
                self.vocab.add(word)
                if prev is not None:
                    self.bigram_counts[prev][word] += 1
                prev = word
        self.vocab_size = len(self.vocab)
        self.total_tokens = sum(self.unigram_counts.values())
        if self.total_tokens == 0:
            raise ValueError("Cannot train BigramLM on an empty corpus")

    def log_prob(self, prev_word: str, word: str) -> float:
        bigram_count = self.bigram_counts[prev_word].get(word, 0)
        unigram_count = self.unigram_counts.get(prev_word, 0)
        return log((bigram_count + 1) / (unigram_count + self.vocab_size))

    def sentence_log_prob(self, sentence: Union[Sequence[str], str]) -> float:
        tokens = ensure_tokens(sentence)
        if not tokens:
            return float('-inf')
        first = tokens[0]
        total = log((self.unigram_counts.get(first, 0) + 1) / (self.total_tokens + self.vocab_size))
        for prev, word in zip(tokens, tokens[1:]):
            total += self.log_prob(prev, word)
        return total


### 4.1 Quantitative Evaluation
We compute accuracy plus the full classification report (macro/micro precision, recall, F1) and display the confusion matrix to see which classes dominate the errors.

In [58]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")
print("\nClassification Report (zero_division=0):")
print(classification_report(y_test, y_pred, digits=3, zero_division=0))

labels_order = nb_clf.classes_
cm_array = confusion_matrix(y_test, y_pred, labels=labels_order)
cm = pd.DataFrame(cm_array, index=labels_order, columns=labels_order)
cm

Test Accuracy: 0.5286

Classification Report (zero_division=0):
                                               precision    recall  f1-score   support

                               arts_&_culture      0.500     0.167     0.250         6
                     business_&_entrepreneurs      0.500     0.154     0.235        13
business_&_entrepreneursnews_&_social_concern      0.375     0.375     0.375        16
                      celebrity_&_pop_culture      0.000     0.000     0.000        11
       celebrity_&_pop_culturefilm_tv_&_video      0.091     0.043     0.059        23
  celebrity_&_pop_culturefilm_tv_&_videomusic      0.250     0.059     0.095        17
                 celebrity_&_pop_culturemusic      0.393     0.415     0.404        53
 celebrity_&_pop_culturenews_&_social_concern      1.000     0.200     0.333        15
                celebrity_&_pop_culturesports      1.000     0.273     0.429        11
                         diaries_&_daily_life      0.200     0.15

Unnamed: 0,arts_&_culture,business_&_entrepreneurs,business_&_entrepreneursnews_&_social_concern,celebrity_&_pop_culture,celebrity_&_pop_culturefilm_tv_&_video,celebrity_&_pop_culturefilm_tv_&_videomusic,celebrity_&_pop_culturemusic,celebrity_&_pop_culturenews_&_social_concern,celebrity_&_pop_culturesports,diaries_&_daily_life,...,fitness_&_healthnews_&_social_concern,gaming,gamingsports,music,news_&_social_concern,news_&_social_concernscience_&_technology,news_&_social_concernsports,other_hobbies,science_&_technology,sports
arts_&_culture,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
business_&_entrepreneurs,0,2,4,1,0,0,1,0,0,0,...,0,0,0,0,2,0,0,0,0,3
business_&_entrepreneursnews_&_social_concern,0,0,6,0,0,0,0,0,0,0,...,2,0,0,1,5,0,1,0,0,0
celebrity_&_pop_culture,0,0,0,0,0,0,3,0,0,0,...,0,0,0,2,3,0,1,0,0,2
celebrity_&_pop_culturefilm_tv_&_video,0,0,0,0,1,0,3,0,0,0,...,0,0,0,1,3,0,0,0,0,3
celebrity_&_pop_culturefilm_tv_&_videomusic,0,0,0,0,0,1,4,0,0,0,...,0,0,0,7,0,0,0,0,0,3
celebrity_&_pop_culturemusic,0,0,0,0,1,1,22,0,0,1,...,0,0,0,23,1,0,0,0,0,1
celebrity_&_pop_culturenews_&_social_concern,0,0,0,0,0,0,1,3,0,0,...,0,0,0,0,11,0,0,0,0,0
celebrity_&_pop_culturesports,0,0,0,0,1,0,1,0,3,0,...,0,0,0,0,0,0,0,0,0,6
diaries_&_daily_life,0,0,0,0,1,0,2,0,0,3,...,0,1,0,2,3,0,0,0,0,2


In [59]:
analysis_df = pd.DataFrame({
    "text": text_test,
    "true_label": y_test,
    "predicted": y_pred
})
analysis_df["correct"] = analysis_df["true_label"] == analysis_df["predicted"]

correct_examples = analysis_df[analysis_df["correct"]].head(5)
incorrect_examples = analysis_df[~analysis_df["correct"]].head(5)

print("Correctly classified samples:")
correct_examples


Correctly classified samples:


Unnamed: 0,text,true_label,predicted,correct
1,pre save new release bad guy spotify power,music,music,True
3,happy dominick cruz getting knock henry cejudo...,sports,sports,True
5,look limp biskit music video hd,music,music,True
8,want play poke pull ben bad half game know los...,sports,sports,True
9,happy postal lose absentee ballot mail,news_&_social_concern,news_&_social_concern,True


In [60]:
print("Incorrectly classified samples:")
analysis_df[~analysis_df["correct"]].head(5)

Incorrectly classified samples:


Unnamed: 0,text,true_label,predicted,correct
0,vg duty warzone get fps update xbox series ps,gaming,sports,False
2,hey good article renewable future win offset h...,news_&_social_concernscience_&_technology,film_tv_&_video,False
4,release new ep difference stream,celebrity_&_pop_culturemusic,music,False
6,jane jungle tonight official music video breat...,film_tv_&_videomusic,music,False
7,love live love house lot room big yard woode a...,diaries_&_daily_life,celebrity_&_pop_culturemusic,False


## 5. Findings & Next Steps
- Label audit: filtering out labels with <30 examples shrank the corpus from 6,090 to 4,540 tweets and left 30 frequent classes (e.g., sports/news/music), which keeps the stratified split stable.
- Vocabulary: the capped set of 1,000 uni/bi-grams is dominated by topical markers (team names, climate terms, release words), and the per-class previews confirm that these n-grams discriminate the heads of the distribution.
- Modeling: the CountVectorizer + Multinomial NB pipeline is ready; wiring back the Lab-3 `build_lm_features` helper will let us append unigram/bigram log-likelihoods before re-running the accuracy, report, and confusion matrix cells for the final evaluation.
