# Text Classification Lab (Hikmet)

---
## 1. Notebook Orientation

### 1.1 Focus of this notebook
We revisit the preprocessed tweets from Lab 3 and limit ourselves to the token analysis stage:

1. Load the dataset and normalise the label lists.
2. Derive the 1000 most frequent tokens, with optional per-class previews.

Later tasks (training a Naive Bayes classifier, evaluating it) remain intentionally open and appear only as placeholders.

### 1.2 Dataset
- Source: `../Data/df_preprocessed.parquet`
- Columns: `text` (whitespace-tokenised strings) and `label_name` (list of categories)

### 1.3 Section overview
1. **Section 2** – Load/prepare the data frame.
2. **Section 3** – Reuse the Lab 3 helper classes (`UnigramLM`).
3. **Section 4** – Compute the top 1000 tokens globally and preview them per class.
4. **Sections 5 & 6** – Placeholders for future classification steps.


## 2. Data Loading & Preparation

### 2.1 Goal
Load the preprocessed tweets, standardise the label column, and create a single-label view that can act as training data later on.

### 2.2 Steps
1. Import libraries (Pandas, NumPy, collections helper).
2. Convert `label_name` into consistent Python lists.
3. Build a DataFrame with a `label` column for single-label examples.


In [40]:
import ast
from collections import Counter
from typing import List

import numpy as np
import pandas as pd

DATA_PATH = "../Data/df_preprocessed.parquet"


def load_dataset(path: str) -> pd.DataFrame:
    """Load tweets from parquet and normalise the label column."""
    df = pd.read_parquet(path)

    def parse_labels(value) -> List[str]:
        if isinstance(value, list):
            return [str(v) for v in value]
        if isinstance(value, tuple):
            return [str(v) for v in value]
        if isinstance(value, str):
            try:
                parsed = ast.literal_eval(value)
                if isinstance(parsed, (list, tuple)):
                    return [str(v) for v in parsed]
            except (ValueError, SyntaxError):
                return [value]
        return [str(value)]

    df = df.copy()
    df["labels"] = df["label_name"].apply(parse_labels)
    df["label_count"] = df["labels"].apply(len)
    df["primary_label"] = df["labels"].apply(lambda items: items[0] if items else "unknown")
    return df


df_raw = load_dataset(DATA_PATH)
print(f"Loaded {len(df_raw):,} documents from {DATA_PATH}.")
print(df_raw.head(3))

single_label_df = df_raw[df_raw["label_count"] == 1][["text", "primary_label"]].rename(
    columns={"primary_label": "label"}
)
print(f"Single-label subset: {len(single_label_df):,} rows (label column = 'label').")


Loaded 6,090 documents from ../Data/df_preprocessed.parquet.
                                                text  label_name    labels  \
0  beat rapid game western division final evan ed...  ['sports']  [sports]   
1         hear eli gold announce auburn game dumbass  ['sports']  [sports]   
2       phone away try look home game ticket october  ['sports']  [sports]   

   label_count primary_label  
0            1        sports  
1            1        sports  
2            1        sports  
Single-label subset: 6,089 rows (label column = 'label').


## 3. Reusing Language-Model Helpers (Lab 3)

### 3.1 Background
`lab3_sunny.ipynb` defined a `UnigramLM` class that counts token frequencies and computes Laplace-smoothed log probabilities. We reuse the same implementation here to keep the logic consistent across notebooks.

### 3.2 How it works
- `ensure_tokens` converts strings to token lists.
- `UnigramLM` aggregates token counts (`self.unigram_counts`) across the corpus.
- Calling `.unigram_counts.most_common(n)` returns the top-n tokens along with their frequencies.


In [41]:
from collections import Counter
from typing import Sequence, Union
import math


def ensure_tokens(sentence: Union[Sequence[str], str]) -> List[str]:
    """Convert whitespace-separated text or token sequences into a list."""
    if isinstance(sentence, str):
        sentence = sentence.split()
    return list(sentence)


class UnigramLM:
    """Laplace-smoothed unigram language model operating in log-space."""

    def __init__(self, corpus: Sequence[Sequence[str]]):
        self.unigram_counts = Counter()
        self.total_tokens = 0
        self.vocab = set()

        for sentence in corpus:
            tokens = ensure_tokens(sentence)
            self.unigram_counts.update(tokens)
            self.total_tokens += len(tokens)
            self.vocab.update(tokens)

        if self.total_tokens == 0:
            raise ValueError("Cannot train a UnigramLM on an empty corpus.")

        self.vocab_size = len(self.vocab)

    def log_prob(self, word: str) -> float:
        count = self.unigram_counts.get(word, 0)
        return math.log((count + 1) / (self.total_tokens + self.vocab_size))

    def sentence_log_prob(self, sentence: Union[Sequence[str], str]) -> float:
        tokens = ensure_tokens(sentence)
        if not tokens:
            return float('-inf')
        return sum(self.log_prob(token) for token in tokens)


## 4. Task – Top 1000 Tokens

### 4.1 Goal
Identify the most frequent tokens in the corpus (with optional class-wise previews) and store them for later feature engineering.

### 4.2 Approach
1. Train the `UnigramLM` on the single-label subset.
2. Retrieve `most_common(1000)` and inspect the first items.
3. Optionally repeat the process for the most frequent classes to understand their characteristic vocabulary.


In [42]:
MAX_FEATURES = 1000

# Gesamtvokabular
corpus_tokens = [ensure_tokens(text) for text in single_label_df["text"]]
unigram_model = UnigramLM(corpus_tokens)

top_unigrams = unigram_model.unigram_counts.most_common(MAX_FEATURES)
print(f"Collected top {len(top_unigrams)} tokens (showing the first 20):")
for token, freq in top_unigrams[:20]:
    print(f"  {token:<15} -> {freq}")

# Optional: per class preview for the three most frequent labels
label_counts = single_label_df["label"].value_counts().head(3)
print("\nPer-class token preview (Top 10 tokens for the most frequent labels):")
for label, count in label_counts.items():
    label_corpus = [ensure_tokens(text) for text in single_label_df.loc[single_label_df["label"] == label, "text"]]
    label_model = UnigramLM(label_corpus)
    label_top = label_model.unigram_counts.most_common(10)
    formatted = ", ".join([f"{tok} ({freq})" for tok, freq in label_top])
    print(f"- {label} ({count} docs): {formatted}")

# Speichern des Vokabulars für spätere Schritte (falls benötigt)
TOP_VOCABULARY = [token for token, _ in top_unigrams]
print(f"\nStored vocabulary length: {len(TOP_VOCABULARY)}")

Collected top 1000 tokens (showing the first 20):
  new             -> 571
  love            -> 499
  day             -> 466
  good            -> 431
  game            -> 427
  make            -> 412
  year            -> 405
  time            -> 394
  watch           -> 383
  happy           -> 344
  come            -> 329
  music           -> 319
  like            -> 318
  win             -> 307
  great           -> 295
  thank           -> 292
  go              -> 292
  video           -> 275
  live            -> 272
  today           -> 261

Per-class token preview (Top 10 tokens for the most frequent labels):
- sports (1181 docs): game (248), win (178), team (143), ufc (110), good (107), today (91), go (85), vs (83), time (82), make (81)
- news_&_social_concern (625 docs): trump (97), president (76), news (57), people (55), world (44), woman (42), change (42), year (42), know (41), black (41)
- music (439 docs): new (145), music (137), album (111), song (83), love (53), listen (52)

## 5. Task – Naive Bayes Setup (Placeholder)

> To be added later: build the pipeline, split the data, and train the classifier.


In [None]:
print(TOP_VOCABULARY)

['new', 'love', 'day', 'good', 'game', 'make', 'year', 'time', 'watch', 'happy', 'come', 'music', 'like', 'win', 'great', 'thank', 'go', 'video', 'live', 'today', 'world', 'get', 'look', 'need', 'know', 'play', 'people', 'show', 'work', 'team', 'family', 'think', 'check', 'hope', 'man', 'news', 'want', 'say', 'life', 'change', 'woman', 'night', 'morning', 'trump', 'album', 'song', 'th', 'listen', 'week', 'fight', 'let', 'help', 'guy', 'right', 'ufc', 'vs', 'bad', 'remember', 'way', 'tonight', 'season', 'home', 'big', 'stay', 'break', 'follow', 'climate', 'state', 'star', 'tell', 'sign', 'see', 'president', 'end', 'don', 'friend', 'stream', 'join', 'thing', 'power', 'talk', 'final', 'feel', 'fan', 'second', 'take', 'amazing', 'official', 'weekend', 'hour', 'league', 'wait', 'well', 'start', 'war', 'send', 'movie', 'stop', 'black', 'find', 'hard', 'boy', 'miss', 'story', 'pm', 'sunday', 'city', 'give', 'run', 'line', 'hear', 'try', 'vote', 'free', 'long', 'sta', 'school', 'country', 'los

Schritt 1: Labels vorbereiten

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(
    vocabulary=TOP_VOCABULARY,  # hier wird dein gespeichertes Vokabular verwendet
    lowercase=True,
    token_pattern=r"(?u)\b\w+\b"
)

# Texte in Zahlen umwandeln (Bag-of-Words)
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow  = vectorizer.transform(X_test)

print("Feature-Matrix (Train):", X_train_bow.shape)
print("Beispiel-Features:", vectorizer.get_feature_names_out()[:10])

Feature-Matrix (Train): (4872, 1000)
Beispiel-Features: ['new' 'love' 'day' 'good' 'game' 'make' 'year' 'time' 'watch' 'happy']


Multi-Label-Encoding für die Labels

In [45]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
y_train_bin = mlb.fit_transform(y_train)
y_test_bin  = mlb.transform(y_test)

print("Anzahl Klassen:", len(mlb.classes_))
print("Beispiel Label:", y_train.iloc[0])


Anzahl Klassen: 405
Beispiel Label: ['sports']




Naive-Bayes-Klassifikator trainieren

In [46]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier

# One-vs-Rest erlaubt Multi-Label-Training
nb_clf = OneVsRestClassifier(MultinomialNB(alpha=1.0))
nb_clf.fit(X_train_bow, y_train_bin)

print("✅ Multi-Label Naive-Bayes-Modell trainiert!")


✅ Multi-Label Naive-Bayes-Modell trainiert!


Erste Test-Vorhersagen prüfen

In [47]:
y_pred_bin = nb_clf.predict(X_test_bow)
y_pred_labels = mlb.inverse_transform(y_pred_bin)

for text, pred in zip(X_test[:3], y_pred_labels[:3]):
    print(f"Text: {text[:70]}...")
    print(f"→ Vorhergesagte Labels: {pred}\n")


Text: fresh find discover new music hot talent listen popstar dnb remix dooz...
→ Vorhergesagte Labels: ('celebrity_&_pop_culturemusic', 'music')

Text: stop love thing game tell bill room growth go to look forward see gobi...
→ Vorhergesagte Labels: ()

Text: putin say nord stream gas pipeline europe complete end year quarter wa...
→ Vorhergesagte Labels: ('business_&_entrepreneursnews_&_social_concern',)



## 6. Task – Evaluation & Error Analysis (Placeholder)

> Once a classifier is trained, we will add metrics and example analyses here.


Vorhersagen für dein Test-Set

In [48]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, hamming_loss, classification_report

# Vorhersage (0/1-Matrix)
y_pred_bin = nb_clf.predict(X_test_bow)

# Labels wieder in Textform zurückwandeln
y_pred_labels = mlb.inverse_transform(y_pred_bin)
y_true_labels = mlb.inverse_transform(y_test_bin)

print("Beispiel-Vorhersagen:")
for text, true, pred in zip(X_test[:3], y_true_labels[:3], y_pred_labels[:3]):
    print(f"Text: {text[:80]}...")
    print(f"  Wahr: {true}")
    print(f"  Vorhergesagt: {pred}\n")

Beispiel-Vorhersagen:
Text: fresh find discover new music hot talent listen popstar dnb remix doozy hiphop r...
  Wahr: ('music',)
  Vorhergesagt: ('celebrity_&_pop_culturemusic', 'music')

Text: stop love thing game tell bill room growth go to look forward see gobill...
  Wahr: ('gamingsports',)
  Vorhergesagt: ()

Text: putin say nord stream gas pipeline europe complete end year quarter wall street...
  Wahr: ('business_&_entrepreneursnews_&_social_concern',)
  Vorhergesagt: ('business_&_entrepreneursnews_&_social_concern',)

