# Lab 5: Neural Network Classification with scikit-learn

---
## 1. Notebook Overview

### 1.1 Objective
- Re-use the most frequent words (optional: per class) you found for
your Naive Bayes classifier last week.

- Construct binary vectors for your whole dataset. Each dimension states
whether the word is part of the sample or not.

- Create a small neural network using scikit-learn: https://scikit-learn.org/
stable/modules/neural_networks_supervised.html. Start with three
hidden layers of 128/64/128 neurons. Consider what your input and
output layers should look like.

- Train your network on your training set and test it on your test set.
Calculate evaluation measures and compare with your previous
classifier.

- Optional: Experiment with different network sizes.

### 1.2 Prerequisites
This notebook assumes you have already executed:
- **Lab 2**: Data preprocessing ‚Üí `../Data/multi_label/tweets_preprocessed_*.parquet`
- **Lab 3**: Language modeling
- **Lab 4**: Feature extraction ‚Üí `../Data/top_1000_vocabulary.json`
- **Single-Label**: `../Data/single_label/tweets_single_label_*.parquet`

### 1.3 Architecture
We implement neural networks with:
- **Input layer**: 1000 features (Top 1000 vocabulary from Lab 4)
- **Hidden layers**: 128 ‚Üí 64 ‚Üí 128 neurons (as specified)
- **Output layer**: 
  - Multi-label: 14 binary classifiers (one per topic class, using OneVsRestClassifier)
  - Single-label: 14 classes with Softmax activation

### 1.4 Neural Network Fundamentals (From Lecture)
- A single neuron computes: ≈∑ = g(w‚ÇÄ + Œ£ x·µ¢w·µ¢) where g is a non-linear activation function
- **Activation functions are critical** - they introduce non-linearities that make multi-layer networks powerful (universal approximators)
- Common activations: ReLU (g(z) = max(0,z)), Sigmoid, Tanh
- For multi-class (single-label): use **Softmax** to convert outputs to probabilities
- For multi-label: use **Sigmoid** per class via OneVsRestClassifier
- **Loss function for classification**: Cross-entropy loss
- Weights should NOT be initialized to all zeros (breaks symmetry)

---
## 2. Task 1: Establish Context

### 2.1 Review Preprocessing from Lab 2
In Lab 2, we preprocessed tweets with the following pipeline:
- Remove RT indicators, URLs, usernames, and mentions
- Convert emojis to text descriptions
- Extract hashtag text and segment CamelCase words
- Normalize whitespace and lowercase
- Tokenize with SpaCy and filter/lemmatize tokens

The output is stored in parquet files with columns: `text`, `label_name`, `label`

Two approaches for label handling are supported:
- Parse `label_name` (string list format) into Python lists
- Use `label` column directly (pre-computed binary vectors)

In [103]:
# Import required libraries
import json
import ast
import os
import hashlib
import time
from typing import List
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder
from sklearn.metrics import (
    accuracy_score, 
    f1_score, 
    precision_score, 
    recall_score, 
    hamming_loss,
    classification_report
)
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from tqdm import tqdm

# Constants - Updated paths to new folder structure
TRAIN_DATA_PATH = "../Data/multi_label/tweets_preprocessed_train.parquet"
TEST_DATA_PATH = "../Data/multi_label/tweets_preprocessed_test.parquet"
VALIDATION_DATA_PATH = "../Data/multi_label/tweets_preprocessed_validation.parquet"
VOCABULARY_PATH = "../Data/top_1000_vocabulary.json"
RANDOM_STATE = 42

print("‚úì Libraries imported successfully")

‚úì Libraries imported successfully


### 2.2 Load and Verify Vocabulary from Lab 4

In [104]:
# Load the top 1000 vocabulary from Lab 4
with open(VOCABULARY_PATH, 'r', encoding='utf-8') as f:
    vocab_data = json.load(f)

VOCABULARY = vocab_data['tokens']
vocab_set = set(VOCABULARY)

print(f"‚úì Loaded vocabulary from: {VOCABULARY_PATH}")
print(f"‚úì Description: {vocab_data['description']}")
print(f"‚úì Vocabulary size: {len(VOCABULARY)}")
print(f"‚úì First 20 tokens: {VOCABULARY[:20]}")
print(f"‚úì Last 10 tokens: {VOCABULARY[-10:]}")

‚úì Loaded vocabulary from: ../Data/top_1000_vocabulary.json
‚úì Description: Top 1000 most frequent tokens from preprocessed tweets (Lab 4)
‚úì Vocabulary size: 1000
‚úì First 20 tokens: ['new', 'game', 'day', 'good', 'year', 'love', 'time', 'win', 'come', 'happy', 'like', 'watch', 'go', 'world', 'live', 'today', 'red', 'team', 'great', 'heart']
‚úì Last 10 tokens: ['straight', 'google', 'december', 'thankful', 'oklahoma', 'donald', 'army', 'beverage', 'education', 'titan']


### 2.3 Load Preprocessed Datasets

In [105]:
def parse_labels(value) -> List[str]:
    """Parse label_name column into consistent Python lists."""
    if isinstance(value, (list, np.ndarray)):
        return [str(v) for v in value]
    if isinstance(value, tuple):
        return [str(v) for v in value]
    if isinstance(value, str):
        value = value.strip()
        if value.startswith('[') and value.endswith(']'):
            # Remove brackets
            inner = value[1:-1].strip()
            if not inner:
                return []
            # Remove quotes and split by whitespace (handles both formats)
            inner = inner.replace("'", "").replace('"', '')
            labels = [l.strip() for l in inner.split() if l.strip()]
            return labels
        try:
            parsed = ast.literal_eval(value)
            if isinstance(parsed, (list, tuple)):
                return [str(v) for v in parsed]
        except (ValueError, SyntaxError):
            pass
        return [value] if value else []
    return [str(value)] if value else []

def parse_binary_label(value) -> np.ndarray:
    """Parse binary label array from string representation."""
    if isinstance(value, np.ndarray):
        return value
    if isinstance(value, str):
        # Parse "[0 0 1 0 ...]" format
        inner = value.strip()[1:-1]
        return np.array([int(x) for x in inner.split()])
    return np.array(value)

def load_dataset(path: str) -> pd.DataFrame:
    """Load tweets from parquet and normalize the label columns."""
    df = pd.read_parquet(path)
    df = df.copy()
    df["labels"] = df["label_name"].apply(parse_labels)
    df["label_binary"] = df["label"].apply(parse_binary_label)
    return df

# Load all datasets
df_train = load_dataset(TRAIN_DATA_PATH)
df_test = load_dataset(TEST_DATA_PATH)
df_validation = load_dataset(VALIDATION_DATA_PATH)

print(f"‚úì Training set: {len(df_train):,} samples")
print(f"‚úì Test set: {len(df_test):,} samples")
print(f"‚úì Validation set: {len(df_validation):,} samples")
print(f"\nSample preprocessed text:")
print(f"  {df_train['text'].iloc[0][:80]}...")
print(f"  Labels: {df_train['labels'].iloc[0]}")

‚úì Training set: 5,465 samples
‚úì Test set: 1,511 samples
‚úì Validation set: 178 samples

Sample preprocessed text:
  lumber beat rapid game western division final evan edwards hit hr wp josh robers...
  Labels: ['sports']


In [106]:
# ============================================================
# DYNAMISCHE KLASSEN-ERKENNUNG AUS DEN DATEN
# ============================================================
# Diese Zelle passt sich automatisch an die Daten an, 
# unabh√§ngig davon wie viele Klassen nach dem Preprocessing √ºbrig sind.

print("="*60)
print("AUTOMATISCHE KLASSEN-ERKENNUNG")
print("="*60)

# 1. Bestimme die Anzahl der Klassen aus den bin√§ren Label-Vektoren
num_classes = len(df_train['label_binary'].iloc[0])
print(f"\n‚úì Anzahl Klassen (aus label_binary): {num_classes}")

# 2. Extrahiere alle einzigartigen Klassennamen aus label_name
all_class_names = set()
for df in [df_train, df_test, df_validation]:
    for labels in df['labels']:
        all_class_names.update(labels)

TOPIC_CLASSES = sorted(list(all_class_names))
print(f"‚úì Klassennamen aus Daten extrahiert: {len(TOPIC_CLASSES)}")
print(f"‚úì Klassen: {TOPIC_CLASSES}")

# 3. Verifiziere Konsistenz
if len(TOPIC_CLASSES) != num_classes:
    print(f"\n‚ö†Ô∏è WARNUNG: Anzahl Klassennamen ({len(TOPIC_CLASSES)}) != Anzahl Spalten in label_binary ({num_classes})")
    print("   Das kann passieren wenn label_name und label nicht synchron sind.")
    print("   Verwende Anzahl aus label_binary als ma√ügeblich.")
    
# 4. Zeige Beispiel-Daten
print(f"\n‚úì Beispiel-Daten:")
print(f"  Text: {df_train['text'].iloc[0][:60]}...")
print(f"  Labels (Namen): {df_train['labels'].iloc[0]}")
print(f"  Labels (Bin√§r): {df_train['label_binary'].iloc[0]}")

# 5. Statistiken
print(f"\n‚úì Dataset-Statistiken:")
print(f"  Training: {len(df_train):,} Samples")
print(f"  Test: {len(df_test):,} Samples")
print(f"  Validation: {len(df_validation):,} Samples")
print(f"  Gesamt: {len(df_train) + len(df_test) + len(df_validation):,} Samples")

print("\n" + "="*60)

AUTOMATISCHE KLASSEN-ERKENNUNG

‚úì Anzahl Klassen (aus label_binary): 6
‚úì Klassennamen aus Daten extrahiert: 6
‚úì Klassen: ['celebrity_&_pop_culture', 'diaries_&_daily_life', 'film_tv_&_video', 'music', 'news_&_social_concern', 'sports']

‚úì Beispiel-Daten:
  Text: lumber beat rapid game western division final evan edwards h...
  Labels (Namen): ['sports']
  Labels (Bin√§r): [0 0 0 0 0 1]

‚úì Dataset-Statistiken:
  Training: 5,465 Samples
  Test: 1,511 Samples
  Validation: 178 Samples
  Gesamt: 7,154 Samples



### 2.4 Intelligent Single-Label Assignment with Claude Haiku

For comparison with a single-label classifier, we convert multi-label samples to single-label. Instead of simply taking the first label, we use **Claude Haiku** to intelligently decide which label best represents the tweet's main topic.

**Strategy:**
1. **Single-label tweets**: Keep the label as-is
2. **Multi-label tweets**: Use Claude Haiku to analyze the tweet and select the most appropriate single label
3. **Caching**: Results are cached to avoid redundant API calls and reduce costs

This approach addresses the problem of arbitrary label selection and leverages semantic understanding to pick the most relevant label.

In [107]:
# ============================================================
# INTELLIGENTE SINGLE-LABEL ZUWEISUNG MIT CLAUDE HAIKU
# ============================================================
import anthropic
import json
import os
import hashlib
from pathlib import Path
from tqdm import tqdm
import time

# Cache-Pfad f√ºr API-Ergebnisse (spart Kosten bei erneutem Ausf√ºhren)
CACHE_PATH = Path("../Data/single_label/single_label_cache.json")

# Anthropic Client initialisieren
# API Key aus Umgebungsvariable oder direkt setzen
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY", None)

if ANTHROPIC_API_KEY is None:
    print("‚ö†Ô∏è ANTHROPIC_API_KEY nicht gefunden!")
    print("   Bitte setze die Umgebungsvariable oder gib den Key hier ein:")
    print("   export ANTHROPIC_API_KEY='dein-api-key'")
    USE_LLM = False
else:
    client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
    USE_LLM = True
    print("‚úì Anthropic Client initialisiert")

def load_cache() -> dict:
    """Lade gecachte Single-Label Entscheidungen."""
    if CACHE_PATH.exists():
        with open(CACHE_PATH, 'r', encoding='utf-8') as f:
            return json.load(f)
    return {}

def save_cache(cache: dict):
    """Speichere Cache auf Disk."""
    with open(CACHE_PATH, 'w', encoding='utf-8') as f:
        json.dump(cache, f, ensure_ascii=False, indent=2)

def get_cache_key(text: str, labels: list) -> str:
    """Erstelle einen eindeutigen Cache-Key f√ºr Tweet + Labels."""
    content = f"{text}|{','.join(sorted(labels))}"
    return hashlib.md5(content.encode()).hexdigest()

def classify_with_haiku(text: str, labels: list, cache: dict) -> str:
    """
    Verwende Claude Haiku um das passendste Label auszuw√§hlen.
    Mit Caching um API-Kosten zu minimieren.
    """
    cache_key = get_cache_key(text, labels)
    
    # Pr√ºfe Cache
    if cache_key in cache:
        return cache[cache_key]
    
    # Falls LLM nicht verf√ºgbar, nimm erstes Label
    if not USE_LLM:
        return labels[0]
    
    # Erstelle Prompt
    system_prompt = """Du bist ein Experte f√ºr Tweet-Klassifizierung. 
Deine Aufgabe ist es, das EINE passendste Label f√ºr einen Tweet auszuw√§hlen.
Antworte NUR mit dem gew√§hlten Label, nichts anderes."""

    user_prompt = f"""Tweet: "{text}"

M√∂gliche Labels: {', '.join(labels)}

Welches Label passt am besten zu diesem Tweet? Antworte nur mit dem Label."""

    try:
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=50,
            messages=[
                {"role": "user", "content": user_prompt}
            ],
            system=system_prompt
        )
        
        result = response.content[0].text.strip()
        
        # Validiere dass das Ergebnis ein g√ºltiges Label ist
        if result in labels:
            cache[cache_key] = result
            return result
        else:
            # Falls LLM ung√ºltiges Label zur√ºckgibt, finde beste √úbereinstimmung
            for label in labels:
                if label.lower() in result.lower():
                    cache[cache_key] = label
                    return label
            # Fallback: erstes Label
            cache[cache_key] = labels[0]
            return labels[0]
            
    except Exception as e:
        print(f"API Error: {e}")
        return labels[0]  # Fallback

def assign_single_labels_smart(df, cache: dict, desc: str = "Processing") -> list:
    """
    Weise jedem Sample das beste Single-Label zu.
    Verwendet LLM nur f√ºr Samples mit mehreren Labels.
    """
    single_labels = []
    api_calls = 0
    cache_hits = 0
    
    for idx, row in tqdm(df.iterrows(), total=len(df), desc=desc):
        labels = row['labels']
        
        if len(labels) == 1:
            # Nur ein Label - keine Entscheidung n√∂tig
            single_labels.append(labels[0])
        else:
            # Mehrere Labels - LLM entscheidet (oder Cache)
            cache_key = get_cache_key(row['text'], labels)
            if cache_key in cache:
                cache_hits += 1
            else:
                api_calls += 1
            
            best_label = classify_with_haiku(row['text'], labels, cache)
            single_labels.append(best_label)
            
            # Speichere Cache regelm√§√üig
            if api_calls > 0 and api_calls % 100 == 0:
                save_cache(cache)
                print(f"\n  üíæ Cache gespeichert ({api_calls} API calls, {cache_hits} cache hits)")
    
    print(f"\n  üìä Statistik: {api_calls} API calls, {cache_hits} cache hits, {len(df) - api_calls - cache_hits} single-label samples")
    return single_labels

‚ö†Ô∏è ANTHROPIC_API_KEY nicht gefunden!
   Bitte setze die Umgebungsvariable oder gib den Key hier ein:
   export ANTHROPIC_API_KEY='dein-api-key'


In [108]:
# ============================================================
# SINGLE-LABEL DATEN LADEN ODER GENERIEREN
# ============================================================
# Setze REGENERATE_SINGLE_LABELS = True um die LLM-Klassifizierung 
# erneut durchzuf√ºhren und die Dateien zu aktualisieren.
# Standardm√§√üig werden die gespeicherten Dateien geladen.

REGENERATE_SINGLE_LABELS = False  # ‚Üê Auf True setzen um LLM-Klassifizierung zu starten

# Definiere Pfade
SINGLE_LABEL_TRAIN_PATH = "../Data/single_label/tweets_single_label_train"
SINGLE_LABEL_TEST_PATH = "../Data/single_label/tweets_single_label_test"
SINGLE_LABEL_VALIDATION_PATH = "../Data/single_label/tweets_single_label_validation"

print("=" * 70)
print("SINGLE-LABEL DATEN")
print("=" * 70)

# Pr√ºfe ob gespeicherte Dateien existieren
train_exists = Path(f"{SINGLE_LABEL_TRAIN_PATH}.parquet").exists()
test_exists = Path(f"{SINGLE_LABEL_TEST_PATH}.parquet").exists()
val_exists = Path(f"{SINGLE_LABEL_VALIDATION_PATH}.parquet").exists()
all_files_exist = train_exists and test_exists and val_exists

if not REGENERATE_SINGLE_LABELS and all_files_exist:
    # ============================================================
    # OPTION 1: Lade gespeicherte Single-Label Dateien
    # ============================================================
    print("\nüìÇ Lade gespeicherte Single-Label Dateien...")
    
    df_train_single = pd.read_parquet(f"{SINGLE_LABEL_TRAIN_PATH}.parquet")
    df_test_single = pd.read_parquet(f"{SINGLE_LABEL_TEST_PATH}.parquet")
    df_validation_single = pd.read_parquet(f"{SINGLE_LABEL_VALIDATION_PATH}.parquet")
    
    # Parse labels falls n√∂tig (f√ºr Kompatibilit√§t)
    for df in [df_train_single, df_test_single, df_validation_single]:
        if 'labels' not in df.columns and 'label_name' in df.columns:
            df['labels'] = df['label_name'].apply(parse_labels)
    
    print(f"‚úì Training Set geladen: {len(df_train_single):,} Samples")
    print(f"‚úì Test Set geladen: {len(df_test_single):,} Samples")
    print(f"‚úì Validation Set geladen: {len(df_validation_single):,} Samples")
    
    # √úberschreibe df_train/test/validation mit den Single-Label Versionen
    df_train = df_train_single
    df_test = df_test_single
    df_validation = df_validation_single
    
    print(f"\n‚úì Single-Label Verteilung (Training):")
    print(df_train['single_label'].value_counts())
    
    print("\nüí° Tipp: Setze REGENERATE_SINGLE_LABELS = True um die Dateien zu aktualisieren")

else:
    # ============================================================
    # OPTION 2: Generiere Single-Labels mit LLM
    # ============================================================
    if not all_files_exist:
        print("\n‚ö†Ô∏è Single-Label Dateien nicht gefunden - generiere mit LLM...")
    else:
        print("\nüîÑ REGENERATE_SINGLE_LABELS = True - generiere Single-Labels mit LLM...")
    
    print("\n" + "=" * 70)
    print("INTELLIGENTE SINGLE-LABEL ZUWEISUNG MIT CLAUDE HAIKU")
    print("=" * 70)
    
    # Konvertiere alle Datasets
    print("\nüìå Training Set:")
    df_train = assign_single_label_intelligent(df_train, TOPIC_CLASSES, use_llm=USE_LLM)
    
    print("\nüìå Test Set:")
    df_test = assign_single_label_intelligent(df_test, TOPIC_CLASSES, use_llm=USE_LLM)
    
    print("\nüìå Validation Set:")
    df_validation = assign_single_label_intelligent(df_validation, TOPIC_CLASSES, use_llm=USE_LLM)
    
    # Statistiken anzeigen
    print("\n" + "=" * 70)
    print("SINGLE-LABEL VERTEILUNG (Training Set)")
    print("=" * 70)
    print(df_train['single_label'].value_counts())
    
    # Vergleich: Original erstes Label vs. LLM-Auswahl
    if USE_LLM:
        original_first = df_train['labels'].apply(lambda x: x[0])
        llm_selected = df_train['single_label']
        changed = (original_first != llm_selected).sum()
        print(f"\n‚úì LLM hat {changed:,} Labels anders gew√§hlt als 'erstes Label' ({100*changed/len(df_train):.1f}%)")
        
        # Beispiele zeigen wo LLM anders entschieden hat
        changed_mask = original_first != llm_selected
        if changed_mask.any():
            print("\nüìå Beispiele wo LLM anders entschieden hat:")
            examples = df_train[changed_mask].head(5)
            for idx, row in examples.iterrows():
                print(f"\n  Text: {row['text'][:80]}...")
                print(f"  Original Labels: {row['labels']}")
                print(f"  Erstes Label w√§re: {row['labels'][0]}")
                print(f"  LLM w√§hlte: {row['single_label']}")
    
    # ============================================================
    # SPEICHERE SINGLE-LABEL DATEN (nur Parquet)
    # ============================================================
    print("\n" + "=" * 70)
    print("SPEICHERE SINGLE-LABEL DATASETS")
    print("=" * 70)
    
    # Speichere Training Set
    df_train.to_parquet(f"{SINGLE_LABEL_TRAIN_PATH}.parquet", index=False)
    print(f"‚úì Training Set gespeichert: {SINGLE_LABEL_TRAIN_PATH}.parquet")
    
    # Speichere Test Set
    df_test.to_parquet(f"{SINGLE_LABEL_TEST_PATH}.parquet", index=False)
    print(f"‚úì Test Set gespeichert: {SINGLE_LABEL_TEST_PATH}.parquet")
    
    # Speichere Validation Set
    df_validation.to_parquet(f"{SINGLE_LABEL_VALIDATION_PATH}.parquet", index=False)
    print(f"‚úì Validation Set gespeichert: {SINGLE_LABEL_VALIDATION_PATH}.parquet")

print("\n" + "=" * 70)
print("‚úì Single-Label Daten bereit!")
print("=" * 70)
print(f"\nüìä Zusammenfassung:")
print(f"  Training: {len(df_train):,} Samples")
print(f"  Test: {len(df_test):,} Samples")  
print(f"  Validation: {len(df_validation):,} Samples")
print(f"\n  Spalten: {list(df_train.columns)}")

SINGLE-LABEL DATEN

üìÇ Lade gespeicherte Single-Label Dateien...
‚úì Training Set geladen: 5,465 Samples
‚úì Test Set geladen: 1,511 Samples
‚úì Validation Set geladen: 178 Samples

‚úì Single-Label Verteilung (Training):
single_label
sports                     1617
news_&_social_concern      1421
music                      1046
film_tv_&_video             618
diaries_&_daily_life        561
celebrity_&_pop_culture     202
Name: count, dtype: int64

üí° Tipp: Setze REGENERATE_SINGLE_LABELS = True um die Dateien zu aktualisieren

‚úì Single-Label Daten bereit!

üìä Zusammenfassung:
  Training: 5,465 Samples
  Test: 1,511 Samples
  Validation: 178 Samples

  Spalten: ['text', 'label_name', 'label', 'labels', 'label_binary', 'single_label', 'single_label_binary']


---
## 3. Task 2: Implementation Plan

### 3.1 Binary Feature Vector Construction
For each sample, we create a binary vector of size 1000 (vocabulary size):
- For each word in the vocabulary, set dimension to 1 if word is present in sample, 0 otherwise
- This is a Bag-of-Words style encoding (word order is lost)

### 3.2 MLPClassifier Configuration
- **hidden_layer_sizes**: (128, 64, 128) - three hidden layers as specified
- **activation**: 'relu' - ReLU activation (most commonly used)
- **solver**: 'adam' - Adam optimizer (handles mini-batch gradient descent)
- **max_iter**: 300 - sufficient iterations for convergence
- **random_state**: 42 - for reproducibility
- **early_stopping**: Disabled for multi-label (some classes have few samples), enabled for single-label

### 3.3 Evaluation Metrics
For multi-label classification:
- Subset Accuracy (exact match)
- Hamming Loss
- Micro/Macro F1-Score

For single-label classification:
- Accuracy
- Macro/Weighted F1-Score

---
## 4. Task 3: Multi-Label Classification

### 4.1 Feature Engineering: Binary Vector Construction

In [109]:
def create_binary_features(texts: pd.Series, vocabulary: List[str]) -> np.ndarray:
    """
    Create binary feature vectors for text samples.
    
    Each dimension represents whether a word from the vocabulary
    is present (1) or absent (0) in the sample.
    
    Parameters:
    -----------
    texts : pd.Series
        Series of preprocessed text strings (whitespace-tokenized)
    vocabulary : List[str]
        List of vocabulary words (top 1000 from Lab 4)
    
    Returns:
    --------
    np.ndarray
        Binary feature matrix of shape (n_samples, vocab_size)
    """
    vocab_set = set(vocabulary)
    vocab_to_idx = {word: idx for idx, word in enumerate(vocabulary)}
    
    n_samples = len(texts)
    n_features = len(vocabulary)
    
    # Initialize feature matrix with zeros
    features = np.zeros((n_samples, n_features), dtype=np.int8)
    
    # Fill in binary features
    for i, text in enumerate(texts):
        if isinstance(text, str):
            words = set(text.split())
            for word in words:
                if word in vocab_to_idx:
                    features[i, vocab_to_idx[word]] = 1
    
    return features

# Create binary feature vectors for all datasets
print("Creating binary feature vectors...")
X_train = create_binary_features(df_train['text'], VOCABULARY)
X_test = create_binary_features(df_test['text'], VOCABULARY)
X_validation = create_binary_features(df_validation['text'], VOCABULARY)

print(f"\n‚úì Feature matrix shapes:")
print(f"  X_train: {X_train.shape}")
print(f"  X_test: {X_test.shape}")
print(f"  X_validation: {X_validation.shape}")

# Show sample feature statistics
print(f"\nFeature statistics (training set):")
print(f"  Average features per sample: {X_train.sum(axis=1).mean():.2f}")
print(f"  Max features in a sample: {X_train.sum(axis=1).max()}")
print(f"  Min features in a sample: {X_train.sum(axis=1).min()}")

Creating binary feature vectors...

‚úì Feature matrix shapes:
  X_train: (5465, 1000)
  X_test: (1511, 1000)
  X_validation: (178, 1000)

Feature statistics (training set):
  Average features per sample: 7.71
  Max features in a sample: 22
  Min features in a sample: 0


### 4.2 Label Encoding (Multi-Label Binarization)

In [110]:
# ============================================================
# MULTI-LABEL ENCODING (DYNAMISCH)
# ============================================================

# Verwende die vorbereiteten bin√§ren Labels direkt
y_train_multi = np.vstack(df_train['label_binary'].values)
y_test_multi = np.vstack(df_test['label_binary'].values)
y_validation_multi = np.vstack(df_validation['label_binary'].values)

# Bestimme die tats√§chliche Anzahl der Klassen aus den Daten
NUM_CLASSES = y_train_multi.shape[1]

# Erstelle MultiLabelBinarizer f√ºr inverse_transform
# Wenn TOPIC_CLASSES nicht die richtige L√§nge hat, erstelle generische Namen
if len(TOPIC_CLASSES) != NUM_CLASSES:
    print(f"‚ö†Ô∏è TOPIC_CLASSES hat {len(TOPIC_CLASSES)} Eintr√§ge, aber Daten haben {NUM_CLASSES} Klassen")
    print("   Erstelle generische Klassennamen...")
    TOPIC_CLASSES = [f"class_{i}" for i in range(NUM_CLASSES)]

mlb = MultiLabelBinarizer(classes=TOPIC_CLASSES)
mlb.fit([TOPIC_CLASSES])

print(f"‚úì Anzahl Klassen: {NUM_CLASSES}")
print(f"‚úì Klassennamen: {TOPIC_CLASSES}")
print(f"\n‚úì Multi-Label Matrix Shapes:")
print(f"  y_train_multi: {y_train_multi.shape}")
print(f"  y_test_multi: {y_test_multi.shape}")
print(f"  y_validation_multi: {y_validation_multi.shape}")

# Label-Verteilung
print(f"\n‚úì Label-Verteilung (Training):")
print(f"  Durchschnitt Labels pro Sample: {y_train_multi.sum(axis=1).mean():.2f}")
print(f"  Samples pro Klasse:")
for i, class_name in enumerate(TOPIC_CLASSES):
    count = y_train_multi[:, i].sum()
    print(f"    {class_name}: {count}")

‚úì Anzahl Klassen: 6
‚úì Klassennamen: ['celebrity_&_pop_culture', 'diaries_&_daily_life', 'film_tv_&_video', 'music', 'news_&_social_concern', 'sports']

‚úì Multi-Label Matrix Shapes:
  y_train_multi: (5465, 6)
  y_test_multi: (1511, 6)
  y_validation_multi: (178, 6)

‚úì Label-Verteilung (Training):
  Durchschnitt Labels pro Sample: 1.34
  Samples pro Klasse:
    celebrity_&_pop_culture: 924
    diaries_&_daily_life: 866
    film_tv_&_video: 953
    music: 1131
    news_&_social_concern: 1782
    sports: 1683


### 4.2.1 Single-Label Encoding

In [111]:
# Create single-label encoding using the primary label (first label) from each sample
# Extract primary labels from the binary vectors

def extract_primary_label(binary_vector):
    """Extract the first active class from a binary vector"""
    for i, val in enumerate(binary_vector):
        if val == 1:  # First active class
            return i
    return 0  # Default to first class if no labels found

# Extract primary labels for single-label classification
primary_train_labels = [extract_primary_label(row) for row in y_train_multi]
primary_test_labels = [extract_primary_label(row) for row in y_test_multi]
primary_validation_labels = [extract_primary_label(row) for row in y_validation_multi]

# Convert to numpy arrays  
y_train_single = np.array(primary_train_labels)
y_test_single = np.array(primary_test_labels)
y_validation_single = np.array(primary_validation_labels)

print(f"‚úì Single-label shapes:")
print(f"  y_train_single: {y_train_single.shape}")
print(f"  y_test_single: {y_test_single.shape}")
print(f"  y_validation_single: {y_validation_single.shape}")

print(f"\n‚úì Label distribution (single-label training set):")
for i, class_name in enumerate(TOPIC_CLASSES):
    count = (y_train_single == i).sum()
    print(f"  {class_name}: {count}")

print(f"\n‚úì Single-label encoding complete")

‚úì Single-label shapes:
  y_train_single: (5465,)
  y_test_single: (1511,)
  y_validation_single: (178,)

‚úì Label distribution (single-label training set):
  celebrity_&_pop_culture: 924
  diaries_&_daily_life: 807
  film_tv_&_video: 603
  music: 514
  news_&_social_concern: 1284
  sports: 1333

‚úì Single-label encoding complete


### 4.3 Multi-Label Neural Network Training

In [112]:
# Create MLPClassifier with specified architecture
# Using OneVsRestClassifier for multi-label classification
# Note: early_stopping is disabled because some classes have very few samples
# which causes issues with the validation split in OneVsRest multi-label setting
mlp_base = MLPClassifier(
    hidden_layer_sizes=(128, 64, 128),  # Three hidden layers as specified
    activation='relu',                   # ReLU activation function
    solver='adam',                       # Adam optimizer (mini-batch gradient descent)
    max_iter=300,                        # Maximum iterations
    random_state=RANDOM_STATE,           # For reproducibility
    early_stopping=False,                # Disabled for multi-label compatibility
    verbose=True                         # Show training progress
)

# Wrap with OneVsRestClassifier for multi-label support
mlp_clf_multi = OneVsRestClassifier(mlp_base, n_jobs=-1)

print("="*60)
print("MULTI-LABEL NEURAL NETWORK ARCHITECTURE")
print("="*60)
print(f"Input layer:  {X_train.shape[1]} neurons (vocabulary size)")
print(f"Hidden layer 1: 128 neurons (ReLU activation)")
print(f"Hidden layer 2: 64 neurons (ReLU activation)")
print(f"Hidden layer 3: 128 neurons (ReLU activation)")
print(f"Output layer: {len(TOPIC_CLASSES)} neurons ({len(TOPIC_CLASSES)} binary classifiers)")
print("="*60)

print("\nTraining Multi-Label Neural Network...")
mlp_clf_multi.fit(X_train, y_train_multi)
print("\n‚úì Multi-Label Neural Network training complete!")

MULTI-LABEL NEURAL NETWORK ARCHITECTURE
Input layer:  1000 neurons (vocabulary size)
Hidden layer 1: 128 neurons (ReLU activation)
Hidden layer 2: 64 neurons (ReLU activation)
Hidden layer 3: 128 neurons (ReLU activation)
Output layer: 6 neurons (6 binary classifiers)

Training Multi-Label Neural Network...
Iteration 1, loss = 0.63143498
Iteration 1, loss = 0.55208228
Iteration 1, loss = 0.54292549
Iteration 1, loss = 0.53747628
Iteration 1, loss = 0.53642916
Iteration 1, loss = 0.60748537
Iteration 2, loss = 0.35055481
Iteration 2, loss = 0.42907287
Iteration 2, loss = 0.40901103
Iteration 2, loss = 0.32566156
Iteration 2, loss = 0.38432390
Iteration 2, loss = 0.38086755
Iteration 3, loss = 0.21221044
Iteration 3, loss = 0.25179557
Iteration 3, loss = 0.30495521
Iteration 3, loss = 0.13805326
Iteration 3, loss = 0.35494303
Iteration 3, loss = 0.29974619
Iteration 4, loss = 0.12212557
Iteration 4, loss = 0.07519624
Iteration 4, loss = 0.16645057
Iteration 4, loss = 0.29963092
Iteration

### 4.4 Multi-Label Neural Network Evaluation

In [113]:
# Make predictions
y_pred_nn_multi = mlp_clf_multi.predict(X_test)

# Calculate metrics
nn_multi_metrics = {
    'Subset Accuracy': accuracy_score(y_test_multi, y_pred_nn_multi),
    'Hamming Loss': hamming_loss(y_test_multi, y_pred_nn_multi),
    'Micro F1': f1_score(y_test_multi, y_pred_nn_multi, average='micro', zero_division=0),
    'Macro F1': f1_score(y_test_multi, y_pred_nn_multi, average='macro', zero_division=0),
    'Micro Precision': precision_score(y_test_multi, y_pred_nn_multi, average='micro', zero_division=0),
    'Micro Recall': recall_score(y_test_multi, y_pred_nn_multi, average='micro', zero_division=0)
}

print("="*60)
print("MULTI-LABEL NEURAL NETWORK EVALUATION (Test Set)")
print("="*60)
for metric, value in nn_multi_metrics.items():
    print(f"{metric:<20}: {value:.4f}")

MULTI-LABEL NEURAL NETWORK EVALUATION (Test Set)
Subset Accuracy     : 0.4527
Hamming Loss        : 0.1442
Micro F1            : 0.6447
Macro F1            : 0.5636
Micro Precision     : 0.6985
Micro Recall        : 0.5987


In [114]:
# Show sample predictions
y_pred_labels = mlb.inverse_transform(y_pred_nn_multi)
y_true_labels = mlb.inverse_transform(y_test_multi)

print("\nSample Multi-Label Neural Network Predictions:")
print("-" * 60)
for i in range(5):
    text = df_test['text'].iloc[i][:60]
    true = y_true_labels[i] if y_true_labels[i] else ('none',)
    pred = y_pred_labels[i] if y_pred_labels[i] else ('none',)
    match = "‚úì" if set(true) == set(pred) else "‚úó"
    print(f"\n{match} Sample {i+1}:")
    print(f"   Text: {text}...")
    print(f"   True: {true}")
    print(f"   Pred: {pred}")


Sample Multi-Label Neural Network Predictions:
------------------------------------------------------------

‚úó Sample 1:
   Text: philadelphia clearly page game playbook fire net oppose goal...
   True: ('news_&_social_concern', 'sports')
   Pred: ('sports',)

‚úó Sample 2:
   Text: sure bay face flyer man experience versus blue jacket year h...
   True: ('sports',)
   Pred: ('none',)

‚úó Sample 3:
   Text: tizamagician put cherry kentucky derby day winner pie take d...
   True: ('news_&_social_concern', 'sports')
   Pred: ('sports',)

‚úó Sample 4:
   Text: flyer give false hope absolutely destroy islander go to dest...
   True: ('news_&_social_concern', 'sports')
   Pred: ('sports',)

‚úó Sample 5:
   Text: flyer tremendous season face excited season go to well thank...
   True: ('news_&_social_concern', 'sports')
   Pred: ('sports',)


### 4.5 Naive Bayes Classifier (for Comparison)

In [115]:
# Train Naive Bayes classifier with same features
nb_clf = OneVsRestClassifier(MultinomialNB(alpha=1.0))
nb_clf.fit(X_train, y_train_multi)

# Make predictions
y_pred_nb = nb_clf.predict(X_test)

# Calculate metrics
nb_metrics = {
    'Subset Accuracy': accuracy_score(y_test_multi, y_pred_nb),
    'Hamming Loss': hamming_loss(y_test_multi, y_pred_nb),
    'Micro F1': f1_score(y_test_multi, y_pred_nb, average='micro', zero_division=0),
    'Macro F1': f1_score(y_test_multi, y_pred_nb, average='macro', zero_division=0),
    'Micro Precision': precision_score(y_test_multi, y_pred_nb, average='micro', zero_division=0),
    'Micro Recall': recall_score(y_test_multi, y_pred_nb, average='micro', zero_division=0)
}

print("="*60)
print("NAIVE BAYES EVALUATION (Test Set)")
print("="*60)
for metric, value in nb_metrics.items():
    print(f"{metric:<20}: {value:.4f}")

NAIVE BAYES EVALUATION (Test Set)
Subset Accuracy     : 0.4917
Hamming Loss        : 0.1337
Micro F1            : 0.6811
Macro F1            : 0.6210
Micro Precision     : 0.7114
Micro Recall        : 0.6532


---
## 5. Task 4: Single-Label Classification

For comparison, we train a neural network using single-label classification. Each tweet is assigned only its primary (first) label, converting the multi-label problem to a standard multi-class classification problem.

### 5.1 Single-Label Encoding

In [116]:
# ============================================================
# SINGLE-LABEL ENCODING (DYNAMISCH)
# ============================================================

# Verwende die dynamisch erkannten TOPIC_CLASSES
# Stelle sicher, dass alle Labels in TOPIC_CLASSES vorkommen
unique_single_labels = set(df_train['single_label'].unique()) | \
                       set(df_test['single_label'].unique()) | \
                       set(df_validation['single_label'].unique())

# Pr√ºfe ob alle Labels bekannt sind
unknown_labels = unique_single_labels - set(TOPIC_CLASSES)
if unknown_labels:
    print(f"‚ö†Ô∏è Unbekannte Labels gefunden: {unknown_labels}")
    print(f"   F√ºge sie zu TOPIC_CLASSES hinzu...")
    TOPIC_CLASSES = sorted(list(set(TOPIC_CLASSES) | unknown_labels))

# Create label encoder for single-label classification
le = LabelEncoder()
le.fit(TOPIC_CLASSES)

# Encode single labels as integers
y_train_single = le.transform(df_train['single_label'])
y_test_single = le.transform(df_test['single_label'])
y_validation_single = le.transform(df_validation['single_label'])

print(f"‚úì Single-label encoding complete")
print(f"\n‚úì Label shapes:")
print(f"  y_train_single: {y_train_single.shape}")
print(f"  y_test_single: {y_test_single.shape}")
print(f"  y_validation_single: {y_validation_single.shape}")

print(f"\n‚úì Class mapping (dynamisch erkannt):")
for i, cls in enumerate(le.classes_):
    count = (y_train_single == i).sum()
    print(f"  {i}: {cls} ({count} samples)")

‚úì Single-label encoding complete

‚úì Label shapes:
  y_train_single: (5465,)
  y_test_single: (1511,)
  y_validation_single: (178,)

‚úì Class mapping (dynamisch erkannt):
  0: celebrity_&_pop_culture (202 samples)
  1: diaries_&_daily_life (561 samples)
  2: film_tv_&_video (618 samples)
  3: music (1046 samples)
  4: news_&_social_concern (1421 samples)
  5: sports (1617 samples)


### 5.2 Single-Label Neural Network Training

In [117]:
# Create MLPClassifier for single-label classification
# For single-label, MLPClassifier uses softmax output automatically
mlp_clf_single = MLPClassifier(
    hidden_layer_sizes=(128, 64, 128),  # Same architecture as multi-label
    activation='relu',                   # ReLU activation function
    solver='adam',                       # Adam optimizer
    max_iter=300,                        # Maximum iterations
    random_state=RANDOM_STATE,           # For reproducibility
    early_stopping=True,                 # Enable early stopping for single-label
    validation_fraction=0.1,             # Use 10% for validation
    verbose=True                         # Show training progress
)

print("="*60)
print("SINGLE-LABEL NEURAL NETWORK ARCHITECTURE")
print("="*60)
print(f"Input layer:  {X_train.shape[1]} neurons (vocabulary size)")
print(f"Hidden layer 1: 128 neurons (ReLU activation)")
print(f"Hidden layer 2: 64 neurons (ReLU activation)")
print(f"Hidden layer 3: 128 neurons (ReLU activation)")
print(f"Output layer: {len(TOPIC_CLASSES)} neurons (Softmax activation)")
print("="*60)

print("\nTraining Single-Label Neural Network...")
mlp_clf_single.fit(X_train, y_train_single)
print("\n‚úì Single-Label Neural Network training complete!")

SINGLE-LABEL NEURAL NETWORK ARCHITECTURE
Input layer:  1000 neurons (vocabulary size)
Hidden layer 1: 128 neurons (ReLU activation)
Hidden layer 2: 64 neurons (ReLU activation)
Hidden layer 3: 128 neurons (ReLU activation)
Output layer: 6 neurons (Softmax activation)

Training Single-Label Neural Network...
Iteration 1, loss = 1.64859046
Validation score: 0.477148
Iteration 2, loss = 1.20520977
Validation score: 0.636197
Iteration 3, loss = 0.74446490
Validation score: 0.733090
Iteration 4, loss = 0.49954898
Validation score: 0.738574
Iteration 5, loss = 0.36394867
Validation score: 0.747715
Iteration 6, loss = 0.27613723
Validation score: 0.745887
Iteration 7, loss = 0.20510681
Validation score: 0.747715
Iteration 8, loss = 0.15276296
Validation score: 0.734918
Iteration 9, loss = 0.11145349
Validation score: 0.727605
Iteration 10, loss = 0.08296176
Validation score: 0.723949
Iteration 11, loss = 0.06146216
Validation score: 0.731261
Iteration 12, loss = 0.04664484
Validation score: 0

### 5.3 Single-Label Neural Network Evaluation

In [118]:
# Make predictions
y_pred_nn_single = mlp_clf_single.predict(X_test)

# Calculate metrics
nn_single_metrics = {
    'Accuracy': accuracy_score(y_test_single, y_pred_nn_single),
    'Macro F1': f1_score(y_test_single, y_pred_nn_single, average='macro', zero_division=0),
    'Weighted F1': f1_score(y_test_single, y_pred_nn_single, average='weighted', zero_division=0),
    'Macro Precision': precision_score(y_test_single, y_pred_nn_single, average='macro', zero_division=0),
    'Macro Recall': recall_score(y_test_single, y_pred_nn_single, average='macro', zero_division=0)
}

print("="*60)
print("SINGLE-LABEL NEURAL NETWORK EVALUATION (Test Set)")
print("="*60)
for metric, value in nn_single_metrics.items():
    print(f"{metric:<20}: {value:.4f}")

SINGLE-LABEL NEURAL NETWORK EVALUATION (Test Set)
Accuracy            : 0.7207
Macro F1            : 0.5483
Weighted F1         : 0.7027
Macro Precision     : 0.5440
Macro Recall        : 0.5561


In [119]:
# Check if predicted single label matches ANY of the original multi-labels

def calculate_partial_match_accuracy(y_pred_single:  np.ndarray, 
                                      original_labels_list: pd.Series,
                                      label_encoder: LabelEncoder) -> dict:
    """
    Calculate partial match accuracy for single-label predictions.
    
    A prediction is considered a 'hit' if the predicted label matches
    at least one of the original multi-labels.
    
    Parameters:
    -----------
    y_pred_single : np.ndarray
        Single-label predictions (encoded as integers)
    original_labels_list : pd.Series
        Series of lists containing the original multi-labels (as strings)
    label_encoder :  LabelEncoder
        Fitted label encoder to convert predictions back to strings
    
    Returns:
    --------
    dict
        Dictionary containing partial match metrics
    """
    # Convert predictions to label names
    pred_labels = label_encoder.inverse_transform(y_pred_single)
    
    # Count matches
    total = len(pred_labels)
    hits = 0
    
    for pred, original_labels in zip(pred_labels, original_labels_list):
        # Check if prediction matches ANY of the original labels
        if pred in original_labels:
            hits += 1
    
    partial_match_accuracy = hits / total if total > 0 else 0.0
    
    return {
        'total_samples': total,
        'hits':  hits,
        'misses': total - hits,
        'partial_match_accuracy': partial_match_accuracy
    }

# Calculate partial match accuracy for Single-Label NN
partial_match_results = calculate_partial_match_accuracy(
    y_pred_nn_single, 
    df_test['labels'],  # Original multi-labels
    le
)

print("=" * 70)
print("PARTIAL MATCH EVALUATION:  Single-Label NN vs Original Multi-Labels")
print("=" * 70)
print(f"\nA 'hit' occurs when the predicted single label matches ANY of the")
print(f"original multi-labels (not just the first/primary label).")
print("-" * 70)
print(f"Total test samples:         {partial_match_results['total_samples']: ,}")
print(f"Hits (partial matches):    {partial_match_results['hits']:,}")
print(f"Misses:                     {partial_match_results['misses']:,}")
print(f"\nPartial Match Accuracy:    {partial_match_results['partial_match_accuracy']:.4f} ({partial_match_results['partial_match_accuracy']*100:.2f}%)")
print("-" * 70)

# Compare with exact single-label accuracy
print(f"\nComparison:")
print(f"  Exact Single-Label Accuracy:     {nn_single_metrics['Accuracy']:.4f}")
print(f"  Partial Match Accuracy:         {partial_match_results['partial_match_accuracy']:.4f}")
print(f"  Improvement:                    +{(partial_match_results['partial_match_accuracy'] - nn_single_metrics['Accuracy']):.4f}")
print("=" * 70)

# Show examples of partial matches (predictions that match a secondary label)
print("\nExamples of Partial Matches (pred matches non-primary label):")
print("-" * 70)
example_count = 0
for i in range(len(y_pred_nn_single)):
    pred_label = le.inverse_transform([y_pred_nn_single[i]])[0]
    true_primary = df_test['single_label'].iloc[i]
    original_labels = df_test['labels'].iloc[i]
    
    # Show cases where prediction doesn't match primary but matches another label
    if pred_label != true_primary and pred_label in original_labels:
        example_count += 1
        if example_count <= 5: 
            text = df_test['text'].iloc[i][: 50]
            print(f"\n‚úì Sample {i+1}:")
            print(f"   Text: {text}...")
            print(f"   Original labels: {original_labels}")
            print(f"   Primary label: {true_primary}")
            print(f"   Predicted:  {pred_label} (matches secondary label! )")

print(f"\n-" * 70)
print(f"Total samples where prediction matched a secondary label: {example_count}")

PARTIAL MATCH EVALUATION:  Single-Label NN vs Original Multi-Labels

A 'hit' occurs when the predicted single label matches ANY of the
original multi-labels (not just the first/primary label).
----------------------------------------------------------------------
Total test samples:          1,511
Hits (partial matches):    1,144
Misses:                     367

Partial Match Accuracy:    0.7571 (75.71%)
----------------------------------------------------------------------

Comparison:
  Exact Single-Label Accuracy:     0.7207
  Partial Match Accuracy:         0.7571
  Improvement:                    +0.0364

Examples of Partial Matches (pred matches non-primary label):
----------------------------------------------------------------------

‚úì Sample 54:
   Text: flaw simple agenda leadership example talk example...
   Original labels: ['news_&_social_concern' 'sports']
   Primary label: news_&_social_concern
   Predicted:  sports (matches secondary label! )

‚úì Sample 85:
   Text: 

In [120]:
# Show sample predictions for single-label
print("\nSample Single-Label Neural Network Predictions:")
print("-" * 60)
for i in range(5):
    text = df_test['text'].iloc[i][:60]
    true_label = le.inverse_transform([y_test_single[i]])[0]
    pred_label = le.inverse_transform([y_pred_nn_single[i]])[0]
    original_labels = df_test['labels'].iloc[i]
    match = "‚úì" if true_label == pred_label else "‚úó"
    print(f"\n{match} Sample {i+1}:")
    print(f"   Text: {text}...")
    print(f"   Original labels: {original_labels}")
    print(f"   Single label (true): {true_label}")
    print(f"   Single label (pred): {pred_label}")


Sample Single-Label Neural Network Predictions:
------------------------------------------------------------

‚úì Sample 1:
   Text: philadelphia clearly page game playbook fire net oppose goal...
   Original labels: ['news_&_social_concern' 'sports']
   Single label (true): sports
   Single label (pred): sports

‚úì Sample 2:
   Text: sure bay face flyer man experience versus blue jacket year h...
   Original labels: ['sports']
   Single label (true): sports
   Single label (pred): sports

‚úì Sample 3:
   Text: tizamagician put cherry kentucky derby day winner pie take d...
   Original labels: ['news_&_social_concern' 'sports']
   Single label (true): sports
   Single label (pred): sports

‚úì Sample 4:
   Text: flyer give false hope absolutely destroy islander go to dest...
   Original labels: ['news_&_social_concern' 'sports']
   Single label (true): sports
   Single label (pred): sports

‚úì Sample 5:
   Text: flyer tremendous season face excited season go to well thank...
   Ori

---
## 6. Model Comparison

### 6.1 Multi-Label Models Comparison

In [121]:
# Create comparison table for multi-label models
comparison_df = pd.DataFrame({
    'Metric': list(nn_multi_metrics.keys()),
    'Neural Network (Multi-Label)': list(nn_multi_metrics.values()),
    'Naive Bayes (Multi-Label)': list(nb_metrics.values())
})

# Calculate improvement
comparison_df['Difference'] = comparison_df['Neural Network (Multi-Label)'] - comparison_df['Naive Bayes (Multi-Label)']
comparison_df['Better Model'] = comparison_df.apply(
    lambda row: 'Neural Network' if (row['Difference'] > 0 and row['Metric'] != 'Hamming Loss') 
                or (row['Difference'] < 0 and row['Metric'] == 'Hamming Loss')
                else 'Naive Bayes' if row['Difference'] != 0 else 'Tie',
    axis=1
)

print("="*80)
print("MULTI-LABEL MODEL COMPARISON: Neural Network vs Naive Bayes")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)
print("\nNote: For Hamming Loss, lower is better. For all other metrics, higher is better.")

MULTI-LABEL MODEL COMPARISON: Neural Network vs Naive Bayes
         Metric  Neural Network (Multi-Label)  Naive Bayes (Multi-Label)  Difference Better Model
Subset Accuracy                      0.452680                   0.491727   -0.039047  Naive Bayes
   Hamming Loss                      0.144165                   0.133686    0.010479  Naive Bayes
       Micro F1                      0.644740                   0.681053   -0.036312  Naive Bayes
       Macro F1                      0.563573                   0.620975   -0.057402  Naive Bayes
Micro Precision                      0.698469                   0.711380   -0.012911  Naive Bayes
   Micro Recall                      0.598688                   0.653205   -0.054518  Naive Bayes

Note: For Hamming Loss, lower is better. For all other metrics, higher is better.


### 6.2 Single-Label vs Multi-Label Comparison

In [122]:
# Compare single-label NN predictions against multi-label ground truth
# Convert single-label predictions to multi-label format for comparison
y_pred_single_as_multi = np.zeros((len(y_pred_nn_single), len(TOPIC_CLASSES)), dtype=int)
for i, pred in enumerate(y_pred_nn_single):
    y_pred_single_as_multi[i, pred] = 1

# Calculate metrics for single-label NN on multi-label test set
single_on_multi_metrics = {
    'Subset Accuracy': accuracy_score(y_test_multi, y_pred_single_as_multi),
    'Hamming Loss': hamming_loss(y_test_multi, y_pred_single_as_multi),
    'Micro F1': f1_score(y_test_multi, y_pred_single_as_multi, average='micro', zero_division=0),
    'Macro F1': f1_score(y_test_multi, y_pred_single_as_multi, average='macro', zero_division=0),
}

print("="*80)
print("SINGLE-LABEL VS MULTI-LABEL NEURAL NETWORK COMPARISON")
print("="*80)
print(f"\n{'Metric':<20} {'Multi-Label NN':<18} {'Single-Label NN':<18} {'Difference':<12}")
print("-"*70)
for metric in ['Subset Accuracy', 'Hamming Loss', 'Micro F1', 'Macro F1']:
    multi = nn_multi_metrics[metric]
    single = single_on_multi_metrics[metric]
    diff = multi - single
    print(f"{metric:<20} {multi:<18.4f} {single:<18.4f} {diff:+.4f}")

print("\n" + "="*80)
print("\nNote: Single-label NN can only predict one class per sample.")
print("Comparison is made against the original multi-label ground truth.")

SINGLE-LABEL VS MULTI-LABEL NEURAL NETWORK COMPARISON

Metric               Multi-Label NN     Single-Label NN    Difference  
----------------------------------------------------------------------
Subset Accuracy      0.4527             0.5473             -0.0946
Hamming Loss         0.1442             0.1328             +0.0114
Micro F1             0.6447             0.6552             -0.0105
Macro F1             0.5636             0.5243             +0.0393


Note: Single-label NN can only predict one class per sample.
Comparison is made against the original multi-label ground truth.


### 6.3 All Models Summary

In [123]:
# Create comprehensive summary table
print("="*90)
print("COMPREHENSIVE MODEL COMPARISON SUMMARY")
print("="*90)

print("\n" + "-"*90)
print("MULTI-LABEL CLASSIFICATION RESULTS (evaluated on multi-label test set)")
print("-"*90)
print(f"{'Model':<35} {'Accuracy':<12} {'Micro F1':<12} {'Macro F1':<12} {'Hamming Loss':<12}")
print("-"*90)
print(f"{'Multi-Label Neural Network':<35} {nn_multi_metrics['Subset Accuracy']:<12.4f} {nn_multi_metrics['Micro F1']:<12.4f} {nn_multi_metrics['Macro F1']:<12.4f} {nn_multi_metrics['Hamming Loss']:<12.4f}")
print(f"{'Naive Bayes (Multi-Label)':<35} {nb_metrics['Subset Accuracy']:<12.4f} {nb_metrics['Micro F1']:<12.4f} {nb_metrics['Macro F1']:<12.4f} {nb_metrics['Hamming Loss']:<12.4f}")
print(f"{'Single-Label NN (on multi-label)':<35} {single_on_multi_metrics['Subset Accuracy']:<12.4f} {single_on_multi_metrics['Micro F1']:<12.4f} {single_on_multi_metrics['Macro F1']:<12.4f} {single_on_multi_metrics['Hamming Loss']:<12.4f}")

print("\n" + "-"*90)
print("SINGLE-LABEL CLASSIFICATION RESULTS (evaluated on single-label test set)")
print("-"*90)
print(f"{'Model':<35} {'Accuracy':<12} {'Weighted F1':<12} {'Macro F1':<12}")
print("-"*90)
print(f"{'Single-Label Neural Network':<35} {nn_single_metrics['Accuracy']:<12.4f} {nn_single_metrics['Weighted F1']:<12.4f} {nn_single_metrics['Macro F1']:<12.4f}")

print("\n" + "="*90)

COMPREHENSIVE MODEL COMPARISON SUMMARY

------------------------------------------------------------------------------------------
MULTI-LABEL CLASSIFICATION RESULTS (evaluated on multi-label test set)
------------------------------------------------------------------------------------------
Model                               Accuracy     Micro F1     Macro F1     Hamming Loss
------------------------------------------------------------------------------------------
Multi-Label Neural Network          0.4527       0.6447       0.5636       0.1442      
Naive Bayes (Multi-Label)           0.4917       0.6811       0.6210       0.1337      
Single-Label NN (on multi-label)    0.5473       0.6552       0.5243       0.1328      

------------------------------------------------------------------------------------------
SINGLE-LABEL CLASSIFICATION RESULTS (evaluated on single-label test set)
------------------------------------------------------------------------------------------
Model   

---
## 7. Optional: Experiment with Different Network Sizes

In [124]:
# Define different architectures to test
architectures = {
    'Small (64-32-64)': (64, 32, 64),
    'Medium (128-64-128)': (128, 64, 128),  # Original
    'Large (256-128-256)': (256, 128, 256),
    'Deep (128-128-64-64-128-128)': (128, 128, 64, 64, 128, 128),
    'Wide (512-256-512)': (512, 256, 512)
}

results = []

print("Experimenting with different network architectures (Multi-Label)...")
print("="*60)

for name, layers in architectures.items():
    print(f"\nTraining: {name}...")
    
    # Create and train model
    mlp = MLPClassifier(
        hidden_layer_sizes=layers,
        activation='relu',
        solver='adam',
        max_iter=200,
        random_state=RANDOM_STATE,
        early_stopping=False,  # Disabled for multi-label compatibility
        verbose=False
    )
    
    clf = OneVsRestClassifier(mlp, n_jobs=-1)
    clf.fit(X_train, y_train_multi)
    
    # Evaluate
    y_pred = clf.predict(X_test)
    
    results.append({
        'Architecture': name,
        'Layers': str(layers),
        'Accuracy': accuracy_score(y_test_multi, y_pred),
        'Micro F1': f1_score(y_test_multi, y_pred, average='micro', zero_division=0),
        'Macro F1': f1_score(y_test_multi, y_pred, average='macro', zero_division=0)
    })
    
    print(f"  Accuracy: {results[-1]['Accuracy']:.4f}, Micro F1: {results[-1]['Micro F1']:.4f}")

# Display results
results_df = pd.DataFrame(results)
print("\n" + "="*80)
print("ARCHITECTURE COMPARISON RESULTS")
print("="*80)
print(results_df.to_string(index=False))

Experimenting with different network architectures (Multi-Label)...

Training: Small (64-32-64)...
  Accuracy: 0.4520, Micro F1: 0.6409

Training: Medium (128-64-128)...
  Accuracy: 0.4527, Micro F1: 0.6447

Training: Large (256-128-256)...
  Accuracy: 0.4573, Micro F1: 0.6455

Training: Deep (128-128-64-64-128-128)...
  Accuracy: 0.4725, Micro F1: 0.6535

Training: Wide (512-256-512)...
  Accuracy: 0.4626, Micro F1: 0.6485

ARCHITECTURE COMPARISON RESULTS
                Architecture                       Layers  Accuracy  Micro F1  Macro F1
            Small (64-32-64)                 (64, 32, 64)  0.452019  0.640916  0.569563
         Medium (128-64-128)               (128, 64, 128)  0.452680  0.644740  0.563573
         Large (256-128-256)              (256, 128, 256)  0.457313  0.645541  0.567340
Deep (128-128-64-64-128-128) (128, 128, 64, 64, 128, 128)  0.472535  0.653532  0.573370
          Wide (512-256-512)              (512, 256, 512)  0.462608  0.648470  0.567219


---
## 8. Summary

### What was accomplished
1. Loaded preprocessed data from Lab 2 and vocabulary from Lab 4
2. Created binary feature vectors (Bag-of-Words encoding) for all samples
3. Trained a Multi-Label Neural Network with 128‚Üí64‚Üí128 hidden layers using MLPClassifier and OneVsRestClassifier
4. Converted multi-label data to single-label by keeping only the primary label
5. Trained a Single-Label Neural Network with the same architecture
6. Compared Multi-Label NN, Single-Label NN, and Naive Bayes classifiers
7. Experimented with different network architectures

### Key Findings
- Multi-label classification allows predicting multiple topics per tweet
- Single-label classification simplifies the problem but loses information about secondary topics
- Neural networks can capture non-linear relationships in text classification
- The MLPClassifier with ReLU activation and Adam optimizer provides good results
- For multi-label tasks, OneVsRestClassifier trains separate binary classifiers per class
- For single-label tasks, MLPClassifier uses softmax output for probability distribution
- Network architecture affects performance, but larger isn't always better

In [None]:
print("="*60)
print("LAB 5 SUMMARY")
print("="*60)
print(f"Input vocabulary: {VOCABULARY_PATH}")
print(f"Training samples: {len(df_train):,}")
print(f"Test samples: {len(df_test):,}")
print(f"Feature vector size: {X_train.shape[1]}")
print(f"Number of classes: {len(TOPIC_CLASSES)}")
print(f"\nMulti-Label Neural Network Metrics:")
print(f"  Subset Accuracy: {nn_multi_metrics['Subset Accuracy']:.4f}")
print(f"  Micro F1: {nn_multi_metrics['Micro F1']:.4f}")
print(f"  Macro F1: {nn_multi_metrics['Macro F1']:.4f}")
print(f"\nSingle-Label Neural Network Metrics:")
print(f"  Accuracy: {nn_single_metrics['Accuracy']:.4f}")
print(f"  Weighted F1: {nn_single_metrics['Weighted F1']:.4f}")
print(f"  Macro F1: {nn_single_metrics['Macro F1']:.4f}")
print("="*60)

LAB 5 SUMMARY
Input vocabulary: ../Data/top_1000_vocabulary.json
Training samples: 5,465
Test samples: 1,511
Feature vector size: 1000
Number of classes: 6

Multi-Label Neural Network Metrics:
  Subset Accuracy: 0.4527
  Micro F1: 0.6447
  Macro F1: 0.5636

Single-Label Neural Network Metrics:
  Accuracy: 0.7207
  Weighted F1: 0.7027
  Macro F1: 0.5483


Exception ignored in: <function ResourceTracker.__del__ at 0x107df18a0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 84, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 93, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 118, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x1070ad8a0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 84, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 93, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 118, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x103f458a0>
Traceback (most recent call last