
# MVSA-Single Text Sentiment Classification (TF‑IDF + Logistic Regression)
### Preprocessing Ablation Study (Raw vs Individual Steps vs Full Pipeline)

This notebook implements a **text-only** sentiment classification project on the **MVSA-Single** dataset and runs **controlled comparison experiments** to measure how different preprocessing steps improve a **Logistic Regression** model.

## Goals
- Use **only** a **Logistic Regression** classifier.
- Compare **Raw text** vs **individual preprocessing steps** vs **Full preprocessing** on the **same English subset**.
- Report improvements using **Accuracy** and **Macro-F1** (Macro-F1 is more informative under class imbalance).

## Experiments
All experiments share the same:
- English subset selection
- Train/Test split (fixed random seed, stratified)
- TF-IDF settings
- Logistic Regression hyperparameters

We change **only one factor** at a time (except the **Full** experiment which combines all steps).

| ID | Description | Text Transform | Training Change |
|---|---|---|---|
| E0 | Raw baseline (English subset) | None | None |
| E1 | Remove `@/#/url` tokens | ✅ | None |
| E2 | Slang expansion | ✅ | None |
| E3 | Class weighting | None | `class_weight="balanced"` |
| E4 | **FULL**: Remove `@/#/url` + Slang expansion + Class weighting | ✅ | `class_weight="balanced"` |

---


In [18]:
# =========================
# 1) Setup & Imports
# =========================
# This notebook is designed to be reproducible:
# - Fixed random seed for splitting
# - Deterministic language detection seed (if langdetect is installed)
# - All key parameters are defined in one place

import os
import re
from pathlib import Path
from typing import Dict, Callable, Optional, List

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Make langdetect deterministic if available
try:
    from langdetect import DetectorFactory
    DetectorFactory.seed = 0
except Exception:
    pass

print("Imports complete.")

Imports complete.



## 2) Load MVSA-Single (Text + Text Sentiment Labels)

The dataset is from kaggle https://www.kaggle.com/datasets/vincemarcs/mvsasingle.

Inside it, the MVSA-Single structure typically includes:
- `labelResultAll.txt` : tab-separated, contains labels for each ID (format: `text,image`)
- `data/<ID>.txt` : raw tweet text for each sample

We will:
1. download the dataset
2. Parse the **text sentiment** label (the first value in `text,image`)
3. Read tweet texts from `data/*.txt`
4. Merge into a single DataFrame: **(text, label)**

> Note: This notebook focuses on **text-only** modeling.


In [19]:
# =========================
# 2) Dataset Extraction
# =========================

import kagglehub

# Download latest version
path = kagglehub.dataset_download("vincemarcs/mvsasingle")

print("Path to dataset files:", path)

MVSA_ROOT = Path(os.path.join(path, "MVSA_Single"))
LABEL_FILE = Path(os.path.join(MVSA_ROOT, "labelResultAll.txt"))
DATA_DIR = Path(os.path.join(MVSA_ROOT, "data"))

print("MVSA_ROOT:", MVSA_ROOT)
print("LABEL_FILE:", LABEL_FILE)
print("DATA_DIR:", DATA_DIR)

# Load label file: columns are typically ["ID", "text,image"]
labels = pd.read_csv(LABEL_FILE, sep="\t", engine="python")
labels.columns = [c.strip() for c in labels.columns]

# Extract *text* sentiment label from "text,image" column
# Example row: "neutral,positive"  -> text label = "neutral"
labels["label"] = labels["text,image"].astype(str).str.split(",").str[0].str.strip().str.lower()
labels["ID"] = labels["ID"].astype(int)
labels = labels.dropna(subset=["label"])

# Load texts from data/*.txt
rows = []
id_pattern = re.compile(r"^(\d+)\.txt$")

for p in DATA_DIR.glob("*.txt"):
    m = id_pattern.match(p.name)
    if not m:
        continue
    _id = int(m.group(1))
    # Tweets can contain emojis or unusual chars; ignore decoding errors safely
    text = p.read_text(encoding="utf-8", errors="ignore").strip()
    rows.append((_id, text))

texts = pd.DataFrame(rows, columns=["ID", "text"])

# Merge text + label
df = texts.merge(labels[["ID", "label"]], on="ID", how="inner")
df = df[["text", "label"]].dropna().reset_index(drop=True)

print("Merged dataset shape:", df.shape)
df.head()

Path to dataset files: C:\Users\mhjv_\.cache\kagglehub\datasets\vincemarcs\mvsasingle\versions\1
MVSA_ROOT: C:\Users\mhjv_\.cache\kagglehub\datasets\vincemarcs\mvsasingle\versions\1\MVSA_Single
LABEL_FILE: C:\Users\mhjv_\.cache\kagglehub\datasets\vincemarcs\mvsasingle\versions\1\MVSA_Single\labelResultAll.txt
DATA_DIR: C:\Users\mhjv_\.cache\kagglehub\datasets\vincemarcs\mvsasingle\versions\1\MVSA_Single\data
Merged dataset shape: (4869, 2)


Unnamed: 0,text,label
0,How I feel today #legday #jelly #aching #gym,neutral
1,@ArrivaTW absolute disgrace two carriages from...,negative
2,This is my Valentine's from 1 of my nephews. I...,positive
3,betterfeelingfilms: RT via Instagram: First da...,positive
4,Zoe's first love #Rattled @JohnnyHarper15,positive



## 3) Quick Exploratory Data Analysis (EDA)

We inspect:
- Label distribution
- A few example texts

This helps us understand class imbalance and the kind of noise present (mentions, hashtags, URLs, slang).


In [20]:
# Label distribution
label_counts = df["label"].value_counts()
label_dist = (label_counts / len(df)).round(4)

display(pd.DataFrame({"count": label_counts, "ratio": label_dist}))

# Show a few random examples
df.sample(5, random_state=RANDOM_STATE)

Unnamed: 0_level_0,count,ratio
label,Unnamed: 1_level_1,Unnamed: 2_level_1
neutral,1921,0.3945
positive,1731,0.3555
negative,1217,0.2499


Unnamed: 0,text,label
3861,RT @The4GNet: Black Pastor Protests Outside NA...,neutral
1101,#Wiserswhiskeyquestionoftheday Cheerful Valent...,positive
1149,"Top chef even when I'm crippled, just need the...",positive
4135,ˤQ | Mocha106b #pixiv http://t.co/sp4d7R5EnN h...,neutral
1721,RT @TheBloggess: That moment when you have a t...,negative



## 4) English Subset Filtering

MVSA-Single contains multiple languages. To make our experiments cleaner and to align with the project requirement,
we filter to an **English subset** once, then run all ablations on the same subset.

We implement English filtering using:
- Preferred: `langdetect` (if installed)
- Fallback: a simple **ASCII ratio heuristic** if `langdetect` is unavailable


In [None]:
# =========================
# 4) English Detection Helpers
# =========================

_whitespace_re = re.compile(r"\s+")

def normalize_text_basic(text: str) -> str:
    # Minimal normalization shared by all experiments:
    # - Replace line breaks with spaces
    # - Collapse repeated whitespace
    # - Trim leading/trailing spaces
    text = "" if text is None else str(text)
    text = text.replace("\r", " ").replace("\n", " ")
    text = _whitespace_re.sub(" ", text).strip()
    return text

def is_english_langdetect(text: str) -> Optional[bool]:
    # Return True/False using langdetect if available.
    # Return None if langdetect is not installed.
    try:
        from langdetect import detect
    except Exception:
        return None

    t = normalize_text_basic(text)
    if len(t) < 3:
        return False
    try:
        return detect(t) == "en"
    except Exception:
        # langdetect can fail on very short or noisy strings
        return False

def is_english_ascii_heuristic(text: str, threshold: float = 0.90) -> bool:
    # Fallback heuristic:
    # - Count the fraction of ASCII characters
    # - If ratio >= threshold, treat as English-like
    t = normalize_text_basic(text)
    if not t:
        return False
    ascii_chars = sum(1 for c in t if ord(c) < 128)
    return (ascii_chars / max(1, len(t))) >= threshold

def is_english(text: str) -> bool:
    r = is_english_langdetect(text)
    if r is None:
        return is_english_ascii_heuristic(text)
    return bool(r)

# Apply English filtering
df["is_english"] = df["text"].apply(is_english)
df_en = df.loc[df["is_english"]].drop(columns=["is_english"]).reset_index(drop=True)

print("Before English filter:", len(df))
print("After English filter:", len(df_en))

display(df_en["label"].value_counts())


## 5) Preprocessing Functions (Used in Ablations)

We implement the preprocessing steps required by the experiments:

### (A) Remove social tokens
Remove tokens starting with:
- `@` (mentions)
- `#` (hashtags)
- `http` (URLs)

### (B) Slang expansion
Replace common social-media abbreviations using a dictionary, e.g.:
- `lol` → `laugh out loud`
- `idk` → `i do not know`

### (C) Full preprocessing (E4)
Combine:
- remove social tokens
- slang expansion
and also apply class weighting during training.


In [None]:
# =========================
# 5) Preprocessing Functions
# =========================

# Token pattern: sequences of non-space characters
_token_re = re.compile(r"\b\S+\b", flags=re.UNICODE)

def remove_social_tokens(text: str) -> str:
    # Remove tokens that start with '@', '#', or 'http'.
    # This is common for tweet-like data where mentions/hashtags/URLs can add noise.
    t = normalize_text_basic(text)
    tokens = _token_re.findall(t)
    kept = []
    for tok in tokens:
        low = tok.lower()
        if low.startswith("@") or low.startswith("#") or low.startswith("http"):
            continue
        kept.append(tok)
    return " ".join(kept)

# A small, editable slang dictionary (you can extend this list as needed)
SLANG_DICT: Dict[str, str] = {
    "lol": "laugh out loud",
    "lmao": "laughing my ass off",
    "rofl": "rolling on the floor laughing",
    "idk": "i do not know",
    "imo": "in my opinion",
    "imho": "in my humble opinion",
    "omg": "oh my god",
    "btw": "by the way",
    "brb": "be right back",
    "ttyl": "talk to you later",
    "thx": "thanks",
    "u": "you",
    "ur": "your",
    "pls": "please",
    "plz": "please",
    "w/": "with",
    "w/o": "without",
}

# Lowercase keys for case-insensitive matching
SLANG_DICT = {k.lower(): v for k, v in SLANG_DICT.items()}

def expand_slang(text: str, slang: Dict[str, str] = SLANG_DICT) -> str:
    # Replace slang tokens by dictionary expansion (token-by-token).
    t = normalize_text_basic(text)
    tokens = _token_re.findall(t)
    out = []
    for tok in tokens:
        repl = slang.get(tok.lower())
        out.append(repl if repl is not None else tok)
    return " ".join(out)

def full_preprocess(text: str) -> str:
    # FULL preprocessing used in E4:
    # 1) Remove social tokens
    # 2) Expand slang
    t = remove_social_tokens(text)
    t = expand_slang(t, SLANG_DICT)
    return t

# Quick sanity check
sample_text = "@user lol check this out http://example.com #happy"
print("Original:", sample_text)
print("Remove social:", remove_social_tokens(sample_text))
print("Slang expand:", expand_slang(sample_text))
print("FULL:", full_preprocess(sample_text))


## 6) Feature Engineering and Model

### TF‑IDF Features
We convert texts into sparse vectors using **TF‑IDF** with word n‑grams (1–2 grams).

### Logistic Regression Classifier
We train a **multiclass Logistic Regression** model on TF‑IDF features.

For the class imbalance experiment (E3 and E4), we use:
- `class_weight="balanced"`

This reweights the loss so minority classes receive higher weight, often improving **Macro-F1**.


In [None]:
# =========================
# 6) Train/Test Split (Stratified) + Model Builder
# =========================

LABEL_MAP = {"negative": 0, "neutral": 1, "positive": 2}

df_en_num = df_en.copy()
df_en_num["y"] = df_en_num["label"].map(LABEL_MAP)

# Drop any unexpected labels (safe-guard)
df_en_num = df_en_num.dropna(subset=["y"]).reset_index(drop=True)
df_en_num["y"] = df_en_num["y"].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    df_en_num["text"].values,
    df_en_num["y"].values,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=df_en_num["y"].values,
)

print("Train size:", len(X_train), "Test size:", len(X_test))
print("Train label distribution:", pd.Series(y_train).value_counts().to_dict())
print("Test  label distribution:", pd.Series(y_test).value_counts().to_dict())

def build_pipeline(class_weight: Optional[str] = None) -> Pipeline:
    # Build TF-IDF + Logistic Regression.
    # Parameters are fixed across experiments for fair comparison.
    tfidf = TfidfVectorizer(
        lowercase=True,
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.95,
        token_pattern=r"(?u)\b\w+\b",
    )

    lr = LogisticRegression(
        solver="saga",
        max_iter=2000,
        C=1.0,
        class_weight=class_weight,
        n_jobs=-1,
        multi_class="auto",
    )

    return Pipeline([("tfidf", tfidf), ("lr", lr)])


## 7) Run Ablation Experiments (E0–E4)

We evaluate each experiment using:
- Accuracy
- Macro-F1
- Confusion Matrix

We also compute **Δ (improvement)** relative to the baseline E0.


In [None]:
# =========================
# 7) Experiment Runner
# =========================

TextTransform = Callable[[str], str]

def apply_transform(texts: np.ndarray, transform: Optional[TextTransform]) -> List[str]:
    # Apply a text transform (if provided) to an array of texts.
    # We always apply normalize_text_basic for stable vectorization.
    if transform is None:
        return [normalize_text_basic(t) for t in texts]
    return [transform(t) for t in texts]

def evaluate_experiment(
    exp_id: str,
    name: str,
    transform: Optional[TextTransform],
    class_weight: Optional[str],
) -> Dict:
    # Train and evaluate one experiment, returning metrics and confusion matrix.
    Xtr = apply_transform(X_train, transform)
    Xte = apply_transform(X_test, transform)

    model = build_pipeline(class_weight=class_weight)
    model.fit(Xtr, y_train)

    y_pred = model.predict(Xte)
    acc = accuracy_score(y_test, y_pred)
    macro = f1_score(y_test, y_pred, average="macro")
    cm = confusion_matrix(y_test, y_pred)

    return {
        "exp_id": exp_id,
        "name": name,
        "accuracy": float(acc),
        "macro_f1": float(macro),
        "confusion_matrix": cm,
        "y_pred": y_pred,
    }

def plot_confusion_matrix(cm: np.ndarray, title: str) -> None:
    # Plot confusion matrix using matplotlib only.
    plt.figure()
    plt.imshow(cm, interpolation="nearest")
    plt.title(title)
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.colorbar()
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j, i, str(cm[i, j]), ha="center", va="center")
    plt.tight_layout()
    plt.show()

# Define experiments (E0–E4)
experiments = [
    ("E0", "Raw (English subset)", None, None),
    ("E1", "Raw + remove @/#/url tokens", remove_social_tokens, None),
    ("E2", "Raw + slang expansion", lambda t: expand_slang(t, SLANG_DICT), None),
    ("E3", "Raw + class_weight=balanced", None, "balanced"),
    ("E4", "FULL: remove @/#/url + slang + class_weight", full_preprocess, "balanced"),
]

results = []
cms = {}

for exp_id, name, transform, cw in experiments:
    print(f"Running {exp_id}: {name}")
    out = evaluate_experiment(exp_id, name, transform, cw)
    results.append({k: out[k] for k in ["exp_id", "name", "accuracy", "macro_f1"]})
    cms[exp_id] = out["confusion_matrix"]

results_df = pd.DataFrame(results)

# Improvements vs E0 baseline
base_acc = float(results_df.loc[results_df["exp_id"] == "E0", "accuracy"].iloc[0])
base_f1  = float(results_df.loc[results_df["exp_id"] == "E0", "macro_f1"].iloc[0])
results_df["delta_accuracy_vs_E0"] = results_df["accuracy"] - base_acc
results_df["delta_macro_f1_vs_E0"] = results_df["macro_f1"] - base_f1

results_df


## 8) Results Analysis

We interpret metrics as:
- **Accuracy**: overall correctness (may be influenced by majority classes)
- **Macro-F1**: average F1 across classes, treating each class equally

We visualize confusion matrices for:
- Baseline **E0**
- Full pipeline **E4**


In [None]:
display(results_df.sort_values("exp_id").reset_index(drop=True))

plot_confusion_matrix(cms["E0"], "Confusion Matrix - E0 (Raw)")
plot_confusion_matrix(cms["E4"], "Confusion Matrix - E4 (FULL)")


## 9) Optional: Error Analysis (Inspect Misclassifications)

To better understand *why* preprocessing helps, we can inspect a few misclassified examples
(using the FULL pipeline E4).


In [None]:
# Re-run E4 to explicitly obtain predictions (kept explicit for clarity)
e4_pred = evaluate_experiment(
    exp_id="E4",
    name="FULL: remove @/#/url + slang + class_weight",
    transform=full_preprocess,
    class_weight="balanced",
)["y_pred"]

inv_label_map = {v: k for k, v in LABEL_MAP.items()}

err_df = pd.DataFrame({
    "text": X_test,
    "y_true": y_test,
    "y_pred": e4_pred,
})
err_df["true_label"] = err_df["y_true"].map(inv_label_map)
err_df["pred_label"] = err_df["y_pred"].map(inv_label_map)

errors = err_df[err_df["y_true"] != err_df["y_pred"]].copy()
print("Number of errors (E4):", len(errors))

# Show a few random errors
errors.sample(min(10, len(errors)), random_state=RANDOM_STATE)[["true_label", "pred_label", "text"]]


## 10) Conclusion

This notebook provides a clean, reproducible pipeline to evaluate **text preprocessing** impact on a **Logistic Regression** sentiment model.

Suggested report takeaways:
1. TF-IDF + Logistic Regression is a strong and interpretable baseline for text sentiment.
2. Individual preprocessing steps can yield incremental improvements.
3. The **FULL** pipeline (E4) often provides the best improvement, especially in **Macro-F1** when combined with class weighting.
4. Pipeline Safety: All transformations were encapsulated in a pipeline, strictly adhering to the Golden Rule of Train-Test splitting to prevent data leakage.
