## 1. Data Preprocessing

The dataset requires cleaning to ensure consistency and remove noise.  
Steps include:

- **Handling missing & invalid data**: remove rows with empty text or invalid labels.  
- **Duplicate & conflicting data removal**: drop exact, normalized, and conflicting duplicates.  
- **Text normalization**:  
  - Lowercasing  
  - Removing URLs and @mentions  
  - Keeping hashtag words (`#happy → happy`)  
  - Expanding contractions (`can't → can not`)  
  - Mapping emoticons (`:) → smile`)  
  - De-elongation (`soooo → soo`)  
  - Replacing numbers with `<num>`  
  - Preserving `!` and `?` as emotion cues  
- **Stopword removal**: removes common words (`the, is, at`) but keeps negations (`no, not, never`).  
- **Lemmatization**: reduces words to base form (`running → run`, `better → good`).  
- **Short/empty text removal**: discards samples with fewer than 2 tokens.  
- **Optional filters**: language filtering (keep English only), near-duplicate removal using TF-IDF cosine similarity.

In [1]:
pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993223 sha256=f67ddc0857ef93f4298e2529174ccbf1be307704425272b265ed7b55dbcff6f7
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9
Note: you may need to restart the kernel to use updated packages.


In [2]:
# === Data Preprocessing Script (matches your documentation end-to-end) ===
import re, html, unicodedata
import pandas as pd
import numpy as np

# ---------------- Config ----------------
INPUT_CSV  = "/kaggle/input/emotions/text.csv"   # change if needed
OUTPUT_CSV = "emotions_clean.csv"
VALID_LABELS = {0,1,2,3,4,5}

# Feature toggles
ENABLE_LANG_FILTER = True        # English-only (uses langdetect); auto-disables if not installed
ENABLE_NEAR_DUP    = True        # TF-IDF + cosine near-duplicate pruning (sampled)
ENABLE_STOPWORDS   = True        # remove stopwords BUT keep negations (no/not/never/without)
ENABLE_LEMMATIZE   = True        # lemmatization with POS tags

# Near-duplicate config
NEAR_DUP_SAMPLE     = 60000      # sample size for TF-IDF near-dup check
NEAR_DUP_SIMILARITY = 0.96       # cosine similarity >= this → considered near-duplicate

# ---------------- Safe imports ----------------
def _safe_import(pkg):
    try:
        return __import__(pkg)
    except Exception:
        return None

langdetect = _safe_import("langdetect")
sklearn_ok = _safe_import("sklearn") is not None

if ENABLE_LANG_FILTER and langdetect is None:
    print("langdetect not installed → disabling language filter (pip install langdetect to enable).")
    ENABLE_LANG_FILTER = False

# ---------------- NLTK setup for stopwords & lemma ----------------
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

nltk.download("stopwords", quiet=True)
nltk.download("wordnet", quiet=True)
nltk.download("omw-1.4", quiet=True)
# robust tagger (either resource name works depending on NLTK version)
try:
    nltk.download("averaged_perceptron_tagger_eng", quiet=True)
    _TAGGER_RES = "averaged_perceptron_tagger_eng"
except Exception:
    nltk.download("averaged_perceptron_tagger", quiet=True)
    _TAGGER_RES = "averaged_perceptron_tagger"

# Stopwords (keep negations)
BASE_STOPWORDS = set(stopwords.words("english"))
KEEP_NEG = {"no", "not", "never", "without"}
FINAL_STOPWORDS = BASE_STOPWORDS - KEEP_NEG

# ---------------- Normalization helpers ----------------
CONTRACTIONS = {
    "can't":"can not", "cant":"can not", "won't":"will not", "wont":"will not",
    "don't":"do not", "dont":"do not", "isn't":"is not", "isnt":"is not",
    "aren't":"are not", "arent":"are not", "doesn't":"does not", "doesnt":"does not",
    "didn't":"did not", "didnt":"did not", "haven't":"have not", "havent":"have not",
    "hasn't":"has not", "hasnt":"has not", "hadn't":"had not", "hadnt":"had not",
    "i'm":"i am", "im":"i am", "it's":"it is", "he's":"he is", "she's":"she is",
    "that's":"that is", "there's":"there is", "what's":"what is", "who's":"who is",
    "i've":"i have", "we've":"we have", "they've":"they have",
    "i'll":"i will", "we'll":"we will", "you'll":"you will", "they'll":"they will",
    "i'd":"i would", "you'd":"you would", "he'd":"he would", "she'd":"she would", "they'd":"they would",
    "y'all":"you all", "should've":"should have", "could've":"could have", "would've":"would have"
}
def expand_contractions(t: str) -> str:
    keys = sorted(CONTRACTIONS.keys(), key=len, reverse=True)
    pat = re.compile(r"\b(" + "|".join(map(re.escape, keys)) + r")\b")
    return pat.sub(lambda m: CONTRACTIONS[m.group(0)], t)

# Emoticons mapping → words (covers the examples in your doc)
EMOTICONS = {
    r":-\)": "smile", r":\)": "smile",
    r":-D": "laugh",  r":D": "laugh",
    r":-\(": "sad",   r":\(": "sad",
    r";-\)": "wink",  r";\)": "wink",
    r":'\(": "cry",
    r":-P": "playful", r":P": "playful",
    r":-O": "surprise", r":O": "surprise",
    r":/": "skeptical", r":-\|": "neutral"
}
EMOTICON_REGEX = [(re.compile(k), v) for k, v in EMOTICONS.items()]
def replace_emoticons(text: str) -> str:
    for rx, word in EMOTICON_REGEX:
        text = rx.sub(f" {word} ", text)
    return text

# Regex helpers
URL_PATTERN       = re.compile(r"(https?://\S+|www\.\S+)")
MENTION_PATTERN   = re.compile(r"@\w+")
HASHTAG_PATTERN   = re.compile(r"#(\w+)")
MULTISPACE        = re.compile(r"\s+")
REPEAT_CHARS      = re.compile(r"(.)\1{2,}")           # 3+ same character → 2
NUM_PATTERN       = re.compile(r"\b\d+\b")

def strip_punct_keep_emotion(s: str) -> str:
    # Keep ! and ? (emotion cues), allow <num> token
    return re.sub(r"[^\w\s!?<>]", " ", s)

# POS → wordnet mapping
def wn_pos(tag: str):
    c = tag[0].upper() if tag else "N"
    if c == "J": return wordnet.ADJ
    if c == "V": return wordnet.VERB
    if c == "N": return wordnet.NOUN
    if c == "R": return wordnet.ADV
    return wordnet.NOUN

LEMMA = WordNetLemmatizer()

def clean_text(txt: str) -> str:
    """Full normalization to match documentation."""
    if not isinstance(txt, str):
        return ""
    # Core normalization
    t = unicodedata.normalize("NFKC", txt)
    t = html.unescape(t)
    t = t.strip().lower()
    t = URL_PATTERN.sub(" ", t)              # remove URLs
    t = MENTION_PATTERN.sub(" ", t)          # remove @mentions
    t = HASHTAG_PATTERN.sub(r"\1", t)        # #happy -> happy
    t = expand_contractions(t)               # can't->can not, etc.
    t = replace_emoticons(t)                 # :) -> smile, etc.
    t = REPEAT_CHARS.sub(r"\1\1", t)         # sooo -> soo
    t = NUM_PATTERN.sub(" <num> ", t)        # numbers -> <num>
    t = strip_punct_keep_emotion(t)          # keep ! and ?
    t = MULTISPACE.sub(" ", t).strip()

    # Token-level: stopwords (keep negations) + lemmatization with POS
    tokens = t.split()
    if not tokens:
        return ""
    tagged = pos_tag(tokens, lang="eng")     # tag once per text
    out = []
    for w, tag in tagged:
        if w in {"!", "?", "<num>"}:         # preserve emotion markers and <num>
            out.append(w); continue
        if ENABLE_STOPWORDS and (w in FINAL_STOPWORDS):
            continue
        if ENABLE_LEMMATIZE:
            w = LEMMA.lemmatize(w, wn_pos(tag))
        out.append(w)
    return " ".join(out)

# For diagnostics & near-dup: more aggressive normalization (ignore !, ?)
def normalize_for_dedup(t: str) -> str:
    t = t.replace("!", " ").replace("?", " ")
    t = re.sub(r"[^a-z0-9\s<>]", " ", t)     # keep letters, digits, <num>
    return MULTISPACE.sub(" ", t).strip()

# ---------------- Load ----------------
df = pd.read_csv(INPUT_CSV)
df = df.drop(columns=["Unnamed: 0"], errors="ignore")

# Handling missing/invalid
n0 = len(df)
df = df.dropna(subset=["text", "label"])
df["text"] = df["text"].astype(str)
df = df[df["label"].isin(VALID_LABELS)].copy()
print(f"Loaded: {n0} → after hygiene: {len(df)} rows")

# (Optional) language filter
if ENABLE_LANG_FILTER:
    from langdetect import detect
    def is_en(s):
        try: return detect(s) == "en"
        except Exception: return False
    df["__is_en"] = df["text"].map(is_en)
    kept = int(df["__is_en"].sum())
    print(f"Language filter: kept {kept}/{len(df)} English rows")
    df = df[df["__is_en"]].drop(columns="__is_en")

# --- Exact duplicates (same raw text & label) ---
before = len(df)
df = df.drop_duplicates(subset=["text", "label"])
print(f"Exact dupes removed: {before - len(df)}")

# --- Text normalization ---
df["clean_text"] = df["text"].map(clean_text)

# --- Normalized duplicates (same clean_text & label) ---
before = len(df)
df = df.drop_duplicates(subset=["clean_text", "label"])
print(f"Normalized dupes removed: {before - len(df)}")

# --- Conflicting duplicates (same clean_text with different labels → drop all) ---
counts_by_clean = df.groupby("clean_text")["label"].nunique()
conflict_keys = set(counts_by_clean[counts_by_clean > 1].index)
before = len(df)
if conflict_keys:
    df = df[~df["clean_text"].isin(conflict_keys)].copy()
    print(f"Conflicting-label rows removed: {before - len(df)}")
else:
    print("No conflicting-label duplicates.")

# --- Very short/empty after cleaning (fewer than 2 tokens) ---
token_lens = df["clean_text"].str.split().map(len)
before = len(df)
df = df[(df["clean_text"] != "") & (token_lens >= 2)].copy()
print(f"Short/empty rows removed: {before - len(df)}")

# --- norm_text (for reporting/near-dup) ---
df["norm_text"] = df["clean_text"].map(normalize_for_dedup)

# --- Optional: near-duplicate pruning via TF-IDF + cosine on a sample ---
if ENABLE_NEAR_DUP and sklearn_ok and len(df) > 2000:
    print("Near-duplicate pruning (TF-IDF + cosine)…")
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.neighbors import NearestNeighbors

    sub = df.sample(n=min(NEAR_DUP_SAMPLE, len(df)), random_state=42).copy()
    vec = TfidfVectorizer(min_df=2, max_df=0.98)
    X = vec.fit_transform(sub["norm_text"])

    try:
        nn = NearestNeighbors(metric="cosine", n_neighbors=5, n_jobs=-1)
    except TypeError:  # sklearn>=1.4 removed n_jobs in some estimators
        nn = NearestNeighbors(metric="cosine", n_neighbors=5)
    nn.fit(X)
    distances, indices = nn.kneighbors(X, return_distance=True)

    threshold = 1.0 - NEAR_DUP_SIMILARITY
    pairs = set()
    for i, (dists, nbrs) in enumerate(zip(distances, indices)):
        for dist, j in zip(dists[1:], nbrs[1:]):  # skip self
            if dist <= threshold:
                pairs.add(tuple(sorted((i, j))))
    to_drop = {b for (_, b) in pairs}
    before_s = len(sub)
    sub = sub.drop(index=sub.index[list(to_drop)]).copy()
    print(f"Near-duplicate pairs: {len(pairs)} | dropped {before_s - len(sub)} (from sample={before_s})")

    # Merge back: keep deduped sample, keep the rest
    df = pd.concat([df.drop(index=sub.index, errors="ignore"), sub], axis=0)

# ---------------- Save & report ----------------
cols = ["text", "label", "clean_text", "norm_text"]
df[cols].to_csv(OUTPUT_CSV, index=False)

print(f"\nSaved cleaned dataset → {OUTPUT_CSV}")
print("Final shape:", df.shape)
print("Label counts:\n", df["label"].value_counts().sort_index())

Loaded: 416809 → after hygiene: 416809 rows
Language filter: kept 387970/416809 English rows
Exact dupes removed: 364
Normalized dupes removed: 7259
Conflicting-label rows removed: 41159
Short/empty rows removed: 60
Near-duplicate pruning (TF-IDF + cosine)…
Near-duplicate pairs: 130 | dropped 102 (from sample=60000)

Saved cleaned dataset → emotions_clean.csv
Final shape: (339128, 4)
Label counts:
 label
0    103607
1    119244
2     22871
3     47685
4     36523
5      9198
Name: count, dtype: int64


## 2. Feature Extraction (TF-IDF)

Once the dataset is cleaned, the text is transformed into numerical features  
using **Term Frequency–Inverse Document Frequency (TF-IDF)**.

Configuration:
- **n-grams (1,2):** unigrams (single words) + bigrams (two-word phrases).  
- **min_df = 2:** discard words appearing only once.  
- **sublinear_tf = True:** log-scaled term frequency.  
- **max_features = 80,000:** limit vocabulary size for efficiency.  

We fit TF-IDF on the **training set only** (to avoid data leakage),  
then transform validation and test sets with the same vocabulary.  
Finally, the features are saved in `.npz` format for re-use in different models.

In [3]:
# === TF-IDF feature extraction (fit once, save) ===
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import sparse
import joblib

DATA_PATH = OUTPUT_CSV
RAND = 42

df = pd.read_csv(DATA_PATH)
X = df["clean_text"].astype(str)
y = df["label"].astype(int)

# Split (70/15/15)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=RAND
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, stratify=y_temp, random_state=RAND
)

# Fit TF-IDF on train only
vec = TfidfVectorizer(
    ngram_range=(1,2),
    min_df=2,
    sublinear_tf=True,
    max_features=80000
)
Xtr = vec.fit_transform(X_train)
Xva = vec.transform(X_val)
Xte = vec.transform(X_test)

# Save features + labels + vectorizer
sparse.save_npz("X_train_tfidf.npz", Xtr)
sparse.save_npz("X_val_tfidf.npz",   Xva)
sparse.save_npz("X_test_tfidf.npz",  Xte)
np.save("y_train.npy", y_train.values)
np.save("y_val.npy",   y_val.values)
np.save("y_test.npy",  y_test.values)

joblib.dump(vec, "tfidf_vectorizer.joblib")

print("Saved: X_*_tfidf.npz, y_*.npy, tfidf_vectorizer.joblib")
print("Train shape:", Xtr.shape, "Val shape:", Xva.shape, "Test shape:", Xte.shape)

Saved: X_*_tfidf.npz, y_*.npy, tfidf_vectorizer.joblib
Train shape: (237389, 80000) Val shape: (50869, 80000) Test shape: (50870, 80000)
