## **Dataset Info**

### **Selection of Datasets**

We chose to combine three well-known emotion datasets: **DailyDialog**, **ISEAR**, and **GoEmotions**.  
The goal was to create a dataset that reflects natural, real-life emotional expressions—something that aligns with our podcast episode, which focuses on **relationships** and their **application in daily life**.

- **DailyDialog**  
  Provides everyday conversations in dialogue form.  
  This mirrors spoken interactions, making it relevant to podcast-style exchanges.

- **ISEAR (International Survey on Emotion Antecedents and Reactions)**  
  Contains personal stories where people describe situations that triggered emotions.  
  These reflective accounts connect closely to how relationships and daily experiences shape emotions.

- **GoEmotions**  
  Reddit comments with 27 detailed emotion labels, mapped into 7 main categories.  
  This dataset adds modern, informal language similar to how people express themselves online and in casual talks.

**Why these three?**  
Together, these datasets cover:  
- **Natural conversations** (like in a podcast discussion)  
- **Personal reflections** (how people feel in relationships and daily life)  
- **Modern expressions** (informal, online style, closer to how emotions are shared today)

This makes the final dataset **diverse, comprehensive, and highly relevant** to our episode’s theme of exploring emotions in relationships and everyday life.


## **Data Preparation for Emotion Classification**

In this process, we prepared a clean dataset for training an emotion classification model.  
We started with three different datasets: **DailyDialog**, **ISEAR**, and **GoEmotions**.  
Since each dataset used different formats and emotion categories, we applied a unified cleaning and mapping pipeline.

**Steps we followed:**

1. **Load the datasets**  
   - Read the raw CSV files for each dataset.

2. **Parse and clean text**  
   - Converted stored strings into proper Python lists or cleaned plain text.  
   - Removed unwanted characters, extra spaces, and invisible Unicode symbols.  
   - Dropped rows with missing or empty text.

3. **Map emotions into 7 categories**  
   - Different datasets had different labels (e.g., *joy*, *remorse*, *gratitude*).  
   - We mapped all of them into 7 target emotions:  
     **happiness, sadness, surprise, anger, disgust, fear, neutral**.  
   - This made the datasets consistent and comparable.

4. **Filter and align data**  
   - Kept only rows with valid single labels.  
   - Dropped duplicates (same text + label pairs).  
   - Ensured only non-empty cleaned text remained.

5. **Combine datasets**  
   - Merged DailyDialog, ISEAR, and GoEmotions into one unified DataFrame.  
   - Shuffled the data for randomness while keeping reproducibility.

6. **Export final dataset**  
   - Saved the cleaned dataset to a **TSV file** (`task6_emotions_clean.tsv`).  
   - TSV format avoids issues with commas in text.

**Result:**  
We now have a single, clean dataset with consistent labels across three sources, ready to be used for training an emotion classification model.


In [1]:
import pandas as pd
import numpy as np

In [2]:
dd = pd.read_csv('daily_dialog.csv')

In [3]:
print(dd.shape)
dd.head()

(11118, 3)


Unnamed: 0,dialog,act,emotion
0,"['Say , Jim , how about going for a few beers ...",[3 4 2 2 2 3 4 1 3 4],[0 0 0 0 0 0 4 4 4 4]
1,"['Can you do push-ups ? '\n "" Of course I can ...",[2 1 2 2 1 1],[0 0 6 0 0 0]
2,"['Can you study with the radio on ? '\n ' No ,...",[2 1 2 1 1],[0 0 0 0 0]
3,['Are you all right ? '\n ' I will be all righ...,[2 1 1 1],[0 0 0 0]
4,"['Hey John , nice skates . Are they new ? '\n ...",[2 1 2 1 1 2 1 3 4],[0 0 0 0 0 6 0 6 0]


In [4]:
# Create a dictionary to map numerical emotion labels to their string categories
dd_map = {
    0: "neutral", 1: "anger", 2: "disgust", 3: "fear",
    4: "happiness", 5: "sadness", 6: "surprise"
}

In [5]:
import ast
# Convert "dialog" strings into Python lists (fallback: wrap as single-item list)
dd["dialog"] = dd["dialog"].apply(lambda x: ast.literal_eval(str(x)) if str(x).startswith("[") else [str(x)])

# Convert "emotion" strings like "[3 4 2]" into lists of integers
dd["emotion"] = dd["emotion"].apply(
    lambda x: [int(n) for n in re.split(r"[ ,]+", str(x).strip("[]")) if n.isdigit()]
)

In [6]:
dd_rows = []

# Iterate through each row of the dataset
for _, row in dd.iterrows():
    dialogs = row["dialog"] 
    emos = row["emotion"]  
    
    # Ensure both lists are aligned by taking the shorter length
    L = min(len(dialogs), len(emos))
    
    # Pair each utterance with its corresponding emotion
    for i in range(L):
        text_i = str(dialogs[i]).strip()
        emo_i  = dd_map.get(int(emos[i]), None)
        
        # Keep only non-empty utterances with valid emotions
        if text_i and emo_i in {"happiness","sadness","surprise","anger","disgust","fear","neutral"}:
            dd_rows.append({"text": text_i, "label": emo_i, "source": "dailydialog"})


In [7]:
dd_clean = pd.DataFrame(dd_rows)
# Quick inspection
print(dd_clean.shape)
dd_clean.head()

(11118, 3)


Unnamed: 0,text,label,source
0,"Say , Jim , how about going for a few beers af...",neutral,dailydialog
1,Can you do push-ups ? Of course I can . It's ...,neutral,dailydialog
2,"Can you study with the radio on ? No , I list...",neutral,dailydialog
3,Are you all right ? I will be all right soon ...,neutral,dailydialog
4,"Hey John , nice skates . Are they new ? Yeah ...",neutral,dailydialog


In [8]:
# check the classes distribution
dd_clean['label'].value_counts()

label
neutral      9786
happiness     929
anger         136
surprise       96
sadness        92
disgust        56
fear           23
Name: count, dtype: int64

In [9]:
isear = pd.read_csv('ISEAR.csv')

In [10]:
print(isear.shape)
isear.head()

(7102, 3)


Unnamed: 0,ID,sentiment,content
0,10941,anger,At the point today where if someone says somet...
1,10942,anger,@CorningFootball IT'S GAME DAY!!!! T MIN...
2,10943,anger,This game has pissed me off more than any othe...
3,10944,anger,@spamvicious I've just found out it's Candice ...
4,10945,anger,@moocowward @mrsajhargreaves @Melly77 @GaryBar...


In [11]:
# Map ISEAR dataset emotions to unified categories
isear_map = {
    "anger":"anger",
    "disgust":"disgust",
    "fear":"fear",
    "joy":"happiness",
    "happiness":"happiness",
    "sadness":"sadness",
    "shame":"sadness",   
    "guilt":"sadness",  
    "neutral":"neutral" 
}

In [12]:
# Normalize sentiment values: lowercase + remove extra spaces
isear["sentiment_norm"] = isear["sentiment"].astype(str).str.lower().str.strip()

# Map normalized sentiments to unified emotion labels
isear["label"] = isear["sentiment_norm"].map(isear_map)

# Clean up content text (ensure string, strip whitespace)
isear["text"] = isear["content"].astype(str).str.strip()

In [13]:
# Remove rows where the mapped label is missing
isear_clean = isear.dropna(subset=["label"])

# Keep only rows with non-empty text, selecting text + label columns
isear_clean = isear_clean.loc[isear_clean["text"].str.len() > 0, ["text","label"]].copy()

# Add a source column to identify dataset origin
isear_clean["source"] = "isear"


In [14]:
# - remove leading '>' markers
# - replace invisible/Unicode spaces with normal spaces
# - collapse multiple spaces into one
# - strip leading/trailing spaces
isear_clean["text"] = (isear_clean["text"]
    .str.replace(r"^\s*>\s*", "", regex=True)
    .str.replace("\u200b"," ")
    .str.replace("\xa0"," ")
    .str.replace(r"\s+"," ", regex=True)
    .str.strip()
)

# Drop duplicate rows (same text + label) and reset index
isear_clean = isear_clean.drop_duplicates(subset=["text","label"]).reset_index(drop=True)


In [15]:
# Quick inspiction
print(isear_clean.shape)
isear_clean.head()

(7101, 3)


Unnamed: 0,text,label,source
0,At the point today where if someone says somet...,anger,isear
1,@CorningFootball IT'S GAME DAY!!!! T MINUS 14:...,anger,isear
2,This game has pissed me off more than any othe...,anger,isear
3,@spamvicious I've just found out it's Candice ...,anger,isear
4,@moocowward @mrsajhargreaves @Melly77 @GaryBar...,anger,isear


In [16]:
# Load the three GoEmotions dataset splits
goe1 = pd.read_csv('goemotions_1.csv')
goe2 = pd.read_csv('goemotions_2.csv')
goe3 = pd.read_csv('goemotions_3.csv')

# Combine them into one DataFrame
go = pd.concat([goe1, goe2, goe3], axis=0, ignore_index=True)


In [17]:
print(go.shape)
go.head()

(211225, 37)


Unnamed: 0,text,id,author,subreddit,link_id,parent_id,created_utc,rater_id,example_very_unclear,admiration,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,That game hurt.,eew5j0j,Brdd9,nrl,t3_ajis4z,t1_eew18eq,1548381000.0,1,False,0,...,0,0,0,0,0,0,0,1,0,0
1,>sexuality shouldn’t be a grouping category I...,eemcysk,TheGreen888,unpopularopinion,t3_ai4q37,t3_ai4q37,1548084000.0,37,True,0,...,0,0,0,0,0,0,0,0,0,0
2,"You do right, if you don't care then fuck 'em!",ed2mah1,Labalool,confessions,t3_abru74,t1_ed2m7g7,1546428000.0,37,False,0,...,0,0,0,0,0,0,0,0,0,1
3,Man I love reddit.,eeibobj,MrsRobertshaw,facepalm,t3_ahulml,t3_ahulml,1547965000.0,18,False,0,...,1,0,0,0,0,0,0,0,0,0
4,"[NAME] was nowhere near them, he was by the Fa...",eda6yn6,American_Fascist713,starwarsspeculation,t3_ackt2f,t1_eda65q2,1546669000.0,2,False,0,...,0,0,0,0,0,0,0,0,0,1


In [18]:
# Define metadata columns that are not emotion labels
meta_cols = {
    "text","id","author","subreddit","link_id",
    "parent_id","created_utc","rater_id","example_very_unclear"
}

# Select candidate label columns (everything except metadata)
cand_cols = [c for c in go.columns if c not in meta_cols]


In [19]:
# Normalize label columns so all are numeric (0/1)
for c in cand_cols:
    s = go[c]
    if s.dtype == bool:
        # Convert boolean → int (True=1, False=0)
        go[c] = s.astype(int)
    elif s.dtype == object:
        # Clean string values and map to 0/1
        go[c] = s.astype(str).str.strip().replace(
            {"True": "1", "False": "0", "true": "1", "false": "0"}
        )
        # Convert to numeric, invalid entries → NaN
        go[c] = pd.to_numeric(go[c], errors="coerce")


In [20]:
emo_cols = []

# Collect valid emotion label columns (binary: 0/1 only)
for c in cand_cols:
    vals = go[c].dropna().unique()
    if len(vals) > 0 and set(np.unique(vals)).issubset({0,1}):
        emo_cols.append(c)


In [21]:
# Show how many emotion label columns were detected and preview the first 15
print("Detected emotion columns:", len(emo_cols))
print(emo_cols[:15])

Detected emotion columns: 28
['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear']


In [22]:
# Map GoEmotions fine-grained labels to 7 broad categories
go_to_seven = {
    # happiness
    "admiration":"happiness","amusement":"happiness","approval":"happiness",
    "caring":"happiness","desire":"happiness","excitement":"happiness",
    "gratitude":"happiness","joy":"happiness","love":"happiness",
    "optimism":"happiness","pride":"happiness","relief":"happiness","contentment":"happiness",

    # anger
    "anger":"anger","annoyance":"anger","disapproval":"anger",

    # disgust
    "disgust":"disgust",

    # fear
    "fear":"fear","nervousness":"fear",

    # sadness
    "sadness":"sadness","disappointment":"sadness","grief":"sadness",
    "remorse":"sadness","embarrassment":"sadness",

    # surprise
    "surprise":"surprise","realization":"surprise","confusion":"surprise","curiosity":"surprise",

    # neutral
    "neutral":"neutral"
}


In [23]:
# Keep only emotion columns that can be mapped to 7-class schema
mapped_cols = [c for c in emo_cols if c in go_to_seven]

# Subset DataFrame: text + mapped emotion columns
go_sub = go[["text"] + mapped_cols].copy()

# Clean text: ensure string type and strip whitespace
go_sub["text"] = go_sub["text"].astype(str).str.strip()


In [24]:
# - For each row, find all active emotions (columns with value=1)
# - Map them to broad categories using go_to_seven
# - If exactly one unique category → keep it, else mark as NaN
go_sub["label"] = go_sub[mapped_cols].apply(
    lambda r: (lambda mapped: mapped[0] if len(mapped) == 1 else np.nan)(
        sorted({go_to_seven[c] for c in r.index if r[c] == 1})
    ),
    axis=1
)

# Keep only rows with valid label and non-empty text
go_clean = go_sub.dropna(subset=["label"]).loc[go_sub["text"].str.len() > 0, ["text","label"]].copy()

# Add source column to track dataset origin
go_clean["source"] = "goemotions"


In [25]:
# - remove leading '>' markers
# - replace invisible/Unicode spaces with normal spaces
# - collapse multiple spaces into one
# - strip leading/trailing spaces
go_clean["text"] = (go_clean["text"]
    .str.replace(r"^\s*>\s*", "", regex=True)
    .str.replace("\u200b"," ")
    .str.replace("\xa0"," ")
    .str.replace(r"\s+"," ", regex=True)
    .str.strip()
)

# Drop duplicate (text, label) pairs and reset index
go_clean = go_clean.drop_duplicates(subset=["text","label"]).reset_index(drop=True)


In [26]:
# Quick inspiction
print(go_clean.shape)
go_clean.head()

(104310, 3)


Unnamed: 0,text,label,source
0,That game hurt.,sadness,goemotions
1,"You do right, if you don't care then fuck 'em!",neutral,goemotions
2,Man I love reddit.,happiness,goemotions
3,"[NAME] was nowhere near them, he was by the Fa...",neutral,goemotions
4,Right? Considering it’s such an important docu...,happiness,goemotions


In [27]:
# check classes didstribution
print(go_clean["label"].value_counts())

label
happiness    32322
neutral      31445
anger        14613
surprise     12926
sadness       8523
disgust       2363
fear          2118
Name: count, dtype: int64


In [28]:
# Combine cleaned datasets (DailyDialog, GoEmotions, ISEAR) into one DataFrame
frames = [dd_clean, go_clean, isear_clean]
all_df = pd.concat(frames, ignore_index=True)


In [29]:
# Keep only rows where label is in the 7 target emotion categories
target_set = {"happiness","sadness","surprise","anger","disgust","fear","neutral"}
all_df = all_df[all_df["label"].isin(target_set)].copy()

In [30]:
# Remove rows with missing text/label
all_df = all_df.dropna(subset=["text","label"])

# Keep only rows with non-empty text
all_df = all_df.loc[all_df["text"].str.len() > 0]

In [31]:
# Drop duplicate (text, label) pairs and reset index
all_df = all_df.drop_duplicates(subset=["text","label"]).reset_index(drop=True)

# Shuffle the dataset randomly (reproducible with random_state)
all_df = all_df.sample(frac=1.0, random_state=42).reset_index(drop=True)

In [32]:
# Save the cleaned dataset to a TSV file (tab-separated, no index column)
all_out = "task6_emotions_clean.tsv"
all_df.to_csv(all_out, sep="\t", index=False)

In [33]:
# final chekc
print(all_df.shape)
all_df.head()

(122017, 3)


Unnamed: 0,text,label,source
0,Yep. This sub is just being overly reactionary...,anger,goemotions
1,"Do Mona and Jim need a new house ? No , they ...",neutral,dailydialog
2,While we watch the private form of Operation C...,surprise,goemotions
3,It is though. Certainly the worst Zelda game a...,anger,goemotions
4,"Everybody , I'd like to propose a toast to Mar...",happiness,dailydialog


In [34]:
#check classes distiribution
all_df['label'].value_counts()

label
neutral      40768
happiness    34835
anger        16444
surprise     13018
sadness      10144
fear          4392
disgust       2416
Name: count, dtype: int64

## **feature extraction**

### **Configuration**

In [35]:
from pathlib import Path

# Path to raw dataset CSV 
DATA_CSV = 'task6_emotions_clean.tsv' 

# Output directory for all artifacts
ARTIFACT_DIR = Path("artifacts")
ARTIFACT_VERSION = "v1"

In [36]:
# Split ratios
TEST_SIZE = 0.10   # 10% test
VALID_SIZE = 0.10  # 10% validation

# Random seed
RANDOM_STATE = 42

In [37]:
# Label mapping 
LABEL_ORDER = ["anger","disgust","fear","happiness","neutral","sadness","surprise"]

In [38]:
# TF-IDF params
TFIDF_NGRAM_RANGE = (1, 2)
TFIDF_MIN_DF = 5
TFIDF_MAX_DF = 0.8
TFIDF_MAX_FEATURES = 200_000
TFIDF_SUBLINEAR_TF = True
TFIDF_NORM = "l2"

In [39]:
# LSTM/RNN tokenization
VOCAB_SIZE = 40_000        
MAX_LEN = None            
MAX_LEN_PERCENTILE = 95    

In [40]:
# Transformer model
HF_MODEL_NAME = "distilbert-base-uncased"
TRANSFORMER_MAX_LENGTH = 128               # common good default

In [41]:
# Dense features to include (by prefix)
INCLUDE_SENTIMENT = True
INCLUDE_POS = True

In [42]:
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)
print("Artifacts will be saved under:", ARTIFACT_DIR.resolve())

Artifacts will be saved under: /home/y2a/BlockA2/Task6Test/artifacts


### **Load data**

In [43]:

import pandas as pd

all_df = pd.read_csv(DATA_CSV, sep='\t')
assert "text" in all_df.columns and "label" in all_df.columns, "Data must include 'text' and 'label' columns."
print(all_df.head())
print("Rows:", len(all_df))


                                                text      label       source
0  Yep. This sub is just being overly reactionary...      anger   goemotions
1  Do Mona and Jim need a new house ?  No , they ...    neutral  dailydialog
2  While we watch the private form of Operation C...   surprise   goemotions
3  It is though. Certainly the worst Zelda game a...      anger   goemotions
4  Everybody , I'd like to propose a toast to Mar...  happiness  dailydialog
Rows: 122017


### **Create stratified train/valid/test splits and save indices**

In [44]:
from sklearn.model_selection import train_test_split
import pandas as pd

# First: test split
X_train_text, X_temp_text, y_train, y_temp = train_test_split(
    all_df["text"].astype(str),
    all_df["label"],
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=all_df["label"]
)

In [45]:
# Second: validation split from temp
# proportion within the remaining
valid_size_rel = VALID_SIZE / (1 - TEST_SIZE)
X_valid_text, X_test_text, y_valid, y_test = train_test_split(
    X_temp_text,
    y_temp,
    # split temp into equal valid/test parts to match target ratios
    test_size=0.5,  
    random_state=RANDOM_STATE,
    stratify=y_temp
)

In [46]:
# Save indices
splits_dir = ARTIFACT_DIR / "splits"
splits_dir.mkdir(parents=True, exist_ok=True)

train_idx = X_train_text.index
valid_idx = X_valid_text.index
test_idx  = X_test_text.index

train_idx.to_series(index=None).to_csv(splits_dir / f"train_idx_{ARTIFACT_VERSION}.csv", index=False)
valid_idx.to_series(index=None).to_csv(splits_dir / f"valid_idx_{ARTIFACT_VERSION}.csv", index=False)
test_idx.to_series(index=None).to_csv(splits_dir / f"test_idx_{ARTIFACT_VERSION}.csv", index=False)

print("Split sizes:", len(train_idx), len(valid_idx), len(test_idx))

Split sizes: 109815 6101 6101


### **Create and save a stable label mapping**

In [47]:
import json

# Create dictionaries to map labels to IDs and IDs back to labels
label_to_id = {lab: i for i, lab in enumerate(LABEL_ORDER)}
id_to_label = {i: lab for lab, i in label_to_id.items()}

In [48]:
# Check if all labels in the dataset are included in LABEL_ORDER
data_labels = set(all_df["label"].unique())
missing = data_labels.difference(set(LABEL_ORDER))
if missing:
    print("WARNING: The following labels are in data but not in LABEL_ORDER:", missing)

In [49]:
# Create the labels directory if it does not already exist
labels_dir = ARTIFACT_DIR / "labels"
labels_dir.mkdir(exist_ok=True)

In [50]:
# Save the label-to-ID mapping as a JSON file
with open(labels_dir / f"label_mapping_{ARTIFACT_VERSION}.json", "w") as f:
    json.dump(label_to_id, f, indent=2)

In [51]:
# Confirm that the mapping was saved
print("Saved label mapping:", label_to_id)

Saved label mapping: {'anger': 0, 'disgust': 1, 'fear': 2, 'happiness': 3, 'neutral': 4, 'sadness': 5, 'surprise': 6}


### **Sentiment features (VADER)**

In [52]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Add sentiment features if enabled
if INCLUDE_SENTIMENT:
    # Initialize the VADER sentiment analyzer
    analyzer = SentimentIntensityAnalyzer()

    # Ensure text column has no missing values and is of string type
    all_df["text"] = all_df["text"].fillna("").astype(str)

    # Function to calculate VADER sentiment scores for a given text
    def vader_scores(text):
        s = analyzer.polarity_scores(text)
        return pd.Series(
            [s["neg"], s["neu"], s["pos"], s["compound"]],
            index=["sent_neg", "sent_neu", "sent_pos", "sent_compound"]
        )

    # Apply the sentiment scoring function to the text column
    sent_df = all_df["text"].apply(vader_scores)
    print(sent_df.head())

else:
    # If sentiment analysis is disabled, create an empty DataFrame
    sent_df = pd.DataFrame(index=all_df.index)
    print("Sentiment features skipped.")


   sent_neg  sent_neu  sent_pos  sent_compound
0     0.144     0.588     0.268         0.2263
1     0.031     0.769     0.199         0.9945
2     0.103     0.897     0.000        -0.3182
3     0.283     0.552     0.166        -0.4019
4     0.025     0.695     0.280         0.9365


### **POS features**

In [53]:
import re
import pandas as pd
from collections import Counter

# Universal Part-of-Speech tags (UPOS) to track
UPOS = ["ADJ","ADP","ADV","AUX","CCONJ","DET","INTJ","NOUN","NUM",
        "PART","PRON","PROPN","PUNCT","SCONJ","SYM","VERB","X","SPACE"]

In [54]:
# Extract POS tags from a given cell (handles lists of tokens or string formats)
def extract_tags_from_cell(cell):
    tags = []
    if isinstance(cell, list):
        for item in cell:
            if hasattr(item, "pos_"):  # spaCy token
                tags.append(item.pos_)
            else:  # fallback to regex if not a spaCy token
                s = str(item)
                m = re.search(r"/([A-Z]+)", s)
                if m:
                    tags.append(m.group(1))
    else:
        for token in re.findall(r"/([A-Z]+)", str(cell)):
            tags.append(token)
    return tags

In [55]:
# Count POS tags and calculate normalized ratios
def pos_counts_and_ratios(tags):
    c = Counter(tags)
    counts = {f"POS_{t}": c.get(t, 0) for t in UPOS}
    total = sum(counts.values())
    ratios = {
        f"POS_{t}_norm": (counts[f"POS_{t}"]/total if total > 0 else 0.0)
        for t in UPOS
    }
    return {**counts, **ratios, "POS_token_total": total}

In [56]:
# Add POS features if enabled
if INCLUDE_POS:
    if "POS_Tags" in all_df.columns:
        # Use existing POS_Tags column
        tags_series = all_df["POS_Tags"]
        pos_features = tags_series.apply(
            lambda cell: pos_counts_and_ratios(extract_tags_from_cell(cell))
        )
        pos_df = pd.DataFrame(list(pos_features)).fillna(0)
        print("Parsed POS from existing 'POS_Tags' column.")
    else:
        # Compute POS features with spaCy if not already available
        import spacy
        nlp = spacy.load("en_core_web_sm", disable=["ner","parser"])
        texts = all_df["text"].fillna("").astype(str).tolist()
        feats = []
        for doc in nlp.pipe(texts, batch_size=1000, n_process=-1):
            counts = doc.count_by(spacy.attrs.POS)
            features = {}
            for pos_id, count in counts.items():
                features[nlp.vocab[pos_id].text] = count

            # Align to UPOS and compute normalized ratios
            counts_aligned = {f"POS_{t}": features.get(t, 0) for t in UPOS}
            total = sum(counts_aligned.values())
            ratios = {
                f"POS_{t}_norm": (counts_aligned[f"POS_{t}"]/total if total > 0 else 0.0)
                for t in UPOS
            }
            feats.append({**counts_aligned, **ratios, "POS_token_total": total})
        pos_df = pd.DataFrame(feats).fillna(0)
        print("Computed POS with spaCy.")
else:
    # If POS is disabled, create an empty DataFrame with the right index
    pos_df = pd.DataFrame(index=all_df.index)
    print("POS features skipped.")

# Show first rows of the POS features
print(pos_df.head())

Computed POS with spaCy.
   POS_ADJ  POS_ADP  POS_ADV  POS_AUX  POS_CCONJ  POS_DET  POS_INTJ  POS_NOUN  \
0        1        0        2        2          0        3         0         2   
1       18       12        9       23          6       23         6        42   
2        3        2        0        3          0        2         0         3   
3        3        1        2        1          0        1         0         1   
4        1        4        4        3          0        4         1         5   

   POS_NUM  POS_PART  ...  POS_PART_norm  POS_PRON_norm  POS_PROPN_norm  \
0        0         0  ...       0.000000       0.000000        0.066667   
1        0        11  ...       0.039007       0.102837        0.039007   
2        0         1  ...       0.040000       0.080000        0.120000   
3        0         0  ...       0.000000       0.083333        0.000000   
4        0         1  ...       0.019608       0.137255        0.078431   

   POS_PUNCT_norm  POS_SCONJ_norm  PO

### **Save dense features (sentiment + POS ratios)**

In [57]:
import json
import pandas as pd

dense_dir = ARTIFACT_DIR / "features" / "dense"
dense_dir.mkdir(parents=True, exist_ok=True)

In [58]:
# Select columns
dense_parts = []
if INCLUDE_SENTIMENT and not sent_df.empty:
    dense_parts.append(sent_df)
if INCLUDE_POS and not pos_df.empty:
    # keep normalized ratios only to be compact
    pos_norm_cols = [c for c in pos_df.columns if c.endswith("_norm")]
    dense_parts.append(pos_df[pos_norm_cols])

if len(dense_parts) > 0:
    dense_features = pd.concat(dense_parts, axis=1).fillna(0.0)
else:
    dense_features = pd.DataFrame(index=all_df.index)

In [59]:
# Save dense features and column list
dense_features.to_parquet(dense_dir / f"dense_features_{ARTIFACT_VERSION}.parquet")
with open(dense_dir / f"dense_feature_columns_{ARTIFACT_VERSION}.json", "w") as f:
    json.dump(list(dense_features.columns), f, indent=2)

print("Saved dense features with shape:", dense_features.shape)

Saved dense features with shape: (122017, 22)


### **Fit scaler on train dense features and save**

In [60]:
from sklearn.preprocessing import MaxAbsScaler
import joblib

# Directory where the scaler will be saved
scaler_dir = dense_dir

# Only scale if there are dense features
if dense_features.shape[1] > 0:
    # Load training indices to fit the scaler only on training data
    train_idx = pd.read_csv(
        ARTIFACT_DIR / "splits" / f"train_idx_{ARTIFACT_VERSION}.csv",
        header=None
    ).iloc[:, 0]

    # Fit MaxAbsScaler on the training subset of dense features
    scaler = MaxAbsScaler()
    scaler.fit(dense_features.loc[train_idx])

    # Save the fitted scaler for later use
    joblib.dump(scaler, scaler_dir / f"dense_scaler_{ARTIFACT_VERSION}.joblib")
    print("Saved dense scaler.")
else:
    # If no dense features exist, skip scaling
    print("No dense features to scale; skipping scaler.")


Saved dense scaler.


### **TF-IDF: fit on train, transform all splits, save matrices + vectorizer**

In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import sparse
import joblib
import pandas as pd

# Load train, validation, and test indices
train_idx = pd.read_csv(ARTIFACT_DIR / "splits" / f"train_idx_{ARTIFACT_VERSION}.csv", header=None).squeeze()
valid_idx = pd.read_csv(ARTIFACT_DIR / "splits" / f"valid_idx_{ARTIFACT_VERSION}.csv", header=None).squeeze()
test_idx  = pd.read_csv(ARTIFACT_DIR / "splits" / f"test_idx_{ARTIFACT_VERSION}.csv",  header=None).squeeze()

# Directory for TF-IDF features
tfidf_dir = ARTIFACT_DIR / "features" / "tfidf"
tfidf_dir.mkdir(parents=True, exist_ok=True)

In [62]:
# Configure TF-IDF vectorizer
tfidf = TfidfVectorizer(
    analyzer="word",
    ngram_range=TFIDF_NGRAM_RANGE,
    min_df=TFIDF_MIN_DF,
    max_df=TFIDF_MAX_DF,
    max_features=TFIDF_MAX_FEATURES,
    sublinear_tf=TFIDF_SUBLINEAR_TF,
    norm=TFIDF_NORM
)

In [63]:
# Extract text for each split
X_train_text = all_df.loc[train_idx, "text"].astype(str)
X_valid_text = all_df.loc[valid_idx, "text"].astype(str)
X_test_text  = all_df.loc[test_idx,  "text"].astype(str)

In [64]:
# Fit TF-IDF on training data and transform all splits
X_train_tfidf = tfidf.fit_transform(X_train_text)
X_valid_tfidf = tfidf.transform(X_valid_text)
X_test_tfidf  = tfidf.transform(X_test_text)

In [65]:
# Save vectorizer and TF-IDF matrices
joblib.dump(tfidf, tfidf_dir / f"tfidf_word_{ARTIFACT_VERSION}.joblib")
sparse.save_npz(tfidf_dir / f"X_train_tfidf_{ARTIFACT_VERSION}.npz", X_train_tfidf)
sparse.save_npz(tfidf_dir / f"X_valid_tfidf_{ARTIFACT_VERSION}.npz", X_valid_tfidf)
sparse.save_npz(tfidf_dir / f"X_test_tfidf_{ARTIFACT_VERSION}.npz",  X_test_tfidf)

In [66]:
# Save split labels mapped to integer IDs
labels_dir = ARTIFACT_DIR / "labels"
all_labels = all_df["label"].map(lambda x: label_to_id.get(x, -1))

In [67]:
all_labels.loc[train_idx].to_csv(labels_dir / f"y_train_{ARTIFACT_VERSION}.csv", index=False, header=False)
all_labels.loc[valid_idx].to_csv(labels_dir / f"y_valid_{ARTIFACT_VERSION}.csv", index=False, header=False)
all_labels.loc[test_idx].to_csv(labels_dir / f"y_test_{ARTIFACT_VERSION}.csv",  index=False, header=False)

In [68]:
# Confirm shapes of TF-IDF matrices
print("TF-IDF saved:",
      X_train_tfidf.shape, X_valid_tfidf.shape, X_test_tfidf.shape)

TF-IDF saved: (109816, 67892) (6102, 67892) (6102, 67892)


### **LSTM/RNN: tokenize, pad, save sequences**

In [69]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from scipy import sparse
import pandas as pd
import joblib

# Directory for sequence features
seq_dir = ARTIFACT_DIR / "features" / "sequences"
seq_dir.mkdir(parents=True, exist_ok=True)

2025-09-27 13:17:36.916503: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [70]:
# Load train, validation, and test indices
train_idx = pd.read_csv(ARTIFACT_DIR / "splits" / f"train_idx_{ARTIFACT_VERSION}.csv", header=None).squeeze()
valid_idx = pd.read_csv(ARTIFACT_DIR / "splits" / f"valid_idx_{ARTIFACT_VERSION}.csv", header=None).squeeze()
test_idx  = pd.read_csv(ARTIFACT_DIR / "splits" / f"test_idx_{ARTIFACT_VERSION}.csv",  header=None).squeeze()


In [71]:
# Extract text for each split
texts_train = all_df.loc[train_idx, "text"].astype(str).tolist()
texts_valid = all_df.loc[valid_idx, "text"].astype(str).tolist()
texts_test  = all_df.loc[test_idx,  "text"].astype(str).tolist()

In [72]:
# Initialize and fit tokenizer on training text
tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token="<UNK>", lower=True)
tokenizer.fit_on_texts(texts_train)

In [73]:
# Convert text into integer sequences
seq_train = tokenizer.texts_to_sequences(texts_train)
seq_valid = tokenizer.texts_to_sequences(texts_valid)
seq_test  = tokenizer.texts_to_sequences(texts_test)

In [74]:
# If no max length is provided, choose it based on a percentile of training lengths
if MAX_LEN is None:
    lengths = [len(s) for s in seq_train]
    MAX_LEN = int(np.percentile(lengths, MAX_LEN_PERCENTILE))
    MAX_LEN = max(8, min(MAX_LEN, 256))  # enforce reasonable bounds

In [75]:
# Pad or truncate sequences to the same length
X_train_seq = pad_sequences(seq_train, maxlen=MAX_LEN, padding="post", truncating="post")
X_valid_seq = pad_sequences(seq_valid, maxlen=MAX_LEN, padding="post", truncating="post")
X_test_seq  = pad_sequences(seq_test,  maxlen=MAX_LEN, padding="post", truncating="post")

In [76]:
# Save padded sequences and tokenizer
np.savez_compressed(seq_dir / f"X_train_seq_{ARTIFACT_VERSION}.npz", X_train_seq)
np.savez_compressed(seq_dir / f"X_valid_seq_{ARTIFACT_VERSION}.npz", X_valid_seq)
np.savez_compressed(seq_dir / f"X_test_seq_{ARTIFACT_VERSION}.npz",  X_test_seq)
joblib.dump(tokenizer, seq_dir / f"tokenizer_{ARTIFACT_VERSION}.joblib")

['artifacts/features/sequences/tokenizer_v1.joblib']

In [77]:
# Save metadata (vocabulary size and sequence length)
with open(seq_dir / f"sequence_meta_{ARTIFACT_VERSION}.txt", "w") as f:
    f.write(f"VOCAB_SIZE={VOCAB_SIZE}\n")
    f.write(f"MAX_LEN={MAX_LEN}\n")

In [78]:
# Confirm results
print("Saved sequences:", X_train_seq.shape, X_valid_seq.shape, X_test_seq.shape)
print("Tokenizer vocab size (limited):", VOCAB_SIZE)

Saved sequences: (109816, 65) (6102, 65) (6102, 65)
Tokenizer vocab size (limited): 40000


### **Transformers: tokenize with HuggingFace tokenizer and save arrays**

In [79]:
from transformers import AutoTokenizer
import numpy as np
import pandas as pd

# Directory for transformer-based features
tr_dir = ARTIFACT_DIR / "features" / "transformers"
tr_dir.mkdir(parents=True, exist_ok=True)

In [80]:
# Save the Hugging Face model name used for reproducibility
with open(tr_dir / f"hf_model_name_{ARTIFACT_VERSION}.txt", "w") as f:
    f.write(HF_MODEL_NAME)

# Load tokenizer from the specified Hugging Face model
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)

In [81]:
# Load train, validation, and test indices
train_idx = pd.read_csv(ARTIFACT_DIR / "splits" / f"train_idx_{ARTIFACT_VERSION}.csv", header=None).squeeze()
valid_idx = pd.read_csv(ARTIFACT_DIR / "splits" / f"valid_idx_{ARTIFACT_VERSION}.csv", header=None).squeeze()
test_idx  = pd.read_csv(ARTIFACT_DIR / "splits" / f"test_idx_{ARTIFACT_VERSION}.csv",  header=None).squeeze()

In [82]:
# Extract text for each split
texts_train = all_df.loc[train_idx, "text"].astype(str).tolist()
texts_valid = all_df.loc[valid_idx, "text"].astype(str).tolist()
texts_test  = all_df.loc[test_idx,  "text"].astype(str).tolist()


In [83]:
# Helper function to tokenize a batch of texts
def tok_batch(texts):
    enc = tokenizer(
        texts,
        padding="max_length",
        truncation=True,
        max_length=TRANSFORMER_MAX_LENGTH,
        return_attention_mask=True,
        return_tensors=None
    )
    return np.array(enc["input_ids"], dtype="int32"), np.array(enc["attention_mask"], dtype="int32")

In [84]:
# Tokenize each split into input IDs and attention masks
train_input_ids, train_attention_mask = tok_batch(texts_train)
valid_input_ids, valid_attention_mask = tok_batch(texts_valid)
test_input_ids,  test_attention_mask  = tok_batch(texts_test)

In [85]:
# Save arrays for model training
np.savez_compressed(tr_dir / f"train_input_ids_{ARTIFACT_VERSION}.npz", train_input_ids)
np.savez_compressed(tr_dir / f"valid_input_ids_{ARTIFACT_VERSION}.npz", valid_input_ids)
np.savez_compressed(tr_dir / f"test_input_ids_{ARTIFACT_VERSION}.npz",  test_input_ids)

np.savez_compressed(tr_dir / f"train_attention_mask_{ARTIFACT_VERSION}.npz", train_attention_mask)
np.savez_compressed(tr_dir / f"valid_attention_mask_{ARTIFACT_VERSION}.npz", valid_attention_mask)
np.savez_compressed(tr_dir / f"test_attention_mask_{ARTIFACT_VERSION}.npz",  test_attention_mask)

In [86]:
# Confirm that arrays were saved and print their shapes
print("Saved transformer arrays:")
print("Train:", train_input_ids.shape, train_attention_mask.shape)
print("Valid:", valid_input_ids.shape, valid_attention_mask.shape)
print("Test :", test_input_ids.shape,  test_attention_mask.shape)

Saved transformer arrays:
Train: (109816, 128) (109816, 128)
Valid: (6102, 128) (6102, 128)
Test : (6102, 128) (6102, 128)


### **Summary: artifact layout**

In [87]:
import os

# Walk through the artifact directory and print its structure
for root, dirs, files in os.walk(ARTIFACT_DIR):
    # Determine depth of the current folder for indentation
    level = root.replace(str(ARTIFACT_DIR), '').count(os.sep)
    indent = '  ' * level

    # Print current directory name
    print(f"{indent}{os.path.basename(root)}/")

    # Print files inside the current directory with extra indentation
    subindent = '  ' * (level + 1)
    for f in files:
        print(f"{subindent}{f}")

artifacts/
  features/
    transformers/
      train_input_ids_v1.npz
      test_input_ids_v1.npz
      valid_attention_mask_v1.npz
      valid_input_ids_v1.npz
      test_attention_mask_v1.npz
      hf_model_name_v1.txt
      train_attention_mask_v1.npz
    tfidf/
      X_test_tfidf_v1.npz
      X_train_tfidf_v1.npz
      X_valid_tfidf_v1.npz
      tfidf_word_v1.joblib
    dense/
      dense_scaler_v1.joblib
      dense_feature_columns_v1.json
      dense_features_v1.parquet
    sequences/
      X_test_seq_v1.npz
      X_train_seq_v1.npz
      X_valid_seq_v1.npz
      sequence_meta_v1.txt
      tokenizer_v1.joblib
  labels/
    y_test_v1.csv
    y_valid_v1.csv
    y_train_v1.csv
    label_mapping_v1.json
  splits/
    train_idx_v1.csv
    test_idx_v1.csv
    valid_idx_v1.csv
