## 1. Load and Inspect Data

#### Purpose
Before doing anything fancy, we just want to make sure the files exist, load correctly, and look right.  
This helps us confirm:  
1. The data paths and filenames are correct  
2. The structure (columns, rows) is what we expect  
3. The text content (`comment_text`) actually contains readable sentences  

#### What to Observe
- `shape`: how many rows and columns in each file  
- Does the DataFrame show columns like `id`, `comment_text`, and labels?  
- Do the comments look like proper sentences (not garbled text)?


In [1]:
# Step 1: Load and look at the data
import pandas as pd
import os

DATA_DIR = "./data"   # since notebook is inside CleanSpeech/src/
TRAIN_PATH = os.path.join(DATA_DIR, "train_data.csv")
TEST_PATH  = os.path.join(DATA_DIR, "test_data.csv")

train = pd.read_csv(TRAIN_PATH)
test  = pd.read_csv(TEST_PATH)

print("Train shape:", train.shape)
print("Test shape :", test.shape)

train.head(3)


Train shape: (159571, 8)
Test shape : (153164, 2)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0


## 2. Basic Data Quality Checks

#### Purpose
Before we begin cleaning, it’s essential to understand the condition of our dataset.  
This step helps us identify problems that might affect preprocessing or model training.  
We’ll check three basic aspects:  
1. Missing text values  
2. Duplicate comments  
3. Label balance (how common or rare each toxicity label is)

#### What to Observe
- Missing text rows indicate rows that should be dropped before modeling.  
- Duplicate comments may bias the model and should be removed.  
- Label proportions reveal if certain classes are rare or dominant.


In [3]:
# Step 2: Basic data quality checks

TEXT = "comment_text"
LABELS = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

# 1. Missing text values
null_count = train[TEXT].isna().sum()
print("Missing text rows:", null_count)

# 2. Duplicate comments
dup_count = train.duplicated(subset=[TEXT]).sum()
print("Duplicate comments:", dup_count)

# 3. Label balance
if all(col in train.columns for col in LABELS):
    label_sum = train[LABELS].sum().sort_values(ascending=False)
    print("\nLabel counts:")
    print(label_sum)
    print("\nLabel proportions (% of total):")
    print((label_sum / len(train) * 100).round(3))
else:
    print("\nLabel columns not found.")


Missing text rows: 0
Duplicate comments: 0

Label counts:
toxic            15294
obscene           8449
insult            7877
severe_toxic      1595
identity_hate     1405
threat             478
dtype: int64

Label proportions (% of total):
toxic            9.584
obscene          5.295
insult           4.936
severe_toxic     1.000
identity_hate    0.880
threat           0.300
dtype: float64


## 3. Text Cleaning

#### Purpose
After understanding the data and its quality, the next step is to prepare the text for modeling.  
Raw comments often contain unwanted elements that can confuse the model.  
Here we focus on cleaning and standardizing the text so that it’s easier for the model to learn meaningful patterns.  

We will perform basic text preprocessing steps such as:  
1. Lowercasing all text  
2. Removing URLs and HTML tags  
3. Removing extra spaces and line breaks  
4. Optionally removing emojis and non-ASCII characters  

These steps will help in reducing noise while keeping the useful information intact.

#### What to Observe
- Whether the cleaned text still conveys the same meaning as the original  
- That URLs, HTML, and special characters have been successfully removed  
- That no large chunks of text are accidentally deleted


In [4]:
# Step 3: Text Cleaning

import re

# Define cleaning patterns (same as found during EDA)
url_pattern = r"http\S+|www\.\S+"
html_pattern = r"<.*?>"
emoji_pattern = r"[\U00010000-\U0010ffff]"       # captures emojis and symbols
non_ascii_pattern = r"[^\x00-\x7F]+"             # captures non-ASCII chars
multi_space_pattern = r"\s+"                     # normalize spaces

def clean_text(text):
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = re.sub(url_pattern, " ", text)
    text = re.sub(html_pattern, " ", text)
    text = re.sub(emoji_pattern, " ", text)
    text = re.sub(non_ascii_pattern, " ", text)
    text = re.sub(multi_space_pattern, " ", text).strip()
    return text

# Apply cleaning to train and test sets
train["clean_text"] = train["comment_text"].apply(clean_text)
test["clean_text"] = test["comment_text"].apply(clean_text)

# Quick preview to ensure it worked
train[["comment_text", "clean_text"]].head(5)


Unnamed: 0,comment_text,clean_text
0,Explanation\nWhy the edits made under my usern...,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,d'aww! he matches this background colour i'm s...
2,"Hey man, I'm really not trying to edit war. It...","hey man, i'm really not trying to edit war. it..."
3,"""\nMore\nI can't make any real suggestions on ...",""" more i can't make any real suggestions on im..."
4,"You, sir, are my hero. Any chance you remember...","you, sir, are my hero. any chance you remember..."


## 4. Final Cleanup and Train–Validation Split

#### Purpose
After cleaning the text, we need to ensure the dataset is consistent and ready for model training.  
This step involves removing unnecessary rows and creating a proper split between training and validation data.  
Doing this ensures that the model learns and is evaluated on distinct samples.

We will perform the following actions:  
1. Remove rows with missing or empty cleaned text  
2. Remove duplicate comments to avoid repetition bias  
3. Create a new column `any_toxic` to indicate whether a comment is toxic or not (useful for stratified split)  
4. Split the data into training and validation sets in an 80–20 ratio  

#### What to Observe
- The number of rows before and after cleanup  
- Whether `any_toxic` correctly represents toxic vs. non-toxic rows  
- That the final train and validation sets have similar label proportions


In [5]:
# Step 4: Final cleanup and train-validation split

from sklearn.model_selection import train_test_split

# 1. Remove rows with missing or empty cleaned text
before = len(train)
train = train[train["clean_text"].notna() & (train["clean_text"].str.strip() != "")]
after = len(train)
print(f"Removed {before - after} rows with missing or empty cleaned text.")

# 2. Remove duplicate comments (based on clean_text)
before = len(train)
train = train.drop_duplicates(subset=["clean_text"])
after = len(train)
print(f"Removed {before - after} duplicate rows.")

# 3. Create binary indicator for "any_toxic"
LABELS = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
if all(col in train.columns for col in LABELS):
    train["any_toxic"] = (train[LABELS].sum(axis=1) > 0).astype(int)
    print("Created 'any_toxic' column for binary toxic indicator.")
else:
    print("Label columns missing; skipped 'any_toxic' creation.")

# 4. Train–Validation split (80–20), stratified on "any_toxic"
train_df, val_df = train_test_split(
    train,
    test_size=0.2,
    random_state=42,
    stratify=train["any_toxic"] if "any_toxic" in train else None
)

print("Final sizes:")
print("Train:", train_df.shape)
print("Validation:", val_df.shape)


Removed 8 rows with missing or empty cleaned text.
Removed 316 duplicate rows.
Created 'any_toxic' column for binary toxic indicator.
Final sizes:
Train: (127397, 10)
Validation: (31850, 10)


## 5. Save Cleaned Data

#### Purpose
After completing cleaning and splitting, it’s good practice to save the processed datasets.  
This allows future steps like feature extraction, modeling, and evaluation to load the clean data directly without repeating the preprocessing every time.

We will save:  
1. `clean_train.csv` — Cleaned training data  
2. `clean_val.csv` — Validation data  
3. `clean_test.csv` — Cleaned test data with no labels  

#### What to Observe
- Ensure the files are saved correctly in the `data/` folder  
- Verify that the saved files contain the expected columns and number of rows


In [6]:
# Step 5: Save cleaned datasets

import os

DATA_DIR = "./data"
os.makedirs(DATA_DIR, exist_ok=True)

train_path = os.path.join(DATA_DIR, "clean_train.csv")
val_path   = os.path.join(DATA_DIR, "clean_val.csv")
test_path  = os.path.join(DATA_DIR, "clean_test.csv")

# Columns to include in outputs
cols_to_save = ["id", "comment_text", "clean_text", "any_toxic"] + LABELS

# Save train and validation sets
train_df.to_csv(train_path, index=False, columns=[c for c in cols_to_save if c in train_df.columns])
val_df.to_csv(val_path, index=False, columns=[c for c in cols_to_save if c in val_df.columns])

# Test set has no labels
test.to_csv(test_path, index=False)

print("Saved cleaned datasets:")
print(train_path)
print(val_path)
print(test_path)


Saved cleaned datasets:
./data\clean_train.csv
./data\clean_val.csv
./data\clean_test.csv
