# Dataset Balancing: SelfMA + Toxigen

Create a balanced 3-class classification dataset by combining:
- **Label 0 (Non-toxic)**: Toxigen low toxicity samples (toxicity_human < 2)
- **Label 1 (Microaggressive)**: SelfMA dataset samples
- **Label 2 (Toxic)**: Toxigen high toxicity samples (toxicity_human > 4)

## Pipeline:
1. Setup and helper functions
2. Load SelfMA dataset (microaggressive texts)
3. Load Toxigen dataset (filter by toxicity levels)
4. Balance and combine datasets (3-class)
5. Save final balanced dataset

## 1. Setup

In [1]:
# Install dependencies
!pip install -U -q gdown
!pip install -q git+https://github.com/dnozza/profanity-obfuscation.git

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for profanity_obfuscation (setup.py) ... [?25l[?25hdone


In [2]:
import os
import pandas as pd
import requests
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict, load_dataset
from profanity_obfuscation import Prof

# Configuration
BASE_PATH = '/content/drive/MyDrive/266_project/'
RANDOM_SEED = 42

# Toxigen toxicity thresholds
TOXICITY_LOW_THRESHOLD = 2   # Below this = non-toxic (label 0)
TOXICITY_HIGH_THRESHOLD = 4  # Above this = toxic (label 2)

In [3]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 2. Profanity Obfuscation Setup

In [4]:
# Download profanity table and setup obfuscator
local_profanity_table_path = 'prof_table.tsv'
if not os.path.exists(local_profanity_table_path):
    profanity_table_url = 'https://raw.githubusercontent.com/dnozza/profanity-obfuscation/main/resources/prof_table.tsv'
    response = requests.get(profanity_table_url)
    response.raise_for_status()
    with open(local_profanity_table_path, 'wb') as f:
        f.write(response.content)

class CustomProfanityObfuscator(Prof):
    def __init__(self, profanity_table_path):
        self.prof_table = pd.read_csv(profanity_table_path, sep="\t")

obfuscator = CustomProfanityObfuscator(local_profanity_table_path)

def process_text(text):
    """Apply profanity obfuscation to standardize text."""
    if text is None:
        text = ""
    return obfuscator.obfuscate_string(text)

print("Profanity obfuscator ready.")

Profanity obfuscator ready.


## 3. Load SelfMA Dataset

SelfMA contains microaggressive text samples. All samples will be labeled as **1 (microaggressive)**.

In [5]:
# Download SelfMA dataset
file_id = '138fisv1BB7pEDlc0bJQzoAV4Wgewi9bV'
file_name = 'self_MA.json'
!gdown --id {file_id} -O {file_name}

self_MA = pd.read_json(file_name, lines=True)
print(f"SelfMA loaded: {len(self_MA)} samples")

Downloading...
From: https://drive.google.com/uc?id=138fisv1BB7pEDlc0bJQzoAV4Wgewi9bV
To: /content/self_MA.json
100% 1.96M/1.96M [00:00<00:00, 95.1MB/s]
SelfMA loaded: 3240 samples


In [6]:
# Process SelfMA: apply profanity obfuscation and prepare for ML
# Keep all samples with tags (microaggressions)
mask = self_MA["tags"].apply(lambda xs: isinstance(xs, list) and len(xs) > 0)
self_MA_filtered = self_MA[mask].copy()

# Apply profanity obfuscation
self_MA_filtered['text'] = self_MA_filtered['quote'].apply(process_text)
self_MA_filtered['text'] = self_MA_filtered['text'].str.strip('"')  # Remove surrounding quotes
self_MA_filtered['label'] = 1  # All SelfMA samples are microaggressive

# Keep only text and label columns
selfma_df = self_MA_filtered[['text', 'label']].copy()
selfma_df = selfma_df[selfma_df['text'].str.strip() != ''].reset_index(drop=True)

print(f"SelfMA processed: {len(selfma_df)} samples (all label=1)")
display(selfma_df.head())

SelfMA processed: 1300 samples (all label=1)


Unnamed: 0,text,label
0,"Yeah, but you're not that kind of Native.",1
1,CAN YOU HEAR ME?,1
2,All immigrants should go back to their own cou...,1
3,Whenever black people come to the beach I star...,1
4,I think black girls with short hair are ugly.,1


In [7]:
# Split SelfMA into train/validation/test (80/10/10)
train_df, temp_df = train_test_split(selfma_df, test_size=0.2, random_state=RANDOM_SEED)
valid_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=RANDOM_SEED)

selfMA_ds = DatasetDict({
    "train": Dataset.from_pandas(train_df.reset_index(drop=True)),
    "validation": Dataset.from_pandas(valid_df.reset_index(drop=True)),
    "test": Dataset.from_pandas(test_df.reset_index(drop=True)),
})

print("SelfMA splits created:")
for split, ds in selfMA_ds.items():
    print(f"  {split}: {len(ds)} samples")

SelfMA splits created:
  train: 1040 samples
  validation: 130 samples
  test: 130 samples


## 4. Load Toxigen Dataset

Toxigen contains text with human-annotated toxicity scores. We extract:
- **Label 0**: Low toxicity (toxicity_human < 2) - non-toxic
- **Label 2**: High toxicity (toxicity_human > 4) - toxic

Middle-range samples (2-4) are excluded to create clearer class separation.

In [8]:
# Load Toxigen dataset from Hugging Face
print("Loading Toxigen dataset from Hugging Face...")
toxigen_raw = load_dataset('toxigen/toxigen-data')

print("Toxigen loaded:")
print(toxigen_raw)

Loading Toxigen dataset from Hugging Face...


README.md: 0.00B [00:00, ?B/s]

annotated/test-00000-of-00001.parquet:   0%|          | 0.00/79.7k [00:00<?, ?B/s]

annotated/train-00000-of-00001.parquet:   0%|          | 0.00/689k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/940 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/8960 [00:00<?, ? examples/s]

Toxigen loaded:
DatasetDict({
    test: Dataset({
        features: ['text', 'target_group', 'factual?', 'ingroup_effect', 'lewd', 'framing', 'predicted_group', 'stereotyping', 'intent', 'toxicity_ai', 'toxicity_human', 'predicted_author', 'actual_method'],
        num_rows: 940
    })
    train: Dataset({
        features: ['text', 'target_group', 'factual?', 'ingroup_effect', 'lewd', 'framing', 'predicted_group', 'stereotyping', 'intent', 'toxicity_ai', 'toxicity_human', 'predicted_author', 'actual_method'],
        num_rows: 8960
    })
})


In [9]:
def preprocess_toxigen(example):
    """Preprocess Toxigen entry: apply profanity obfuscation and assign label based on toxicity."""
    processed_text = process_text(example['text'])

    # Map toxicity_human to labels
    if example['toxicity_human'] < TOXICITY_LOW_THRESHOLD:
        label = 0  # Non-toxic
    elif example['toxicity_human'] > TOXICITY_HIGH_THRESHOLD:
        label = 2  # Toxic
    else:
        label = None  # Middle range - exclude

    return {'text': processed_text, 'label': label}

# Apply preprocessing
print("Preprocessing Toxigen dataset...")
columns_to_remove = [
    'text', 'target_group', 'factual?', 'ingroup_effect', 'lewd', 'framing',
    'predicted_group', 'stereotyping', 'intent', 'toxicity_ai', 'toxicity_human',
    'predicted_author', 'actual_method'
]

toxigen_processed = toxigen_raw.map(
    preprocess_toxigen,
    remove_columns=columns_to_remove,
    batched=False
)

# Filter out None labels (middle-range toxicity)
toxigen_processed = toxigen_processed.filter(lambda x: x['label'] is not None)

print("Toxigen preprocessing complete.")

Preprocessing Toxigen dataset...


Map:   0%|          | 0/940 [00:00<?, ? examples/s]

Map:   0%|          | 0/8960 [00:00<?, ? examples/s]

Filter:   0%|          | 0/940 [00:00<?, ? examples/s]

Filter:   0%|          | 0/8960 [00:00<?, ? examples/s]

Toxigen preprocessing complete.


In [10]:
# Create train/validation/test splits for Toxigen
# Original only has train and test, so we split train into train+validation
train_val_split = toxigen_processed['train'].train_test_split(test_size=0.2, seed=RANDOM_SEED)

toxigen_ds = DatasetDict({
    "train": train_val_split["train"],
    "validation": train_val_split["test"],
    "test": toxigen_processed["test"]
})

print("Toxigen splits created:")
for split, ds in toxigen_ds.items():
    df = ds.to_pandas()
    print(f"  {split}: {len(ds)} samples")
    print(f"    Label distribution: {dict(df['label'].value_counts())}")

Toxigen splits created:
  train: 4937 samples
    Label distribution: {0: np.int64(3338), 2: np.int64(1599)}
  validation: 1235 samples
    Label distribution: {0: np.int64(838), 2: np.int64(397)}
  test: 604 samples
    Label distribution: {0: np.int64(355), 2: np.int64(249)}


## 5. Balance and Combine Datasets (3-Class)

For each split, balance samples across all three classes:
- Label 0: Toxigen non-toxic
- Label 1: SelfMA microaggressive
- Label 2: Toxigen toxic

Sample size is limited by the smallest class.

In [11]:
def balance_3class(toxigen_split, selfma_split, split_name):
    """Balance 3-class dataset: Toxigen label 0, SelfMA label 1, Toxigen label 2."""
    toxigen_df = toxigen_split.to_pandas()
    selfma_df = selfma_split.to_pandas()

    # Extract each class
    toxigen_label_0 = toxigen_df[toxigen_df['label'] == 0].copy()
    toxigen_label_2 = toxigen_df[toxigen_df['label'] == 2].copy()
    selfma_label_1 = selfma_df[selfma_df['label'] == 1].copy()

    # Find minimum class size
    n_samples = min(len(toxigen_label_0), len(toxigen_label_2), len(selfma_label_1))

    print(f"{split_name}:")
    print(f"  Available - Label 0: {len(toxigen_label_0)}, Label 1: {len(selfma_label_1)}, Label 2: {len(toxigen_label_2)}")
    print(f"  Balanced to {n_samples} samples per class")

    # Sample equal amounts from each class
    balanced_0 = toxigen_label_0.sample(n=n_samples, random_state=RANDOM_SEED).reset_index(drop=True)
    balanced_1 = selfma_label_1.sample(n=n_samples, random_state=RANDOM_SEED).reset_index(drop=True)
    balanced_2 = toxigen_label_2.sample(n=n_samples, random_state=RANDOM_SEED).reset_index(drop=True)

    # Combine and shuffle
    combined = pd.concat([balanced_0, balanced_1, balanced_2], ignore_index=True)
    combined = combined.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

    return combined

print("Balancing datasets (3-class)...\n")
balanced_train = balance_3class(toxigen_ds['train'], selfMA_ds['train'], 'Train')
balanced_valid = balance_3class(toxigen_ds['validation'], selfMA_ds['validation'], 'Validation')
balanced_test = balance_3class(toxigen_ds['test'], selfMA_ds['test'], 'Test')

Balancing datasets (3-class)...

Train:
  Available - Label 0: 3338, Label 1: 1040, Label 2: 1599
  Balanced to 1040 samples per class
Validation:
  Available - Label 0: 838, Label 1: 130, Label 2: 397
  Balanced to 130 samples per class
Test:
  Available - Label 0: 355, Label 1: 130, Label 2: 249
  Balanced to 130 samples per class


In [12]:
# Create final balanced DatasetDict
balanced_selfMA_toxigen_ds = DatasetDict({
    "train": Dataset.from_pandas(balanced_train),
    "validation": Dataset.from_pandas(balanced_valid),
    "test": Dataset.from_pandas(balanced_test),
})

print("\nFinal balanced dataset:")
for split, ds in balanced_selfMA_toxigen_ds.items():
    df = ds.to_pandas()
    print(f"  {split}: {len(ds)} samples")
    print(f"    Label distribution: {dict(df['label'].value_counts().sort_index())}")


Final balanced dataset:
  train: 3120 samples
    Label distribution: {0: np.int64(1040), 1: np.int64(1040), 2: np.int64(1040)}
  validation: 390 samples
    Label distribution: {0: np.int64(130), 1: np.int64(130), 2: np.int64(130)}
  test: 390 samples
    Label distribution: {0: np.int64(130), 1: np.int64(130), 2: np.int64(130)}


## 6. Save Dataset

In [13]:
# Save to Google Drive
output_path = BASE_PATH + 'balanced_selfMA_toxigen_ds'
balanced_selfMA_toxigen_ds.save_to_disk(output_path)
print(f"Dataset saved to: {output_path}")

Saving the dataset (0/1 shards):   0%|          | 0/3120 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/390 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/390 [00:00<?, ? examples/s]

Dataset saved to: /content/drive/MyDrive/266_project/balanced_selfMA_toxigen_ds


## 7. Summary

In [14]:
print("=" * 60)
print("DATASET CREATION COMPLETE: SelfMA + Toxigen")
print("=" * 60)

total = sum(len(balanced_selfMA_toxigen_ds[s]) for s in ['train', 'validation', 'test'])
print(f"""
Dataset Structure (3-Class):
----------------------------
- Label 0: Non-toxic text (from Toxigen, toxicity_human < {TOXICITY_LOW_THRESHOLD})
- Label 1: Microaggressive text (from SelfMA)
- Label 2: Toxic text (from Toxigen, toxicity_human > {TOXICITY_HIGH_THRESHOLD})

Split Sizes:
------------
- Train:      {len(balanced_selfMA_toxigen_ds['train']):,} samples
- Validation: {len(balanced_selfMA_toxigen_ds['validation']):,} samples
- Test:       {len(balanced_selfMA_toxigen_ds['test']):,} samples
- Total:      {total:,} samples

Class Balance: 1:1:1 (perfectly balanced across all 3 classes)

Output: {output_path}
""")

# Show samples from each class
print("Sample data from each class:")
train_df = balanced_selfMA_toxigen_ds['train'].to_pandas()
for label in [0, 1, 2]:
    print(f"\n--- Label {label} samples ---")
    display(train_df[train_df['label'] == label].head(3))

DATASET CREATION COMPLETE: SelfMA + Toxigen

Dataset Structure (3-Class):
----------------------------
- Label 0: Non-toxic text (from Toxigen, toxicity_human < 2)
- Label 1: Microaggressive text (from SelfMA)
- Label 2: Toxic text (from Toxigen, toxicity_human > 4)

Split Sizes:
------------
- Train:      3,120 samples
- Validation: 390 samples
- Test:       390 samples
- Total:      3,900 samples

Class Balance: 1:1:1 (perfectly balanced across all 3 classes)

Output: /content/drive/MyDrive/266_project/balanced_selfMA_toxigen_ds

Sample data from each class:

--- Label 0 samples ---


Unnamed: 0,text,label
3,there is no difference in the quality of paren...,0
6,"it's not just about being a model minority, bu...",0
12,awesome,0



--- Label 1 samples ---


Unnamed: 0,text,label
4,"Just smile for me baby, let me see that pretty...",1
10,I wish I was retarded. Then no one would expec...,1
11,You two boys can do this stretch against that ...,1



--- Label 2 samples ---


Unnamed: 0,text,label
0,you can always tell when a person is of asian ...,2
1,i am not racist but mexicans are dirty and try...,2
2,youre just an arab keyboard warrior who is hid...,2
