# Phase 4: Create Master Training Dataset
## Combine GoEmotions + Crisis + Non-Crisis Data

This notebook creates a **stratified, balanced 60K dataset for training multi-task BERT**.

### Target Dataset Composition (60K total):

| Source | Target Rows | Stratification |
|--------|-------------|----------------|
| **Crisis** | 25,000 | Balanced by event_type |
| **Non-Crisis** | 18,000 | Balanced by source_dataset |
| **GoEmotions** | 17,000 | Balanced by emotion_label (13 emotions) |

### Why Stratified Sampling?
- Ensures all 13 emotions are equally represented
- Ensures all crisis event types (hurricane, wildfire, etc.) are represented
- Ensures balanced crisis vs non-crisis for classification
- Smaller dataset (60K) is more efficient for BERT training

### What happens after BERT is trained:
1. Apply trained BERT to **ORIGINAL FULL datasets** (1.5M+ non-crisis, 67K crisis)
2. Extract emotion features for ALL tweets
3. Use these features to create episodes & hourly aggregations for RL agent

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 1. Load All Datasets

In [2]:
print("Loading datasets...\n")

# Load GoEmotions with 13 emotions
print("1. GoEmotions (with 13 emotions)...")
df_goemotions = pd.read_csv('goemotion_data/goemotions_with_13_emotions.csv')
print(f"   ‚úì Loaded {len(df_goemotions):,} rows")
print(f"   Columns: {df_goemotions.columns.tolist()}")

# Load crisis data with emotion columns
print("\n2. Crisis data (with emotion columns)...")
df_crisis = pd.read_csv('standardized_data/crisis_combined_with_emotions.csv')
print(f"   ‚úì Loaded {len(df_crisis):,} rows")
print(f"   Columns: {df_crisis.columns.tolist()}")

# Load non-crisis data with emotion columns
print("\n3. Non-crisis data (with emotion columns)...")
df_non_crisis = pd.read_csv('standardized_data/non_crisis_combined_with_emotions.csv')
print(f"   ‚úì Loaded {len(df_non_crisis):,} rows")
print(f"   Columns: {df_non_crisis.columns.tolist()}")

print(f"\n{'='*80}")
print(f"Total rows to combine: {len(df_goemotions) + len(df_crisis) + len(df_non_crisis):,}")
print(f"{'='*80}")

Loading datasets...

1. GoEmotions (with 13 emotions)...
   ‚úì Loaded 54,263 rows
   Columns: ['text', 'emotion_label', 'emotion_name', 'id', 'labels']

2. Crisis data (with emotion columns)...
   ‚úì Loaded 66,748 rows
   Columns: ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'informativeness', 'emotion_label', 'emotion_name']

3. Non-crisis data (with emotion columns)...
   ‚úì Loaded 1,533,696 rows
   Columns: ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'emotion_label', 'emotion_name']

Total rows to combine: 1,654,707


## 1.5 Stratified Sampling (Create 60K Balanced Dataset)

Apply stratified sampling to create a balanced dataset:

| Source | Target | Strategy |
|--------|--------|----------|
| **Crisis** | 25,000 | Equal samples per event_type |
| **Non-Crisis** | 18,000 | Equal samples per source_dataset |
| **GoEmotions** | 17,000 | Equal samples per emotion (~1,308 per emotion) |

This ensures:
- All 13 emotions are represented equally
- All crisis event types are represented
- All non-crisis sources are represented

In [3]:
# =============================================================================
# STRATIFIED SAMPLING CONFIGURATION
# =============================================================================
TARGET_CRISIS = 25000
TARGET_NON_CRISIS = 18000
TARGET_GOEMOTIONS = 17000
TARGET_TOTAL = TARGET_CRISIS + TARGET_NON_CRISIS + TARGET_GOEMOTIONS  # 60K

print("=" * 80)
print(f"STRATIFIED SAMPLING TO {TARGET_TOTAL:,} ROWS")
print("=" * 80)
print(f"\nTargets:")
print(f"  Crisis:      {TARGET_CRISIS:,}")
print(f"  Non-crisis:  {TARGET_NON_CRISIS:,}")
print(f"  GoEmotions:  {TARGET_GOEMOTIONS:,}")
print(f"  TOTAL:       {TARGET_TOTAL:,}")

# =============================================================================
# 1. STRATIFIED SAMPLE FROM GOEMOTIONS (by emotion_label)
# =============================================================================
print("\n" + "-" * 60)
print("1. Sampling GoEmotions - Equal across 13 emotions")
print("-" * 60)

print(f"\nOriginal GoEmotions: {len(df_goemotions):,} rows")
print("Emotion distribution:")
print(df_goemotions['emotion_label'].value_counts().sort_index())

n_emotions = 13
target_per_emotion = TARGET_GOEMOTIONS // n_emotions  # ~1308 per emotion
print(f"\nTarget per emotion: {target_per_emotion}")

goemotions_sampled = []
for emotion_label in range(1, 14):
    emotion_df = df_goemotions[df_goemotions['emotion_label'] == emotion_label]
    available = len(emotion_df)
    
    if available >= target_per_emotion:
        sampled = emotion_df.sample(n=target_per_emotion, random_state=42)
    else:
        sampled = emotion_df
        print(f"  ‚ö†Ô∏è  Emotion {emotion_label}: only {available} available")
    
    goemotions_sampled.append(sampled)

df_goemotions = pd.concat(goemotions_sampled, ignore_index=True)
print(f"\n‚úÖ GoEmotions sampled: {len(df_goemotions):,} rows")

# =============================================================================
# 2. STRATIFIED SAMPLE FROM CRISIS (by event_type)
# =============================================================================
print("\n" + "-" * 60)
print("2. Sampling Crisis - Balanced by event_type")
print("-" * 60)

print(f"\nOriginal Crisis: {len(df_crisis):,} rows")
print("Event type distribution:")
print(df_crisis['event_type'].value_counts())

n_event_types = df_crisis['event_type'].nunique()
target_per_event = TARGET_CRISIS // n_event_types
print(f"\nUnique event types: {n_event_types}")
print(f"Target per event type: {target_per_event}")

crisis_sampled = []
for event_type in df_crisis['event_type'].unique():
    event_df = df_crisis[df_crisis['event_type'] == event_type]
    available = len(event_df)
    
    if available >= target_per_event:
        sampled = event_df.sample(n=target_per_event, random_state=42)
    else:
        sampled = event_df
        print(f"  ‚ö†Ô∏è  {event_type}: only {available} available")
    
    crisis_sampled.append(sampled)

df_crisis = pd.concat(crisis_sampled, ignore_index=True)
print(f"\n‚úÖ Crisis sampled: {len(df_crisis):,} rows")

# =============================================================================
# 3. STRATIFIED SAMPLE FROM NON-CRISIS (by source_dataset)
# =============================================================================
print("\n" + "-" * 60)
print("3. Sampling Non-Crisis - Balanced by source_dataset")
print("-" * 60)

print(f"\nOriginal Non-Crisis: {len(df_non_crisis):,} rows")
print("Source distribution:")
print(df_non_crisis['source_dataset'].value_counts())

n_sources = df_non_crisis['source_dataset'].nunique()
target_per_source = TARGET_NON_CRISIS // n_sources
print(f"\nUnique sources: {n_sources}")
print(f"Target per source: {target_per_source}")

non_crisis_sampled = []
for source in df_non_crisis['source_dataset'].unique():
    source_df = df_non_crisis[df_non_crisis['source_dataset'] == source]
    available = len(source_df)
    
    if available >= target_per_source:
        sampled = source_df.sample(n=target_per_source, random_state=42)
    else:
        sampled = source_df
        print(f"  ‚ö†Ô∏è  {source}: only {available} available")
    
    non_crisis_sampled.append(sampled)

df_non_crisis = pd.concat(non_crisis_sampled, ignore_index=True)
print(f"\n‚úÖ Non-Crisis sampled: {len(df_non_crisis):,} rows")

# =============================================================================
# SUMMARY
# =============================================================================
print("\n" + "=" * 80)
print("STRATIFIED SAMPLING COMPLETE")
print("=" * 80)
print(f"\nüìä Final counts:")
print(f"   GoEmotions:  {len(df_goemotions):,} rows")
print(f"   Crisis:      {len(df_crisis):,} rows")
print(f"   Non-crisis:  {len(df_non_crisis):,} rows")
print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"   TOTAL:       {len(df_goemotions) + len(df_crisis) + len(df_non_crisis):,} rows")

STRATIFIED SAMPLING TO 60,000 ROWS

Targets:
  Crisis:      25,000
  Non-crisis:  18,000
  GoEmotions:  17,000
  TOTAL:       60,000

------------------------------------------------------------
1. Sampling GoEmotions - Equal across 13 emotions
------------------------------------------------------------

Original GoEmotions: 54,263 rows
Emotion distribution:
emotion_label
1       658
2      4607
3      1148
4       132
5      3753
6      1831
7      1044
8      1218
9     11048
10     1543
11     4093
12     3898
13    19290
Name: count, dtype: int64

Target per emotion: 1307
  ‚ö†Ô∏è  Emotion 1: only 658 available
  ‚ö†Ô∏è  Emotion 3: only 1148 available
  ‚ö†Ô∏è  Emotion 4: only 132 available
  ‚ö†Ô∏è  Emotion 7: only 1044 available
  ‚ö†Ô∏è  Emotion 8: only 1218 available

‚úÖ GoEmotions sampled: 14,656 rows

------------------------------------------------------------
2. Sampling Crisis - Balanced by event_type
------------------------------------------------------------

Original

## 2. Check Current Schemas

In [4]:
print("Current column schemas:\n")

print("GoEmotions columns:")
for col in df_goemotions.columns:
    print(f"  - {col}: {df_goemotions[col].dtype}")

print("\nCrisis columns:")
for col in df_crisis.columns:
    print(f"  - {col}: {df_crisis[col].dtype}")

print("\nNon-crisis columns:")
for col in df_non_crisis.columns:
    print(f"  - {col}: {df_non_crisis[col].dtype}")

Current column schemas:

GoEmotions columns:
  - text: str
  - emotion_label: int64
  - emotion_name: str
  - id: str
  - labels: str

Crisis columns:
  - text: str
  - created_at: str
  - event_name: str
  - event_type: str
  - crisis_label: int64
  - source_dataset: str
  - informativeness: str
  - emotion_label: float64
  - emotion_name: float64

Non-crisis columns:
  - text: str
  - created_at: str
  - event_name: str
  - event_type: str
  - crisis_label: int64
  - source_dataset: str
  - emotion_label: float64
  - emotion_name: float64


## 3. Define Master Schema

Create unified column structure for all datasets:
- **text**: Tweet/comment text
- **emotion_label**: Numeric emotion (1-13, NULL for unlabeled)
- **emotion_name**: Text emotion name (NULL for unlabeled)
- **source_dataset**: Origin of data (GoEmotions, HumAID, CrisisLex, etc.)
- **crisis_label**: Binary (1=crisis, 0=non-crisis, NULL for GoEmotions)
- **event_type**: General category (hurricane, sports, etc., NULL for GoEmotions)
- **event_name**: Specific event (hurricane_harvey_2017, etc., NULL for GoEmotions)
- **informativeness**: CrisisLex informativeness label (NULL for others)

**Note**: `created_at` is NOT included - BERT doesn't need timestamps. Timestamps are only needed for RL training, which uses the original full datasets.

In [5]:
# Define master column set (NO created_at - not needed for BERT training)
MASTER_COLUMNS = [
    'text',
    'emotion_label',
    'emotion_name',
    'source_dataset',
    'crisis_label',
    'event_type',
    'event_name',
    'informativeness'
]

print("Master schema columns:")
for i, col in enumerate(MASTER_COLUMNS, 1):
    print(f"  {i}. {col}")

Master schema columns:
  1. text
  2. emotion_label
  3. emotion_name
  4. source_dataset
  5. crisis_label
  6. event_type
  7. event_name
  8. informativeness


## 4. Standardize GoEmotions Data

Add missing columns to GoEmotions dataset.

In [6]:
print("Standardizing GoEmotions data...\n")

# Create standardized GoEmotions dataframe
df_goemotions_std = pd.DataFrame()

# Keep existing columns
df_goemotions_std['text'] = df_goemotions['text']
df_goemotions_std['emotion_label'] = df_goemotions['emotion_label']
df_goemotions_std['emotion_name'] = df_goemotions['emotion_name']

# Add source
df_goemotions_std['source_dataset'] = 'GoEmotions'

# Add NULL columns (GoEmotions is not crisis-related)
df_goemotions_std['crisis_label'] = np.nan
df_goemotions_std['event_type'] = ''
df_goemotions_std['event_name'] = ''
df_goemotions_std['informativeness'] = ''

print(f"‚úì GoEmotions standardized: {len(df_goemotions_std):,} rows")
print(f"  Columns: {df_goemotions_std.columns.tolist()}")
print(f"\nSample:")
display(df_goemotions_std.head(3))

Standardizing GoEmotions data...

‚úì GoEmotions standardized: 14,656 rows
  Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'informativeness']

Sample:


Unnamed: 0,text,emotion_label,emotion_name,source_dataset,crisis_label,event_type,event_name,informativeness
0,To make her feel threatened,1,fear,GoEmotions,,,,
1,Your coaching is terrible.... be ready and see how [NAME] uses [NAME],1,fear,GoEmotions,,,,
2,"He may have, I was more worried about the ""running and shooting the AR one handed, off to the si...",1,fear,GoEmotions,,,,


## 5. Standardize Crisis Data

Select and reorder crisis columns to match master schema.

In [7]:
print("Standardizing crisis data...\n")

# Create standardized crisis dataframe
df_crisis_std = pd.DataFrame()

df_crisis_std['text'] = df_crisis['text']
df_crisis_std['emotion_label'] = df_crisis['emotion_label']  # Will be NaN
df_crisis_std['emotion_name'] = df_crisis['emotion_name']    # Will be empty
df_crisis_std['source_dataset'] = df_crisis['source_dataset']
df_crisis_std['crisis_label'] = df_crisis['crisis_label']
df_crisis_std['event_type'] = df_crisis['event_type']
df_crisis_std['event_name'] = df_crisis['event_name']
df_crisis_std['informativeness'] = df_crisis['informativeness']

print(f"‚úì Crisis standardized: {len(df_crisis_std):,} rows")
print(f"  Columns: {df_crisis_std.columns.tolist()}")
print(f"\nSample:")
display(df_crisis_std.head(3))

Standardizing crisis data...

‚úì Crisis standardized: 20,855 rows
  Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'informativeness']

Sample:


Unnamed: 0,text,emotion_label,emotion_name,source_dataset,crisis_label,event_type,event_name,informativeness
0,"Uma trag√©dia! For√ßa, RS! RT @JornalOGlobo: Inc√™ndio em boate de #SantaMaria j√° √© o segundo maior...",,,crisislex,1,wildfire,2013_Brazil_nightclub_fire,related_informative
1,"#Greece #Fire The fires have claimed 79 lives in Greece and the death toll is expected to climb,...",,,humaid,1,wildfire,greece_wildfires_2018_train,
2,"RT @ch150ch: Abbott (who claims exps for runs, swims &amp; bike rides) changes rules so less peo...",,,crisislex,1,wildfire,2013_Australia_bushfire,related_informative


## 6. Standardize Non-Crisis Data

Select and reorder non-crisis columns to match master schema.

In [8]:
print("Standardizing non-crisis data...\n")

# Create standardized non-crisis dataframe
df_non_crisis_std = pd.DataFrame()

df_non_crisis_std['text'] = df_non_crisis['text']
df_non_crisis_std['emotion_label'] = df_non_crisis['emotion_label']  # Will be NaN
df_non_crisis_std['emotion_name'] = df_non_crisis['emotion_name']    # Will be empty
df_non_crisis_std['source_dataset'] = df_non_crisis['source_dataset']
df_non_crisis_std['crisis_label'] = df_non_crisis['crisis_label']
df_non_crisis_std['event_type'] = df_non_crisis['event_type']
df_non_crisis_std['event_name'] = df_non_crisis['event_name']

# Non-crisis doesn't have informativeness
df_non_crisis_std['informativeness'] = ''

print(f"‚úì Non-crisis standardized: {len(df_non_crisis_std):,} rows")
print(f"  Columns: {df_non_crisis_std.columns.tolist()}")
print(f"\nSample:")
display(df_non_crisis_std.head(3))

Standardizing non-crisis data...

‚úì Non-crisis standardized: 17,256 rows
  Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'informativeness']

Sample:


Unnamed: 0,text,emotion_label,emotion_name,source_dataset,crisis_label,event_type,event_name,informativeness
0,"@YELLOWCLAW at @coachella is happening, life is good! #ymfc #Coachella2015",,,coachella,0,entertainment,coachella_2015,
1,"@portlandaniel @coachella Ah man, that's a bummer! Caleb and I have been training for it. #coach...",,,coachella,0,entertainment,coachella_2015,
2,@flo_tweet and @MarinasDiamonds lined up for #Coachella2015 _√ô√∑¬ù,,,coachella,0,entertainment,coachella_2015,


## 7. Validate Schema Alignment

Ensure all three datasets have identical column structure before combining.

In [9]:
print("=" * 80)
print("SCHEMA VALIDATION")
print("=" * 80)

# Check column names
goemotions_cols = df_goemotions_std.columns.tolist()
crisis_cols = df_crisis_std.columns.tolist()
non_crisis_cols = df_non_crisis_std.columns.tolist()

print(f"\nGoEmotions columns: {goemotions_cols}")
print(f"Crisis columns:     {crisis_cols}")
print(f"Non-crisis columns: {non_crisis_cols}")

# Validate all match
if goemotions_cols == crisis_cols == non_crisis_cols:
    print("\n‚úÖ All datasets have matching column structure!")
else:
    print("\n‚ùå Column mismatch detected!")
    print(f"\nDifferences:")
    if goemotions_cols != crisis_cols:
        print(f"  GoEmotions vs Crisis: {set(goemotions_cols) ^ set(crisis_cols)}")
    if crisis_cols != non_crisis_cols:
        print(f"  Crisis vs Non-crisis: {set(crisis_cols) ^ set(non_crisis_cols)}")

# Check if columns match master schema
if goemotions_cols == MASTER_COLUMNS:
    print("\n‚úÖ Columns match master schema!")
else:
    print(f"\n‚ö†Ô∏è  Column order differs from master schema")

print(f"\n" + "=" * 80)

SCHEMA VALIDATION

GoEmotions columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'informativeness']
Crisis columns:     ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'informativeness']
Non-crisis columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'informativeness']

‚úÖ All datasets have matching column structure!

‚úÖ Columns match master schema!



## 8. Combine All Datasets

Concatenate all three standardized datasets into master training file.

In [10]:
print("Combining datasets...\n")

# Concatenate all datasets
df_master = pd.concat([
    df_goemotions_std,
    df_crisis_std,
    df_non_crisis_std
], ignore_index=True)

print(f"Combined master dataset: {len(df_master):,} rows")

# Shuffle the dataset so rows are randomized (not grouped by source)
print("Shuffling dataset to randomize row order...")
df_master = df_master.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"\n‚úÖ Combined and shuffled master dataset created!")
print(f"\nTotal rows: {len(df_master):,}")
print(f"\nBreakdown:")
print(f"  GoEmotions:  {len(df_goemotions_std):,} ({len(df_goemotions_std)/len(df_master)*100:.1f}%)")
print(f"  Crisis:      {len(df_crisis_std):,} ({len(df_crisis_std)/len(df_master)*100:.1f}%)")
print(f"  Non-crisis:  {len(df_non_crisis_std):,} ({len(df_non_crisis_std)/len(df_master)*100:.1f}%)")

print(f"\nColumns: {df_master.columns.tolist()}")
print(f"\nMemory usage: {df_master.memory_usage(deep=True).sum() / (1024**2):.2f} MB")

# Show that rows are now mixed
print(f"\nFirst 10 rows source distribution (showing shuffle worked):")
print(df_master.head(10)['source_dataset'].tolist())

Combining datasets...

Combined master dataset: 52,767 rows
Shuffling dataset to randomize row order...

‚úÖ Combined and shuffled master dataset created!

Total rows: 52,767

Breakdown:
  GoEmotions:  14,656 (27.8%)
  Crisis:      20,855 (39.5%)
  Non-crisis:  17,256 (32.7%)

Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'informativeness']

Memory usage: 24.86 MB

First 10 rows source distribution (showing shuffle worked):
['crisislex', 'us_election', 'GoEmotions', 'crisislex', 'humaid', 'crisislex', 'game_of_thrones', 'crisislex', 'GoEmotions', 'GoEmotions']


## 9. Data Quality Validation

In [11]:
print("=" * 80)
print("DATA QUALITY VALIDATION")
print("=" * 80)

# Check for nulls in critical columns
print(f"\nNull counts:")
print(df_master.isnull().sum())

# Check text column
null_text = df_master['text'].isna().sum()
empty_text = (df_master['text'] == '').sum()
print(f"\nText validation:")
print(f"  Null texts: {null_text}")
print(f"  Empty texts: {empty_text}")
if null_text == 0 and empty_text == 0:
    print(f"  ‚úÖ All rows have text content")

# Check emotion labels
labeled_rows = df_master['emotion_label'].notna().sum()
unlabeled_rows = df_master['emotion_label'].isna().sum()
print(f"\nEmotion label status:")
print(f"  Labeled (GoEmotions):    {labeled_rows:,} ({labeled_rows/len(df_master)*100:.1f}%)")
print(f"  Unlabeled (Crisis+Non):  {unlabeled_rows:,} ({unlabeled_rows/len(df_master)*100:.1f}%)")

# Check crisis labels
crisis_rows = (df_master['crisis_label'] == 1).sum()
non_crisis_rows = (df_master['crisis_label'] == 0).sum()
unlabeled_crisis = df_master['crisis_label'].isna().sum()
print(f"\nCrisis label distribution:")
print(f"  Crisis (1):      {crisis_rows:,}")
print(f"  Non-crisis (0):  {non_crisis_rows:,}")
print(f"  Unlabeled (GoE): {unlabeled_crisis:,}")

# Check source distribution
print(f"\nSource dataset distribution:")
print(df_master['source_dataset'].value_counts())

print(f"\n" + "=" * 80)

DATA QUALITY VALIDATION

Null counts:
text                   4
emotion_label      38111
emotion_name       38111
source_dataset         0
crisis_label       14656
event_type             0
event_name             0
informativeness     8335
dtype: int64

Text validation:
  Null texts: 4
  Empty texts: 0

Emotion label status:
  Labeled (GoEmotions):    14,656 (27.8%)
  Unlabeled (Crisis+Non):  38,111 (72.2%)

Crisis label distribution:
  Crisis (1):      20,855
  Non-crisis (0):  17,256
  Unlabeled (GoE): 14,656

Source dataset distribution:
source_dataset
GoEmotions         14656
crisislex          12733
humaid              8122
us_election         2571
game_of_thrones     2571
worldcup_2018       2571
tokyo_olympics      2571
coachella           2571
fifa_worldcup       2571
music_concerts      1830
Name: count, dtype: int64



## 10. Show Sample Data from Each Source

In [12]:
print("Sample rows from each source:\n")

print("GoEmotions sample (with emotion labels):")
display(df_master[df_master['source_dataset'] == 'GoEmotions'][['text', 'emotion_label', 'emotion_name', 'source_dataset']].head(3))

print("\nCrisis sample (emotion labels = NULL):")
crisis_sample = df_master[df_master['crisis_label'] == 1][['text', 'emotion_label', 'emotion_name', 'event_name', 'crisis_label']].head(3)
display(crisis_sample)

print("\nNon-crisis sample (emotion labels = NULL):")
non_crisis_sample = df_master[df_master['crisis_label'] == 0][['text', 'emotion_label', 'emotion_name', 'event_name', 'crisis_label']].head(3)
display(non_crisis_sample)

Sample rows from each source:

GoEmotions sample (with emotion labels):


Unnamed: 0,text,emotion_label,emotion_name,source_dataset
2,"you don‚Äôt appreciate the good times till you go through the bad.. things will get better OP, sta...",11.0,gratitude,GoEmotions
8,"I'm hoping it's more statistical noise at this point still, but it's definitely worrying.",4.0,anxiety,GoEmotions
9,Well I can feel it was mentally hard bringing myself to go full [NAME] and shit talking prior to...,4.0,anxiety,GoEmotions



Crisis sample (emotion labels = NULL):


Unnamed: 0,text,emotion_label,emotion_name,event_name,crisis_label
0,There's a bomb explosion again in Texas. From a fertilizer plant. 0_0 #PrayForTexas,,,2013_West_Texas_explosion,1.0
3,"#rescueph Babyshambles follows me personally. Omg. Will be this. I can not believe this, my dear...",,,2012_Philipinnes_floods,1.0
4,RT @604Now: 200 #BC firefighters head over to #FortMcMurray to assist in wildfire,,,canada_wildfires_2016_test,1.0



Non-crisis sample (emotion labels = NULL):


Unnamed: 0,text,emotion_label,emotion_name,event_name,crisis_label
1,@trollBigotry @KamalaHarris I think you mean 47 years of #JoeBiden,,,us_election_2020,0.0
6,I am so sick of advertisements for that Game Of Thrones mobile game.,,,got_season8_2019,0.0
10,@burnt_rain @MikeFromYEG Also one of the main characters on Game of Thrones. He's played by Qui...,,,got_season8_2019,0.0


## 11. Save Master Training Dataset

In [13]:
# Save to master_training_data folder
output_path = 'master_training_data/master_training_data_v5.csv'

print(f"Saving master training dataset to {output_path}...\n")

df_master.to_csv(output_path, index=False)

file_size = Path(output_path).stat().st_size / (1024**2)

print("=" * 80)
print("MASTER DATASET SAVED")
print("=" * 80)
print(f"\n‚úÖ Saved to: {output_path}")
print(f"\nFile size: {file_size:.2f} MB")
print(f"Total rows: {len(df_master):,}")
print(f"Total columns: {len(df_master.columns)}")
print(f"\nColumns: {df_master.columns.tolist()}")
print(f"\n" + "=" * 80)

Saving master training dataset to master_training_data/master_training_data_v5.csv...

MASTER DATASET SAVED

‚úÖ Saved to: master_training_data/master_training_data_v5.csv

File size: 8.04 MB
Total rows: 52,767
Total columns: 8

Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'informativeness']



## 12. Create Smaller Sample File for Testing

In [14]:
# Create 10K sample for quick testing
sample_size = min(10000, len(df_master))
df_sample = df_master.sample(n=sample_size, random_state=42)

sample_path = 'master_training_data/master_training_sample_v5.csv'
df_sample.to_csv(sample_path, index=False)

print(f"‚úÖ Created sample file: {sample_path}")
print(f"   Rows: {len(df_sample):,}")
print(f"   Size: {Path(sample_path).stat().st_size / (1024**2):.2f} MB")

‚úÖ Created sample file: master_training_data/master_training_sample_v5.csv
   Rows: 10,000
   Size: 1.50 MB


## 13. Final Summary & Statistics

In [15]:
print("=" * 80)
print("FINAL SUMMARY")
print("=" * 80)

print(f"\nüìä Dataset Composition (Stratified 60K):")
print(f"   Total rows:          {len(df_master):,}")
print(f"   Crisis:              {len(df_crisis_std):,} (balanced by event_type)")
print(f"   Non-crisis:          {len(df_non_crisis_std):,} (balanced by source)")
print(f"   GoEmotions:          {len(df_goemotions_std):,} (balanced by emotion)")

print(f"\nüìÅ Files Created:")
print(f"   Main:   master_training_data/master_training_data_v5.csv ({file_size:.2f} MB)")
print(f"   Sample: master_training_data/master_training_sample_v5.csv")

print(f"\nüè∑Ô∏è Label Status (to be filled by LLM in notebook 06):")
print(f"   emotion_label:  {df_master['emotion_label'].notna().sum():,} labeled, {df_master['emotion_label'].isna().sum():,} need LLM")
print(f"   crisis_label:   {df_master['crisis_label'].notna().sum():,} labeled, {df_master['crisis_label'].isna().sum():,} need LLM")

print(f"\nüîß Schema:")
print(f"   Columns: {len(df_master.columns)}")
for i, col in enumerate(df_master.columns, 1):
    print(f"      {i}. {col}")

print(f"\nüìã Next Steps:")
print(f"   1. Run notebook 06 to fill missing labels using Gemini API")
print(f"   2. Train multi-task BERT on labeled 60K dataset")
print(f"   3. Apply trained BERT to ORIGINAL FULL datasets")
print(f"   4. Create episodes & hourly aggregations for RL agent")

print(f"\n" + "=" * 80)
print("‚úÖ PHASE 4 COMPLETE!")
print("=" * 80)

FINAL SUMMARY

üìä Dataset Composition (Stratified 60K):
   Total rows:          52,767
   Crisis:              20,855 (balanced by event_type)
   Non-crisis:          17,256 (balanced by source)
   GoEmotions:          14,656 (balanced by emotion)

üìÅ Files Created:
   Main:   master_training_data/master_training_data_v5.csv (8.04 MB)
   Sample: master_training_data/master_training_sample_v5.csv

üè∑Ô∏è Label Status (to be filled by LLM in notebook 06):
   emotion_label:  14,656 labeled, 38,111 need LLM
   crisis_label:   38,111 labeled, 14,656 need LLM

üîß Schema:
   Columns: 8
      1. text
      2. emotion_label
      3. emotion_name
      4. source_dataset
      5. crisis_label
      6. event_type
      7. event_name
      8. informativeness

üìã Next Steps:
   1. Run notebook 06 to fill missing labels using Gemini API
   2. Train multi-task BERT on labeled 60K dataset
   3. Apply trained BERT to ORIGINAL FULL datasets
   4. Create episodes & hourly aggregations for RL agen