# Phase 4: Create Master Training Dataset
## Combine GoEmotions + Crisis + Non-Crisis Data

This notebook:
1. Loads GoEmotions with 13 emotions (has emotion labels)
2. Loads crisis data with emotion columns (labels = NULL)
3. Loads non-crisis data with emotion columns (labels = NULL)
4. Standardizes all columns across datasets
5. Combines into single master training file
6. Validates and saves final dataset

### Data Sources:
- **GoEmotions**: 54K Reddit comments with labeled emotions (for BERT training)
- **Crisis**: 66K crisis tweets (BERT will predict emotions)
- **Non-Crisis**: 1.5M non-crisis tweets (BERT will predict emotions)
- **Total**: ~1.6M rows for comprehensive emotion classification

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 1. Load All Datasets

In [2]:
print("Loading datasets...\n")

# Load GoEmotions with 13 emotions
print("1. GoEmotions (with 13 emotions)...")
df_goemotions = pd.read_csv('goemotion_data/goemotions_with_13_emotions.csv')
print(f"   ‚úì Loaded {len(df_goemotions):,} rows")
print(f"   Columns: {df_goemotions.columns.tolist()}")

# Load crisis data with emotion columns
print("\n2. Crisis data (with emotion columns)...")
df_crisis = pd.read_csv('standardized_data/crisis_combined_with_emotions.csv')
print(f"   ‚úì Loaded {len(df_crisis):,} rows")
print(f"   Columns: {df_crisis.columns.tolist()}")

# Load non-crisis data with emotion columns
print("\n3. Non-crisis data (with emotion columns)...")
df_non_crisis = pd.read_csv('standardized_data/non_crisis_combined_with_emotions.csv')
print(f"   ‚úì Loaded {len(df_non_crisis):,} rows")
print(f"   Columns: {df_non_crisis.columns.tolist()}")

print(f"\n{'='*80}")
print(f"Total rows to combine: {len(df_goemotions) + len(df_crisis) + len(df_non_crisis):,}")
print(f"{'='*80}")

Loading datasets...

1. GoEmotions (with 13 emotions)...
   ‚úì Loaded 54,263 rows
   Columns: ['text', 'emotion_label', 'emotion_name', 'id', 'labels']

2. Crisis data (with emotion columns)...
   ‚úì Loaded 66,748 rows
   Columns: ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'informativeness', 'emotion_label', 'emotion_name']

3. Non-crisis data (with emotion columns)...
   ‚úì Loaded 1,533,696 rows
   Columns: ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'emotion_label', 'emotion_name']

Total rows to combine: 1,654,707


## 2. Check Current Schemas

In [3]:
print("Current column schemas:\n")

print("GoEmotions columns:")
for col in df_goemotions.columns:
    print(f"  - {col}: {df_goemotions[col].dtype}")

print("\nCrisis columns:")
for col in df_crisis.columns:
    print(f"  - {col}: {df_crisis[col].dtype}")

print("\nNon-crisis columns:")
for col in df_non_crisis.columns:
    print(f"  - {col}: {df_non_crisis[col].dtype}")

Current column schemas:

GoEmotions columns:
  - text: str
  - emotion_label: int64
  - emotion_name: str
  - id: str
  - labels: str

Crisis columns:
  - text: str
  - created_at: str
  - event_name: str
  - event_type: str
  - crisis_label: int64
  - source_dataset: str
  - informativeness: str
  - emotion_label: float64
  - emotion_name: float64

Non-crisis columns:
  - text: str
  - created_at: str
  - event_name: str
  - event_type: str
  - crisis_label: int64
  - source_dataset: str
  - emotion_label: float64
  - emotion_name: float64


## 3. Define Master Schema

Create unified column structure for all datasets:
- **text**: Tweet/comment text
- **emotion_label**: Numeric emotion (1-13, NULL for unlabeled)
- **emotion_name**: Text emotion name (NULL for unlabeled)
- **source_dataset**: Origin of data (GoEmotions, HumAID, CrisisLex, etc.)
- **crisis_label**: Binary (1=crisis, 0=non-crisis, NULL for GoEmotions)
- **event_type**: General category (hurricane, sports, etc., NULL for GoEmotions)
- **event_name**: Specific event (hurricane_harvey_2017, etc., NULL for GoEmotions)
- **created_at**: Timestamp (NULL for GoEmotions)
- **informativeness**: CrisisLex informativeness label (NULL for others)

In [4]:
# Define master column set
MASTER_COLUMNS = [
    'text',
    'emotion_label',
    'emotion_name',
    'source_dataset',
    'crisis_label',
    'event_type',
    'event_name',
    'created_at',
    'informativeness'
]

print("Master schema columns:")
for i, col in enumerate(MASTER_COLUMNS, 1):
    print(f"  {i}. {col}")

Master schema columns:
  1. text
  2. emotion_label
  3. emotion_name
  4. source_dataset
  5. crisis_label
  6. event_type
  7. event_name
  8. created_at
  9. informativeness


## 4. Standardize GoEmotions Data

Add missing columns to GoEmotions dataset.

In [5]:
print("Standardizing GoEmotions data...\n")

# Create standardized GoEmotions dataframe
df_goemotions_std = pd.DataFrame()

# Keep existing columns
df_goemotions_std['text'] = df_goemotions['text']
df_goemotions_std['emotion_label'] = df_goemotions['emotion_label']
df_goemotions_std['emotion_name'] = df_goemotions['emotion_name']

# Add source
df_goemotions_std['source_dataset'] = 'GoEmotions'

# Add NULL columns (GoEmotions is not crisis-related)
df_goemotions_std['crisis_label'] = np.nan
df_goemotions_std['event_type'] = ''
df_goemotions_std['event_name'] = ''
df_goemotions_std['created_at'] = ''
df_goemotions_std['informativeness'] = ''

print(f"‚úì GoEmotions standardized: {len(df_goemotions_std):,} rows")
print(f"  Columns: {df_goemotions_std.columns.tolist()}")
print(f"\nSample:")
display(df_goemotions_std.head(3))

Standardizing GoEmotions data...

‚úì GoEmotions standardized: 54,263 rows
  Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'created_at', 'informativeness']

Sample:


Unnamed: 0,text,emotion_label,emotion_name,source_dataset,crisis_label,event_type,event_name,created_at,informativeness
0,My favourite food is anything I didn't have to cook myself.,13,neutral,GoEmotions,,,,,
1,"Now if he does off himself, everyone will think hes having a laugh screwing with people instead ...",13,neutral,GoEmotions,,,,,
2,WHY THE FUCK IS BAYLESS ISOING,2,anger,GoEmotions,,,,,


## 5. Standardize Crisis Data

Select and reorder crisis columns to match master schema.

In [6]:
print("Standardizing crisis data...\n")

# Create standardized crisis dataframe
df_crisis_std = pd.DataFrame()

df_crisis_std['text'] = df_crisis['text']
df_crisis_std['emotion_label'] = df_crisis['emotion_label']  # Will be NaN
df_crisis_std['emotion_name'] = df_crisis['emotion_name']    # Will be empty
df_crisis_std['source_dataset'] = df_crisis['source_dataset']
df_crisis_std['crisis_label'] = df_crisis['crisis_label']
df_crisis_std['event_type'] = df_crisis['event_type']
df_crisis_std['event_name'] = df_crisis['event_name']
df_crisis_std['created_at'] = df_crisis['created_at']
df_crisis_std['informativeness'] = df_crisis['informativeness']

print(f"‚úì Crisis standardized: {len(df_crisis_std):,} rows")
print(f"  Columns: {df_crisis_std.columns.tolist()}")
print(f"\nSample:")
display(df_crisis_std.head(3))

Standardizing crisis data...

‚úì Crisis standardized: 66,748 rows
  Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'created_at', 'informativeness']

Sample:


Unnamed: 0,text,emotion_label,emotion_name,source_dataset,crisis_label,event_type,event_name,created_at,informativeness
0,.@GreenABEnergy How can @AirworksCanada assist in the cleanup? #AlbertaStrong,,,humaid,1,wildfire,canada_wildfires_2016_dev,2016-05-19 18:16:11.727000+00:00,
1,RT @katvondawn: Thoughts &amp; prayers going to all those being affected by the wildfire in Cana...,,,humaid,1,wildfire,canada_wildfires_2016_dev,2016-05-09 03:58:37.448000+00:00,
2,Glacier Farm Media pledges $50K in support for Fort McMurray wildfire disaster relief.,,,humaid,1,wildfire,canada_wildfires_2016_dev,2016-05-12 12:41:05.044000+00:00,


## 6. Standardize Non-Crisis Data

Select and reorder non-crisis columns to match master schema.

In [7]:
print("Standardizing non-crisis data...\n")

# Create standardized non-crisis dataframe
df_non_crisis_std = pd.DataFrame()

df_non_crisis_std['text'] = df_non_crisis['text']
df_non_crisis_std['emotion_label'] = df_non_crisis['emotion_label']  # Will be NaN
df_non_crisis_std['emotion_name'] = df_non_crisis['emotion_name']    # Will be empty
df_non_crisis_std['source_dataset'] = df_non_crisis['source_dataset']
df_non_crisis_std['crisis_label'] = df_non_crisis['crisis_label']
df_non_crisis_std['event_type'] = df_non_crisis['event_type']
df_non_crisis_std['event_name'] = df_non_crisis['event_name']
df_non_crisis_std['created_at'] = df_non_crisis['created_at']

# Non-crisis doesn't have informativeness
df_non_crisis_std['informativeness'] = ''

print(f"‚úì Non-crisis standardized: {len(df_non_crisis_std):,} rows")
print(f"  Columns: {df_non_crisis_std.columns.tolist()}")
print(f"\nSample:")
display(df_non_crisis_std.head(3))

Standardizing non-crisis data...

‚úì Non-crisis standardized: 1,533,696 rows
  Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'created_at', 'informativeness']

Sample:


Unnamed: 0,text,emotion_label,emotion_name,source_dataset,crisis_label,event_type,event_name,created_at,informativeness
0,#Coachella2015 tickets selling out in less than 40 minutes _√ô_¬¶_√ô___√ô___√ô√∑¬ù_√ô√é¬µ_√ô√é¬µ_√ô___√ô_¬¶ http...,,,coachella,0,entertainment,coachella_2015,2015-01-07 15:02:00,
1,RT @sudsybuddy: WAIT THIS IS ABSOLUTE FIRE _√ô√ì¬¥_√ô√ì¬¥_√ô√ì¬¥ #Coachella2015 http://t.co/Ov2eCJtAvR,,,coachella,0,entertainment,coachella_2015,2015-01-07 15:02:00,
2,#Coachella2015 #VIP passes secured! See you there bitchesssss,,,coachella,0,entertainment,coachella_2015,2015-01-07 15:01:00,


## 7. Validate Schema Alignment

Ensure all three datasets have identical column structure before combining.

In [8]:
print("=" * 80)
print("SCHEMA VALIDATION")
print("=" * 80)

# Check column names
goemotions_cols = df_goemotions_std.columns.tolist()
crisis_cols = df_crisis_std.columns.tolist()
non_crisis_cols = df_non_crisis_std.columns.tolist()

print(f"\nGoEmotions columns: {goemotions_cols}")
print(f"Crisis columns:     {crisis_cols}")
print(f"Non-crisis columns: {non_crisis_cols}")

# Validate all match
if goemotions_cols == crisis_cols == non_crisis_cols:
    print("\n‚úÖ All datasets have matching column structure!")
else:
    print("\n‚ùå Column mismatch detected!")
    print(f"\nDifferences:")
    if goemotions_cols != crisis_cols:
        print(f"  GoEmotions vs Crisis: {set(goemotions_cols) ^ set(crisis_cols)}")
    if crisis_cols != non_crisis_cols:
        print(f"  Crisis vs Non-crisis: {set(crisis_cols) ^ set(non_crisis_cols)}")

# Check if columns match master schema
if goemotions_cols == MASTER_COLUMNS:
    print("\n‚úÖ Columns match master schema!")
else:
    print(f"\n‚ö†Ô∏è  Column order differs from master schema")

print(f"\n" + "=" * 80)

SCHEMA VALIDATION

GoEmotions columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'created_at', 'informativeness']
Crisis columns:     ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'created_at', 'informativeness']
Non-crisis columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'created_at', 'informativeness']

‚úÖ All datasets have matching column structure!

‚úÖ Columns match master schema!



## 8. Combine All Datasets

Concatenate all three standardized datasets into master training file.

In [9]:
print("Combining datasets...\n")

# Concatenate all datasets
df_master = pd.concat([
    df_goemotions_std,
    df_crisis_std,
    df_non_crisis_std
], ignore_index=True)

print(f"‚úÖ Combined master dataset created!")
print(f"\nTotal rows: {len(df_master):,}")
print(f"\nBreakdown:")
print(f"  GoEmotions:  {len(df_goemotions_std):,} ({len(df_goemotions_std)/len(df_master)*100:.1f}%)")
print(f"  Crisis:      {len(df_crisis_std):,} ({len(df_crisis_std)/len(df_master)*100:.1f}%)")
print(f"  Non-crisis:  {len(df_non_crisis_std):,} ({len(df_non_crisis_std)/len(df_master)*100:.1f}%)")

print(f"\nColumns: {df_master.columns.tolist()}")
print(f"\nMemory usage: {df_master.memory_usage(deep=True).sum() / (1024**2):.2f} MB")

Combining datasets...

‚úÖ Combined master dataset created!

Total rows: 1,654,707

Breakdown:
  GoEmotions:  54,263 (3.3%)
  Crisis:      66,748 (4.0%)
  Non-crisis:  1,533,696 (92.7%)

Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'created_at', 'informativeness']

Memory usage: 864.60 MB


## 9. Data Quality Validation

In [10]:
print("=" * 80)
print("DATA QUALITY VALIDATION")
print("=" * 80)

# Check for nulls in critical columns
print(f"\nNull counts:")
print(df_master.isnull().sum())

# Check text column
null_text = df_master['text'].isna().sum()
empty_text = (df_master['text'] == '').sum()
print(f"\nText validation:")
print(f"  Null texts: {null_text}")
print(f"  Empty texts: {empty_text}")
if null_text == 0 and empty_text == 0:
    print(f"  ‚úÖ All rows have text content")

# Check emotion labels
labeled_rows = df_master['emotion_label'].notna().sum()
unlabeled_rows = df_master['emotion_label'].isna().sum()
print(f"\nEmotion label status:")
print(f"  Labeled (GoEmotions):    {labeled_rows:,} ({labeled_rows/len(df_master)*100:.1f}%)")
print(f"  Unlabeled (Crisis+Non):  {unlabeled_rows:,} ({unlabeled_rows/len(df_master)*100:.1f}%)")

# Check crisis labels
crisis_rows = (df_master['crisis_label'] == 1).sum()
non_crisis_rows = (df_master['crisis_label'] == 0).sum()
unlabeled_crisis = df_master['crisis_label'].isna().sum()
print(f"\nCrisis label distribution:")
print(f"  Crisis (1):      {crisis_rows:,}")
print(f"  Non-crisis (0):  {non_crisis_rows:,}")
print(f"  Unlabeled (GoE): {unlabeled_crisis:,}")

# Check source distribution
print(f"\nSource dataset distribution:")
print(df_master['source_dataset'].value_counts())

print(f"\n" + "=" * 80)

DATA QUALITY VALIDATION

Null counts:
text                   537
emotion_label      1600444
emotion_name       1600444
source_dataset           0
crisis_label         54263
event_type               0
event_name               0
created_at               0
informativeness      43816
dtype: int64

Text validation:
  Null texts: 537
  Empty texts: 0

Emotion label status:
  Labeled (GoEmotions):    54,263 (3.3%)
  Unlabeled (Crisis+Non):  1,600,444 (96.7%)

Crisis label distribution:
  Crisis (1):      66,748
  Non-crisis (0):  1,533,696
  Unlabeled (GoE): 54,263

Source dataset distribution:
source_dataset
game_of_thrones    760614
worldcup_2018      458533
tokyo_olympics     159432
us_election         99948
GoEmotions          54263
fifa_worldcup       49493
humaid              43409
crisislex           23339
coachella            3846
music_concerts       1830
Name: count, dtype: int64



## 10. Show Sample Data from Each Source

In [11]:
print("Sample rows from each source:\n")

print("GoEmotions sample (with emotion labels):")
display(df_master[df_master['source_dataset'] == 'GoEmotions'][['text', 'emotion_label', 'emotion_name', 'source_dataset']].head(3))

print("\nCrisis sample (emotion labels = NULL):")
crisis_sample = df_master[df_master['crisis_label'] == 1][['text', 'emotion_label', 'emotion_name', 'event_name', 'crisis_label']].head(3)
display(crisis_sample)

print("\nNon-crisis sample (emotion labels = NULL):")
non_crisis_sample = df_master[df_master['crisis_label'] == 0][['text', 'emotion_label', 'emotion_name', 'event_name', 'crisis_label']].head(3)
display(non_crisis_sample)

Sample rows from each source:

GoEmotions sample (with emotion labels):


Unnamed: 0,text,emotion_label,emotion_name,source_dataset
0,My favourite food is anything I didn't have to cook myself.,13.0,neutral,GoEmotions
1,"Now if he does off himself, everyone will think hes having a laugh screwing with people instead ...",13.0,neutral,GoEmotions
2,WHY THE FUCK IS BAYLESS ISOING,2.0,anger,GoEmotions



Crisis sample (emotion labels = NULL):


Unnamed: 0,text,emotion_label,emotion_name,event_name,crisis_label
54263,.@GreenABEnergy How can @AirworksCanada assist in the cleanup? #AlbertaStrong,,,canada_wildfires_2016_dev,1.0
54264,RT @katvondawn: Thoughts &amp; prayers going to all those being affected by the wildfire in Cana...,,,canada_wildfires_2016_dev,1.0
54265,Glacier Farm Media pledges $50K in support for Fort McMurray wildfire disaster relief.,,,canada_wildfires_2016_dev,1.0



Non-crisis sample (emotion labels = NULL):


Unnamed: 0,text,emotion_label,emotion_name,event_name,crisis_label
121011,#Coachella2015 tickets selling out in less than 40 minutes _√ô_¬¶_√ô___√ô___√ô√∑¬ù_√ô√é¬µ_√ô√é¬µ_√ô___√ô_¬¶ http...,,,coachella_2015,0.0
121012,RT @sudsybuddy: WAIT THIS IS ABSOLUTE FIRE _√ô√ì¬¥_√ô√ì¬¥_√ô√ì¬¥ #Coachella2015 http://t.co/Ov2eCJtAvR,,,coachella_2015,0.0
121013,#Coachella2015 #VIP passes secured! See you there bitchesssss,,,coachella_2015,0.0


## 11. Save Master Training Dataset

In [12]:
# Save to master_training_data folder
output_path = 'master_training_data/master_training_data_v2.csv'

print(f"Saving master training dataset to {output_path}...\n")
df_master.to_csv(output_path, index=False)

file_size = Path(output_path).stat().st_size / (1024**2)

print("=" * 80)
print("MASTER DATASET SAVED")
print("=" * 80)
print(f"\n‚úÖ Saved to: {output_path}")
print(f"\nFile size: {file_size:.2f} MB")
print(f"Total rows: {len(df_master):,}")
print(f"Total columns: {len(df_master.columns)}")
print(f"\nColumns: {df_master.columns.tolist()}")
print(f"\n" + "=" * 80)

Saving master training dataset to master_training_data/master_training_data_v2.csv...

MASTER DATASET SAVED

‚úÖ Saved to: master_training_data/master_training_data_v2.csv

File size: 289.75 MB
Total rows: 1,654,707
Total columns: 9

Columns: ['text', 'emotion_label', 'emotion_name', 'source_dataset', 'crisis_label', 'event_type', 'event_name', 'created_at', 'informativeness']



## 12. Create Smaller Sample File for Testing

In [13]:
# Create 10K sample for quick testing
sample_size = 10000
df_sample = df_master.sample(n=sample_size, random_state=42)

sample_path = 'master_training_data/master_training_sample_10k.csv'
df_sample.to_csv(sample_path, index=False)

print(f"‚úÖ Created sample file: {sample_path}")
print(f"   Rows: {len(df_sample):,}")
print(f"   Size: {Path(sample_path).stat().st_size / (1024**2):.2f} MB")

‚úÖ Created sample file: master_training_data/master_training_sample_10k.csv
   Rows: 10,000
   Size: 1.75 MB


## 13. Final Summary & Statistics

In [14]:
print("=" * 80)
print("FINAL SUMMARY")
print("=" * 80)

print(f"\nüìä Dataset Composition:")
print(f"   Total rows:          {len(df_master):,}")
print(f"   GoEmotions:          {len(df_goemotions_std):,} (with emotion labels)")
print(f"   Crisis events:       {len(df_crisis_std):,} (emotion labels = NULL)")
print(f"   Non-crisis events:   {len(df_non_crisis_std):,} (emotion labels = NULL)")

print(f"\nüìÅ Files Created:")
print(f"   Main:   master_training_data/master_training_data_v2.csv ({file_size:.2f} MB)")
print(f"   Sample: master_training_data/master_training_sample_10k.csv")

print(f"\nüè∑Ô∏è Emotion Labels:")
print(f"   Labeled rows:    {labeled_rows:,} (GoEmotions - for training)")
print(f"   Unlabeled rows:  {unlabeled_rows:,} (Crisis + Non-crisis - for prediction)")

print(f"\nüîß Schema:")
print(f"   Columns: {len(df_master.columns)}")
for i, col in enumerate(df_master.columns, 1):
    print(f"      {i}. {col}")

print(f"\nüìã Next Steps:")
print(f"   1. Train BERT on GoEmotions data (54K labeled rows)")
print(f"   2. Use trained BERT to predict emotions for Crisis/Non-crisis data")
print(f"   3. Fill emotion_label and emotion_name for unlabeled rows")
print(f"   4. Analyze fear/anxiety patterns in crisis vs non-crisis events")
print(f"   5. Build temporal analysis for crisis detection")

print(f"\n" + "=" * 80)
print("‚úÖ PHASE 4 COMPLETE!")
print("=" * 80)

FINAL SUMMARY

üìä Dataset Composition:
   Total rows:          1,654,707
   GoEmotions:          54,263 (with emotion labels)
   Crisis events:       66,748 (emotion labels = NULL)
   Non-crisis events:   1,533,696 (emotion labels = NULL)

üìÅ Files Created:
   Main:   master_training_data/master_training_data_v2.csv (289.75 MB)
   Sample: master_training_data/master_training_sample_10k.csv

üè∑Ô∏è Emotion Labels:
   Labeled rows:    54,263 (GoEmotions - for training)
   Unlabeled rows:  1,600,444 (Crisis + Non-crisis - for prediction)

üîß Schema:
   Columns: 9
      1. text
      2. emotion_label
      3. emotion_name
      4. source_dataset
      5. crisis_label
      6. event_type
      7. event_name
      8. created_at
      9. informativeness

üìã Next Steps:
   1. Train BERT on GoEmotions data (54K labeled rows)
   2. Use trained BERT to predict emotions for Crisis/Non-crisis data
   3. Fill emotion_label and emotion_name for unlabeled rows
   4. Analyze fear/anxiety patte