# Add Emotion Columns to Standardized Data
## Prepare Crisis & Non-Crisis Data for BERT Training

This notebook:
1. Loads `crisis_combined.csv` and `non_crisis_combined.csv`
2. Adds `emotion_label` column (set to NULL - BERT will predict)
3. Adds `emotion_name` column (set to empty - BERT will predict)
4. Validates schema matches GoEmotions format
5. Saves updated datasets
6. Shows before/after comparison

### Why NULL Emotions?
- GoEmotions data has labeled emotions (for training BERT)
- Crisis/non-crisis data needs emotion **prediction** by BERT
- We add empty columns now so all datasets have the same schema
- BERT will fill these in during inference later

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 1. Load Crisis Data

In [2]:
print("Loading crisis_combined.csv...")
crisis_path = 'standardized_data/crisis_combined.csv'
df_crisis = pd.read_csv(crisis_path)

print(f"\n‚úÖ Loaded {len(df_crisis):,} crisis rows")
print(f"\nCurrent columns: {df_crisis.columns.tolist()}")
print(f"\nData types:")
print(df_crisis.dtypes)
print(f"\nFirst 5 rows:")
display(df_crisis.head())

Loading crisis_combined.csv...

‚úÖ Loaded 66,748 crisis rows

Current columns: ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'informativeness']

Data types:
text               object
created_at         object
event_name         object
event_type         object
crisis_label        int64
source_dataset     object
informativeness    object
dtype: object

First 5 rows:


Unnamed: 0,text,created_at,event_name,event_type,crisis_label,source_dataset,informativeness
0,.@GreenABEnergy How can @AirworksCanada assist in the cleanup? #AlbertaStrong,2016-05-19 18:16:11.727000+00:00,canada_wildfires_2016_dev,wildfire,1,humaid,
1,RT @katvondawn: Thoughts &amp; prayers going to all those being affected by the wildfire in Cana...,2016-05-09 03:58:37.448000+00:00,canada_wildfires_2016_dev,wildfire,1,humaid,
2,Glacier Farm Media pledges $50K in support for Fort McMurray wildfire disaster relief.,2016-05-12 12:41:05.044000+00:00,canada_wildfires_2016_dev,wildfire,1,humaid,
3,Beatton Airport Road wildfire in northern B.C. leaves a patchwork of damage - #VernonNews,2016-05-20 16:00:33.863000+00:00,canada_wildfires_2016_dev,wildfire,1,humaid,
4,RT @dana_balsor: @InsuranceBureau will Insur. professionals be entering our homes without us pre...,2016-05-18 21:27:01.688000+00:00,canada_wildfires_2016_dev,wildfire,1,humaid,


## 2. Load Non-Crisis Data

In [3]:
print("Loading non_crisis_combined.csv...")
non_crisis_path = 'standardized_data/non_crisis_combined.csv'
df_non_crisis = pd.read_csv(non_crisis_path)

print(f"\n‚úÖ Loaded {len(df_non_crisis):,} non-crisis rows")
print(f"\nCurrent columns: {df_non_crisis.columns.tolist()}")
print(f"\nData types:")
print(df_non_crisis.dtypes)
print(f"\nFirst 5 rows:")
display(df_non_crisis.head())

Loading non_crisis_combined.csv...

‚úÖ Loaded 1,533,696 non-crisis rows

Current columns: ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset']

Data types:
text              object
created_at        object
event_name        object
event_type        object
crisis_label       int64
source_dataset    object
dtype: object

First 5 rows:


Unnamed: 0,text,created_at,event_name,event_type,crisis_label,source_dataset
0,#Coachella2015 tickets selling out in less than 40 minutes _√ô_¬¶_√ô___√ô___√ô√∑¬ù_√ô√é¬µ_√ô√é¬µ_√ô___√ô_¬¶ http...,2015-01-07 15:02:00,coachella_2015,entertainment,0,coachella
1,RT @sudsybuddy: WAIT THIS IS ABSOLUTE FIRE _√ô√ì¬¥_√ô√ì¬¥_√ô√ì¬¥ #Coachella2015 http://t.co/Ov2eCJtAvR,2015-01-07 15:02:00,coachella_2015,entertainment,0,coachella
2,#Coachella2015 #VIP passes secured! See you there bitchesssss,2015-01-07 15:01:00,coachella_2015,entertainment,0,coachella
3,Philly¬â√õ¬™s @warondrugsjams will play #Coachella2015 &amp; #GovBall2015! Watch them on Jimmy Fall...,2015-01-07 15:01:00,coachella_2015,entertainment,0,coachella
4,If briana and her mom out to #Coachella2015 im out with them !!! _√ô√∑√ù_√ô√∑√ù_√ô√∑√ù_√ô√ï√Ñ,2015-01-07 15:00:00,coachella_2015,entertainment,0,coachella


## 3. Add Emotion Columns to Crisis Data

### Strategy:
- `emotion_label`: Set to `NaN` (pandas NULL)
- `emotion_name`: Set to empty string `''` or `NaN`
- BERT will predict these values later during inference

In [4]:
print("Adding emotion columns to crisis data...")

# Add emotion_label column (numeric, set to NaN)
df_crisis['emotion_label'] = np.nan

# Add emotion_name column (string, set to empty string)
df_crisis['emotion_name'] = ''

print("\n‚úÖ Added emotion columns to crisis data")
print(f"\nNew columns: {df_crisis.columns.tolist()}")
print(f"\nEmotion columns:")
print(f"  - emotion_label: {df_crisis['emotion_label'].dtype} (all NaN: {df_crisis['emotion_label'].isna().all()})")
print(f"  - emotion_name: {df_crisis['emotion_name'].dtype} (all empty: {(df_crisis['emotion_name'] == '').all()})")

print(f"\nUpdated crisis data preview:")
display(df_crisis[['text', 'event_name', 'crisis_label', 'emotion_label', 'emotion_name']].head())

Adding emotion columns to crisis data...

‚úÖ Added emotion columns to crisis data

New columns: ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'informativeness', 'emotion_label', 'emotion_name']

Emotion columns:
  - emotion_label: float64 (all NaN: True)
  - emotion_name: object (all empty: True)

Updated crisis data preview:


Unnamed: 0,text,event_name,crisis_label,emotion_label,emotion_name
0,.@GreenABEnergy How can @AirworksCanada assist in the cleanup? #AlbertaStrong,canada_wildfires_2016_dev,1,,
1,RT @katvondawn: Thoughts &amp; prayers going to all those being affected by the wildfire in Cana...,canada_wildfires_2016_dev,1,,
2,Glacier Farm Media pledges $50K in support for Fort McMurray wildfire disaster relief.,canada_wildfires_2016_dev,1,,
3,Beatton Airport Road wildfire in northern B.C. leaves a patchwork of damage - #VernonNews,canada_wildfires_2016_dev,1,,
4,RT @dana_balsor: @InsuranceBureau will Insur. professionals be entering our homes without us pre...,canada_wildfires_2016_dev,1,,


## 4. Add Emotion Columns to Non-Crisis Data

In [5]:
print("Adding emotion columns to non-crisis data...")

# Add emotion_label column (numeric, set to NaN)
df_non_crisis['emotion_label'] = np.nan

# Add emotion_name column (string, set to empty string)
df_non_crisis['emotion_name'] = ''

print("\n‚úÖ Added emotion columns to non-crisis data")
print(f"\nNew columns: {df_non_crisis.columns.tolist()}")
print(f"\nEmotion columns:")
print(f"  - emotion_label: {df_non_crisis['emotion_label'].dtype} (all NaN: {df_non_crisis['emotion_label'].isna().all()})")
print(f"  - emotion_name: {df_non_crisis['emotion_name'].dtype} (all empty: {(df_non_crisis['emotion_name'] == '').all()})")

print(f"\nUpdated non-crisis data preview:")
display(df_non_crisis[['text', 'event_name', 'crisis_label', 'emotion_label', 'emotion_name']].head())

Adding emotion columns to non-crisis data...

‚úÖ Added emotion columns to non-crisis data

New columns: ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'emotion_label', 'emotion_name']

Emotion columns:
  - emotion_label: float64 (all NaN: True)
  - emotion_name: object (all empty: True)

Updated non-crisis data preview:


Unnamed: 0,text,event_name,crisis_label,emotion_label,emotion_name
0,#Coachella2015 tickets selling out in less than 40 minutes _√ô_¬¶_√ô___√ô___√ô√∑¬ù_√ô√é¬µ_√ô√é¬µ_√ô___√ô_¬¶ http...,coachella_2015,0,,
1,RT @sudsybuddy: WAIT THIS IS ABSOLUTE FIRE _√ô√ì¬¥_√ô√ì¬¥_√ô√ì¬¥ #Coachella2015 http://t.co/Ov2eCJtAvR,coachella_2015,0,,
2,#Coachella2015 #VIP passes secured! See you there bitchesssss,coachella_2015,0,,
3,Philly¬â√õ¬™s @warondrugsjams will play #Coachella2015 &amp; #GovBall2015! Watch them on Jimmy Fall...,coachella_2015,0,,
4,If briana and her mom out to #Coachella2015 im out with them !!! _√ô√∑√ù_√ô√∑√ù_√ô√∑√ù_√ô√ï√Ñ,coachella_2015,0,,


## 5. Validate Schema Compatibility

Ensure crisis and non-crisis data have matching columns for Phase 4 combination.

In [6]:
print("=" * 80)
print("SCHEMA VALIDATION")
print("=" * 80)

# Check column names match
crisis_cols = set(df_crisis.columns)
non_crisis_cols = set(df_non_crisis.columns)

print(f"\nCrisis columns: {sorted(df_crisis.columns.tolist())}")
print(f"\nNon-crisis columns: {sorted(df_non_crisis.columns.tolist())}")

if crisis_cols == non_crisis_cols:
    print("\n‚úÖ Columns match perfectly!")
else:
    only_crisis = crisis_cols - non_crisis_cols
    only_non_crisis = non_crisis_cols - crisis_cols
    print(f"\n‚ö†Ô∏è  Column mismatch:")
    if only_crisis:
        print(f"   Only in crisis: {only_crisis}")
    if only_non_crisis:
        print(f"   Only in non-crisis: {only_non_crisis}")

# Check both have emotion columns
required_cols = ['emotion_label', 'emotion_name']
crisis_has_emotion = all(col in df_crisis.columns for col in required_cols)
non_crisis_has_emotion = all(col in df_non_crisis.columns for col in required_cols)

if crisis_has_emotion and non_crisis_has_emotion:
    print(f"\n‚úÖ Both datasets have emotion columns: {required_cols}")
else:
    print(f"\n‚ùå Missing emotion columns!")

# Summary
print(f"\n" + "=" * 80)
print("SUMMARY")
print("=" * 80)
print(f"\nCrisis data:")
print(f"   Rows: {len(df_crisis):,}")
print(f"   Columns: {len(df_crisis.columns)}")
print(f"   Emotion labels (all NaN): {df_crisis['emotion_label'].isna().all()}")

print(f"\nNon-crisis data:")
print(f"   Rows: {len(df_non_crisis):,}")
print(f"   Columns: {len(df_non_crisis.columns)}")
print(f"   Emotion labels (all NaN): {df_non_crisis['emotion_label'].isna().all()}")

SCHEMA VALIDATION

Crisis columns: ['created_at', 'crisis_label', 'emotion_label', 'emotion_name', 'event_name', 'event_type', 'informativeness', 'source_dataset', 'text']

Non-crisis columns: ['created_at', 'crisis_label', 'emotion_label', 'emotion_name', 'event_name', 'event_type', 'source_dataset', 'text']

‚ö†Ô∏è  Column mismatch:
   Only in crisis: {'informativeness'}

‚úÖ Both datasets have emotion columns: ['emotion_label', 'emotion_name']

SUMMARY

Crisis data:
   Rows: 66,748
   Columns: 9
   Emotion labels (all NaN): True

Non-crisis data:
   Rows: 1,533,696
   Columns: 8
   Emotion labels (all NaN): True


## 6. Compare with GoEmotions Schema

Load a sample of GoEmotions to ensure compatibility.

In [7]:
print("Loading GoEmotions for schema comparison...")
df_goemotions = pd.read_csv('goemotion_data/goemotions_with_13_emotions.csv', nrows=100)

print(f"\nGoEmotions columns: {df_goemotions.columns.tolist()}")
print(f"\nGoEmotions emotion columns:")
print(f"  - emotion_label: {df_goemotions['emotion_label'].dtype}")
print(f"  - emotion_name: {df_goemotions['emotion_name'].dtype}")

print(f"\nGoEmotions sample:")
display(df_goemotions[['text', 'emotion_label', 'emotion_name']].head())

# Check if emotion column types match
print(f"\n" + "=" * 80)
print("TYPE COMPATIBILITY CHECK")
print("=" * 80)

print(f"\nemotion_label types:")
print(f"  GoEmotions:  {df_goemotions['emotion_label'].dtype}")
print(f"  Crisis:      {df_crisis['emotion_label'].dtype}")
print(f"  Non-crisis:  {df_non_crisis['emotion_label'].dtype}")

print(f"\nemotion_name types:")
print(f"  GoEmotions:  {df_goemotions['emotion_name'].dtype}")
print(f"  Crisis:      {df_crisis['emotion_name'].dtype}")
print(f"  Non-crisis:  {df_non_crisis['emotion_name'].dtype}")

print(f"\n‚úÖ Schema compatible for Phase 4 combination!")

Loading GoEmotions for schema comparison...

GoEmotions columns: ['text', 'emotion_label', 'emotion_name', 'id', 'labels']

GoEmotions emotion columns:
  - emotion_label: int64
  - emotion_name: object

GoEmotions sample:


Unnamed: 0,text,emotion_label,emotion_name
0,My favourite food is anything I didn't have to cook myself.,13,neutral
1,"Now if he does off himself, everyone will think hes having a laugh screwing with people instead ...",13,neutral
2,WHY THE FUCK IS BAYLESS ISOING,2,anger
3,To make her feel threatened,1,fear
4,Dirty Southern Wankers,2,anger



TYPE COMPATIBILITY CHECK

emotion_label types:
  GoEmotions:  int64
  Crisis:      float64
  Non-crisis:  float64

emotion_name types:
  GoEmotions:  object
  Crisis:      object
  Non-crisis:  object

‚úÖ Schema compatible for Phase 4 combination!


## 7. Save Updated Datasets

Save with `_with_emotions` suffix to preserve originals.

In [8]:
# Define output paths
crisis_output = 'standardized_data/crisis_combined_with_emotions.csv'
non_crisis_output = 'standardized_data/non_crisis_combined_with_emotions.csv'

# Save crisis data
print(f"Saving crisis data to {crisis_output}...")
df_crisis.to_csv(crisis_output, index=False)
crisis_size = Path(crisis_output).stat().st_size / (1024*1024)
print(f"‚úÖ Saved {len(df_crisis):,} rows ({crisis_size:.2f} MB)")

# Save non-crisis data
print(f"\nSaving non-crisis data to {non_crisis_output}...")
df_non_crisis.to_csv(non_crisis_output, index=False)
non_crisis_size = Path(non_crisis_output).stat().st_size / (1024*1024)
print(f"‚úÖ Saved {len(df_non_crisis):,} rows ({non_crisis_size:.2f} MB)")

print(f"\n" + "=" * 80)
print("FILES SAVED")
print("=" * 80)
print(f"\nCrisis:     {crisis_output} ({crisis_size:.2f} MB)")
print(f"Non-crisis: {non_crisis_output} ({non_crisis_size:.2f} MB)")
print(f"\nTotal rows: {len(df_crisis) + len(df_non_crisis):,}")

Saving crisis data to standardized_data/crisis_combined_with_emotions.csv...
‚úÖ Saved 66,748 rows (13.04 MB)

Saving non-crisis data to standardized_data/non_crisis_combined_with_emotions.csv...
‚úÖ Saved 1,533,696 rows (268.60 MB)

FILES SAVED

Crisis:     standardized_data/crisis_combined_with_emotions.csv (13.04 MB)
Non-crisis: standardized_data/non_crisis_combined_with_emotions.csv (268.60 MB)

Total rows: 1,600,444


## 8. Before/After Comparison

In [9]:
print("=" * 80)
print("BEFORE/AFTER COMPARISON")
print("=" * 80)

# Load original files for comparison
df_crisis_old = pd.read_csv('standardized_data/crisis_combined.csv', nrows=5)
df_non_crisis_old = pd.read_csv('standardized_data/non_crisis_combined.csv', nrows=5)

print(f"\nCRISIS DATA:")
print(f"\nBEFORE - Columns ({len(df_crisis_old.columns)}):")
print(df_crisis_old.columns.tolist())

print(f"\nAFTER - Columns ({len(df_crisis.columns)}):")
print(df_crisis.columns.tolist())

print(f"\n" + "-" * 80)

print(f"\nNON-CRISIS DATA:")
print(f"\nBEFORE - Columns ({len(df_non_crisis_old.columns)}):")
print(df_non_crisis_old.columns.tolist())

print(f"\nAFTER - Columns ({len(df_non_crisis.columns)}):")
print(df_non_crisis.columns.tolist())

print(f"\n" + "=" * 80)
print("‚úÖ EMOTION COLUMNS ADDED SUCCESSFULLY!")
print("=" * 80)

BEFORE/AFTER COMPARISON

CRISIS DATA:

BEFORE - Columns (7):
['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'informativeness']

AFTER - Columns (9):
['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'informativeness', 'emotion_label', 'emotion_name']

--------------------------------------------------------------------------------

NON-CRISIS DATA:

BEFORE - Columns (6):
['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset']

AFTER - Columns (8):
['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'emotion_label', 'emotion_name']

‚úÖ EMOTION COLUMNS ADDED SUCCESSFULLY!


## 9. Final Summary

In [10]:
print("=" * 80)
print("FINAL SUMMARY")
print("=" * 80)

print(f"\n‚úÖ Task Complete!")

print(f"\nüìä Datasets Updated:")
print(f"   Crisis:      {len(df_crisis):,} rows with emotion columns")
print(f"   Non-crisis:  {len(df_non_crisis):,} rows with emotion columns")
print(f"   Total:       {len(df_crisis) + len(df_non_crisis):,} rows ready for Phase 4")

print(f"\nüìÅ Files Created:")
print(f"   ‚úì standardized_data/crisis_combined_with_emotions.csv")
print(f"   ‚úì standardized_data/non_crisis_combined_with_emotions.csv")

print(f"\nüîß Columns Added:")
print(f"   ‚úì emotion_label (float64, all NaN - for BERT prediction)")
print(f"   ‚úì emotion_name (object, all empty - for BERT prediction)")

print(f"\nüìã Next Steps:")
print(f"   1. Create Phase 4 notebook to combine all datasets")
print(f"   2. Merge GoEmotions + Crisis + Non-crisis data")
print(f"   3. Create master_training_data.csv")
print(f"   4. Train BERT on 13-emotion classification")
print(f"   5. Use trained BERT to predict emotions for crisis/non-crisis data")

print(f"\n" + "=" * 80)

FINAL SUMMARY

‚úÖ Task Complete!

üìä Datasets Updated:
   Crisis:      66,748 rows with emotion columns
   Non-crisis:  1,533,696 rows with emotion columns
   Total:       1,600,444 rows ready for Phase 4

üìÅ Files Created:
   ‚úì standardized_data/crisis_combined_with_emotions.csv
   ‚úì standardized_data/non_crisis_combined_with_emotions.csv

üîß Columns Added:
   ‚úì emotion_label (float64, all NaN - for BERT prediction)
   ‚úì emotion_name (object, all empty - for BERT prediction)

üìã Next Steps:
   1. Create Phase 4 notebook to combine all datasets
   2. Merge GoEmotions + Crisis + Non-crisis data
   3. Create master_training_data.csv
   4. Train BERT on 13-emotion classification
   5. Use trained BERT to predict emotions for crisis/non-crisis data

