# Apply Emotion Mapping to GoEmotions Dataset
## Map 27 GoEmotions ‚Üí 13 Target Emotions

This notebook:
1. Loads the emotion mapping configuration
2. Loads GoEmotions dataset
3. Applies 27‚Üí13 mapping to create `emotion_label` column
4. Validates the transformation
5. Saves updated GoEmotions dataset

In [1]:
import pandas as pd
import numpy as np
import json
import ast
from pathlib import Path
from collections import Counter

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 1. Load Emotion Mapping Configuration

In [2]:
# Load the mapping config created in 02_emotion_mapping.ipynb
with open('emotion_mapping_config.json', 'r') as f:
    config = json.load(f)

# Extract mappings (need to convert string keys to int for goemotions_27)
TARGET_EMOTIONS = {int(k): v for k, v in config['target_emotions'].items()}
GOEMOTIONS_27 = {int(k): v for k, v in config['goemotions_27'].items()}
MAPPING_27_TO_13 = config['mapping_27_to_13']

print("‚úÖ Loaded emotion mapping configuration")
print(f"\nTarget emotions (13): {list(TARGET_EMOTIONS.values())}")
print(f"\nGoEmotions (27): {list(GOEMOTIONS_27.values())}")

‚úÖ Loaded emotion mapping configuration

Target emotions (13): ['fear', 'anger', 'sadness', 'anxiety', 'confusion', 'surprise', 'disgust', 'caring', 'joy', 'excitement', 'gratitude', 'disappointment', 'neutral']

GoEmotions (27): ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']


## 2. Load GoEmotions Dataset

In [3]:
# Load full GoEmotions dataset
print("Loading GoEmotions dataset...")
goemotions_path = 'goemotion_data/goemotions.csv'
df = pd.read_csv(goemotions_path)

print(f"\n‚úÖ Loaded {len(df):,} rows")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nFirst 5 rows:")
display(df.head())

Loading GoEmotions dataset...

‚úÖ Loaded 54,263 rows

Columns: ['text', 'labels', 'id', 'Unnamed: 3', '[27] = neutral [0] = admiration [1] = amusement [2] = anger [3] = annoyance [4] = approval [5] = caring [6] = confusion [7] = curiosity [8] = desire [9] = disappointment [10] = disapproval [11] = disgust [12] = embarrassment [13] = excitement [14] = fear [15] = gratitude [16] = grief [17] = joy [18] = love [19] = nervousness [20] = optimism [21] = pride [22] = realization [23] = relief [24] = remorse [25] = sadness [26] = surprise [27] = neutral']

Data types:
text                                                                                                                                                                                                                                                                                                                                                                                                                                           

Unnamed: 0,text,labels,id,Unnamed: 3,[27] = neutral [0] = admiration [1] = amusement [2] = anger [3] = annoyance [4] = approval [5] = caring [6] = confusion [7] = curiosity [8] = desire [9] = disappointment [10] = disapproval [11] = disgust [12] = embarrassment [13] = excitement [14] = fear [15] = gratitude [16] = grief [17] = joy [18] = love [19] = nervousness [20] = optimism [21] = pride [22] = realization [23] = relief [24] = remorse [25] = sadness [26] = surprise [27] = neutral
0,My favourite food is anything I didn't have to cook myself.,[27],eebbqej,,
1,"Now if he does off himself, everyone will think hes having a laugh screwing with people instead ...",[27],ed00q6i,,
2,WHY THE FUCK IS BAYLESS ISOING,[2],eezlygj,,
3,To make her feel threatened,[14],ed7ypvh,,
4,Dirty Southern Wankers,[3],ed0bdzj,,


## 3. Parse Labels Column

GoEmotions stores labels as string representation of lists (e.g., "[14]" or "[2, 14]").
We need to parse these into actual lists.

In [4]:
# Parse labels column from string to list
def parse_labels(label_str):
    """
    Convert string representation of list to actual list
    e.g., "[14]" -> [14], "[2, 14]" -> [2, 14]
    """
    try:
        if pd.isna(label_str):
            return [27]  # Default to neutral if missing
        # Use ast.literal_eval to safely parse string lists
        labels = ast.literal_eval(label_str)
        if isinstance(labels, list):
            return labels
        else:
            return [labels]  # Wrap single int in list
    except:
        return [27]  # Default to neutral on error

# Apply parsing
print("Parsing labels column...")
df['labels_parsed'] = df['labels'].apply(parse_labels)

print("\n‚úÖ Labels parsed")
print("\nSample parsed labels:")
print(df[['text', 'labels', 'labels_parsed']].head(10))

Parsing labels column...

‚úÖ Labels parsed

Sample parsed labels:
                                                                                                  text  \
0                                          My favourite food is anything I didn't have to cook myself.   
1  Now if he does off himself, everyone will think hes having a laugh screwing with people instead ...   
2                                                                       WHY THE FUCK IS BAYLESS ISOING   
3                                                                          To make her feel threatened   
4                                                                               Dirty Southern Wankers   
5   OmG pEyToN iSn'T gOoD eNoUgH tO hElP uS iN tHe PlAyOfFs! Dumbass Broncos fans circa December 2015.   
6  Yes I heard abt the f bombs! That has to be why. Thanks for your reply:) until then hubby and I ...   
7                   We need more boards and to create a bit more space for [NAME]. Th

## 4. Apply Emotion Mapping (27 ‚Üí 13)

### Strategy for Multiple Labels:
Some texts have multiple emotion labels. We'll take the **first label** as the primary emotion.
This is reasonable because:
- GoEmotions labels are ordered by confidence
- For crisis detection, we need one primary emotion per tweet
- Multi-label classification can be added later if needed

In [5]:
def map_to_target_emotion(goemotions_labels):
    """
    Maps GoEmotions label(s) to target emotion (1-13)
    
    Args:
        goemotions_labels: list of int (GoEmotions labels 0-27)
    
    Returns:
        int: Target emotion label (1-13)
    """
    # Take first label (most confident)
    if not goemotions_labels or len(goemotions_labels) == 0:
        primary_label = 27  # neutral
    else:
        primary_label = goemotions_labels[0]
    
    # Convert index to emotion name
    emotion_name = GOEMOTIONS_27.get(primary_label, 'neutral')
    
    # Map to target label (1-13)
    target_label = MAPPING_27_TO_13.get(emotion_name, 13)  # Default to neutral if not found
    
    return target_label

# Apply mapping
print("Applying emotion mapping (27 ‚Üí 13)...")
df['emotion_label'] = df['labels_parsed'].apply(map_to_target_emotion)

# Add emotion name column for readability
df['emotion_name'] = df['emotion_label'].map(TARGET_EMOTIONS)

print("\n‚úÖ Emotion mapping applied")
print(f"\nNew columns created:")
print(f"  - 'emotion_label': numeric values 1-13 (for ML training)")
print(f"  - 'emotion_name': text labels (for readability)")

Applying emotion mapping (27 ‚Üí 13)...

‚úÖ Emotion mapping applied

New columns created:
  - 'emotion_label': numeric values 1-13 (for ML training)
  - 'emotion_name': text labels (for readability)


## 5. Validate Transformation

In [6]:
print("=" * 80)
print("VALIDATION RESULTS")
print("=" * 80)

# Check for any null values
null_count = df['emotion_label'].isna().sum()
print(f"\nNull emotion_labels: {null_count}")

# Check value range (should be 1-13)
min_label = df['emotion_label'].min()
max_label = df['emotion_label'].max()
print(f"Label range: {min_label} to {max_label} (expected: 1 to 13)")

if min_label >= 1 and max_label <= 13 and null_count == 0:
    print("\n‚úÖ All validations passed!")
else:
    print("\n‚ö†Ô∏è  Validation issues detected")

# Show distribution of target emotions
print("\n" + "=" * 80)
print("EMOTION LABEL DISTRIBUTION")
print("=" * 80)

emotion_counts = df['emotion_label'].value_counts().sort_index()
print(f"\n{'Label':<8} {'Emotion':<20} {'Count':<10} {'Percentage'}")
print("-" * 60)

for label in range(1, 14):
    count = emotion_counts.get(label, 0)
    pct = (count / len(df)) * 100
    emotion_name = TARGET_EMOTIONS[label]
    print(f"{label:<8} {emotion_name:<20} {count:<10,} {pct:>6.2f}%")

print(f"\nTotal: {len(df):,} rows")

VALIDATION RESULTS

Null emotion_labels: 0
Label range: 1 to 13 (expected: 1 to 13)

‚úÖ All validations passed!

EMOTION LABEL DISTRIBUTION

Label    Emotion              Count      Percentage
------------------------------------------------------------
1        fear                 658          1.21%
2        anger                4,607        8.49%
3        sadness              1,148        2.12%
4        anxiety              132          0.24%
5        confusion            3,753        6.92%
6        surprise             1,831        3.37%
7        disgust              1,044        1.92%
8        caring               1,218        2.24%
9        joy                  11,048      20.36%
10       excitement           1,543        2.84%
11       gratitude            4,093        7.54%
12       disappointment       3,898        7.18%
13       neutral              19,290      35.55%

Total: 54,263 rows


## 6. Before/After Comparison

In [7]:
# Create helper column to show original emotion name
df['original_emotion'] = df['labels_parsed'].apply(
    lambda x: GOEMOTIONS_27.get(x[0] if x else 27, 'neutral')
)

# Create helper column to show target emotion name
df['target_emotion'] = df['emotion_label'].apply(
    lambda x: TARGET_EMOTIONS.get(x, 'unknown')
)

# Show mapping examples
print("=" * 80)
print("BEFORE/AFTER EXAMPLES")
print("=" * 80)

sample = df[['text', 'original_emotion', 'target_emotion', 'emotion_label']].head(20)
display(sample)

# Group by original emotion to see mapping
print("\n" + "=" * 80)
print("MAPPING SUMMARY: GoEmotion ‚Üí Target")
print("=" * 80)

mapping_summary = df.groupby(['original_emotion', 'target_emotion', 'emotion_label']).size().reset_index(name='count')
mapping_summary = mapping_summary.sort_values('count', ascending=False)

print(f"\n{'Original (27)':<20} ‚Üí {'Target (13)':<20} {'Label':<8} {'Count'}")
print("-" * 70)
for _, row in mapping_summary.iterrows():
    print(f"{row['original_emotion']:<20} ‚Üí {row['target_emotion']:<20} {row['emotion_label']:<8} {row['count']:,}")

BEFORE/AFTER EXAMPLES


Unnamed: 0,text,original_emotion,target_emotion,emotion_label
0,My favourite food is anything I didn't have to cook myself.,neutral,neutral,13
1,"Now if he does off himself, everyone will think hes having a laugh screwing with people instead ...",neutral,neutral,13
2,WHY THE FUCK IS BAYLESS ISOING,anger,anger,2
3,To make her feel threatened,fear,fear,1
4,Dirty Southern Wankers,annoyance,anger,2
5,OmG pEyToN iSn'T gOoD eNoUgH tO hElP uS iN tHe PlAyOfFs! Dumbass Broncos fans circa December 2015.,surprise,surprise,6
6,Yes I heard abt the f bombs! That has to be why. Thanks for your reply:) until then hubby and I ...,gratitude,gratitude,11
7,We need more boards and to create a bit more space for [NAME]. Then we‚Äôll be good.,desire,excitement,10
8,Damn youtube and outrage drama is super lucrative for reddit,admiration,joy,9
9,It might be linked to the trust factor of your friend.,neutral,neutral,13



MAPPING SUMMARY: GoEmotion ‚Üí Target

Original (27)        ‚Üí Target (13)          Label    Count
----------------------------------------------------------------------
neutral              ‚Üí neutral              13       16,021
admiration           ‚Üí joy                  9        5,122
approval             ‚Üí neutral              13       3,269
amusement            ‚Üí joy                  9        2,793
gratitude            ‚Üí gratitude            11       2,681
annoyance            ‚Üí anger                2        2,671
curiosity            ‚Üí confusion            5        2,210
disapproval          ‚Üí disappointment       12       2,117
anger                ‚Üí anger                2        1,936
love                 ‚Üí joy                  9        1,883
confusion            ‚Üí confusion            5        1,543
disappointment       ‚Üí disappointment       12       1,284
joy                  ‚Üí joy                  9        1,250
optimism             ‚Üí gratitude

## 7. Save Updated Dataset

In [8]:
# Select columns for final dataset
# Keep: text, emotion_label, emotion_name, id (and optionally original labels for reference)
final_columns = ['text', 'emotion_label', 'emotion_name', 'id', 'labels']  # Keep original labels for reference

df_final = df[final_columns].copy()

# Save to new file
output_path = 'goemotion_data/goemotions_with_13_emotions.csv'
df_final.to_csv(output_path, index=False)

print("=" * 80)
print("SAVED UPDATED DATASET")
print("=" * 80)
print(f"\n‚úÖ Saved to: {output_path}")
print(f"\nRows: {len(df_final):,}")
print(f"Columns: {df_final.columns.tolist()}")
print(f"\nFile size: {Path(output_path).stat().st_size / (1024*1024):.2f} MB")

# Show final preview
print("\nFinal dataset preview:")
display(df_final.head(10))

SAVED UPDATED DATASET

‚úÖ Saved to: goemotion_data/goemotions_with_13_emotions.csv

Rows: 54,263
Columns: ['text', 'emotion_label', 'emotion_name', 'id', 'labels']

File size: 4.94 MB

Final dataset preview:


Unnamed: 0,text,emotion_label,emotion_name,id,labels
0,My favourite food is anything I didn't have to cook myself.,13,neutral,eebbqej,[27]
1,"Now if he does off himself, everyone will think hes having a laugh screwing with people instead ...",13,neutral,ed00q6i,[27]
2,WHY THE FUCK IS BAYLESS ISOING,2,anger,eezlygj,[2]
3,To make her feel threatened,1,fear,ed7ypvh,[14]
4,Dirty Southern Wankers,2,anger,ed0bdzj,[3]
5,OmG pEyToN iSn'T gOoD eNoUgH tO hElP uS iN tHe PlAyOfFs! Dumbass Broncos fans circa December 2015.,6,surprise,edvnz26,[26]
6,Yes I heard abt the f bombs! That has to be why. Thanks for your reply:) until then hubby and I ...,11,gratitude,ee3b6wu,[15]
7,We need more boards and to create a bit more space for [NAME]. Then we‚Äôll be good.,10,excitement,ef4qmod,"[8, 20]"
8,Damn youtube and outrage drama is super lucrative for reddit,9,joy,ed8wbdn,[0]
9,It might be linked to the trust factor of your friend.,13,neutral,eczgv1o,[27]


## 8. Summary Statistics

In [9]:
print("=" * 80)
print("SUMMARY")
print("=" * 80)

print(f"\nüìä Dataset Statistics:")
print(f"   Total texts: {len(df_final):,}")
print(f"   Unique emotion labels: {df_final['emotion_label'].nunique()}")
print(f"   Columns: {df_final.columns.tolist()}")

print(f"\nüìÅ Files:")
print(f"   Original: goemotion_data/goemotions.csv")
print(f"   Updated:  goemotion_data/goemotions_with_13_emotions.csv")
print(f"   Config:   emotion_mapping_config.json")

print(f"\n‚úÖ Transformation Complete!")
print(f"\nNext steps:")
print(f"   1. Use goemotions_with_13_emotions.csv for training")
print(f"   2. Apply same emotion_label schema to crisis/non-crisis data")
print(f"   3. Re-run Phase 4 to create master training file")
print(f"   4. Train BERT on 13-emotion classification")

SUMMARY

üìä Dataset Statistics:
   Total texts: 54,263
   Unique emotion labels: 13
   Columns: ['text', 'emotion_label', 'emotion_name', 'id', 'labels']

üìÅ Files:
   Original: goemotion_data/goemotions.csv
   Updated:  goemotion_data/goemotions_with_13_emotions.csv
   Config:   emotion_mapping_config.json

‚úÖ Transformation Complete!

Next steps:
   1. Use goemotions_with_13_emotions.csv for training
   2. Apply same emotion_label schema to crisis/non-crisis data
   3. Re-run Phase 4 to create master training file
   4. Train BERT on 13-emotion classification
