# TEMPO Data Inspection Notebook
## Understanding Current Data Structure

This notebook inspects the current standardized data to understand:
1. Column structure across datasets
2. Current emotion label format (need to map 27 → 13)
3. Event types and informativeness values
4. Data quality and missing values

In [None]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 1. Check Available Data Files

In [2]:
# List all CSV files in standardized_data
data_dir = Path('standardized_data')
csv_files = list(data_dir.glob('*.csv'))

print(f"Found {len(csv_files)} CSV files:\n")
for file in sorted(csv_files):
    size_mb = file.stat().st_size / (1024 * 1024)
    print(f"  {file.name:<40} {size_mb:>8.2f} MB")

Found 11 CSV files:

  coachella_standardized.csv                   0.56 MB
  crisis_combined.csv                         12.91 MB
  crisislex_standardized.csv                   4.59 MB
  fifa_worldcup_standardized.csv              12.00 MB
  game_of_thrones_standardized.csv           142.61 MB
  humaid_standardized.csv                      8.32 MB
  music_concerts_standardized.csv              0.45 MB
  non_crisis_combined.csv                    265.68 MB
  tokyo_olympics_standardized.csv             29.49 MB
  us_election_standardized.csv                23.67 MB
  worldcup_2018_standardized.csv              56.89 MB


## 2. Inspect Crisis Combined Dataset

In [3]:
# Load crisis data
crisis_df = pd.read_csv('standardized_data/crisis_combined.csv')

print("=" * 80)
print("CRISIS COMBINED DATASET")
print("=" * 80)
print(f"\nTotal rows: {len(crisis_df):,}")
print(f"\nColumns ({len(crisis_df.columns)}): {crisis_df.columns.tolist()}")
print(f"\nData types:\n{crisis_df.dtypes}")

CRISIS COMBINED DATASET

Total rows: 66,748

Columns (7): ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset', 'informativeness']

Data types:
text               object
created_at         object
event_name         object
event_type         object
crisis_label        int64
source_dataset     object
informativeness    object
dtype: object


In [4]:
# Show first few rows
print("\nFirst 3 rows:")
crisis_df.head(3)


First 3 rows:


Unnamed: 0,text,created_at,event_name,event_type,crisis_label,source_dataset,informativeness
0,.@GreenABEnergy How can @AirworksCanada assist in the cleanup? #AlbertaStrong,2016-05-19 18:16:11.727000+00:00,canada_wildfires_2016_dev,wildfire,1,humaid,
1,RT @katvondawn: Thoughts &amp; prayers going to all those being affected by the wildfire in Cana...,2016-05-09 03:58:37.448000+00:00,canada_wildfires_2016_dev,wildfire,1,humaid,
2,Glacier Farm Media pledges $50K in support for Fort McMurray wildfire disaster relief.,2016-05-12 12:41:05.044000+00:00,canada_wildfires_2016_dev,wildfire,1,humaid,


In [5]:
# Check informativeness values
print("=" * 80)
print("INFORMATIVENESS ANALYSIS")
print("=" * 80)

if 'informativeness' in crisis_df.columns:
    print(f"\nInformativeness value counts:")
    print(crisis_df['informativeness'].value_counts())
    print(f"\nInformativeness data type: {crisis_df['informativeness'].dtype}")
    print(f"Missing values: {crisis_df['informativeness'].isna().sum()}")
else:
    print("\nNo 'informativeness' column found")

INFORMATIVENESS ANALYSIS

Informativeness value counts:
informativeness
related_informative        14379
related_not_informative     6257
not_related                 2296
Name: count, dtype: int64

Informativeness data type: object
Missing values: 43816


In [6]:
# Check event types
print("=" * 80)
print("EVENT TYPE ANALYSIS")
print("=" * 80)

if 'event_type' in crisis_df.columns:
    print(f"\nUnique event types ({crisis_df['event_type'].nunique()}):")
    print(crisis_df['event_type'].value_counts())
else:
    print("\nNo 'event_type' column found")

EVENT TYPE ANALYSIS

Unique event types (7):
event_type
hurricane     33422
earthquake    11429
flood          7498
wildfire       7151
accident       4248
bombing        2000
haze           1000
Name: count, dtype: int64


In [7]:
# Check for emotion columns
print("=" * 80)
print("EMOTION COLUMNS IN CRISIS DATA")
print("=" * 80)

emotion_cols = [col for col in crisis_df.columns if 'emotion' in col.lower()]
print(f"\nFound {len(emotion_cols)} emotion columns: {emotion_cols}")

if emotion_cols:
    print("\nSample emotion values:")
    print(crisis_df[emotion_cols].head(10))

EMOTION COLUMNS IN CRISIS DATA

Found 0 emotion columns: []


## 3. Inspect Non-Crisis Combined Dataset

In [8]:
# Load non-crisis data
non_crisis_df = pd.read_csv('standardized_data/non_crisis_combined.csv')

print("=" * 80)
print("NON-CRISIS COMBINED DATASET")
print("=" * 80)
print(f"\nTotal rows: {len(non_crisis_df):,}")
print(f"\nColumns ({len(non_crisis_df.columns)}): {non_crisis_df.columns.tolist()}")
print(f"\nData types:\n{non_crisis_df.dtypes}")

NON-CRISIS COMBINED DATASET

Total rows: 1,533,696

Columns (6): ['text', 'created_at', 'event_name', 'event_type', 'crisis_label', 'source_dataset']

Data types:
text              object
created_at        object
event_name        object
event_type        object
crisis_label       int64
source_dataset    object
dtype: object


In [9]:
# Show first few rows
print("\nFirst 3 rows:")
non_crisis_df.head(3)


First 3 rows:


Unnamed: 0,text,created_at,event_name,event_type,crisis_label,source_dataset
0,#Coachella2015 tickets selling out in less than 40 minutes _Ù_¦_Ù___Ù___Ù÷_ÙÎµ_ÙÎµ_Ù___Ù_¦ http...,2015-01-07 15:02:00,coachella_2015,entertainment,0,coachella
1,RT @sudsybuddy: WAIT THIS IS ABSOLUTE FIRE _ÙÓ´_ÙÓ´_ÙÓ´ #Coachella2015 http://t.co/Ov2eCJtAvR,2015-01-07 15:02:00,coachella_2015,entertainment,0,coachella
2,#Coachella2015 #VIP passes secured! See you there bitchesssss,2015-01-07 15:01:00,coachella_2015,entertainment,0,coachella


In [10]:
# Check event types in non-crisis
print("=" * 80)
print("NON-CRISIS EVENT TYPES")
print("=" * 80)

if 'event_type' in non_crisis_df.columns:
    print(f"\nEvent type distribution:")
    print(non_crisis_df['event_type'].value_counts())

if 'event_name' in non_crisis_df.columns:
    print(f"\nEvent name distribution:")
    print(non_crisis_df['event_name'].value_counts())

NON-CRISIS EVENT TYPES

Event type distribution:
event_type
entertainment    766290
sports           667458
politics          99948
Name: count, dtype: int64

Event name distribution:
event_name
got_season8_2019       760614
fifa_worldcup_2018     458533
tokyo_olympics_2020    159432
us_election_2020        99948
fifa_worldcup_2022      49493
coachella_2015           3846
music_concerts_2021      1830
Name: count, dtype: int64


In [11]:
# Check informativeness in non-crisis
print("=" * 80)
print("NON-CRISIS INFORMATIVENESS")
print("=" * 80)

if 'informativeness' in non_crisis_df.columns:
    print(f"\nInformativeness values:")
    print(non_crisis_df['informativeness'].value_counts())
    print(f"\nMissing: {non_crisis_df['informativeness'].isna().sum()}")
else:
    print("\nNo 'informativeness' column in non-crisis data")

NON-CRISIS INFORMATIVENESS

No 'informativeness' column in non-crisis data


## 4. Check for GoEmotions Data (27 Emotions)

In [12]:
# Check if goemotion_data exists
goemotion_path = Path('goemotion_data/goemotions.csv')

if goemotion_path.exists():
    print("✓ GoEmotions data found!\n")
    
    # Load a sample
    goemotions_df = pd.read_csv(goemotion_path, nrows=1000)
    
    print("=" * 80)
    print("GOEMOTIONS DATASET (27 EMOTIONS)")
    print("=" * 80)
    print(f"\nColumns: {goemotions_df.columns.tolist()}")
    print(f"\nFirst 3 rows:")
    display(goemotions_df.head(3))
    
    # Check emotion columns
    emotion_cols = [col for col in goemotions_df.columns if 'emotion' in col.lower()]
    print(f"\nEmotion-related columns: {emotion_cols}")
    
else:
    print("✗ GoEmotions data NOT found at goemotion_data/goemotions.csv")
    print("  You need to download this from Google Drive!")

✓ GoEmotions data found!

GOEMOTIONS DATASET (27 EMOTIONS)

Columns: ['text', 'labels', 'id', 'Unnamed: 3', '[27] = neutral [0] = admiration [1] = amusement [2] = anger [3] = annoyance [4] = approval [5] = caring [6] = confusion [7] = curiosity [8] = desire [9] = disappointment [10] = disapproval [11] = disgust [12] = embarrassment [13] = excitement [14] = fear [15] = gratitude [16] = grief [17] = joy [18] = love [19] = nervousness [20] = optimism [21] = pride [22] = realization [23] = relief [24] = remorse [25] = sadness [26] = surprise [27] = neutral']

First 3 rows:


Unnamed: 0,text,labels,id,Unnamed: 3,[27] = neutral [0] = admiration [1] = amusement [2] = anger [3] = annoyance [4] = approval [5] = caring [6] = confusion [7] = curiosity [8] = desire [9] = disappointment [10] = disapproval [11] = disgust [12] = embarrassment [13] = excitement [14] = fear [15] = gratitude [16] = grief [17] = joy [18] = love [19] = nervousness [20] = optimism [21] = pride [22] = realization [23] = relief [24] = remorse [25] = sadness [26] = surprise [27] = neutral
0,My favourite food is anything I didn't have to cook myself.,[27],eebbqej,,
1,"Now if he does off himself, everyone will think hes having a laugh screwing with people instead ...",[27],ed00q6i,,
2,WHY THE FUCK IS BAYLESS ISOING,[2],eezlygj,,



Emotion-related columns: []


## 5. Summary & Next Steps

In [13]:
print("=" * 80)
print("SUMMARY")
print("=" * 80)

print(f"\n✓ Crisis data: {len(crisis_df):,} rows")
print(f"✓ Non-crisis data: {len(non_crisis_df):,} rows")

print("\n" + "=" * 80)
print("NEXT STEPS")
print("=" * 80)
print("""
1. Download missing data folders from Google Drive:
   - goemotion_data/ (for 27 emotion labels)
   - master_training_data/ (if already processed)
   - baseline_data/ (optional - for noise tweets)

2. Map 27 GoEmotions → 13 emotions (need to define which 13!)

3. Understand informativeness values:
   - What do the current values mean?
   - Binary (0/1)? Categorical? Scale?

4. Decide event_type labels for non-crisis data

5. Refactor code to industry standards

6. Convert Python scripts to notebooks
""")

SUMMARY

✓ Crisis data: 66,748 rows
✓ Non-crisis data: 1,533,696 rows

NEXT STEPS

1. Download missing data folders from Google Drive:
   - goemotion_data/ (for 27 emotion labels)
   - master_training_data/ (if already processed)
   - baseline_data/ (optional - for noise tweets)

2. Map 27 GoEmotions → 13 emotions (need to define which 13!)

3. Understand informativeness values:
   - What do the current values mean?
   - Binary (0/1)? Categorical? Scale?

4. Decide event_type labels for non-crisis data

5. Refactor code to industry standards

6. Convert Python scripts to notebooks

