### Examining the transcripts sentiment results

I used the GoEmotions model for [text classification](https://colab.research.google.com/drive/1GuUbnw1pVMQrNDVKJ98bJGYvR3YWWB_D) to assign an emotion to each sentence. This notebook analyzes the results and the transcripts themselves.


In [1]:
import pandas as pd
import json
from collections import defaultdict

In [2]:
df = pd.read_csv('results_text_sentiments.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6509 entries, 0 to 6508
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   sketch_id        6509 non-null   int64  
 1   sketch_name      6509 non-null   object 
 2   season           6509 non-null   int64  
 3   episode          6509 non-null   int64  
 4   sentence_index   6509 non-null   int64  
 5   sentence_text    6509 non-null   object 
 6   sentiment_label  6509 non-null   object 
 7   sentiment_score  6509 non-null   float64
dtypes: float64(1), int64(4), object(3)
memory usage: 406.9+ KB


## Analysis Ideas

### 1. **Sentiment Distribution & Patterns**
- Overall sentiment distribution across all sketches
- Sentiment by sketch category (office, commercial, party, etc.)
- Sentiment by season/episode (evolution over time)
- Most emotional vs. most neutral sketches
- Sentiment score distributions (high vs. low confidence)

### 2. **Temporal Analysis**
- Sentiment evolution within sketches (beginning → middle → end)
- Sentiment transitions between consecutive sentences
- Sentiment patterns by sentence position (first 25%, middle 50%, last 25%)
- Episode-level sentiment arcs

### 6. **Visualization Ideas for Observable Plot**
- **Heatmap**: Sentiment by category × emotion type
- **Line chart**: Sentiment evolution across seasons/episodes
- **Bar chart**: Top emotions by category
- **Streamgraph**: Sentiment flow within a sketch
- **Scatter plot**: Sentiment score vs. sentence position
- **Network graph**: Sentiment transitions between emotions
- **Radar chart**: Emotion profile by category
- **Timeline**: Sentiment arc for a specific sketch


In [4]:
# Load metadata
metadata = pd.read_csv('episode_metadata.csv')

# Merge sentiment data with metadata
df_merged = df.merge(metadata, left_on='sketch_id', right_on='id', how='left')

# Basic stats
print(f"Total sentences: {len(df)}")
print(f"Total sketches: {df['sketch_id'].nunique()}")
print(f"Total episodes: {df.groupby(['season', 'episode']).ngroups}")
print(f"Total seasons: {df['season'].nunique()}")
print(f"\nUnique emotions: {df['sentiment_label'].nunique()}")
print(f"Unique categories: {df_merged['category'].nunique()}")


Total sentences: 6509
Total sketches: 86
Total episodes: 18
Total seasons: 3

Unique emotions: 25
Unique categories: 22


In [66]:
# 1. Overall sentiment distribution
sentiment_counts = df['sentiment_label'].value_counts()
sentiment_percentages = sentiment_counts / sentiment_counts.sum() * 100
sentiment_distribution = pd.DataFrame({
    'count': sentiment_counts,
    'percentage': sentiment_percentages
})
sentiment_distribution['percentage'] = sentiment_distribution['percentage'].round(0).astype(int)
sentiment_distribution.to_csv('overall_sentiment_distribution.csv')
sentiment_distribution


Unnamed: 0_level_0,count,percentage
sentiment_label,Unnamed: 1_level_1,Unnamed: 2_level_1
neutral,3718,57
curiosity,510,8
approval,368,6
anger,333,5
admiration,268,4
disapproval,199,3
confusion,135,2
surprise,113,2
annoyance,107,2
gratitude,80,1


In [None]:
# Bring in the externally edited overall_sentiment_distribution.csv
overall_sentiment_distribution = pd.read_csv('overall_sentiment_distribution.csv')
overall_sentiment_distribution
# Recalculate sentiment distribution using the 'simplified_label' column from the externally edited csv
simplified_sentiment_counts = overall_sentiment_distribution['simplified_label'].value_counts()
simplified_sentiment_percentages = simplified_sentiment_counts / simplified_sentiment_counts.sum() * 100

simplified_sentiment_distribution = pd.DataFrame({
    'count': simplified_sentiment_counts,
    'percentage': simplified_sentiment_percentages.round(0).astype(int)
})

simplified_sentiment_distribution


Unnamed: 0_level_0,count,percentage
simplified_label,Unnamed: 1_level_1,Unnamed: 2_level_1
happy,10,40
surprise,4,16
sad,4,16
fear,2,8
anger,2,8
disgust,2,8
neutral,1,4


In [6]:
# 2. Sentiment by category
if 'category' in df_merged.columns:
    category_sentiment = df_merged.groupby(['category', 'sentiment_label']).size().reset_index(name='count')
    category_totals = df_merged.groupby('category').size()
    
    # Calculate percentages
    category_sentiment_pct = category_sentiment.merge(
        category_totals.reset_index(name='total'), 
        on='category'
    )
    category_sentiment_pct['percentage'] = (category_sentiment_pct['count'] / category_sentiment_pct['total']) * 100
    
    # Top emotions per category
    top_emotions_by_category = category_sentiment_pct.sort_values(['category', 'count'], ascending=[True, False]).groupby('category').head(5)
    print("Top 5 emotions by category:")
    print(top_emotions_by_category)

Top 5 emotions by category:
    category sentiment_label  count  total  percentage
12     class         neutral     45     95   47.368421
5      class       curiosity     12     95   12.631579
0      class           anger     10     95   10.526316
4      class       confusion      5     95    5.263158
8      class         disgust      5     95    5.263158
..       ...             ...    ...    ...         ...
365     tour         neutral     41     85   48.235294
357     tour        approval      9     85   10.588235
355     tour           anger      5     85    5.882353
361     tour     disapproval      5     85    5.882353
354     tour      admiration      4     85    4.705882

[110 rows x 5 columns]


In [38]:
top_emotions_by_category.to_csv('top_emotions_by_category.csv', index=False)
print(f"Exported top_emotions_by_category.csv")

Exported top_emotions_by_category.csv


In [None]:
# categori counts

cat1_counts = metadata['category'].value_counts(dropna=True)
cat2_counts = metadata['category2'].value_counts(dropna=True)

# count categories from both columns
all_categories = set(cat1_counts.index).union(set(cat2_counts.index))
all_categories = [cat for cat in all_categories if pd.notna(cat)]

# summary table
summary_data = []
for cat in all_categories:
    count_primary = cat1_counts.get(cat, 0)
    count_secondary = cat2_counts.get(cat, 0)
    summary_data.append({
        'category': cat,
        'count_of_category': count_primary,
        'count_of_category2': count_secondary
    })

categories_table = pd.DataFrame(summary_data)
categories_table = categories_table.sort_values(by='count_of_category', ascending=False).reset_index(drop=True)

categories_table


Unnamed: 0,category,count_of_category,count_of_category2
0,office,20,1
1,commercial,15,2
2,party,13,3
3,game show,5,0
4,driving,5,1
5,restaurant,4,4
6,reality tv,4,0
7,family,3,0
8,dating,3,2
9,sci-fi/fantasy,2,0


In [7]:
# 3. Sentiment by season
season_sentiment = df.groupby(['season', 'sentiment_label']).size().reset_index(name='count')
season_totals = df.groupby('season').size()

season_sentiment_pct = season_sentiment.merge(
    season_totals.reset_index(name='total'),
    on='season'
)
season_sentiment_pct['percentage'] = (season_sentiment_pct['count'] / season_sentiment_pct['total']) * 100

print("Sentiment distribution by season:")
print(season_sentiment_pct.sort_values(['season', 'count'], ascending=[True, False]).groupby('season').head(5))


Sentiment distribution by season:
    season sentiment_label  count  total  percentage
19       1         neutral   1165   2080   56.009615
7        1       curiosity    179   2080    8.605769
4        1        approval    121   2080    5.817308
0        1      admiration    105   2080    5.048077
2        1           anger     83   2080    3.990385
44       2         neutral   1242   2157   57.579972
32       2       curiosity    157   2157    7.278628
27       2           anger    117   2157    5.424200
29       2        approval    111   2157    5.146036
35       2     disapproval     85   2157    3.940658
69       3         neutral   1311   2272   57.702465
57       3       curiosity    174   2272    7.658451
54       3        approval    136   2272    5.985915
52       3           anger    133   2272    5.853873
50       3      admiration     82   2272    3.609155


In [8]:
# 4. Sentiment evolution within sketches
# Calculate relative position in sketch (0-1)
sketch_lengths = df.groupby('sketch_id').size()
df['sketch_length'] = df['sketch_id'].map(sketch_lengths)
df['relative_position'] = (df['sentence_index'] - 1) / (df['sketch_length'] - 1)

# Bin into beginning, middle, end
df['position_bin'] = pd.cut(df['relative_position'], 
                            bins=[0, 0.25, 0.75, 1.0], 
                            labels=['beginning', 'middle', 'end'],
                            include_lowest=True)

position_sentiment = df.groupby(['position_bin', 'sentiment_label']).size().reset_index(name='count')
print("Sentiment by position in sketch:")
print(position_sentiment.sort_values(['position_bin', 'count'], ascending=[True, False]).groupby('position_bin').head(5))


Sentiment by position in sketch:
   position_bin sentiment_label  count
19    beginning         neutral    950
7     beginning       curiosity    125
0     beginning      admiration    102
4     beginning        approval    101
2     beginning           anger     54
44       middle         neutral   1862
32       middle       curiosity    257
29       middle        approval    187
27       middle           anger    183
25       middle      admiration    118
69          end         neutral    906
57          end       curiosity    128
52          end           anger     96
54          end        approval     80
60          end     disapproval     66


  position_sentiment = df.groupby(['position_bin', 'sentiment_label']).size().reset_index(name='count')
  print(position_sentiment.sort_values(['position_bin', 'count'], ascending=[True, False]).groupby('position_bin').head(5))


In [10]:
position_sentiment

Unnamed: 0,position_bin,sentiment_label,count
0,beginning,admiration,102
1,beginning,amusement,12
2,beginning,anger,54
3,beginning,annoyance,16
4,beginning,approval,101
...,...,...,...
70,end,optimism,12
71,end,realization,4
72,end,remorse,11
73,end,sadness,22


In [None]:
# 6. Most emotional vs. most neutral sketches
sketch_stats = df.groupby('sketch_id').agg({
    'sentiment_label': lambda x: (x == 'neutral').sum() / len(x),  # % neutral
    'sentiment_score': ['mean', 'std'],
    'sentence_index': 'count'  # sketch length
}).reset_index()
sketch_stats.columns = ['sketch_id', 'pct_neutral', 'avg_score', 'score_std', 'length']

# Add sketch names and category
sketch_stats = sketch_stats.merge(
    df[['sketch_id', 'sketch_name']].drop_duplicates(),
    on='sketch_id'
)
if 'category' in df_merged.columns:
    sketch_stats = sketch_stats.merge(
        df_merged[['sketch_id', 'category']].drop_duplicates(),
        on='sketch_id',
        how='left'
    )

# Show 10 sketches with lowest % neutral (most emotional)
most_emotional = sketch_stats.nsmallest(10, 'pct_neutral')[['sketch_id', 'sketch_name', 'category', 'pct_neutral', 'avg_score']]
print("Most emotional sketches (top 10) as DataFrame:")
display(most_emotional)

# Show 10 sketches with highest % neutral (most neutral)
most_neutral = sketch_stats.nlargest(10, 'pct_neutral')[['sketch_id', 'sketch_name', 'category', 'pct_neutral', 'avg_score']]
print("Most neutral sketches (top 10) as DataFrame:")
display(most_neutral)




Most emotional sketches (top 10) as DataFrame:


Unnamed: 0,sketch_id,sketch_name,category,pct_neutral,avg_score
27,28,Bozo #1,office,0.346154,0.807639
52,53,Dave Campor,commercial,0.363636,0.741398
42,43,Detective Crashmore Trailer,movie trailer,0.390244,0.72009
49,50,Parking Lot,driving,0.413793,0.732421
0,1,Both Ways,office,0.421053,0.836869
8,9,Pink Bag,office,0.421875,0.669075
75,76,Pacific Proposal Park,commercial,0.425,0.759125
73,74,Rat Mom,office,0.426829,0.709768
61,62,Summer Loving,reality tv,0.431373,0.70504
47,48,Friends Weekend,party,0.442857,0.729108


Most neutral sketches (top 10) as DataFrame:


Unnamed: 0,sketch_id,sketch_name,category,pct_neutral,avg_score
64,65,Supermarket Swap VR Edition,game show,0.810127,0.762706
77,78,Summer Loving Pt. 2,dating,0.75,0.903059
1,2,Has This Ever Happened To You,commercial,0.727273,0.750548
23,24,The Day Robert Palin's Murdered Me,music,0.724638,0.80193
32,33,Coffin Flop,commercial,0.722222,0.759734
56,57,Tammy Craps,commercial,0.714286,0.79169
31,32,Lunch Meeting,office,0.711111,0.697896
48,49,Calico Cut Pants,office,0.697917,0.737821
26,27,Chunky,game show,0.690265,0.762241
83,84,Photo Wall of Metal: Metal Motto Search,game show,0.684211,0.788549


In [14]:
# examining emotion counts for individual sketches
sk_id = 43
emotion_counts = df[df['sketch_id'] == sk_id]['sentiment_label'].value_counts()
sketch_name = df.loc[df['sketch_id'] == sk_id, 'sketch_name'].iloc[0]
print(f"Emotion counts: for sketch {sk_id} ({sketch_name})")
emotion_counts


Emotion counts: for sketch 43 (Detective Crashmore Trailer)


sentiment_label
neutral        16
anger          10
annoyance       4
excitement      3
disapproval     2
sadness         1
remorse         1
caring          1
love            1
admiration      1
approval        1
Name: count, dtype: int64

In [35]:
import random

# show 5 random sentences
sample_rows = df.sample(5)
sample_display_multi = sample_rows[['sketch_id', 'sentence_text', 'sentiment_label', 'sentiment_score']]
sample_display_multi


Unnamed: 0,sketch_id,sentence_text,sentiment_label,sentiment_score
1612,25,It's not a problem!,approval,0.68821
1499,24,The don't want to hear songs they can hear in ...,neutral,0.738013
3135,47,I should quit I don't even know what I'm doing.,confusion,0.734979
4646,65,Get him out of there!,neutral,0.676165
4915,69,I did.,neutral,0.586206


## Heatmap: Emotions Over Time for Each Sketch

Prepare data for a heatmap showing emotion evolution across all 86 sketches.
- Y-axis: Sketches (ordered by season, episode, sketch_id)
- X-axis: Time (normalized position within sketch, binned)
- Color: Emotion type


In [None]:
# Prepare data for sketch × time heatmap
# Normalize time position for each sketch (0-100)
df['time_percent'] = df['relative_position'] * 100

# Create time bins (e.g., 0-5%, 5-10%, etc.)
# Using 20 bins (5% increments) for good resolution without too much granularity
NUM_TIME_BINS = 20
df['time_bin'] = pd.cut(df['time_percent'], 
                        bins=NUM_TIME_BINS, 
                        labels=[f"{i*100/NUM_TIME_BINS:.1f}-{(i+1)*100/NUM_TIME_BINS:.1f}%" 
                                for i in range(NUM_TIME_BINS)],
                        include_lowest=True)

# Get sketch metadata for ordering
sketch_order = df[['sketch_id', 'sketch_name', 'season', 'episode']].drop_duplicates().sort_values(
    ['season', 'episode', 'sketch_id']
).reset_index(drop=True)
sketch_order['sketch_order'] = sketch_order.index

# Add ordering to main dataframe
df = df.merge(sketch_order[['sketch_id', 'sketch_order']], on='sketch_id')

print(f"Time bins: {NUM_TIME_BINS} bins ({100/NUM_TIME_BINS:.1f}% each)")
print(f"Total sketches: {len(sketch_order)}")
print(f"\nFirst few sketches in order:")
print(sketch_order.head(10))


In [None]:
# For each sketch × time_bin, capture ALL emotions (not just dominant)
# This allows visualization of the full range of emotions
heatmap_data_all = df.groupby(['sketch_id', 'time_bin', 'sentiment_label'], observed=True).size().reset_index(name='count')

# Calculate total sentences per bin for percentages
bin_totals = df.groupby(['sketch_id', 'time_bin'], observed=True).size().reset_index(name='total_sentences')
heatmap_data_all = heatmap_data_all.merge(bin_totals, on=['sketch_id', 'time_bin'])
heatmap_data_all['percentage'] = (heatmap_data_all['count'] / heatmap_data_all['total_sentences']) * 100

# Also determine the dominant emotion for comparison
heatmap_data_sorted = heatmap_data_all.sort_values(['sketch_id', 'time_bin', 'count'], ascending=[True, True, False])
dominant_emotions = heatmap_data_sorted.groupby(['sketch_id', 'time_bin'], observed=True).first().reset_index()
dominant_emotions = dominant_emotions[['sketch_id', 'time_bin', 'sentiment_label', 'count', 'total_sentences']]
dominant_emotions['dominance_ratio'] = dominant_emotions['count'] / dominant_emotions['total_sentences']

print("Sample of ALL emotions by sketch × time bin (showing range):")
print(heatmap_data_all.head(20))
print(f"\nTotal emotion × time_bin combinations: {len(heatmap_data_all)}")
print(f"Average emotions per sketch × time_bin: {heatmap_data_all.groupby(['sketch_id', 'time_bin'], observed=True).size().mean():.2f}")


In [None]:
# Add sketch metadata and ordering to ALL emotions dataset
heatmap_all_emotions = heatmap_data_all.merge(
    sketch_order[['sketch_id', 'sketch_name', 'season', 'episode', 'sketch_order']],
    on='sketch_id'
)

# Add category if available
if 'category' in df_merged.columns:
    sketch_categories = df_merged[['sketch_id', 'category', 'category2']].drop_duplicates()
    heatmap_all_emotions = heatmap_all_emotions.merge(sketch_categories, on='sketch_id', how='left')

# Convert time_bin to numeric for easier sorting/plotting
heatmap_all_emotions['time_bin_start'] = heatmap_all_emotions['time_bin'].astype(str).str.split('-').str[0].str.rstrip('%').astype(float)
heatmap_all_emotions['time_bin_end'] = heatmap_all_emotions['time_bin'].astype(str).str.split('-').str[1].str.rstrip('%').astype(float)
heatmap_all_emotions['time_bin_center'] = (heatmap_all_emotions['time_bin_start'] + heatmap_all_emotions['time_bin_end']) / 2

# Sort by sketch order, time bin, and count (descending)
heatmap_all_emotions = heatmap_all_emotions.sort_values(['sketch_order', 'time_bin_start', 'count'], ascending=[True, True, False]).reset_index(drop=True)

# Also create dominant-only version for comparison
heatmap_export = dominant_emotions.merge(
    sketch_order[['sketch_id', 'sketch_name', 'season', 'episode', 'sketch_order']],
    on='sketch_id'
)
if 'category' in df_merged.columns:
    heatmap_export = heatmap_export.merge(sketch_categories, on='sketch_id', how='left')
heatmap_export['time_bin_start'] = heatmap_export['time_bin'].astype(str).str.split('-').str[0].str.rstrip('%').astype(float)
heatmap_export['time_bin_end'] = heatmap_export['time_bin'].astype(str).str.split('-').str[1].str.rstrip('%').astype(float)
heatmap_export['time_bin_center'] = (heatmap_export['time_bin_start'] + heatmap_export['time_bin_end']) / 2
heatmap_export = heatmap_export.sort_values(['sketch_order', 'time_bin_start']).reset_index(drop=True)

print(f"ALL emotions heatmap data shape: {heatmap_all_emotions.shape}")
print(f"Dominant-only heatmap data shape: {heatmap_export.shape}")
print(f"\nSample of ALL emotions data (shows range):")
print(heatmap_all_emotions[['sketch_name', 'time_bin', 'sentiment_label', 'count', 'percentage']].head(20))


In [None]:
# Also create an alternative version with average sentiment score per bin
# This could be useful for a continuous color scale
heatmap_score_data = df.groupby(['sketch_id', 'time_bin'], observed=True).agg({
    'sentiment_score': 'mean',
    'sentiment_label': lambda x: x.mode()[0] if len(x.mode()) > 0 else 'neutral'  # most common emotion
}).reset_index()
heatmap_score_data.columns = ['sketch_id', 'time_bin', 'avg_sentiment_score', 'most_common_emotion']

# Add metadata
heatmap_score_export = heatmap_score_data.merge(
    sketch_order[['sketch_id', 'sketch_name', 'season', 'episode', 'sketch_order']],
    on='sketch_id'
)
if 'category' in df_merged.columns:
    heatmap_score_export = heatmap_score_export.merge(sketch_categories, on='sketch_id', how='left')

# Add time bin numeric values
heatmap_score_export['time_bin_start'] = heatmap_score_export['time_bin'].astype(str).str.split('-').str[0].str.rstrip('%').astype(float)
heatmap_score_export['time_bin_end'] = heatmap_score_export['time_bin'].astype(str).str.split('-').str[1].str.rstrip('%').astype(float)
heatmap_score_export['time_bin_center'] = (heatmap_score_export['time_bin_start'] + heatmap_score_export['time_bin_end']) / 2

heatmap_score_export = heatmap_score_export.sort_values(['sketch_order', 'time_bin_start']).reset_index(drop=True)

print("Alternative version with sentiment scores:")
print(heatmap_score_export[['sketch_name', 'time_bin', 'most_common_emotion', 'avg_sentiment_score']].head(15))


In [None]:
# Create comprehensive exports with multiple versions
# Version 1: ALL emotions per time bin (shows full range)
heatmap_all_final = heatmap_all_emotions[[
    'sketch_id', 'sketch_name', 'sketch_order', 'season', 'episode',
    'time_bin', 'time_bin_start', 'time_bin_end', 'time_bin_center',
    'sentiment_label', 'count', 'total_sentences', 'percentage'
]]

if 'category' in heatmap_all_emotions.columns:
    heatmap_all_final['category'] = heatmap_all_emotions['category']
    if 'category2' in heatmap_all_emotions.columns:
        heatmap_all_final['category2'] = heatmap_all_emotions['category2']

# Version 2: Dominant emotion only (simpler heatmap)
heatmap_final = heatmap_export[[
    'sketch_id', 'sketch_name', 'sketch_order', 'season', 'episode',
    'time_bin', 'time_bin_start', 'time_bin_end', 'time_bin_center',
    'sentiment_label', 'count', 'total_sentences', 'dominance_ratio'
]]

if 'category' in heatmap_export.columns:
    heatmap_final['category'] = heatmap_export['category']
    if 'category2' in heatmap_export.columns:
        heatmap_final['category2'] = heatmap_export['category2']

# Version 3: Score-based version
heatmap_score_final = heatmap_score_export[[
    'sketch_id', 'sketch_name', 'sketch_order', 'season', 'episode',
    'time_bin', 'time_bin_start', 'time_bin_end', 'time_bin_center',
    'most_common_emotion', 'avg_sentiment_score'
]]

if 'category' in heatmap_score_export.columns:
    heatmap_score_final['category'] = heatmap_score_export['category']
    if 'category2' in heatmap_score_export.columns:
        heatmap_score_final['category2'] = heatmap_score_export['category2']

# Calculate emotion diversity metrics per sketch
emotion_diversity = heatmap_all_emotions.groupby('sketch_id').agg({
    'sentiment_label': 'nunique',  # number of unique emotions
    'count': 'sum'  # total sentences
}).reset_index()
emotion_diversity.columns = ['sketch_id', 'num_unique_emotions', 'total_sentences']
emotion_diversity = emotion_diversity.merge(
    sketch_order[['sketch_id', 'sketch_name', 'season', 'episode', 'sketch_order']],
    on='sketch_id'
)

print("Final heatmap datasets prepared:")
print(f"  - ALL emotions version (shows range): {len(heatmap_all_final)} rows")
print(f"  - Dominant emotion version: {len(heatmap_final)} rows")
print(f"  - Score-based version: {len(heatmap_score_final)} rows")
print(f"\nUnique emotions across all sketches: {heatmap_all_final['sentiment_label'].nunique()}")
print(f"Emotions: {sorted(heatmap_all_final['sentiment_label'].unique())}")
print(f"\nEmotion diversity per sketch:")
print(f"  Average unique emotions per sketch: {emotion_diversity['num_unique_emotions'].mean():.2f}")
print(f"  Range: {emotion_diversity['num_unique_emotions'].min()} - {emotion_diversity['num_unique_emotions'].max()} unique emotions")


In [None]:
# Update the export JSON to include heatmap data
# Read existing export if it exists, otherwise create new structure
try:
    with open('sentiment_analysis_export.json', 'r') as f:
        export_data = json.load(f)
except FileNotFoundError:
    export_data = {}

# Add heatmap data - ALL versions
export_data['sketch_heatmap_all'] = heatmap_all_final.to_dict('records')  # ALL emotions (shows range)
export_data['sketch_heatmap'] = heatmap_final.to_dict('records')  # Dominant only
export_data['sketch_heatmap_scores'] = heatmap_score_final.to_dict('records')  # Score-based
export_data['emotion_diversity'] = emotion_diversity.to_dict('records')  # Diversity metrics per sketch

# Add metadata about the heatmap
export_data['heatmap_metadata'] = {
    'num_time_bins': NUM_TIME_BINS,
    'bin_size_percent': 100 / NUM_TIME_BINS,
    'num_sketches': len(sketch_order),
    'sketch_order': sketch_order[['sketch_id', 'sketch_name', 'season', 'episode', 'sketch_order']].to_dict('records'),
    'unique_emotions': sorted(heatmap_all_final['sentiment_label'].unique().tolist()),
    'avg_emotions_per_bin': heatmap_all_emotions.groupby(['sketch_id', 'time_bin'], observed=True).size().mean()
}

# Save updated export
with open('sentiment_analysis_export.json', 'w') as f:
    json.dump(export_data, f, indent=2)

print("Heatmap data added to sentiment_analysis_export.json")
print(f"\nHeatmap structure:")
print(f"  - sketch_heatmap_all: ALL emotions per sketch × time bin (shows full range)")
print(f"  - sketch_heatmap: Dominant emotion per sketch × time bin (simpler)")
print(f"  - sketch_heatmap_scores: Average sentiment score per sketch × time bin")
print(f"  - emotion_diversity: Emotion diversity metrics per sketch")
print(f"  - heatmap_metadata: Configuration and reference data")


In [None]:
# Preview: Show samples of both data structures
print("=" * 80)
print("SAMPLE 1: Dominant emotion only (simpler heatmap)")
print("=" * 80)
sample_sketch_dominant = heatmap_final[heatmap_final['sketch_order'] == 0].head(5)
print(sample_sketch_dominant[['sketch_name', 'time_bin', 'sentiment_label', 'dominance_ratio']].to_string(index=False))

print("\n" + "=" * 80)
print("SAMPLE 2: ALL emotions (shows full range - multiple rows per time bin)")
print("=" * 80)
sample_sketch_all = heatmap_all_final[heatmap_all_final['sketch_order'] == 0].head(10)
print(sample_sketch_all[['sketch_name', 'time_bin', 'sentiment_label', 'count', 'percentage']].to_string(index=False))

print("\n" + "=" * 80)
print("DATA STRUCTURE FOR OBSERVABLE PLOT")
print("=" * 80)
print("\nFor visualizing RANGE of emotions, use 'sketch_heatmap_all':")
print("  - Multiple records per sketch × time_bin (one per emotion present)")
print("  - Each record contains:")
print("    - sketch_id, sketch_name, sketch_order (for Y-axis ordering)")
print("    - season, episode (for grouping/filtering)")
print("    - time_bin, time_bin_start, time_bin_end, time_bin_center (for X-axis)")
print("    - sentiment_label (for color mapping)")
print("    - count, percentage (how many sentences, what % of bin)")
print("    - category, category2 (if available, for filtering/grouping)")
print("\nFor simpler dominant-only heatmap, use 'sketch_heatmap':")
print("  - One record per sketch × time_bin (dominant emotion only)")
print("  - Same fields as above, plus 'dominance_ratio'")


## Data Structure: Emotions in Order of Appearance per Sketch

Create a structured data format that organizes emotions chronologically for each sketch, making it easy to visualize the emotional arc over time.

In [None]:
# Ensure data is sorted by sketch_id and sentence_index for chronological order
# Use df_merged if it exists, otherwise use df (which has season/episode from CSV)
if 'df_merged' in globals() and df_merged is not None:
    df_sorted = df_merged.sort_values(['sketch_id', 'sentence_index']).reset_index(drop=True)
else:
    # Merge with metadata if df_merged doesn't exist
    if 'metadata' not in globals():
        metadata = pd.read_csv('episode_metadata.csv')
    df_sorted = df.merge(metadata, left_on='sketch_id', right_on='id', how='left')
    df_sorted = df_sorted.sort_values(['sketch_id', 'sentence_index']).reset_index(drop=True)

# Get all unique emotions for reference
all_emotions = sorted(df_sorted['sentiment_label'].unique())
print(f"Total unique emotions: {len(all_emotions)}")
print(f"All emotions: {all_emotions}")
print(f"\nColumns in df_sorted: {list(df_sorted.columns)}")

In [None]:
# Create the main data structure: emotions in order of appearance per sketch
emotions_by_sketch = {}

# Process each sketch
for sketch_id in sorted(df_sorted['sketch_id'].unique()):
    sketch_data = df_sorted[df_sorted['sketch_id'] == sketch_id].copy()
    
    # Get sketch metadata (should be same for all rows of a sketch)
    first_row = sketch_data.iloc[0]
    
    # Create ordered list of emotions with their positions
    emotion_sequence = []
    for idx, row in sketch_data.iterrows():
        emotion_sequence.append({
            'position': int(row['sentence_index']),
            'emotion': row['sentiment_label'],
            'confidence': float(row['sentiment_score']),
            'sentence_text': row['sentence_text']
        })
    
    # Store in dictionary (use .get() for columns that might not exist)
    emotions_by_sketch[sketch_id] = {
        'sketch_id': int(sketch_id),
        'sketch_name': first_row.get('sketch_name', ''),
        'season': int(first_row.get('season', 0)),
        'episode': int(first_row.get('episode', 0)),
        'category': first_row.get('category', ''),
        'category2': first_row.get('category2', ''),
        'start_time': first_row.get('start', ''),
        'end_time': first_row.get('end', ''),
        'total_sentences': len(emotion_sequence),
        'emotion_sequence': emotion_sequence,  # Ordered list of emotions
        'emotion_counts': sketch_data['sentiment_label'].value_counts().to_dict(),  # Count of each emotion
        'unique_emotions': sorted(sketch_data['sentiment_label'].unique().tolist())  # Unique emotions in this sketch
    }

print(f"Created data structure for {len(emotions_by_sketch)} sketches")
# Get first available sketch ID for example
first_sketch_id = min(emotions_by_sketch.keys()) if emotions_by_sketch else None
if first_sketch_id:
    print(f"\nExample structure for sketch {first_sketch_id}:")
    example_keys = ['sketch_id', 'sketch_name', 'season', 'episode', 'category', 'total_sentences', 'unique_emotions']
    example_dict = {k: emotions_by_sketch[first_sketch_id][k] for k in example_keys}
    print(json.dumps(example_dict, indent=2))
    print(f"\nFirst 5 emotions in sequence:")
    for i, emo in enumerate(emotions_by_sketch[first_sketch_id]['emotion_sequence'][:5]):
        print(f"  {i+1}. Position {emo['position']}: {emo['emotion']} (confidence: {emo['confidence']:.3f})")
else:
    print("No sketches found in data")

In [None]:
# Create a normalized position-based structure for easier charting
# This converts sentence positions to normalized positions (0.0 to 1.0) within each sketch

emotions_by_sketch_normalized = {}

for sketch_id, data in emotions_by_sketch.items():
    emotion_sequence_normalized = []
    total = data['total_sentences']
    
    for emo in data['emotion_sequence']:
        normalized_position = (emo['position'] - 1) / max(total - 1, 1)  # 0.0 to 1.0
        emotion_sequence_normalized.append({
            'normalized_position': float(normalized_position),
            'position': emo['position'],
            'emotion': emo['emotion'],
            'confidence': emo['confidence']
        })
    
    emotions_by_sketch_normalized[sketch_id] = {
        **{k: v for k, v in data.items() if k != 'emotion_sequence'},
        'emotion_sequence_normalized': emotion_sequence_normalized
    }

print(f"Created normalized position structure for {len(emotions_by_sketch_normalized)} sketches")
# Get first available sketch ID for example
first_sketch_id = min(emotions_by_sketch_normalized.keys()) if emotions_by_sketch_normalized else None
if first_sketch_id:
    print(f"\nExample normalized sequence for sketch {first_sketch_id} (first 5):")
    for emo in emotions_by_sketch_normalized[first_sketch_id]['emotion_sequence_normalized'][:5]:
        print(f"  Position {emo['normalized_position']:.3f} (sentence {emo['position']}): {emo['emotion']}")
else:
    print("No sketches found in normalized data")

In [None]:
# Create a long-format DataFrame for easier plotting
# Each row represents one emotion occurrence at a specific position in a sketch

plotting_data = []

for sketch_id, data in emotions_by_sketch.items():
    for emo in data['emotion_sequence']:
        plotting_data.append({
            'sketch_id': data['sketch_id'],
            'sketch_name': data['sketch_name'],
            'season': data['season'],
            'episode': data['episode'],
            'category': data['category'],
            'category2': data['category2'],
            'position': emo['position'],
            'normalized_position': (emo['position'] - 1) / max(data['total_sentences'] - 1, 1),
            'emotion': emo['emotion'],
            'confidence': emo['confidence']
        })

df_emotions_plotting = pd.DataFrame(plotting_data)

print(f"Plotting DataFrame shape: {df_emotions_plotting.shape}")
print(f"\nFirst 10 rows:")
df_emotions_plotting.head(10)

In [None]:
# Create a summary DataFrame for easier analysis
sketch_summaries = []

for sketch_id, data in emotions_by_sketch.items():
    # Create a simple list of emotions in order
    emotion_list = [e['emotion'] for e in data['emotion_sequence']]
    
    sketch_summaries.append({
        'sketch_id': data['sketch_id'],
        'sketch_name': data['sketch_name'],
        'season': data['season'],
        'episode': data['episode'],
        'category': data['category'],
        'category2': data['category2'],
        'total_sentences': data['total_sentences'],
        'num_unique_emotions': len(data['unique_emotions']),
        'emotion_sequence': emotion_list,  # List of emotions in order
        'emotion_string': ' -> '.join(emotion_list),  # String representation for quick viewing
        'emotion_counts': data['emotion_counts']
    })

df_sketch_emotions_summary = pd.DataFrame(sketch_summaries)

print(f"Summary DataFrame shape: {df_sketch_emotions_summary.shape}")
print(f"\nFirst few sketches:")
df_sketch_emotions_summary[['sketch_id', 'sketch_name', 'season', 'episode', 'total_sentences', 'num_unique_emotions', 'emotion_string']].head(10)

In [None]:
# Display statistics about the data structure
print("=" * 60)
print("DATA STRUCTURE STATISTICS")
print("=" * 60)

print(f"\nTotal sketches: {len(emotions_by_sketch)}")
print(f"Total emotion occurrences: {len(df_emotions_plotting)}")
print(f"Total unique emotions: {len(all_emotions)}")
print(f"\nAll {len(all_emotions)} emotions:")
for i, emo in enumerate(all_emotions, 1):
    print(f"  {i:2d}. {emo}")

print(f"\n\nSketches by season:")
print(df_emotions_plotting.groupby('season')['sketch_id'].nunique())

print(f"\n\nSketches by episode:")
episode_counts = df_emotions_plotting.groupby(['season', 'episode'])['sketch_id'].nunique()
print(episode_counts)

print(f"\n\nAverage emotions per sketch:")
emotions_per_sketch = df_emotions_plotting.groupby('sketch_id').size()
print(f"  Mean: {emotions_per_sketch.mean():.1f}")
print(f"  Median: {emotions_per_sketch.median():.1f}")
print(f"  Min: {emotions_per_sketch.min()}")
print(f"  Max: {emotions_per_sketch.max()}")

print(f"\n\nEmotion distribution across all sketches:")
emotion_counts = df_emotions_plotting['emotion'].value_counts()
print(emotion_counts)

### Data Structure Summary

**Main Data Structures Created:**

1. **`emotions_by_sketch`** (dict): 
   - Key: sketch_id
   - Value: Dictionary containing:
     - Sketch metadata (id, name, season, episode, category, etc.)
     - `emotion_sequence`: List of emotions in order of appearance with position, confidence, and sentence text
     - `emotion_counts`: Count of each emotion type
     - `unique_emotions`: List of unique emotions in the sketch

2. **`emotions_by_sketch_normalized`** (dict):
   - Same as above but with normalized positions (0.0 to 1.0) for easier comparison across sketches of different lengths

3. **`df_emotions_plotting`** (DataFrame):
   - Long-format DataFrame with one row per emotion occurrence
   - Columns: sketch_id, sketch_name, season, episode, category, position, normalized_position, emotion, confidence
   - Ideal for creating charts with libraries like matplotlib, plotly, or seaborn

4. **`df_sketch_emotions_summary`** (DataFrame):
   - Summary DataFrame with one row per sketch
   - Includes emotion_sequence as a list and emotion_string for quick viewing

In [64]:
# Save the data structures for later use
# Helper function to convert numpy types to native Python types for JSON serialization
def convert_to_json_serializable(obj):
    """Recursively convert numpy types to native Python types"""
    import numpy as np
    if isinstance(obj, dict):
        # Convert keys: numpy int64 -> regular Python int (JSON accepts int keys)
        result = {}
        for k, v in obj.items():
            # Convert numpy int64/int32 to regular Python int
            if isinstance(k, (np.integer, np.int64, np.int32)):
                key = int(k)
            else:
                key = k
            result[key] = convert_to_json_serializable(v)
        return result
    elif isinstance(obj, list):
        return [convert_to_json_serializable(item) for item in obj]
    elif isinstance(obj, (np.integer, np.int64, np.int32)):
        return int(obj)
    elif isinstance(obj, (np.floating, np.float64, np.float32)):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return obj

# 1. Full structured data (JSON)
emotions_by_sketch_json = convert_to_json_serializable(emotions_by_sketch)
with open('emotions_by_sketch_structured.json', 'w') as f:
    json.dump(emotions_by_sketch_json, f, indent=2)

# 2. Normalized version (JSON)
emotions_by_sketch_normalized_json = convert_to_json_serializable(emotions_by_sketch_normalized)
with open('emotions_by_sketch_normalized.json', 'w') as f:
    json.dump(emotions_by_sketch_normalized_json, f, indent=2)

# 3. Plotting DataFrame (CSV)
df_emotions_plotting.to_csv('emotions_plotting_data.csv', index=False)

# 4. Summary DataFrame (CSV) - note: emotion_sequence column will be saved as string representation
df_sketch_emotions_export = df_sketch_emotions_summary.copy()
df_sketch_emotions_export['emotion_sequence'] = df_sketch_emotions_export['emotion_sequence'].apply(lambda x: '|'.join(x))
df_sketch_emotions_export.to_csv('emotions_by_sketch_summary.csv', index=False)

print("Data structures saved:")
print("  ✓ emotions_by_sketch_structured.json: Full structured data")
print("  ✓ emotions_by_sketch_normalized.json: Normalized position data")
print("  ✓ emotions_plotting_data.csv: Long-format DataFrame for plotting")
print("  ✓ emotions_by_sketch_summary.csv: Summary DataFrame")

Data structures saved:
  ✓ emotions_by_sketch_structured.json: Full structured data
  ✓ emotions_by_sketch_normalized.json: Normalized position data
  ✓ emotions_plotting_data.csv: Long-format DataFrame for plotting
  ✓ emotions_by_sketch_summary.csv: Summary DataFrame
