Social media platforms contain large volumes of user-generated text expressing opinions on real-world events.
However, this text is noisy, informal, and constantly evolving.

This project aims to analyze public sentiment on social media over time, compare different NLP approaches, and evaluate how reliable and interpretable these models are.

In [None]:
import pandas as pd
import os

os.makedirs("data/raw", exist_ok=True)


df1 = pd.read_csv("data/raw/goemotions_1.csv")
df2 = pd.read_csv("data/raw/goemotions_2.csv")
df3 = pd.read_csv("data/raw/goemotions_3.csv")

df = pd.concat([df1, df2, df3], ignore_index=True)
print("Merged dataset shape:", df.shape)
print(df.columns)


Merged dataset shape: (211225, 37)
Index(['text', 'id', 'author', 'subreddit', 'link_id', 'parent_id',
       'created_utc', 'rater_id', 'example_very_unclear', 'admiration',
       'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion',
       'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust',
       'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy',
       'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief',
       'remorse', 'sadness', 'surprise', 'neutral'],
      dtype='object')


In [7]:
# List of emotion columns
emotion_cols = [
    'admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring',
    'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval',
    'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief',
    'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization',
    'relief', 'remorse', 'sadness', 'surprise', 'neutral'
]

# Function to pick first emotion
def pick_emotion(row):
    for col in emotion_cols:
        if row[col] == 1:
            return col
    return "neutral"

# Apply to dataset
df['emotion'] = df[emotion_cols].apply(pick_emotion, axis=1)


In [8]:
import numpy as np
import pandas as pd

# Option A — use original Reddit timestamp
df['created_at'] = pd.to_datetime(df['created_utc'], unit='s')

# Option B — synthetic dates (if you want controlled time-series)
# df['created_at'] = pd.to_datetime(
#     np.random.choice(pd.date_range("2024-01-01","2025-01-01",freq="D"), len(df))
# )


In [9]:
df_final = df[['text', 'created_at', 'emotion']]
print(df_final.head())


                                                text          created_at  \
0                                    That game hurt. 2019-01-25 01:50:39   
1   >sexuality shouldn’t be a grouping category I... 2019-01-21 15:22:49   
2     You do right, if you don't care then fuck 'em! 2019-01-02 11:15:44   
3                                 Man I love reddit. 2019-01-20 06:17:34   
4  [NAME] was nowhere near them, he was by the Fa... 2019-01-05 06:10:01   

   emotion  
0  sadness  
1  neutral  
2  neutral  
3     love  
4  neutral  


In [10]:
os.makedirs("data/processed", exist_ok=True)
df_final.to_csv("data/processed/social_posts.csv", index=False)
print("Saved processed CSV to data/processed/social_posts.csv")


Saved processed CSV to data/processed/social_posts.csv
