# Reddit Movie Sentiment Analysis - Data Collection / Sentiment + Emotion Analysis Script

## By Leonardo Ferreira

This script collects data from reddit related to a given movie and performs sentiment and emotion analysis using a widely used [pretrained model for sentiment analysis](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) and a [pretrained emotion recognition model](https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion) to generate the scores.

### Requirements
- `praw` (python reddit API wrapper)
- `pandas`
- `nltk`
- `datetime`
- `torch`
- `scipy`
- `transformers`
- `python-dotenv` - if you want to store your API credentials in a .env file

# ⚠️ IMPORTANT: In order to use this script you must have access to reddit API

## Create a reddit account

- If you don't already have one, go to reddit's registration page: [https://www.reddit.com/register/](https://www.reddit.com/register/)

## Create a reddit application

- Go to your [App Preferences page](https://www.reddit.com/prefs/apps) while logged in.
- Scroll down to the bottom and click **"create another app"** (or **"create app"** if it's your first one).

## Fill in the application details

- Select **"script"** as the application type.
- Provide a name for your application (e.g., "Movie Sentiment Analysis Project").
- Add a brief description.
- For the **"about url"** and **"redirect uri"** fields, you can use `http://localhost:8080` as a placeholder.
- Click **"create app"** to submit.

## Get your credentials

- After creating the app, you'll see the **client ID** directly under the app name.
- The **client secret** will be displayed as **"secret"**.
- Make note of both, as you'll need them in your code.

## Example: Initializing the reddit API with PRAW

```python
import praw

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="python:movie.sentiment.analyzer:v1.0 (by /u/your_username)"
)

In [1]:
import praw
import pandas as pd
import datetime
import time
import re
import os
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
# this is used to import private variables from reddit -> you can either modify the script using your own credentials or create a .env with them.
from dotenv import load_dotenv

In [None]:
class RedditMovieDataCollector:
    def __init__(self, client_id, client_secret, user_agent):
        """
        initialize the reddit API client
        
        parameters:
        - client_id: your reddit API client ID
        - client_secret: your reddit API client secret
        - user_agent: unique identifier for your script
        """
        self.reddit = praw.Reddit(
            client_id=client_id,
            client_secret=client_secret,
            user_agent=user_agent
        )


        # initialize pretrained sentiment analysis model
        # using RoBERTa model fine-tuned for sentiment analysis on tweets
        # source: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
        self.sentiment_model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
        self.sentiment_tokenizer = AutoTokenizer.from_pretrained(self.sentiment_model_name)
        self.sentiment_model = AutoModelForSequenceClassification.from_pretrained(self.sentiment_model_name)


        # doing the same but for emotion analyzis
        # source: https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion
        self.emotion_model_name = "cardiffnlp/twitter-roberta-base-emotion"
        self.emotion_tokenizer = AutoTokenizer.from_pretrained(self.emotion_model_name)
        self.emotion_model = AutoModelForSequenceClassification.from_pretrained(self.emotion_model_name)

        # get emotion labels dynamically from the model configuration
        self.emotion_labels = [
            self.emotion_model.config.id2label[i] 
            for i in range(len(self.emotion_model.config.id2label))
        ]

        # create emotion column names
        self.emotion_columns = [f"{label.lower()}_emotion" for label in self.emotion_labels]


        # create folder for data if it doesn't exist
        if not os.path.exists('movie_data'):
            os.makedirs('movie_data')
    
    def clean_text(self, text):
        """
        clean and preprocess text data
        
        parameters:
        - text: text to clean
        
        returns:
        - cleaned text
        """
        if text is None:
            return ""
        
        # to lowercase
        text = text.lower()
        
        # remove urls
        text = re.sub(r'http\S+', '', text)
        
        # keep only letters and spaces
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        
        # remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def get_sentiment_scores(self, text):
        """
        calculate sentiment scores for a given text using VADER
        
        parameters:
        - text: text to analyze
        
        returns:
        - dictionary with sentiment scores
        """
        if not text:
            # return neutral sentiment if text is empty
            return {
                'compound': 0,
                'pos': 0,
                'neu': 1,
                'neg': 0
            }
        
        # truncate text if it's too long for the model
        max_length = 512
        if len(text) > max_length:
            text = text[:max_length]
        
        encoded_input = self.sentiment_tokenizer(text, return_tensors='pt', truncation=True, max_length=512)

        # get model output
        with torch.no_grad():
            output = self.sentiment_model(**encoded_input)
        
        # get probabilities
        scores = softmax(output.logits[0].numpy())
        
        # map scores to sentiment categories (negative, neutral, positive)
        sentiment_scores = {
            'neg': float(scores[0]),
            'neu': float(scores[1]),
            'pos': float(scores[2]),
            'compound': float(scores[2] - scores[0])
        }
        
        return sentiment_scores
    
    def get_emotion_scores(self, text):
        """
        calculate emotion scores

        parameters:
        - text: text to analyze
        
        returns:
        - dictionary with emotion scores
        """
        if not text:
            # return neutral emotion scores if text is empty
            return {label.lower(): 0 for label in self.emotion_labels}
        
        # truncate text if it's too long for the model
        max_length = 512
        if len(text) > max_length:
            text = text[:max_length]
        
        encoded_input = self.emotion_tokenizer(text, return_tensors='pt', truncation=True, max_length=512)

        # get model output
        with torch.no_grad():
            output = self.emotion_model(**encoded_input)
        
        # get probabilities
        scores = softmax(output.logits[0].numpy())
        
        # map scores to emotion categories dynamically
        emotion_mapping = {
            self.emotion_model.config.id2label[i].lower(): float(scores[i]) 
            for i in range(len(self.emotion_labels))
        }
        
        return emotion_mapping
    
    def collect_reddit_data(self, movie_name, limit=100, time_filter="all"):
        """
        collect reddit data for a specific movie
        
        parameters:
        - movie_name: name of the movie to search for
        - limit: maximum number of posts to retrieve
        - time_filter: time period to search within (hour, day, week, month, year, all)
        
        returns:
        - df with collected data
        """
        
        posts_data = []
        comments_data = []
        
        # search for posts related to the movie
        search_query = movie_name
        posts = self.reddit.subreddit("all").search(
            search_query, 
            sort="relevance", 
            time_filter=time_filter, 
            limit=limit,
            syntax='lucene'
        )
        
        post_count = 0
        comment_count = 0
        
        for post in posts:
            post_count += 1
            
            # clean title and text
            clean_title = self.clean_text(post.title)
            clean_text = self.clean_text(post.selftext)
            combined_text = f"{clean_title} {clean_text}"
            
            # get sentiment scores
            sentiment_scores = self.get_sentiment_scores(combined_text)

            # get emotion scores
            emotion_scores = self.get_emotion_scores(combined_text)

            # process post
            post_data = {
                'id': post.id,
                'title': post.title,
                'text': post.selftext,
                'author': str(post.author),
                'score': post.score,
                'created_utc': datetime.datetime.fromtimestamp(post.created_utc),
                'subreddit': post.subreddit.display_name,
                'num_comments': post.num_comments,
                'compound_sentiment': sentiment_scores['compound'],
                'positive_sentiment': sentiment_scores['pos'],
                'neutral_sentiment': sentiment_scores['neu'],
                'negative_sentiment': sentiment_scores['neg'],
                **{f"{k}_emotion": v for k, v in emotion_scores.items()},
                'content_type': 'post'
            }
            posts_data.append(post_data)
            
            # get comments while skipping loading more comments to avoid API rate limits
            post.comments.replace_more(limit=0)
            for comment in post.comments.list():
                comment_count += 1
                
                # clean comment text
                clean_comment = self.clean_text(comment.body)
                
                # get sentiment scores
                comment_sentiment = self.get_sentiment_scores(clean_comment)

                # get emotion scores
                comment_emotion = self.get_emotion_scores(clean_comment)

                # process comment
                comment_data = {
                    'id': comment.id,
                    'post_id': post.id,
                    'text': comment.body,
                    'author': str(comment.author),
                    'score': comment.score,
                    'created_utc': datetime.datetime.fromtimestamp(comment.created_utc),
                    'subreddit': post.subreddit.display_name,
                    'compound_sentiment': comment_sentiment['compound'],
                    'positive_sentiment': comment_sentiment['pos'],
                    'neutral_sentiment': comment_sentiment['neu'],
                    'negative_sentiment': comment_sentiment['neg'],
                    **{f"{k}_emotion": v for k, v in comment_emotion.items()},
                    'content_type': 'comment'
                }
                comments_data.append(comment_data)
            
            # sleep to avoid rate limits
            time.sleep(0.5)
        
        # create dfs
        posts_df = pd.DataFrame(posts_data)
        comments_df = pd.DataFrame(comments_data)
        
        # save to csv
        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
        movie_name_cleaned = movie_name.replace(" ", "_").lower()
        
        posts_filename = f"movie_data/{movie_name_cleaned}_posts_{timestamp}.csv"
        comments_filename = f"movie_data/{movie_name_cleaned}_comments_{timestamp}.csv"
        
        posts_df.to_csv(posts_filename, index=False)
        comments_df.to_csv(comments_filename, index=False)
        
        print(f"Data collection is complete: {post_count} posts and {comment_count} comments")
        print(f"Data saved to {posts_filename} and {comments_filename}")
        
        # Combine data for analysis
        all_data = pd.concat([posts_df, comments_df])
        
        return all_data
    
    def analyze_sentiment_and_emotion_distribution(self, data):
        """
        analyze the sentiment distribution of collected data
        
        parameters:
        - data: df with collected data
        
        returns:
        - df with sentiment distribution analysis
        """
        # calculate averages for sentiment scores
        sentiment_avg = {
            'Average compound score': data['compound_sentiment'].mean(),
            'Average positive score': data['positive_sentiment'].mean(),
            'Average neutral score': data['neutral_sentiment'].mean(),
            'Average negative score': data['negative_sentiment'].mean()
        }
        
        # categorize sentiments
        data['sentiment_category'] = pd.cut(
            data['compound_sentiment'],
            bins=[-2, -0.6, -0.2, 0.2, 0.6, 2],
            labels=['Very negative', 'Negative', 'Neutral', 'Positive', 'Very positive']
        )
        
        # count by category
        sentiment_counts = data['sentiment_category'].value_counts().to_dict()
        
        # calculate percentages
        total = sum(sentiment_counts.values())
        sentiment_percentages = {k: (v / total) * 100 for k, v in sentiment_counts.items()}
        
        # emotion analysis
        emotion_columns = [col for col in data.columns if col.endswith('_emotion')]

        # calculate average emotion scores
        emotion_avg = {f'Average {col.split("_")[0]} emotion': data[col].mean() for col in emotion_columns}
        
        # calculate total emotion scores across all content
        total_emotion_scores = {}
        for col in emotion_columns:
            emotion_name = col.split('_')[0]
            total_emotion_scores[emotion_name] = data[col].sum()
        
        # sort emotions
        sorted_emotions = sorted(total_emotion_scores.items(), key=lambda x: x[1], reverse=True)

        emotions_dict = {
            'emotions': [emotion for emotion, score in sorted_emotions],
            'emotions_scores': {emotion: score for emotion, score in sorted_emotions}
        }

        # calculate percentages for top 5 emotions
        total_emotion_score = sum(total_emotion_scores.values())
        emotions_percentages = {
            emotion: (score / total_emotion_score) * 100 
            for emotion, score in emotions_dict['emotions_scores'].items()
        }
        emotions_dict['emotions_percentages'] = emotions_percentages

        # combine results
        analysis_result = {
            'sentiment_avg': sentiment_avg,
            'sentiment_counts': sentiment_counts,
            'sentiment_percentages': sentiment_percentages,
            'emotion_avg': emotion_avg,
            'emotions': emotions_dict
        }
        
        return analysis_result

In [None]:
# load credentials from .env file
load_dotenv()
REDDIT_CLIENT_ID = os.getenv("REDDIT_CLIENT_ID")
REDDIT_CLIENT_SECRET = os.getenv("REDDIT_CLIENT_SECRET")
REDDIT_USER_AGENT = os.getenv("REDDIT_USER_AGENT")

# initialize collector with reddit credentials
collector = RedditMovieDataCollector(REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_USER_AGENT)

# collect data for a given movie
# change the movie name to get sentiments about different movies
movie_name = "Flash"
# limit = 50 means: 50 posts. Be aware that a post can have multiple comments (API rate limit)
limit = 100
data = collector.collect_reddit_data(movie_name, limit=limit)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Data collection is complete: 50 posts and 16303 comments
Data saved to movie_data/interstellar_posts_20250411_200432.csv and movie_data/interstellar_comments_20250411_200432.csv


In [4]:
"""
Metrics meaning

Sentiment scores:
Average compound score: overall sentiment score, ranging from -1 (very negative) to +1 (very positive)
Average positive score: represents the proportion of text that expresses positive sentiment
Average neutral score: % of the content that is presenting emotionally neutral language
Average negative score: % of the content that is presenting emotionally negative language

Sentiment distribution:
Neutral: number of items expressing sentiments classified as neither positive nor negative
Positive: number of items expressing positive sentiments
Very positive: number of items expressing strongly positive sentiments 
Negative: number of items expressing negative sentiments 
Very negative: number of items expressing strongly negative sentiments
"""
# brief analysis of sentiment and emotion distribution for the given movie
analysis = collector.analyze_sentiment_and_emotion_distribution(data)

print("Sentiment analysis results:")
print("Average sentiment scores:")
for k, v in analysis['sentiment_avg'].items():
    print(f"  {k}: {v:.4f}")

print("\n")

print("Sentiment distribution:")
for k, v in analysis['sentiment_counts'].items():
    percentage = analysis['sentiment_percentages'][k]
    print(f"  {k}: {v} ({percentage:.2f}%)")


Sentiment analysis results:
Average sentiment scores:
  Average compound score: -0.0386
  Average positive score: 0.2636
  Average neutral score: 0.4342
  Average negative score: 0.3022


Sentiment distribution:
  Neutral: 5259 (32.16%)
  Very negative: 3374 (20.63%)
  Negative: 3032 (18.54%)
  Very positive: 3005 (18.38%)
  Positive: 1683 (10.29%)


In [None]:
# available emotions
print("\nAvailable emotions in the pretrained model:")
for i, label in enumerate(collector.emotion_labels):
    print(f"{i}: {label}")

# emotion analysis results
print("Emotion analysis results:")
print("Average emotion scores:")
for k, v in analysis['emotion_avg'].items():
    print(f"  {k}: {v:.4f}")

print("\n")

print("Top emotions:")
emotions_data = analysis['emotions']
for emotion, score in zip(
    emotions_data['emotions'], 
    emotions_data['emotions_scores'].values()
):
    percentage = emotions_data['emotions_percentages'][emotion]
    print(f"  {emotion.capitalize()}: {score:.2f} ({percentage:.2f}%)")


Available emotions in the pretrained model:
0: joy
1: optimism
2: anger
3: sadness
Emotion analysis results:
Average emotion scores:
  Average joy emotion: 0.2830
  Average optimism emotion: 0.3440
  Average anger emotion: 0.1289
  Average sadness emotion: 0.2380


Top emotions:
  Optimism: 5624.74 (34.61%)
  Joy: 4627.25 (28.47%)
  Sadness: 3891.73 (23.95%)
  Anger: 2108.28 (12.97%)
