# Reddit Movie Sentiment Analysis - Data Collection / Sentiment + Emotion Analysis Script

## Author: Leonardo Ferreira

## 1. Objective
The main goal is to collect and analyze sentiment and emotional responses to movies using reddit comments and posts.

## 2. Sources
- **Reddit API**: Data source
- **Hugging Face Pretrained Models**: 
  - Sentiment Analysis: [Pretrained model for sentiment analysis](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)
  - Emotion Recognition: [Pretrained emotion recognition model](https://huggingface.co/j-hartmann/emotion-english-distilroberta-base)

## 3. Reddit API setup

### 3.1 Create a reddit account

- If you don't already have one, go to reddit's [registration page](https://www.reddit.com/register/)

### 3.2 Create a reddit application

- Go to your [app preferences page](https://www.reddit.com/prefs/apps) while logged in.
- Scroll down to the bottom and click **"create another app"** (or **"create app"** if it's your first one).

### 3.3 Fill in the application details

- Select **"script"** as the application type.
- Provide a name for your application (e.g., "Movie Sentiment Analysis Project").
- Add a brief description.
- For the **"about url"** and **"redirect uri"** fields, you can use `http://localhost:8080` as a placeholder.
- Click **"create app"** to submit.

### 3.4 Get your credentials

- After creating the app, you'll see the **client ID** directly under the app name.
- The **client secret** will be displayed as **"secret"**.
- Make note of both, as you'll need them in your code.

### 3.5 Example: Initializing the reddit API with PRAW

```python
import praw

reddit = praw.Reddit(
    client_id="YOUR_CLIENT_ID",
    client_secret="YOUR_CLIENT_SECRET",
    user_agent="python:movie.sentiment_emotion.analyzer:v1.0 (by /u/your_username)"
)
```

## 4. Methodology

1. **Data Retrieval**
   - Utilize PRAW (python reddit API wrapper) to search reddit
   - Search parameters include:
     - Movie name as search query
     - Relevance sorting
     - Configurable time filter
     - Limit on number of posts
<br>
2. **Text Preprocessing**
   - Clean text by:
     - Converting to lowercase
     - Removing URLs
     - Eliminating non-alphabetic characters
     - Removing extra whitespace
   - Process both post titles and body text
   - Handle comments separately
<br>
3. **Sentiment Analysis**
   - Use CardiffNLP's RoBERTa-based sentiment model
   - Extract sentiment scores:
     - Negative sentiment
     - Neutral sentiment
     - Positive sentiment
     - Compound sentiment score
<br>
4. **Emotion Recognition**
   - Apply DistilRoBERTa emotion recognition model
   - Identify emotional categories:
     - Anger
     - Disgust
     - Fear
     - Joy
     - Neutral
     - Sadness
     - Surprise
<br>
5. **Data Storage**
   - Save processed data to CSV files
   - Separate files for posts and comments
   - Include:
     - Original text
     - Sentiment scores
     - Emotion scores
     - Metadata (author, timestamp, etc...)
    
## 5. ML models

### Sentiment analysis model
- **Architecture**: RoBERTa
- **Training Data**: Twitter
- **Sentiment Categories**: 3 different sentiments (neutral, positive, negative)
- **Output**: Prob. distribution across sentiments
- **Compound Score**: Difference between positive and negative probs.

### Emotion Recognition Model
- **Architecture**: DistilRoBERTa
- **Training Data**: English text
- **Emotion Categories**: 7 different emotions
- **Output**: Prob. distribution across emotions

## 6. Considerations
- **API Limitations**: 
  - Implement rate limiting
  - Can't use date ranges to limit the search
  
- **Text Processing**:
  - Truncate long texts to model's max length
  - Clean and standardize text

In [8]:
import praw
import pandas as pd
from datetime import datetime, timedelta
import time
import re
import os
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
# this is used to import private variables for reddit API
# you can either modify the script using your own credentials or create a .env with them
from dotenv import load_dotenv

In [9]:
class RedditMovieDataCollector:
    def __init__(self, client_id, client_secret, user_agent):
        """
        initialize the reddit API client
        
        parameters:
        - client_id: your reddit API client ID
        - client_secret: your reddit API client secret
        - user_agent: unique identifier for your script
        """
        self.reddit = praw.Reddit(
            client_id=client_id,
            client_secret=client_secret,
            user_agent=user_agent
        )


        # initialize pretrained sentiment analysis model
        # using fine-tuned model for sentiment analysis
        # source: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
        self.sentiment_model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
        self.sentiment_tokenizer = AutoTokenizer.from_pretrained(self.sentiment_model_name)
        self.sentiment_model = AutoModelForSequenceClassification.from_pretrained(self.sentiment_model_name)


        # doing the same but for emotion analysis
        # source: https://huggingface.co/j-hartmann/emotion-english-distilroberta-base
        self.emotion_model_name = "j-hartmann/emotion-english-distilroberta-base"
        self.emotion_tokenizer = AutoTokenizer.from_pretrained(self.emotion_model_name)
        self.emotion_model = AutoModelForSequenceClassification.from_pretrained(self.emotion_model_name)

        # get emotion labels dynamically from the model configuration
        self.emotion_labels = [
            self.emotion_model.config.id2label[i] 
            for i in range(len(self.emotion_model.config.id2label))
        ]

        # create emotion column names
        self.emotion_columns = [f"{label.lower()}_emotion" for label in self.emotion_labels]


        # create folder for data if it doesn't exist
        if not os.path.exists('movie_data_reddit'):
            os.makedirs('movie_data_reddit')
    
    def clean_text(self, text):
        """
        clean and preprocess text data
        
        parameters:
        - text: text to clean
        
        returns:
        - cleaned text
        """
        if text is None:
            return ""
        
        # to lowercase
        text = text.lower()
        
        # remove urls
        text = re.sub(r'(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})', '', text)
        
        # remove extra whitespace
        text = re.sub(r'\s+', ' ', text).strip()
        
        return text
    
    def get_sentiment_scores(self, text):
        """
        calculate sentiment scores for a given text using VADER
        
        parameters:
        - text: text to analyze
        
        returns:
        - dictionary with sentiment scores
        """
        if not text:
            # return neutral sentiment if text is empty
            return {
                'compound': 0,
                'pos': 0,
                'neu': 1,
                'neg': 0
            }
        
        # truncate text if it's too long for the model
        max_length = 512
        if len(text) > max_length:
            text = text[:max_length]
        
        encoded_input = self.sentiment_tokenizer(text, return_tensors='pt', truncation=True, max_length=512)

        # get model output
        with torch.no_grad():
            output = self.sentiment_model(**encoded_input)
        
        # get probabilities
        scores = softmax(output.logits[0].numpy())
        
        # map scores to sentiment categories (negative, neutral, positive)
        sentiment_scores = {
            'neg': float(scores[0]),
            'neu': float(scores[1]),
            'pos': float(scores[2]),
            'compound': float(scores[2] - scores[0])
        }
        
        return sentiment_scores
    
    def get_emotion_scores(self, text):
        """
        calculate emotion scores

        parameters:
        - text: text to analyze
        
        returns:
        - dictionary with emotion scores
        """
        if not text:
            # return neutral emotion scores if text is empty
            return {label.lower(): 0 for label in self.emotion_labels}
        
        # truncate text if it's too long for the model
        max_length = 512
        if len(text) > max_length:
            text = text[:max_length]
        
        encoded_input = self.emotion_tokenizer(text, return_tensors='pt', truncation=True, max_length=512)

        # get model output
        with torch.no_grad():
            output = self.emotion_model(**encoded_input)
        
        # get probabilities
        scores = softmax(output.logits[0].numpy())
        
        # map scores to emotion categories dynamically
        emotion_mapping = {
            self.emotion_model.config.id2label[i].lower(): float(scores[i]) 
            for i in range(len(self.emotion_labels))
        }
        
        return emotion_mapping
    
    def get_date_range(self, release_date, months_before):
        """
        calculate a date range starting N months before a movie's release date
        
        parameters:
        - release_date: release date datetime
        - months_before: N of months before release date
        
        returns:
        - start_date: datetime for the start date
        - end_date: datetime for the end date (release date)
        """
        # calculate start date (N months before release)
        start_date = release_date - timedelta(days=30 * months_before)
        
        return start_date, release_date
    
    def collect_reddit_data(self, movie_name, release_date, months_before, limit):
        """
        collect reddit data for a specific movie
        
        parameters:
        - movie_name: name of the movie to search for
        - release_date: movie's release date
        - months_before: number of months before release date
        - limit: max number of posts
        
        returns:
        - df with collected data
        """

        # calculate date range
        start_date, end_date = self.get_date_range(release_date, months_before)

        # convert dates to unix timestamps
        start_timestamp = int(start_date.timestamp())
        end_timestamp = int(end_date.timestamp())
        
        posts_data = []
        comments_data = []
        
        # search for posts related to the movie
        search_query = movie_name

        # to avoid infinite search we have to set a reasonable max limit to search for
        max_posts_to_check = limit * 100
        posts_processed = 0
        post_count = 0
        comment_count = 0
        
        for post in self.reddit.subreddit("all").search(search_query, sort="relevance", time_filter="all", limit=None, syntax='lucene'):

            # we need to break if checked too many posts without finding enough matches
            if posts_processed > max_posts_to_check:
                print(f"maximum posts to check ({max_posts_to_check}) reached... stopping search")
                break

            # we skip posts outside the date range we stablished
            post_timestamp = post.created_utc
            if post_timestamp < start_timestamp or post_timestamp > end_timestamp:
                continue
            
            # if we reach the limit, then we can stop
            if post_count >= limit:
                break

            post_count += 1
            
            # clean title and text
            # for posts we use the TITLE and the POST TEXTUAL CONTENT for sentiment / emotion analysis
            clean_title = self.clean_text(post.title)
            clean_text = self.clean_text(post.selftext)
            combined_text = f"{clean_title} {clean_text}"
            
            # get sentiment scores
            sentiment_scores = self.get_sentiment_scores(combined_text)

            # get emotion scores
            emotion_scores = self.get_emotion_scores(combined_text)

            # process post
            post_data = {
                'id': post.id,
                'title': post.title,
                'text': post.selftext,
                'author': str(post.author),
                'score': post.score,
                'created_utc': datetime.fromtimestamp(post.created_utc),
                'subreddit': post.subreddit.display_name,
                'num_comments': post.num_comments,
                'compound_sentiment': sentiment_scores['compound'],
                'positive_sentiment': sentiment_scores['pos'],
                'neutral_sentiment': sentiment_scores['neu'],
                'negative_sentiment': sentiment_scores['neg'],
                **{f"{k}_emotion": v for k, v in emotion_scores.items()},
                'content_type': 'post'
            }
            posts_data.append(post_data)
            
            # get comments while skipping loading more comments to avoid API rate limits
            post.comments.replace_more(limit=0)

            # get top comments based on its score
            top_comments = list(post.comments)
            top_comments.sort(key=lambda x: x.score, reverse=True)

            post_comment_count = 0

            for comment in top_comments:
                # filter comments by date as well
                comment_timestamp = comment.created_utc
                if comment_timestamp < start_timestamp or comment_timestamp > end_timestamp:
                    continue

                # filter by minimum length (80 characters)
                if len(comment.body) < 80:
                    continue
                
                # clean comment text
                # for commentws we use the COMMENT TEXTUAL BODY for sentiment / emotion analysis
                clean_comment = self.clean_text(comment.body)
                
                # get sentiment scores
                comment_sentiment = self.get_sentiment_scores(clean_comment)

                # get emotion scores
                comment_emotion = self.get_emotion_scores(clean_comment)

                # process comment
                comment_data = {
                    'id': comment.id,
                    'post_id': post.id,
                    'text': comment.body,
                    'author': str(comment.author),
                    'score': comment.score,
                    'created_utc': datetime.fromtimestamp(comment.created_utc),
                    'subreddit': post.subreddit.display_name,
                    'compound_sentiment': comment_sentiment['compound'],
                    'positive_sentiment': comment_sentiment['pos'],
                    'neutral_sentiment': comment_sentiment['neu'],
                    'negative_sentiment': comment_sentiment['neg'],
                    **{f"{k}_emotion": v for k, v in comment_emotion.items()},
                    'content_type': 'comment'
                }
                comments_data.append(comment_data)

                comment_count += 1
                post_comment_count += 1

                # lets limit it to 20 comments per post
                if post_comment_count >= 20:
                    break

            # sleep to avoid rate limits
            time.sleep(0.5)
        
        # create dfs
        posts_df = pd.DataFrame(posts_data)
        comments_df = pd.DataFrame(comments_data)
        
        # save to csv
        movie_name_cleaned = movie_name.replace(" ", "_").lower()
        
        posts_filename = f"movie_data_reddit/{movie_name_cleaned}_posts.csv"
        comments_filename = f"movie_data_reddit/{movie_name_cleaned}_comments.csv"
        
        posts_df.to_csv(posts_filename, index=False, quoting=1, escapechar='\\')
        comments_df.to_csv(comments_filename, index=False, quoting=1, escapechar='\\')
        
        print(f"Data collection is complete: {post_count} posts and {comment_count} comments")
        print(f"Data saved to {posts_filename} and {comments_filename}")
        
        # combine data for analysis
        all_data = pd.concat([posts_df, comments_df])
        
        return all_data
    
    def process_movies_with_incremental_saving(self, movies_df, output_file='tmdb_movies_with_reddit_sentiment_emotion_scores.csv', limit=20, months_before=1):
        """
        process movies and add sentiment/emotion scores with incremental saving
        
        Parameters:
        - collector: RedditMovieDataCollector instance
        - movies_df: df containing movie data
        - output_file: File to save results to
        - limit: Max num of posts to collect per movie
        - months_before: Num of months before release to collect data
        
        returns:
        - updated movies df with sentiment and emotion scores
        """
        # check if output file exists and load it if it does
        if os.path.exists(output_file):
            print(f"found existing file: {output_file}")
            existing_df = pd.read_csv(output_file)

            # get the list of movies already processed
            processed_movies = existing_df['movie_ID'].tolist() if 'movie_ID' in existing_df.columns else []
            if processed_movies:
                print(f"found {len(processed_movies)} already processed movies, we will start from where we stopped")

                # filter out movies that have already been processed
                movies_to_process = movies_df[~movies_df['movie_ID'].isin(processed_movies)]
                
                # combine the existing data with the new data we're going to process
                result_df = existing_df
            else:
                movies_to_process = movies_df
                result_df = pd.DataFrame()
        else:
            # otherwise, it means that we are starting the data download for the first time
            print(f"creating new file: {output_file}")
            movies_to_process = movies_df
            
            # start new df
            # plus the sentiment and emotion columns we'll add
            result_df = pd.DataFrame()
        
        # skip processing if all movies have been processed
        if len(movies_to_process) == 0:
            print("all movies processed")
            return
        
        print(f"processing {len(movies_to_process)} movies...")
        
        # Define sentiment and emotion categories
        sentiment_categories = ['negative', 'neutral', 'positive']
        emotion_categories = [label.lower() for label in self.emotion_labels]
        
        # process each movie at time
        for index, row in movies_to_process.iterrows():
            print(f"processing movie {index+1}/{len(movies_to_process)}: {row['title']}")
            
            # copy of the current row that we'll update with sentiment data
            movie_row = row.copy()
            
            # sentiment and emotion scores for this movie
            for category in sentiment_categories:
                movie_row[f'{category}_sentiment'] = 0.0
            
            for category in emotion_categories:
                movie_row[f'{category}_emotion'] = 0.0
            
            # parse date
            release_date = datetime.strptime(row['Release Date'], '%Y-%m-%d')
            
            # get reddit data for current movie
            reddit_data = self.collect_reddit_data(
                row['title'], 
                release_date, 
                months_before=months_before, 
                limit=limit
            )
        
            # calculate sentiment and emotion scores
            for category in sentiment_categories:
                movie_row[f'{category}_sentiment'] = reddit_data[f'{category}_sentiment'].mean()
            for category in emotion_categories:
                movie_row[f'{category}_emotion'] = reddit_data[f'{category}_emotion'].mean()
            
            # convert the row to a df
            movie_df = pd.DataFrame([movie_row])
            
            # append it to existing df 
            if result_df.empty:
                result_df = movie_df
            else:
                result_df = pd.concat([result_df, movie_df], ignore_index=True)
            
            # save for every row
            result_df.to_csv(output_file, index=False)
        
        return result_df

In [None]:
# load credentials from .env file
load_dotenv()
REDDIT_CLIENT_ID = os.getenv("REDDIT_CLIENT_ID")
REDDIT_CLIENT_SECRET = os.getenv("REDDIT_CLIENT_SECRET")
REDDIT_USER_AGENT = os.getenv("REDDIT_USER_AGENT")

# read data
movies_df = pd.read_csv('tmdb_1000_sample.csv')
#movies_df = movies_df.head(5)

# initialize collector with reddit credentials
collector = RedditMovieDataCollector(REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_USER_AGENT)

# collect data and add sentiment and emotions to it 
output_file = 'reddit_movie_data/tmdb_movies_with_reddit_sentiment_emotion_scores.csv'
movies_with_sentiment_and_emotion = collector.process_movies_with_incremental_saving(
    movies_df, # data
    output_file, # output file name
    20,  # limit
    1  # months before
)