# Sentiment Analysis for IMDB reviews

**Objective:** Identify sentiment for IMDB reviews to determine whether it matches the target sentiment for the movie.

**Note:** Using Kaggle IMDB Spoiler Dataset for movie reviews, need to input with Mike's cleaned data for plot + movie reviews

### 1. Sentiment Analysis using NLTK Sentiment Intensity Analyzer (VADER)

#### Load Datasets

In [3]:
import json

Movie review dataset

In [4]:
with open("IMDB_reviews.json", 'r') as f:
    json_data = f.read()

# Split the file contents into individual JSON objects
json_objects = json_data.strip().split('\n')

# Load each JSON object and store them in a list
loaded_data = []
for obj in json_objects:
    data = json.loads(obj)
    loaded_data.append(data)

Movie plot dataset

In [5]:
with open("IMDB_movie_details.json", 'r') as f:
    json_data = f.read()

# Split the file contents into individual JSON objects
json_objects = json_data.strip().split('\n')

# Load each JSON object and store them in a list
loaded_plots = []
for obj in json_objects:
    data = json.loads(obj)
    loaded_plots.append(data)

Merged dataset based on movie ID:

In [6]:
merged_list = [dict1 | dict2 for dict1 in loaded_data for dict2 in loaded_plots if dict1["movie_id"] == dict2["movie_id"]]

Keep only movies with plots:

In [42]:
movies_list = [review for review in merged_list if 'plot_synopsis' in review and review['plot_synopsis']]

len(movies_list)

538828

In [33]:
empty_plot_synopsis_exists = any('plot_synopsis' in review and not review['plot_synopsis'] for review in movies_list)

if empty_plot_synopsis_exists:
    print("There is at least one movie with an empty plot_synopsis.")
else:
    print("All movies have a plot_synopsis.")

All movies have a plot_synopsis.


#### Attach movie name based on movie ID from a titles.aka.tsv.gz dataset:

Load dataset:

In [7]:
#import the required Library
import pandas as pdd
#Select all columns 
#dff = pdd.read_csv("title.akas.tsv.gz",sep="\t")
#Select specified columns 
dff = pdd.read_csv("title.akas.tsv.gz",sep="\t", usecols = ['titleId','title'])
#print the dataframe header and some rows
dff.head()

Unnamed: 0,titleId,title
0,tt0000001,Карменсіта
1,tt0000001,Carmencita
2,tt0000001,Carmencita - spanyol tánc
3,tt0000001,Καρμενσίτα
4,tt0000001,Карменсита


Remove duplicates:

In [117]:
for col in dff.columns:
    is_unique = not dff[col].duplicated().any()
    print(f"{col} is unique: {is_unique}")

titleId is unique: False
title is unique: False


There will still be duplicated title ID if we only use drop_duplicate directly. The duplicate is due to titles in different language. Hence, I combined the titles under the same title id after dropping duplicates:

In [8]:
dff_no_duplicates = dff.drop_duplicates()

In [281]:
dff_no_duplicates['title'] = dff_no_duplicates['title'].astype(str)

In [16]:
df_final = dff_no_duplicates.groupby('titleId')['title'].agg(', '.join).reset_index()

Merge review and plots dataset with movie title, using title_id:

In [53]:
import pandas as pd

In [54]:
# Convert movies_list to a DataFrame
movies_df = pd.DataFrame(movies_list)

In [55]:
# Merge the DataFrames based on 'movie_id' and 'titleId' using inner join
merged_df = df_final.merge(movies_df, left_on='titleId', right_on='movie_id', how='inner')

In [60]:
len(merged_df)

538727

In [62]:
# Rename 'titleId' column to 'Id'
merged_df.rename(columns={'titleId': 'id'}, inplace=True)

In [59]:
# Drop 'movie_id' column
merged_df.drop(columns='movie_id', inplace=True)

In [64]:
merged_df.head(2)

Unnamed: 0,id,title,review_date,user_id,is_spoiler,review_text,rating,review_summary,plot_summary,duration,genre,release_date,plot_synopsis
0,tt0015864,"La quimera del oro, Kultakuume, La ruée vers l...",14 October 2005,ur0176092,True,If any single figure can fairly be said to sym...,8.2,The Little Fellow is simply superb!,A lone prospector ventures into Alaska looking...,1h 35min,"[Adventure, Comedy, Drama]",1925,It is in the middle of the Gold Rush. A Lone P...
1,tt0015864,"La quimera del oro, Kultakuume, La ruée vers l...",19 October 2005,ur3838473,True,"In Charles Chaplin's 1925 film, ""The Gold Rush...",8.2,A masterpiece of early cinema....,A lone prospector ventures into Alaska looking...,1h 35min,"[Adventure, Comedy, Drama]",1925,It is in the middle of the Gold Rush. A Lone P...


### Test sentences/reviews

Use nltk to split review text into sentences:

In [68]:
import nltk
nltk.download('punkt')  

In [69]:
def split_into_sentences(text):
    # Use the punkt tokenizer to split the text into sentences
    sentences = nltk.sent_tokenize(text)
    return sentences

Store all reviews in a list and run sentiment and emotional analysis:
* reviews for the same movie (identified by id) is combined 
* reviews are split into sentences

In [92]:
import math

def normalize(score, alpha=15):
    import math
    
    """
    Normalize the score to be between -1 and 1 using an alpha that
    approximates the max expected value
    """
    norm_score = score/math.sqrt((score*score) + alpha)
    return norm_score

In [115]:
# !pip install nrclex

In [109]:
# Initialize the Emotion Classification
from nrclex import NRCLex

# Initialize the Sentiment Intensity Analyzer
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
sent_analyzer = SentimentIntensityAnalyzer()

only ran 1000 rows bc i ran out of memeory:

In [282]:
# Create a list to store movie entries
reviews_plot_list = []

# Iterate through the merged_df
for index, row in merged_df[:1000].iterrows():
    id = row['id']
    title = row['title']
    review_text = row['review_text']
    plot_synopsis = row['plot_synopsis']
    
    # Split review_text into sentences
    review_sentences = split_into_sentences(review_text)
    
    # Split plot_synopsis into sentences
    plot_sentences = split_into_sentences(plot_synopsis)
    
    # Check if id or title exists in the reviews_plot_list
    existing_movie = next((movie for movie in reviews_plot_list if movie['id'] == id or movie['title'] == title), None)
    if existing_movie:
        existing_movie['reviews'].extend(review_sentences)
        existing_movie['plots'].extend(plot_sentences)
    else:
        reviews_plot_list.append({'id': id, 'title': title, 'reviews': review_sentences, 'plots': plot_sentences})

In [284]:
# Create an empty dictionary to store all the updated movie data
updated_movies = {}

# Iterate through the movies in reviews_plot_list
for movie in reviews_plot_list:
    id = movie['id']
    title = movie['title']
    reviews = movie.get('reviews', [])
    plots = movie.get('plots', [])

    # Combine all reviews and plots into one string
    all_reviews_text = ' '.join(reviews)
    all_plots_text = ' '.join(plots)

    # Calculate sentiment scores using sent_analyzer for reviews and normalize
    if reviews:
        review_sentiment_scores = [normalize(sent_analyzer.polarity_scores(sentence)['compound']) for sentence in reviews]
        combined_review_sentiment = sum(review_sentiment_scores) / len(review_sentiment_scores)
        # Perform emotion analysis using NRCLex for reviews
        review_emotion_scores = NRCLex(all_reviews_text).affect_frequencies
    else:
        combined_review_sentiment = None
        review_emotion_scores = None
    
    # Calculate sentiment scores using sent_analyzer for plots and normalize
    if plots:
        plot_sentiment_scores = [normalize(sent_analyzer.polarity_scores(sentence)['compound']) for sentence in plots]
        combined_plot_sentiment = sum(plot_sentiment_scores) / len(plot_sentiment_scores)
        # Perform emotion analysis using NRCLex for plots
        plot_emotion_scores = NRCLex(all_plots_text).affect_frequencies
    else:
        combined_plot_sentiment = None
        plot_emotion_scores = None

    # Sort emotion scores in descending order of value for reviews and plots
    sorted_review_emotion_scores = sorted(review_emotion_scores.items(), key=lambda x: x[1], reverse=True) if review_emotion_scores else None
    sorted_plot_emotion_scores = sorted(plot_emotion_scores.items(), key=lambda x: x[1], reverse=True) if plot_emotion_scores else None

    # Create a new dictionary with additional information
    updated_movie_one = {
        'id': id,
        'title': title,
        'reviews_emotion_scores': sorted_review_emotion_scores,
        'reviews_sentiment': combined_review_sentiment,
        'plots_emotion_scores': sorted_plot_emotion_scores,
        'plots_sentiment': combined_plot_sentiment
    }
    
    # Store the updated movie data in the all_updated_movies dictionary
    updated_movies[id] = updated_movie_one

Query by movie title or id to find sentiment and emotion analysis for both the movie plot and review:

In [258]:
def search_movie_by_id(sentiment_dict, search_id):
    if search_id in sentiment_dict:
        return sentiment_dict[search_id]
    return None  # Movie with search_id not found

def search_movie_by_title(updated_movie, search_title):
    matching_movies = []
    for movie_id, movie_data in updated_movie.items():
        if search_title.lower() in movie_data['title'].lower():
            matching_movies.append(movie_data)
    return matching_movies

In [285]:
search_id = 'tt0017136'  # Replace with the desired movie ID

movie_by_id = search_movie_by_id(updated_movies, search_id)

if movie_by_id:
#     print(f"Movie ID: {movie_by_id['id']}")
#     print(f"Movie Title: {movie_by_id['title']}")
    for key, value in movie_by_id.items():
        if key == 'reviews_emotion_scores':
            print("=====Reviews Emotion Scores:=====")
            for emotion, score in value:
                print(f"{emotion}: {score}")
        elif key == 'plots_emotion_scores':
            print("\n=====Plots Emotion Scores:=====")
            for emotion, score in value:
                print(f"{emotion}: {score}")
        elif key == 'reviews_sentiment':
            print(f"Reviews Sentiment: {value}")
        elif key == 'plots_sentiment':
            print(f"Plots Sentiment: {value}")
else:
    print("Movie with ID not found")

=====Reviews Emotion Scores:=====
positive: 0.24654460599533298
trust: 0.12977921378567583
anticipation: 0.12161191886555377
negative: 0.11824627535451444
joy: 0.10294381619098905
fear: 0.07220427212349668
sadness: 0.06659486627176449
surprise: 0.0537156704361874
anger: 0.0525489140190271
disgust: 0.03581044695745827
anticip: 0.0
Reviews Sentiment: 0.045468294686666855

=====Plots Emotion Scores:=====
positive: 0.2
trust: 0.14482758620689656
negative: 0.14482758620689656
joy: 0.10344827586206896
fear: 0.0896551724137931
anticipation: 0.0896551724137931
anger: 0.07586206896551724
surprise: 0.06896551724137931
sadness: 0.04827586206896552
disgust: 0.034482758620689655
anticip: 0.0
Plots Sentiment: -0.03364110422266192


In [286]:
search_title = 'Gold rush'  # Replace with the desired movie title (partial or full)

movies_by_title = search_movie_by_title(updated_movies, search_title)

if movies_by_title:
    for movie in movies_by_title:
#         print(f"Movie ID: {movie_by_id['id']}")
#         print(f"Movie Title: {movie_by_id['title']}")
        for key, value in movie.items():
            if key == 'reviews_emotion_scores':
                print("=====Reviews Emotion Scores:=====")
                for emotion, score in value:
                    print(f"{emotion}: {score}")
            elif key == 'plots_emotion_scores':
                print("\n=====Plots Emotion Scores:=====")
                for emotion, score in value:
                    print(f"{emotion}: {score}")
            elif key != 'id' and key != 'title':
                print(f"{key}: {value}")
    print("\n")
else:
    print("Movie with title not found")

=====Reviews Emotion Scores:=====
positive: 0.2641305814110347
trust: 0.12639956832591392
joy: 0.12491568865506543
anticipation: 0.10616484554161608
negative: 0.10333198435181438
fear: 0.07824092809928504
sadness: 0.062457844327532715
surprise: 0.054903547821394845
anger: 0.04357210306218805
disgust: 0.03588290840415486
anticip: 0.0
reviews_sentiment: 0.07843402981055392

=====Plots Emotion Scores:=====
positive: 0.23809523809523808
negative: 0.1523809523809524
trust: 0.14285714285714285
joy: 0.10952380952380952
fear: 0.09047619047619047
anticipation: 0.08571428571428572
anger: 0.08095238095238096
sadness: 0.04285714285714286
surprise: 0.03333333333333333
disgust: 0.023809523809523808
anticip: 0.0
plots_sentiment: -0.02499485844377837


