# Predicting Marvel Movie Ratings from YouTube Comments

Kevin Nolasco

Cabrini University

MCIS565 - Natural Language Processing

05/13/2022

![MarvelRatings_Flow.jpg](MarvelRatings_Flow.jpg)

## Load Data

In [9]:
import pandas as pd

In [10]:
df_comments = pd.read_json('data/movie_comments.json')
df_ratings = pd.read_json('data/movie_ratings.json')

Unnamed: 0,MovieName,MovieId,CommentAuthor,OriginalComment
0,Iron Man,8ugaeA-nMTc,UCQzG9JFe87-FrpMAVZjX_Kg,honestly this is still the best marvel movie imo
1,Iron Man,8ugaeA-nMTc,UC1bmGHVTIBOerbW2P-pJsog,2:10 2:11
2,Iron Man,8ugaeA-nMTc,UC7A-gvQcpTmkBs1c0FJU92Q,This just shows the whole movie lmao😂
3,Iron Man,8ugaeA-nMTc,UC4r37ZNp-chrosonJ2VsdEg,14 years
4,Iron Man,8ugaeA-nMTc,UC0QY_nzoJvlb5z-dfwnVGsA,2:14


## EDA and Data Cleaning

First we will inspect the comments dataset and clean it so we can use it later.

Since our data is taken from YouTube comments, there is a chance that we see repeated comments; the repeated comments could be considered spam and will will remove them.

In [18]:
# inspect the duplicated comments
df_comments[df_comments.duplicated(['CommentAuthor', 'OriginalComment'])].tail()

Unnamed: 0,MovieName,MovieId,CommentAuthor,OriginalComment
90388,Spider-Man: No Way Home,ZYzbalQ6Lg8,UC4eL8LDWE75VraK9UfSlonQ,Malayalam please 😟😟😟
90389,Spider-Man: No Way Home,ZYzbalQ6Lg8,UC4eL8LDWE75VraK9UfSlonQ,Malayalam please 😟😟😟
90390,Spider-Man: No Way Home,ZYzbalQ6Lg8,UC4eL8LDWE75VraK9UfSlonQ,Malayalam please 😟😟😟
90391,Spider-Man: No Way Home,ZYzbalQ6Lg8,UC4eL8LDWE75VraK9UfSlonQ,Malayalam please 😟😟😟
90392,Spider-Man: No Way Home,ZYzbalQ6Lg8,UC4eL8LDWE75VraK9UfSlonQ,Malayalam please 😟😟😟


In [20]:
# remove the duplicates
df_comments.drop_duplicates(subset = ['CommentAuthor', 'OriginalComment'], inplace = True, keep = False)
df_comments.head()

Unnamed: 0,MovieName,MovieId,CommentAuthor,OriginalComment
0,Iron Man,8ugaeA-nMTc,UCQzG9JFe87-FrpMAVZjX_Kg,honestly this is still the best marvel movie imo
1,Iron Man,8ugaeA-nMTc,UC1bmGHVTIBOerbW2P-pJsog,2:10 2:11
2,Iron Man,8ugaeA-nMTc,UC7A-gvQcpTmkBs1c0FJU92Q,This just shows the whole movie lmao😂
3,Iron Man,8ugaeA-nMTc,UC4r37ZNp-chrosonJ2VsdEg,14 years
4,Iron Man,8ugaeA-nMTc,UC0QY_nzoJvlb5z-dfwnVGsA,2:14


Now let's see how many comments are left, we will check for duplicates once more to ensure the duplicates are gone.

In [21]:
# drop duplicates if any
n_rows_comments = df_comments.shape[0]
n_rows_comments_deduped = df_comments.drop_duplicates(subset = ['CommentAuthor', 'OriginalComment']).shape[0]
print("""
Number of rows in dataset : {}
Number of rows deduped : {}
""".format(n_rows_comments, n_rows_comments_deduped))


Number of rows in dataset : 89336
Number of rows deduped : 89336



Let's see some simple statistics for our dataset.

- How many comments total?
- How many comment authors?
- How many movies?
- Which movie has the most comments?
- What is the average number of comments per movie?

In [35]:
avg_number_of_comments = df_comments.groupby(by = 'MovieName').size().mean()
avg_number_of_comments

3436.0

In [38]:
# answering the questions above
n_comments = df_comments.shape[0]
n_comment_authors = df_comments['CommentAuthor'].nunique()
n_movies = df_comments['MovieName'].nunique()
movie_with_most_comments, most_comments_count = df_comments.groupby(by = 'MovieName').size()\
    .sort_values(ascending = False).reset_index(name = 'comment_count').iloc[0].values
avg_number_of_comments = int(df_comments.groupby(by = 'MovieName').size().mean())

print("""
Dataset Stats:\n
{:,} comments\n
{:,} comment authors\n
{} unique movies\n
"{}" contains the most comments - ({:,} comments)\n
On average, the movies in the dataset contain {:,} comments\n
""".format(n_comments, n_comment_authors, n_movies, movie_with_most_comments, most_comments_count, avg_number_of_comments))
print('')


Dataset Stats:

89,336 comments

77,533 comment authors

26 unique movies

"Black Widow" contains the most comments - (4,005 comments)

On average, the movies in the dataset contain 3,436 comments



