# Data and Analysis Plan: Movie Reccomendation

## Team -23

- Vedant Bhagat (bhagat.ve@northeastern.edu)
- Anthony Lee (lee.ant@northeastern.edu)
- Dawn Lu (lu.daw@northeastern.edu)
- Josh Rosenberg (rosenberg.jo@northeastern.edu)


### Motivation

The goal of our project is to recommend a few movies to users who input a movie they already enjoy. People who enjoy a movie typically also enjoy movies that are similar, so our recommendation will compile a list of movies along with some information about them to accurately compare movies to each other and ultimately predict similar movies they would also like to see.

### Data Processing Summary

We will compile a list of movies using the TMDB API.
From this list, for each movie, we will obtain the following information:

- Budget
- Genres
- Original language
- Overview
- Popularity
- Production companies
- Release date
- Revenue
- Runtime
- Spoken languages
- Tagline
- Vote count

We also found a dataset of CSVs with user ratings of different movies, and it contains TMDB ids, which will be used to make API calls to map more data from the TMDB API to the CSV data.
Csvs and TMDB for sources

The csv datasets are downloadable here: https://grouplens.org/datasets/movielens/

2 data viz
Distribution of ratings (num user reviews per rating)
Distribution of genres (num reviews per genre)


In [13]:
import pandas as pd

movies = pd.read_csv('data/movies.csv')

movies.head()


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [14]:
tags = pd.read_csv('data/tags.csv')

tags.head()


Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [15]:
ratings = pd.read_csv('data/ratings.csv')

ratings.head()


Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [16]:
links = pd.read_csv('data/links.csv')

links.head()


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [17]:
import pandas as pd

movies = pd.read_csv('data/movies.csv')

movies.head()


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [18]:
tags = pd.read_csv('data/tags.csv')

tags.head()


Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [19]:
ratings = pd.read_csv('data/ratings.csv')

ratings.head()


Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [20]:
links = pd.read_csv('data/links.csv')

links.head()


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


### Fetching data is... hard

So, it turns out that making over 60,000 API requests sequentially is prone to failure. Lots of things can go wrong - wifi can drop, laptop can fall asleep, etc.

There are a couple of solutions to this:

- Save api results locally as you get them, so you can resume where you left off if network drops
- Run the code doing all the fetches in a cloud vm so that it has high network reliability and stability

We ended up doing both. I spun up a small debian vm in gcp, and ran the below cell as a standalone python script. Once the script finished running, I used SCP to copy the files from the cloud vm to my local machine, so that the data was available locally for my jupyter notebook to consume.

If this were a business use case, and we needed to run this on a regular basis, I would set this up as a cron job that ran on a scheduled basis, and dumped the data to a new s3 bucket each time it ran


In [21]:
import requests
tmdb_api_key = '433618a0549fe010dec7ca3a9dc46d53'


def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]


def build_data_cache_from_tmdb():
    """
    Fetch all movie data from TMDB, and store it in chunked json files
    We do this because network conditions are unreliable sometimes, maybe battery dies,
    etc, so we want to be able to resume where we left off, instead of losing all the data
    we gathered. Chunks are written as json files to the /cache directory.
    """
    tmdb_id_list = links['tmdbId'].tolist()
    chunked_list = list(chunks(tmdb_id_list, 100))

    for idx, chunk in enumerate(chunked_list):
        results = []
        for movie_id in chunk:
            url = f'https://api.themoviedb.org/3/movie/{movie_id}?api_key={tmdb_api_key}&language=en-US'
            resp = requests.get(url)
            data = resp.json()
            data['tmdb_id'] = movie_id
            results.append(data)
        movies_df = pd.DataFrame(results)
        start = idx * 100
        end = idx * 100 + len(chunk)
        movies_df.to_json(f'cache/movies_{start}_{end}_data.json')


In [22]:
# Usually I would use an environment variable to control this; However,
# Environment variables with Jupyter is tricky, so we'll just hardcode this
# boolean manually. You probably want to leave this as false and use the cache
# we've already built.
rebuild_cache = False
if rebuild_cache:
    build_data_cache_from_tmdb()


In [23]:
# Now we want to recombine all of our cache files into a big dataframe
import os

directory = 'cache'

files = []
# iterate over files in
# that directory
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    # checking if it is a file
    if os.path.isfile(f):
        files.append(f)

dataframes = []

# Read each file, add it to the array
for file in files:
    dataframes.append(pd.read_json(file))

# merge all dataframes into one
tmdb_movies_df = pd.concat(dataframes)
len(tmdb_movies_df)


62423

In [24]:
# join our dataframe built from api calls with the links 
# df that contains all of the different movie ids used to join our datasets
df_merged_with_gl_ids = pd.merge(
    tmdb_movies_df, links, how='inner', left_on='tmdb_id', right_on='tmdbId')

df_merged_with_gl_ids.head()


Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,video,vote_average,vote_count,tmdb_id,success,status_code,status_message,movieId,imdbId,tmdbId
0,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}]",,274104.0,tt0379176,es,B-Happy,...,0.0,5.3,6.0,274104.0,,,,185857,379176,274104.0
1,0.0,/hSGMwIlns3hmuvd0VrnayJCdpZ8.jpg,,10000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",https://www.facebook.com/pages/Rubens-Place-20...,256935.0,tt2609706,en,Ruben's Place,...,0.0,5.7,10.0,256935.0,,,,185859,2609706,256935.0
2,0.0,/89JnVvAavvGfVP3KyTQZ9lzZs2M.jpg,,0.0,"[{'id': 14, 'name': 'Fantasy'}, {'id': 28, 'na...",,484997.0,tt6513338,cn,仙球大戰,...,0.0,0.0,0.0,484997.0,,,,185861,6513338,484997.0
3,0.0,/nmdsYZvgqxwLkcgWCYCXQEcDxW.jpg,,0.0,"[{'id': 10752, 'name': 'War'}, {'id': 18, 'nam...",,163363.0,tt0056160,hu,Két félidő a pokolban,...,0.0,7.7,15.0,163363.0,,,,185865,56160,163363.0
4,0.0,/pyYMPkIFuvtB4CVVmBSrkYlTg5D.jpg,,0.0,"[{'id': 99, 'name': 'Documentary'}]",,445004.0,tt6440810,en,Iron Men,...,0.0,6.3,7.0,445004.0,,,,185867,6440810,445004.0


In [25]:
# Compute average ratings from our csv dataset, round to the nearest 0.5,
# count the number of ratings, and add these columns to our dataframe
df_averaged_gl_ratings = ratings.groupby(
    'movieId')['rating'].mean().multiply(2).round().divide(2).reset_index()
df_gl_rating_count = ratings['movieId'].value_counts().rename_axis(
    'movieId').reset_index(name='number_of_ratings')
df_gl_rating_count.head()


df_ratings_computed = pd.merge(
    df_averaged_gl_ratings, df_gl_rating_count, how='inner', on='movieId')
df_ratings_computed.head()


Unnamed: 0,movieId,rating,number_of_ratings
0,1,4.0,57309
1,2,3.5,24228
2,3,3.0,11804
3,4,3.0,2523
4,5,3.0,11714


In [26]:
df_merged_with_gl_ratings = pd.merge(
    df_merged_with_gl_ids, df_ratings_computed, how="inner", left_on='movieId', right_on='movieId')


We've done joins/merges on the numerical stuff, but what we really want is to convert some of these string attributes into usable things to train our model. The biggest one we want is genre. In order to get this into a more usable format, we will denormalize genre into many columns, i.e for each genre, we will add a column to the dataframe `is_<genre-name>`, for example, `is_fantasy`, `is_horror`, and so forth.

If a movie matches one of those genres, it will have a 1 in the corresponding column, otherwise it will have a 0


In [27]:

denormalized_genres = []


for index, movie in movies.iterrows():
    genres = movie['genres'].split('|')
    movie_with_genres = {'movieId': movie['movieId']}
    for genre in genres:
        movie_with_genres[f'is_genre_{genre.lower()}'] = 1
    denormalized_genres.append(movie_with_genres)

df_genres = pd.DataFrame(denormalized_genres)
df_genres.fillna(0, inplace=True)

df_merged_with_genres = pd.merge(
    df_merged_with_gl_ratings, df_genres, how='inner', left_on='movieId', right_on='movieId')
df_merged_with_genres.head()


Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,is_genre_horror,is_genre_mystery,is_genre_sci-fi,is_genre_imax,is_genre_documentary,is_genre_war,is_genre_musical,is_genre_western,is_genre_film-noir,is_genre_(no genres listed)
0,0.0,,,0.0,"[{'id': 18, 'name': 'Drama'}]",,274104.0,tt0379176,es,B-Happy,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,/hSGMwIlns3hmuvd0VrnayJCdpZ8.jpg,,10000.0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",https://www.facebook.com/pages/Rubens-Place-20...,256935.0,tt2609706,en,Ruben's Place,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,/89JnVvAavvGfVP3KyTQZ9lzZs2M.jpg,,0.0,"[{'id': 14, 'name': 'Fantasy'}, {'id': 28, 'na...",,484997.0,tt6513338,cn,仙球大戰,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,/nmdsYZvgqxwLkcgWCYCXQEcDxW.jpg,,0.0,"[{'id': 10752, 'name': 'War'}, {'id': 18, 'nam...",,163363.0,tt0056160,hu,Két félidő a pokolban,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,/pyYMPkIFuvtB4CVVmBSrkYlTg5D.jpg,,0.0,"[{'id': 99, 'name': 'Documentary'}]",,445004.0,tt6440810,en,Iron Men,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [28]:
# Pull the year out of the release date into its own column so that it is usable for analysis
df_merged_with_genres['release_date'].values

df_merged_with_genres['release_year'] = list(map(lambda x: not(x) or x.split("-")[0], df_merged_with_genres['release_date'].values))
# df_merged_with_genres['release_year']

## Fetching reviews and doing sentiment analysis
We also have a large (3.7gb) json file full of movie reviews! This is great, but it raises a few challenges:

1. Most of our group only has 8gb of RAM total on our laptops. To read a file this large into memory at once and operate on it is going to be extremely difficult if not impossible
2. These reviews are all just plaintext! You can't do k-nearest-neighbors or k-means on strings, none of them will match. Therefore, we need to find some way of converting this big pile of information into data that is easily interpreted by a classifier or regressor.



Fortunately, we can solve both of these problems at once. First, instead of reading the whole file into memory at once, we can read it in one line at a time. Second, instead of storing the whole string review in memory, we can perform sentiment analysis on each string as we read it in, and then simply store the polarity and subjectivity of each review along with some uuid tying it back to the original review and the movie. 

In [29]:
from textblob import TextBlob

def review_sentiment(review):
    """ get the polarity and subjectivity of each
     user review
     params: review, a string of a user's review
     returns: a tuple of the review's polarity and subjectivity"""
   
    pol, sub = TextBlob(review).sentiment
    return (round(pol, 2), round(sub, 2))

In [30]:
import json
import pandas as pd

REVIEWS_FILE = 'movie_dataset_public_final/raw/reviews.json'

def build_sentiment_cache():
    reviews = []
    iteration = 0
    with open(REVIEWS_FILE) as f:
        for jsonObj in f:
            review = json.loads(jsonObj)
            sentiment = review_sentiment(review['txt'])
            review_meta = {
                'movieId': review['item_id'],
                'polarity': sentiment[0],
                'subjectivity': sentiment[1],
            }
            reviews.append(review_meta)

            if len(reviews) >= 1000:
                df_sentiment = pd.DataFrame(reviews)
                df_sentiment.to_csv(f'sentiment_cache/reviews_{iteration}_{iteration+len(reviews)}_data.csv')
                print(f'iteration {iteration}')
                iteration = iteration + 1000;
                reviews = []

    if len(reviews) > 0:
        df_sentiment = pd.DataFrame(reviews)
        df_sentiment.to_csv(f'sentiment_cache/reviews_{iteration}_{iteration+len(reviews)}_data.csv')



In [31]:
REBUILD_SENTIMENT_CACHE = False

if REBUILD_SENTIMENT_CACHE:
  build_sentiment_cache()

In [32]:
directory = 'sentiment_cache'

files = []
# iterate over files in
# that directory
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    # checking if it is a file
    if os.path.isfile(f):
        files.append(f)

dataframes = []

# Read each file, add it to the array
for file in files:
    dataframes.append(pd.read_csv(file))

# merge all dataframes into one
df_sentiment = pd.concat(dataframes)

# Should be 2624608 rows
len(df_sentiment)

2624608

In [33]:
df_sentiment.head()

Unnamed: 0.1,Unnamed: 0,movieId,polarity,subjectivity
0,0,120466,0.14,0.54
1,1,5210,-0.08,0.51
2,2,130520,0.23,0.48
3,3,7700,-0.01,0.56
4,4,442,-0.09,0.51


In [34]:
df_sentiment_avg = df_sentiment.groupby('movieId').mean().reset_index().drop(columns=['Unnamed: 0'])

In [43]:
# This merge drops any movies that don't have sentiment data. We might want to consider adding them back in later
df_movies_sanitized = pd.merge(df_sentiment_avg, df_merged_with_genres, how='inner', left_on='movieId', right_on='movieId' )

len(df_movies_sanitized)

50275

## Visualizations


Our first visualization shows how many ratings each movie in our dataset has recieved. This is helpful in establishing the accuracy of our "average rating" statistic. If the majority of our data only has one rating per movie, then those averages may not accurately reflect the public consensus about each movie. However, if most movies have 100 ratings, then we know our averages are quite accurate with the general publics feelings towards each movie


In [35]:
import plotly.express as px
df_num_ratings = df_merged_with_genres['number_of_ratings'].value_counts(
).rename_axis('number_of_ratings').reset_index(name='count')

fig = px.histogram(df_num_ratings, x='number_of_ratings', y='count',
                   nbins=20, log_x=True, title="How many people review each movie?")

fig.update_yaxes(title_text="# of movies")
fig.update_xaxes(title_text="# of ratings")

fig.update_traces(xbins=dict(  # bins used for histogram
    start=0.0,
    end=100,
    size=1
))
fig.show()


The visualization below serves 2 purposes:

- It helps us see the distribution of movies based on runtime
- It helps us determine if runtime is a significant factor in ratings.

These pieces of information will help us identify if runtime is an important factor to consider when recommending a movie to a user


In [36]:
fig = px.histogram(df_merged_with_genres.sort_values(
    by="rating"), x='runtime', nbins=10, color="rating", log_y=True, title="Runtime Distribution & Ratings vs Runtime")

fig.update_traces(xbins=dict( 
    start=0.0,
    end=250,
    size=10
))
fig.update_layout(legend_traceorder="reversed")


fig.show()


### Machine Learning Tools:

We will utilize k-nearest neighbors to recommend similar movie types. We will have a distance measurement that incorporates various attributes of a movie, such as user rating, the colors in a movie poster, or genre and get a measurement that can be compared to other movies to see which movies are most similar. We can then recommend maybe the 5 nearest neighbors based on this distance metric. However, with k nearest neighbors it relies on all the inputs for the distance metric to be numerical, however a lot of our data is a string. Due to this, a support vector machine (SVM) may be a better fit since it will allow us to also include any data with string values.


In [45]:
df_movies_sanitized.columns

Index(['movieId', 'polarity', 'subjectivity', 'adult', 'backdrop_path',
       'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'tmdb_id', 'success', 'status_code',
       'status_message', 'imdbId', 'tmdbId', 'rating', 'number_of_ratings',
       'is_genre_adventure', 'is_genre_animation', 'is_genre_children',
       'is_genre_comedy', 'is_genre_fantasy', 'is_genre_romance',
       'is_genre_drama', 'is_genre_action', 'is_genre_crime',
       'is_genre_thriller', 'is_genre_horror', 'is_genre_mystery',
       'is_genre_sci-fi', 'is_genre_imax', 'is_genre_documentary',
       'is_genre_war', 'is_genre_musical', 'is_genre_western',
       'is_genre_film-noir', 'is_genre_(no genr

In [None]:
def w8d_feat(feat_name, weight):
  return {"feat": feat_name, "weight": weight}

x_feat_list = [w8d_feat('runtime', 1), w8d_feat('polarity', 1), w8d_feat('subjectivity', 1), w8d_feat('adult', 1), w8d_feat('budget', 1)]