### The Movie Database 5000 Movie Dataset

In this content-based filtering project I will build a function to generate movie recommendations based on their similarity to a user's chosen movie. Similarity will be calculated based on text features of each movie as found in the Movie Database 5000 Movie Dataset.

In [1]:
import numpy as np
import pandas as pd
import ast
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
movies_df = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_movies.csv")
credits_df = pd.read_csv("../input/tmdb-movie-metadata/tmdb_5000_credits.csv")

In [3]:
movies_df.head(3)

In [4]:
credits_df.head(3)

The first thing to deal with is the json-like columns. These look like they will contain some useful information, but I won't be able to use them in their current format. Let's have a look at some examples.

In [5]:
movies_df.loc[0, 'genres']

In [6]:
movies_df.loc[0,'keywords']

In [7]:
movies_df.loc[0,'production_companies']

In [8]:
movies_df.loc[0,'production_countries']

In [9]:
movies_df.loc[0,'spoken_languages']

In [10]:
credits_df.loc[0,'cast']

In [11]:
credits_df.loc[0,'crew']

 All of these features look like they could be useful, with the exception of `spoken_languages`. The actual language of the movie is represented by the `original_language` column. This may well be an important factor in whether a user likes a movie or not, but the fact that another language is spoken at some point during the movie is unlikely to be important.
 
`Keywords` and `genres` are clearly going to be important for movie recommendations, so I will need to extract the `name` feature from these json-like entries. `Production_companies` and `Production_countries` may also be useful, as users may prefer movies from a certain company, such as Disney movies, or from a specific country. However, I will note that some of the entries have a lot of associated companies and countries. As there is no clear hierarchy to them I will need to use them all, which could ultimately confuse the model.

Finally, `cast` and `crew` look like they may be useful, but likely only some of the entries. Fortunately, the `cast` column appears to be sorted by order of actor/character importance, so I can simply take the five main actors. Similarly, the `crew` column contains each crew member's role, so I can simply extract the director. 
 
Although the entries look like json objects, due to the way Pandas reads the data from the CSV file they are actually strings. Therefore we will need to use a trick to extract the pertinent information. I am grateful to Abhishek Jaiswal for the following method method: https://www.analyticsvidhya.com/blog/2021/12/comprehensive-project-on-building-a-movie-recommender-website/

Originally in this project I used the method without any modifications. However, on generating recommendations for movies similar to 'Toy Story', one of the top recommendations was the film 'Everything You Always Wanted to Know About Sex (But Were Afraid to Ask)', not the first movie I would think of when recommending something to watch to a fan of Toy Story! I realised that this was happening due to the first name of the directory (Woody Allen) and the name of the main character in Toy Story (Woody). By joining first and last names of cast and crew (and production companies) into a single string I was able to avoid these problems.

In [12]:
def get_name(col):
    my_list = []
    for i in ast.literal_eval(col):
        my_list.append(i['name'])
    return my_list

def get_actors(col):
    count = 0
    my_list = []
    for i in ast.literal_eval(col):
        if count != 5:
            name = i['name']                            
            name = name.replace(" ","")  # Code edited to create single string from name
            my_list.append(name)
            count+=1
    return my_list

def get_director(col):
    my_list = []
    for i in ast.literal_eval(col):
        if i['job'] == 'Director':
            name = i['name']
            name = name.replace(" ","")  # Code edited to create single string from name
            my_list.append(name)
    return my_list

def get_company(col):
    my_list = []
    for i in ast.literal_eval(col):
        name = i['name']                            
        name = name.replace(" ","")  # Code edited to create single string from name
        my_list.append(name)
    return my_list

def get_country(col):
    my_list = []
    for i in ast.literal_eval(col):
        my_list.append(i["iso_3166_1"])
    return my_list

In [13]:
movies_df.genres = movies_df.genres.apply(get_name)
movies_df.keywords = movies_df.keywords.apply(get_name)
movies_df.production_countries = movies_df.production_countries.apply(get_country)
movies_df.production_companies = movies_df.production_companies.apply(get_company)
credits_df.cast = credits_df.cast.apply(get_actors)
credits_df.crew = credits_df.crew.apply(get_director)

Now we want to join the two datasets into a single dataframe. First I will check that movies in both datasets are the same

In [14]:
set(movies_df.id.unique() == credits_df.movie_id.unique())

As the movies are the same in both datasets, we can join the two together without worrying about introducing NAs.

In [15]:
credits_df.rename(columns={'movie_id':'id'}, inplace=True)
credits_df.drop(["title"], inplace=True, axis=1)
movies_df = movies_df.merge(credits_df, on='id')

Now I can drop some of the columns that I don't plan to use. As stated, this recommender project will be based on similarity of text. Therefore, I will drop all non-text columns. I will also drop `original_title` and `spoken_language`. The `title` column contains the important information relating to the movie title, and the language information is found in `original_language`, as discussed above. Finally, homepage is unlikely to be helpful for recommendations, so this will also be removed. 

In [16]:
movies_df.drop(["budget", "homepage", "popularity", "original_title", "release_date", 
                "revenue", "runtime", "spoken_languages", "vote_average", "vote_count"], axis=1, inplace=True)

How about `status`? What are the different options?

In [17]:
movies_df.status.unique()

We can recommend movies that are due to come out (those in post-production), but it doesn't really make sense to recommend movies that are only rumoured. Let's see how there are.

In [18]:
movies_df.status.value_counts()

Only five of the movies in the dataset are 'rumoured' movies, so it is no problem to drop them. I can then drop the `status` column too.

In [19]:
movies_df = movies_df.loc[movies_df.status != 'Rumored']
movies_df.drop("status", axis=1, inplace=True)

Now let's look at some general information on the dataframe.

In [20]:
movies_df.info()

The info method shows that there are a large number of missing values in the `tagline` column, plus a small number in 'overview'. I will replace these NAs with a space.

In [21]:
movies_df.tagline = movies_df.tagline.fillna(" ")
movies_df.overview = movies_df.overview.fillna(" ")

In [22]:
movies_df.head(2)

Looking at the sample of the dataframe above, it is clear that the function used on the `genres`, `keywords`, etc. columns returned lists. For text processing it would be better if these were string, so I will convert them now.

In [23]:
def list_to_string(col):
    string = ' '.join([str(i) for i in col])
    return string

movies_df.genres = movies_df.genres.apply(list_to_string)
movies_df.keywords = movies_df.keywords.apply(list_to_string)
movies_df.production_companies = movies_df.production_companies.apply(list_to_string)
movies_df.production_countries = movies_df.production_countries.apply(list_to_string)
movies_df.cast = movies_df.cast.apply(list_to_string)
movies_df.crew = movies_df.crew.apply(list_to_string)

In [24]:
movies_df.head(2)

Before I get to the actual recommender phase, let's have a look at some of the most frequently occurring words in each of the columns. This can be done using a Word Cloud.

In [65]:
# Create a function that can be reused for each column. First I am going to create
# my own set of stopwords.

my_stopwords = text.ENGLISH_STOP_WORDS

def wc_generator(col):
    wc_text = " ".join(word for word in movies_df[col])
    wc = WordCloud(background_color = "white", max_words = 2000, max_font_size = 100, random_state = 3, 
              stopwords = my_stopwords, contour_width = 3).generate(wc_text)
    fig = plt.figure(figsize = (20, 10))
    plt.imshow(wc, interpolation = "bilinear")
    plt.axis("off")
    plt.show()

In [50]:
wc_generator("genres")

In [51]:
wc_generator("keywords")

In [52]:
wc_generator("overview")

In [53]:
wc_generator("tagline")

In [54]:
wc_generator("cast")

In [55]:
wc_generator("crew")

In [56]:
wc_generator("production_countries")

In [57]:
wc_generator("production_companies")

The word clouds generally look like we would expect. However, two things stand out to me. Firstly, the most frequently occurring keyword is 'based'. This is not going to be helpful when it comes to finding similarities between movies, so I will add it to the stopwords to be removed. Secondly, there is a notable omission from the production countries' word cloud: US. The reason it is missing is because the list of stopwords contains the word 'us'. I am therefore going to add one word to the stopwords and remove another. Because the stopwords list from Scikit learn is actually a frozenset, I will first create a set version, and then add/remove the words.

In [66]:
my_stopwords = set(my_stopwords)
my_stopwords.add('based')
my_stopwords.remove('us')

In [68]:
wc_generator("keywords")

In [69]:
wc_generator("production_countries")

That looks better.

Now we can combine all of the text columns into a single column which will be used to create our feature vectors. I will create a copy of the original dataframe so that I can come back to it later.

In [70]:
text_df = movies_df.copy(deep=True)

text_df["text"] = text_df.genres+" "+text_df.keywords+" "+ \
                                " "+text_df.original_language+" "+text_df.overview+ \
                                " "+text_df.production_companies+" "+text_df.production_countries+ \
                                " "+text_df.tagline+" "+text_df.title+" "+text_df.cast+ \
                                " "+text_df.crew
text_df.drop(['genres', 'keywords', 'original_language', 'overview',
       'production_companies', 'production_countries', 'tagline',
       'cast', 'crew'], axis=1, inplace=True)

The first approach I will take is to use Term Frequency-Inverse Document Frequency to get a measure of how important each individual word is in the context of the whole set of words (corpus). Scikit-learn's TfidfVectorizer preforms preprocesing on the text, such as converting to lowercase and tokenising each word. I can pass my edited set of stopwords as an argument in order to remove the stopwords I just defined.

In [71]:
tfidf_vectorizer= TfidfVectorizer(stop_words=my_stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform(text_df.text)

To measure similarity I will use cosine similarity, a metric frequently used in text classification. Cosine distance measures the angle between vectors, with a small angle indicating that the vectors are similar. Cosine similarity is simply 1 - cosine distance, in order to give a more intuitive understanding of similarity (i.e. 1 = perfectly similar, 0 = no similarity).

For each movie in the dataset, I will calculate the cosine similarity between it and every other movie.

In [72]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Next, I create a new dataframe consisting of all of the similarities and using the index from the original dataframe.

In [73]:
recommendations_df = pd.DataFrame(cosine_sim, columns=text_df.title, index=text_df.title)

In [75]:
recommendations_df.iloc[0:5, 0:5]

As you would expect, the diagonal entries all equal to 1, as this is the movies' similarity with itself. The off-diagonal values show the similarities with other movies, which is what we are interested in.

I will now create a function that a user could use to generate as many recommendations as they like, ordered from most to least similar, excluding the original movie entered by the user.

In [76]:
def movie_recommender(movie, num_recommendations, database):
    temp_df = database[movie]
    temp_df = temp_df.sort_values(axis=0, ascending=False)
    for item in temp_df.index[1:num_recommendations+1]:
        print("Recommendation: \033[1m{}\033[0m \t Similarity: \033[1m{:.2%}\033[0m".format(item, temp_df[item]))

In [77]:
movie_recommender('Toy Story', 10, recommendations_df)

The above results show that the recommender is working, at least to an extent. The first two movies returned look like solid recommendations, as Toy Story 2 and 3 are undoubtedly similar to Toy Story. Big and The Indian in the Cupboard also seem like acceptable recommendations, as they are both children's movies. Some of the other recommendations seem less suitable, such as The 40 year Old Virgin and Factory Girl. The most concerning recommendations, though, are Child's Play and Child's Play 2, two movies about a murderous doll! Let's have a look at the entries for our two movies.

In [78]:
movies_df[(movies_df.title == "Child's Play")|(movies_df.title == "Toy Story")]

Looking at the `keywords` column, the word 'toy' stands out. This is likely to be the reason for the similarity.

Another approach for representing text as vectors is the Bag of Words approach, which is essentially a count of how many times a word appears in the text. These vectors can be created using Scikit learn's CountVectorizer. Perhaps this will give a better result.

In [79]:
count_vectorizer= CountVectorizer(stop_words=my_stopwords)
count_matrix = count_vectorizer.fit_transform(text_df.text)

In [80]:
cosine_sim_count = cosine_similarity(count_matrix, count_matrix)
count_recommendations_df = pd.DataFrame(cosine_sim_count, columns=text_df.title, index=text_df.title)

In [81]:
movie_recommender('Toy Story', 10, count_recommendations_df)

In fact, the Bag of Words approach has actually suggested a greater similarity between Toy Story and Child's Play than the TF-IDF approach.

One other thing that we can try to see if our model improves it to use less features. Feature such as the title, production country and production company may actually be confusing the model. I will try again without these features to see if the results change. As we won't be using production country anymore, we can add 'us' back into the stopwords.

In [82]:
my_stopwords.add('us')

In [83]:
smaller_text_df = movies_df.copy(deep=True)
smaller_text_df["text"] = smaller_text_df.genres+" "+smaller_text_df.keywords+" "+ \
                                " "+smaller_text_df.overview+" "+smaller_text_df.tagline \
                                +" "+smaller_text_df.cast+" "+smaller_text_df.crew
smaller_text_df.drop(['genres', 'keywords', 'original_language', 'overview',
       'production_companies', 'production_countries', 'tagline',
       'cast', 'crew'], axis=1, inplace=True)

In [84]:
tfidf_vectorizer= TfidfVectorizer(stop_words=my_stopwords)
tfidf_matrix = tfidf_vectorizer.fit_transform(smaller_text_df.text)

In [85]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
recommendations_df = pd.DataFrame(cosine_sim, columns=smaller_text_df.title, index=smaller_text_df.title)

In [86]:
movie_recommender('Toy Story', 10, recommendations_df)

In [87]:
count_vectorizer= CountVectorizer(stop_words='english')
count_matrix = count_vectorizer.fit_transform(smaller_text_df.text)

In [88]:
cosine_sim_count = cosine_similarity(count_matrix, count_matrix)
count_recommendations_df = pd.DataFrame(cosine_sim_count, columns=smaller_text_df.title, index=smaller_text_df.title)

In [89]:
movie_recommender('Toy Story', 10, count_recommendations_df)

This hasn't had much of an effect on the recommender. Let's have a look at a few other recommendations. I will use the TF-IDF model.

In [90]:
movie_recommender('Alien', 10, recommendations_df)

In [91]:
movie_recommender('Jurassic Park', 10, recommendations_df)

In [93]:
movie_recommender('The Godfather', 10, recommendations_df)

The results show us the limitations of the recommender. It does seem to be able to pick up on general themes, hence recommending movies set in space for Alien, movies with dinosaurs for Jurassic Park, and movies involving the Mafia for The Godfather. In particular, the top one or two results do appear to be good recommendations. However, beyond this, the recommender seems to give too little weight to the movie genre, for example when faced with The Godfather it gives a higher recommendation to the comedy Micky Blues Eyes than it gives to the much more similar film Goodfellas. It also fails to take into account whether the movie is appropriate for adults or children.

If i were to expand on this project, I would look to give more weight to genre, as well as adding a feature indicating whether the film was for adults or children.