# K-nearest neighbors: Movie recommendation system

## 1. Data loading
### 1.1. Load

In [1]:
# Handle imports up-front
import json
import pickle
import pandas as pd

movies_data_file='../data/raw_data/movies.csv'
credits_data_file='../data/raw_data/movie_credits.csv'

movies=pd.read_csv(movies_data_file)
credits=pd.read_csv(credits_data_file)

### 1.2. Inspect

In [2]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

Mix of numbers and strings. Some columns appear to have obviously missing values - ex: homepage. Also, have a probably unnecessary feature 'ID'. We will take a closer look at it later.

In [3]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


No obvious missing values here. *movie_id* is probably not needed for the model, but might be useful for joining the two dataframes.

### 1.3. Join

Before we start exploring and cleaning the data - let's join our two dataframes together. That way we have all of our data in one place. The project tutorial does with with SQL. But, if we don't need or want an SQL database containing this data for any other reason, we don't need to create one. Pandas can do the join directly, saving us some unnecessary data processing steps and intermediate data artifacts.

In [4]:
# Rename ID column so that it matches between the dataframes
credits.rename({'movie_id': 'id'}, axis=1, inplace=True)

data_df=pd.merge(movies, credits, on='id', how='outer')
data_df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,title_y,cast,crew
0,4000000,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 35, ""name...",,5,"[{""id"": 612, ""name"": ""hotel""}, {""id"": 613, ""na...",en,Four Rooms,It's Ted the Bellhop's first night on the job....,22.87623,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...",...,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Twelve outrageous guests. Four scandalous requ...,Four Rooms,6.5,530,Four Rooms,"[{""cast_id"": 42, ""character"": ""Ted the Bellhop...","[{""credit_id"": ""52fe420dc3a36847f800012d"", ""de..."
1,11000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...",http://www.starwars.com/films/star-wars-episod...,11,"[{""id"": 803, ""name"": ""android""}, {""id"": 4270, ...",en,Star Wars,Princess Leia is captured and held hostage by ...,126.393695,"[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...",...,121.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"A long time ago in a galaxy far, far away...",Star Wars,8.1,6624,Star Wars,"[{""cast_id"": 3, ""character"": ""Luke Skywalker"",...","[{""credit_id"": ""52fe420dc3a36847f8000437"", ""de..."
2,94000000,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...",http://movies.disney.com/finding-nemo,12,"[{""id"": 494, ""name"": ""father son relationship""...",en,Finding Nemo,"Nemo, an adventurous young clownfish, is unexp...",85.688789,"[{""name"": ""Pixar Animation Studios"", ""id"": 3}]",...,100.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"There are 3.7 trillion fish in the ocean, they...",Finding Nemo,7.6,6122,Finding Nemo,"[{""cast_id"": 8, ""character"": ""Marlin (voice)"",...","[{""credit_id"": ""52fe420ec3a36847f80006b1"", ""de..."
3,55000000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",,13,"[{""id"": 422, ""name"": ""vietnam veteran""}, {""id""...",en,Forrest Gump,A man with a low IQ has accomplished great thi...,138.133331,"[{""name"": ""Paramount Pictures"", ""id"": 4}]",...,142.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"The world will never be the same, once you've ...",Forrest Gump,8.2,7927,Forrest Gump,"[{""cast_id"": 7, ""character"": ""Forrest Gump"", ""...","[{""credit_id"": ""52fe420ec3a36847f800076b"", ""de..."
4,15000000,"[{""id"": 18, ""name"": ""Drama""}]",http://www.dreamworks.com/ab/,14,"[{""id"": 255, ""name"": ""male nudity""}, {""id"": 29...",en,American Beauty,"Lester Burnham, a depressed suburban father in...",80.878605,"[{""name"": ""DreamWorks SKG"", ""id"": 27}, {""name""...",...,122.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Look closer.,American Beauty,7.9,3313,American Beauty,"[{""cast_id"": 6, ""character"": ""Lester Burnham"",...","[{""credit_id"": ""52fe420ec3a36847f8000809"", ""de..."


From a quick inspection, we can see that our merge worked - the titles from both dataframes match. Let's clean up a little by dropping the extra columns.

In [5]:
data_df.drop(['title_x', 'title_y'], axis=1, inplace=True)
data_df.rename({'original_title': 'title'}, axis=1, inplace=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   title                 4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

## 2. EDA

In [6]:
# Make a copy to work with while encoding so that we have the original to go back to
# if needed
encoded_data_df=data_df.copy()

### 2.1. Feature encoding

#### 2.1.1. *genre*, *keywords* and *cast*

The *genre*, *keywords* and *cast* columns contain JSON formatted data where for each movie there are several entries with 'id' and 'name' keys. We are going to extract just the names to a list and then concatenate them to a single string so that we can vectorized them similarly to how we handled the app review data.

In [7]:
# Loads the 'cast' JSON from each row of the dataframe as a dict. and extracts the value of 'name'
encoded_data_df['cast']=data_df['cast'].apply(lambda x: [item['name'] for item in json.loads(x)][:3] if pd.notna(x) else None)

# Same for the 'keywords' column
encoded_data_df['keywords']=data_df['keywords'].apply(lambda x: [item['name'] for item in json.loads(x)][:3] if pd.notna(x) else 'none')

# And the 'genres' column
encoded_data_df['genres']=data_df['genres'].apply(lambda x: [item['name'] for item in json.loads(x)][:3] if pd.notna(x) else 'none')

encoded_data_df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,cast,crew
0,4000000,"[Crime, Comedy]",,5,"[hotel, new year's eve, witch]",en,Four Rooms,It's Ted the Bellhop's first night on the job....,22.87623,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...",...,1995-12-09,4300000,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Twelve outrageous guests. Four scandalous requ...,6.5,530,"[Tim Roth, Antonio Banderas, Jennifer Beals]","[{""credit_id"": ""52fe420dc3a36847f800012d"", ""de..."
1,11000000,"[Adventure, Action, Science Fiction]",http://www.starwars.com/films/star-wars-episod...,11,"[android, galaxy, hermit]",en,Star Wars,Princess Leia is captured and held hostage by ...,126.393695,"[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...",...,1977-05-25,775398007,121.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"A long time ago in a galaxy far, far away...",8.1,6624,"[Mark Hamill, Harrison Ford, Carrie Fisher]","[{""credit_id"": ""52fe420dc3a36847f8000437"", ""de..."
2,94000000,"[Animation, Family]",http://movies.disney.com/finding-nemo,12,"[father son relationship, harbor, underwater]",en,Finding Nemo,"Nemo, an adventurous young clownfish, is unexp...",85.688789,"[{""name"": ""Pixar Animation Studios"", ""id"": 3}]",...,2003-05-30,940335536,100.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"There are 3.7 trillion fish in the ocean, they...",7.6,6122,"[Albert Brooks, Ellen DeGeneres, Alexander Gould]","[{""credit_id"": ""52fe420ec3a36847f80006b1"", ""de..."


We could do the same thing with the *production_companies*, *spoken_language* and *crew* features, but what we have is a good start. Let's move on.

### 2.1.2. *Overview*

The next feature that looks obviously important for movie recommendation is the *overview*. Convert it to a list so that we can concatenate it with the other features. This will give us one long text that contains the keywords, genre, actors and description. This text will then be used for vectorization.

In [8]:
encoded_data_df['overview']=data_df['overview'].apply(lambda x: [x if pd.notna(x) else 'none'])
encoded_data_df.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,title,overview,popularity,production_companies,...,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count,cast,crew
0,4000000,"[Crime, Comedy]",,5,"[hotel, new year's eve, witch]",en,Four Rooms,[It's Ted the Bellhop's first night on the job...,22.87623,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...",...,1995-12-09,4300000,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Twelve outrageous guests. Four scandalous requ...,6.5,530,"[Tim Roth, Antonio Banderas, Jennifer Beals]","[{""credit_id"": ""52fe420dc3a36847f800012d"", ""de..."
1,11000000,"[Adventure, Action, Science Fiction]",http://www.starwars.com/films/star-wars-episod...,11,"[android, galaxy, hermit]",en,Star Wars,[Princess Leia is captured and held hostage by...,126.393695,"[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...",...,1977-05-25,775398007,121.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"A long time ago in a galaxy far, far away...",8.1,6624,"[Mark Hamill, Harrison Ford, Carrie Fisher]","[{""credit_id"": ""52fe420dc3a36847f8000437"", ""de..."
2,94000000,"[Animation, Family]",http://movies.disney.com/finding-nemo,12,"[father son relationship, harbor, underwater]",en,Finding Nemo,"[Nemo, an adventurous young clownfish, is unex...",85.688789,"[{""name"": ""Pixar Animation Studios"", ""id"": 3}]",...,2003-05-30,940335536,100.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"There are 3.7 trillion fish in the ocean, they...",7.6,6122,"[Albert Brooks, Ellen DeGeneres, Alexander Gould]","[{""credit_id"": ""52fe420ec3a36847f80006b1"", ""de..."


### 2.1.3. Combine and encode features

In [9]:
# Concatenate the four features we just extracted to a single list feature called 'tags'
encoded_data_df["tags"]=encoded_data_df["overview"] + encoded_data_df["genres"] + encoded_data_df["keywords"] + encoded_data_df["cast"]

# Join the list 'tags' feature into a string
encoded_data_df["tags"]=encoded_data_df["tags"].apply(lambda x: ', '.join(x))

# Take a look at the first row to see and example of the result
encoded_data_df.iloc[0].tags

"It's Ted the Bellhop's first night on the job...and the hotel's very unusual guests are about to place him in some outrageous predicaments. It seems that this evening's room service is serving up one unbelievable happening after another., Crime, Comedy, hotel, new year's eve, witch, Tim Roth, Antonio Banderas, Jennifer Beals"

### 2.2. Missing and/or extreme values

Already handled these when extracting the JSON data.

### 2.3. Feature selection

To start with, let's try just using the 'tag' feature we just created. We could add other features later if we want to - for example, maybe the budget could be relevant, etc.

In [10]:
tags=encoded_data_df['tags']

### 2.4. Save the processed data

In [11]:
# Save the data
data_file='../data/processed_data/movies.pkl'

with open(data_file, 'wb') as output_file:
    pickle.dump(tags, output_file, protocol=pickle.HIGHEST_PROTOCOL)

## 3. Model training

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# Vectorize the 'tags' string feature using TF-IDF (text frequency, inverse document frequency)
vectorizer=TfidfVectorizer()
tfidf_matrix=vectorizer.fit_transform(tags)

# Instantiate and train the nearest neighbors model
model=NearestNeighbors(n_neighbors=5, algorithm="brute", metric="cosine")
fit_result=model.fit(tfidf_matrix)

## 4. Recommender

In [13]:
# Recommender function

def get_movie_recommendations(movie_title):
    '''Takes a movie title string, looks up TFIDF feature vector for that movie
    and returns title of top 5 most similar movies'''

    # Find the query movie in the encoded data, get the index
    movie_index = encoded_data_df[encoded_data_df["title"] == movie_title].index[0]

    print

    # Get the distances and indexes of similar movies
    distances, indices = model.kneighbors(tfidf_matrix[movie_index])

    # Extract the titles of the similar movie
    similar_movies = [(encoded_data_df["title"][i], distances[0][j]) for j, i in enumerate(indices[0])]
    
    return similar_movies[1:]


In [14]:
# 'Target' movie
input_movie = "How to Train Your Dragon"

# Call the recommendation function
recommendations = get_movie_recommendations(input_movie)

# Print the results
print("Film recommendations '{}'".format(input_movie))
for movie, distance in recommendations:
    print("- Film: {}".format(movie))

Film recommendations 'How to Train Your Dragon'
- Film: How to Train Your Dragon 2
- Film: Dragon Nest: Warriors' Dawn
- Film: Pete's Dragon
- Film: Eragon


## 5. Deployment

OK, it works! Let's refactor this code and turn it into a simple command line utility.

First, we need to save some of the assets we created:

1. The model
2. The TFIDF matrix
3. The encoded dataframe

Then we can place our recommender function in app.py, load the assets and take user input.

In [15]:
pickle.dump(model, open("../models/knn_movie_recommender.pkl", "wb"))