# CA06 - kNN based Recommender Engine
In this project, a kNN (k-nearest neighbor) recommender engine is built in order to make predictions in regards to movies. Other examples of recommender engines include recommending products on Amazon, articles on Medium, movies on Netflix, or videos on YouTube. Although we can be certain they all use more efficient means of making recommendations due to the enormous volume of data they process. However, we could replicate one of these recommender systems on a smaller scale; Let us build the core of a movies recommender system.


**Given a movies data set, what are the 5 most similar movies to a movie query?**

In [214]:
# Importing packages
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors

## 1. Data Source and Contents

This dataset contains thirty movies representing seven genres and their IMDB ratings. The label column values are all zeroes because this dataset is not used for classification or regression. You can ignore this column. The implementation assumes that all columns contain numerical data.

Additionally, there are relationships among the movies that will not be accounted for (e.g. actors, directors, and themes) when using the KNN algorithm simply because the data that captures those relationships is missing from the dataset. Consequently, when we run the KNN algorithm on our data, similarities will be based solely on the movies' genres and the IMDB ratings.

In [207]:
# Loading dataset
movies = pd.read_csv('https://github.com/ArinB/CA05-kNN/raw/master/movies_recommendation_data.csv', index_col='Movie ID')

In [208]:
# Previewing data
movies.head()

Unnamed: 0_level_0,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
Movie ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
8,Ex Machina,7.7,0,1,0,0,0,1,0,0
46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
97,Forrest Gump,8.8,0,1,0,0,0,0,0,0


In [209]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 58 to 46
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Movie Name   30 non-null     object 
 1   IMDB Rating  30 non-null     float64
 2   Biography    30 non-null     int64  
 3   Drama        30 non-null     int64  
 4   Thriller     30 non-null     int64  
 5   Comedy       30 non-null     int64  
 6   Crime        30 non-null     int64  
 7   Mystery      30 non-null     int64  
 8   History      30 non-null     int64  
 9   Label        30 non-null     int64  
dtypes: float64(1), int64(8), object(1)
memory usage: 2.6+ KB


## 2. Building a Recommender System

*Scenario*: You are building your own movie recommendation website which uses your Recommendation Engine at the back-end. You are going to build this back-end Recommendation Engine. Imagine a user is navigating your recommendation website, and he/she encounters a movie named “The Post”. The user is not sure if he/she wants to watch it, but its genres intrigue the user; he/she is curious about other similar movies. The user scrolls down to the “More Like This” section to see what recommendations your recommendation website will make, and the back-end algorithmic gears begin to turn.

Your website sends a request to its back-end for the 5 movies that are most similar to The Post. The back- end has a recommendation data set exactly like ours. It begins by creating the row representation (better known as a feature vector) for The Post, then it runs a program similar to the one below to search for the 5 movies that are most similar to The Post, and finally sends the results back to the user at your website.

*Following is the genre information about the movie “The Post”*:

> IMDB Rating = 7.2, Biography = Yes, Drama = Yes, Thriller = No, Comedy = No,
Crime = No, Mystery = No, History = Yes

In [210]:
# Adding The Post data
post_data = {'IMDB Rating':[7.2], 'Biography':1, 'Drama':1, 'Thriller':0, 'Comedy':0, 'Crime':0, 'Mystery':0, 'History':1}
the_post = pd.DataFrame(data=post_data, index=None)

In [211]:
# Selecting feature variables 
feature_cols = movies.drop(['Movie Name','Label'], axis=1)
X = feature_cols

In [212]:
# Using NearestNeighbors model and kneighbors() method to find k neighbors.
# Setting n_neighbors = 5 to find 5 similar movies 
# Using brute force due to small sample size (30) and few dimensions (11)

neigh = NearestNeighbors(n_neighbors=5, algorithm='brute')
neigh.fit(X)
distances, indices = neigh.kneighbors(the_post)

In [213]:
# Printing the top 5 movie recommendations:

print('Recommendations for "The Post":\n')
for i in range(len(distances.flatten())):
  print('{0}: {1}, with a distance of {2}.'.format(i+1, movies['Movie Name'].iloc[indices.flatten()[i]],distances.flatten()[i]))

Recommendations for "The Post":

1: 12 Years a Slave, with a distance of 0.9000000000000012.
2: Hacksaw Ridge, with a distance of 1.0.
3: Queen of Katwe, with a distance of 1.0198039027185601.
4: The Wind Rises, with a distance of 1.1661903789690629.
5: The Karate Kid, with a distance of 1.4142135623730951.
