# Jack Taylor

# CA06 - kNN Based Recommender Engine

# The Application

At scale, this would look like recommending products on Amazon, articles on Medium, movies on Netflix, or videos on YouTube. Although, we can be certain they all use more efficient means of making recommendations due to the enormous volume of data they process. However, we could replicate one of these recommender systems on a smaller scale using what we have learned here in this article. Let us build the core of a movies recommender system.

**What question are we trying to answer?**

Given a movies data set, what are the 5 most similar movies to a movie query?


# Data Source and Description

Data File Name: movies_recommendation_data.csv

File Location: https://github.com/ArinB/CA05-kNN/raw/master/movies_recommendation_data.csv

# Program Initialization Section

**Enter you import packages here**

In [1]:
import pandas as pd

import numpy as np

from sklearn.neighbors import NearestNeighbors

# Data File Reading Section

In [2]:
data = pd.read_csv('https://github.com/ArinB/CA05-kNN/raw/master/movies_recommendation_data.csv')
data.head()

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0,0


# Initial Data Investigation Section

**Summarized Details**

Descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset's distribution, exluding NaN values.

In [3]:
#Statistical Description of data
data.describe()

Unnamed: 0,Movie ID,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
count,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0,30.0
mean,48.133333,7.696667,0.233333,0.6,0.1,0.1,0.133333,0.1,0.1,0.0
std,29.288969,0.666169,0.430183,0.498273,0.305129,0.305129,0.345746,0.305129,0.305129,0.0
min,1.0,5.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,27.75,7.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,48.5,7.75,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,64.25,8.175,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
max,98.0,8.8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


In [6]:
#Displaying number of total rows and columns of the dataset
data.shape

(30, 11)

In [7]:
#Displaying number of non-null values for each column
data.count()

Movie ID       30
Movie Name     30
IMDB Rating    30
Biography      30
Drama          30
Thriller       30
Comedy         30
Crime          30
Mystery        30
History        30
Label          30
dtype: int64

In [8]:
#Displaying number of null values for each column
data.isnull().sum()

Movie ID       0
Movie Name     0
IMDB Rating    0
Biography      0
Drama          0
Thriller       0
Comedy         0
Crime          0
Mystery        0
History        0
Label          0
dtype: int64

In [9]:
#Displaying range, column, number of non-null objects of each column, datatype and memory usage
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Movie ID     30 non-null     int64  
 1   Movie Name   30 non-null     object 
 2   IMDB Rating  30 non-null     float64
 3   Biography    30 non-null     int64  
 4   Drama        30 non-null     int64  
 5   Thriller     30 non-null     int64  
 6   Comedy       30 non-null     int64  
 7   Crime        30 non-null     int64  
 8   Mystery      30 non-null     int64  
 9   History      30 non-null     int64  
 10  Label        30 non-null     int64  
dtypes: float64(1), int64(9), object(1)
memory usage: 2.7+ KB


In [9]:
#Dropping 'Label' column bc it is irrelevant
data.drop(columns='Label', inplace=True)

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0
5,98,21,6.8,0,1,0,0,1,0,1
6,31,Gifted,7.6,0,1,0,0,0,0,0
7,3,Travelling Salesman,5.9,0,1,0,0,0,1,0
8,51,Avatar,7.9,0,0,0,0,0,0,0
9,47,The Karate Kid,7.2,0,1,0,0,0,0,0


# Building a Recommender System

In [10]:
#slicing the data to only inlcude imdb rating and genres to use for the model
kNN_data = data[['IMDB Rating', 'Biography', 'Drama', 'Thriller', 'Comedy', 'Crime', 'Mystery', 'History']]

In [11]:
#building the model and testing it
neigh = NearestNeighbors(n_neighbors=5, algorithm='brute') #using brute bc the dataset is small
neigh.fit(kNN_data)
distances, indices = neigh.kneighbors(kNN_data)
indices #returns indexes of 5 similar titles for each row in order

array([[ 0, 16,  2, 29, 28],
       [ 1,  6, 21, 18, 10],
       [ 2, 16, 29, 27,  3],
       [ 3, 12,  4,  6, 18],
       [ 4, 12,  3, 15, 17],
       [ 5,  9, 10, 18, 21],
       [ 6, 18, 21,  9, 10],
       [ 7, 20, 10,  9, 21],
       [ 8, 22, 24, 14, 19],
       [10,  9, 21, 18,  6],
       [10,  9, 21, 18,  6],
       [11, 18, 21,  6,  9],
       [12,  4,  3,  6, 17],
       [13, 23, 25,  2, 24],
       [14, 19, 26,  8, 22],
       [15, 17, 24, 22,  8],
       [16, 29,  2,  0,  6],
       [17, 15, 24, 22,  8],
       [18, 21, 10,  9,  6],
       [19, 14, 26,  8, 22],
       [20, 26, 19,  7, 14],
       [18, 21, 10,  9,  6],
       [22,  8, 24, 17, 14],
       [23, 25, 13, 19, 26],
       [24, 22,  8, 17, 14],
       [25,  8, 22, 24, 14],
       [26, 19, 14,  8, 22],
       [27, 28,  2, 16, 29],
       [28, 27,  2, 16, 29],
       [29, 16,  2, 18, 21]])

In [12]:
#appending 'The Post' to the dataset
data.loc[len(data.index)] = [100, 'The Post', 7.2, 1, 1, 0, 0, 0, 0, 1]

#rerunning to ensure 'The Post' is included
kNN_data = data[['IMDB Rating', 'Biography', 'Drama', 'Thriller', 'Comedy', 'Crime', 'Mystery', 'History']]

In [13]:
#rerunning the model but selecting only for index 30 (The Post)
neigh = NearestNeighbors(n_neighbors=5, algorithm='brute')
neigh.fit(kNN_data)
distances, indices = neigh.kneighbors(kNN_data)
indices[30]

array([30, 28, 27, 29, 16])

In [14]:
#Returning movies most similar to The Post according to IMBD Score and Genre
print(data.loc[[30,28,27,29,16]])

    Movie ID        Movie Name  IMDB Rating  ...  Crime  Mystery  History
30       100          The Post          7.2  ...      0        0        1
28        86  12 Years a Slave          8.1  ...      0        0        1
27         1     Hacksaw Ridge          8.2  ...      0        0        1
29        46    Queen of Katwe          7.4  ...      0        0        0
16        44    The Wind Rises          7.8  ...      0        0        0

[5 rows x 10 columns]
