# Movie Recommendation
## using K-Nearest Neighbours

This is a Jupyter Notebook to develop a movies recommendation system using K-Nearest Neighbous.

The dataset used is [MovieLens Dataset](https://grouplens.org/datasets/movielens/latest/). The *Small* dataset is being used here for educational and development purposes.

In [1]:
import pandas as pd
import numpy as np

## Importing Datasets and creating DataFrames

In [2]:
movies = pd.read_csv('data/movies.csv')
ratings = pd.read_csv('data/ratings.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Preparing Dataset

In [5]:
print('Movies df Shape: ', movies.shape)
print('Ratings df Shape: ', ratings.shape)

Movies df Shape:  (9742, 3)
Ratings df Shape:  (100836, 4)


In [6]:
df = ratings.merge(movies, on='movieId')

In [7]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [8]:
rating_counts = ratings.groupby('movieId')['rating'].count().sort_values(ascending=False)
rating_counts.head()

movieId
356     329
318     317
296     307
593     279
2571    278
Name: rating, dtype: int64

In [9]:
movies = movies.merge(rating_counts, on='movieId')
movies.rename(columns={'rating':'rating_count'}, inplace=True)
movies

Unnamed: 0,movieId,title,genres,rating_count
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215
1,2,Jumanji (1995),Adventure|Children|Fantasy,110
2,3,Grumpier Old Men (1995),Comedy|Romance,52
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7
4,5,Father of the Bride Part II (1995),Comedy,49
...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,1
9720,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,1
9721,193585,Flint (2017),Drama,1
9722,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,1


In [10]:
movies[movies['movieId']==356]

Unnamed: 0,movieId,title,genres,rating_count
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,329


In [11]:
rating_avg = ratings.groupby('movieId')['rating'].mean().sort_values(ascending=False)
rating_avg.head()

movieId
88448     5.0
100556    5.0
143031    5.0
143511    5.0
143559    5.0
Name: rating, dtype: float64

In [12]:
movies = movies.merge(rating_avg, on='movieId')
movies.rename(columns={'rating':'rating_avg'}, inplace=True)
movies

Unnamed: 0,movieId,title,genres,rating_count,rating_avg
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215,3.920930
1,2,Jumanji (1995),Adventure|Children|Fantasy,110,3.431818
2,3,Grumpier Old Men (1995),Comedy|Romance,52,3.259615
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7,2.357143
4,5,Father of the Bride Part II (1995),Comedy,49,3.071429
...,...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,1,4.000000
9720,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,1,3.500000
9721,193585,Flint (2017),Drama,1,3.500000
9722,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,1,3.500000


### Handling Genres Feature

In [13]:
genres = movies['genres']
genres.head()

0    Adventure|Animation|Children|Comedy|Fantasy
1                     Adventure|Children|Fantasy
2                                 Comedy|Romance
3                           Comedy|Drama|Romance
4                                         Comedy
Name: genres, dtype: object

In [14]:
genre = list()
subgenre = list()
for movie in genres:
    temp_list = movie.split('|')
    if len(temp_list) > 1:
        subgenre.append(temp_list[1])
    else:
        subgenre.append(temp_list[0])
    genre.append(temp_list[0])

In [15]:
print('Genre Length: ', len(genre))
print('Sub Genre Length: ', len(subgenre))

Genre Length:  9724
Sub Genre Length:  9724


In [16]:
movies['genre'] = genre
movies['subgenre'] = subgenre
movies.drop(['genres'], axis=1, inplace=True)
movies

Unnamed: 0,movieId,title,rating_count,rating_avg,genre,subgenre
0,1,Toy Story (1995),215,3.920930,Adventure,Animation
1,2,Jumanji (1995),110,3.431818,Adventure,Children
2,3,Grumpier Old Men (1995),52,3.259615,Comedy,Romance
3,4,Waiting to Exhale (1995),7,2.357143,Comedy,Drama
4,5,Father of the Bride Part II (1995),49,3.071429,Comedy,Comedy
...,...,...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),1,4.000000,Action,Animation
9720,193583,No Game No Life: Zero (2017),1,3.500000,Animation,Comedy
9721,193585,Flint (2017),1,3.500000,Drama,Drama
9722,193587,Bungo Stray Dogs: Dead Apple (2018),1,3.500000,Action,Animation


### Calculating Weighted Average Rating of a Movie

Formula Used:  
&emsp; &emsp; *w = (Rv + Cm) / (v + m)*   
  
where,  
&emsp; w = weighted rating  
&emsp; R = average rating of the movie  
&emsp; v = number of total votes (ratings)  
&emsp; m = minimum votes to be listed in top 75% of the data  
&emsp; C = mean across all the average ratings

In [17]:
v = movies['rating_count']
R = movies['rating_avg']
m = movies['rating_count'].quantile(q=0.75)
C = movies['rating_avg'].mean()

In [18]:
movies['weighted_avg'] = ((R * v) + (C * m)) / (v + m) 

In [19]:
movies

Unnamed: 0,movieId,title,rating_count,rating_avg,genre,subgenre,weighted_avg
0,1,Toy Story (1995),215,3.920930,Adventure,Animation,3.894473
1,2,Jumanji (1995),110,3.431818,Adventure,Children,3.419009
2,3,Grumpier Old Men (1995),52,3.259615,Comedy,Romance,3.260033
3,4,Waiting to Exhale (1995),7,2.357143,Comedy,Drama,2.866377
4,5,Father of the Bride Part II (1995),49,3.071429,Comedy,Comedy,3.101070
...,...,...,...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),1,4.000000,Action,Animation,3.336203
9720,193583,No Game No Life: Zero (2017),1,3.500000,Animation,Comedy,3.286203
9721,193585,Flint (2017),1,3.500000,Drama,Drama,3.286203
9722,193587,Bungo Stray Dogs: Dead Apple (2018),1,3.500000,Action,Animation,3.286203


### Preparing Data for Nearest Neighbors Algorithm

Checking Number of Genres

In [20]:
print('Number of Genres: ', len(movies['genre'].unique()))
print('Number of Sub Genres: ', len(movies['subgenre'].unique()))

Number of Genres:  19
Number of Sub Genres:  20


In [21]:
genre_dummies = pd.get_dummies(movies['genre'], drop_first=True)
genre_dummies.head()

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [22]:
subgenre_dummies = pd.get_dummies(movies['subgenre'], drop_first=True)
subgenre_dummies.head()

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [23]:
movies = pd.concat([movies, genre_dummies], axis=1)
movies = pd.concat([movies, subgenre_dummies], axis=1)
movies.head()

Unnamed: 0,movieId,title,rating_count,rating_avg,genre,subgenre,weighted_avg,Action,Adventure,Animation,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),215,3.92093,Adventure,Animation,3.894473,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),110,3.431818,Adventure,Children,3.419009,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),52,3.259615,Comedy,Romance,3.260033,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),7,2.357143,Comedy,Drama,2.866377,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),49,3.071429,Comedy,Comedy,3.10107,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
movies.columns

Index(['movieId', 'title', 'rating_count', 'rating_avg', 'genre', 'subgenre',
       'weighted_avg', 'Action', 'Adventure', 'Animation', 'Children',
       'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
       'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
       'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'],
      dtype='object')

In [25]:
movies.drop(['genre', 'subgenre'], inplace=True, axis=1)
movies.head()

Unnamed: 0,movieId,title,rating_count,rating_avg,weighted_avg,Action,Adventure,Animation,Children,Comedy,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),215,3.92093,3.894473,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),110,3.431818,3.419009,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),52,3.259615,3.260033,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),7,2.357143,2.866377,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),49,3.071429,3.10107,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


#### Standard Scaling Rating Count

In [26]:
movies['rating_count'] = (movies['rating_count'] - movies['rating_count'].mean()) / movies['rating_count'].std()

In [27]:
movies.head()

Unnamed: 0,movieId,title,rating_count,rating_avg,weighted_avg,Action,Adventure,Animation,Children,Comedy,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),9.134867,3.92093,3.894473,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),4.447577,3.431818,3.419009,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),1.858407,3.259615,3.260033,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),-0.150431,2.357143,2.866377,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),1.724485,3.071429,3.10107,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [28]:
X = movies.drop(['movieId', 'title'], axis=1)
movies_data = movies[['movieId', 'title']]

In [29]:
X.head()

Unnamed: 0,rating_count,rating_avg,weighted_avg,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,9.134867,3.92093,3.894473,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4.447577,3.431818,3.419009,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1.858407,3.259615,3.260033,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
3,-0.150431,2.357143,2.866377,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1.724485,3.071429,3.10107,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
movies_data.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


## Creating Sparse Matrix

In [31]:
from scipy.sparse import csr_matrix
X_sparse = csr_matrix(X)

## Training Nearest Neighbors Model

In [32]:
from sklearn.neighbors import NearestNeighbors

In [33]:
nn_model = NearestNeighbors(metric='cosine', algorithm='brute')

In [34]:
nn_model.fit(X_sparse)

NearestNeighbors(algorithm='brute', metric='cosine')

## Testing a Sample

In [35]:
test_index = np.random.choice(movies_data.shape[0])
test_index

2110

In [36]:
movies_data.iloc[test_index]

movieId                       2805
title      Mickey Blue Eyes (1999)
Name: 2110, dtype: object

In [37]:
sample = X.iloc[test_index].values.reshape(1, -1)
sample

array([[0.56382263, 3.        , 3.07381358, 0.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        1.        , 0.        , 0.        , 0.        , 0.        ]])

In [38]:
distances, indices = nn_model.kneighbors(sample, n_neighbors = 100)

In [39]:
indices.flatten()

array([2110, 4805, 1877, 3872, 3037, 6335, 3524,  406,  238, 7637, 4892,
        203, 4606, 3312, 1569, 6641, 3981, 3718, 2316, 4136, 4221, 4125,
        955,  890, 6434, 3405, 1107, 7399, 3632, 1869,  697,  106, 3224,
       3391, 7039, 7505,  735,  152,  361, 5767, 5246, 3087, 6708, 6506,
       2430, 1946, 6399, 8872, 4346, 3782, 2261, 2691, 2629, 3712, 1161,
       1177, 5320, 5254, 6739,  687, 2017,  731, 6242, 2625, 1677,  632,
        434, 5837, 2689, 4843,  476, 4165, 6731,  250, 7539, 4910, 2135,
       6672, 1873, 3075, 3787,  216, 3791,  335, 4803, 6130, 1928, 5321,
       2553, 2207, 5242, 7415, 4761, 8398, 6036, 1195, 8847, 6135, 1684,
       3337])

In [40]:
print('Original Movie:', movies_data.iloc[test_index]['title'], ', ID: ', movies_data.iloc[test_index]['movieId'])

Original Movie: Mickey Blue Eyes (1999) , ID:  2805


In [41]:
recs = np.random.randint(0, 100, 5)
recs

array([51, 40,  6, 85, 95])

In [42]:
print('Recommended Movies: ')
for i in recs:
    print(f"MovieID: {movies_data.iloc[indices.flatten()[i]]['movieId']}, \
\tName: {movies_data.iloc[indices.flatten()[i]]['title']}, \
\tDistance: {distances.flatten()[i]}")

Recommended Movies: 
MovieID: 3616, 	Name: Loser (2000), 	Distance: 0.008473401997269092
MovieID: 8623, 	Name: Roxanne (1987), 	Distance: 0.00728114227360066
MovieID: 4823, 	Name: Serendipity (2001), 	Distance: 0.00018990485194159135
MovieID: 44004, 	Name: Failure to Launch (2006), 	Distance: 0.013757903120475423
MovieID: 1593, 	Name: Picture Perfect (1997), 	Distance: 0.014557924958298996


In [43]:
movies.iloc[indices.flatten()[recs]]

Unnamed: 0,movieId,title,rating_count,rating_avg,weighted_avg,Action,Adventure,Animation,Children,Comedy,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
2691,3616,Loser (2000),-0.016508,3.3,3.282212,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
5246,8623,Roxanne (1987),0.028132,3.272727,3.268102,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3524,4823,Serendipity (2001),0.653104,3.14,3.172413,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
6130,44004,Failure to Launch (2006),-0.195072,3.25,3.257469,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
1195,1593,Picture Perfect (1997),-0.195072,2.833333,3.090802,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
