# Movie Recommendation
## using K-Nearest Neighbours

This is a Jupyter Notebook to develop a movies recommendation system using K-Nearest Neighbous.

The dataset used is [MovieLens Dataset](https://grouplens.org/datasets/movielens/latest/). The *Small* dataset is being used here for educational and development purposes.

In [1]:
import pandas as pd
import numpy as np

## Importing Datasets and creating DataFrames

In [2]:
movies = pd.read_csv('data/movies_data/movies.csv')
ratings = pd.read_csv('data/movies_data/ratings.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Preparing Dataset

In [5]:
print('Movies df Shape: ', movies.shape)
print('Ratings df Shape: ', ratings.shape)

Movies df Shape:  (9742, 3)
Ratings df Shape:  (100836, 4)


In [6]:
df = ratings.merge(movies, on='movieId')

In [7]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [8]:
rating_counts = ratings.groupby('movieId')['rating'].count().sort_values(ascending=False)
rating_counts.head()

movieId
356     329
318     317
296     307
593     279
2571    278
Name: rating, dtype: int64

In [9]:
movies = movies.merge(rating_counts, on='movieId')
movies.rename(columns={'rating':'rating_count'}, inplace=True)
movies

Unnamed: 0,movieId,title,genres,rating_count
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215
1,2,Jumanji (1995),Adventure|Children|Fantasy,110
2,3,Grumpier Old Men (1995),Comedy|Romance,52
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7
4,5,Father of the Bride Part II (1995),Comedy,49
...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,1
9720,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,1
9721,193585,Flint (2017),Drama,1
9722,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,1


In [10]:
movies[movies['movieId']==356]

Unnamed: 0,movieId,title,genres,rating_count
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,329


In [11]:
rating_avg = ratings.groupby('movieId')['rating'].mean().sort_values(ascending=False)
rating_avg.head()

movieId
88448     5.0
100556    5.0
143031    5.0
143511    5.0
143559    5.0
Name: rating, dtype: float64

In [12]:
movies = movies.merge(rating_avg, on='movieId')
movies.rename(columns={'rating':'rating_avg'}, inplace=True)
movies

Unnamed: 0,movieId,title,genres,rating_count,rating_avg
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215,3.920930
1,2,Jumanji (1995),Adventure|Children|Fantasy,110,3.431818
2,3,Grumpier Old Men (1995),Comedy|Romance,52,3.259615
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7,2.357143
4,5,Father of the Bride Part II (1995),Comedy,49,3.071429
...,...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,1,4.000000
9720,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,1,3.500000
9721,193585,Flint (2017),Drama,1,3.500000
9722,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,1,3.500000


### Calculating Weighted Average Rating of a Movie

Formula Used:  
&emsp; &emsp; *w = (Rv + Cm) / (v + m)*   
  
where,  
&emsp; w = weighted rating  
&emsp; R = average rating of the movie  
&emsp; v = number of total votes (ratings)  
&emsp; m = minimum votes to be listed in top 75% of the data  
&emsp; C = mean across all the average ratings

In [13]:
v = movies['rating_count']
R = movies['rating_avg']
m = movies['rating_count'].quantile(q=0.75)
C = movies['rating_avg'].mean()

In [14]:
movies['weighted_avg'] = ((R * v) + (C * m)) / (v + m) 

In [15]:
movies.head()

Unnamed: 0,movieId,title,genres,rating_count,rating_avg,weighted_avg
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215,3.92093,3.894473
1,2,Jumanji (1995),Adventure|Children|Fantasy,110,3.431818,3.419009
2,3,Grumpier Old Men (1995),Comedy|Romance,52,3.259615,3.260033
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7,2.357143,2.866377
4,5,Father of the Bride Part II (1995),Comedy,49,3.071429,3.10107


### Handling Genres Feature

In [16]:
genres = movies['genres']
genres.head()

0    Adventure|Animation|Children|Comedy|Fantasy
1                     Adventure|Children|Fantasy
2                                 Comedy|Romance
3                           Comedy|Drama|Romance
4                                         Comedy
Name: genres, dtype: object

In [17]:
unique_genres = []
for movie in genres:
    list_genre = movie.split('|')
    for item in list_genre:
        if item not in unique_genres:
            unique_genres.append(item)
unique_genres

['Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Fantasy',
 'Romance',
 'Drama',
 'Action',
 'Crime',
 'Thriller',
 'Horror',
 'Mystery',
 'Sci-Fi',
 'War',
 'Musical',
 'Documentary',
 'IMAX',
 'Western',
 'Film-Noir',
 '(no genres listed)']

In [18]:
for genre in unique_genres:
    movies[genre] = 0

In [19]:
movies.head()

Unnamed: 0,movieId,title,genres,rating_count,rating_avg,weighted_avg,Adventure,Animation,Children,Comedy,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215,3.92093,3.894473,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,110,3.431818,3.419009,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,52,3.259615,3.260033,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7,2.357143,2.866377,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,49,3.071429,3.10107,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
for index, list_genre in enumerate(genres):
    list_genre_split = list_genre.split('|')
    for item in list_genre_split:
        movies.at[index, item] = 1
movies.head()

Unnamed: 0,movieId,title,genres,rating_count,rating_avg,weighted_avg,Adventure,Animation,Children,Comedy,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215,3.92093,3.894473,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,110,3.431818,3.419009,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,52,3.259615,3.260033,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7,2.357143,2.866377,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,49,3.071429,3.10107,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [21]:
movies.drop(['(no genres listed)'], inplace=True, axis=1)
movies.head()

Unnamed: 0,movieId,title,genres,rating_count,rating_avg,weighted_avg,Adventure,Animation,Children,Comedy,...,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215,3.92093,3.894473,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,110,3.431818,3.419009,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,52,3.259615,3.260033,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7,2.357143,2.866377,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,49,3.071429,3.10107,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### Creating Model to Dump Later

In [22]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(np.array(movies['rating_count']).reshape(-1, 1))

StandardScaler()

In [23]:
movies['rating_count'] = (movies['rating_count'] - movies['rating_count'].mean()) / movies['rating_count'].std()

In [24]:
movies.head()

Unnamed: 0,movieId,title,genres,rating_count,rating_avg,weighted_avg,Adventure,Animation,Children,Comedy,...,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,9.134867,3.92093,3.894473,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,4.447577,3.431818,3.419009,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,1.858407,3.259615,3.260033,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,-0.150431,2.357143,2.866377,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,1.724485,3.071429,3.10107,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [43]:
X = movies.drop(['movieId', 'title', 'genres'], axis=1)
movies_data = movies[['movieId', 'title', 'genres']]

In [44]:
X.head()

Unnamed: 0,rating_count,rating_avg,weighted_avg,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,...,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir
0,9.134867,3.92093,3.894473,1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4.447577,3.431818,3.419009,1,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1.858407,3.259615,3.260033,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,-0.150431,2.357143,2.866377,0,0,0,1,0,1,1,...,0,0,0,0,0,0,0,0,0,0
4,1.724485,3.071429,3.10107,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [45]:
movies_data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Creating Sparse Matrix

In [29]:
from scipy.sparse import csr_matrix
X_sparse = csr_matrix(X)

## Training Nearest Neighbors Model

In [30]:
from sklearn.neighbors import NearestNeighbors

In [31]:
nn_model = NearestNeighbors(metric='cosine', algorithm='brute')

In [32]:
nn_model.fit(X_sparse)

NearestNeighbors(algorithm='brute', metric='cosine')

## Testing a Sample

In [33]:
test_index = np.random.choice(movies_data.shape[0])
test_index

2264

In [34]:
movies_data.iloc[test_index]

movieId                     3007
title      American Movie (1999)
Name: 2264, dtype: object

In [35]:
sample = X.iloc[test_index].values.reshape(1, -1)
sample

array([[-0.1950719 ,  3.75      ,  3.45746896,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  0.        ]])

In [36]:
distances, indices = nn_model.kneighbors(sample, n_neighbors = 100)

In [37]:
indices.flatten()

array([2264, 7549, 3801, 2491, 6210, 5949, 5549, 3187, 4357,  101, 6786,
       7118, 5865, 6162, 6468, 6922, 8574, 5712, 2397, 1047, 5976, 8949,
       5050, 4831, 5921, 3714, 6377, 7959, 3899, 4352, 5714,  501, 4347,
       5074, 7832, 9300, 5894, 3994, 4323, 4305,  866, 5706,  487, 5707,
       6763, 7095, 4121, 8250, 2881, 6509, 2020, 9493,  843, 5713, 2735,
       5545, 6456, 1358, 4955, 6229, 7851, 7980,  604, 4804, 5718,  987,
        891, 7577, 4380, 9278, 9380, 4654, 9718, 5134, 6086, 6147, 8158,
       6439, 7767, 5009, 3144, 4929, 6741, 2688, 7292, 5908, 8883, 9096,
       8901, 9138, 7066, 4499, 7005, 4388, 4464, 6899, 7141, 4652, 4365,
       8603])

In [38]:
print('Original Movie:', movies_data.iloc[test_index]['title'], ', ID: ', movies_data.iloc[test_index]['movieId'])

Original Movie: American Movie (1999) , ID:  3007


In [39]:
recs = np.random.randint(0, 100, 5)
recs

array([31,  9, 85, 42,  1])

In [40]:
print('Recommended Movies: ')
for i in recs:
    print(f"MovieID: {movies_data.iloc[indices.flatten()[i]]['movieId']}, \
\tName: {movies_data.iloc[indices.flatten()[i]]['title']}, \
\tDistance: {distances.flatten()[i]}")

Recommended Movies: 
MovieID: 581, 	Name: Celluloid Closet, The (1995), 	Distance: 0.0005050554467982415
MovieID: 116, 	Name: Anne Frank Remembered (1995), 	Distance: 0.0001451524186066866
MovieID: 33838, 	Name: Rize (2005), 	Distance: 0.001239632639111088
MovieID: 556, 	Name: War Room, The (1993), 	Distance: 0.0007034070079346977
MovieID: 85774, 	Name: Senna (2010), 	Distance: 0.0


In [46]:
movies_data.iloc[indices.flatten()[recs]]

Unnamed: 0,movieId,title,genres
501,581,"Celluloid Closet, The (1995)",Documentary
101,116,Anne Frank Remembered (1995),Documentary
5908,33838,Rize (2005),Documentary
487,556,"War Room, The (1993)",Documentary
7549,85774,Senna (2010),Documentary


## Exporting Model and Scaling Object 

In [47]:
import pickle

Exporting Model

In [48]:
with open('nn_model_movies.pkl', 'wb') as f:
    pickle.dump(nn_model, f)

Exporting Scaler

In [49]:
with open('movies_rating_scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)