# Movie Recommendation
## using K-Nearest Neighbours

This is a Jupyter Notebook to develop a movies recommendation system using K-Nearest Neighbous.

The dataset used is [MovieLens Dataset](https://grouplens.org/datasets/movielens/latest/). The *Small* dataset is being used here for educational and development purposes.

In [1]:
import pandas as pd
import numpy as np

## Importing Datasets and creating DataFrames

In [2]:
movies = pd.read_csv('data/movies.csv')
ratings = pd.read_csv('data/ratings.csv')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### Preparing Dataset

In [5]:
print('Movies df Shape: ', movies.shape)
print('Ratings df Shape: ', ratings.shape)

Movies df Shape:  (9742, 3)
Ratings df Shape:  (100836, 4)


In [6]:
df = ratings.merge(movies, on='movieId')

In [7]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [8]:
rating_counts = ratings.groupby('movieId')['rating'].count().sort_values(ascending=False)
rating_counts.head()

movieId
356     329
318     317
296     307
593     279
2571    278
Name: rating, dtype: int64

In [9]:
movies = movies.merge(rating_counts, on='movieId')
movies.rename(columns={'rating':'rating_count'}, inplace=True)
movies

Unnamed: 0,movieId,title,genres,rating_count
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215
1,2,Jumanji (1995),Adventure|Children|Fantasy,110
2,3,Grumpier Old Men (1995),Comedy|Romance,52
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7
4,5,Father of the Bride Part II (1995),Comedy,49
...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,1
9720,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,1
9721,193585,Flint (2017),Drama,1
9722,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,1


In [10]:
movies[movies['movieId']==356]

Unnamed: 0,movieId,title,genres,rating_count
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,329


In [11]:
rating_avg = ratings.groupby('movieId')['rating'].mean().sort_values(ascending=False)
rating_avg.head()

movieId
88448     5.0
100556    5.0
143031    5.0
143511    5.0
143559    5.0
Name: rating, dtype: float64

In [12]:
movies = movies.merge(rating_avg, on='movieId')
movies.rename(columns={'rating':'rating_avg'}, inplace=True)
movies

Unnamed: 0,movieId,title,genres,rating_count,rating_avg
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,215,3.920930
1,2,Jumanji (1995),Adventure|Children|Fantasy,110,3.431818
2,3,Grumpier Old Men (1995),Comedy|Romance,52,3.259615
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,7,2.357143
4,5,Father of the Bride Part II (1995),Comedy,49,3.071429
...,...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,1,4.000000
9720,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,1,3.500000
9721,193585,Flint (2017),Drama,1,3.500000
9722,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,1,3.500000


### Handling Genres Feature

In [13]:
genres = movies['genres']
genres.head()

0    Adventure|Animation|Children|Comedy|Fantasy
1                     Adventure|Children|Fantasy
2                                 Comedy|Romance
3                           Comedy|Drama|Romance
4                                         Comedy
Name: genres, dtype: object

In [14]:
genre = list()
subgenre = list()
for movie in genres:
    temp_list = movie.split('|')
    if len(temp_list) > 1:
        subgenre.append(temp_list[1])
    else:
        subgenre.append(temp_list[0])
    genre.append(temp_list[0])

In [15]:
print('Genre Length: ', len(genre))
print('Sub Genre Length: ', len(subgenre))

Genre Length:  9724
Sub Genre Length:  9724


In [16]:
movies['genre'] = genre
movies['subgenre'] = subgenre
movies.drop(['genres'], axis=1, inplace=True)
movies

Unnamed: 0,movieId,title,rating_count,rating_avg,genre,subgenre
0,1,Toy Story (1995),215,3.920930,Adventure,Animation
1,2,Jumanji (1995),110,3.431818,Adventure,Children
2,3,Grumpier Old Men (1995),52,3.259615,Comedy,Romance
3,4,Waiting to Exhale (1995),7,2.357143,Comedy,Drama
4,5,Father of the Bride Part II (1995),49,3.071429,Comedy,Comedy
...,...,...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),1,4.000000,Action,Animation
9720,193583,No Game No Life: Zero (2017),1,3.500000,Animation,Comedy
9721,193585,Flint (2017),1,3.500000,Drama,Drama
9722,193587,Bungo Stray Dogs: Dead Apple (2018),1,3.500000,Action,Animation


### Calculating Weighted Average Rating of a Movie

Formula Used:  
&emsp; &emsp; *w = (Rv + Cm) / (v + m)*   
  
where,  
&emsp; w = weighted rating  
&emsp; R = average rating of the movie  
&emsp; v = number of total votes (ratings)  
&emsp; m = minimum votes to be listed in top 75% of the data  
&emsp; C = mean across all the average ratings

In [17]:
v = movies['rating_count']
R = movies['rating_avg']
m = movies['rating_count'].quantile(q=0.75)
C = movies['rating_avg'].mean()

In [18]:
movies['weighted_avg'] = ((R * v) + (C * m)) / (v + m) 

In [19]:
movies

Unnamed: 0,movieId,title,rating_count,rating_avg,genre,subgenre,weighted_avg
0,1,Toy Story (1995),215,3.920930,Adventure,Animation,3.894473
1,2,Jumanji (1995),110,3.431818,Adventure,Children,3.419009
2,3,Grumpier Old Men (1995),52,3.259615,Comedy,Romance,3.260033
3,4,Waiting to Exhale (1995),7,2.357143,Comedy,Drama,2.866377
4,5,Father of the Bride Part II (1995),49,3.071429,Comedy,Comedy,3.101070
...,...,...,...,...,...,...,...
9719,193581,Black Butler: Book of the Atlantic (2017),1,4.000000,Action,Animation,3.336203
9720,193583,No Game No Life: Zero (2017),1,3.500000,Animation,Comedy,3.286203
9721,193585,Flint (2017),1,3.500000,Drama,Drama,3.286203
9722,193587,Bungo Stray Dogs: Dead Apple (2018),1,3.500000,Action,Animation,3.286203


### Preparing Data for Nearest Neighbors Algorithm

Checking Number of Genres

In [20]:
print('Number of Genres: ', len(movies['genre'].unique()))
print('Number of Sub Genres: ', len(movies['subgenre'].unique()))

Number of Genres:  19
Number of Sub Genres:  20


In [21]:
genre_dummies = pd.get_dummies(movies['genre'], drop_first=True)
genre_dummies.head()

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [22]:
subgenre_dummies = pd.get_dummies(movies['subgenre'], drop_first=True)
subgenre_dummies.head()

Unnamed: 0,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [23]:
cols = {}
for col in subgenre_dummies.columns:
    cols[col] = col+'_sub'
cols

{'Action': 'Action_sub',
 'Adventure': 'Adventure_sub',
 'Animation': 'Animation_sub',
 'Children': 'Children_sub',
 'Comedy': 'Comedy_sub',
 'Crime': 'Crime_sub',
 'Documentary': 'Documentary_sub',
 'Drama': 'Drama_sub',
 'Fantasy': 'Fantasy_sub',
 'Film-Noir': 'Film-Noir_sub',
 'Horror': 'Horror_sub',
 'IMAX': 'IMAX_sub',
 'Musical': 'Musical_sub',
 'Mystery': 'Mystery_sub',
 'Romance': 'Romance_sub',
 'Sci-Fi': 'Sci-Fi_sub',
 'Thriller': 'Thriller_sub',
 'War': 'War_sub',
 'Western': 'Western_sub'}

In [24]:
subgenre_dummies.rename(columns=cols, inplace=True)
subgenre_dummies.head()

Unnamed: 0,Action_sub,Adventure_sub,Animation_sub,Children_sub,Comedy_sub,Crime_sub,Documentary_sub,Drama_sub,Fantasy_sub,Film-Noir_sub,Horror_sub,IMAX_sub,Musical_sub,Mystery_sub,Romance_sub,Sci-Fi_sub,Thriller_sub,War_sub,Western_sub
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [25]:
movies = pd.concat([movies, genre_dummies], axis=1)
movies = pd.concat([movies, subgenre_dummies], axis=1)
movies.head()

Unnamed: 0,movieId,title,rating_count,rating_avg,genre,subgenre,weighted_avg,Action,Adventure,Animation,...,Film-Noir_sub,Horror_sub,IMAX_sub,Musical_sub,Mystery_sub,Romance_sub,Sci-Fi_sub,Thriller_sub,War_sub,Western_sub
0,1,Toy Story (1995),215,3.92093,Adventure,Animation,3.894473,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),110,3.431818,Adventure,Children,3.419009,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),52,3.259615,Comedy,Romance,3.260033,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),7,2.357143,Comedy,Drama,2.866377,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),49,3.071429,Comedy,Comedy,3.10107,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
movies.columns

Index(['movieId', 'title', 'rating_count', 'rating_avg', 'genre', 'subgenre',
       'weighted_avg', 'Action', 'Adventure', 'Animation', 'Children',
       'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
       'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western', 'Action_sub', 'Adventure_sub', 'Animation_sub',
       'Children_sub', 'Comedy_sub', 'Crime_sub', 'Documentary_sub',
       'Drama_sub', 'Fantasy_sub', 'Film-Noir_sub', 'Horror_sub', 'IMAX_sub',
       'Musical_sub', 'Mystery_sub', 'Romance_sub', 'Sci-Fi_sub',
       'Thriller_sub', 'War_sub', 'Western_sub'],
      dtype='object')

In [27]:
movies.drop(['genre', 'subgenre'], inplace=True, axis=1)
movies.head()

Unnamed: 0,movieId,title,rating_count,rating_avg,weighted_avg,Action,Adventure,Animation,Children,Comedy,...,Film-Noir_sub,Horror_sub,IMAX_sub,Musical_sub,Mystery_sub,Romance_sub,Sci-Fi_sub,Thriller_sub,War_sub,Western_sub
0,1,Toy Story (1995),215,3.92093,3.894473,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),110,3.431818,3.419009,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),52,3.259615,3.260033,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),7,2.357143,2.866377,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II (1995),49,3.071429,3.10107,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


#### Standard Scaling Rating Count

Creating Model to Dump Later

In [28]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(movies['rating_count'].reshape())

ValueError: Expected 2D array, got 1D array instead:
array=[215. 110.  52. ...   1.   1.   1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
movies['rating_count'] = (movies['rating_count'] - movies['rating_count'].mean()) / movies['rating_count'].std()

In [None]:
movies.head()

In [None]:
X = movies.drop(['movieId', 'title'], axis=1)
movies_data = movies[['movieId', 'title']]

In [None]:
X.head()

In [None]:
movies_data.head()

## Creating Sparse Matrix

In [None]:
from scipy.sparse import csr_matrix
X_sparse = csr_matrix(X)

## Training Nearest Neighbors Model

In [None]:
from sklearn.neighbors import NearestNeighbors

In [None]:
nn_model = NearestNeighbors(metric='cosine', algorithm='brute')

In [None]:
nn_model.fit(X_sparse)

## Testing a Sample

In [None]:
test_index = np.random.choice(movies_data.shape[0])
test_index

In [None]:
movies_data.iloc[test_index]

In [None]:
sample = X.iloc[test_index].values.reshape(1, -1)
sample

In [None]:
distances, indices = nn_model.kneighbors(sample, n_neighbors = 100)

In [None]:
indices.flatten()

In [None]:
print('Original Movie:', movies_data.iloc[test_index]['title'], ', ID: ', movies_data.iloc[test_index]['movieId'])

In [None]:
recs = np.random.randint(0, 100, 5)
recs

In [None]:
print('Recommended Movies: ')
for i in recs:
    print(f"MovieID: {movies_data.iloc[indices.flatten()[i]]['movieId']}, \
\tName: {movies_data.iloc[indices.flatten()[i]]['title']}, \
\tDistance: {distances.flatten()[i]}")

In [None]:
movies.iloc[indices.flatten()[recs]]

## Exporting Model and Scaling Object 

In [None]:
import pickle

Exporting Model

In [None]:
with open('nn_model_movies.pkl', 'wb') as f:
    pickle.dump(nn_model, f)

Exporting Scaler

In [None]:
with open('movies_rating_scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)