# Netflix movie recommendation engine
Kaggle competition link: https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data 

After doing some preprocessing, now we have 2 datasets in sparse matrix format. 
1. train_sparse_matrix
2. test_sparse_matrix

Both the data sets have 'user' as rows and 'movie' as columns and 'rating' as data

In [4]:
# imports
from datetime import datetime # To compute time taken wherever necessary
import os
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import sparse
from scipy.sparse import csr_matrix

import warnings
warnings.filterwarnings("ignore")

In [163]:
# Load train & test sparse matrix which we've created in previous file
transformed_folder = 'F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Transformed Data'
train_sparse_matrix = sparse.load_npz(transformed_folder+'/train_sparse_matrix.npz')
print('train_sparse_matrix loaded!')
test_sparse_matrix = sparse.load_npz(transformed_folder+'/test_sparse_matrix.npz')
print('test_sparse_matrix loaded!')

train_sparse_matrix loaded!
test_sparse_matrix loaded!


### 1. Create samples and do feature enginnering on sample data sets
The train dataset contains 405k users & 17k movies while test dataset contains 35k users & 17k movies. If we perform similarity computation & featurization on these, it will take a lot of time considering my local machine. To save time, lets create a sample of these datasets and operate on them. Later we can extend the same operations on full dataset.

#### A function that takes sparse matrix and returns sample sparse matrix 

In [127]:
def get_sample(sparse_matrix, no_users, no_movies, file_name):
    
    row_index, col_index, ratings = sparse.find(sparse_matrix)
    users = np.unique(row_index)
    movies = np.unique (col_index)
    
    print("Original matrix stats")
    print("Users: {}\nMovies: {}\nRatings: {}".format(len(users), len(movies), len(ratings)))
    
    # Pick random state so we will get same samples everytime 
    np.random.seed(15)
    # Pick samples without replacement
    sample_users = np.random.choice(users, no_users, replace=False)
    sample_movies = np.random.choice(movies, no_movies, replace=False)
    
    # Get the boolean mask of these sampled indexes from original matrix. 
    # This will help us to directly use index reference to pick the rating 
    boolean_mask = np.logical_and(np.isin(row_index, sample_users), np.isin(col_index, sample_movies))
    
    # Pick the sample values which from the index where boolean_mask is returning true
    sample_sparse_matrix = sparse.csc_matrix( (ratings[boolean_mask], (row_index[boolean_mask], col_index[boolean_mask])),
                                            shape = (max(sample_users)+1, max(sample_movies)+1) )

    print("\nSampled matrix stats")
    print("Users: {}\nMovies: {}\nRatings: {}".format(len(sample_users), len(sample_movies), ratings[boolean_mask].shape[0]))
    
    print('Saving to {}'.format(transformed_folder+'/'+file_name))
    sparse.save_npz(transformed_folder+'/'+file_name, sample_sparse_matrix)
    return sample_sparse_matrix

#### 1.1 Create sample_train_sparse_matrix from train_sparse_matrix

In [164]:
start = datetime.now()
if os.path.isfile(transformed_folder+'/sample_train_sparse_matrix.npz'):
    print('Sparse matrix for sample train already exists. Getting it from disk...')
    sample_train_sparse_matrix = sparse.load_npz(transformed_folder+'/sample_train_sparse_matrix.npz')
    print('Done!')
    print('Time taken: {}'.format(datetime.now() - start))
else:
    print('Creating sample_train_sparse_matrix from train_sparse_matrix...')
    sample_train_sparse_matrix = get_sample(sparse_matrix=train_sparse_matrix, no_users=25000, 
                                            no_movies=3000, file_name="sample_train_sparse_matrix.npz")
    print('Done!')
    print('Time taken: {}'.format(datetime.now() - start))

Creating sample_train_sparse_matrix from train_sparse_matrix...
Original matrix stats
Users: 405041
Movies: 17424
Ratings: 80384405

Sampled matrix stats
Users: 25000
Movies: 3000
Ratings: 856986
Saving to F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Transformed Data/sample_train_sparse_matrix.npz
Done!
Time taken: 0:00:31.997092


#### 1.2 Create sample_test_sparse_matrix from test_sparse_matrix

In [166]:
start = datetime.now()
if os.path.isfile(transformed_folder+'/sample_test_sparse_matrix.npz'):
    print('Sparse matrix for sample test already exists. Getting it from disk...')
    sample_test_sparse_matrix = sparse.load_npz(transformed_folder+'/sample_test_sparse_matrix.npz')
    print('Done!')
    print('Time taken: {}'.format(datetime.now() - start))
else:
    print('Creating sample_train_sparse_matrix from test_sparse_matrix...')
    sample_test_sparse_matrix = get_sample(sparse_matrix=test_sparse_matrix, no_users=15000, 
                                            no_movies=2000, file_name="sample_test_sparse_matrix.npz")
    print('Done!')
    print('Time taken: {}'.format(datetime.now() - start))

Creating sample_train_sparse_matrix from test_sparse_matrix...
Original matrix stats
Users: 349312
Movies: 17757
Ratings: 20096102

Sampled matrix stats
Users: 15000
Movies: 2000
Ratings: 103997
Saving to F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Transformed Data/sample_test_sparse_matrix.npz
Done!
Time taken: 0:00:06.803585


In [167]:
# Compare samples with original
print('Ratings in train_sparse_matrix: {}'.format(train_sparse_matrix.count_nonzero()))
print('Ratings in sample_train_sparse_matrix: {}'.format(sample_train_sparse_matrix.count_nonzero()))

print('\nRatings in test_sparse_matrix: {}'.format(test_sparse_matrix.count_nonzero()))
print('Ratings in sample_test_sparse_matrix: {}'.format(sample_test_sparse_matrix.count_nonzero()))


Ratings in train_sparse_matrix: 80384405
Ratings in sample_train_sparse_matrix: 856986

Ratings in test_sparse_matrix: 20096102
Ratings in sample_test_sparse_matrix: 103997


In [168]:
# Change variable names for ease
del train_sparse_matrix
del test_sparse_matrix
train_sparse_matrix = sample_train_sparse_matrix
test_sparse_matrix = sample_test_sparse_matrix

In [169]:
# After changing variable names
print('Ratings in train_sparse_matrix: {}'.format(train_sparse_matrix.count_nonzero()))
print('Ratings in sample_train_sparse_matrix: {}'.format(sample_train_sparse_matrix.count_nonzero()))

print('\nRatings in test_sparse_matrix: {}'.format(test_sparse_matrix.count_nonzero()))
print('Ratings in sample_test_sparse_matrix: {}'.format(sample_test_sparse_matrix.count_nonzero()))

Ratings in train_sparse_matrix: 856986
Ratings in sample_train_sparse_matrix: 856986

Ratings in test_sparse_matrix: 103997
Ratings in sample_test_sparse_matrix: 103997


### 2. Feature Engineering

#### 2.1 Basic features based on statistics
Lets try to build some features that will be useful in modelling. Few such features could be:
1. Average of all ratings given
2. Average rating per user
3. Average rating per movie

We will create a dictionary train_averages which will store all these values. Like:<br>
train_averages = <br>
{<br>
'global_average' : xyz,<br>
'user' : { 'user1': abc, 'user2':pqr},<br>
'movie' : { 'movie1': abc, 'movie2':pqr}<br>
}

In [170]:
train_averages = dict()

#### Global average

In [171]:
start = datetime.now()
# Global average
train_averages['global_average'] = train_sparse_matrix.sum()/train_sparse_matrix.count_nonzero()
print(train_averages)
print("*"*50)
print('Time taken: {}'.format(datetime.now() - start))

{'global_average': 3.5875813607223455}
**************************************************
Time taken: 0:00:00.021661


#### Average rating per user

In [172]:
start = datetime.now()
# Average rating per user
print('Computing average ratings per user...')
sum_of_ratings_per_user = train_sparse_matrix.sum(axis=1).A1
no_of_ratings_per_user = (train_sparse_matrix!=0).sum(axis=1).A1

u,m = train_sparse_matrix.shape
average_ratings = dict()
for i in range(0,u):
    if no_of_ratings_per_user[i]!=0:
        average_ratings[i] = sum_of_ratings_per_user[i]/no_of_ratings_per_user[i]

train_averages['user'] = average_ratings
print('Done!')
print("*"*50)
print('Time taken: {}'.format(datetime.now() - start))

Computing average ratings per user...
Done!
**************************************************
Time taken: 0:00:00.514203


In [173]:
print('Average rating given by user 14531 is: {}'.format(train_averages['user'][14531]))

Average rating given by user 14531 is: 3.7142857142857144


#### Average rating per movie

In [174]:
start = datetime.now()
# Average rating per user
print('Computing average ratings per movie...')
sum_of_ratings_per_movie = train_sparse_matrix.sum(axis=0).A1
no_of_ratings_per_movie = (train_sparse_matrix!=0).sum(axis=0).A1

u,m = train_sparse_matrix.shape
average_ratings = dict()
for i in range(0,m):
    if no_of_ratings_per_movie[i]!=0:
        average_ratings[i] = sum_of_ratings_per_movie[i]/no_of_ratings_per_movie[i]

train_averages['movie'] = average_ratings
print('Done!')
print("*"*50)
print('Time taken: {}'.format(datetime.now() - start))

Computing average ratings per movie...
Done!
**************************************************
Time taken: 0:00:00.016515


In [175]:
print('Average rating given for movie 40 is: {}'.format(train_averages['movie'][40]))

Average rating given for movie 40 is: 2.6923076923076925


#### 2.2 Compute user-user similarity matrix

We know that there are more than 400k users. If we try to compute similarity using cosine similarity, we will need to do (400k * 400k / 2) = 80 billions computations, which will take days to compute even though it will only perform on nonzeros. <br>
If we try to reduce number of dimensions using PCA or SVD, it will take even more time as the matrix will become dense and the multiplication will be done for each feature as there will not be zero cells. <br><br>

One of the ideas is to compute similarity at <b>run time</b>. Here, we will follow below startegy to compute user-user similarity:<br>
1. We will compute similarity (top N) for given user (run time)
2. Once we compute similarity for any user, we will store this data in our customized data structure so that we can retirve it whenever we want in future
3. Now next time if we want to compute similarity for any user, first we will check if we have already computed for that particular user. If yes, then get it from out customized data structure. If not, compute the similarity and store it in our customized data structure.
<br><br>
Our customized data structure will be a <b>dictionary of dictionaries</b><br>
{ 'user1': {'similar_user1' : value1 }, {'similar_user2' : value2 }, {'similar_user3' : value3 } <br>
  'user2': {'similar_user1' : value1 }, {'similar_user2' : value2 }, {'similar_user3' : value3 } }
  
<br>
For now, since we have taken sample of users, lets compute similarity using cosine_similarity which wouldn't take much time.<br>
<b>From the result of this, we will pick top 5 similar users and take them as features to train our model.</b>

In [176]:
from sklearn.metrics.pairwise import cosine_similarity

In [177]:
start = datetime.now()
# To compute this, we will use consine similarity
if os.path.isfile(transformed_folder+'/u_u_similarity_sparse.npz'):
    print('Sparse matrix for user-user similarity already exists, getting it from disk... ')
    u_u_similarity_sparse = sparse.load_npz(transformed_folder+'/u_u_similarity_sparse.npz')
    print('Done!')
    print('Time taken; {}'.format(datetime.now() - start))
else:
    print('Creating sparse matrix for user-user similarity...')
    u_u_similarity_sparse = cosine_similarity(X=train_sparse_matrix, dense_output=Falseb)
    print('Saving to {}'.format(transformed_folder+'/u_u_similarity_sparse.npz'))
    sparse.save_npz(transformed_folder+'/u_u_similarity_sparse.npz', u_u_similarity_sparse)
    print('Done!')
    print('Time taken: {}'.format(datetime.now() - start))

Creating sparse matrix for user-user similarity...
Saving to F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Transformed Data/u_u_similarity_sparse.npz
Done!
Time taken: 0:04:06.605539


In [178]:
print('Shape of user-user similarity matrix: {}'.format(u_u_similarity_sparse.shape))

Shape of user-user similarity matrix: (2649405, 2649405)


We have computed similarity between each user that we have. In reality, we only care about top 10 or 20 users that we can check their movies for the user. For this, we will maintain a separate dictionary and refer to it when we need.

In [179]:
# Get unique user IDs
user_ids = np.unique(u_u_similarity_sparse.nonzero()[1])

In [182]:
print('Total number of unique users: {}'.format(len(user_ids)))

Total number of unique users: 24029


In [183]:
start = datetime.now()
# Create dictionary of top 20 similar users for each of the user based on similarity matrix
similar_users = dict()
print('Creating dictionary of top 20 users for each user...')
for user in user_ids:
    sim_users = u_u_similarity_sparse[user].toarray().ravel().argsort()[::-1][1:]
    # toarray() : converts sparse matrix to dense matrix
    # ravel() : flattens the 1-D array
    # argsort() : returns the indexes of values in ascending sorted order
    # [::-1] : sorts the array in descending order
    # [1:] : Skips the first user since its the same user itself
    similar_users[user] = sim_users[:20]
print('Done!')
print('Time taken: {}'.format(datetime.now() - start))

Creating dictionary of top 20 users for each user


MemoryError: Unable to allocate 20.2 MiB for an array with shape (2649405,) and data type int64

#### 2.3 Compute movie-movie similarity matrix

We know that there are around 17k movies in total. If we try to compute cosine similarity, we will need to do (17k * 17k / 2) = 144 millions computations, which would not take much longer. In this case, we can definately go with cosine similarity to get similar movies for each movie.

<br>
<b>From the result of this, we will pick top 5 similar movies and take them as features to train our model.</b>

In [156]:
start = datetime.now()
# To compute this, we will use consine similarity
if os.path.isfile(transformed_folder+'/m_m_similarity_sparse.npz'):
    print('Sparse matrix for movie-movie similarity already exists, getting it from disk... ')
    m_m_similarity_sparse = sparse.load_npz(transformed_folder+'/m_m_similarity_sparse.npz')
    print('Done!')
    print('Time taken; {}'.format(datetime.now() - start))
else:
    print('Creating sparse matrix for movie-movie similarity...')
    m_m_similarity_sparse = cosine_similarity(X=train_sparse_matrix.T, dense_output=False)
    print('Saving to {}'.format(transformed_folder+'/m_m_similarity_sparse.npz'))
    sparse.save_npz(transformed_folder+'/m_m_similarity_sparse.npz', m_m_similarity_sparse)
    print('Done!')
    print('Time taken: {}'.format(datetime.now() - start))

Creating sparse matrix for movie-movie similarity...
Saving to F:/09 - Machine Learning Case Studies/01 - Netflix Movies Recommendation/Transformed Data/m_m_similarity_sparse.npz
Done!
Time taken: 0:00:00.257410


In [157]:
print('Shape of movie-movie similarity matrix: {}'.format(m_m_similarity_sparse.shape))

Shape of movie-movie similarity matrix: (17724, 17724)


We have computed similarity between each movies that we have. In reality, we only care about top 10 or 20 movies that we can recommend to user. For this, we will maintain a separate dictionary and refer to it when we need.

In [159]:
# Get unique movie IDs which have atleast 1 rating
movie_ids = np.unique(m_m_similarity_sparse.nonzero()[1])

In [160]:
start = datetime.now()
# Create dictionary of top 20 movies for each of the movies based on similarity matrix
similar_movies = dict()
print('Creating dictionary of top 20 movies for each movie')
for movie in movie_ids:
    sim_movies = m_m_similarity_sparse[movie].toarray().ravel().argsort()[::-1][1:]
    # toarray() : converts sparse matrix to dense matrix
    # ravel() : flattens the 1-D array
    # argsort() : returns the indexes of values in ascending sorted order
    # [::-1] : sorts the array in descending order
    # [1:] : Skips the first element since its the same movie itself
    similar_movies[movie] = sim_movies[:20]
print('Done!')
print('Time taken: {}'.format(datetime.now() - start))

Creating dictionary of top 20 movies for each movie
Done!
Time taken: 0:00:00.281233


### 3. Featurizing Data

As of now, we have below features that we want our model to be trained on:
1. Global average
2. Average rating by the user
3. Average rating for the movie
4. Ratings received from top 5 similar users for the movie
5. Ratings given for top 5 similar movies from the user

In total, we have 13 features to train our base model on. Lets featurize these to bring them in a required format.

In [115]:
from scipy.sparse import csr_matrix, find
A = csr_matrix([[7.0, 8.0, 0],[0, 0, 9.0], [1,2,3]])
find(A)

(array([0, 2, 0, 2, 1, 2], dtype=int32),
 array([0, 0, 1, 1, 2, 2], dtype=int32),
 array([7., 1., 8., 2., 9., 3.]))

In [116]:
A.toarray()

array([[7., 8., 0.],
       [0., 0., 9.],
       [1., 2., 3.]])