Movie-Movie Recommendation

**DESCRIPTION**

Consider the ratings dataset below containing data on UserID, MovieID, Rating, and Timestamp. Each line of this file represents one rating of one movie by one user and has the following format: UserID::MovieID::Rating::Timestamp Ratings are made on a 5 star scale with half star increments. UserID: represents the ID of the user MovieID: represents the ID of the movie Timestamp: represents seconds from midnight Coordinated Universal Time (UTC) of January 1, 1970.

**Objective**: *Predict a movie-movie recommendation model.*

#Import library

In [None]:
import pandas as pd
import numpy as np

#Data aquisition

In [None]:
pd_df_recommend = pd.read_csv('/content/Recommend.csv')

In [None]:
pd_df_recommend.columns = ['UserID', 'MovieID', 'Rating', 'Timestamp']

In [None]:
pd_df_recommend.head(10)

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,186,302,3,891717742
1,22,377,1,878887116
2,244,51,2,880606923
3,166,346,1,886397596
4,298,474,4,884182806
5,115,265,2,881171488
6,253,465,5,891628467
7,305,451,3,886324817
8,6,86,3,883603013
9,62,257,2,879372434


In [None]:
pd_df_recommend.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99999 entries, 0 to 99998
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   UserID     99999 non-null  int64
 1   MovieID    99999 non-null  int64
 2   Rating     99999 non-null  int64
 3   Timestamp  99999 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


#Recomendation model

In [None]:
pd_df_recommend.columns

Index(['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype='object')

In [None]:
from sklearn.model_selection import train_test_split
n_users = pd_df_recommend.UserID.unique().shape[0]
n_movie = pd_df_recommend.MovieID.unique().shape[0]
train_data, test_data = train_test_split(pd_df_recommend, test_size=0.25)

In [None]:
test_data_matrix = np.zeros((n_users, n_movie))
for _l, userID_l, movieID_l, rating_l, _l in test_data.itertuples():
  test_data_matrix[userID_l - 1, movieID_l - 1] = rating_l

test_data_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 5., 0., ..., 0., 0., 0.]])

In [None]:
train_data_matrix = np.zeros((n_users, n_movie))
for _l, userID_l, movieID_l, rating_l, _l in train_data.itertuples():
  train_data_matrix[userID_l - 1, movieID_l - 1] = rating_l

train_data_matrix

array([[5., 3., 4., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [5., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
from sklearn.metrics import pairwise_distances

In [None]:
user_similarity = pairwise_distances(train_data_matrix, metric = 'cosine')
mean_user_rating = train_data_matrix.mean(axis=1)[:, np.newaxis]
ratings_diff = (train_data_matrix - mean_user_rating)
user_pred = mean_user_rating + user_similarity.dot(ratings_diff) / np.array([np.abs(user_similarity).sum(axis=1)]).T

In [None]:
movie_similarity = pairwise_distances(train_data_matrix.T, metric = 'cosine')
mean_movie_rating = train_data_matrix.mean(axis=0)[np.newaxis, :]
ratings_diff = (train_data_matrix - mean_movie_rating)
movie_pred = mean_movie_rating.T + movie_similarity.dot(ratings_diff.T) / np.array([np.abs(movie_similarity).sum(axis=1)]).T
movie_pred = movie_pred.T

### Evaluation
There are many evaluation metrics, but one of the most popular metric used to evaluate accuracy of predicted ratings is *Root Mean Squared Error (RMSE)*.

Since, you only want to consider predicted ratings that are in the test dataset, you filter out all other elements in the prediction matrix with: `prediction[ground_truth.nonzero()]`.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [None]:
print('User-based CF RMSE: ' + str(rmse(user_pred, train_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(movie_pred, train_data_matrix)))

User-based CF RMSE: 3.1278841019495647
Item-based CF RMSE: 3.1147572174339775


In [None]:
print('User-based CF RMSE: ' + str(rmse(user_pred, test_data_matrix)))
print('Item-based CF RMSE: ' + str(rmse(movie_pred, test_data_matrix)))

User-based CF RMSE: 3.1244708119683007
Item-based CF RMSE: 3.1145323795195026
