### <b>DESCRIPTION</b>
The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

##### <b>Data Dictionary</b>
UserID – 4848 customers who provided a rating for each movie
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users

##### <b>Data Considerations</b>
- All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
- Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

### <b>Analysis Task</b>
<b>Exploratory Data Analysis:</b>
- Which movies have maximum views/ratings?
- What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
- Define the top 5 movies with the least audience.

<b>Recommendation Model:</b> Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.
- Divide the data into training and test data
- Build a recommendation model on training data
- Make predictions on the test data


In [1]:
# Import required libraries
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# load/read the "Amazon - Movies and TV Ratings" data
ratings = pd.read_csv('Amazon - Movies and TV Ratings.csv')

In [3]:
# View the few records in the ratings
ratings.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,Movie11,Movie12,Movie13,Movie14,Movie15,Movie16,Movie17,Movie18,Movie19,Movie20,Movie21,Movie22,Movie23,Movie24,Movie25,Movie26,Movie27,Movie28,Movie29,Movie30,Movie31,Movie32,Movie33,Movie34,Movie35,Movie36,Movie37,Movie38,Movie39,...,Movie167,Movie168,Movie169,Movie170,Movie171,Movie172,Movie173,Movie174,Movie175,Movie176,Movie177,Movie178,Movie179,Movie180,Movie181,Movie182,Movie183,Movie184,Movie185,Movie186,Movie187,Movie188,Movie189,Movie190,Movie191,Movie192,Movie193,Movie194,Movie195,Movie196,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [4]:
# Verify the number of records in the data
ratings.shape

(4848, 207)

In [5]:
# Keep a copy of the data
ratings_copy = ratings.copy()

### Task 1 - Which movies have maximum views/ratings?
- Build the solution step by step to get the required answer.

In [6]:
# Get all the statistical details of movies and transpose the data
ratings.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Movie1,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie2,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie3,1.0,2.000000,,2.0,2.00,2.0,2.0,2.0
Movie4,2.0,5.000000,0.000000,5.0,5.00,5.0,5.0,5.0
Movie5,29.0,4.103448,1.496301,1.0,4.00,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...
Movie202,6.0,4.333333,1.632993,1.0,5.00,5.0,5.0,5.0
Movie203,1.0,3.000000,,3.0,3.00,3.0,3.0,3.0
Movie204,8.0,4.375000,1.407886,1.0,4.75,5.0,5.0,5.0
Movie205,35.0,4.628571,0.910259,1.0,5.00,5.0,5.0,5.0


In [7]:
# Get the number of views/ratings for the movies
ratings.describe().T['count']

Movie1       1.0
Movie2       1.0
Movie3       1.0
Movie4       2.0
Movie5      29.0
            ... 
Movie202     6.0
Movie203     1.0
Movie204     8.0
Movie205    35.0
Movie206    13.0
Name: count, Length: 206, dtype: float64

In [8]:
# Sort the count of views/ratings for teh movies in descending order to get the max count a the 1st record.
ratings.describe().T['count'].sort_values(ascending=False)

Movie127    2313.0
Movie140     578.0
Movie16      320.0
Movie103     272.0
Movie29      243.0
             ...  
Movie54        1.0
Movie116       1.0
Movie115       1.0
Movie55        1.0
Movie1         1.0
Name: count, Length: 206, dtype: float64

In [9]:
# Movie with highest views
ratings.describe().T['count'].sort_values(ascending=False)[:1]

Movie127    2313.0
Name: count, dtype: float64

In [10]:
# Dro user_id column as it is not required to get the highest ratings
ratings.drop('user_id',axis=1)

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,Movie11,Movie12,Movie13,Movie14,Movie15,Movie16,Movie17,Movie18,Movie19,Movie20,Movie21,Movie22,Movie23,Movie24,Movie25,Movie26,Movie27,Movie28,Movie29,Movie30,Movie31,Movie32,Movie33,Movie34,Movie35,Movie36,Movie37,Movie38,Movie39,Movie40,...,Movie167,Movie168,Movie169,Movie170,Movie171,Movie172,Movie173,Movie174,Movie175,Movie176,Movie177,Movie178,Movie179,Movie180,Movie181,Movie182,Movie183,Movie184,Movie185,Movie186,Movie187,Movie188,Movie189,Movie190,Movie191,Movie192,Movie193,Movie194,Movie195,Movie196,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,5.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4843,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0
4844,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0
4845,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0
4846,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0


In [11]:
# Sum all the ratings given to a movie and sort the list for all the movies in descending order of sum of ratings
ratings.drop('user_id',axis=1).sum().sort_values(ascending=False)

Movie127    9511.0
Movie140    2794.0
Movie16     1446.0
Movie103    1241.0
Movie29     1168.0
             ...  
Movie154       1.0
Movie144       1.0
Movie69        1.0
Movie60        1.0
Movie67        1.0
Length: 206, dtype: float64

In [12]:
# Movie with highest ratings
ratings.drop('user_id',axis=1).sum().sort_values(ascending=False)[:1]

Movie127    9511.0
dtype: float64

From the above results it can be seen that:
- Movie with highest views is "Movie127" with "2313.0" views.
- Movie with highest ratings is "Movie127" with "9511.0" ratings.

### Task 2 - What is the average rating for each movie? Define the top 5 movies with the maximum ratings. 
- Here are the steps to get the result.
  - To get the average ratings for each movie, need to ge the mean of ratings for each movie.
  - Sort the data in the above step in descending order
  - Get the top 5 records/movies for knowing the top 5 moviews with max average ratings

In [13]:
# To get the average ratings for each movie, need to ge the mean of ratings for each movie.
# Dropping "user_id" as it is not required.
ratings.drop('user_id',axis=1).mean()

Movie1      5.000000
Movie2      5.000000
Movie3      2.000000
Movie4      5.000000
Movie5      4.103448
              ...   
Movie202    4.333333
Movie203    3.000000
Movie204    4.375000
Movie205    4.628571
Movie206    4.923077
Length: 206, dtype: float64

In [14]:
# Sort the data in the above step in descending order
ratings.drop('user_id',axis=1).mean().sort_values(ascending=False)

Movie1      5.0
Movie66     5.0
Movie76     5.0
Movie75     5.0
Movie74     5.0
           ... 
Movie58     1.0
Movie60     1.0
Movie154    1.0
Movie45     1.0
Movie144    1.0
Length: 206, dtype: float64

In [15]:
# Get the top 5 records/movies for knowing the top 5 moviews with max average ratings
ratings.drop('user_id',axis=1).mean().sort_values(ascending=False)[:5]

Movie1     5.0
Movie66    5.0
Movie76    5.0
Movie75    5.0
Movie74    5.0
dtype: float64

From the above results it can be seen that the following movies are the top-5 movies with maximum ratings.
- Movie1     with rating 5.0
- Movie66    with rating 5.0
- Movie76    with rating 5.0
- Movie75    with rating 5.0
- Movie74    with rating 5.0

### Task 3 - Define the top 5 movies with the least audience
- Here are the steps to get the result.
  - Sort the data in descending order based on the movie audience
  - Get the top 5 records/movies for knowing the top 5 moviews with least audience.

In [16]:
# Get the details of ratings
ratings_d = ratings.describe()
ratings_d

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,Movie11,Movie12,Movie13,Movie14,Movie15,Movie16,Movie17,Movie18,Movie19,Movie20,Movie21,Movie22,Movie23,Movie24,Movie25,Movie26,Movie27,Movie28,Movie29,Movie30,Movie31,Movie32,Movie33,Movie34,Movie35,Movie36,Movie37,Movie38,Movie39,Movie40,...,Movie167,Movie168,Movie169,Movie170,Movie171,Movie172,Movie173,Movie174,Movie175,Movie176,Movie177,Movie178,Movie179,Movie180,Movie181,Movie182,Movie183,Movie184,Movie185,Movie186,Movie187,Movie188,Movie189,Movie190,Movie191,Movie192,Movie193,Movie194,Movie195,Movie196,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
count,1.0,1.0,1.0,2.0,29.0,1.0,1.0,1.0,1.0,1.0,2.0,5.0,1.0,1.0,1.0,320.0,1.0,1.0,2.0,1.0,1.0,2.0,3.0,5.0,1.0,3.0,1.0,3.0,243.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,3.0,...,4.0,4.0,4.0,3.0,1.0,2.0,15.0,4.0,1.0,1.0,1.0,1.0,7.0,1.0,2.0,30.0,1.0,17.0,24.0,9.0,1.0,6.0,5.0,7.0,6.0,10.0,7.0,7.0,1.0,9.0,5.0,2.0,1.0,8.0,3.0,6.0,1.0,8.0,35.0,13.0
mean,5.0,5.0,2.0,5.0,4.103448,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,4.51875,3.0,5.0,3.5,3.0,5.0,5.0,5.0,4.4,5.0,3.0,5.0,3.333333,4.806584,4.5,5.0,4.5,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,4.333333,2.0,4.5,4.733333,4.75,5.0,5.0,5.0,5.0,4.714286,5.0,5.0,4.733333,5.0,4.823529,4.791667,5.0,5.0,5.0,5.0,4.714286,5.0,4.5,4.571429,4.714286,4.0,4.888889,3.8,5.0,5.0,4.625,4.333333,4.333333,3.0,4.375,4.628571,4.923077
std,,,,0.0,1.496301,,,,,,0.0,0.0,,,,0.795535,,,2.12132,,,0.0,0.0,1.341641,,1.732051,,1.527525,0.655269,0.707107,0.0,0.707107,,,,,,,0.0,0.0,...,0.0,0.0,0.0,1.154701,,0.707107,0.703732,0.5,,,,,0.755929,,0.0,0.639684,,0.528594,0.508977,0.0,,0.0,0.0,0.755929,0.0,0.707107,0.786796,0.48795,,0.333333,1.643168,0.0,,0.517549,1.154701,1.632993,,1.407886,0.910259,0.27735
min,5.0,5.0,2.0,5.0,1.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,1.0,3.0,5.0,2.0,3.0,5.0,5.0,5.0,2.0,5.0,1.0,5.0,2.0,1.0,4.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,3.0,2.0,4.0,3.0,4.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,2.0,5.0,3.0,3.0,5.0,5.0,5.0,5.0,3.0,5.0,3.0,3.0,4.0,4.0,4.0,1.0,5.0,5.0,4.0,3.0,1.0,3.0,1.0,1.0,4.0
25%,5.0,5.0,2.0,5.0,4.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,4.0,3.0,5.0,2.75,3.0,5.0,5.0,5.0,5.0,5.0,2.5,5.0,2.5,5.0,4.25,5.0,4.25,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,4.0,2.0,4.25,5.0,4.75,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,4.5,4.5,4.0,5.0,4.0,5.0,5.0,4.0,4.0,5.0,3.0,4.75,5.0,5.0
50%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0,3.0,5.0,3.5,3.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,3.0,5.0,4.5,5.0,4.5,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,2.0,4.5,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
75%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0,3.0,5.0,4.25,3.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,4.0,5.0,4.75,5.0,4.75,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,2.0,4.75,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
max,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0,3.0,5.0,5.0,3.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,2.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0


In [17]:
# Get the movie audience
ratings_aud = ratings_d[0:1]
ratings_aud

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,Movie11,Movie12,Movie13,Movie14,Movie15,Movie16,Movie17,Movie18,Movie19,Movie20,Movie21,Movie22,Movie23,Movie24,Movie25,Movie26,Movie27,Movie28,Movie29,Movie30,Movie31,Movie32,Movie33,Movie34,Movie35,Movie36,Movie37,Movie38,Movie39,Movie40,...,Movie167,Movie168,Movie169,Movie170,Movie171,Movie172,Movie173,Movie174,Movie175,Movie176,Movie177,Movie178,Movie179,Movie180,Movie181,Movie182,Movie183,Movie184,Movie185,Movie186,Movie187,Movie188,Movie189,Movie190,Movie191,Movie192,Movie193,Movie194,Movie195,Movie196,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
count,1.0,1.0,1.0,2.0,29.0,1.0,1.0,1.0,1.0,1.0,2.0,5.0,1.0,1.0,1.0,320.0,1.0,1.0,2.0,1.0,1.0,2.0,3.0,5.0,1.0,3.0,1.0,3.0,243.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,3.0,...,4.0,4.0,4.0,3.0,1.0,2.0,15.0,4.0,1.0,1.0,1.0,1.0,7.0,1.0,2.0,30.0,1.0,17.0,24.0,9.0,1.0,6.0,5.0,7.0,6.0,10.0,7.0,7.0,1.0,9.0,5.0,2.0,1.0,8.0,3.0,6.0,1.0,8.0,35.0,13.0


In [18]:
# Sort the data in ascending order based on the movie audience
ratings_aud.T['count'].sort_values(ascending=True)

Movie1         1.0
Movie71        1.0
Movie145       1.0
Movie69        1.0
Movie68        1.0
             ...  
Movie29      243.0
Movie103     272.0
Movie16      320.0
Movie140     578.0
Movie127    2313.0
Name: count, Length: 206, dtype: float64

In [19]:
# Get the top 5 records/movies for knowing the top 5 moviews with least audience
ratings_aud.T['count'].sort_values(ascending=True)[:5]

Movie1      1.0
Movie71     1.0
Movie145    1.0
Movie69     1.0
Movie68     1.0
Name: count, dtype: float64

From the above results it can be seen that the following movies are the top-5 moviews with least audience.
- Movie1      1.0
- Movie71     1.0
- Movie145    1.0
- Movie69     1.0
- Movie68     1.0

### Task 4 - Recommendation Model

In [20]:
pip install scikit-surprise



In [21]:
import surprise
from surprise import Reader
from surprise import accuracy
from surprise import Dataset
from surprise import SVD
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split

In [22]:
# View the data set
ratings.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,Movie11,Movie12,Movie13,Movie14,Movie15,Movie16,Movie17,Movie18,Movie19,Movie20,Movie21,Movie22,Movie23,Movie24,Movie25,Movie26,Movie27,Movie28,Movie29,Movie30,Movie31,Movie32,Movie33,Movie34,Movie35,Movie36,Movie37,Movie38,Movie39,...,Movie167,Movie168,Movie169,Movie170,Movie171,Movie172,Movie173,Movie174,Movie175,Movie176,Movie177,Movie178,Movie179,Movie180,Movie181,Movie182,Movie183,Movie184,Movie185,Movie186,Movie187,Movie188,Movie189,Movie190,Movie191,Movie192,Movie193,Movie194,Movie195,Movie196,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [23]:
# Convert movies also into a single column which is melt the data for more clarity by reducing the number of columns
ratings_melt = ratings.melt(id_vars = ratings.columns[0],value_vars=ratings.columns[1:],var_name="Movies",value_name="Rating")
ratings_melt

Unnamed: 0,user_id,Movies,Rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [24]:
# Preparing data set
rd = Reader()
data = Dataset.load_from_df(ratings_melt.fillna(0),reader=rd)
data

<surprise.dataset.DatasetAutoFolds at 0x7f0957e8e450>

In [25]:
# Split the data set inti train data set & test data set
# 25% data set as test data set
train, test = train_test_split(data,test_size=0.25)

In [26]:
# Create mode and fit the train data set with the model created

# Using SVD (Singular Value Descomposition)
svd = SVD()

# Fit the tran data set
svd.fit(train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f0957e8f510>

In [27]:
# Test the model with test data set and get the predictions
predictions = svd.test(test)

In [28]:
# RMSE values for the model predictions
accuracy.rmse(predictions)

RMSE: 1.0261


1.026126176610573

In [29]:
# MAE values for the model predictions
accuracy.mae(predictions)

MAE:  1.0120


1.0120486430862206

In [30]:
# Conduct cross validation with cv = 3
cross_validate(svd, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0249  1.0274  1.0260  1.0261  0.0010  
MAE (testset)     1.0115  1.0126  1.0120  1.0120  0.0004  
Fit time          43.96   44.44   44.30   44.23   0.20    
Test time         5.45    4.48    4.06    4.66    0.58    


{'fit_time': (43.96275091171265, 44.4448676109314, 44.29594016075134),
 'test_mae': array([1.01146283, 1.01256059, 1.01200935]),
 'test_rmse': array([1.0249149 , 1.02742677, 1.02596647]),
 'test_time': (5.450758218765259, 4.483715057373047, 4.0591630935668945)}

In [31]:
def repeat(ml_type, dframe):
    rd = Reader()
    data = Dataset.load_from_df(dframe, reader=rd)
    print(cross_validate(ml_type, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True))
    print("--"*15)
    usr_id = 'A3R5OBKS7OM2IR'
    mv = 'Movie1'
    r_u = 5.0
    print(ml_type.predict(usr_id,mv,r_ui = r_u,verbose=True))
    print("--"*15)

In [32]:
repeat(SVD(),ratings_melt.fillna(ratings_melt['Rating'].mean()))

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0865  0.0870  0.0838  0.0858  0.0014  
MAE (testset)     0.0099  0.0097  0.0094  0.0096  0.0002  
Fit time          43.39   43.77   43.94   43.70   0.23    
Test time         4.92    4.18    4.33    4.48    0.32    
{'test_rmse': array([0.08653794, 0.08703266, 0.08382585]), 'test_mae': array([0.00987164, 0.00965566, 0.00936293]), 'fit_time': (43.39470148086548, 43.76898765563965, 43.943806648254395), 'test_time': (4.922894477844238, 4.176424980163574, 4.327325820922852)}
------------------------------
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 4.39   {'was_impossible': False}
user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 4.39   {'was_impossible': False}
------------------------------


In [33]:
# Grid search and find optimum hyperparameter value for n_factors
from surprise.model_selection import GridSearchCV

In [34]:
# Define parameters for grid search
param_grid = {'n_epochs':[20,30],
             'lr_all':[0.005,0.001],
             'n_factors':[50,100]}

In [35]:
# Grid search and fit data
gs = GridSearchCV(SVD, param_grid, measures=['rmse','mae'], cv=3)
data1 = Dataset.load_from_df(ratings_melt.fillna(ratings_melt['Rating'].mean()),reader=rd)
gs.fit(data1)

In [36]:
# get the grid search best score
gs.best_score

{'mae': 0.008949384645529676, 'rmse': 0.08482482459249015}

In [37]:
# pint the best score & params for RMSE
print(gs.best_score["rmse"])
print(gs.best_params["rmse"])


0.08482482459249015
{'n_epochs': 30, 'lr_all': 0.001, 'n_factors': 50}
