# Building user-based recommendation model

### DESCRIPTION

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

### Data Dictionary
UserID – 4848 customers who provided a rating for each movie
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users

### Data Considerations
- All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
- Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

### Analysis Task
- Exploratory Data Analysis:

    - Which movies have maximum views/ratings?
    - What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
    - Define the top 5 movies with the least audience.
- Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

    - Divide the data into training and test data
    - Build a recommendation model on training data
    - Make predictions on the test data

In [1]:
# import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# import dataset
df = pd.read_csv('Amazon - Movies and TV Ratings.csv')

In [3]:
# Look at the data set
df.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [4]:
# Look at the shape
df.shape

(4848, 207)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4848 entries, 0 to 4847
Columns: 207 entries, user_id to Movie206
dtypes: float64(206), object(1)
memory usage: 7.7+ MB


In [6]:
df.isna().sum()

user_id        0
Movie1      4847
Movie2      4847
Movie3      4847
Movie4      4846
            ... 
Movie202    4842
Movie203    4847
Movie204    4840
Movie205    4813
Movie206    4835
Length: 207, dtype: int64

In [7]:
df.columns

Index(['user_id', 'Movie1', 'Movie2', 'Movie3', 'Movie4', 'Movie5', 'Movie6',
       'Movie7', 'Movie8', 'Movie9',
       ...
       'Movie197', 'Movie198', 'Movie199', 'Movie200', 'Movie201', 'Movie202',
       'Movie203', 'Movie204', 'Movie205', 'Movie206'],
      dtype='object', length=207)

In [8]:
# Describe the dataset
df.describe()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
count,1.0,1.0,1.0,2.0,29.0,1.0,1.0,1.0,1.0,1.0,...,5.0,2.0,1.0,8.0,3.0,6.0,1.0,8.0,35.0,13.0
mean,5.0,5.0,2.0,5.0,4.103448,4.0,5.0,5.0,5.0,5.0,...,3.8,5.0,5.0,4.625,4.333333,4.333333,3.0,4.375,4.628571,4.923077
std,,,,0.0,1.496301,,,,,,...,1.643168,0.0,,0.517549,1.154701,1.632993,,1.407886,0.910259,0.27735
min,5.0,5.0,2.0,5.0,1.0,4.0,5.0,5.0,5.0,5.0,...,1.0,5.0,5.0,4.0,3.0,1.0,3.0,1.0,1.0,4.0
25%,5.0,5.0,2.0,5.0,4.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,4.0,4.0,5.0,3.0,4.75,5.0,5.0
50%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,4.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
75%,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0
max,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0


### Exploratory Data Analysis:

Which movies have maximum views/ratings?

In [9]:
# Transpose the matrix
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Movie1,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie2,1.0,5.000000,,5.0,5.00,5.0,5.0,5.0
Movie3,1.0,2.000000,,2.0,2.00,2.0,2.0,2.0
Movie4,2.0,5.000000,0.000000,5.0,5.00,5.0,5.0,5.0
Movie5,29.0,4.103448,1.496301,1.0,4.00,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...
Movie202,6.0,4.333333,1.632993,1.0,5.00,5.0,5.0,5.0
Movie203,1.0,3.000000,,3.0,3.00,3.0,3.0,3.0
Movie204,8.0,4.375000,1.407886,1.0,4.75,5.0,5.0,5.0
Movie205,35.0,4.628571,0.910259,1.0,5.00,5.0,5.0,5.0


In [10]:
df.describe().T['count'].sort_values(ascending = False)

Movie127    2313.0
Movie140     578.0
Movie16      320.0
Movie103     272.0
Movie29      243.0
             ...  
Movie68        1.0
Movie69        1.0
Movie145       1.0
Movie71        1.0
Movie1         1.0
Name: count, Length: 206, dtype: float64

#### These are five movies have maximum views / ratings

In [11]:
# Look first five higest view movies
df.describe().T['count'].sort_values(ascending = False).head()

Movie127    2313.0
Movie140     578.0
Movie16      320.0
Movie103     272.0
Movie29      243.0
Name: count, dtype: float64

What is the average rating for each movie? Define the top 5 movies with the maximum ratings.

In [12]:
# Fill NA with zero rating
df_filtered = df.fillna(0.0)

In [13]:
df_filtered.head(10)

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,AH3QC2PC1VTGP,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,A3LKP6WPMP9UKX,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,AVIY68KEPQ5ZD,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,A1CV1WROP5KTTW,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,AP57WZ2X4G0AA,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,A3NMBJ2LCRCATT,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,A5Y15SAOMX6XA,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,A3P671HJ32TCSF,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,A3VCKTRD24BG7K,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
df_filtered.describe()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
count,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,...,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0
mean,0.001031,0.001031,0.000413,0.002063,0.024546,0.000825,0.001031,0.001031,0.001031,0.001031,...,0.003919,0.002063,0.001031,0.007632,0.002682,0.005363,0.000619,0.007219,0.033416,0.013201
std,0.071811,0.071811,0.028724,0.101545,0.336268,0.057448,0.071811,0.071811,0.071811,0.071811,...,0.1308,0.101545,0.071811,0.188769,0.110296,0.161142,0.043086,0.185478,0.399243,0.254991
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0


In [15]:
# drop user id
df_filtered.drop(['user_id'], inplace=True, axis =1)

In [16]:
# Look at the head
df_filtered.head()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,5.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
# describe dataframe
df_filtered.describe()

Unnamed: 0,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
count,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,...,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0,4848.0
mean,0.001031,0.001031,0.000413,0.002063,0.024546,0.000825,0.001031,0.001031,0.001031,0.001031,...,0.003919,0.002063,0.001031,0.007632,0.002682,0.005363,0.000619,0.007219,0.033416,0.013201
std,0.071811,0.071811,0.028724,0.101545,0.336268,0.057448,0.071811,0.071811,0.071811,0.071811,...,0.1308,0.101545,0.071811,0.188769,0.110296,0.161142,0.043086,0.185478,0.399243,0.254991
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.0,5.0,2.0,5.0,5.0,4.0,5.0,5.0,5.0,5.0,...,5.0,5.0,5.0,5.0,5.0,5.0,3.0,5.0,5.0,5.0


In [18]:
df_max = df_filtered.sum()

In [19]:
# finding max sum of ratings
max(df_max)

9511.0

In [20]:
df_max.head()

Movie1      5.0
Movie2      5.0
Movie3      2.0
Movie4     10.0
Movie5    119.0
dtype: float64

In [21]:
df_max.tail()

Movie202     26.0
Movie203      3.0
Movie204     35.0
Movie205    162.0
Movie206     64.0
dtype: float64

In [22]:
# finding which movie has maximum views / ratings
movie_loc = df_max.argmax()
movie_loc

126

In [23]:
# check the sum of ratings at argmax location
df_max.iloc[movie_loc]

9511.0

In [24]:
# Average rating for eacch movie
average_rating_for_each_movie = sum(df_max) / len(df_max.index)
average_rating_for_each_movie

106.44660194174757

In [25]:
df_max.mean()

106.44660194174757

Define the top 5 movies with the least audience.

In [26]:
# Create a dataframe out of this
df_max_dataframe = pd.DataFrame(df_max, columns=['rating'])

In [27]:
df_max_dataframe.head()

Unnamed: 0,rating
Movie1,5.0
Movie2,5.0
Movie3,2.0
Movie4,10.0
Movie5,119.0


In [28]:
# top 5 movie ratings
df_max_dataframe.nlargest(5, 'rating')

Unnamed: 0,rating
Movie127,9511.0
Movie140,2794.0
Movie16,1446.0
Movie103,1241.0
Movie29,1168.0


In [29]:
# top 5 with least audience
df_max_dataframe.nsmallest(5, 'rating')

Unnamed: 0,rating
Movie45,1.0
Movie58,1.0
Movie60,1.0
Movie67,1.0
Movie69,1.0


### Recommendation Model: 
Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users. 

In [30]:
df_melt = df.melt(id_vars=df.columns[0], value_vars=df.columns[1:], var_name='Movie', value_name='rating')

In [31]:
df_melt.shape

(998688, 3)

In [32]:
df_melt

Unnamed: 0,user_id,Movie,rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,
...,...,...,...
998683,A1IMQ9WMFYKWH5,Movie206,5.0
998684,A1KLIKPUF5E88I,Movie206,5.0
998685,A5HG6WFZLO10D,Movie206,5.0
998686,A3UU690TWXCG1X,Movie206,5.0


In [33]:
df_melt.fillna(0.0, inplace=True)

## Approach 1

In [34]:
n_users = df_melt.user_id.unique().shape[0]
n_users

4848

In [35]:
n_movies = df_melt.Movie.unique().shape[0]
n_movies

206

In [36]:
movie_list = df_melt.Movie.unique()
def get_movie_numeric_id(movie):
    itemindex = np.where(movie_list == movie)
    return itemindex[0][0]
df_melt['movie_order'] = df_melt['Movie'].apply(get_movie_numeric_id)

In [37]:
df_melt.head()

Unnamed: 0,user_id,Movie,rating,movie_order
0,A3R5OBKS7OM2IR,Movie1,5.0,0
1,AH3QC2PC1VTGP,Movie1,0.0,0
2,A3LKP6WPMP9UKX,Movie1,0.0,0
3,AVIY68KEPQ5ZD,Movie1,0.0,0
4,A1CV1WROP5KTTW,Movie1,0.0,0


In [38]:
user_list = df_melt.user_id.unique()
def get_user_numeric_id(user):
    itemindex = np.where(user_list == user)
    return itemindex[0][0]
df_melt['user_id_order'] = df_melt.user_id.apply(get_user_numeric_id)

In [39]:
df_melt.head()

Unnamed: 0,user_id,Movie,rating,movie_order,user_id_order
0,A3R5OBKS7OM2IR,Movie1,5.0,0,0
1,AH3QC2PC1VTGP,Movie1,0.0,0,1
2,A3LKP6WPMP9UKX,Movie1,0.0,0,2
3,AVIY68KEPQ5ZD,Movie1,0.0,0,3
4,A1CV1WROP5KTTW,Movie1,0.0,0,4


In [40]:
# Re-index columns to build matrix later on.
new_col_order = ['user_id_order', 'movie_order', 'rating', 'user_id', 'Movie']
df_melt = df_melt.reindex(columns= new_col_order)
df_melt.head()

Unnamed: 0,user_id_order,movie_order,rating,user_id,Movie
0,0,0,5.0,A3R5OBKS7OM2IR,Movie1
1,1,0,0.0,AH3QC2PC1VTGP,Movie1
2,2,0,0.0,A3LKP6WPMP9UKX,Movie1
3,3,0,0.0,AVIY68KEPQ5ZD,Movie1
4,4,0,0.0,A1CV1WROP5KTTW,Movie1


In [41]:
df_melt.shape

(998688, 5)

In [42]:
# Split the data into training and testing dataset
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(df_melt, test_size = 0.3)

In [43]:
train_data.shape

(699081, 5)

In [44]:
test_data.shape

(299607, 5)

In [45]:
train_data_matrix = np.zeros((n_users, n_movies))
for line in train_data.itertuples():
    train_data_matrix[line[1] - 1, line[2] - 1] = line[3]

train_data_matrix

array([[0., 2., 0., ..., 0., 0., 0.],
       [0., 0., 5., ..., 0., 0., 0.],
       [0., 0., 5., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 5., 0.],
       [0., 0., 0., ..., 0., 0., 5.]])

In [46]:
test_data_matrix = np.zeros((n_users, n_movies))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

test_data_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 5., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [5., 0., 0., ..., 0., 0., 0.]])

In [47]:
# import pairwise_distance library
from sklearn.metrics import pairwise_distances

In [48]:
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')

In [49]:
user_similarity

array([[0., 1., 1., ..., 1., 1., 1.],
       [1., 0., 0., ..., 1., 1., 1.],
       [1., 0., 0., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 0., 1., 1.],
       [1., 1., 1., ..., 1., 0., 1.],
       [1., 1., 1., ..., 1., 1., 0.]])

In [50]:
# make prediction
def predict (ratings, similarity):
    mean_user_rating = ratings.mean(axis=1)
    
    ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
    pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    return pred

In [51]:
user_predict = predict(train_data_matrix, user_similarity)

In [52]:
type(user_predict)

numpy.ndarray

In [53]:
user_predict

array([[-0.00531507, -0.00531507, -0.00325194, ...,  0.01717307,
         0.00273115, -0.0042835 ],
       [ 0.00925295,  0.00966566,  0.00925295, ...,  0.03174573,
         0.01730083,  0.01028473],
       [ 0.00925295,  0.00966566,  0.00925295, ...,  0.03174573,
         0.01730083,  0.01028473],
       ...,
       [-0.01502581, -0.01461318, -0.01296268, ...,  0.00746233,
        -0.0069796 , -0.01399424],
       [ 0.00926342,  0.00967664,  0.01132954, ...,  0.03178408,
         0.00926342,  0.01029648],
       [ 0.00925104,  0.00966367,  0.01131417, ...,  0.03173918,
         0.01729726,  0.00925104]])

#### Evaluation

In [54]:
from sklearn.metrics import mean_squared_error
from math import sqrt

In [55]:
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [56]:
print('User-based CF RMSE: ' + str(rmse(user_predict, test_data_matrix)))

User-based CF RMSE: 3.9807885761472144


## Approach 2

In [57]:
# import surprise package
import surprise
from surprise import Reader
from surprise import Dataset
from surprise import SVD
from surprise.model_selection import train_test_split

In [58]:
reader = Reader(rating_scale=(-1, 10))

In [59]:
df_melt1 = df.melt(id_vars=df.columns[0], value_vars=df.columns[1:], var_name='Movie', value_name='rating')

In [60]:
df_melt1.head()

Unnamed: 0,user_id,Movie,rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,


In [61]:
data = Dataset.load_from_df(df_melt1.fillna(0.0), reader=reader)

In [62]:
type(data)

surprise.dataset.DatasetAutoFolds

In [63]:
# divide the data into traing and test data
trainset, testset = train_test_split(data, test_size=0.3)

In [64]:
type(trainset)

surprise.trainset.Trainset

In [65]:
algo = SVD()

In [66]:
# building model
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1b5faa7088>

In [67]:
# make prediction
predict = algo.test(testset)

In [68]:
from surprise.model_selection import cross_validate

In [69]:
cross_validate(algo,data,measures=['RMSE','MAE'],cv=3,verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2814  0.2845  0.2794  0.2818  0.0021  
MAE (testset)     0.0427  0.0428  0.0426  0.0427  0.0001  
Fit time          74.80   75.30   75.80   75.30   0.41    
Test time         5.35    4.96    5.97    5.43    0.42    


{'test_rmse': array([0.28135963, 0.28451349, 0.27938029]),
 'test_mae': array([0.04269287, 0.04280434, 0.04256222]),
 'fit_time': (74.79917049407959, 75.2995810508728, 75.79735350608826),
 'test_time': (5.346543788909912, 4.963647365570068, 5.9725635051727295)}

In [70]:
# do a prediction for an user
user_id = 'A1KLIKPUF5E88I'
Movie = 'Movie36'
rating = '3'
algo.predict(user_id, Movie, r_ui=rating)

Prediction(uid='A1KLIKPUF5E88I', iid='Movie36', r_ui='3', est=0.011878732028878987, details={'was_impossible': False})

In [71]:
cross_validate(algo,data,measures=['RMSE','MAE'],cv=3,verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2839  0.2810  0.2815  0.2821  0.0012  
MAE (testset)     0.0436  0.0429  0.0430  0.0432  0.0003  
Fit time          74.53   71.99   72.29   72.94   1.13    
Test time         4.66    4.67    4.19    4.51    0.22    


{'test_rmse': array([0.28388731, 0.28102409, 0.28153003]),
 'test_mae': array([0.04355452, 0.04289084, 0.04300795]),
 'fit_time': (74.53340554237366, 71.990314245224, 72.28971147537231),
 'test_time': (4.663262844085693, 4.672937870025635, 4.192778587341309)}

# Thank You