### Clustering
It is possible to use a clustering algorithm, such as k-means, to group users into a cluster and then take only the users from the same cluster into consideration when predicting ratings.

We will use k-means' sister algorithm, kNN, to build our clustering-based collaborative filter. In a nutshell, given an user, `u`, and a movie, `m`, these are the steps involved:
1. Find the k-nearest neighbors of `u` who have rated movie `m`
2. Output the average rating of the `k` users for the movie `m`

That's it. This extremely simply algorithm happens to be one of the most popularly used.

In [5]:
import pandas as pd
import numpy as np

# Import the required classes and methods from the surprise library
from surprise import Reader, Dataset, KNNBasic
from surprise.model_selection import cross_validate

In [3]:
# Define a Reader object
# The Reader object helps in parsing the file or dataframe containing ratings
reader = Reader()

In [6]:
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('../../data/ml-100k/u.data', sep='\t', names=r_cols, encoding='latin-1')
ratings = ratings.drop('timestamp', axis=1)

In [7]:
# Create the dataset to be used for building the filter
data = Dataset.load_from_df(ratings, reader)

In [8]:
# Define the algorithm object; in this case kNN
knn = KNNBasic()

In [9]:
# cross-validation with no. of kfold=5
cross_validate(knn, data, measures=['rmse'], cv=5)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.97716501, 0.98064061, 0.97422453, 0.9835689 , 0.97525814]),
 'fit_time': (0.4553859233856201,
  0.5232524871826172,
  0.41683530807495117,
  0.49950361251831055,
  0.40623974800109863),
 'test_time': (2.210005044937134,
  2.4829437732696533,
  2.697406530380249,
  2.3003673553466797,
  2.407249927520752)}

In [12]:
knn.predict(1, 2).est

3.2153616573804804

### Singular Value Decomposition

In [13]:
# Import SVD
from surprise import SVD

# Define the SVD algorithm object
svd = SVD()

# Evaluate the performance in terms of RMSE
cross_validate(svd, data, measures=['rmse'], cv=5)

{'test_rmse': array([0.94771382, 0.93162724, 0.93001421, 0.93244431, 0.93972319]),
 'fit_time': (0.9692947864532471,
  0.8996889591217041,
  0.8974428176879883,
  0.9165253639221191,
  0.8830184936523438),
 'test_time': (0.09898853302001953,
  0.14996623992919922,
  0.09996676445007324,
  0.15003299713134766,
  0.09988713264465332)}

In [14]:
svd.predict(1, 2).est

3.067275831546676