# Collaborative Filtering

Based on [Real Python Tutorial](https://realpython.com/build-recommendation-engine-collaborative-filtering/) and on this [Medium article](https://medium.com/analytics-vidhya/recommendation-system-using-collaborative-filtering-cc310e641fde)

## Import MovieLens dataset

[GroupLens](https://grouplens.org/) provides a collection of datasets that are used to study the behavior of users and items in a digital environment.

[MovieLens](https://grouplens.org/datasets/movielens/) is one of those datasets that is about movies and ratings of those movies by users. It is a collection of millions ratings of thousands of movies by 9,000 users, but smaller dataset sizes are also available. We'll be using the **ml-100k** dataset which contains 100,000 ratings of 1682 movies by 943 users.

In [1]:
import pandas as pd

rating_list = pd.read_csv('datasets/ml-100k/u.data', sep='\t', names=['user', 'movie', 'rating', 'timestamp'])
items = pd.read_csv('datasets/ml-100k/u.item', sep='|', index_col=0,
                    names=['title', 'release_date', 'video_release_date', 'imdb_url', 'unknown', 'action',
                           'adventure', 'animation', 'childrens', 'comedy', 'crime', 'documentary', 'drama', 'fantasy',
                           'film_noir', 'horror', 'musical', 'mystery', 'romance', 'sci_fi', 'thriller', 'war',
                           'western'], encoding='latin-1')

# unique number of users and movies (items)
print(f'Number of users: {rating_list.user.nunique()} | Number of movies: {rating_list.movie.nunique()}')

Number of users: 943 | Number of movies: 1682


## Train-test split

We'll use the **train_test_split** function to split the dataset into a training set and a test set. We stratify the dataset by the ratings, so that all kinds of ratings are equally represented in the training and test sets.

In [22]:
rating_list.head()

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(rating_list, rating_list.rating, test_size=0.25,
                                                    stratify=rating_list.rating, random_state=42)

## User-User Collaborative Filtering

### User-Item Matrix

First we pivot the data to create a **user-item matrix**.
_Note that the pivot function introduces NaN values for missing data. This is intentional, as we want to ignore missing data when calculating the mean._

Next we calculate the mean rating for each user. NaN values are ignored in this calculation. These mean ratings will serve as baseline ratings for each user when predicting new ratings.

In [23]:


# pivot ratings to user-item matrix
ratings = X_train.pivot(index='user', columns='movie', values='rating')
mean_ratings = ratings.mean(axis=1)
print(f'Number of users: {ratings.shape[0]} | Number of movies: {ratings.shape[1]}')
ratings.head()

Number of users: 943 | Number of movies: 1644


movie,1,2,3,4,5,6,7,8,9,10,...,1670,1672,1674,1676,1677,1678,1679,1680,1681,1682
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,3.0,3.0,5.0,4.0,,5.0,,...,,,,,,,,,,
2,4.0,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


## Calculate similarity between users

Using Nearest Neighbors algorithm we can calculate the similarity between a **target user** and other users using the cosine similarity between the user's rating vectors.
The cosine similarity is calculated by taking the dot product of the two vectors and dividing by the product of their magnitudes (see Content-based recommendation for more details).


In [8]:
# compute all recommendations for all movies given a user and model
def get_all_recommendations(user_id, model, use_means=True):
    distances, knn = model.kneighbors(ratings.fillna(0))  # knn are the nearest neighbors of the user
    # we increase the nearest neighbor indices by 1 because the indices of the ratings dataframe start at 1
    knn = pd.DataFrame(knn + 1, index=ratings.index)
    sim = pd.DataFrame(1 - distances, index=ratings.index)  # 1 - distance because we want to invert the distance
    neighors = knn.loc[user_id, 1:]  # we exclude the user itself
    similarities = sim.loc[user_id, 1:]  # we exclude the user itself 
    similarities.index = ratings.loc[neighors].index # we set the similarities index to the index of the neighbors instances
    if use_means:
        return pd.Series(
            mean_ratings.loc[user_id] + ratings.loc[neighors].subtract(mean_ratings.loc[neighors], axis='index').mul(
                similarities, axis='index').sum(axis='index') / similarities.sum(), name='recommendation')
    else:
        return pd.Series(ratings.loc[neighors].mul(similarities, axis='index').sum(axis='index') / similarities.sum(),
                         name='recommendation')


# compute a single recommendation for a given user, movie and model
def get_recommendation(user_id, movie_id, model, use_means=True):
    if movie_id not in ratings.columns:
        return 2.5
    recommendations = get_all_recommendations(user_id, model, use_means=use_means)
    return recommendations.loc[movie_id]

## Model Evaluation

We compute the **Root Mean Squared Error** (RMSE) of the Collaborative Filtering model. First we predict ratings for all movies for every user in the test set.
Next we line the predicted ratings with the actual ratings in the test set and compute the RMSE.

[What does RMSE really mean?](https://towardsdatascience.com/what-does-rmse-really-mean-806b65f2e48e)

_To phrase it another way, RMSE is a good way to answer the question: “How far off should we expect our model to be on its next prediction?”_


In [10]:

from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import mean_squared_error

model = NearestNeighbors(n_neighbors=40, metric='cosine')
model.fit(ratings.fillna(0))


def get_RMSE(X_test, model, use_means=True):
    group = X_test[['movie', 'rating']].groupby(X_test.user)
    mse = pd.DataFrame(columns=['movie', 'rating', 'recommendation'])
    i = 0
    for key in group.groups:
        predictions = get_all_recommendations(key, model=model, use_means=use_means)
        rated_movies = group.get_group(key).set_index('movie')
        df = rated_movies.join(predictions).dropna().reset_index()
        mse = pd.concat([mse, df]).reset_index(drop=True)
        if i % 100 == 0:
            score = np.sqrt(mean_squared_error(mse.rating, mse.recommendation))
            print(f'{i}/{X_test.user.nunique()} - RMSE: {score:.4f}')
        i += 1

    score = np.sqrt(mean_squared_error(mse.rating, mse.recommendation))
    print(f'{X_test.user.nunique()}/{X_test.user.nunique()} - RMSE: {score:.4f}')


get_RMSE(X_test, model)

  mse = pd.concat([mse, df]).reset_index(drop=True)


0/943 - RMSE: 1.3285
100/943 - RMSE: 1.0495
200/943 - RMSE: 1.0324
300/943 - RMSE: 1.0159
400/943 - RMSE: 1.0047
500/943 - RMSE: 1.0007
600/943 - RMSE: 0.9987
700/943 - RMSE: 0.9943
800/943 - RMSE: 0.9949
900/943 - RMSE: 0.9925
943/943 - RMSE: 0.9911


In [None]:
import pandas as pd
from surprise import Dataset, Reader

ratings_dict = {
    "item": [1, 2, 1, 2, 1, 2, 1, 2, 1],
    "user": ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D', 'E'],
    "rating": [1, 2, 2, 4, 2.5, 4, 4.5, 5, 3],
}

df = pd.DataFrame(ratings_dict)
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(df[["user", "item", "rating"]], reader)

movielens = Dataset.load_builtin('ml-100k')

In [None]:
trainingSet = movielens.build_full_trainset()
algo.fit(trainingSet)

In [528]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def get_recommendation(id_user, id_movie, ratings):
    #cosine similarity of the ratings
    similarity_matrix = cosine_similarity(ratings.fillna(0), ratings.fillna(0))
    similarity_matrix_df = pd.DataFrame(similarity_matrix, index=ratings.index, columns=ratings.index)

    cosine_scores = similarity_matrix_df[id_user]
    ratings_scores = ratings[id_movie]
    ratings_scores.dropna().dot(cosine_scores[~ratings_scores.isna()]) / cosine_scores[~ratings_scores.isna()].sum()
    return np.dot(ratings_scores.dropna(), cosine_scores[~ratings_scores.isna()]) / cosine_scores[
        ~ratings_scores.isna()].sum()


# get recommendation for user 196 for movie 8
get_recommendation(196, 8, ratings)
