# Collaborative filtering - KNN with Surprise

In this notebook you will learn about collaborative filtering and how to implement it with the surprise library. Collaborative filtering is a collective term for different recommendation algorithms based on user behavior. Those algorithm find users similar to each other based on their rating or clicking history. The interactions between users and items are stored in a so-called "user-item interactions matrix". These interactions can be explicit like actively giving ratings or implicit like click-data. In general there are two popular types of collaborative filtering approaches. The **user-based** filtering and the **item-based** filtering.

**User-based** filtering algorithms predict ratings based on the ratings from similar (in terms of rating) users.</br>
**Item-based** filtering algorithms predict ratings based on the ratings of similar (in terms of rating) items. Item-based models are especially used when you have way more users than items. Those models use average rating per item and not per user.

A typical example of a problem collaborative filtering is trying to solve is the following: We have users, who rated specific items but a lot of item were not rated yet. We then try to predict the missing ratings denoted by red fields in this example of a user-item rating matrix.

<p align = "center">
<img src = "./images/UserItemRatingMatrix.png">
</p>
<p align = "center">
Fig.1 - User-Item-Rating Matrix - icons are from Vecteezy.com
</p>

first we import the necessary libraries

In [None]:
import pandas as pd
import numpy as np
from surprise import Dataset
from surprise import Reader
from surprise import KNNWithMeans, SVD
from surprise.model_selection import GridSearchCV
from surprise.model_selection import train_test_split
from surprise import accuracy
from collections import defaultdict

Let's load our rating data. It contains the necessary `user_id`, `item_id` and the `rating` users gave to the fish items. Additionally it has some nice-to-have information about the fish items. There are 500 users with 300 rated fishes each. 

In [None]:
# Loading the dataset from github may take some minutes -> coffee time :)
df = pd.read_csv('data/user_item_ratings.csv')
df.head(3)

The Surprise library we want to use does not work with pandas DataFrames but with Dataset objects. So we need to create a Dataset object from our DataFrame. We also need to define the possible ratings with the Reader class.

In [None]:
# defines possible ratings
reader = Reader(rating_scale=(1, 10))
# Loads Pandas dataframe
data = Dataset.load_from_df(df[["user_id", "item_id", "rating"]], reader)

In order to validate our models we need to split our data into a trainset, which we will use to train our models. And a testset to validate the ability of our models to predict on unseen data. 

In [None]:
# Splitting the data into training and test set
trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

Let's start with the modeling!

## K-Nearest-Neighbors
One of the most common models for **collaborative filtering** is the **K-nearest neighbor algorithm (KNN)**. KNN is a **non-parametric**, **lazy learning** method. Lazy because it just stores the data-points without learning any kind of coefficient. To make predictions it calculates the "distance" between the target and every other instance, then it ranks the distances and returns the top **K** who are closest and therefore most similar to a given data point. Several ways exist to calculate the distances between the target and the other observations.

As KNN's performance suffers from **curse of dimensionality** and e.g. **euclidean distance** is not optimal in high dimensions, **cosine similarity** is the most popular distance measure in terms of multi-dimensional data. Further description of the cosine similarity can be found in notebook 1. In this notebook we will use the [**KNNWithMeans**](https://surprise.readthedocs.io/en/stable/knn_inspired.html) algorithm implemented in the **surprise library**. This algorithm is directly derived from KNN but also takes the **mean ratings** of each user into account.

For **user-based** the algorithm works as follows. First, we calculate the **similarity matrix** of the users. We use **cosine-similarity** here but other similarity measures can be used.

<p align="center">
<img src="./images/UserSimilarityMatrix.png">
</p>
<p align="center">
Fig.1 - User-Similarity-Rating Matrix - icons are from Vecteezy.com
</p>

To then predict the rating for a certain fish by a certain user we simply take the sum of **k** (hyper parameter of the algorithm, here we use **k=2**) user ratings, with the highest similarity to our user, weighted by their similarity divided by the sum of used similarities.

<p align="center">
<img src="./images/KNNExampleCalc.png">
</p>
<p align="center">
Fig.1 - User-Similarity-Rating Matrix - icons are from Vecteezy.com
</p>

Now let's see how the algorithm does on our dataset!

In [None]:
similarity_options = {
    "name": "cosine",   # Use Cosine-Similarity
    "user_based": False,  # Compute  similarities between items
}
algo_knn = KNNWithMeans(sim_options=similarity_options, k=10, min_k=4)
algo_knn.fit(trainset)

In [None]:
# Predict ratings for the testset
predictions = algo_knn.test(testset)

# Then compute RMSE
print(f"RMSE: {accuracy.rmse(predictions)}")

**Note**: `.test()` is a method that evaluates the entire test set and returns the predictions as a list of `Prediction` objects. Each object details the `user ID`, `item ID`, `actual rating`, and `estimated rating`. Additionally, the `.predict()` method is used for predicting the rating for a single user-item pair, returning a `Prediction` object that includes the estimated rating among other details.

In [None]:
for element in predictions:
    print(f"user id:{element.uid}", f"item id:{element.iid}", f"estimated rating:{element.est}", f"real rating:{element.r_ui}")

Let's have a look at the top 10 recommendations for a specific user. Though there is no implementation of this in surprise the documentation provides a function `get_top_n` that returns the top-N recommendations, if we provide the predictions of our model:

In [None]:
def get_top_n(predictions, n=10):
    """ Return the top-N recommendation for each user from a set of predictions.
    
    Args:
    predictions(list of Prediction objects): The list of predictions, as
        returned by the test method of an algorithm.
    n(int): The number of recommendation to output for each user. Default
        is 10.
    
    
    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of
        size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    
    for user_id, item_id, actual_rating, estimated_rating, _ in predictions:
        top_n[user_id].append((item_id, estimated_rating))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for user_id, estimated_ratings in top_n.items():
        estimated_ratings.sort(key=lambda x: x[1], reverse=True) # sort by rating estimation, descending. x[1] is the estimated rating. 
        top_n[user_id] = estimated_ratings[:n]

    return top_n

What we will get is a list of ten tuples (item_id, estimated_rating). 

In [None]:
# Getting the top 10 recommendations for each user
top_10 = get_top_n(predictions, n=10)
top_10

In [None]:
# Print the recommended items for a specific user
user_id = 201   # user id
# 10 best rated items for user id
top_10[user_id]

Let's make list of the top 10 item id's `top_iids`. And use it with the original fishes dataframe to get some characteristics of our recommended fishes. Apparently our user liked especially colorful fishes the most :).

In [None]:
# The top 10 recommendations for user_id 201 are:
top_items_id_user_id = []
for item_id, estimated_rating in top_10[user_id]:
    print(f"item id: {item_id}, estimated rating: {estimated_rating}")
    top_items_id_user_id.append(item_id)

In [None]:
top_items_id_user_id

In [None]:
# Getting the name of the recommended items
recommended_fishes = df.set_index('item_id').loc[top_items_id_user_id][['name','fish_group','visual_effect']].drop_duplicates().copy()
recommended_fishes

## Predictions for a new user
Let's imagine we have a new user who has not rated any fish yet. This is a common issue called the **cold start problem**. For this user we could use the **most popular** fishes as a recommendation. This is a simple but effective way to start with. We can also ask the user to rate some items and then use the **user-based** or **item-based** collaborative filtering to make recommendations. One could directly use the trained model to make predictions for the new user or retrain the model with the new user's ratings.

We will use the approch of asking the user to rate some items and then use the trained model `KNNWithMeans` to make recommendations. For this we will leverage the item-item similarity matrix *learnt* by the model.
The item-item similarity matrix is used to predict the rating of a user for an item by taking the sum of the ratings of the **k** most similar items weighted by their similarity divided by the sum of the similarities. 

In the following we will show step by step how to make recommendations for a new user and then collect the steps in a function. 

#### Step1 - Collect Ratings from New User
We will create a new user with user_id = 500 and collect ratings for some fishes.

In [None]:
# new user ratings
new_user_ratings = [
    {"user_id": 500, "item_id": 1, "rating": 10},
    {"user_id": 500, "item_id": 2, "rating": 9},
    {"user_id": 500, "item_id": 3, "rating": 8},
    {"user_id": 500, "item_id": 40, "rating": 7},
    {"user_id": 500, "item_id": 50, "rating": 6},
    {"user_id": 500, "item_id": 6, "rating": 5},
    {"user_id": 500, "item_id": 390, "rating": 4},
    {"user_id": 500, "item_id": 100, "rating": 3},
    {"user_id": 500, "item_id": 9, "rating": 2},
    {"user_id": 500, "item_id": 10, "rating": 1},
]

In [None]:
# create new user dataframe
new_user_df = pd.DataFrame(new_user_ratings)
new_user_df

#### Step2 - Extract Similarity Matrix from trained `KNNWithMeans` Model
The similarity matrix is stored in the `sim` attribute of the model.

In [None]:
# Similarity matrix
item_item_similarity_matrix = algo_knn.sim
item_item_similarity_matrix.shape

The similarity matrix is a numpy array with shape `(n_items, n_items)`. The similarity between item `i` and item `j` is stored in `sim[i, j]`. The similarity between item `i` and itself is stored in `sim[i, i]`. **Note** that `i` refers to the index of the item in the dataset
and not the `item_id`.

#### Step3 - Select an Item and Convert it to the Index
We will select an item  and convert it to the index in the dataset.

In [None]:
# Select an item_id
item_id = 100
item_id


In [None]:
# Get the inner id of the item
item_inner_id = algo_knn.trainset.to_inner_iid(item_id)
item_inner_id

#### Step4 - Get the Neighbors of the Item (Most Similar Items)
We will get the neighbours of the item by using the `get_neighbors` method of the model. The method returns a list of inner indices of the most similar items to the selected item.

In [None]:
# retrieve the most similar items
neighbors_inner_id = algo_knn.get_neighbors(item_inner_id, k=10)
neighbors_inner_id

#### Step5 - Initialize Recommendations
We will initialize the recommendations and the total similarity per item as empty dictionaries.


In [None]:
# Initialize the recommendations and total similarity
recommendations = defaultdict(float)
total_similarity = defaultdict(float)

#### Step6 - Calculate the Recommendations
For each neighbour of the selected item we will calculate the recommendation by taking the sum of the ratings of the **k** most similar items weighted by their similarity divided by the sum of the similarities. We will store the recommendations in the recommendations dictionary and the total similarity per item in the total similarity dictionary.

In [None]:
for neighbor_inner_id in neighbors_inner_id:
    neighbor_inner_raw_id = algo_knn.trainset.to_raw_iid(neighbor_inner_id)
    # Prevent recommending items that the user has already rated
    if neighbor_inner_raw_id not in new_user_df.item_id.values:
        # Get the similarity score
        similarity_score = item_item_similarity_matrix[item_inner_id, neighbor_inner_id]
        ## Get the list of tuples of (inner id, rating) for the neighbor item
        item_inner_id_ratings_list = algo_knn.trainset.ir[neighbor_inner_id]
        ## Get only the ratings
        ratings = [rating for (_, rating) in item_inner_id_ratings_list]
        #print(ratings)
        ## Calculate the total rating for the neighbor item
        neighbor_total_rating = np.sum(ratings)
        #print(neighbor_total_rating)
        
        # Accumulate weighted score and keep track of total similarity for normalization
        recommendations[neighbor_inner_raw_id] = recommendations[neighbor_inner_raw_id] + similarity_score * (neighbor_total_rating)
        total_similarity[neighbor_inner_raw_id] = total_similarity[neighbor_inner_raw_id] + similarity_score



# Normalize the recommendations by both the total similarity and the number of ratings
# This is to mitigate the bias towards items with a higher number of ratings
for item_id, score in recommendations.items():
    total_count_per_item = len(algo_knn.trainset.ir[item_id])
    recommendations[item_id] = score / (total_similarity[item_id] * total_count_per_item)
# Sort the recommendations by score
sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
sorted_recommendations


#### Define a Function to Make Recommendations for a New 
We will collect the steps in a function `get_recommendations` that takes the new user's ratings, the trained model and the number of recommendations as input and returns the recommendations.

In [None]:

def get_recommendations(new_user_df, model=algo_knn, top_k=10):
    """ Get recommendations for a new user based on the ratings provided. 
    Args:
    new_user_df (pd.DataFrame): A dataframe containing the new user ratings.
    model (surprise.prediction_algorithms.knns.KNNWithMeans): A trained KNNWithMeans model.
    top_k (int): The number of recommendations to return.
    """
    
    # Extract the similarity matrix
    item_item_similarity_matrix = model.sim
    # Initialize the recommendations
    recommendations = defaultdict(float)
    total_similarity = defaultdict(float)
    # Rated items
    rated_items = set(new_user_df.item_id.values)

    # Iterate over the new user ratings
    for item_id in new_user_df.item_id.values:
        # Get the inner id of the item
        item_inner_id = model.trainset.to_inner_iid(item_id)
        # Get the neighbors (the most similar items)
        neighbors_inner_id = model.get_neighbors(item_inner_id, k=top_k)
        # Iterate over the neighbors
        for neighbor_inner_id in neighbors_inner_id:
            # Get the raw id of the neighbor
            neighbor_inner_raw_id = model.trainset.to_raw_iid(neighbor_inner_id)
            # Prevent recommending items that the user has already rated
            if neighbor_inner_raw_id not in rated_items:
                # Get the similarity score between the item and the neighbor
                similarity_score = item_item_similarity_matrix[item_inner_id, neighbor_inner_id]
                # Get the list of tuples containing the ratings of the neighbor item
                item_inner_id_ratings_list = model.trainset.ir[neighbor_inner_id]
                ## Get only the ratings
                ratings = [rating for (_, rating) in item_inner_id_ratings_list]
                ## Get the total rating of the neighbor item
                neighbor_total_rating = np.sum(ratings)
                # Accumulate weighted score and keep track of total similarity for normalization
                recommendations[neighbor_inner_raw_id] = recommendations[neighbor_inner_raw_id] + similarity_score * (neighbor_total_rating)
                total_similarity[neighbor_inner_raw_id] = total_similarity[neighbor_inner_raw_id] + similarity_score
                
    # Normalize the recommendations by both the total similarity and the number of ratings
    # This is to mitigate the bias towards items with a higher number of ratings
    for item_id, score in recommendations.items():
        total_count_per_item = len(algo_knn.trainset.ir[item_id])
        recommendations[item_id] = score / (total_similarity[item_id] * total_count_per_item)
    # Sort and return the recommendations
    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
    return sorted_recommendations[:top_k]

In [None]:
recommended_items = get_recommendations(new_user_df, model=algo_knn, top_k=10)
print("Recommended Items:", recommended_items)

## Conclusion
In this notebook, we learned how to use collaborative filtering to make recommendations based on the idea of similarity: 
- **user-based filtering**: *users who are similar to you also liked ...* 
- **item-based**: *because you watched/bought ... you may also like ...*


We used the Scikit-Surprise library to train a KNNWithMeans model on a custom dataset of user ratings for fish items. We then used the model to make recommendations for a new user by leveraging the item-item similarity matrix.

## References
- [Surprise Library](https://surprise.readthedocs.io/en/stable/index.html)
- [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
- [Collaborative Filtering](https://en.wikipedia.org/wiki/Collaborative_filtering)
- [Netflix Prize](https://en.wikipedia.org/wiki/Netflix_Prize)
- [Simon Funk](https://sifter.org/simon/journal/20061211.html)
- [Cold Start Problem](https://en.wikipedia.org/wiki/Cold_start_(recommender_systems))
- [Implicit Recommender Systems](https://andbloch.github.io/An-Overview-of-Collaborative-Filtering-Algorithms/)