# USER-BASED COLLABORATIVE FILTERING

Original Dataset **bgg-19m-reviews** is taken from <a>https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews</a> <br>
This dataset contains approximately 19 million reviews.<br>
In the first notebook using  **bgg-19m-reviews** dataset :
  * we have obtained **df_ratings_10** by filtering out users who have less than 10 ratings.
  * we have created 4 helper dataframes: **df_id2user,df_user2id,df_id2game,df_game2id** which function as dictionaries.

Now, we can proceed by importing packages and load these 5 dataframes.

In [1]:
import numpy as np
import pandas as pd
from functools import partial
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
from zipfile import ZipFile
import random
from sklearn.metrics.pairwise import pairwise_distances

In [2]:
#Download data
# df_ratings_10 file link: https://drive.google.com/file/d/13BfGTvZigyigHHSxb9jWKG6a1SUVzTBi/view?usp=sharing
!gdown 13BfGTvZigyigHHSxb9jWKG6a1SUVzTBi
# df_id2user file link : https://drive.google.com/file/d/1oXRXpQCjIHXXKpmQ_0bjWmH_n5hvThio/view?usp=sharing
!gdown 1oXRXpQCjIHXXKpmQ_0bjWmH_n5hvThio
# df_user2id file link: https://drive.google.com/file/d/1Yb_drQGpSQCFBENFU0m_13MCCYj7QPi4/view?usp=sharing
!gdown 1Yb_drQGpSQCFBENFU0m_13MCCYj7QPi4
# df_game2id file link: https://drive.google.com/file/d/1IDMw7Vwr_hBklq1o_42aXZxsHcJDWEK6/view?usp=sharing
!gdown 1IDMw7Vwr_hBklq1o_42aXZxsHcJDWEK6
# df_id2game file link: https://drive.google.com/file/d/1H15QwTWm3eysF4vFW-L-1jtv0L6IEF1s/view?usp=sharing
!gdown 1H15QwTWm3eysF4vFW-L-1jtv0L6IEF1s

Downloading...
From (original): https://drive.google.com/uc?id=13BfGTvZigyigHHSxb9jWKG6a1SUVzTBi
From (redirected): https://drive.google.com/uc?id=13BfGTvZigyigHHSxb9jWKG6a1SUVzTBi&confirm=t&uuid=aa10ac8a-a6ca-4912-a7d7-863c479f3dea
To: /content/df_ratings_10.csv
100% 257M/257M [00:05<00:00, 49.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1oXRXpQCjIHXXKpmQ_0bjWmH_n5hvThio
To: /content/df_id2user.csv
100% 3.69M/3.69M [00:00<00:00, 22.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Yb_drQGpSQCFBENFU0m_13MCCYj7QPi4
To: /content/df_user2id.csv
100% 3.69M/3.69M [00:00<00:00, 22.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1IDMw7Vwr_hBklq1o_42aXZxsHcJDWEK6
To: /content/df_game2id.csv
100% 566k/566k [00:00<00:00, 5.78MB/s]
Downloading...
From: https://drive.google.com/uc?id=1H15QwTWm3eysF4vFW-L-1jtv0L6IEF1s
To: /content/df_id2game.csv
100% 566k/566k [00:00<00:00, 5.46MB/s]


In [3]:
df_ratings_10 = pd.read_csv("df_ratings_10.csv", index_col=0, dtype={"game_id": "uint32", "rating": "int8"})
df_ratings_10.index = df_ratings_10.index.astype("uint32")

df_id2user = pd.read_csv("df_id2user.csv",index_col=0)
df_user2id = pd.read_csv("df_user2id.csv",index_col=0)
df_game2id = pd.read_csv("df_game2id.csv",index_col=0)
df_id2game = pd.read_csv("df_id2game.csv",index_col=0)

In [5]:
df = dict()
# data for user-based and item-based collaborative filtering
df["item_based_rec"] = df["user_based_rec"] = pd.read_csv("df_ratings_10.csv", index_col=0, dtype={"gameId": "uint32", "rating": "int8"})
df["user_based_rec"].index = df["user_based_rec"].index.astype("uint32")
df["item_based_rec"].head()

Unnamed: 0_level_0,game_id,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,30549,8
1,822,7
1,13,7
1,68448,8
1,36218,7


Let's see heads of these dataframes as reminders.

In [None]:
df_ratings_10.head(2)

Unnamed: 0_level_0,game_id,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,30549,8
1,822,7


In [None]:
df_id2game.head(2)

Unnamed: 0_level_0,game_name
game_id,Unnamed: 1_level_1
1,Die Macher
2,Dragonmaster


In [None]:
df_game2id.head(2)

Unnamed: 0_level_0,game_id
game_name,Unnamed: 1_level_1
Die Macher,1
Dragonmaster,2


In [None]:
df_user2id.head(2)

Unnamed: 0_level_0,user_id
user_name,Unnamed: 1_level_1
oldgoat3769967,1
warta,2


In [None]:
df_id2user.head(2)

Unnamed: 0_level_0,user_name
user_id,Unnamed: 1_level_1
1,oldgoat3769967
2,warta


# DEVELOPING THE RECOMMENDER

To develop user-based collaborative filtering we will:
1. Select the target user to give recommendations
2. Get top users (most active users)
3. Calculate similarities with top users
4. Give recommendations from top users

## 1 - Select the target user

In [None]:
target_user_name = input("Enter a user name (enter 0 for random user):")
if target_user_name == "0":
    target_user_id = np.random.choice(df_ratings_10.index.unique())
    target_user_name = df_id2user.loc[target_user_id].item()
elif target_user_name in df_user2id.index:
    target_user_id = df_user2id.loc[target_user_name,"user_id"]
else:
    target_user_id = -1

if target_user_id not in df_ratings_10.index:
    print("User not found")
else:
    print("Target user id = ",target_user_id," Target user name:",target_user_name,",Number of ratings:",sum(df_ratings_10.index==target_user_id))

Enter a user name (enter 0 for random user):0
Target user id =  216692  Target user name: peppeds ,Number of ratings: 11


## 2 - Get top users (most active users)

* As the filename **df_ratings_10** implies, in user-based collaborative filtering we consider only users **who rated at least 10 games**.
* In correlation matrix we will find the similar users among **experienced** users who are **most active users**. Let's say top 10000 users.
* Then we will eliminate the users who have games in common less than threshold value. We choose the threshold as **10**.

In [None]:
num_top_users = 10000
top_user_ids = df_ratings_10.index.value_counts()[:num_top_users].index

In [None]:
 # Get ratings of top users
df_ratings_top_users = df_ratings_10.loc[top_user_ids]
df_ratings_top_users.index.name="user_id"
df_ratings_top_users.head(3)

Unnamed: 0_level_0,game_id,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,30549,8
1,822,7
1,13,7


## 3- Calculate similarities with other users

* We will use a threshold for **the number of games in common**.
* We will only calculate correlations or distances with other users whose "number of games in common with the target user" is above the threshold.<br>
* Note that in user-based collaborative filtering based on this threshold and the number of games rated by the target user, we may not find any recommendations. For example, let's say threshold is 15. It is possible that there might not be any other user that rated the same 15 games. Another case might be that the target user did not rate 15 games and naturally we do not have to start any search process.

Let's choose the threshold for number of games in common as 10.

In [None]:
threshold_games_in_common = 10

We create pivot table whose rows are user ids, columns are game ids and values are ratings.<br>
We use this pivot table to calculate the similarities.

In [None]:
def create_pivot_table(df_ratings, df_ratings_top_users, target_id, threshold_in_common = 10,slice_length=1000 ):
    i = 0
    ids_rows = df_ratings_top_users.index.unique().tolist()
    df_ratings_target = df_ratings.loc[target_id]
    ids_cols = df_ratings_target.iloc[:,0] #  df_ratings_target["game_id"] # game_ids_played_by_target_user
    for i in range(0, len(ids_rows), slice_length):
        ids_rows_subset = set(ids_rows[i:i + slice_length])  # select a batch of users
        ids_rows_subset.discard(target_id)                   # exclude the target user if it is in the user-batch
        # only users in the user batch(slice)
        df_ratings_subset = df_ratings_top_users[df_ratings_top_users.index.isin(ids_rows_subset)]
        #filter ratings: select the ratings such that only games played(id_cols) by the target user are included
        df_ratings_subset= df_ratings_subset[df_ratings_subset.iloc[:,0].isin(ids_cols)]#df_ratings_subset[df_ratings_subset["game_id"].isin(ids_cols)]
        #only users who have games in common (with the target user) more than the threshold
        df_counts = df_ratings_subset.index.value_counts()
        df_ratings_subset = df_ratings_subset.loc[df_counts[df_counts>threshold_in_common].index]
        df_ratings_subset=pd.concat((df_ratings_subset,df_ratings_target),axis=0)
        yield df_ratings_subset.pivot_table(index=df_ratings_subset.index, columns=df_ratings_subset.columns[0], values="rating")#column is game_id

Now, we generate a dataframe for similarities.

In [None]:
def get_similarities(df_ratings, df_ratings_top_users, target_id, threshold_in_common, df_id2name,slice_length=2000, similarity_metric =  None):
    """
    @param df_ratings: ratings (df_ratings_10: ratings with users more than 10 ratings)
    @param df_ratings_top_users: the most active users
    @param target_id:  target user id
    @param threshold_in_common: threshold for games in common with the target user (used to filter users)
    @param df_id2name: df_id2user dictionary dataframe maps user ids to user names
    @param slice_length: slice_length
    @param similarity_metric: similarity metric(euclidian, manhattan etc.)
    """
    df_similarities = pd.DataFrame()
    if len( df_ratings.loc[target_id]) < threshold_in_common: # If the target user did not rate enough games we do not have to proceed any more.
        print("The number of games rated by the target user is less than threshold for number of games in common.")
    else:
        total_steps = len(df_ratings_top_users.index.unique())  // slice_length
        pivot_generator = create_pivot_table(df_ratings, df_ratings_top_users, target_id,threshold_in_common, slice_length=slice_length)

        for df_pivot_filtered_slice in tqdm(pivot_generator, total = total_steps):
            # row of the pivot table related to the target user
            df_pivot_target = df_pivot_filtered_slice.loc[target_id]
            # Pivot table slice for other users who have at least threshold games in common with the target user
            df_pivot_others_slice = df_pivot_filtered_slice.loc[df_pivot_filtered_slice.index != target_id]
            if not df_pivot_others_slice.empty:
                # calculate  correlations of the target user with other (filtered) users in pivot table slice
                if similarity_metric == None:
                    df_similarities_slice = df_pivot_others_slice.corrwith(df_pivot_target, axis=1, numeric_only=True)
                # calculate distances of the target user with other (filtered) users in pivot table slice
                else:
                    df_similarities_slice = pd.DataFrame( pairwise_distances(df_pivot_others_slice,df_pivot_target.to_numpy().reshape(1,-1), metric=similarity_metric),index=df_pivot_others_slice.index )

                if len(df_pivot_others_slice)>0:
                    # add "number of games in common" column as a new column to df_similarities_slice
                    df_num_in_common = df_pivot_others_slice.notna().sum(axis=1)
                    df_sim_slice=  pd.concat((df_similarities_slice,df_num_in_common),axis=1)
                    # save similarities and "number of games in common" in  df_similarities
                    df_similarities = pd.concat((df_similarities, df_sim_slice))

        if not df_similarities.empty:
            name_rows_pivot = df_ratings.index.name[:4] # user
            name_cols_pivot = df_ratings.iloc[:,0].name[:4] # game
            similarity_column = similarity_metric if similarity_metric else "Correlation_with_the_selected_"+name_rows_pivot
            df_similarities.columns = [similarity_column,"num_of_"+name_cols_pivot+"games_in_common"]
            df_similarities.sort_values(by= similarity_column , ascending=False if similarity_metric is None else True , inplace=True)
            df_similarities[name_rows_pivot+"_name"]= df_id2name.loc[df_similarities.index]
            df_similarities.index.name= name_rows_pivot+"_id"
        else:
            print("No recommendations found")

    return df_similarities

Let's see similar users.

In [None]:
df_similarities = get_similarities(df_ratings_10, df_ratings_top_users, target_user_id,threshold_games_in_common,df_id2user, slice_length=2000, similarity_metric =  None)
df_similarities.head()

  0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0_level_0,Correlation_with_the_selected_user,num_of_gamegames_in_common,user_name
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
20,0.180021,11,Pandorzecza
12,0.149071,11,JasonSaastad
54,-0.286304,11,Gibmaatsuki
7,-0.374836,11,Walt Mulder


## 4- Give recommendations

Now using df_similarities let's check the similar user with highest similarity (lowest distance)/correlation with the target user.

In [None]:
if not df_similarities.empty:
    similar_user_id= df_similarities.index[0] # the most similar user
    similar_user_name = df_similarities.iloc[0]["user_name"]
    print("Target user:",target_user_name)
    print("Similar user id:",similar_user_id,"Similar user name:",similar_user_name )
    df_target_user  = df_ratings_10.loc[target_user_id]
    df_similar_user =  df_ratings_top_users.loc[similar_user_id]
    df_recommendations = df_similar_user[~df_similar_user["game_id"].isin(df_target_user["game_id"])]
    df_games_in_common = df_similar_user[df_similar_user["game_id"].isin(df_target_user["game_id"])]
    print(f"Total games played by the target user {target_user_name} is {len(df_target_user)}")
    print(f"Total games played by the most similar user{similar_user_name} is {len(df_similar_user)}")
    print("Number of games played in common:", len(df_games_in_common)  )
    print("Number of different games played by the similar user",len(df_recommendations) )

Target user: peppeds
Similar user id: 20 Similar user name: Pandorzecza
Total games played by the target user peppeds is 11
Total games played by the most similar userPandorzecza is 2851
Number of games played in common: 11
Number of different games played by the similar user 2840


Now let's  generate recommendations from the most similar user.

In [None]:
num_of_recommendations=5
df_recommendations = df_recommendations.sort_values(by="rating",ascending=False)
df_recommendations.index.name="user_id"
df_recommendations[:num_of_recommendations]

Unnamed: 0_level_0,game_id,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
20,126042,10
20,266192,10
20,25292,10
20,199792,10
20,201808,10


Now we can add games for recommendations starting from the most similar user. <br>
In this step, we find the ratings of K-nearest neighbors. Then we sort the games based on frequency and average rating.<br>
Average rating is calculated using only the ratings of K-nearest neighbors.

In [None]:
def get_recommendations(df_ratings,df_ratings_top_users, target_id, df_similarities,df_id2game, num_recommendations=10, K=1):
    df_target_user =df_ratings.loc[target_id]
    similar_user_ids = df_similarities.index[:K]
    # Get ratings of games in common of top k similar users
    df_top_k_similar_users = df_ratings_top_users.loc[similar_user_ids]
    # Exclude games played by the target user
    df_top_k_similar_users = df_top_k_similar_users[~df_top_k_similar_users.isin(df_target_user["game_id"].tolist())]
    # Find counts of the games in common among top-k neigbors
    df_game_counts= df_top_k_similar_users.groupby('game_id')['game_id'].count().to_frame("count")
    # Find average ratings of the games
    df_game_average_ratings = df_top_k_similar_users.groupby("game_id")["rating"].mean().to_frame("average_rating")
    df_recommended_games = pd.merge(df_game_counts, df_game_average_ratings, left_index=True, right_index=True)
    df_recommended_games.index = df_recommended_games.index.astype("uint32")
    # Sort the games by count first, than by average rating
    df_recommended_games.sort_values(by=["count","average_rating"],ascending=False,inplace=True)
    # Add game names as a new column
    df_recommended_games["game_name"] = df_id2game.loc[df_recommended_games.index]
    df_recommended_games = df_recommended_games.round(2)
    return df_recommended_games[:num_of_recommendations]

### Testing

In [None]:
K=3
print(f'### RECOMMENDADED GAMES ###')
df_recommended_games= get_recommendations(df_ratings_10,df_ratings_top_users,target_user_id,df_similarities,df_id2game,num_recommendations,K)
print(f'Recommended games for the user "{target_user_name}" with id {target_user_id}')
df_recommended_games

### RECOMMENDADED GAMES ###
Recommended games for the user "peppeds" with id 216692


Unnamed: 0_level_0,count,average_rating,game_name
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
127023,3,8.67,Kemet
463,3,8.33,Magic: The Gathering
3076,3,8.0,Puerto Rico
47185,3,8.0,Warhammer: Invasion
155821,3,8.0,Inis
