<a href="https://colab.research.google.com/github/onertartan/recommender-systems-board-games/blob/main/explanatory_3_item_based_collaborative_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ITEM-BASED COLLABORATIVE FILTERING

* Original Dataset is taken from <a>https://www.kaggle.com/datasets/jvanelteren/boardgamegeek-reviews</a> <br>
Previously in user-based collaborative filtering we have done pre-processing on the original dataset and obtained **df_ratings_10**.
* This dataframe contains ratings of users who have rated at least 10 games.
* For memory efficiency, we have generated user ids which were not in original dataset,then created dictionary dataframes mapping users names to user ids(**df_user2id** and vice versa(**df_id2user**).
* We have also created dictionary dataframes mapping game names to game ids (**df_game2id**) and vice versa (**df_id2game**).

* We can load our dictionaries, and pre-processed dataframe, and start developing recommendation system using item-based collaborative filtering.

Import packages and load data.

In [1]:
import numpy as np
import pandas as pd
from functools import partial
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import random
from sklearn.metrics.pairwise import pairwise_distances

In [None]:
#Download preprocessed data
# File link of df_ratings_10: https://drive.google.com/file/d/13BfGTvZigyigHHSxb9jWKG6a1SUVzTBi/view?usp=sharing
# File link of df_id2game:    https://drive.google.com/file/d/123LMDzrhgsi7C0lo7Ru7syyNoFoMGIvW/view?usp=sharing
# File link of df_game2id:    https://drive.google.com/file/d/1IDMw7Vwr_hBklq1o_42aXZxsHcJDWEK6/view?usp=sharing
# File link of df_user2id:    https://drive.google.com/file/d/1Yb_drQGpSQCFBENFU0m_13MCCYj7QPi4/view?usp=sharing
# File link of df_id2user:    https://drive.google.com/file/d/1oXRXpQCjIHXXKpmQ_0bjWmH_n5hvThio/view?usp=sharing
!gdown 13BfGTvZigyigHHSxb9jWKG6a1SUVzTBi
!gdown 123LMDzrhgsi7C0lo7Ru7syyNoFoMGIvW
!gdown 1IDMw7Vwr_hBklq1o_42aXZxsHcJDWEK6
!gdown 1Yb_drQGpSQCFBENFU0m_13MCCYj7QPi4
!gdown 1oXRXpQCjIHXXKpmQ_0bjWmH_n5hvThio

In [3]:
# Load data
df_ratings_10 = pd.read_csv("df_ratings_10.csv", index_col=0, dtype={"gameId": "uint32", "rating": "int8"})
df_id2game = pd.read_csv("df_id2game.csv", index_col=0)  # index: game id column:game name
df_id2user = pd.read_csv("df_id2user.csv", index_col=0)  # index: user id column:user name
df_user2id = pd.read_csv("df_user2id.csv", index_col=0)  # index: user name column:user id
df_game2id = pd.read_csv("df_game2id.csv", index_col=0)  # index: game name column:game id

For reminder let's check heads of these dataframes.

In [4]:
df_ratings_10.head(2)

Unnamed: 0_level_0,game_id,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,30549,8
1,822,7


In [5]:
df_id2game.head(2)

Unnamed: 0_level_0,gameName
gameId,Unnamed: 1_level_1
1,Die Macher
2,Dragonmaster


In [6]:
df_game2id.head(2)

Unnamed: 0_level_0,game_id
game_name,Unnamed: 1_level_1
Die Macher,1
Dragonmaster,2


In [7]:
df_id2user.head(2)

Unnamed: 0_level_0,user_name
user_id,Unnamed: 1_level_1
1,oldgoat3769967
2,warta


In [8]:
df_user2id.head(2)

Unnamed: 0_level_0,user_id
user_name,Unnamed: 1_level_1
oldgoat3769967,1
warta,2


Let's check number of ratings, users and games.

In [9]:
print("Total number of ratings       :",len(df_ratings_10))
print("Number of users               :",df_ratings_10.index.nunique())
print("Number of games               :",df_ratings_10["game_id"].nunique())

Total number of ratings       : 18393100
Number of users               : 224662
Number of games               : 21839


## DEVELOPING THE RECOMMENDATION SYSTEM

In item-based collaborative filtering ratings given for a game will be considered as the features of that game.

In development we will;
1.  Select the target game to give recommendations
2. Calculate correlations/similarities with other games
3. Give recommendations

### 1-Select the target game

In [11]:
target_game_name = input("Enter a game name (enter 0 for random game):")
if target_game_name == "0":
    target_game_id = np.random.choice(df_ratings_10["game_id"].unique())
    target_game_name = df_id2game.loc[target_game_id].item()
elif target_game_name in df_game2id.index:
    target_game_id = df_game2id.loc[target_game_name].item()
else:
    target_game_id = -1

if target_game_id not in df_ratings_10["game_id"].unique():
    print("Game not found")
else:
    print("Target game id = ",target_game_id," Target game name:",target_game_name,",Number of ratings:",sum(df_ratings_10["game_id"]==target_game_id))

Enter a game name (enter 0 for random game):Starfight
Target game id =  14088  Target game name: Starfight ,Number of ratings: 47


## 2 - Get top users (most active users)

* As the filename **df_ratings_10** implies, in user-based collaborative filtering we consider only users **who rated at least 10 games**.
* In correlation matrix we will find the similar users among **experienced** users who are **the most active users**(users who rated most). Let's say top 10000 users.
* Then we will eliminate the users who have games in common less than threshold value. We choose the threshold as **8**, but we can change as long as the target user does not have less ratings than the threshold.

In [12]:
num_top_users = 10000
top_user_ids = df_ratings_10.index.value_counts()[:num_top_users].index

In [13]:
 # Get ratings of top users
df_ratings_top_users = df_ratings_10.loc[top_user_ids]
df_ratings_top_users.index.name="user_id"
df_ratings_top_users.head(3)

Unnamed: 0_level_0,game_id,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,30549,8
1,822,7
1,13,7


### 3-Calculate similarities with other games

* We will use a threshold for **the number of users in common**.
* We will only calculate correlations or distances with other games where "number of user in common with the target game" is equal to or above the threshold.<br>
* Note that in item-based collaborative filtering based on this threshold and the number of users rated  the target game, we may not find any recommendations. For example, let's say threshold is 15. It is possible that there might not be any other game that is rated by the same 15 users. Another case might be that the target game did not receive 15 ratings and naturally we do not have to start any search process.

Let's choose the threshold for number of users in common as 8.

In [14]:
threshold_users_in_common = 8

We will use pivot tables whose rows are game ids, columns are user ids and values are ratings.

In [15]:
def create_pivot_table(df_ratings, df_ratings_top_users, target_id, threshold_in_common = 10,slice_length=1000 ):
    i = 0
    ids_rows = df_ratings_top_users.index.unique().tolist()
    df_ratings_target = df_ratings.loc[target_id]
    ids_cols = df_ratings_target.iloc[:,0] # df_ratings_target["user_id"] # user ids who played the target game

    for i in range(0, len(ids_rows), slice_length):
        ids_rows_subset = set(ids_rows[i:i + slice_length])  # select a batch of games
        ids_rows_subset.discard(target_id)                   # exclude the target game if it is in the game-batch
        # only games in the games-batch (slice)
        df_ratings_subset = df_ratings_top_users[df_ratings_top_users.index.isin(ids_rows_subset)]
        # filter ratings: select the ratings such that only users (ids_cols) who played the target game are included
        df_ratings_subset= df_ratings_subset[df_ratings_subset.iloc[:,0].isin(ids_cols)]#df_ratings_subset[df_ratings_subset["user_id"].isin(ids_cols)]
        #only games with users in common more than the threshold (index is game_id)
        df_counts = df_ratings_subset.index.value_counts()
        df_ratings_subset = df_ratings_subset.loc[df_counts[df_counts>threshold_in_common].index]
        df_ratings_subset=pd.concat((df_ratings_subset,df_ratings_target),axis=0)
        yield df_ratings_subset.pivot_table(index=df_ratings_subset.index, columns=df_ratings_subset.columns[0], values="rating")#column is game_id

Now, we will generate a dataframe for similarities.

In [16]:
def get_similarities(df_ratings, df_ratings_top_users, target_id, threshold_in_common,df_id2name, slice_length=2000, similarity_metric =  None):
    """
    @param df_ratings: ratings (df_ratings_10: ratings with users more than 10 ratings)
    @param df_ratings_top_users: the most active users
    @param target_id:  target game id
    @param threshold_in_common: threshold for "number of users in common with the target game" (used to filter games with few users in common with the target game)
    @param df_id2name: df_id2game dictionary dataframe that maps game ids to game names
    @param slice_length: batch size for pivot generator
    @param similarity_metric: similarity metric (euclidian, manhattan etc.)
    """
    df_similarities = pd.DataFrame()
    # If the target game is not rated by enough users we do not have to proceed any more.
    if len( df_ratings.loc[target_id])<threshold_in_common:
        print("The number of users rated the target game is less than threshold for number of users in common.")
    else:
        total_steps = len(df_ratings_top_users.index.unique())  // slice_length
        pivot_generator = create_pivot_table(df_ratings, df_ratings_top_users, target_id,threshold_in_common, slice_length)
        for df_pivot_filtered_slice in tqdm(pivot_generator, total = total_steps):
            # df_pivot_target: pivot table including games are rated by at least threshold users in common with the target game
            df_pivot_target = df_pivot_filtered_slice.loc[target_id]
            # df_pivot_others_slice: the row of the pivot table related to the target game
            df_pivot_others_slice = df_pivot_filtered_slice.loc[df_pivot_filtered_slice.index != target_id] # pivot table where index includes games other than target game
            if not df_pivot_others_slice.empty:
                # if metric is not specified,calculate  correlations of the target game with other (filtered) games in pivot table slice
                if similarity_metric == None:
                    df_similarities_slice = df_pivot_others_slice.corrwith(df_pivot_target, axis=1, numeric_only=True)
                # else calculate distances of the target game with other (filtered) games in pivot table slice
                else:
                    df_similarities_slice = pd.DataFrame( pairwise_distances(df_pivot_others_slice,df_pivot_target.to_numpy().reshape(1,-1), metric=similarity_metric),index=df_pivot_others_slice.index )

                if len(df_pivot_others_slice)>0:
                    # add "number of users in common" column as a new column to df_similarities_slice
                    df_num_in_common = df_pivot_others_slice.notna().sum(axis=1)
                    df_sim_slice=  pd.concat((df_similarities_slice,df_num_in_common),axis=1)
                    # save similarities and "number of users in common" in  df_similarities
                    df_similarities = pd.concat((df_similarities, df_sim_slice))

        if not df_similarities.empty:
            name_rows_pivot = df_ratings.index.name[:4] # game
            name_cols_pivot = df_ratings.iloc[:,0].name[:4] # user
            similarity_column = similarity_metric if similarity_metric else "Correlation_with_the_selected_"+name_rows_pivot
            df_similarities.columns = [similarity_column,"num_"+name_cols_pivot+"s_in_common"]
            df_similarities.sort_values(by= similarity_column , ascending=False if similarity_metric is None else True , inplace=True)
            df_similarities[name_rows_pivot+"_name"]= df_id2name.loc[df_similarities.index]
            df_similarities.index.name= name_rows_pivot+"_id"
        else:
            print("No recommendations found")

    return df_similarities

Let's see similar users, but different from the previous user-based collaborative filtering, we have to swap index and the first column of df_ratings_10 and df_top_users.<br>
So, index column will be game_id and the first column will be user_id.

In [17]:
# Initial df_ratings_10
df_ratings_10.head(2)

Unnamed: 0_level_0,game_id,rating
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,30549,8
1,822,7


In [18]:
# After swapping index column and the first column of df_ratings_10
df_ratings_10 = df_ratings_10.reset_index().set_index("game_id")
df_ratings_10.head(2)

Unnamed: 0_level_0,user_id,rating
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1
30549,1,8
822,1,7


In [19]:
# After swapping index column and the first column of df_ratings_top_users
df_ratings_top_users = df_ratings_top_users.reset_index().set_index("game_id")
df_ratings_top_users.head(2)

Unnamed: 0_level_0,user_id,rating
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1
30549,1,8
822,1,7


Now we can check similar users.

In [20]:
df_similarities = get_similarities(df_ratings_10, df_ratings_top_users,
                                   target_game_id,threshold_users_in_common , df_id2game,slice_length=2000, similarity_metric =  None)
df_similarities.head()

  0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0_level_0,Correlation_with_the_selected_game,num_users_in_common,game_name
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
14996,0.64312,9,Ticket to Ride: Europe
13,0.640729,11,Catan
12333,0.593593,14,Twilight Struggle
10547,0.55814,9,Betrayal at House on the Hill
121,0.533396,10,Dune


# 4-3 Give recommendations

Now using df_similarities let's check the similar game with highest similarity (lowest distance)/correlation with the target game.

In [21]:
if not df_similarities.empty:
    similar_game_id= df_similarities.index[0] # the most similar user
    similar_game_name = df_similarities.iloc[0]["game_name"]
    print("Target game:",target_game_name)
    print("Similar game id:",similar_game_id,"Similar game name:",similar_game_name )
    df_target_game  = df_ratings_10.loc[target_game_id]
    df_similar_game =  df_ratings_top_users.loc[similar_game_id]
    print(f"Total users played the target user {target_game_name} is {len(df_target_game)}")
    print(f"Total users played by the most similar user{similar_game_name} is {len(df_similar_game)}")
    users_in_common = set(df_target_game["user_id"]) & set(df_similar_game["user_id"])
    print("Number of users in common:", len(users_in_common)  )
    users_different =  set(df_similar_game["user_id"])-set(df_target_game["user_id"])
    print("Number of different users played the similar game",len(users_different) )

Target game: Starfight
Similar game id: 14996 Similar game name: Ticket to Ride: Europe
Total users played the target user Starfight is 47
Total users played by the most similar userTicket to Ride: Europe is 6898
Number of users in common: 9
Number of different users played the similar game 6889


Now let's  generate recommendations based on similarities.

In [22]:
def get_recommendations(df_ratings, target_id,df_similarities,df_id2game, df_id2user, num_of_recommendations=10):
    recommended_game_ids = []
    df_target_game = df_ratings.loc[target_game_id]
    df_result = df_similarities.iloc[:num_of_recommendations].copy()
    for similar_game_id in df_result.index:
        similar_game_name =  df_similarities.loc[similar_game_id,"game_name"]
        df_result.loc[similar_game_id,"total_users_rated"] =len(df_ratings_top_users.loc[similar_game_id])
    return df_result

In [24]:
df_recommended_games= get_recommendations(df_ratings_10,target_game_id,df_similarities,df_id2game,df_id2user,num_of_recommendations = 10)
print(f'Recommended games for the user {target_game_name} with id "{target_game_id}" :')
df_recommended_games

Recommended games for the user Starfight with id "14088" :


Unnamed: 0_level_0,Correlation_with_the_selected_game,num_users_in_common,game_name,total_users_rated
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
14996,0.64312,9,Ticket to Ride: Europe,6898.0
13,0.640729,11,Catan,8597.0
12333,0.593593,14,Twilight Struggle,5792.0
10547,0.55814,9,Betrayal at House on the Hill,4910.0
121,0.533396,10,Dune,1726.0
243,0.512494,9,Advanced Squad Leader,638.0
483,0.508747,9,Diplomacy,2474.0
822,0.500298,9,Carcassonne,8939.0
10630,0.462177,12,Memoir '44,4654.0
28143,0.458475,9,Race for the Galaxy,7631.0
