# **Games recommendation system based on collaborative filtering**

## **Description of the data**
To build a recommendation system based on collaborative filtering, we need a dataset that describes the interactions of the users with the available items.

So in this case, we are going to work with a dataset that contains the interactions between users and games.

The data is available on the Kaggle website (you can look at it in this [link](https://www.kaggle.com/tamber/steam-video-games)).

The data contains the games that each user has purchased, and the number of hours the user has played each of its games.

In the above image, we can see the percentage of the data that refers to the purchases and hours played.

![](images/data.png)

Therefore, we are going to split the dataset based on user behavior (purchase or game).

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from math import floor

In [2]:
# definition of the column names because the csv file has not a header
col_names = ['user_id', 'game_name', 'behavior', 'value']

# read the data
df = pd.read_csv("data/steam-200k.csv", \
                 header=None, \
                 usecols=[0, 1, 2, 3], \
                 names=col_names)
df.head()

Unnamed: 0,user_id,game_name,behavior,value
0,151603712,The Elder Scrolls V Skyrim,purchase,1.0
1,151603712,The Elder Scrolls V Skyrim,play,273.0
2,151603712,Fallout 4,purchase,1.0
3,151603712,Fallout 4,play,87.0
4,151603712,Spore,purchase,1.0


In [3]:
df_plays = df[df["behavior"] == "play"][["user_id", "game_name", "value"]]
df_plays.head()

Unnamed: 0,user_id,game_name,value
1,151603712,The Elder Scrolls V Skyrim,273.0
3,151603712,Fallout 4,87.0
5,151603712,Spore,14.9
7,151603712,Fallout New Vegas,12.1
9,151603712,Left 4 Dead 2,8.9


In [4]:
df_purchases = df[df["behavior"] == "purchase"][["user_id", "game_name", "value"]]
df_purchases.head()

Unnamed: 0,user_id,game_name,value
0,151603712,The Elder Scrolls V Skyrim,1.0
2,151603712,Fallout 4,1.0
4,151603712,Spore,1.0
6,151603712,Fallout New Vegas,1.0
8,151603712,Left 4 Dead 2,1.0


### **Data profling**

Instead of using `Pandas Profiling` to do data profiling, we are going to run a few methods to get a description of these data.

In [5]:
print("Number of unique users", df["user_id"].nunique())
print("Number of unique games", df["game_name"].nunique())

Number of unique users 12393
Number of unique games 5155


In [6]:
# shape of the dataframes
print("Purchases dataframe", df_purchases.shape)
print("Plays dataframe", df_plays.shape)

Purchases dataframe (129511, 3)
Plays dataframe (70489, 3)


In [7]:
# missing values
print("Purchases dataframe")
print(df_purchases.isnull().sum())
print("")
print("Plays dataframe")
print(df_plays.isnull().sum())

Purchases dataframe
user_id      0
game_name    0
value        0
dtype: int64

Plays dataframe
user_id      0
game_name    0
value        0
dtype: int64


In [8]:
#Number of purchased games by each user
df_purchases['user_id'].value_counts()

62990992     1075
33865373      783
30246419      766
58345543      667
76892907      597
             ... 
149194171       1
207945140       1
130315685       1
282733934       1
214618086       1
Name: user_id, Length: 12393, dtype: int64

In [9]:
#Number of played games by each user
df_plays['user_id'].value_counts()

62990992     498
11403772     314
138941587    299
47457723     298
49893565     297
            ... 
188448131      1
98244166       1
213929581      1
7519923        1
176643508      1
Name: user_id, Length: 11350, dtype: int64

In [10]:
#Number of purchases for feach game
df_purchases['game_name'].value_counts()

Dota 2                                                4841
Team Fortress 2                                       2323
Unturned                                              1563
Counter-Strike Global Offensive                       1412
Half-Life 2 Lost Coast                                 981
                                                      ... 
Minimon                                                  1
Warehouse and Logistics Simulator                        1
Sweezy Gunner                                            1
3DMark Vantage                                           1
Pinball FX2 - Star Wars Pinball Heroes Within Pack       1
Name: game_name, Length: 5155, dtype: int64

In [11]:
# Number of total hours played for each game
df_plays.groupby('game_name')['value'].sum().sort_values(ascending=False)

game_name
Dota 2                             981684.6
Counter-Strike Global Offensive    322771.6
Team Fortress 2                    173673.3
Counter-Strike                     134261.1
Sid Meier's Civilization V          99821.3
                                     ...   
A-Train 8                               0.1
Shan Gui                                0.1
Hyper Fighters                          0.1
Diamond Dan                             0.1
Guardians of Orion                      0.1
Name: value, Length: 3600, dtype: float64

## **Data pre-processing**
We are going to work with the `df_plays` DataFrame and create an interaction matrix, where the rows represent the users and the columns, the games.

In [12]:
# rename the "values" column
df_plays.rename(columns={'value': 'game_hours'}, inplace=True)
df_plays.head()

Unnamed: 0,user_id,game_name,game_hours
1,151603712,The Elder Scrolls V Skyrim,273.0
3,151603712,Fallout 4,87.0
5,151603712,Spore,14.9
7,151603712,Fallout New Vegas,12.1
9,151603712,Left 4 Dead 2,8.9


In [13]:
# get a statistical description of the game_hours_column
df_plays['game_hours'].describe()

count    70489.000000
mean        48.878063
std        229.335236
min          0.100000
25%          1.000000
50%          4.500000
75%         19.100000
max      11754.000000
Name: game_hours, dtype: float64

In [14]:
df_plays = df_plays

In [15]:
# create the interaction matrix that contains the hours 
# that each user has played its games
hours_matrix = df_plays.pivot_table(\
                                    values='game_hours', \
                                    index='user_id', \
                                    columns='game_name')
hours_matrix.head()

game_name,007 Legends,0RBITALIS,1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby),10 Second Ninja,"10,000,000",100% Orange Juice,1000 Amps,12 Labours of Hercules,12 Labours of Hercules II The Cretan Bull,12 Labours of Hercules III Girl Power,...,rFactor,rFactor 2,realMyst,realMyst Masterpiece Edition,resident evil 4 / biohazard 4,rymdkapsel,sZone-Online,the static speaks my name,theHunter,theHunter Primal
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5250,,,,,,,,,,,...,,,,,,,,,,
76767,,,,,,,,,,,...,,,,,,,,,,
86540,,,,,,,,,,,...,,,,,,,,,,
144736,,,,,,,,,,,...,,,,,,,,,,
181212,,,,,,,,,,,...,,,,,,,,,,


In [16]:
# fill the NaN (Not a Number) values with the zero value.
hours_matrix.fillna(0, inplace=True)
hours_matrix.head()

game_name,007 Legends,0RBITALIS,1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby),10 Second Ninja,"10,000,000",100% Orange Juice,1000 Amps,12 Labours of Hercules,12 Labours of Hercules II The Cretan Bull,12 Labours of Hercules III Girl Power,...,rFactor,rFactor 2,realMyst,realMyst Masterpiece Edition,resident evil 4 / biohazard 4,rymdkapsel,sZone-Online,the static speaks my name,theHunter,theHunter Primal
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5250,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
76767,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
86540,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
144736,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
181212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The zero value means that user has not played yet the game

In [17]:
# create another interaction matrix that indicates
# 
purchases_matrix = df_purchases.pivot_table(\
                                            values='value', \
                                            index='game_name', \
                                            columns='user_id')

purchases_matrix.fillna(0, inplace=True)
purchases_matrix=purchases_matrix.astype('int')
purchases_matrix.head()

user_id,5250,76767,86540,103360,144736,181212,229911,298950,299153,381543,...,309262440,309265377,309375103,309404240,309434439,309554670,309626088,309812026,309824202,309903146
game_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
007 Legends,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
0RBITALIS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10 Second Ninja,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10000000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The zero value means that the user has not purchased yet the game.

## **Item-based collaborative filtering**
This system will recomend to an user games based on a specified game that he has purchased yet.
But first, we need a similarity matrix that has the score similarity between all items pairs.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

# compute the similarity matrix
item_sim_matrix = cosine_similarity(purchases_matrix)

# load this matrix into a DataFrame
item_sim_matrix = pd.DataFrame(\
                               item_sim_matrix, 
                               index=purchases_matrix.index, 
                               columns=purchases_matrix.index)
item_sim_matrix.head()

game_name,007 Legends,0RBITALIS,1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby),10 Second Ninja,"10,000,000",100% Orange Juice,1000 Amps,12 Labours of Hercules,12 Labours of Hercules II The Cretan Bull,12 Labours of Hercules III Girl Power,...,rFactor 2,realMyst,realMyst Masterpiece Edition,resident evil 4 / biohazard 4,rymdkapsel,sZone-Online,samurai_jazz,the static speaks my name,theHunter,theHunter Primal
game_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
007 Legends,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0RBITALIS,0.0,1.0,0.0,0.471405,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036662,0.0
1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby),0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.077152,0.0,0.074848,0.0,0.104828,0.048002,0.0
10 Second Ninja,0.0,0.471405,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.025924,0.0
10000000,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.408248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
def get_similar_items(item_name, sim_matrix):
    """
    Function that returns all the items ordered by the similarity to the input user
    """
    return list(\
                sim_matrix.loc[item_name] \
                .sort_values(ascending=False)[1:] \
                .index \
               )

def get_recommendations(user_id, item_name, sim_matrix, item_user_interactions, n_games=5):
    """
    Function that returns as recommendations the top n most similar games to the
    input game for the input user based on the purchasing activity of the other users
    by collaborative filtering with item-item approach
    """
    
    # get the most similar items to the input game
    most_similar_items = get_similar_items(item_name, sim_matrix)
    
    # get the items that the input user has already purchased.
    item_names = item_user_interactions.index[\
                                              np.where(\
                                                     item_user_interactions.loc[:, user_id] == 1)
                                             ]
    # remove from the list of the most similar items,
    # the items that the input user already purchased
    for game in item_names:
        if game in most_similar_items:
            most_similar_items.remove(game)
    
    # return the top n of the most similar games
    return most_similar_items[:n_games]

In [20]:
test_games = ["Counter-Strike", "Left 4 Dead", "Mortal Kombat X", "Need for Speed SHIFT"]
for game in test_games:
    print(f"The users have purchased {game}, also have purchased: ")
    print(get_recommendations(5250, game, item_sim_matrix, purchases_matrix, 5))
    print("")

The users have purchased Counter-Strike, also have purchased: 
['Counter-Strike Condition Zero', 'Counter-Strike Condition Zero Deleted Scenes', 'Counter-Strike Global Offensive', 'Day of Defeat Source', 'Half-Life Deathmatch Source']

The users have purchased Left 4 Dead, also have purchased: 
['Left 4 Dead 2', 'Half-Life Deathmatch Source', 'Half-Life Source', 'Borderlands', 'BioShock']

The users have purchased Mortal Kombat X, also have purchased: 
['Mortal Kombat Komplete Edition', 'Street Fighter V Beta', "Warhammer End Times - Vermintide Sigmar's Blessing", 'Warhammer End Times - Vermintide', 'The Darkness II']

The users have purchased Need for Speed SHIFT, also have purchased: 
['Need for Speed Undercover', 'Victoria II Interwar Cavalry Unit Pack', 'Shift 2 Unleashed', 'The Magic Circle', 'Duck Dynasty']



In [23]:
test_games = ["Counter-Strike", "Left 4 Dead", "Mortal Kombat X", "Need for Speed SHIFT"]
for game in test_games:
    print(f"The users have purchased {game}, also have purchased: ")
    print(get_recommendations(86540, game, item_sim_matrix, purchases_matrix, 5))
    print("")

The users have purchased Counter-Strike, also have purchased: 
['Counter-Strike Condition Zero', 'Counter-Strike Condition Zero Deleted Scenes', 'Counter-Strike Source', 'Half-Life 2 Deathmatch', 'Counter-Strike Global Offensive']

The users have purchased Left 4 Dead, also have purchased: 
['Half-Life 2 Episode Two', 'Half-Life Deathmatch Source', 'Half-Life Source', 'Half-Life 2 Lost Coast', 'Counter-Strike Source']

The users have purchased Mortal Kombat X, also have purchased: 
['Mortal Kombat Komplete Edition', 'Street Fighter V Beta', "Warhammer End Times - Vermintide Sigmar's Blessing", 'Warhammer End Times - Vermintide', 'The Darkness II']

The users have purchased Need for Speed SHIFT, also have purchased: 
['Need for Speed Undercover', 'Victoria II Interwar Cavalry Unit Pack', 'Shift 2 Unleashed', 'The Magic Circle', 'Duck Dynasty']



## **User-based collaborative filtering**

In [24]:
hours_matrix = hours_matrix[:10000]

In [25]:
sim_matrix = cosine_similarity(hours_matrix, hours_matrix)

In [26]:
sim_matrix = pd.DataFrame(sim_matrix, 
                          index=hours_matrix.index, 
                          columns=hours_matrix.index)
sim_matrix.head()

user_id,5250,76767,86540,144736,181212,229911,298950,381543,547685,554278,...,258648395,258673767,258753415,258754843,258774423,258785172,258795101,258806371,258834833,258845881
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5250,1.0,0.00248,0.0,0.0,0.0,9.778883e-06,0.016179,0.0,0.0,0.0,...,0.00127,0.0,0.00127,0.00127,0.005081,0.002951,0.0,0.005081,0.005081,0.00127
76767,0.00248,1.0,0.000128,0.685379,0.669058,0.3389927,0.003713,0.685379,0.037465,0.050745,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
86540,0.0,0.000128,1.0,0.0,0.0,7.168671e-07,0.104667,0.0,0.0,0.002336,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
144736,0.0,0.685379,0.0,1.0,0.976187,0.2165145,0.000414,1.0,0.044944,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
181212,0.0,0.669058,0.0,0.976187,1.0,0.2116718,0.000404,0.976187,0.043874,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
def predict_hours(user_id, game_name, interactions_matrix, sim_matrix):
    """
    Function that predicts how many hours the input user will play the input game
    """
    sim_scores = sim_matrix[user_id]
    hours_played = interactions_matrix[game_name]
    wmean_hours = np.dot(sim_scores, hours_played)/sim_scores.sum()
    return floor(wmean_hours)

def generate_recommendations(user_id, interactions_matrix, sim_matrix, n_games=5):
    """
    Function that returns n games as recommendations based
    on the activity of the similar users to the input user
    """
    hours_pred = {}
    
    not_played = interactions_matrix.loc[user_id]
    not_played = not_played[not_played == 0].index
    
    for game in not_played:
        hours_pred[game] = predict_hours(user_id, game, interactions_matrix, sim_matrix)
        
    return sorted(hours_pred.items(), key=lambda x: x[1], reverse=True)[:n_games]

In [29]:
from random import sample, seed

seed(42)

test_users = list(hours_matrix[:1000].index)
test_users = sample(test_users, 3)
for user_id in test_users:
  print("Recommendations to the user: ", user_id)
  for game, _ in generate_recommendations(user_id, hours_matrix[:10000], sim_matrix):
    print(game)
  print("\n")

Recommendations to the user:  34960438
Counter-Strike Global Offensive
Dota 2
Team Fortress 2
Garry's Mod
Call of Duty Modern Warfare 2 - Multiplayer


Recommendations to the user:  7955670
Dota 2
Garry's Mod
Counter-Strike Source
The Elder Scrolls V Skyrim
Terraria


Recommendations to the user:  1024319
Counter-Strike
Sniper Elite
FTL Faster Than Light
Counter-Strike Global Offensive
Counter-Strike Source


