# Social Computing/Social Gaming - Summer 2020

# Exercise Sheet 3: Collaborative Filtering with Steam Games

In this exercise, we will build a collaborative filtering recommender system using data we gather from Steam. We will use your friends list to get information about owned games for each ID, and the time each game was played.

Usually, collaborative filtering is based on some sort of rating to determine the similarity between users. However, for games, the enjoyment and a rating do not always match. Additionally, only about 10% of players actually rate the games they play, which would make for a very incomplete dataset. Therefore, the playtime will be used instead of a rating system. This has the added benefit that playtime is usually the most authentic metric of enjoyment, as players are very unlikely to spend much time on a game they don't enjoy.

## Task 3.1: Obtaining the data


**1.** Your first task is to gather the data needed to create the recommender system. Create a data structure that holds the needed information for each player and game.  
**Note:** You cannot obtain a list from your profile with the Steam API unless your profile is set to public. 

If you do not have a Steam profile, you can use the default values. 
However, we encourage you to use your own profile. 

**Hint**: To obtain the games a user owns, use this: `games = data['response']['games']`. This returns a list of games, including the playtime (in minutes) which can be retrieved like this: `playtime = game['playtime_forever']` , where game refers to an item from the list of games. 

In [1]:
#Use this if you want to work with the default IDs
import requests
import urllib
import pandas as pd
import json
from urllib.request import Request, urlopen
from pandas.io.json import json_normalize
from requests.exceptions import HTTPError

# You can replace these values with your own ID and API key
key = "CB35B8F8DCE9135DDAA3B0328FCE0103"
id = "76561198329838242"
url = "http://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0002/?key="+key+"&steamids="+id
r = requests.get(url)
data = r.json()

# Get friendslist
request = Request("http://api.steampowered.com/ISteamUser/GetFriendList/v0001/?key="+key+"&steamid="+id+"&relationship=friend")
response = urlopen(request)
elevations = response.read()
data = json.loads(elevations)
friendslist = data['friendslist']
friends = friendslist['friends']

friendids =[]
tempIDs = []
for friend in friends:
    friendids.append(friend['steamid'])
    
print(len(friendids))
#get friends of friends:
x = 0
while x < len(friendids):
    friendID = friendids[x]
    request = Request("http://api.steampowered.com/ISteamUser/GetFriendList/v0001/?key="+key+"&steamid="+friendID+"&relationship=friend")
    try:
        response = urlopen(request)    
    except urllib.error.HTTPError  as e:
        print('401')
    elevations = response.read()
    try:
        data = json.loads(elevations)
    except json.JSONDecodeError:
        print('couldnt decode')
    friendslist = data['friendslist']
    friends = friendslist['friends']

    friendidsNew =[]
    for friend in friends:
        friendidsNew.append(friend['steamid'])
        
    tempIDs+=friendidsNew
    x+=1

friendids += tempIDs
friendids = list(dict.fromkeys(friendids))
friendids = list(set(friendids))
print(len(friendids))


64
401
couldnt decode
401
couldnt decode
401
couldnt decode
401
couldnt decode
401
couldnt decode
401
couldnt decode
401
couldnt decode
401
couldnt decode
401
couldnt decode
401
couldnt decode
6655


In [2]:
# Trim the list of IDs to reasonable values:
if len(friendids)>250:
    friendids = friendids[:250]    
print(len(friendids))

users_gamedicts = {} # The dictionary containing all information for every ID
gamedict = {} # A dict containing information for one player

# Get owned games of friendslist:
request = Request("http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key="+key+"&steamid="+id+"&include_appinfo=1&format=json")

# TODO:
# Open the URL and read the json response and retrieve the games of your friends and their playtime
# Save the games into a dictionary with key=name and values=playtime
# Hint 1: You can obtain the games a user owns with data['response']['games']
# Hint 2: You can retrieve their playtime with game['playtime_forever']
response = urlopen(request)
elevations = response.read()
data = json.loads(elevations)
games = data['response']['games']
for g in games:
    if g['playtime_forever'] > 0:
        gamedict.update({g['name']:g['playtime_forever']})
    
# Add the dictionary to the users_gamedict       
users_gamedicts[int(id)]=gamedict

# Do the same for the friends of your friends
for friendID in friendids:
    # TODO
    request = Request("http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key="+key+"&steamid="+friendID+"&include_appinfo=1&format=json")
    response = urlopen(request)
    elevations = response.read()
    data = json.loads(elevations)
    data = data['response']
    gamedict_new = {}
    if 'games' in data.keys():
        games = data['games']
        for g in games:
            if g['playtime_forever'] > 0:
                gamedict_new.update({g['name']:g['playtime_forever']})
    
        # Add the dictionary to the users_gamedict       
        users_gamedicts[int(friendID)]=gamedict_new

250


In [13]:
print(len(users_gamedicts))
print(users_gamedicts[int(id)])

68
{'Path of Exile': 16353, 'Europa Universalis IV': 113452, 'Titan Quest Anniversary Edition': 10354, 'Black Desert Online': 3697, 'Crusader Kings II': 5896}


## Task 3.2: Association rule mining

Before we start with the "real" recommender system, let us take a look at a more general form of recommending items using association rules.

The concept of association rule mining is rather simple: Looking at an itemset, one tries to find dependencies between items that could "belong together". A common example would be buying food at the store: If, for example, meat and salt are bought together often, but meat without salt not that often, it is assumed that there is a connection between those two. For games, if it was found that most of the users who own the demo version of a game also own the full version of that game, it would be a reasonable assumption that these users liked the demo and therefore bought the full version.


Let us first cover the mathematical basis for association rules. The most important metrics used are **support**,  **confidence** and **lift**. The first is defined as the amount of times an item occurs in the itemset divided by the total number of items in the set; the second is defined as the support of a list of items [x,y,...] divided by the support of x. Lift is a measure describing the correlation between items. Written down mathematically:

$$supp(x)= \frac{len(x)}{len(n)}$$

$$conf(x=>y) = \frac{supp(x,y)}{supp(x)}$$

$$lift(x=>y) = \frac{P(x \cap y)}{P(x) * P(y)}$$



It is important to note that support refers to an item or a list of items, while confidence refers to a rule. Also note that a lift of 1 means that x and y occur independently of each other, while a lift greater 1 means a positive correlation.


**1.** Your task here is to first convert the dictionary you created into a list of lists as this is the input required for the algorithm to work. Then, print out the most frequent items using the `min_support` attribute. Finally, print out the association rules and play around with the threshold value to get a reasonable amount of rules. 


**2.** Discuss your results and try to answer the following questions: What kind of recommendations can be made? What does a confidence of 1.0 mean and is it meaningful for recommending games? Can you spot a correlation between the games with the highest support and the rules with the highest confidence? How does this affect the lift?  
**Hint:** Play around with the threshold values until you get a reasonable amount (4-30) rows as output.

In [4]:
gamesofallusers = []

# TODO: Convert the gamedict to a list of lists
for idx in users_gamedicts:
    g = []
    for game in users_gamedicts[idx]:
        g.append(game)
    gamesofallusers.append(g)

# Remove common Steam entries that are not games:
for game in gamesofallusers:
    if 'Dota 2 Test' in game:
        game.remove('Dota 2 Test')
    if 'True Sight' in game:
        game.remove('True Sight')
    if 'True Sight: Episode 1' in game:
        game.remove('True Sight: Episode 1')
    if 'True Sight: Episode 2' in game:
        game.remove('True Sight: Episode 2')
    if 'True Sight: Episode 3' in game:
        game.remove('True Sight: Episode 3')
    if 'True Sight: The Kiev Major Grand Finals' in game:
        game.remove('True Sight: The Kiev Major Grand Finals')
    if 'True Sight: The International 2017' in game:
        game.remove('True Sight: The International 2017')
    if 'True Sight: The International 2018 Finals' in game:
        game.remove('True Sight: The International 2018 Finals') 

In [6]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

te = TransactionEncoder()
# TODO: Tinker around with the values
te_ary = te.fit(gamesofallusers).transform(gamesofallusers)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)

frequent_itemsets

Unnamed: 0,support,itemsets
0,0.764706,(Counter-Strike: Global Offensive)
1,0.441176,(Garry's Mod)
2,0.573529,(Left 4 Dead 2)
3,0.455882,(PAYDAY 2)
4,0.514706,(PLAYERUNKNOWN'S BATTLEGROUNDS)
5,0.470588,(Path of Exile)
6,0.455882,(Terraria)
7,0.411765,(Warframe)
8,0.426471,"(Garry's Mod, Counter-Strike: Global Offensive)"
9,0.544118,"(Left 4 Dead 2, Counter-Strike: Global Offensive)"


In [7]:
from mlxtend.frequent_patterns import association_rules

# TODO: Play around with the treshold value
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.75)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Garry's Mod),(Counter-Strike: Global Offensive),0.441176,0.764706,0.426471,0.966667,1.264103,0.0891,7.058824
1,(Left 4 Dead 2),(Counter-Strike: Global Offensive),0.573529,0.764706,0.544118,0.948718,1.240631,0.105536,4.588235
2,(PAYDAY 2),(Counter-Strike: Global Offensive),0.455882,0.764706,0.455882,1.0,1.307692,0.107266,inf
3,(PLAYERUNKNOWN'S BATTLEGROUNDS),(Counter-Strike: Global Offensive),0.514706,0.764706,0.470588,0.914286,1.195604,0.07699,2.745098
4,(Path of Exile),(Counter-Strike: Global Offensive),0.470588,0.764706,0.426471,0.90625,1.185096,0.066609,2.509804
5,(Terraria),(Counter-Strike: Global Offensive),0.455882,0.764706,0.426471,0.935484,1.223325,0.077855,3.647059
6,(PLAYERUNKNOWN'S BATTLEGROUNDS),(Left 4 Dead 2),0.514706,0.573529,0.411765,0.8,1.394872,0.116566,2.132353
7,(Terraria),(Left 4 Dead 2),0.455882,0.573529,0.411765,0.903226,1.574855,0.150303,4.406863


**TODO: Write your observations here:**
* From the association rule output, one can recommend *Counter-Strike: Global Offensive* if an user plays *PAYDAY 2* as the confidence is **1**, the highest. Similarly same recommendation can be made if an user plays *Garry's Mod* as the confidence is second highest.
* Confidence of 1 means that the player who plays game *x* also plays game *y*. So, it can be used as a measure to recommend games. It doesn't capture how similar two games are, rather only look at whether *y* is played by players who play *x*.
* Highest support is for *Counter-Strike: Global Offensive* from the `frequent_itemlist`. Highest confidence is for *PAYDAY 2* as andecedent and *Counter-Strike: Global Offensive* as consequent from the `association_rules`. One can infer a correlation that the game with the highest similarity is the consequent in the association rule. This makes sense because the highest support means that the game is played by most players in the dataset. So, this game has higher chance of being a consequent or a suitable candidate for recommending as it would appear together with many other games.
* Lift can't be commented in the above scenario as it captures the correlation of andecedents and consequents. The one with highest confidence need not have high Lift as seen from the association rules. This is because confidence doesn't take into account the correlation between *x* and *y*. The highest Lift is for *PLAYERUNKNOWN'S BATTLEGROUNDS* as andecedent and *Left 4 Dead 2* as consequent.




## Task 3.3: The Recommender System: Similarity Score


Finally, it is time to build the recommender system. 

**1.** The first thing to do is to implement a similarity score that will be used to predict a user's playtime of an unowned game. We implement a similarity score between two users by taking the relative distance between two players. We use the following formula:

$$d(u, v) = \sum_{i~\in~\textrm{common_games}} \frac{|r_{u,i} - r_{v,i}|}{r_{v,i}}$$  

Where $u$ and $v$ are users and $r_{u,i}$ is the playtime of user $u$ for game $i$. 

You can then return the similarity with  
$$ w_{u,v} = \frac{1}{1 + d(u, v)} $$

**Note:** If no common games exist return 0.

In [17]:
# Here we will calculate the similarity score between two friends based on their common games:
def calculate_similarity(user1ID, user2ID):
    
    # TODO:
    user1 = user1ID.keys()
    user2 = user2ID.keys()
    common_games = [game for game in user1 if game in user2] 
    d = 0
    for i in common_games:
        d += abs(user1ID[i] - user2ID[i])/user2ID[i]
    if len(common_games) == 0:
        return 0
    return 1/(1+d)

## Task 3.4: Recommender System: Predict ratings

With the similarity score calculated, we can now predict a user's playtime for games they don't own.   
**1.** First, we create a set of all games, but we delete all games that are owned by less than 3 players. The reason is simple: If only 1 or 2 players own a game, it is impossible to derive a meaningful prediction since there is not enough data. 

The predicted playtime for a game works analogous to the predicted rating of a movie/item in a conventional collaborative filtering recommender system:

$$r_{u,i} = \frac{\sum_{v \in N_i(u)} w_{u,v}r_{v,i}}{\sum_{v \in N_i(u)} w_{u,v}}$$

where 
- $r_{u,i}$ is the estimated recommendation of item $i$ for target user $u$. 
- $N_i(u)$ is the set of similar users to target user $u$ for the designated item $i$. 
- $w_{u,v}$ is the similarity score between users $u$ and $v$ (used as a weighting factor).  

**Note:** In our case, we use playtime as a recommendation measure and the set $N_i(u)$ consists of user $u$ friends list and friends of friends list. In our scenario, we do not need the index $i$ as our friends list does not change between games.

In [18]:
# List of all games that are owned by at least 1 person:
allGames = []
for user in gamesofallusers:
    for game in user:
        allGames.append(game)
        
# TODO : Create a list of games owned by at least 3 people
allGames_unique = list(set(allGames))
games = [g for g in allGames_unique if allGames.count(g) >=3]
print('Number of unique games played by >3 ', len(games))

# Find out which games you do not own out of all games because we are only interested in recommendations for games that we do not own
def difference(allGames, usersgames): 
    # TODO:
    not_owned_games = [g for g in allGames if g not in usersgames]
    return not_owned_games

    
# Predict ratings based on the formula above for each unowned game
def predict_ratings():
    # TODO:
    '''Hint: Iterate over all unowned games and for each game calculate a rating based
           on your friends playtime and similarity score ''' 
    not_owned_games = difference(games, users_gamedicts[int(id)])
    rating = {}
    for game in not_owned_games:
        score_nr = 0
        score_dr = 0
        for user in users_gamedicts:
            if int(id) != user and game in users_gamedicts[user].keys():
                sim = calculate_similarity(users_gamedicts[int(id)],users_gamedicts[user])
                score_nr += sim* users_gamedicts[user][game]
                score_dr += sim
        if score_dr != 0:
            rating.update({game:score_nr/score_dr})
    return rating


Number of unique games played by >3  641


In [19]:
rating = predict_ratings()
print(rating)

{'HAWKEN': 41.88735545139528, 'Batman: Arkham City GOTY': 323.7850226027374, 'Transistor': 157.5891460842712, 'Krater': 77.43858686482692, 'Turbo Pug': 156.72575160943322, 'They Are Billions': 84.02290441339791, 'SUPERHOT': 311.3198704029416, 'Dead Island': 1140.9692962687593, 'Pid ': 70.82776617954072, 'Divinity: Original Sin Enhanced Edition': 5628.571447592017, 'Left 4 Dead 2': 1322.1112013303361, 'Call of Duty: Modern Warfare 2 - Multiplayer': 11687.075495634355, 'Clicker Heroes': 1372.0575151391722, 'Max Payne': 1004.9807357087033, 'Mortal Kombat Komplete Edition': 520.209806294175, 'Psychonauts': 108.5775576677788, 'Tropico 5': 785.748264413959, 'Pillars of Eternity II: Deadfire': 821.4538950835971, 'Far Cry® 3': 1297.3699339845082, 'Democracy 3': 150.45933003829765, 'Bad Rats': 167.02939008286793, 'DmC Devil May Cry': 2034.9114768702289, 'Sakura Clicker': 75.42639018405157, 'American Truck Simulator': 490.6760428401104, 'Startup Company': 856.0029377561455, 'Sonic & All-Stars Ra

## Task 3.5: Recommender System: Discussion

**1.** Sort the predicted ratings by estimated playtime (highest first) and print out the top 5 predictions for you (or the default user if you are using the default ID). 

**2.** Discuss the difference in recommendations between the collaborative filtering approach and the association rule approach. Would you consider one more accurate than the other? Why/why not?

In [20]:
# TODO 1: 
rating_sorted = sorted(rating.items(), key = lambda kv:(kv[1], kv[0]), reverse=True)
print(rating_sorted[:5])

[('Counter-Strike: Global Offensive', 42631.120578786366), ('Total War: WARHAMMER', 33085.12716617738), ("Tom Clancy's Rainbow Six Siege", 32486.166093619846), ('Hunt: Showdown', 25994.185734620685), ('ARK: Survival Evolved', 24442.32449942682)]


**TODO 2: Write your observations here:** <br>
Top five recommendations are,
* Counter-Strike: Global Offensive
* Total War: WARHAMMER
* Tom Clancy's Rainbow Six Siege
* Hunt: Showdown
* ARK: Survival Evolved <br>
There is not much of similar recommendation between Collaborative filtering and assosiation rules. *Counter-Strike: Global Offensive* is one common recommendation but it appeared in association rules majorly because of it higher support compared to other games. 
I would consider **Collaborative filtering** more accurate than assosiation rules for following reasons, <br>
**1.** as collaborative filtering takes playtime also into consideration <br>
**2.** recommending more than couple of games via associative rules is not quite common, unlike collaborative filtering where we can pick top *n* games <br>
**3.** associative rules are based on the whole data and will suggest rules generically and not tailored for one particular user. <br>
