<h2><center> Deriving the Preference and Confidence Matrices </center></h2>

In the regular recommender system case with explicit data we deal with a single entity - *the ratings matrix*. When implicit data is used, we might have something like clicks, views, playtime, viewtime, etc. In our demo, we will have a *playtime matrix* which specifies the number of hours each user has played each game.

The researchers from AT&T suggest deriving two new tables from the *playtime matrix* - the *confidence matrix* and the *preference matrix*. 

The *preference matrix* is a binary matrix which encodes whether a given user likes a particular item. Basically, if you've bought or played a game, we assume you like it.  
The *confidence matrix* is a floating point matrix which encodes how confident we are that you actually like or dislike the game. Basically, the more you play, the more confident we are.

In [5]:
import pandas as pd
import json
from collections import Counter
from IPython.core.display import display, HTML

DATA_DIR = "data_dir/train-test-split-v2.json"
#display(HTML("<style>.container { width:90% !important; }</style>"))

In [6]:
data = pd.read_json(DATA_DIR, lines=True,dtype={'steam_id':str})
data.head()

Unnamed: 0,steam_id,games,valid_games,train_games
0,76561198046794970,"[{'appid': 205790, 'name': 'Dota 2 Test', 'pla...","[{'appid': 570, 'name': 'Dota 2', 'playtime_fo...","[{'appid': 214420, 'name': 'Gear Up', 'playtim..."
1,76561198029282766,"[{'appid': 23490, 'name': 'Tropico 3 - Steam S...","[{'appid': 15700, 'name': 'Oddworld: Abe's Odd...","[{'appid': 23490, 'name': 'Tropico 3 - Steam S..."
2,76561198044276154,"[{'appid': 220, 'name': 'Half-Life 2', 'playti...","[{'appid': 1083500, 'name': 'PlanetSide 2 - Te...","[{'appid': 340, 'name': 'Half-Life 2: Lost Coa..."
3,76561198047345390,"[{'appid': 40800, 'name': 'Super Meat Boy', 'p...","[{'appid': 238960, 'name': 'Path of Exile', 'p...","[{'appid': 40800, 'name': 'Super Meat Boy', 'p..."
4,76561198065940354,"[{'appid': 12210, 'name': 'Grand Theft Auto IV...","[{'appid': 362960, 'name': 'Tyranny', 'playtim...","[{'appid': 12220, 'name': 'Grand Theft Auto: E..."


Just for fun, let's see which are the most popular games in terms of installs and playtime

In [7]:
installs = Counter()
playtime = Counter()
for user_id, r in data.set_index('steam_id').iterrows():
    for g in r['games']:
        installs[g['name']] += 1
        playtime[g['name']] += g['playtime_forever'] / 60

In [8]:
installs.most_common(20)

[('Counter-Strike: Global Offensive', 6846),
 ('PAYDAY 2', 4689),
 ('Dota 2 Test', 4673),
 ('Left 4 Dead 2', 4638),
 ('Team Fortress 2', 4589),
 ('Dota 2', 4549),
 ("Garry's Mod", 3949),
 ("PLAYERUNKNOWN'S BATTLEGROUNDS", 3878),
 ('Unturned', 3665),
 ('Portal 2', 3620),
 ('Warframe', 3619),
 ('Borderlands 2', 3271),
 ('Rocket League', 3238),
 ('Counter-Strike: Source', 3207),
 ('Half-Life 2: Lost Coast', 3139),
 ('Insurgency', 3085),
 ('Terraria', 3007),
 ('Portal', 2995),
 ('Z1 Battle Royale', 2985),
 ('H1Z1: Test Server', 2985)]

In [9]:
playtime.most_common(20)

[('Counter-Strike: Global Offensive', 6684950.349999996),
 ('Dota 2', 5883273.100000011),
 ('Team Fortress 2', 1155053.0833333305),
 ("PLAYERUNKNOWN'S BATTLEGROUNDS", 981749.5333333337),
 ('Rocket League', 703944.0166666653),
 ('Counter-Strike: Source', 678768.7500000023),
 ("Garry's Mod", 622710.3666666658),
 ('Rust', 563462.4333333314),
 ('Counter-Strike', 500229.6666666664),
 ('Arma 3', 483356.0833333336),
 ('Grand Theft Auto V', 482640.99999999895),
 ('Warframe', 389107.84999999934),
 ('Path of Exile', 367785.2500000002),
 ("Tom Clancy's Rainbow Six Siege", 331102.03333333303),
 ('ARK: Survival Evolved', 275166.39999999997),
 ('PAYDAY 2', 272460.70000000024),
 ('Terraria', 223610.6833333334),
 ('The Elder Scrolls V: Skyrim', 222866.44999999972),
 ('Left 4 Dead 2', 216298.8166666663),
 ('Clicker Heroes', 210129.73333333357)]

For demo purposes, we'll only look at the top 20 games

In [10]:
MOST_PLAYED_GAMES = [x[0] for x in playtime.most_common(20)]

In [11]:
playtime = []
for user_id, r in data.set_index('steam_id').iterrows():
    user_playtime = pd.Series([g['playtime_forever'] /60 for g in r['games'] if g['name'] in MOST_PLAYED_GAMES], 
                             index = [g['name'] for g in r['games'] if g['name'] in MOST_PLAYED_GAMES], name=user_id)
    #user_ratings.reindex(index=MOST_PLAYED_GAMES).fillna(0)
    #display(user_ratings)
    if user_playtime.empty: continue
    user_playtime = user_playtime.reindex(MOST_PLAYED_GAMES).fillna(0)
    playtime.append(user_playtime)

  user_playtime = pd.Series([g['playtime_forever'] /60 for g in r['games'] if g['name'] in MOST_PLAYED_GAMES],


In [12]:
playtime_matrix = pd.DataFrame(playtime)
playtime_matrix.index.name = "Steam Id"

# Playtime matrix 

In [13]:
playtime_matrix.iloc[:5].style.set_caption("Playtime matrix").set_precision(4)

Unnamed: 0_level_0,Counter-Strike: Global Offensive,Dota 2,Team Fortress 2,PLAYERUNKNOWN'S BATTLEGROUNDS,Rocket League,Counter-Strike: Source,Garry's Mod,Rust,Counter-Strike,Arma 3,Grand Theft Auto V,Warframe,Path of Exile,Tom Clancy's Rainbow Six Siege,ARK: Survival Evolved,PAYDAY 2,Terraria,The Elder Scrolls V: Skyrim,Left 4 Dead 2,Clicker Heroes
Steam Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
76561198046794970,41.55,1193.9667,18.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.15,0.0
76561198029282766,842.8,4809.0667,1.9333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.85,0.0
76561198044276154,218.6833,1800.8,5.4333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.25,0.0,0.0
76561198047345390,0.0,5845.1833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.4833,0.0,0.0,0.0,0.0,0.0,0.0,0.0
76561198065940354,504.9667,3313.55,0.45,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.15,0.0,0.0,0.0,0.0,0.0,93.8333,6.9667,0.0


# Who is the biggest Dota 2 player?

In [14]:
playtime_matrix.sort_values("Dota 2", ascending=False)

Unnamed: 0_level_0,Counter-Strike: Global Offensive,Dota 2,Team Fortress 2,PLAYERUNKNOWN'S BATTLEGROUNDS,Rocket League,Counter-Strike: Source,Garry's Mod,Rust,Counter-Strike,Arma 3,Grand Theft Auto V,Warframe,Path of Exile,Tom Clancy's Rainbow Six Siege,ARK: Survival Evolved,PAYDAY 2,Terraria,The Elder Scrolls V: Skyrim,Left 4 Dead 2,Clicker Heroes
Steam Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
76561197970493955,116.666667,21189.216667,1.516667,23.550000,0.150000,3316.916667,0.0,0.000000,19.566667,1425.85,0.000000,0.000000,0.000000,0.000000,0.0,0.00,0.0,0.933333,9.666667,0.0
76561198140567423,22069.583333,20025.900000,19868.116667,12891.633333,0.000000,0.000000,0.0,13557.600000,0.000000,0.00,0.000000,3.233333,3.050000,0.000000,0.0,19817.85,0.0,0.000000,0.000000,0.0
76561198051623299,8.433333,18149.283333,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.0,0.00,0.0,0.000000,0.000000,0.0
76561198065780626,172.133333,17559.100000,10.316667,2022.000000,17.483333,0.000000,0.0,7.100000,0.000000,0.00,32.533333,0.000000,0.000000,9.333333,0.0,0.00,0.0,1.166667,16.966667,0.0
76561198072036324,98.183333,17535.850000,0.766667,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.00,0.000000,0.000000,58.966667,0.000000,0.0,0.00,0.0,0.000000,41.433333,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76561198201941911,656.266667,0.000000,0.300000,0.000000,0.000000,0.000000,0.0,395.916667,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.0,0.00,0.0,0.000000,1.466667,0.0
76561198205082092,278.233333,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.0,0.00,0.0,0.000000,0.000000,0.0
76561198313506492,642.283333,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.0,0.00,0.0,0.000000,0.000000,0.0
76561198317506140,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.00,0.000000,0.000000,0.000000,0.000000,0.0,0.00,0.0,0.000000,0.000000,0.0


# Preference & Confidence Matrix 

The *preference matrix* is simply a binary matrix indicating if the user has played that game or not. We are free to derive the matrix in another manner, but this is the simplest.

The *confidence matrix* reflects our confidence in the preference. I.e. the more a user has played a game, the more he likes it. A simple way to derive it is $ 1+c*P$ where P is the playtime matrix and c is a new hyperparameter.

In [15]:
C = 40
preference_matrix = (playtime_matrix > 0).astype(int)
confidence_matrix = (1 + playtime_matrix * 40)

In [16]:
display(preference_matrix.head().style.set_caption("Preference Matrix"))
display(confidence_matrix.head().style.set_caption("Confidence Matrix"))

Unnamed: 0_level_0,Counter-Strike: Global Offensive,Dota 2,Team Fortress 2,PLAYERUNKNOWN'S BATTLEGROUNDS,Rocket League,Counter-Strike: Source,Garry's Mod,Rust,Counter-Strike,Arma 3,Grand Theft Auto V,Warframe,Path of Exile,Tom Clancy's Rainbow Six Siege,ARK: Survival Evolved,PAYDAY 2,Terraria,The Elder Scrolls V: Skyrim,Left 4 Dead 2,Clicker Heroes
Steam Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
76561198046794970,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
76561198029282766,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
76561198044276154,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
76561198047345390,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
76561198065940354,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,1,0


Unnamed: 0_level_0,Counter-Strike: Global Offensive,Dota 2,Team Fortress 2,PLAYERUNKNOWN'S BATTLEGROUNDS,Rocket League,Counter-Strike: Source,Garry's Mod,Rust,Counter-Strike,Arma 3,Grand Theft Auto V,Warframe,Path of Exile,Tom Clancy's Rainbow Six Siege,ARK: Survival Evolved,PAYDAY 2,Terraria,The Elder Scrolls V: Skyrim,Left 4 Dead 2,Clicker Heroes
Steam Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
76561198046794970,1663.0,47759.666667,725.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0,1.0
76561198029282766,33713.0,192363.666667,78.333333,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,35.0,1.0
76561198044276154,8748.333333,72033.0,218.333333,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,91.0,1.0,1.0
76561198047345390,1.0,233808.333333,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,260.333333,1.0,1.0,1.0,1.0,1.0,1.0,1.0
76561198065940354,20199.666667,132543.0,19.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,167.0,1.0,1.0,1.0,1.0,1.0,3754.333333,279.666667,1.0
