# Recommender

In this notebook, we will explore the data and build a simple recommender and content-based recommender to explore how we should build the final recommender. 

---

## Import Libraries

In this section, we will import all the libraries that will be used in this notebook.

In [1]:
# For Calculation and Data Manipulation
import numpy as np
import pandas as pd
import math

# for cosine similarity calculation
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

# For file exportion folder creation
import os

# for datetime conversion
import datetime

# for data 
import sqlite3

# import created data
from utils import get_recommendations

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 400

# this setting allows us to see up to 50 columns
pd.options.display.max_columns = 50

---

## Functions

In this section, we will list down all the functions that are being used in the notebook as a summary. The functions can be found in [utils.py](./utils.py).

1. `get_recommendations` : get top 10 recommendations based on cosine similarity

---

## Read data file

First, we will connect to the database. 

In [2]:
# connecting to DB file
con = sqlite3.connect('../data/steam_db.db')

In [3]:
# ensure that connection is establish
sql_query = '''
SELECT *
FROM main
LIMIT 5
'''

pd.read_sql(sql_query, con)

Unnamed: 0,steam_appid,name,release_date,type,developer,publisher,num_packages
0,10.0,Counter-Strike,2000-11-01 00:00:00,game,Valve,Valve,2
1,20.0,Team Fortress Classic,1999-04-01 00:00:00,game,Valve,Valve,1
2,30.0,Day of Defeat,2003-05-01 00:00:00,game,Valve,Valve,1
3,40.0,Deathmatch Classic,2001-06-01 00:00:00,game,Valve,Valve,1
4,50.0,Half-Life: Opposing Force,1999-11-01 00:00:00,game,Gearbox Software,Valve,1


Refering back to our EDA, we will know that we have the following tables:

1. main
2. genre
3. genre_mapping
4. categories
5. categories_mapping
6. description
7. price
8. statistics
9. media
10. requirements
11. tag
12. language
13. support_info

---

## Simple Recommender

We will use a weightered average formula as a metric / score. By setting 3 items, we will assign weights based on the 'importance' of the data compared to the other. 

Mathematically, it is represented as follows:

    $ Weighted Average(WA) =  (\frac{1}{6} \cdot F) + (\frac{2}{6} \cdot O) + (\frac{3}{6} \cdot P)$

where 
- P is the percentage positive review: ($\frac{positive}{positive+negative}$) with the weight of 3 as this is the number of game reviews
- O is the midpoint estimate of number of owners: ($\frac{max_owners + min_owners}{2}$) with the weight of 1 as this is an estimate of number of owners
- F is the average_forever: with the weight of 2 as this is the average playtime since March 2009 in minutes. 

In [4]:
# function to calculate weighted review
def weighted_review(x):
    P = x['percentage_positive']
    O = x['midpt_est_owners']
    F = x['average_forever']
    
    # calculation based on formula
    return ((1/6 * F) + (2/6 * O) + (3/6 * P))

In [5]:
# create columns required
sql_query = '''
SELECT statistics.*, main.name, main.developer
FROM main
INNER JOIN statistics
  ON main.steam_appid = statistics.steam_appid
'''

df_sm = pd.read_sql(sql_query, con)

In [6]:
# see shape and first 5 rows
print(df_sm.shape)
df_sm.head()

(49015, 15)


Unnamed: 0,steam_appid,average_2weeks,average_forever,ccu,median_2weeks,median_forever,negative,positive,userscore,min_owners,max_owners,review_score,review_percent,name,developer
0,10.0,212.0,8690.0,16837.0,116.0,239.0,4944.0,193192.0,0.0,10000000,20000000,188248.0,3.966456,Counter-Strike,Valve
1,20.0,0.0,2752.0,77.0,0.0,16.0,896.0,5416.0,0.0,5000000,10000000,4520.0,0.095238,Team Fortress Classic,Valve
2,30.0,0.0,4250.0,139.0,0.0,28.0,557.0,5007.0,0.0,5000000,10000000,4450.0,0.093763,Day of Defeat,Valve
3,40.0,0.0,5083.0,5.0,0.0,7.0,412.0,1854.0,0.0,5000000,10000000,1442.0,0.030383,Deathmatch Classic,Valve
4,50.0,0.0,3223.0,139.0,0.0,156.0,664.0,13298.0,0.0,5000000,10000000,12634.0,0.266203,Half-Life: Opposing Force,Gearbox Software


In [8]:
# create percentage_postive
df_sm['percentage_positive'] = df_sm['positive'] / (df_sm['positive'] + df_sm['negative'])

# create midpt_est_owners
df_sm['midpt_est_owners'] = (df_sm['max_owners'] + df_sm['min_owners']) / 2

In [9]:
# see shape and first 5 rows
print(df_sm.shape)
df_sm.head()

(49015, 17)


Unnamed: 0,steam_appid,average_2weeks,average_forever,ccu,median_2weeks,median_forever,negative,positive,userscore,min_owners,max_owners,review_score,review_percent,name,developer,percentage_positive,midpt_est_owners
0,10.0,212.0,8690.0,16837.0,116.0,239.0,4944.0,193192.0,0.0,10000000,20000000,188248.0,3.966456,Counter-Strike,Valve,0.975047,15000000.0
1,20.0,0.0,2752.0,77.0,0.0,16.0,896.0,5416.0,0.0,5000000,10000000,4520.0,0.095238,Team Fortress Classic,Valve,0.858048,7500000.0
2,30.0,0.0,4250.0,139.0,0.0,28.0,557.0,5007.0,0.0,5000000,10000000,4450.0,0.093763,Day of Defeat,Valve,0.899892,7500000.0
3,40.0,0.0,5083.0,5.0,0.0,7.0,412.0,1854.0,0.0,5000000,10000000,1442.0,0.030383,Deathmatch Classic,Valve,0.818182,7500000.0
4,50.0,0.0,3223.0,139.0,0.0,156.0,664.0,13298.0,0.0,5000000,10000000,12634.0,0.266203,Half-Life: Opposing Force,Gearbox Software,0.952442,7500000.0


In [10]:
# create new column to store weighted average
df_sm['wr'] = df_sm.apply(weighted_review, axis=1)

In [11]:
# sort df by the weighted average 
df_sm = df_sm.sort_values('wr', ascending=False)

# look at top 30 recommended games
df_sm[['name', 'developer', 'wr']].head(30)

Unnamed: 0,name,developer,wr
22,Dota 2,Valve,50006320.0
25,Counter-Strike: Global Offensive,"Valve, Hidden Path Entertainment",25005040.0
13123,PUBG: BATTLEGROUNDS,"KRAFTON, Inc.",25003770.0
19,Team Fortress 2,Valve,25001680.0
29986,New World,Amazon Games,25001140.0
2229,Rust,Facepunch Studios,11669550.0
5525,Tom Clancy's Rainbow Six® Siege,Ubisoft Montreal,11668800.0
2714,Grand Theft Auto V,Rockstar North,11668730.0
137,Garry's Mod,Facepunch Studios,11668420.0
3615,Unturned,Smartly Dressed Games,11668290.0


In [12]:
# see shape and first 5 rows
print(df_sm.shape)
df_sm.head()

(49015, 18)


Unnamed: 0,steam_appid,average_2weeks,average_forever,ccu,median_2weeks,median_forever,negative,positive,userscore,min_owners,max_owners,review_score,review_percent,name,developer,percentage_positive,midpt_est_owners,wr
22,570.0,1939.0,37928.0,712921.0,986.0,891.0,274066.0,1400561.0,0.0,100000000,200000000,1126495.0,23.735672,Dota 2,Valve,0.836342,150000000.0,50006320.0
25,730.0,894.0,30209.0,906236.0,316.0,7525.0,734382.0,5480320.0,0.0,50000000,100000000,4745938.0,99.998694,Counter-Strike: Global Offensive,"Valve, Hidden Path Entertainment",0.881832,75000000.0,25005040.0
13123,578080.0,425.0,22644.0,510496.0,149.0,8099.0,869761.0,1090084.0,0.0,50000000,100000000,220323.0,4.642288,PUBG: BATTLEGROUNDS,"KRAFTON, Inc.",0.556209,75000000.0,25003770.0
19,440.0,936.0,10059.0,75282.0,143.0,323.0,53454.0,793165.0,0.0,50000000,100000000,739711.0,15.585988,Team Fortress 2,Valve,0.936862,75000000.0,25001680.0
29986,1063730.0,1664.0,6813.0,37443.0,1053.0,3063.0,69142.0,150457.0,0.0,50000000,100000000,81315.0,1.713338,New World,Amazon Games,0.685144,75000000.0,25001140.0


In [13]:
# look at the statistics of 3 columns: 'positive', 'negative', 'wr'
df_sm[['positive', 'negative', 'wr']].describe()

Unnamed: 0,positive,negative,wr
count,49015.0,49015.0,48745.0
mean,1455.115,237.291339,44248.93
std,30153.8,5717.088665,437655.6
min,0.0,0.0,3333.333
25%,5.0,1.0,3333.689
50%,20.0,7.0,3333.819
75%,117.0,34.0,11667.13
max,5480320.0,869761.0,50006320.0


---

## Content-Based Recommender: `genre`, `categories`, `language`, `tag`

Let us try build a content based recommender. We will use `genre`, `categories`, `language`, `tag`. For `tag`, we will change the columns to `0` and `1`, i.e. allowing it to be of a `yes` and `no` information. 

In [14]:
# create columns required
sql_query = '''
SELECT steam_appid, name, developer
FROM main
'''

df_main = pd.read_sql(sql_query, con)

In [15]:
sql_query = '''
SELECT *
FROM genre
'''

df_genre = pd.read_sql(sql_query, con)

In [16]:
sql_query = '''
SELECT *
FROM categories
'''

df_categories = pd.read_sql(sql_query, con)

In [17]:
sql_query = '''
SELECT *
FROM language
'''

df_language = pd.read_sql(sql_query, con)

In [18]:
sql_query = '''
SELECT *
FROM tag
'''

df_tag = pd.read_sql(sql_query, con)

In [19]:
# drop columns that are not required
df_genre = df_genre.drop(columns=["genre_id", "genre"])
df_categories = df_categories.drop(columns=['categories_id','categories_description'])
df_language = df_language.drop(columns=['languages'])

# create a copy of df_tag
df_tag2 = df_tag.copy()

# create list of columns that will be affected in the replacing
list_tag_temp = list(df_tag.columns)
list_tag_temp.remove("steam_appid")

# replace the values in tag2
for col in list_tag_temp:
    df_tag2[col] = df_tag2[col].apply(lambda x: 0 if x == -9999 else 1)

In [20]:
# create df to calculate cosing similarity
df_cbr = df_genre.copy()
df_cbr = df_cbr.join(df_categories.set_index("steam_appid"), on="steam_appid")
df_cbr = df_cbr.join(df_language.set_index("steam_appid"), on="steam_appid")
df_cbr = df_cbr.join(df_tag2.set_index("steam_appid"), on="steam_appid")

# drop the column used for undex setting
df_cbr = df_cbr.drop(columns=['steam_appid'])

In [21]:
# see shape and first 2 rows
print(df_cbr.shape)
df_cbr.head(2)

(49015, 530)


Unnamed: 0,genre_id_1,genre_id_18,genre_id_2,genre_id_23,genre_id_25,genre_id_28,genre_id_29,genre_id_3,genre_id_37,genre_id_4,genre_id_50,genre_id_51,genre_id_52,genre_id_53,genre_id_54,genre_id_55,genre_id_56,genre_id_57,genre_id_58,genre_id_59,genre_id_60,genre_id_70,genre_id_71,genre_id_72,genre_id_73,...,Escape Room,Spelling,Roguelike Deckbuilder,Action RTS,VR Only,Skateboarding,Battle Royale,Wrestling,Steam Machine,Hockey,Boss Rush,Social Deduction,Baseball,Jet,Asymmetric VR,Faith,BMX,Hardware,Foreign,Electronic,360 Video,8-bit Music,Rock Music,Instrumental Music,Masterpiece
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [22]:
# calculate cosine similarity
cos_sim = cosine_similarity(df_cbr, df_cbr)

In [23]:
# create reverse mapping of name and index
indices = pd.Series(df_main.index, index=df_main['name'])

In [24]:
# get recommendations based on cosine similarity
get_recommendations(df_main, indices, "Dota 2", cos_sim)

Unnamed: 0,name,cos_sim
25,Counter-Strike: Global Offensive,0.756543
31889,Zombie Island,0.687073
21258,Milky Way Map,0.68538
31495,LAB Defence,0.681623
23750,School of Horror,0.680817
21389,China VS Roman,0.675566
27607,Puzzle game for kids,0.675566
6486,none,0.672827
11940,Half-Life: Alyx,0.672827
36575,Shoot Covid-19 | 消灭新冠肺炎,0.669186


In [25]:
# get recommendations based on cosine similarity
get_recommendations(df_main, indices, "Half-Life 2: Lost Coast", cos_sim)

Unnamed: 0,name,cos_sim
11,Half-Life: Source,0.776206
18,Half-Life 2: Episode Two,0.736363
16,Half-Life 2: Episode One,0.703526
8,Half-Life: Blue Shift,0.682789
6,Half-Life,0.677739
9,Half-Life 2,0.671442
4,Half-Life: Opposing Force,0.646508
123,Sniper Elite,0.622799
418,Virtual Villagers: A New Home,0.604069
192,Just Cause,0.603023


---

## Conclusion

After looking at the two recommenders, we will build a content based recommender. We will explore on how we want to build the recommender in another notebook. 