# Recommender : Model 1

Now that we have finished looking through the recommender options, we will build the recommender that will be used based on 2 types of model. 

In this notebook, we will build model 1, taking in `genre`, `categories`, `language`, `description`, `weighted average` and `tag`.

---

## Import Libraries

In this section, we will import all the libraries that will be used in this notebook.

In [1]:
# For Calculation and Data Manipulation
import numpy as np
import pandas as pd
import math

# for data visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# for cosine similarity calculation
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
from sklearn.preprocessing import StandardScaler

# for Approximate nearest neighbor
import hnswlib

# for NLP
from nltk.stem import WordNetLemmatizer
# from nltk.corpus import stopwords
from stopwordsiso import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import re, string

# For file exportion folder creation
import os

# for datetime conversion
import datetime

# for data 
import sqlite3

# import created data
from utils import get_recommendations, fit_hnsw_index

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 500

# this setting allows us to see up to 50 columns
pd.options.display.max_columns = 50

---

## Functions

In this section, we will list down all the functions that are being used in the notebook as a summary. The functions can be found in [utils.py](./utils.py).

1. `get_recommendations` : get top 10 recommendations based on cosine similarity
2. `fit_hnsw_index` : get top 10 recommendations using approximate nearest neighbours calculated using cosine similarity

---

## Read data file

First, we will connect to the database. 

In [2]:
# connecting to DB file
con = sqlite3.connect('../data/steam_db.db')

In [3]:
# ensure that connection is establish
sql_query = '''
SELECT *
FROM main
LIMIT 5
'''

pd.read_sql(sql_query, con)

Unnamed: 0,steam_appid,name,release_date,type,developer,publisher,num_packages
0,10.0,Counter-Strike,2000-11-01 00:00:00,game,Valve,Valve,2
1,20.0,Team Fortress Classic,1999-04-01 00:00:00,game,Valve,Valve,1
2,30.0,Day of Defeat,2003-05-01 00:00:00,game,Valve,Valve,1
3,40.0,Deathmatch Classic,2001-06-01 00:00:00,game,Valve,Valve,1
4,50.0,Half-Life: Opposing Force,1999-11-01 00:00:00,game,Gearbox Software,Valve,1


From our EDA, we know that we have the below tables: 

1. main
2. genre
3. genre_mapping
4. categories
5. categories_mapping
6. description
7. price
8. statistics
9. media
10. requirements
11. tag
12. language
13. support_info

---

## Data Engineering

### Weighted Average
We will use the weighted average that was previously created for the simple recommender exploration, with the Mathematical formula as follows:

$ Weighted Average(WA) =  (\frac{1}{6} \cdot F) + (\frac{2}{6} \cdot O) + (\frac{3}{6} \cdot P)$

where 
- P is the percentage positive review: ($\frac{positive}{positive+negative}$) with the weight of 3 as this is the number of game reviews
- O is the midpoint estimate of number of owners: ($\frac{max_owners + min_owners}{2}$) with the weight of 1 as this is an estimate of number of owners
- F is the average_forever: with the weight of 2 as this is the average playtime since March 2009 in minutes. 

In [4]:
# function to calculate weighted review
def weighted_review(x):
    P = x['percentage_positive']
    O = x['midpt_est_owners']
    F = x['average_forever']
    
    # calculation based on formula
    return ((1/6 * F) + (2/6 * O) + (3/6 * P))

In [5]:
# create dataframe for the calculation
sql_query = """
SELECT * 
FROM statistics
"""

df_stat = pd.read_sql(sql_query, con)

In [6]:
# create percentage_postive
df_stat['percentage_positive'] = df_stat['positive'] / (df_stat['positive'] + df_stat['negative'])

# create midpt_est_owners
df_stat['midpt_est_owners'] = (df_stat['max_owners'] + df_stat['min_owners']) / 2

# create weighted average
df_stat['wa'] = df_stat.apply(weighted_review, axis=1)

# fill in missing values
df_stat['wa'].fillna(0, inplace=True)

In [7]:
# see statistics of weighted average
df_stat[['wa']].describe()

Unnamed: 0,wa
count,49015.0
mean,44005.18
std,436460.8
min,0.0
25%,3333.683
50%,3333.817
75%,11667.12
max,50006320.0


We see that the data is left skewed. We will need to scale the data before using for calculation. 

### Description

We will look at the description and look to take top 3000 words. 

In [8]:
# import columns required
sql_query = '''
SELECT steam_appid, detailed_description
FROM description
'''

df_des = pd.read_sql(sql_query, con)

In [9]:
# create list of stopwords
final_stopwords = stopwords(["en", "ja", "ko", "zh"])

In [10]:
# instantiate CountVectorizer
cv = CountVectorizer(stop_words=final_stopwords, max_features=3_000)

# fit and transform the column
transformed_cv = cv.fit_transform(df_des['detailed_description'])

# convert transformed data to dataframe
matrix_cv = transformed_cv.todense()   # converts to matrix
df_cv_words = pd.DataFrame(matrix_cv, columns=cv.get_feature_names_out())



In [11]:
# see shape and first 5 rows
print(df_cv_words.shape)
df_cv_words.head()

(49015, 3000)


Unnamed: 0,000,01,02,100,1000,11,12,120,13,14,15,150,16,1620520,17,18,19,20,200,2014,2015,2016,2017,2018,2019,...,woods,word,workers,workshop,worlds,worldwide,worry,worse,worst,worth,worthy,write,writing,written,wrong,xbox,xp,yellow,york,youtube,zombie,zombies,zone,zones,zoom
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [12]:
# update the df to include steam_appid
df_cv_words['steam_appid'] = df_des['steam_appid']

---

## Combining the data

With the data extracted from the description (top 3000 words) and weighted average, we will now extract the remaining data and use it for the calculation. 

In [13]:
# create columns required
sql_query = '''
SELECT steam_appid, name, developer
FROM main
'''

df_main = pd.read_sql(sql_query, con)

In [14]:
sql_query = '''
SELECT *
FROM genre
'''

df_genre = pd.read_sql(sql_query, con)

In [15]:
sql_query = '''
SELECT *
FROM categories
'''

df_categories = pd.read_sql(sql_query, con)

In [16]:
sql_query = '''
SELECT *
FROM language
'''

df_language = pd.read_sql(sql_query, con)

In [17]:
sql_query = '''
SELECT *
FROM tag
'''

df_tag = pd.read_sql(sql_query, con)

### Managing the data: `genre`, `categories`, `language`, `tag`, `weighted average`, `description`

We will combine the different tables into 1 for our calculations, but first we will re-arrange the data that is being read out from the column. 

In [18]:
# drop columns that are not required
df_genre = df_genre.drop(columns=["genre_id", "genre"])
df_categories = df_categories.drop(columns=['categories_id','categories_description'])
df_language = df_language.drop(columns=['languages'])

# rename column in language
df_language.rename(columns = {col: (col+"_lang") for col in df_language.columns if col != 'steam_appid'}, inplace=True)

# create list of columns that will be affected in the replacing for tag
list_tag_temp = list(df_tag.columns)
list_tag_temp.remove("steam_appid")

# replace the values in tag
for col in list_tag_temp:
    df_tag[col] = df_tag[col].apply(lambda x: 0 if x == -9999 else x)

### Model 1: `genre`, `categories`, `language`, `tag`, `weighted average`, `description`

We will build the model using 
- `genre`: game genre
- `categories`: game categories
- `language`: language of the game
- `tag`: user-defined tags of game
- `weighted average`: feature created using `statistics` table
- `description`: Top 3000 words


In [19]:
# create df of model 1
df_model_one = df_genre.copy()
df_model_one = df_model_one.join(df_categories.set_index("steam_appid"), on="steam_appid")
df_model_one = df_model_one.join(df_language.set_index("steam_appid"), on="steam_appid")
df_model_one = df_model_one.join(df_tag.set_index("steam_appid"), on="steam_appid")
df_model_one = df_model_one.join(df_stat[['steam_appid','wa']].set_index("steam_appid"), on="steam_appid")
df_model_one = df_model_one.join(df_cv_words.set_index("steam_appid"), on="steam_appid")

# drop the column used for undex setting
df_model_one = df_model_one.drop(columns=['steam_appid'])

In [20]:
# see shape and first 2 rows
print(df_model_one.shape)
df_model_one.head(2)

(49015, 3531)


Unnamed: 0,genre_id_1,genre_id_18,genre_id_2,genre_id_23,genre_id_25,genre_id_28,genre_id_29,genre_id_3,genre_id_37,genre_id_4,genre_id_50,genre_id_51,genre_id_52,genre_id_53,genre_id_54,genre_id_55,genre_id_56,genre_id_57,genre_id_58,genre_id_59,genre_id_60,genre_id_70,genre_id_71,genre_id_72,genre_id_73,...,woods,word,workers,workshop,worlds,worldwide,worry,worse,worst,worth,worthy,write,writing,written,wrong,xbox,xp,yellow,york,youtube,zombie,zombies,zone,zones,zoom
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [21]:
# standstardscale df_model_one
df_model_one = StandardScaler().fit_transform(df_model_one)

---

## Cosine Similarity

Let us build a model using cosine similarity. 

In [22]:
%%time

# find the cosine similarity 
cos_sim_one = cosine_similarity(df_model_one, df_model_one)

Wall time: 6min 9s


In [23]:
# create reverse mapping of name and index
indices = pd.Series(df_main.index, index=df_main['name'])

We will test the recommender against 4 games:
1. `Half-Life 2: Lost Coast`
2. `Counter-Strike`
3. `Assetto Corsa`
4. `Kenshi`

In [24]:
# get recommendations for Half-Life 2: Lost Coast
get_recommendations(df_main, indices, "Half-Life 2: Lost Coast", cos_sim_one)

Unnamed: 0,name,cos_sim
1594,Tiny Troopers,0.671509
418,Virtual Villagers: A New Home,0.643092
3374,Beyond Space Remastered Edition,0.482003
3983,Congo,0.473816
2437,Journal,0.427125
419,Fish Tycoon,0.38575
13123,PUBG: BATTLEGROUNDS,0.385053
8,Half-Life: Blue Shift,0.368763
11110,Ant-gravity: Tiny's Adventure,0.35925
1247,Terraria,0.323628


In [25]:
# get recommendations for Counter-Strike
get_recommendations(df_main, indices, "Counter-Strike", cos_sim_one)

Unnamed: 0,name,cos_sim
319,Overlord™,0.485882
1,Team Fortress Classic,0.476361
47,Wolfenstein 3D,0.439057
7,Counter-Strike: Condition Zero,0.432878
9,Half-Life 2,0.381876
4,Half-Life: Opposing Force,0.377099
6,Half-Life,0.375856
50,DOOM II,0.365013
2009,140,0.36389
761,Westward® IV: All Aboard,0.35597


In [26]:
# get recommendations for Assetto Corsa
get_recommendations(df_main, indices, "Assetto Corsa", cos_sim_one)

Unnamed: 0,name,cos_sim
8047,Automobilista,0.299561
25737,RDS - The Official Drift Videogame,0.270938
4518,VRC PRO,0.267009
21236,Assetto Corsa Competizione,0.257455
2587,iRacing,0.248357
7489,Drift Streets Japan,0.232173
11032,DRIFT21,0.229825
24833,GRITS Racing,0.226444
2523,Victory: The Age of Racing,0.22517
30109,Automobilista 2,0.224242


In [27]:
# get recommendations for Kenshi
get_recommendations(df_main, indices, "Kenshi", cos_sim_one)

Unnamed: 0,name,cos_sim
28227,Nomads of the Fallen Star,0.168762
28120,Space Digger,0.139072
24685,Amber's Airline - High Hopes,0.138364
41722,Arid,0.13835
13785,Survivalizm - The Animal Simulator,0.137304
32659,Adam: Robot World,0.137026
8525,Expeditions: Viking,0.133533
30720,Orders Of The Ruler,0.132324
20237,Harsh,0.129989
7371,No1Left,0.129797


---

## Approximate Nearest Neighbors (ANN) using `Hnswlib`

We will build a model using `Hnswlib` to calculate the distance based on squared L2 (least square error), as we have calculated cosine similarity above. 

In [28]:
%%time

# create model
# model from https://pub.towardsai.net/knn-k-nearest-neighbors-is-dead-fc16507eb3e?sk=b964df6dccf263518b244d4264ba088d
p = fit_hnsw_index(df_model_one)

# set k as 11 to get 10 recommendations
ann_neighbor_indices, ann_distances = p.knn_query(df_model_one, k=11)

Wall time: 38min 3s


In [29]:
# get recommendation for Half-Life 2: Lost Coast
df_main.loc[list(ann_neighbor_indices[list(df_main['name']).index("Half-Life 2: Lost Coast")]), :]

Unnamed: 0,steam_appid,name,developer
14,340.0,Half-Life 2: Lost Coast,Valve
8,130.0,Half-Life: Blue Shift,Gearbox Software
17754,711710.0,RedEyes 赤瞳之勋,RiceMaster
30191,1069290.0,Elpida: Crônicas de uma guerreira,Daniel Pazos
1247,105600.0,Terraria,Re-Logic
38517,1348300.0,大千世界,滑稽工作室
24901,909020.0,梦本无忧,幻想禁
1564,214700.0,Thirty Flights of Loving,Blendo Games
4397,331470.0,Everlasting Summer,Soviet Games
42443,1492870.0,堕星之乱,龙骨工作室


In [30]:
# get recommendation for Counter-Strike
df_main.loc[list(ann_neighbor_indices[list(df_main['name']).index("Counter-Strike")]), :]

Unnamed: 0,steam_appid,name,developer
0,10.0,Counter-Strike,Valve
319,11450.0,Overlord™,"Triumph Studios, Virtual Programming"
1,20.0,Team Fortress Classic,Valve
47,2270.0,Wolfenstein 3D,id Software
7,80.0,Counter-Strike: Condition Zero,Valve
4,50.0,Half-Life: Opposing Force,Gearbox Software
50,2300.0,DOOM II,id Software
49,2290.0,Final DOOM,id Software
3,40.0,Deathmatch Classic,Valve
318,11390.0,Crash Time 2,


In [31]:
# get recommendation for Assetto Corsa
df_main.loc[list(ann_neighbor_indices[list(df_main['name']).index("Assetto Corsa")]), :]

Unnamed: 0,steam_appid,name,developer
2038,244210.0,Assetto Corsa,Kunos Simulazioni
32826,1151200.0,Just Drift It !,Vencious Games
20102,775900.0,MotoGP™18,Milestone S.r.l.
20698,791150.0,EV3 - Drag Racing,KABloom Interactive
13932,601770.0,Sparc,CCP
32640,1145660.0,Drift Of The Hill,RewindApp
37542,1308140.0,Drive & Drift,UnknownStudio
49013,1885640.0,Warehouse Simulator: Forklift Driver,"Maks Volegov, Andreev Worlds"
7659,418170.0,MicroRC Simulation,Minindustries Game Factory
23412,866260.0,EreaDrone,"EreaDrone, EreaStudio, Elouan Jorrand"


In [32]:
# get recommendation for Kenshi
df_main.loc[list(ann_neighbor_indices[list(df_main['name']).index("Kenshi")]), :]

Unnamed: 0,steam_appid,name,developer
9844,489370.0,Quarantine,Sproing
49008,1883580.0,Cactus Simulator,Iful GS
25695,933380.0,Tennis Story,Gersh Games LLC
15086,641950.0,Last Stonelord,Morganti Livio
42443,1492870.0,堕星之乱,龙骨工作室
46632,1693360.0,Dark Forest Project,Phoenixxx Games
48191,1798130.0,ReTox,Dopamine Games
28503,1018800.0,DEEEER Simulator: Your Average Everyday Deer Game,Gibier Games
29918,1061650.0,Idle Portal Guardian,Thejamiryu
25629,931540.0,RevelationTrestan-尸忆岛,ZBO interactive technology


Comparing the two, we see that the recommender built by both models are similar to each other. 

## Analysis

For our analysis, we will use the cosine similarity model which is more interpretableas compared to `Hnswlib`.

We will use `Dota 2` for our analysis. 

In [34]:
# get recommendations for Counter-Strike
get_recommendations(df_main, indices, "Counter-Strike", cos_sim_one)

Unnamed: 0,name,cos_sim
319,Overlord™,0.485882
1,Team Fortress Classic,0.476361
47,Wolfenstein 3D,0.439057
7,Counter-Strike: Condition Zero,0.432878
9,Half-Life 2,0.381876
4,Half-Life: Opposing Force,0.377099
6,Half-Life,0.375856
50,DOOM II,0.365013
2009,140,0.36389
761,Westward® IV: All Aboard,0.35597


We see that all the games suggested by the model has a threshold of lesser than 0.5, suggesting that these games might not be very similar to `Counter-Strike`. This is suggesting that model 1 might not be suitable.

A possible explanation for the course would be that `tag` created too much noise to the recommendations. `tag` are information that are user provided, and as opposed to ratings of the game, it might not be relavant. 

We will remove the `tag` from model 2 calculation. 

## Conclusion 

After looking at the data, we see that the recommender built using model 1 might not be accurate as the score for some games such as `Counter-Strike` are low. 

We will rebuild our recommender using model 2, by removing user-defined `tag`. 