# Recommender : Built

In the notebook, we will make use of the concept used to build model 2 (using `genre`, `categories`, `language`, `description` and `weighted average`). 

We will build a recommender using games that are still being played for the last 2 weeks (at point of extraction, Feb 2022), i.e. for games that have `average_2weeks` > 0 and take a fraction of the games that we have to build the recommender using streamlit. 



---

## Import Libraries

In this section, we will import all the libraries that will be used in this notebook.

In [1]:
# For Calculation and Data Manipulation
import numpy as np
import pandas as pd
import math

# for cosine similarity calculation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

# for NLP
# from nltk.stem import WordNetLemmatizer
# from nltk.corpus import stopwords
from stopwordsiso import stopwords
from sklearn.feature_extraction.text import CountVectorizer
import re, string

# For file exportion folder creation
import os

# for datetime conversion
import datetime

# for data 
import sqlite3

# import created functions
from utils import get_recommendations

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 500

# this setting allows us to see up to 50 columns
pd.options.display.max_columns = 50

---

## Functions

In this section, we will list down all the functions that are being used in the notebook as a summary. The functions can be found in [utils.py](./utils.py).

1. `get_recommendations` : get top 10 recommendations based on cosine similarity

---

## Read data file

First, we will connect to the database. 

In [2]:
# connecting to DB file
con = sqlite3.connect('../data/steam_db.db')

In [3]:
# ensure that connection is establish
sql_query = '''
SELECT *
FROM main
LIMIT 5
'''

pd.read_sql(sql_query, con)

Unnamed: 0,steam_appid,name,release_date,type,developer,publisher,num_packages
0,10.0,Counter-Strike,2000-11-01 00:00:00,game,Valve,Valve,2
1,20.0,Team Fortress Classic,1999-04-01 00:00:00,game,Valve,Valve,1
2,30.0,Day of Defeat,2003-05-01 00:00:00,game,Valve,Valve,1
3,40.0,Deathmatch Classic,2001-06-01 00:00:00,game,Valve,Valve,1
4,50.0,Half-Life: Opposing Force,1999-11-01 00:00:00,game,Gearbox Software,Valve,1


From our EDA, we know that we have the below tables: 

1. main
2. genre
3. genre_mapping
4. categories
5. categories_mapping
6. description
7. price
8. statistics
9. media
10. requirements
11. tag
12. language
13. support_info

---

## Data Extraction

### Weighted Average
We will use the weighted average that was previously created for the simple recommender exploration, with the Mathematical formula as follows:

$ Weighted Average(WA) =  (\frac{1}{6} \cdot F) + (\frac{2}{6} \cdot O) + (\frac{3}{6} \cdot P)$

where 
- P is the percentage positive review: ($\frac{positive}{positive+negative}$) with the weight of 3 as this is the number of game reviews
- O is the midpoint estimate of number of owners: ($\frac{max_owners + min_owners}{2}$) with the weight of 1 as this is an estimate of number of owners
- F is the average_forever: with the weight of 2 as this is the average playtime since March 2009 in minutes. 

In [4]:
# function to calculate weighted review
def weighted_review(x):
    P = x['percentage_positive']
    O = x['midpt_est_owners']
    F = x['average_forever']
    
    # calculation based on formula
    return ((1/6 * F) + (2/6 * O) + (3/6 * P))

In [5]:
# create dataframe for the calculation
sql_query = """
SELECT statistics.* 
FROM statistics
WHERE (statistics.average_2weeks > 0)
"""

df_stat = pd.read_sql(sql_query, con)

In [6]:
# create percentage_postive
df_stat['percentage_positive'] = df_stat['positive'] / (df_stat['positive'] + df_stat['negative'])

# create midpt_est_owners
df_stat['midpt_est_owners'] = (df_stat['max_owners'] + df_stat['min_owners']) / 2

# create weighted average
df_stat['wa'] = df_stat.apply(weighted_review, axis=1)

# fill in missing values
df_stat['wa'].fillna(0, inplace=True)

In [7]:
# see statistics of weighted average
df_stat[['wa']].describe()

Unnamed: 0,wa
count,1396.0
mean,823845.2
std,2378979.0
min,3334.558
25%,116683.2
50%,250136.3
75%,501691.5
max,50006320.0


We see that the data is left skewed. We will need to scale the data before using for calculation. 

### Description

We will look at the description and look to take top 3000 words.We will look at the description and look to take top 3000 words.

In [8]:
# import columns required
sql_query = '''
SELECT description.steam_appid, description.detailed_description
FROM description
INNER JOIN statistics
   ON description.steam_appid = statistics.steam_appid
WHERE (statistics.average_2weeks > 0)
'''

df_des = pd.read_sql(sql_query, con)

In [9]:
# create list of stopwords
final_stopwords = stopwords(["en", "ja", "ko", "zh"])

In [10]:
# remove numbers from description
df_des['detailed_description'] = df_des['detailed_description'].apply(lambda x: re.sub(r'\d+', '', x))

In [11]:
# instantiate CountVectorizer
cv = CountVectorizer(stop_words=final_stopwords, max_features=3_000)

# fit and transform the column
transformed_cv = cv.fit_transform(df_des['detailed_description'])

# convert transformed data to dataframe
matrix_cv = transformed_cv.todense()   # converts to matrix
df_cv_words = pd.DataFrame(matrix_cv, columns=cv.get_feature_names_out())



In [12]:
# see shape and first 5 rows
print(df_cv_words.shape)
df_cv_words.head()

(1396, 3000)


Unnamed: 0,_blank,_en,abandoned,abilities,ability,absolutely,abyss,academy,accept,access,accessible,accessories,accident,accidents,acclaimed,accolades,account,accounts,accuracy,accurate,achieve,achievements,acquire,acting,action,...,worms,worry,worse,worth,write,writing,written,wrong,ww,wwii,xbox,xcom,xiv,xp,xv,york,youtu,youtube,zen,zombie,zombies,zone,zones,zu,не
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [13]:
# update the df to include steam_appid
df_cv_words['steam_appid'] = df_des['steam_appid']

## Combining the data

With the data extracted from the description (top 2000) and weighted average, we will now extract the remaining data and use it for the calculation. 

In [14]:
# create columns required
sql_query = '''
SELECT main.steam_appid, main.name, main.developer
FROM main
INNER JOIN statistics
   ON main.steam_appid = statistics.steam_appid
WHERE (statistics.average_2weeks > 0)
'''

df_main = pd.read_sql(sql_query, con)

In [15]:
sql_query = '''
SELECT genre.*
FROM genre
INNER JOIN statistics
   ON genre.steam_appid = statistics.steam_appid
WHERE (statistics.average_2weeks > 0)
'''

df_genre = pd.read_sql(sql_query, con)

In [16]:
sql_query = '''
SELECT categories.*
FROM categories
INNER JOIN statistics
   ON categories.steam_appid = statistics.steam_appid
WHERE (statistics.average_2weeks > 0)
'''

df_categories = pd.read_sql(sql_query, con)

In [17]:
sql_query = '''
SELECT language.*
FROM language
INNER JOIN statistics
   ON language.steam_appid = statistics.steam_appid
WHERE (statistics.average_2weeks > 0)
'''

df_language = pd.read_sql(sql_query, con)

### Managing the data: `genre`, `categories`, `language`, `tag`, `weighted average`, `description`

We will combine the different tables into 1 for our calculations, but first we will re-arrange the data that is being read out from the column. 

In [18]:
# drop columns that are not required
df_genre = df_genre.drop(columns=["genre_id", "genre"])
df_categories = df_categories.drop(columns=['categories_id','categories_description'])
df_language = df_language.drop(columns=['languages'])

# rename column in language
df_language.rename(columns = {col: (col+"_lang") for col in df_language.columns if col != 'steam_appid'}, inplace=True)

### Streamlit model: `genre`, `categories`, `language`, `weighted average`, `description`

We will build the model using 
- `genre`: game genre
- `categories`: game categories
- `language`: language of the game
- `weighted average`: feature created using `statistics` table
- `description`: Top 2000 words

In [19]:
# create df of model 2
df_model_two = df_genre.copy()
df_model_two = df_model_two.join(df_categories.set_index("steam_appid"), on="steam_appid")
df_model_two = df_model_two.join(df_language.set_index("steam_appid"), on="steam_appid")
df_model_two = df_model_two.join(df_stat[['steam_appid','wa']].set_index("steam_appid"), on="steam_appid")
df_model_two = df_model_two.join(df_cv_words.set_index("steam_appid"), on="steam_appid")

# drop the column used for undex setting
df_model_two = df_model_two.drop(columns=['steam_appid'])

In [20]:
# see shape and first 2 rows
print(df_model_two.shape)
df_model_two.head(2)

(1396, 3103)


Unnamed: 0,genre_id_1,genre_id_18,genre_id_2,genre_id_23,genre_id_25,genre_id_28,genre_id_29,genre_id_3,genre_id_37,genre_id_4,genre_id_50,genre_id_51,genre_id_52,genre_id_53,genre_id_54,genre_id_55,genre_id_56,genre_id_57,genre_id_58,genre_id_59,genre_id_60,genre_id_70,genre_id_71,genre_id_72,genre_id_73,...,worms,worry,worse,worth,write,writing,written,wrong,ww,wwii,xbox,xcom,xiv,xp,xv,york,youtu,youtube,zen,zombie,zombies,zone,zones,zu,не
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [21]:
# standstardscale df_model_one
df_model_two = StandardScaler().fit_transform(df_model_two)

---

## Cosine Similarity

Let us build a model using cosine similarity. 

In [22]:
%%time

# find the cosine similarity 
cos_sim_two = cosine_similarity(df_model_two, df_model_two)

Wall time: 167 ms


In [23]:
# create reverse mapping of name and index
indices = pd.Series(df_main.index, index=df_main['name'])

We will test the recommender against 4 games:
1. `Half-Life 2: Lost Coast`
2. `Counter-Strike`
3. `Assetto Corsa`
4. `Kenshi`

In [24]:
# get recommendations for Half-Life 2
get_recommendations(df_main, indices, "Half-Life 2", cos_sim_two)

Unnamed: 0,name,cos_sim
6,Portal,0.202672
4,Counter-Strike: Source,0.180202
1,Half-Life,0.175426
7,Half-Life 2: Episode Two,0.155657
663,Half-Life: Alyx,0.149535
14,Counter-Strike: Global Offensive,0.14545
396,Transformice,0.136532
432,Black Mesa,0.13458
1076,Crumble,0.132488
9,Left 4 Dead,0.118897


In [25]:
# get recommendations for Counter-Strike
get_recommendations(df_main, indices, "Counter-Strike", cos_sim_two)

Unnamed: 0,name,cos_sim
418,Eternal Senia,0.2565
1,Half-Life,0.254742
2,Counter-Strike: Condition Zero,0.243983
708,Sonic Mania,0.23595
1186,Marco & The Galaxy Dragon,0.229842
5,Half-Life 2: Deathmatch,0.212463
22,Chuzzle Deluxe,0.199497
769,NARUTO TO BORUTO: SHINOBI STRIKER,0.194543
392,Everlasting Summer,0.191975
989,TAPSONIC BOLD,0.189737


In [26]:
# get recommendations for Assetto Corsa
get_recommendations(df_main, indices, "Assetto Corsa", cos_sim_two)

Unnamed: 0,name,cos_sim
276,iRacing,0.20442
909,Assetto Corsa Competizione,0.163191
160,RaceRoom Racing Experience,0.145972
830,GRID,0.133295
1330,Initial Drift Online,0.129052
534,Project CARS - Pagani Edition,0.127328
1235,RIDE 4,0.125565
401,rFactor,0.121165
1088,F1® 2020,0.117094
439,rFactor 2,0.114403


In [27]:
# get recommendations for Kenshi
get_recommendations(df_main, indices, "Kenshi", cos_sim_two)

Unnamed: 0,name,cos_sim
1233,Days Gone,0.126378
409,ARK: Survival Evolved,0.090848
431,Fran Bow,0.086721
1110,The Last Spell,0.086334
1209,ATRI -My Dear Moments-,0.082636
674,Himawari - The Sunflower -,0.082237
1300,Erzurum,0.078352
583,Idling to Rule the Gods,0.076935
267,The Walking Dead: Season Two,0.074561
1111,Ratropolis,0.072847


## Conclusion

We will store the cosine similarity calculation and use it to build a recommender application. 

### Save dataframe

We will save (export) the required dataframes to output files.

We will use the exported files to build a recommender on streamlit. 

In [28]:
# create new dataframe to store
sql_query = '''
SELECT main.steam_appid, main.name, media.header_image, description.about_the_game
FROM main
INNER JOIN media
    ON main.steam_appid = media.steam_appid
INNER JOIN description
    ON main.steam_appid = description.steam_appid
INNER JOIN statistics
   ON main.steam_appid = statistics.steam_appid
WHERE (statistics.average_2weeks > 0)
'''

df_store = pd.read_sql(sql_query, con)

We will store the dataframes in `.csv` format for reference in our streamlit application. 

In [29]:
# storing the df
pd.DataFrame(df_model_two).to_csv("../data/cos_sim_data_extract.csv", index=False)
df_store.to_csv("../data/details_data_extract.csv", index=False)

We can refer to the application by running the following line on cmd:

```
streamlit run recommender_extract.py
```