# Modeling (Recommendation System)

## Part 2: (English only) Content-based Filtering

In [1]:
import re
import random
import json
from collections import defaultdict

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# using MCA, a PCA version for boolean features
import prince

# for Description
from gensim.models import Word2Vec, FastText

from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df_game = pd.read_csv("archive/cleaned_steam_db_v2.csv")
df_game.head()

Unnamed: 0,type,name,steam_appid,required_age,is_free,genres,platform_windows,platform_mac,platform_linux,release_year,...,nsfw,film,developers,publishers,description,release_distance_value,initial_price_usd,final_price_usd,memory_gb,storage_gb
0,demo,Pin Them Demo,1904630,0,True,[],True,False,False,2023.0,...,False,False,[0],[0],,2,0.0,0.0,,
1,game,Al-Qadim: The Genie's Curse,1904640,0,False,"[1, 3]",True,False,False,2022.0,...,False,False,[1],[2],Experience the mysterious Al-Qadim game world ...,2,3.204,3.204,0.5,2.0
2,game,Dungeons & Dragons - Stronghold: Kingdom Simul...,1904650,0,False,"[28, 2]",True,False,False,2022.0,...,False,False,[3],[2],Run your own kingdom in the legendary Dungeons...,2,3.204,3.204,0.5,2.0
3,game,Chapel 3-D: The Ascent,1904680,0,False,"[1, 23]",True,False,False,,...,False,False,[4],[5],"Chapel 3-D: The Ascent is a break-neck, viole...",0,0.0,0.0,1.0,0.0
4,game,VTuber Gallery : Anime Pose,1904690,0,True,"[51, 53, 55, 57, 59, 70]",True,False,False,2022.0,...,False,False,[6],[6],VTuber Gallery is #1 anime pose app that allow...,2,0.0,0.0,8.0,0.0


In [3]:
df_game.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95033 entries, 0 to 95032
Data columns (total 42 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   type                    95033 non-null  object 
 1   name                    95033 non-null  object 
 2   steam_appid             95033 non-null  int64  
 3   required_age            95033 non-null  int64  
 4   is_free                 95033 non-null  bool   
 5   genres                  95033 non-null  object 
 6   platform_windows        95033 non-null  bool   
 7   platform_mac            95033 non-null  bool   
 8   platform_linux          95033 non-null  bool   
 9   release_year            84290 non-null  float64
 10  release_quarter         82811 non-null  float64
 11  coming_soon             95033 non-null  bool   
 12  package_number          95033 non-null  int64  
 13  discount_percent        95033 non-null  float64
 14  developers_amount       95033 non-null

In [4]:
df_game["genres"] = df_game["genres"].apply(json.loads)
df_game["developers"] = df_game["developers"].apply(json.loads)
df_game["publishers"] = df_game["publishers"].apply(json.loads)

## Before compiling a recommendation system model, let's use MCA to analyse a little.

___

### Genres
There are many genres (for some reasons, there are many missing ones, such as horror games).
By using MCA (PCA for boolean features), we can reduce the feature size and easily to use metrics for recommendation system.

In [5]:
genre_dict = {}
with open("json_folder/genre_dict_clean.json", "r") as f:
	genre_dict = json.loads(f.read())
f.close()

In [6]:
# some_genre_ids = []
# false_table = np.full((df_game.shape[0], len(some_genre_ids)), False)
# false_table = pd.DataFrame(false_table, columns=some_genre_ids)
# df_game = pd.concat([df_game, false_table], axis=1)

# count = 0
# for i in range(df_game.shape[0]):
# 	for e in df_game.loc[i, "genres"]:
# 		if str(e) in some_genre_ids:
# 			df_game.loc[i, str(e)] = True
# 	if count % 10000 == 0:
# 		print(f"Step {count} done")
# 	count += 1

# df_game[some_genre_ids]

In [7]:
# X_genre = df_game[[*some_genre_ids, "tool", "nsfw", "film"]]
# X_genre.head()

In [8]:
# mca_genre = prince.MCA(n_components=3)
# mca_genre.fit(X_genre)
# df_game[["eigen_vec_1", "eigen_vec_2", "eigen_vec_3"]] = mca_genre.transform(X_genre)

In [9]:
# def draw_mca_result(table, feature_name, open_web=True):
    
#     fig = px.scatter(
#     	x=table[0].to_numpy(),                                    # PCA 1
#     	y=table[1].to_numpy(),                                    # PCA 2
#     	size=(table[2] - table[2].min()).to_numpy(),     # PCA 3
#     	text=table[feature_name],
#         color=table["t_or_f"],
#     	labels={
#     		"x": "<b>First Eigen Vector</b>",
#     		"y": "<b>Second Eigen Vector</b>",
#     		"size": "<b>Third Eigen Vector</b>"
#     	},
#         title="MCA with genres"
#     )
    
#     fig.update_layout(
#         font_family="Arial",
#         font_size=16
#     )

#     if open_web:
#         fig.show(renderer="browser")
#     else:
#         fig.show()

In [10]:
# def genre_true_to_name(x):
    
#     result = re.findall(r"\d+", x)
    
#     return_val = ""
#     if len(result) == 0:
#         match x[:4]:
#             case "tool": return_val = "Tool"
#             case "nsfw": return_val = "NSFW"
#             case "film": return_val = "Film"
#     else:
#         return_val = genre_dict[result[0]]

#     return_val += " 1" if x.find("True") != -1 else " 0"
#     return return_val


# col_mca = mca_genre.column_coordinates(X_genre)
# col_mca["genre"] = col_mca.index
# col_mca["genre"] = col_mca["genre"].apply(genre_true_to_name)
# col_mca["t_or_f"] = col_mca["genre"].apply(lambda x: x.find("1") != -1)


# draw_mca_result(col_mca, "genre")

- First eigen vector is the most important one:
   - We can see that `Film` and `Tool` are far different from the rest.
   - `Massively Multiplayer` is quite far away from the rest since it's not a common genre.
- Second eigen vector:
   - `Racing` and `Sports` are far different from the rest.
- Third eigen vector:
   - `Racing`, `NSFW` and `Sports` are different from the rest.
   - `Strategy` is 0, while `Casual` is 1.5. They are different genres.

___

### Developers and publishers

There are lots of developers and publishers, and some of them only produce one game/app. It's difficult especially applying vectoriser to them.

GloVe or FastText would be a perfect model. However, due to inappropriate Python version, we are only able to use Word2Vec for word embedding.

In [11]:
studio_dict = {}
with open("json_folder/studio_dict.json", "r", encoding="utf-8") as f:
	studio_dict = json.loads(f.read())
f.close()

In [12]:
with open("json_folder/stop_words_english.txt", "r", encoding="utf-8") as f:
    stop_words = f.read()

f.close()

stop_words = stop_words.split("\n")
stop_words[:5]

['able', 'about', 'above', 'abroad', 'according']

In [13]:
df_game.columns

Index(['type', 'name', 'steam_appid', 'required_age', 'is_free', 'genres',
       'platform_windows', 'platform_mac', 'platform_linux', 'release_year',
       'release_quarter', 'coming_soon', 'package_number', 'discount_percent',
       'developers_amount', 'publishers_amount', 'single', 'multi',
       'support_vr', 'support_controller', 'lang_en', 'lang_fr', 'lang_de',
       'lang_es', 'lang_po', 'lang_zh', 'lang_ja', 'lang_ko', 'lang_it',
       'lang_ru', 'lang_ar', 'tool', 'nsfw', 'film', 'developers',
       'publishers', 'description', 'release_distance_value',
       'initial_price_usd', 'final_price_usd', 'memory_gb', 'storage_gb'],
      dtype='object')

In [14]:
df_game_english = df_game.loc[df_game["lang_en"], ['lang_en', 'lang_fr', 'lang_de', 'lang_es',
                                                   'lang_po', 'lang_zh', 'lang_ja', 'lang_ko',
                                                   'lang_it', 'lang_ru', 'lang_ar', 'type',
                                                   'developers', 'publishers', 'genres', 'description',
                                                   'tool', 'nsfw', 'film', 'steam_appid']].reset_index(drop=True)
df_game_english.shape

(84000, 20)

In [15]:
df_game_english["description"].fillna("", inplace=True)

In [16]:
count = 0
def to_simplified_sentence_list(x):

    global count, stop_words
    
    count += 1
    if count % 10000 == 0:
        print(count)
    
    # Lower case
    x = x.lower()

    # Only keep alphabet, space and apostrophe characters
    x = re.sub(r"[^a-z '-]+", "", x)
    x = re.sub(r" [-']", " ", x)
    x = re.sub(r"[-'] ", " ", x)
    x = re.sub(r" +", " ", x)
    x = x.strip()

    # Convert to array
    if len(x) > 0:
        x = x.split(" ")
        # remove stop words
        x = [e for e in x if not e in stop_words]
    else:
        x = []
    
    return x


df_game_english["pre_word_embedding"] = df_game_english["description"].apply(lambda x: to_simplified_sentence_list(x))
df_game_english["pre_word_embedding"]

10000
20000
30000
40000
50000
60000
70000
80000


0        [experience, mysterious, al-qadim, game, prepa...
1        [kingdom, legendary, dungeons, dragons, game, ...
2        [chapel, ascent, break-neck, violent, boomer, ...
3        [vtuber, gallery, anime, pose, app, easy, pose...
4             [pack, traits, apex, predator, quills, pest]
                               ...                        
83995                                                   []
83996                                                   []
83997                                                   []
83998    [stylish, blend, deck-building, turn-based, ta...
83999    [perpetual, testing, initiative, expanded, des...
Name: pre_word_embedding, Length: 84000, dtype: object

In [17]:
def add_dev_n_pub(x):
    result = x["pre_word_embedding"]
    result.append(x["type"])

    dev_list = [studio_dict[str(e)] for e in x["developers"]]
    result += [studio_dict[str(e)] for e in x["publishers"]]
    return result
  
df_game_english["pre_word_embedding"] = \
        df_game_english[["pre_word_embedding", "type", "developers", "publishers"]].apply(add_dev_n_pub, axis=1)

In [18]:
count = 0
def add_genres(x):

    global count, lang_key, lang_name
    
    count += 1
    if count % 10000 == 0:
        print(count)

    result = x["pre_word_embedding"].copy()
    result += [genre_dict[str(genre)] for genre in x["genres"]]
    return result


df_game_english["pre_word_embedding"] = \
        df_game_english[["pre_word_embedding", "genres"]].apply(add_genres, axis=1)

10000
20000
30000
40000
50000
60000
70000
80000


In [19]:
lang_key = df_game_english.columns[df_game_english.columns.str.contains("lang")]
lang_name = ["english", "french", "german", "spanish", "portuguese",
             "chinese", "japanese", "korean", "italian", "russian", "arabic"]

count = 0
def add_language(x):

    global count, lang_key, lang_name
    
    count += 1
    if count % 10000 == 0:
        print(count)

    result = x["pre_word_embedding"].copy()
    result += [lang_name[i] for i in range(len(lang_key)) if x[lang_key[i]]]
    return result

    
df_game_english["pre_word_embedding"] = \
        df_game_english[["pre_word_embedding", *lang_key]].apply(add_language, axis=1)

10000
20000
30000
40000
50000
60000
70000
80000


In [20]:
for i, word in enumerate(df_game_english.loc[0, "pre_word_embedding"]):
    print(word, end="\n" if i % 5 == 4 else "\t")

experience	mysterious	al-qadim	game	prepare
arcade-style	combat	role-playing	genre	style
arabian	nights	diverging	gold	box
formula	al-qadim	experience	immersion	thrill
add	adventure	condensed	role-playing	game
SNEG	Action	RPG	english	french
german	spanish	

In [21]:
def get_similar_text(texts, word_model, size=10):
    word_table = pd.concat([
        pd.DataFrame(
            word_model.wv.most_similar(texts[i], topn=size),
            columns=[f"'{texts[i]}' similar text", f"'{texts[i]}' cos-sim"]
        ) for i in range(len(texts))
    ], axis=1)
    word_table.loc[:, word_table.columns.str.contains("cos-sim")] = \
        word_table.loc[:, word_table.columns.str.contains("cos-sim")].apply(lambda x: round(x, 3))

    return word_table

There are two models:

- **Word2Vec**: Good for capturing context.

- **FastText**: Good for finding hidden words / words that are pronounced likely similarly.

It's better to check the result of both.

The parameters of FastText can be shown in `fasttext_experiment.ipynb`.

In [22]:
w2v_model = Word2Vec(df_game_english["pre_word_embedding"],
                              min_count=2,
                              vector_size=150,
                              window=15)

In [23]:
get_similar_text(["anime", "monster", "flight"], w2v_model, 15)

Unnamed: 0,'anime' similar text,'anime' cos-sim,'monster' similar text,'monster' cos-sim,'flight' similar text,'flight' cos-sim
0,animated,0.842,monsters,0.768,seat,0.897
1,charm,0.838,goblin,0.724,cockpit,0.879
2,animations,0.81,hunter,0.712,fpv,0.856
3,dialogue,0.777,poweful,0.697,accurate,0.847
4,erotic,0.776,skillet,0.69,drone,0.846
5,traditional,0.775,geralt,0.687,driving,0.844
6,arts,0.773,rivia,0.657,high-speed,0.844
7,light-hearted,0.761,fingertipsdungeon,0.653,air,0.835
8,drawn,0.76,goblins,0.643,flying,0.834
9,graphic,0.759,wizard,0.643,driver,0.831


In [25]:
ft_model = FastText(df_game_english["pre_word_embedding"],
                    min_count=2,
                    vector_size=150,
                    window=10)

In [26]:
get_similar_text(["anime", "monster", "flight"], ft_model, 15)

Unnamed: 0,'anime' similar text,'anime' cos-sim,'monster' similar text,'monster' cos-sim,'flight' similar text,'flight' cos-sim
0,animi,0.916,bigmonster,0.988,flight's,0.954
1,anima,0.88,monstersmonster,0.98,highlight,0.941
2,anime-style,0.84,monster's,0.971,droplight,0.939
3,animus,0.831,monstercat,0.959,flights,0.931
4,animes,0.827,monstersummon,0.959,ultralight,0.92
5,schoolgirls,0.824,monsti,0.946,flight-sim,0.918
6,futa-girls,0.811,monsters,0.93,spotlight,0.915
7,furry-girls,0.809,monster-girl,0.92,tightrope,0.913
8,animaze,0.8,monstergirl,0.92,lightweight,0.911
9,novel-style,0.796,monsoon,0.92,light-weight,0.906


In [27]:
# Each word has a vector. Using this to assign the vector of each sentence
ft_model.wv["anime"].shape

(150,)

In [28]:
w2v_model.save("model/w2v_model.model")
ft_model.save("model/ft_model.model")

In [29]:
def get_average_vector(x, model):
    vector_list = []
    for e in x:
        try:   # Some words only appear once
            vector_list.append(model.wv[e])
        except:
            continue

    vector_list = np.asarray(vector_list)
    result = vector_list.mean(axis=0)
    return result

In [62]:
w2v_id_to_vector = {}
ft_id_to_vector = {}

for i in range(df_game_english.shape[0]):
    # key as string is safe for saving to json
    w2v_id_to_vector[str(df_game_english.loc[i, "steam_appid"])] = \
                get_average_vector(df_game_english.loc[i, "pre_word_embedding"], w2v_model).tolist()  # list is too
    ft_id_to_vector[str(df_game_english.loc[i, "steam_appid"])] = \
                get_average_vector(df_game_english.loc[i, "pre_word_embedding"], ft_model).tolist()

len(w2v_id_to_vector), len(ft_id_to_vector)

(84000, 84000)

___

## Apply the model to a recommendation system

The general application can be seen on the next notebook. In this one, we will try how effective it can be.

Of course some games will have negative reviews and we can't recommend these without warnings. We will do that later.

In [31]:
# Since there is no shuffle, it's safe to merge them
# Also only some features are selected. The 3rd version will be shown completely
df_game_english = pd.concat([
    df_game_english,
    df_game.loc[df_game["lang_en"], ["name", "initial_price_usd", "final_price_usd",
                                     "memory_gb", "storage_gb", "single", "multi"]].reset_index(drop=True)
], axis=1)
df_game_english.head()

Unnamed: 0,lang_en,lang_fr,lang_de,lang_es,lang_po,lang_zh,lang_ja,lang_ko,lang_it,lang_ru,...,film,steam_appid,pre_word_embedding,name,initial_price_usd,final_price_usd,memory_gb,storage_gb,single,multi
0,True,True,True,True,False,False,False,False,False,False,...,False,1904640,"[experience, mysterious, al-qadim, game, prepa...",Al-Qadim: The Genie's Curse,3.204,3.204,0.5,2.0,True,False
1,True,True,True,False,False,False,False,False,False,False,...,False,1904650,"[kingdom, legendary, dungeons, dragons, game, ...",Dungeons & Dragons - Stronghold: Kingdom Simul...,3.204,3.204,0.5,2.0,True,False
2,True,True,True,True,True,False,False,False,False,False,...,False,1904680,"[chapel, ascent, break-neck, violent, boomer, ...",Chapel 3-D: The Ascent,0.0,0.0,1.0,0.0,True,False
3,True,False,False,False,False,False,False,False,False,False,...,False,1904690,"[vtuber, gallery, anime, pose, app, easy, pose...",VTuber Gallery : Anime Pose,0.0,0.0,8.0,0.0,False,False
4,True,False,False,False,False,False,False,False,False,False,...,False,1904700,"[pack, traits, apex, predator, quills, pest, d...",Evolution - Alone and Unafraid Trait Pack,2.943675,2.943675,4.0,2.0,True,True


In [32]:
df_game_english.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84000 entries, 0 to 83999
Data columns (total 28 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   lang_en             84000 non-null  bool   
 1   lang_fr             84000 non-null  bool   
 2   lang_de             84000 non-null  bool   
 3   lang_es             84000 non-null  bool   
 4   lang_po             84000 non-null  bool   
 5   lang_zh             84000 non-null  bool   
 6   lang_ja             84000 non-null  bool   
 7   lang_ko             84000 non-null  bool   
 8   lang_it             84000 non-null  bool   
 9   lang_ru             84000 non-null  bool   
 10  lang_ar             84000 non-null  bool   
 11  type                84000 non-null  object 
 12  developers          84000 non-null  object 
 13  publishers          84000 non-null  object 
 14  genres              84000 non-null  object 
 15  description         84000 non-null  object 
 16  tool

In [33]:
def appids_to_df(appids, big_df=df_game, name_feature="steam_appid"):
    return big_df[big_df[name_feature].isin(appids)]

Using cosine similarity to find the similar games/apps.

___

In [34]:
pd.concat([
    df_game.loc[df_game["name"] == "Goat Simulator", ["type", "name", "genres", "developers", "steam_appid", "lang_en", "lang_zh", "description"]],
    df_game.loc[df_game["name"] == "Goat Simulator 3", ["type", "name", "genres", "developers", "steam_appid", "lang_en", "lang_zh", "description"]]
], axis=0)

Unnamed: 0,type,name,genres,developers,steam_appid,lang_en,lang_zh,description
64824,game,Goat Simulator,"[4, 23, 28]",[33343],265930,True,True,Goat Simulator is the latest in high-tech Goat...
88099,game,Goat Simulator 3,"[25, 4, 28]",[15196],850190,True,True,Pilgor's baaack! Gather your herd and venture ...


In [64]:
print("Word2Vec: {}".format( cosine_similarity([w2v_id_to_vector["265930"]], [w2v_id_to_vector["850190"]])[0][0] ))
print("FastText: {}".format( cosine_similarity([ft_id_to_vector["265930"]], [ft_id_to_vector["850190"]])[0][0] ))

Word2Vec: 0.7902842862947479
FastText: 0.6685353903373301


It's understandable that the cosine similarity between these two games is high for **Word2Vec** due to it can capture the context. Meanwhile, the two descriptions are different, but **FastText** captures the similarity in pronunciation, thus the metric results in a lower value.

___

In [36]:
pd.concat([
    df_game.loc[df_game["name"] == "Cities: Skylines II", ["type", "name", "genres", "developers", "steam_appid", "lang_en", "lang_zh", "description"]],
    df_game.loc[df_game["name"] == "Planet Coaster", ["type", "name", "genres", "developers", "steam_appid", "lang_en", "lang_zh", "description"]]
], axis=0)

Unnamed: 0,type,name,genres,developers,steam_appid,lang_en,lang_zh,description
77752,game,Cities: Skylines II,[28],[5399],949230,True,True,Raise a city from the ground up and transform ...
54085,game,Planet Coaster,"[1, 25, 4, 28, 2]","[1587, 5104]",493340,True,True,Planet Coaster - the future of coaster park si...


In [58]:
print("Word2Vec: {}".format( cosine_similarity([w2v_id_to_vector["949230"]], [w2v_id_to_vector["493340"]])[0][0] ))
print("FastText: {}".format( cosine_similarity([ft_id_to_vector["949230"]], [ft_id_to_vector["493340"]])[0][0] ))

Word2Vec: 0.9195255535243897
FastText: 0.9194104121487443


Unlike the example above, both **Word2Vec** and **FastText** give better results. It seems they both have the same contexts as well as the same words usage.

___

In [38]:
pd.concat([
    df_game.loc[df_game["name"] == "World of Tanks", ["type", "name", "genres", "developers", "steam_appid", "lang_en", "lang_zh", "description"]],
    df_game.loc[df_game["name"] == "Stardew Valley", ["type", "name", "genres", "developers", "steam_appid", "lang_en", "lang_zh", "description"]]
], axis=0)

Unnamed: 0,type,name,genres,developers,steam_appid,lang_en,lang_zh,description
82874,game,World of Tanks,"[1, 29, 28, 37]",[5541],1407200,True,True,Jump into the free-to-play team-based shooter ...
77723,game,Stardew Valley,"[23, 3, 28]",[45069],413150,True,True,You've inherited your grandfather's old farm p...


In [65]:
print("Word2Vec: {}".format( cosine_similarity([w2v_id_to_vector["1407200"]], [w2v_id_to_vector["413150"]])[0][0] ))
print("FastText: {}".format( cosine_similarity([ft_id_to_vector["1407200"]], [ft_id_to_vector["413150"]])[0][0] ))

Word2Vec: 0.6984315864634885
FastText: 0.6187134804746832


Both are different games, but both **Word2Vec** and **FastText** give not so bad results.

___

Let's check the true recommendation system below.

In [40]:
def recommend_by_description(name=None, appid=None, id_to_vector=None, n_rec=10):

    if id_to_vector is None:
        raise Exception("Please add a ID converter to vectors")
    
    # Check validation
    if (name is None) and (appid is None):
        raise Exception("Please add name or appid")

    global df_game, df_game_english
    
    if not name is None:
        appid = df_game.loc[df_game["name"] == name, "steam_appid"].values[0]

    
    # Get embedded vector of the game
    try:  # Maybe the input is not an English game (or data bug, see more below)
        curr_vector = id_to_vector[appid]
    except:
        print("Unfortunately, this game cannot be recommended since it doesn't have English language or the dataframe about this game is buggy.")


    # Calculate the cosine similarities for each game
    keys = np.fromiter(id_to_vector.keys(), dtype=int)
    cos_sims = np.asarray([cosine_similarity([curr_vector], [v]) for v in id_to_vector.values()]).reshape(-1)

    # Get the highest cosine similarity by indices
    sort_ids = np.argsort(cos_sims)[::-1]
    sort_ids = sort_ids[:n_rec]
    
    return keys[sort_ids]

In [41]:
w2v_rec_table = appids_to_df(recommend_by_description(name="Goat Simulator", id_to_vector=w2v_id_to_vector))
w2v_rec_table[["steam_appid", "name", "required_age", "genres", "release_year", "initial_price_usd", "final_price_usd"]]

Unnamed: 0,steam_appid,name,required_age,genres,release_year,initial_price_usd,final_price_usd
1414,2014780,X-Plane 12,0,[28],2022.0,32.04,32.04
36889,1847510,Boba Simulator : Idle Shop Management,0,"[4, 23, 28, 2, 70]",2022.0,2.36295,2.36295
38193,1827430,Yum Yum Cookstar,0,"[4, 28]",2022.0,6.668325,2.663325
60895,425840,Goat Simulator: PAYDAY,0,"[4, 23, 28]",2016.0,2.943675,2.943675
64824,265930,Goat Simulator,0,"[4, 23, 28]",2014.0,5.6871,5.6871
73481,1204270,New Year's Eve 2020,0,"[4, 23, 28]",2019.0,0.60075,0.60075
77142,1185000,Bloody trains,0,"[1, 23, 28]",2019.0,7.40925,0.740925
81387,1510570,Toy Tinker Simulator: Prologue,0,"[4, 37, 23, 28]",2021.0,0.0,0.0
87786,2566440,FarmCraft,0,"[4, 28]",2023.0,4.0851,4.0851
88270,1273400,Construction Simulator,0,"[4, 28]",2022.0,17.82225,17.82225


In [42]:
ft_rec_table = appids_to_df(recommend_by_description(name="Goat Simulator", id_to_vector=ft_id_to_vector))
ft_rec_table[["steam_appid", "name", "required_age", "genres", "release_year", "initial_price_usd", "final_price_usd"]]

Unnamed: 0,steam_appid,name,required_age,genres,release_year,initial_price_usd,final_price_usd
1414,2014780,X-Plane 12,0,[28],2022.0,32.04,32.04
7458,2325640,Sexy Nurse Puzzle,0,"[4, 23, 28]",2023.0,1.2015,1.2015
11056,2179380,Sand:box,0,"[4, 23, 28]",2023.0,2.36295,2.36295
21465,2690180,XiJiang Shipyard,0,"[23, 28]",2024.0,8.21025,8.21025
34120,1291340,Townscaper,0,"[4, 23, 28]",2021.0,3.504375,3.504375
36889,1847510,Boba Simulator : Idle Shop Management,0,"[4, 23, 28, 2, 70]",2022.0,2.36295,2.36295
64824,265930,Goat Simulator,0,"[4, 23, 28]",2014.0,5.6871,5.6871
72770,1214520,Pro Gymnast Simulator,0,"[1, 23, 28, 18]",2020.0,14.99,14.99
76932,1120920,ELON on MARS,0,"[23, 9, 28, 18]",2019.0,0.60075,0.60075
77142,1185000,Bloody trains,0,"[1, 23, 28]",2019.0,7.40925,0.740925


Most of the games are well-recommended. However, some are weird recommended (especially **FastText**), including NSFW game for such a game for everyone.

We will handle this in the next notebook.

In [72]:
with open("model/w2v_vectors.json", "w") as f:
	json_val = json.dumps(w2v_id_to_vector)
	print(json_val, file=f)
f.close()

with open("model/ft_vectors.json", "w") as f:
	json_val = json.dumps(ft_id_to_vector)
	print(json_val, file=f)
f.close()

___

## Another Problems

Some popular games don't have English language (?) and thus are removed since they are mistaken as non-English game.

In [44]:
df_game.loc[df_game["name"] == "Dota 2", ["name", "steam_appid", "lang_en", "lang_zh"]]

Unnamed: 0,name,steam_appid,lang_en,lang_zh
88106,Dota 2,570,False,True
