# Modeling (Recommendation System)

## Part 3: Hybrid Filtering

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Self-made codes
from model.collaborative import CollaborativeFiltering
from model.contentbased import ContentBasedFiltering
from model.filter_funcs import *

In [2]:
import warnings
warnings.filterwarnings('ignore')

This notebook will take advantage of the two model types of the previous notebooks.

**Please notice that the code in python files were changed a little to unify both.**

### Video game dataframe / `df_game` / `cleaned_steam_db_v2.csv` -> `game`

In [4]:
from gensim.models import Word2Vec, FastText

game = ContentBasedFiltering()
game.read_csv("archive/cleaned_steam_db_v2.csv")
game.get_genre_dict("json_folder/genre_dict_clean.json")
game.get_studio_dict("json_folder/studio_dict.json")
game.load_models(
    ["model/w2v_model.model", "model/ft_model.model"],
    [Word2Vec, FastText],
    ["model/w2v_vectors.json", "model/ft_vectors.json"]
)
game.df.head()

Unnamed: 0,type,name,steam_appid,required_age,is_free,genres,platform_windows,platform_mac,platform_linux,release_year,...,nsfw,film,developers,publishers,description,release_distance_value,initial_price_usd,final_price_usd,memory_gb,storage_gb
0,demo,Pin Them Demo,1904630,0,True,[],True,False,False,2023.0,...,False,False,[0],[0],,2,0.0,0.0,,
1,game,Al-Qadim: The Genie's Curse,1904640,0,False,"[1, 3]",True,False,False,2022.0,...,False,False,[1],[2],Experience the mysterious Al-Qadim game world ...,2,3.204,3.204,0.5,2.0
2,game,Dungeons & Dragons - Stronghold: Kingdom Simul...,1904650,0,False,"[28, 2]",True,False,False,2022.0,...,False,False,[3],[2],Run your own kingdom in the legendary Dungeons...,2,3.204,3.204,0.5,2.0
3,game,Chapel 3-D: The Ascent,1904680,0,False,"[1, 23]",True,False,False,,...,False,False,[4],[5],"Chapel 3-D: The Ascent is a break-neck, viole...",0,0.0,0.0,1.0,0.0
4,game,VTuber Gallery : Anime Pose,1904690,0,True,"[51, 53, 55, 57, 59, 70]",True,False,False,2022.0,...,False,False,[6],[6],VTuber Gallery is #1 anime pose app that allow...,2,0.0,0.0,8.0,0.0


### The bridge between `df_game` and `df_review` / `df_bridge` / `cleaned_bridge.csv` -> `bridge`

In [11]:
bridge = pd.read_csv("archive/cleaned_bridge.csv", index_col=[0])
bridge.head()

Unnamed: 0,appid,num_reviews,review_score,review_score_desc,total_positive,total_negative,total_reviews
0,1020470,2,6,Mostly Positive,360,106,466
1,1018050,0,0,No user reviews,0,0,0
2,1018060,0,0,No user reviews,0,0,0
3,1018080,3,0,3 user reviews,2,1,3
4,1018090,7,0,7 user reviews,1,6,7


### The review dataframe / `df_review` / `cleaned_reviews_v2.csv` -> `review`

In [5]:
from surprise import KNNBasic, KNNWithMeans, SlopeOne

review = CollaborativeFiltering()
review.read_csv("archive/cleaned_reviews_v2.csv")
review.fit_models([
    KNNBasic(sim_options={ "user_based": True }),
	KNNWithMeans(sim_options={ "user_based": False }),
    SlopeOne()
])
review.df.head()

Unnamed: 0,recommendationid,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,appid,steamid,num_games_owned,num_reviews,playtime_forever,playtime_last_two_weeks,playtime_at_review,last_played
0,55936147,A fun and quirky stealth-based problem solving...,2019-10-31 02:33:41,2019-10-31 02:33:41,True,2,0,0.55414,0,True,False,False,1018080,76561198051821837,0,12,41,0,41.0,2019-10-31 00:52:26
1,55989797,"Loved the art style, and the game ran very smo...",2019-10-31 11:57:40,2019-10-31 11:57:40,True,1,0,0.52381,0,True,False,False,1018080,76561197993790846,657,8,17,0,15.0,2019-11-03 09:37:50
2,64251252,the game crashed four times for one hour.... i...,2020-02-28 16:21:55,2020-02-28 16:26:49,False,0,0,0.0,0,True,False,False,1018080,76561198095855343,1254,24,76,0,76.0,2020-02-28 16:16:07
3,49140086,While I cannot recommend this Unity asset reli...,2019-02-21 15:52:16,2019-02-21 15:52:16,False,16,0,0.624971,0,True,False,False,1018090,76561198053422627,2384,1225,56,0,56.0,2019-02-23 20:59:13
4,49137406,Is extremely unoptimized and has laggy framera...,2019-02-21 13:10:21,2019-02-21 13:10:21,False,11,0,0.527824,0,True,False,False,1018090,76561198019816374,1351,1674,10,0,10.0,2019-02-21 11:27:35


In [12]:
game_full = pd.merge(game.df, bridge, how="left", left_on="steam_appid", right_on="appid")
game_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95033 entries, 0 to 95032
Data columns (total 49 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   type                    95033 non-null  object 
 1   name                    95033 non-null  object 
 2   steam_appid             95033 non-null  int64  
 3   required_age            95033 non-null  int64  
 4   is_free                 95033 non-null  bool   
 5   genres                  95033 non-null  object 
 6   platform_windows        95033 non-null  bool   
 7   platform_mac            95033 non-null  bool   
 8   platform_linux          95033 non-null  bool   
 9   release_year            84290 non-null  float64
 10  release_quarter         82811 non-null  float64
 11  coming_soon             95033 non-null  bool   
 12  package_number          95033 non-null  int64  
 13  discount_percent        95033 non-null  float64
 14  developers_amount       95033 non-null

In [19]:
output_features = ["type", "name", "steam_appid", "required_age", "genres",  "release_year",
   "final_price_usd", "memory_gb", "storage_gb", "review_score", "review_score_desc", "total_reviews"]

___

## User treatment

Imagine we are a service of the Steam and Valve, and there are several types of users.

|Type                                |Popular|On sale |Content-based|Collaborative|
|------------------------------------|-------|--------|-------------|-------------|
|new users                           |✅✅  |✅      |✅✅        |             |
|old users                           |✅     |✅✅   |✅           |✅✅        |
|users with conditions (ages, etc.)  |✅     |✅      |✅✅        |✅✅        |
|users loving free/on sale games     |✅     |✅✅   |✅          |✅           |

Where:
- []: Unable to use the model.
- \[✅\]: Usable.
- \[✅✅\]: Very relevant.

___

### Popular apps

Popular apps have high number of ratings. They can be positive or negative, but let the users choose their games.

**Pros**:
- Extremely easy to recommend.
- Many players will try to own them due to their popularity and their great fandom.

**Cons**:
- Some games are really heavy, or having negative reviews. Some of these are not novel comparing to the new released ones.

**Where to recommend**:
- **Top 1**. On the top of the UI, so that everyone will see it at the beginning.

In [25]:
def filter_by_number_of_reviews(df, threshold, top_n=10, shuffle=True,
                                filter_func_list: list = [], ignored_appids: list = []):
    
    recommendation_table = df[df["total_reviews"] > threshold]

    # If shuffle = true: shuffle the results to avoid not recommend less popular ones
    # Otherwise: recommend the top most reviews apps
    if shuffle:
        recommendation_table.sample(frac=1)
    else:
        recommendation_table.sort_values(by=["total_reviews", "review_score"], ascending=False)

    # Apply filter
    if len(filter_func_list) > 0:
        for func in filter_func_list:
            recommendation_table = func(recommendation_table)

    recommendation_table = recommendation_table[~recommendation_table["appid"].isin(ignored_appids)]

    return recommendation_table[:top_n]

In [18]:
for pin in [0, 1, 25, 50, 75, 99, 100]:
    print("percentile = {}: {}".format(pin / 100, np.percentile(game_full["total_reviews"].dropna(), pin)))

percentile = 0.0: 0.0
percentile = 0.01: 0.0
percentile = 0.25: 0.0
percentile = 0.5: 0.0
percentile = 0.75: 8.0
percentile = 0.99: 3044.5
percentile = 1.0: 1222889.0


In [27]:
# It seems not that many games have more than 3000 reviews
# Let's obtain percentile 0.99 as threshold
filter_by_number_of_reviews(
    game_full,
    np.percentile(game_full["total_reviews"].dropna(), 99),
    top_n=15,
    ignored_appids=[2138330]  # Cyberpunk 2077: Phantom Liberty	
)[output_features]

Unnamed: 0,type,name,steam_appid,required_age,genres,release_year,final_price_usd,memory_gb,storage_gb,review_score,review_score_desc,total_reviews
53,game,Football Manager 2023,1904540,0,"[28, 18]",2022.0,0.0,4.0,7.0,8.0,Very Positive,5374.0
1059,game,Lethal Company,1966720,0,"[1, 25, 23, 70]",2023.0,8.173,,1.0,9.0,Overwhelmingly Positive,208193.0
1073,game,20 Minutes Till Dawn,1966900,0,"[1, 25, 4, 23, 2]",2023.0,2.943675,1.0,0.0,8.0,Very Positive,11168.0
1902,game,Call of Duty®: Modern Warfare®,2000950,18,[1],2023.0,0.0,8.0,60.0,8.0,Very Positive,4547.0
1990,game,Placid Plastic Duck Simulator,1999360,0,"[4, 28]",2022.0,1.2015,4.0,0.5,9.0,Overwhelmingly Positive,7818.0
3267,game,Madden NFL 24,2140330,0,"[28, 18, 2]",2023.0,0.0,10.0,50.0,5.0,Mixed,5354.0
4220,game,Return to Monkey Island,2060130,0,"[25, 4]",2022.0,12.85605,8.0,4.0,8.0,Very Positive,4359.0
9444,game,Tom Clancy’s The Division® 2,2221490,18,"[1, 25, 3]",2023.0,19.82475,8.0,60.0,6.0,Mostly Positive,6641.0
9923,game,Assassin's Creed Valhalla,2208920,17,"[1, 25, 3]",2022.0,59.99,8.0,60.0,5.0,Mixed,7117.0
11736,game,Thronefall,2239150,0,"[1, 23, 2, 70]",2023.0,5.6871,4.0,0.25,9.0,Overwhelmingly Positive,5389.0


___

### On sales apps

This will include free games and on sale ones (with discount larger than 0).

**Pros**:
- Extremely easy to recommend.
- More people will look for other unknown games.

**Cons**:
- Some paid games don't want to be on sale, and they can be ignored.

**Where to recommend**:
- **Top 2**. Right after the list of popular apps/games, so that everyone will consider buying them.

In [46]:
def filter_by_on_sale(df, top_n=10, shuffle=True,
                      filter_func_list: list = [], ignored_appids: list = []):
    
    recommendation_table = df[(df["discount_percent"] > 0) | (df["final_price_usd"] == 0)]

    if shuffle:
        recommendation_table.sample(frac=1)
    else:
        recommendation_table.sort_values(by=["total_reviews", "review_score"], ascending=False)

    # Apply filter
    if len(filter_func_list) > 0:
        for func in filter_func_list:
            recommendation_table = func(recommendation_table)

    recommendation_table = recommendation_table[~recommendation_table["appid"].isin(ignored_appids)]

    return recommendation_table[:top_n]

In [47]:
# It seems not that many games have more than 3000 reviews
# Let's obtain percentile 0.99 as threshold
filter_by_on_sale(
    game_full,
    top_n=15,
    filter_func_list=[
        filter_games_only,
        filter_positive_apps,
        lambda x: x[x["total_reviews"] > 300]
    ],
    ignored_appids=[2138330]  # Cyberpunk 2077: Phantom Liberty	
)[[*output_features, "initial_price_usd", "discount_percent"]]

Unnamed: 0,type,name,steam_appid,required_age,genres,release_year,final_price_usd,memory_gb,storage_gb,review_score,review_score_desc,total_reviews,initial_price_usd,discount_percent
53,game,Football Manager 2023,1904540,0,"[28, 18]",2022.0,0.0,4.0,7.0,8.0,Very Positive,5374.0,0.0,0.0
114,game,Killer Frequency,1903620,0,"[25, 23, 28]",2023.0,10.0125,4.0,2.0,8.0,Very Positive,1609.0,20.025,50.0
630,game,Jusant,1977170,0,"[1, 25, 23]",2023.0,15.399225,8.0,15.0,8.0,Very Positive,1119.0,20.525625,25.0
896,game,OCTOPATH TRAVELER II,1971650,0,[3],2023.0,0.0,8.0,10.0,9.0,Overwhelmingly Positive,2879.0,0.0,0.0
920,game,Warlord: Britannia,1970980,0,"[1, 25, 23, 2]",2022.0,5.64705,8.0,5.0,8.0,Very Positive,1188.0,7.5294,25.0
1042,game,Ghost Trick: Phantom Detective,1967430,0,"[1, 25]",2023.0,18.282825,8.0,7.0,9.0,Overwhelmingly Positive,1136.0,27.7146,34.0
1048,game,Railbound,1967510,0,"[4, 23]",2022.0,4.826025,4.0,0.25,8.0,Very Positive,452.0,7.209,33.0
1765,game,Void Scrappers,2005210,0,[1],2022.0,1.421775,2.0,0.25,8.0,Very Positive,325.0,2.36295,40.0
1902,game,Call of Duty®: Modern Warfare®,2000950,18,[1],2023.0,0.0,8.0,60.0,8.0,Very Positive,4547.0,0.0,0.0
1952,game,Atelier Ryza 3: Alchemist of the End & the Sec...,1999770,0,"[25, 4, 3, 28]",2023.0,0.0,8.0,50.0,8.0,Very Positive,345.0,0.0,0.0


___

### Content-based Filtering

Based on all the existing features, we apply NLP to the model to obtain the vectors of each item. After that, we use metrics to figure the similar items out.

**Pros**:
- All of the games can be possibly included.
- Take advantage of categories and description features.

**Cons**:
- Complicated model that requires time and storage.
- Sometimes the recommended items can be absurd.
- Pretty slow for recommending if not optimised properly (since the model has to check through all of the items to find the best among them).

**Where to recommend**:
- **Top 4**. At the end of the UI. This will recommend the games that are similar to the ones users have played before.

In [10]:
game.get_top_10(
    name="Goat Simulator",
    vectorised_id=1,
    df_bridge=df_bridge,
    filter_func_list=[
        filter_games_only,
        filter_affordable_apps,
        filter_restricted_age,
        lambda x: filter_light_storage_games(x, threshold=10, including_null=True)
    ],
    ignored_appids=[850190] # Goat Simulator 3
)[["type", "name", "steam_appid", "required_age", "genres",  "release_year",
   "final_price_usd", "memory_gb", "storage_gb", "review_score", "review_score_desc", "total_reviews"]]

Unnamed: 0,type,name,steam_appid,required_age,genres,release_year,final_price_usd,memory_gb,storage_gb,review_score,review_score_desc,total_reviews
0,game,Bloody trains,1185000,0,"[1, 23, 28]",2019.0,0.740925,2.0,1.0,0.0,5 user reviews,5.0
1,game,Sand:box,2179380,0,"[4, 23, 28]",2023.0,2.36295,1.0,0.0,8.0,Very Positive,188.0
2,game,XiJiang Shipyard,2690180,0,"[23, 28]",2024.0,8.21025,8.0,8.0,0.0,No user reviews,0.0
3,game,Boba Simulator : Idle Shop Management,1847510,0,"[4, 23, 28, 2, 70]",2022.0,2.36295,4.0,0.25,8.0,Very Positive,110.0
5,game,Pro Gymnast Simulator,1214520,0,"[1, 23, 28, 18]",2020.0,14.99,1.0,1.0,7.0,Positive,42.0
6,game,Townscaper,1291340,0,"[4, 23, 28]",2021.0,3.504375,4.0,1.0,9.0,Overwhelmingly Positive,9741.0
7,game,ELON on MARS,1120920,0,"[23, 9, 28, 18]",2019.0,0.60075,1.0,0.0,0.0,8 user reviews,8.0
8,game,Sexy Nurse Puzzle,2325640,0,"[4, 23, 28]",2023.0,1.2015,0.0,0.25,7.0,Positive,10.0
9,game,Economica,1215380,0,"[23, 28, 2]",2020.0,4.005,2.0,0.0,0.0,1 user reviews,1.0
10,game,Virtual Aquarium - Overlay Desktop Game,1791120,0,"[4, 37, 28]",2022.0,0.0,2.0,0.5,0.0,No user reviews,0.0


___

### Collaborative Filtering

Based on the rating, we can extract the relevant games for each users.

**Pros**:
- Reliable, especially when users have high number of ratings for each game.
- Pretty fast for recommending.

**Cons**:
- Require users to rate the games. The number should be at least three.
- The data's size is the problem for performance (In fact, Steam has 120 million users and 73 thousands apps (not including DLC)).
- Still pretty complicated when compiling.

**Where to recommend**:
- **Top 3**. It's much less absurd than Content-based Filtering model.
- However, if the user is new without proper ratings, this model will ignore recommending.

In [9]:
review.get_top_10(
    user=76561198104631037,
    model_id_list=[2],
    df_game=game.df[["type", "name", "steam_appid", "required_age", "genres",  "release_year", "final_price_usd", "memory_gb", "storage_gb"]],
    df_bridge=bridge[["appid", "review_score", "review_score_desc", "total_reviews"]],
    filter_func_list=[
        filter_games_only,
        filter_affordable_apps,
        lambda x: filter_light_storage_games(x, threshold=10, including_null=True)
    ])

Unnamed: 0,appid,likely_to_like,type,name,steam_appid,required_age,genres,release_year,final_price_usd,memory_gb,storage_gb,review_score,review_score_desc,total_reviews
14,70,YES,game,Half-Life,70,0,[1],1998.0,4.806,,,9,Overwhelmingly Positive,45410
20,2280,YES,game,DOOM (1993),2280,0,[1],2007.0,4.505625,,,9,Overwhelmingly Positive,10333
27,3540,YES,game,Peggle™ Nights,3540,0,[4],2008.0,2.8035,,,9,Overwhelmingly Positive,1341
31,4570,YES,game,"Warhammer® 40,000: Dawn of War® - Game of the ...",4570,16,[2],2007.0,6.0075,,,9,Overwhelmingly Positive,4383
32,4580,YES,game,"Warhammer® 40,000: Dawn of War® - Dark Crusade",4580,16,[2],2007.0,6.0075,,,9,Overwhelmingly Positive,4044
33,4700,YES,game,Total War: MEDIEVAL II – Definitive Edition,4700,0,[2],2006.0,14.69835,,,9,Overwhelmingly Positive,14140
46,8930,YES,game,Sid Meier's Civilization® V,8930,0,[2],2010.0,20.4255,2.0,,9,Overwhelmingly Positive,72884
51,9450,YES,game,"Warhammer® 40,000: Dawn of War® - Soulstorm",9450,16,[2],2008.0,6.0075,0.5,,9,Overwhelmingly Positive,6736
76,22320,YES,game,The Elder Scrolls III: Morrowind® Game of the ...,22320,0,[3],2009.0,13.516875,0.0,,9,Overwhelmingly Positive,14705
77,22380,YES,game,Fallout: New Vegas,22380,17,"[1, 3]",2010.0,9.01125,2.0,,9,Overwhelmingly Positive,124455


___

## Conclusion

Taking advantage of all the models to recommend items to users is essential to keep users. Especially video games store, where the people find the relevant games for entertainment and stress reduction.

There are still some recommendation ideas such as based on the subscribed developers or publishers, or improve the **Popular games** by applying the users' taste to it.