# Analysis of Eiyuden Chronicle: Hundred Heroes - Data Collection

This Jupyter Notebook is part of my **Analysis of Eiyuden Chronicle: Hundred Heroes** project. It focuses on the data collection stage and also serves as a demo for [GameInsights](https://github.com/nazhifkojaz/steam-game-data-collector), a Python library I built for collecting and analyzing **Steam game data**.

**GameInsights** started as a personal tool, but I decided to expand it into something that can be useful for others—such as *game developers, researchers, or anyone who enjoys analyzing game data*.

For now, GameInsights focuses on **data collection** with its **collector** module. In the future, I plan to add two additional modules:  
- **Analyzer** – for processing and analyzing collected data.  
- **Visualizer** – for creating plots and visual summaries.  

In this notebook, I will show some of the main functions in the **collector** module, including:  
- `get_game_review` – pulls a list of reviews for a given Steam game.  
- `get_user_data` – retrieves user data for a provided SteamID.  
- `get_games_data` – compiles game data from multiple sources, based on a given Steam AppID.  

For simplicity, I will refer to *Eiyuden Chronicle: Hundred Heroes* as **ECHH** throughout this notebook.


## Library Imports

In [1]:
from gameinsights import Collector

In [2]:
import pandas as pd
import numpy as np

In [3]:
# import secrets
from dotenv import load_dotenv
import os
load_dotenv()

True

## Setting up GameInsights

Here we set up the things needed, like the STEAM_API_KEY that's required to get you an access to pull the user data and game schematics (achievements labels)

In [4]:
steam_api_key = os.getenv("STEAM_API_KEY")

In [5]:
collector = Collector(steam_api_key=steam_api_key)

In [6]:
eiyuden_appid = 1658280

## Collecting review data

I think the most obvious data that we can pull from steam game is its review data, so let's begin

In [7]:
reviews = collector.get_game_review(steam_appid=eiyuden_appid, verbose=False, review_only=True)

In [13]:
reviews.shape

(4100, 21)

In [9]:
reviews.columns

Index(['recommendation_id', 'author_steamid', 'author_num_games_owned',
       'author_num_reviews', 'author_playtime_forever',
       'author_playtime_last_two_weeks', 'author_playtime_at_review',
       'author_last_played', 'language', 'review', 'timestamp_created',
       'timestamp_updated', 'voted_up', 'votes_up', 'votes_funny',
       'weighted_vote_score', 'comment_count', 'steam_purchase',
       'received_for_free', 'written_during_early_access',
       'primarily_steam_deck'],
      dtype='object')

In [24]:
# save reviews to parquet
reviews.to_parquet("data/eiyuden_reviews.parquet", index=False)

## Collecting user data

From the review data, we can extract the list of *steamid* of the users who wrote the reviews, and this gives us a subset of ECHH players from its whole population.

In [7]:
reviews = pd.read_parquet("data/eiyuden_reviews.parquet")

In [9]:
reviews.head()

Unnamed: 0,recommendation_id,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played,language,review,...,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,primarily_steam_deck
0,201952284,76561198045255045,0,10,9,9,9,1754967253,english,By the 4th text box in the game I noticed some...,...,1754968462,False,0,0,0.5,0,False,False,False,False
1,201940210,76561198113965734,0,2,2734,1447,2734,1754703861,english,"Needs a x2 , x3 speed key. Would shave off 70...",...,1754950716,False,0,0,0.5,0,True,False,False,False
2,201914252,76561198138897509,0,7,1530,611,1377,1754932590,italian,per ora bella storia e grafica gradevole,...,1754923391,True,0,0,0.5,0,True,False,False,False
3,201894653,76561198368799839,6496,540,8511,366,8511,1754893622,koreana,- 이 게임의 가장 큰 문제점은 환상수호전2라는 벽이 너무 거대하다는 거다.\n본인...,...,1754896546,True,8,1,0.619325,0,False,False,False,False
4,201825659,76561198449223262,0,2,5018,1605,5018,1754809816,japanese,仲間全員集めてレースやカードゲーム等のミニゲームにも手を付けてプレイ時間約８０時間でクリアし...,...,1754812678,True,0,0,0.5,0,False,False,False,False


In [166]:
reviews.columns

Index(['recommendation_id', 'author_steamid', 'author_num_games_owned',
       'author_num_reviews', 'author_playtime_forever',
       'author_playtime_last_two_weeks', 'author_playtime_at_review',
       'author_last_played', 'language', 'review', 'timestamp_created',
       'timestamp_updated', 'voted_up', 'votes_up', 'votes_funny',
       'weighted_vote_score', 'comment_count', 'steam_purchase',
       'received_for_free', 'written_during_early_access',
       'primarily_steam_deck'],
      dtype='object')

In [140]:
# check if there are any duplicated reviews
reviews.duplicated(subset=['recommendation_id']).sum()

np.int64(0)

In [167]:
reviews['steam_purchase'].value_counts()

steam_purchase
True     2785
False    1315
Name: count, dtype: int64

In [24]:
reviews['author_playtime_forever'].describe()

count      4100.000000
mean       4275.500976
std        4295.696953
min           5.000000
25%        2006.000000
50%        3781.500000
75%        5571.750000
max      128945.000000
Name: author_playtime_forever, dtype: float64

In [9]:
# extract author_steamid to collect user data
authors = reviews['author_steamid'].to_list()

In [10]:
len(authors)

4100

In [None]:
user_data = collector.get_user_data(
    steamids=authors,
    include_free_games=True,
    verbose=True,
)

In [12]:
user_data

Unnamed: 0,steamid,community_visibility_state,profile_state,persona_name,profile_url,last_log_off,real_name,time_created,loc_country_code,loc_state_code,loc_city_id,owned_games,recently_played_games
0,76561198045255045,2,1.0,Dojilol,https://steamcommunity.com/id/Dojilol/,,,,,,,{},{}
1,76561198113965734,1,1.0,123,https://steamcommunity.com/profiles/7656119811...,,,,,,,{},{}
2,76561198138897509,3,1.0,Pino70,https://steamcommunity.com/profiles/7656119813...,,Giuseppe,1.401729e+09,IT,05,24453.0,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
3,76561198368799839,3,1.0,oci51,https://steamcommunity.com/profiles/7656119836...,,oci,1.487895e+09,KR,,,"{'game_count': 6658, 'games': [{'appid': 1610,...","{'games_count': 9, 'total_playtime_2weeks': 73..."
4,76561198449223262,3,1.0,Rock in JPN,https://steamcommunity.com/profiles/7656119844...,,,1.511958e+09,JP,,,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4095,76561198974329255,3,1.0,CR4V3N1,https://steamcommunity.com/profiles/7656119897...,,,1.561924e+09,US,TX,3577.0,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
4096,76561198243956334,3,1.0,Onion Knight,https://steamcommunity.com/profiles/7656119824...,,tommy,1.439213e+09,,,,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
4097,76561198089697358,3,1.0,Crystalizen,https://steamcommunity.com/id/crystlfst/,,,1.366967e+09,KR,,,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
4098,76561198035388233,3,1.0,ubri04,https://steamcommunity.com/id/ubri04/,,Aubrey,1.293036e+09,,,,"{'game_count': 495, 'games': [{'appid': 3330, ...","{'games_count': 8, 'total_playtime_2weeks': 18..."


In [25]:
# sasve data to parquet
user_data.to_parquet("data/eiyuden_user_data.parquet", index=False)

## Collecting game data

From the user data, we can also extract list of games owned or played by the ECHH players.
Below is how you can extract the data.

In [8]:
# import user data parquet
user_data = pd.read_parquet("data/eiyuden_user_data.parquet")

In [8]:
user_data.head()

Unnamed: 0,steamid,community_visibility_state,profile_state,persona_name,profile_url,last_log_off,real_name,time_created,loc_country_code,loc_state_code,loc_city_id,owned_games,recently_played_games
0,76561198045255045,2,1.0,Dojilol,https://steamcommunity.com/id/Dojilol/,,,,,,,"{'game_count': None, 'games': None}","{'games': None, 'games_count': None, 'total_pl..."
1,76561198113965734,1,1.0,123,https://steamcommunity.com/profiles/7656119811...,,,,,,,"{'game_count': None, 'games': None}","{'games': None, 'games_count': None, 'total_pl..."
2,76561198138897509,3,1.0,Pino70,https://steamcommunity.com/profiles/7656119813...,,Giuseppe,1401729000.0,IT,5.0,24453.0,"{'game_count': 0.0, 'games': []}","{'games': [], 'games_count': 0.0, 'total_playt..."
3,76561198368799839,3,1.0,oci51,https://steamcommunity.com/profiles/7656119836...,,oci,1487895000.0,KR,,,"{'game_count': 6658.0, 'games': [{'appid': 161...","{'games': [{'appid': 978300, 'name': 'Saints R..."
4,76561198449223262,3,1.0,Rock in JPN,https://steamcommunity.com/profiles/7656119844...,,,1511958000.0,JP,,,"{'game_count': 0.0, 'games': []}","{'games': [], 'games_count': 0.0, 'total_playt..."


In [13]:
user_data.columns

Index(['steamid', 'community_visibility_state', 'profile_state',
       'persona_name', 'profile_url', 'last_log_off', 'real_name',
       'time_created', 'loc_country_code', 'loc_state_code', 'loc_city_id',
       'owned_games', 'recently_played_games'],
      dtype='object')

In [14]:
user_data.dtypes

steamid                         int64
community_visibility_state      int64
profile_state                 float64
persona_name                   object
profile_url                    object
last_log_off                  float64
real_name                      object
time_created                  float64
loc_country_code               object
loc_state_code                 object
loc_city_id                   float64
owned_games                    object
recently_played_games          object
dtype: object

In [15]:
user_data.shape

(4100, 13)

In [9]:
# let's filter users with community_visibility_state == 3 (public/visible to everyone)
public_user = user_data[user_data['community_visibility_state'] == 3]

In [10]:
public_user.shape

(3222, 13)

In [11]:
# there shouldn't be users with owned_games' game_count == 0 since the list of users who reviewed eiyuden chronicle
# therefore, they should at least own a game (ECHH, whether they received it for free or not)
# but somehow some of the users listed here has games_count == 0
# but maybe they refunded the game once they reviewed it? so let's filter them off
public_user_has_games = public_user[
    public_user['owned_games'].apply(lambda x: x.get('game_count', 0) > 0)
]

In [12]:
public_user_has_games.shape

(1495, 13)

In [20]:
# from 3222 rows down to 1495, that's surprisingly A LOT lol but let's see
public_user_has_games.head()

Unnamed: 0,steamid,community_visibility_state,profile_state,persona_name,profile_url,last_log_off,real_name,time_created,loc_country_code,loc_state_code,loc_city_id,owned_games,recently_played_games
3,76561198368799839,3,1.0,oci51,https://steamcommunity.com/profiles/7656119836...,,oci,1487895000.0,KR,,,"{'game_count': 6658.0, 'games': [{'appid': 161...","{'games': [{'appid': 978300, 'name': 'Saints R..."
5,76561198971574247,3,1.0,依然,https://steamcommunity.com/profiles/7656119897...,,,1561081000.0,CN,,,"{'game_count': 636.0, 'games': [{'appid': 10, ...","{'games': [{'appid': 2277560, 'name': 'WUCHANG..."
7,76561198123424467,3,1.0,108Hvs,https://steamcommunity.com/id/108Hvs/,,Haffipul Saddad,1390055000.0,ID,30.0,,"{'game_count': 128.0, 'games': [{'appid': 4000...","{'games': [{'appid': 477160, 'name': 'Human Fa..."
8,76561198102397621,3,1.0,Malam,https://steamcommunity.com/profiles/7656119810...,,,1376225000.0,KR,,,"{'game_count': 79.0, 'games': [{'appid': 8870,...","{'games': [{'appid': 1658280, 'name': 'Eiyuden..."
9,76561198869364047,3,1.0,ScyRo,https://steamcommunity.com/profiles/7656119886...,,Simon Rosales,1541499000.0,US,,,"{'game_count': 130.0, 'games': [{'appid': 2090...","{'games': [{'appid': 1658280, 'name': 'Eiyuden..."


### Interesting findings, a mismatch between number of "ECHH" reviews

In [13]:
eiyuden_owners = sum(
    1 for owned_games in public_user_has_games["owned_games"]
    if any(game.get("appid") == int(eiyuden_appid) for game in owned_games.get("games", []))
)
print(eiyuden_owners)

1463


In [14]:
played_eiyuden = sum(
    1
    for owned_games in public_user_has_games['owned_games']
    for game in owned_games.get('games', [])
    if (game.get("appid") == eiyuden_appid) and (game.get('playtime_forever', 0) > 0)
)
print(played_eiyuden)

1287


I initially assumed that you can only write review if you owned the games (through steam purchase, key activation), but apparently it wasn't the case. 
Since the user_data is extracted from ECHH's reviews, I expected **ALL USER** will have ECHH in their *owned_games* list, but I found some mismatch.
as you can see in the 2 cells above, out of **1495 (public) users**, only 1463 of them have ECHH in their *owned_games* list, and only 1287 of them have *playtime_forever* value > 0 which I find a bit weird.

Here's what I thought (and based on [this](https://store.steampowered.com/reviews/), and (this)[https://partner.steamgames.com/doc/marketing/discounts/freeweekends]):
- **missing from *owned_games*** -> likely refunded games, expired free weekend players.
- **present but with 0 *playtime_forever*** -> probably users who never actually played the game or review the game without even playing the game.

Anyway, for the sake of analysis, I will cross-check the *playtime_forever* value recorded in both review data and *owned_games* data and decide whether to correct the affected rows.

#### "playtime_forever" cross-check

In [15]:
# table of steamid and ECHH playtime_forever from review data
playtime_forever_review = reviews[['author_steamid', 'author_playtime_forever']]
playtime_forever_review.columns = ['steamid', 'playtime_forever']
playtime_forever_review.head(5)

Unnamed: 0,steamid,playtime_forever
0,76561198045255045,9
1,76561198113965734,2734
2,76561198138897509,1530
3,76561198368799839,8511
4,76561198449223262,5018


In [16]:
def cal_playtime_forever_user(df):
    rows = []
    for _, row in df.iterrows():
        steamid = row.get("steamid")
        owned = row.get("owned_games")

        games = owned.get("games", []) if isinstance(owned, dict) else []
        playtime = next(
            (g.get("playtime_forever", 0) for g in games if g.get("appid") == eiyuden_appid),
            -1
        )

        rows.append({"steamid": steamid, "playtime_forever": playtime})
    return pd.DataFrame(rows)

playtime_forever_user = cal_playtime_forever_user(public_user_has_games)
playtime_forever_user.head(5)

Unnamed: 0,steamid,playtime_forever
0,76561198368799839,8511
1,76561198971574247,296
2,76561198123424467,7732
3,76561198102397621,1298
4,76561198869364047,4589


In [17]:
playtime_forever_user[playtime_forever_user['playtime_forever'] == -1].head(5)

Unnamed: 0,steamid,playtime_forever
52,76561198860289394,-1
98,76561198071820792,-1
243,76561198321506309,-1
264,76561198262702449,-1
281,76561198971597341,-1


In [18]:
# merge the two dataframes on steamid
merged_playtime = pd.merge(
    playtime_forever_review,
    playtime_forever_user,
    on="steamid",
    how="right", # only consider public users that we have data for
    suffixes=('_review', '_user')
)

In [19]:
merged_playtime.head(5)

Unnamed: 0,steamid,playtime_forever_review,playtime_forever_user
0,76561198368799839,8511,8511
1,76561198971574247,296,296
2,76561198123424467,7732,7732
3,76561198102397621,1060,1298
4,76561198869364047,4589,4589


In [20]:
missing_or_zero_mask = merged_playtime['playtime_forever_user'].isin([-1, 0])
mismatch_playtime_mask = merged_playtime['playtime_forever_review'] != merged_playtime['playtime_forever_user']

In [21]:
# 0 means ECHH is present but playtime_forever is 0, and -1 means ECHH is missing from "owned_games"
merged_playtime[missing_or_zero_mask]['playtime_forever_user'].value_counts()

playtime_forever_user
 0    176
-1     32
Name: count, dtype: int64

In [22]:
# check the number of mismatch rows
merged_playtime[mismatch_playtime_mask].shape

(211, 3)

In [23]:
# mismatch but "playtime_forever_user" neither 0 or -1
merged_playtime[mismatch_playtime_mask & ~missing_or_zero_mask]

Unnamed: 0,steamid,playtime_forever_review,playtime_forever_user
3,76561198102397621,1060,1298
260,76561198140268731,5003,5148
368,76561198115628254,2654,2656


**Correcting "playtime_forever" in user data**

Based on the cross-checking above, here are the actions I will take for the mismatches:
- playtime_forever_user == [-1, 0] -> replace this with review's data
- playtime_forever_user > playtime_forever_review -> ignore, these happened probably because the time difference during data collecting and the players are still playing the game at the time.

I will correct the "playtime_forever" later once we created the user-game matrix.

### Uh-oh, too many data to collect?
Based the process above, there are 30,471 unique games in total, played by ECHH players.. and that's **A LOT**. But since our goal is to find games similar to **Eiyuden Chronicle: Hundred Heroes**, we don't really need to collect all 30K+ game data (which is pretty exhausting and put a lot of traffic on the data sources). But we can to trim this down to a smaller number, maybe about ~1,000 games, which is much more make sense I think.

In [26]:
# now we need to store ALL the games that user have played to compare with eiyuden chronicle in a set
# only the games that have playtime > 0
games_set = {
    game['appid']
    for owned_games in public_user_has_games['owned_games']
    for game in owned_games.get('games', []) # they all should have 'games' though but just to be safe
    if game.get('playtime_forever', 0) > 0 # the 'playtime_forever' key also should be there but again, just to be safe
}

In [27]:
len(games_set)

30471

### How do we do that?
Here's the approach I'm thinking of:

1. **Build a user–game playtime matrix**  
   Rows = users (steamid), columns = games (appid), values = playtime_forever (hours).
   These can be extracted from "steamid" and "owned_games" from *public_user_has_games*  
   
2. **Normalize playtime**  
   Since the playtime_forever might differ based on the size of content in games (like mmo, rpg, etc). Applying transformation like log-scaling, min-max normalization will reduce the bias so we can compare *engagement patterns* rather than raw playtime_forever value.

3. **Compute similarity with cosine distance**  
   Cosine similarity will be used to compare games based on how similar their player engagement vectors are, this highlights games that *ECHH players spend similar playtime_forever*.

4. **Build a user–game playtime matrix**  
   From this similarity ranking, we select the most similar (presummably relevant) games for its data to collect.

This way, we don't have to unnecessarily collect 30K+ game data and waste our time (and resources) to check on games that are not similar to ECHH (at least from the playtime behavior perspective).

With that being said, let's start working!

#### 1. Build a user–game playtime matrix  

In [24]:
def build_user_game_matrix(
    df: pd.DataFrame,
    steamid_col: str = "steamid",
    owned_col: str = "owned_games",
    value_key: str = "playtime_forever",
) -> pd.DataFrame:
    # ensure there is no duplicate user to prevent double counting
    df = df.drop_duplicates(subset=[steamid_col], keep="last")

    # I wanna keep all the users to fill the incorrect/missing ECHH playtime_value later
    all_users = df[steamid_col].astype(str)

    # iterate the rows, taking only the steamid and owned_games cols
    rows = []
    for sid, og in zip(df[steamid_col], df[owned_col]): # sid -> steamid, og -> owned_games
        games = og.get("games")
        if games is None or len(games) == 0:
            continue
        for g in games:
            appid = g.get("appid")
            if appid is None:
                continue # in case the missing/incorrect structure, skip them
            
            val = g.get(value_key, 0)
            if val == 0:
                continue # skip games that the player hasn't play yet, preventing all zero column games

            rows.append((str(sid), int(appid), float(val))) # steam's appid is actually int, and it's also better for sorting later

    # don't think it will come to this but in case everything is empty
    if not rows:
        return pd.DataFrame(index=pd.Index([], name="steamid"), columns=pd.Index([], name="appid"))

    # long -> wide pivot
    long = pd.DataFrame(rows, columns=["steamid", "appid", value_key])
    mat = (
        long.pivot_table(index="steamid", columns="appid", values=value_key, aggfunc="sum", fill_value=0)
            .sort_index(axis=0).sort_index(axis=1)
    )

    # reindex and bring back the users will all zeros
    mat = mat.reindex(index=all_users, fill_value=0)

    mat = mat.astype("float32", copy=False) 
    mat = mat.astype(pd.SparseDtype("float32", 0)) # fill the NaN as 0
    return mat

In [25]:
user_game_matrix = build_user_game_matrix(df=public_user_has_games)

In [26]:
user_game_matrix.shape

(1495, 30471)

##### Correcting ECHH players' *playtime_forever* values

In [27]:
# convert to series
playtime_forever_review = playtime_forever_review.set_index('steamid')['playtime_forever']

# reindex, drop the idx not in the matrix
playtime_forever_review_aligned = playtime_forever_review.reindex(user_game_matrix.index)

# get the target col (turn it into dense since the matrix a sparse df)
echh_col_dense = user_game_matrix[eiyuden_appid].sparse.to_dense().astype('float64')

# create a mask for the missing/zero rows
zero_missing_mask = user_game_matrix[eiyuden_appid].sparse.to_dense() == 0

# prep the val to put into matrix
vals_in = playtime_forever_review_aligned.where(zero_missing_mask)

# put them into matrix
user_game_matrix[eiyuden_appid].update(vals_in.astype('float64'))

#### 2. Normalize playtime

**Why log+normalize matters?** -> Raw playtime values are heavily biased by *long games* (MMO, sandbox, or endless loop games) where a players can easily have thousands of playtime hours. This can distort cosine similarity, making such games look "close" to **ECHH** even if the engagement pattern is very different (producing false positives)

In [28]:
# (Optional) filter games that owned by (minimum) players
# pandas sparse compatible
def filter_min_players(
    matrix: pd.DataFrame,
    min_players: int = 30,
) -> pd.DataFrame:
    nnz_per_col = matrix.sparse.to_coo().getnnz(axis=0)
    keep_mask = nnz_per_col >= min_players
    keep_cols = matrix.columns[keep_mask]

    return matrix.loc[:, keep_cols]

In [None]:
# (Optional) filter games that owned by both target_appid and other appids
# Note that in this context, all owners of other appids is assumed to own the target_appid, that's why we don't need to check the target_appid presence
def filter_min_shared_players(
    matrix: pd.DataFrame, # raw / transformed user-game matrix
    target_appid: int = 1658280, # Eiyuden Chronicle: Hundred Heroes
    min_players: int = 50, # minimum number of players who own both target_appid and other appids 
):
    # pandas -> scipy parse
    X_csr = matrix.sparse.to_coo().tocsr()
    j = matrix.columns.get_loc(target_appid) # index of target_appid

    # filter out based on min_shared_players
    binary_presence = X_csr.sign() # 0/1 presence matrix
    target_col = binary_presence[:, j]
    shared_players = (binary_presence.T @ target_col).toarray().ravel() # (games,) -> this is the number of users who own both target_appid and other appids

    # mask candidates based on shared_players
    mask = (shared_players >= min_players)
    
    # Apply the mask to get filtered columns
    keep_cols = matrix.columns[mask]
    
    # Return filtered dataframe
    return matrix.loc[:, keep_cols]

In [29]:
def transform_playtime_matrix(
    matrix: pd.DataFrame,
    log1p: bool = True,
    normalize: bool = True,
) -> pd.DataFrame:

    X = matrix.copy()

    # log1p transform to compress skew / outliers
    if log1p:
        # keep zero as zeros
        X = X.where(X.eq(0), np.log1p(X))

    # min-max normalization, normalize the playtime_forever value into a range of 0 ~ 1
    if normalize:
        col_min = X.min(axis=0)
        col_max = X.max(axis=0)
        denom = (col_max - col_min).replace(0, 1)  # avoid divide by zero

        Y = (X - col_min) / denom
        X = X.where(X.eq(0), Y)

    return X.astype(pd.SparseDtype("float32", 0))

In [30]:
user_game_matrix_transformed = transform_playtime_matrix(user_game_matrix)

In [31]:
# OR filter games with minimum player owned first then transform
# I'm going to set min_players as 50
user_game_matrix_transformed = transform_playtime_matrix(
    filter_min_players(
        matrix=user_game_matrix,
        min_players=50
    )
)
user_game_matrix_transformed.shape

(1495, 2233)

In [134]:
# OR filter games with minimum player owned first then transform
# I'm going to set min_players as 50
user_game_matrix_transformed = transform_playtime_matrix(
    filter_min_shared_players(
        matrix=user_game_matrix,
        min_players=50
    )
)
user_game_matrix_transformed.shape

(1314, 2196)

In [32]:
user_game_matrix_transformed.head()

appid,10,70,80,220,240,300,320,400,440,500,...,3159330,3164500,3178350,3224770,3240220,3241660,3255380,3489700,3513350,3527290
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
76561198368799839,0,0,0,0.497437,0,0,0,0,0,0,...,0.0,0,0.0,0,0,0.0,0,0,0,0
76561198971574247,0,0,0,0.0,0,0,0,0,0,0,...,0.0,0,0.640523,0,0,0.0,0,0,0,0
76561198123424467,0,0,0,0.0,0,0,0,0,0,0,...,0.0,0,0.0,0,0,0.0,0,0,0,0
76561198102397621,0,0,0,0.0,0,0,0,0,0,0,...,0.882693,0,0.0,0,0,0.0,0,0,0,0
76561198869364047,0,0,0,0.0,0,0,0,0,0,0,...,0.857314,0,0.0,0,0,0.642462,0,0,0,0


#### 3 & 4. **Compute similarity with cosine distance** and **Extract the top n candidates**

Here I'm going to set *n* to 1000 but you can adjust it as your needs.

In [39]:
from sklearn.metrics.pairwise import cosine_similarity

def top_similar_games(
    X: pd.DataFrame, # raw / transformed user-game matrix
    target_appid: int | str = 1658280, # Eiyuden Chronicle: Hundred Heroes
    top_n: int = 1000,
) -> pd.DataFrame:
    
    # normalize target_appid to int
    target_appid = int(target_appid)
    if target_appid not in X.columns:
        raise KeyError(f"target_appid {target_appid} is not a column in matrix")

    # pandas -> scipy parse
    X_csr = X.sparse.to_coo().tocsr()
    j = X.columns.get_loc(target_appid) # index of target_appid

    # calculate num owners for each appid
    num_owners = X_csr.getnnz(axis=0)  # number of users who own each appid

    # exclude the target_appid itself
    mask = np.ones(len(X.columns), dtype=bool)
    mask[j] = False
    appids_candidates = X.columns[mask]
    owners_candidates = num_owners[mask]

    # slice the matrix to exclude the target appid
    X_csr_candidates = X_csr[:, mask]
    target_vector = X_csr[:, j]

    # calculate cosine similarity target vs all other appids
    similarity = cosine_similarity(
        X_csr_candidates.T,
        target_vector.T,
        dense_output=True
    ).ravel()

    # build result frame
    out = pd.DataFrame({
        "appid": appids_candidates,
        "similarity": similarity,
        "num_owners": owners_candidates,
    })

    out = out.sort_values(["similarity", "num_owners"], ascending=[False, False])
    return out.head(top_n).reset_index(drop=True) # take top_n then reset index

In [40]:
top_n = 1000

candidates = top_similar_games(
    X=user_game_matrix_transformed,
    target_appid=eiyuden_appid,
    top_n=top_n,
)

# also without transformation for comparison later
candidates_without_transform = top_similar_games(
    X=user_game_matrix,
    target_appid=eiyuden_appid,
    top_n=top_n,
)

In [41]:
pd.set_option("display.max_rows", 200)

In [42]:
candidates.head(5)

Unnamed: 0,appid,similarity,num_owners
0,1086940,0.71368,727
1,292030,0.686347,731
2,413150,0.682368,692
3,582010,0.668863,669
4,570,0.655424,678


In [43]:
candidates_without_transform.head(5)

Unnamed: 0,appid,similarity,num_owners
0,1658290,0.474342,544
1,1932640,0.451833,371
2,429660,0.399569,421
3,1903340,0.386714,447
4,692850,0.379119,383


In [44]:
# comparing the result with and without transformation
df_raw = candidates_without_transform[['appid', 'similarity']].reset_index().rename(columns={"index": "rank_raw"})
df_trans = candidates[['appid', 'similarity']].reset_index().rename(columns={"index": "rank_trans"})

# Merge on appid
merged = df_raw.merge(df_trans, on="appid", suffixes=("_raw", "_trans"))

# Compute delta rank
merged["delta_rank"] = merged["rank_raw"] - merged["rank_trans"]

# Biggest movers
sim_changes = merged.reindex(merged["delta_rank"].abs().sort_values(ascending=False).index)

In [46]:
sim_changes.head(20)

Unnamed: 0,rank_raw,appid,similarity_raw,rank_trans,similarity_trans,delta_rank
68,71,692890,0.261399,987,0.245983,-916
614,830,431960,0.131655,17,0.58851,813
150,177,1351630,0.212137,974,0.24673,-797
638,894,601150,0.127814,119,0.445736,775
195,225,1668510,0.201619,990,0.245585,-765
63,63,553640,0.266374,802,0.263783,-739
45,45,1356670,0.288902,771,0.267936,-726
594,777,552500,0.134857,66,0.49823,711
612,824,1599340,0.132127,113,0.448372,711
232,265,301640,0.191012,976,0.246643,-711
