# Analysis of Eiyuden Chronicle: Hundred Heroes - Data Collection

This Jupyter Notebook is part of my **Analysis of Eiyuden Chronicle: Hundred Heroes** project. It focuses on the data collection stage and also serves as a demo for [GameInsights](https://github.com/nazhifkojaz/steam-game-data-collector), a Python library I built for collecting and analyzing **Steam game data**.

**GameInsights** started as a personal tool, but I decided to expand it into something that can be useful for others—such as *game developers, researchers, or anyone who enjoys analyzing game data*.

For now, GameInsights focuses on **data collection** with its **collector** module. In the future, I plan to add two additional modules:  
- **Analyzer** – for processing and analyzing collected data.  
- **Visualizer** – for creating plots and visual summaries.  

In this notebook, I will show some of the main functions in the **collector** module, including:  
- `get_game_review` – pulls a list of reviews for a given Steam game.  
- `get_user_data` – retrieves user data for a provided SteamID.  
- `get_games_data` – compiles game data from multiple sources, based on a given Steam AppID.  

For simplicity, I will refer to *Eiyuden Chronicle: Hundred Heroes* as **ECHH** throughout this notebook.


## Library Imports

In [1]:
from gameinsights import Collector

In [2]:
import pandas as pd
import numpy as np

In [3]:
# import secrets
from dotenv import load_dotenv
import os
load_dotenv()

True

## Collecting data using GameInsights

Here we set up the things needed, like the STEAM_API_KEY that's required to get you an access to pull the user data and game schematics (achievements labels)

In [4]:
steam_api_key = os.getenv("STEAM_API_KEY")

In [5]:
collector = Collector(steam_api_key=steam_api_key)

In [6]:
eiyuden_appid = "1658280"

### Collecting review data

I think the most obvious data that we can pull from steam game is its review data, so let's begin

In [7]:
reviews = collector.get_game_review(steam_appid=eiyuden_appid, verbose=False, review_only=True)

In [13]:
reviews.shape

(4100, 21)

In [9]:
reviews.columns

Index(['recommendation_id', 'author_steamid', 'author_num_games_owned',
       'author_num_reviews', 'author_playtime_forever',
       'author_playtime_last_two_weeks', 'author_playtime_at_review',
       'author_last_played', 'language', 'review', 'timestamp_created',
       'timestamp_updated', 'voted_up', 'votes_up', 'votes_funny',
       'weighted_vote_score', 'comment_count', 'steam_purchase',
       'received_for_free', 'written_during_early_access',
       'primarily_steam_deck'],
      dtype='object')

In [24]:
# save reviews to parquet
reviews.to_parquet("data/eiyuden_reviews.parquet", index=False)

### Collecting user data

From the review data, we can extract the list of *steamid* of the users who wrote the reviews, and this gives us a subset of ECHH players from its whole population.

In [20]:
reviews = pd.read_parquet("data/eiyuden_reviews.parquet")

In [21]:
reviews.head()

Unnamed: 0,recommendation_id,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played,language,review,...,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,primarily_steam_deck
0,201952284,76561198045255045,0,10,9,9,9,1754967253,english,By the 4th text box in the game I noticed some...,...,1754968462,False,0,0,0.5,0,False,False,False,False
1,201940210,76561198113965734,0,2,2734,1447,2734,1754703861,english,"Needs a x2 , x3 speed key. Would shave off 70...",...,1754950716,False,0,0,0.5,0,True,False,False,False
2,201914252,76561198138897509,0,7,1530,611,1377,1754932590,italian,per ora bella storia e grafica gradevole,...,1754923391,True,0,0,0.5,0,True,False,False,False
3,201894653,76561198368799839,6496,540,8511,366,8511,1754893622,koreana,- 이 게임의 가장 큰 문제점은 환상수호전2라는 벽이 너무 거대하다는 거다.\n본인...,...,1754896546,True,8,1,0.619325,0,False,False,False,False
4,201825659,76561198449223262,0,2,5018,1605,5018,1754809816,japanese,仲間全員集めてレースやカードゲーム等のミニゲームにも手を付けてプレイ時間約８０時間でクリアし...,...,1754812678,True,0,0,0.5,0,False,False,False,False


In [22]:
reviews.columns

Index(['recommendation_id', 'author_steamid', 'author_num_games_owned',
       'author_num_reviews', 'author_playtime_forever',
       'author_playtime_last_two_weeks', 'author_playtime_at_review',
       'author_last_played', 'language', 'review', 'timestamp_created',
       'timestamp_updated', 'voted_up', 'votes_up', 'votes_funny',
       'weighted_vote_score', 'comment_count', 'steam_purchase',
       'received_for_free', 'written_during_early_access',
       'primarily_steam_deck'],
      dtype='object')

In [24]:
reviews['author_playtime_forever'].describe()

count      4100.000000
mean       4275.500976
std        4295.696953
min           5.000000
25%        2006.000000
50%        3781.500000
75%        5571.750000
max      128945.000000
Name: author_playtime_forever, dtype: float64

In [9]:
# extract author_steamid to collect user data
authors = reviews['author_steamid'].to_list()

In [10]:
len(authors)

4100

In [None]:
user_data = collector.get_user_data(
    steamids=authors,
    include_free_games=True,
    verbose=True,
)

In [12]:
user_data

Unnamed: 0,steamid,community_visibility_state,profile_state,persona_name,profile_url,last_log_off,real_name,time_created,loc_country_code,loc_state_code,loc_city_id,owned_games,recently_played_games
0,76561198045255045,2,1.0,Dojilol,https://steamcommunity.com/id/Dojilol/,,,,,,,{},{}
1,76561198113965734,1,1.0,123,https://steamcommunity.com/profiles/7656119811...,,,,,,,{},{}
2,76561198138897509,3,1.0,Pino70,https://steamcommunity.com/profiles/7656119813...,,Giuseppe,1.401729e+09,IT,05,24453.0,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
3,76561198368799839,3,1.0,oci51,https://steamcommunity.com/profiles/7656119836...,,oci,1.487895e+09,KR,,,"{'game_count': 6658, 'games': [{'appid': 1610,...","{'games_count': 9, 'total_playtime_2weeks': 73..."
4,76561198449223262,3,1.0,Rock in JPN,https://steamcommunity.com/profiles/7656119844...,,,1.511958e+09,JP,,,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4095,76561198974329255,3,1.0,CR4V3N1,https://steamcommunity.com/profiles/7656119897...,,,1.561924e+09,US,TX,3577.0,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
4096,76561198243956334,3,1.0,Onion Knight,https://steamcommunity.com/profiles/7656119824...,,tommy,1.439213e+09,,,,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
4097,76561198089697358,3,1.0,Crystalizen,https://steamcommunity.com/id/crystlfst/,,,1.366967e+09,KR,,,"{'game_count': 0, 'games': []}","{'games_count': 0, 'total_playtime_2weeks': 0,..."
4098,76561198035388233,3,1.0,ubri04,https://steamcommunity.com/id/ubri04/,,Aubrey,1.293036e+09,,,,"{'game_count': 495, 'games': [{'appid': 3330, ...","{'games_count': 8, 'total_playtime_2weeks': 18..."


In [25]:
# sasve data to parquet
user_data.to_parquet("data/eiyuden_user_data.parquet", index=False)

### Collecting game data

From the user data, we can also extract list of games owned or played by the ECHH players.
Below is how you can extract the data.

In [7]:
# import user data parquet
user_data = pd.read_parquet("data/eiyuden_user_data.parquet")

In [8]:
user_data.head()

Unnamed: 0,steamid,community_visibility_state,profile_state,persona_name,profile_url,last_log_off,real_name,time_created,loc_country_code,loc_state_code,loc_city_id,owned_games,recently_played_games
0,76561198045255045,2,1.0,Dojilol,https://steamcommunity.com/id/Dojilol/,,,,,,,"{'game_count': None, 'games': None}","{'games': None, 'games_count': None, 'total_pl..."
1,76561198113965734,1,1.0,123,https://steamcommunity.com/profiles/7656119811...,,,,,,,"{'game_count': None, 'games': None}","{'games': None, 'games_count': None, 'total_pl..."
2,76561198138897509,3,1.0,Pino70,https://steamcommunity.com/profiles/7656119813...,,Giuseppe,1401729000.0,IT,5.0,24453.0,"{'game_count': 0.0, 'games': []}","{'games': [], 'games_count': 0.0, 'total_playt..."
3,76561198368799839,3,1.0,oci51,https://steamcommunity.com/profiles/7656119836...,,oci,1487895000.0,KR,,,"{'game_count': 6658.0, 'games': [{'appid': 161...","{'games': [{'appid': 978300, 'name': 'Saints R..."
4,76561198449223262,3,1.0,Rock in JPN,https://steamcommunity.com/profiles/7656119844...,,,1511958000.0,JP,,,"{'game_count': 0.0, 'games': []}","{'games': [], 'games_count': 0.0, 'total_playt..."


In [9]:
user_data.columns

Index(['steamid', 'community_visibility_state', 'profile_state',
       'persona_name', 'profile_url', 'last_log_off', 'real_name',
       'time_created', 'loc_country_code', 'loc_state_code', 'loc_city_id',
       'owned_games', 'recently_played_games'],
      dtype='object')

In [10]:
user_data.dtypes

steamid                         int64
community_visibility_state      int64
profile_state                 float64
persona_name                   object
profile_url                    object
last_log_off                  float64
real_name                      object
time_created                  float64
loc_country_code               object
loc_state_code                 object
loc_city_id                   float64
owned_games                    object
recently_played_games          object
dtype: object

In [11]:
user_data.shape

(4100, 13)

In [12]:
# let's filter users with community_visibility_state == 3 (public/visible to everyone)
public_user = user_data[user_data['community_visibility_state'] == 3]

In [13]:
public_user.shape

(3222, 13)

In [None]:
# there shouldn't be users with owned_games' game_count == 0 since the list of users who reviewed eiyuden chronicle
# therefore, they should at least own a game (ECHH, whether they received it for free or not)
# but somehow some of the users listed here has games_count == 0
# but maybe they refunded the game once they reviewed it? so let's filter them off
public_user_has_games = public_user[
    public_user['owned_games'].apply(lambda x: x.get('game_count', 0) > 0)
]

In [15]:
public_user_has_games.shape

(1495, 13)

In [16]:
# from 3222 rows down to 1495, that's surprisingly A LOT lol but let's see
public_user_has_games.head()

Unnamed: 0,steamid,community_visibility_state,profile_state,persona_name,profile_url,last_log_off,real_name,time_created,loc_country_code,loc_state_code,loc_city_id,owned_games,recently_played_games
3,76561198368799839,3,1.0,oci51,https://steamcommunity.com/profiles/7656119836...,,oci,1487895000.0,KR,,,"{'game_count': 6658.0, 'games': [{'appid': 161...","{'games': [{'appid': 978300, 'name': 'Saints R..."
5,76561198971574247,3,1.0,依然,https://steamcommunity.com/profiles/7656119897...,,,1561081000.0,CN,,,"{'game_count': 636.0, 'games': [{'appid': 10, ...","{'games': [{'appid': 2277560, 'name': 'WUCHANG..."
7,76561198123424467,3,1.0,108Hvs,https://steamcommunity.com/id/108Hvs/,,Haffipul Saddad,1390055000.0,ID,30.0,,"{'game_count': 128.0, 'games': [{'appid': 4000...","{'games': [{'appid': 477160, 'name': 'Human Fa..."
8,76561198102397621,3,1.0,Malam,https://steamcommunity.com/profiles/7656119810...,,,1376225000.0,KR,,,"{'game_count': 79.0, 'games': [{'appid': 8870,...","{'games': [{'appid': 1658280, 'name': 'Eiyuden..."
9,76561198869364047,3,1.0,ScyRo,https://steamcommunity.com/profiles/7656119886...,,Simon Rosales,1541499000.0,US,,,"{'game_count': 130.0, 'games': [{'appid': 2090...","{'games': [{'appid': 1658280, 'name': 'Eiyuden..."


In [17]:
# now we need to store ALL the games that user have played to compare with eiyuden chronicle in a set
# only the games that have playtime > 0
games_set = {
    game['appid']
    for owned_games in public_user_has_games['owned_games']
    for game in owned_games.get('games', []) # they all should have 'games' though but just to be safe
    if game.get('playtime_forever', 0) > 0 # the 'playtime_forever' key also should be there but again, just to be safe
}

In [18]:
len(games_set)

30471

#### Uh-oh, too many data to collect?
Based the process above, there are 30,471 unique games in total, played by ECHH players.. and that's **A LOT**. But since our goal is to find games similar to **Eiyuden Chronicle: Hundred Heroes**, we don't really need to collect all 30K+ game data (which is pretty exhausting and put a lot of traffic on the data sources). But we can to trim this down to a smaller number, maybe about ~1,000 games, which is much more make sense I think.

#### How do we do that?
Here's the approach I'm thinking of:

1. **Build a user–game playtime matrix**  
   Rows = users (steamid), columns = games (appid), values = playtime_forever (hours).
   These can be extracted from "steamid" and "owned_games" from *public_user_has_games*  



3. **Compute similarity with cosine distance**  
   Cosine similarity compares games based on how similar their player engagement vectors are.  This highlights games that *ECHH players spend similar playtime_forever*.  

4. **Extract the top ~1,000 candidates**  
   From similarity rankings, select the most similar (presummably relevant) games for its data to collecct

This way, we don't have to unnecessarily collect 30K+ game data and waste our time (and resources) to check on games that are not similar to ECHH (at least from the playtime behavior perspective).

With that being said, let's start working!

#### How do we do that?
Here's the approach I'm thinking of:


##### 1. Build a user–game playtime matrix  
Rows = users (steamid), columns = games (appid), values = playtime_forever (hours). These can be extracted from "steamid" and "owned_games" from *public_user_has_games*  


In [35]:
def build_user_game_matrix(
    df: pd.DataFrame,
    steamid_col: str = "steamid",
    owned_col: str = "owned_games",
    value_key: str = "playtime_forever",
) -> pd.DataFrame:
    # ensure there is no duplicate user to prevent double counting
    df = df.drop_duplicates(subset=[steamid_col], keep="last")

    # iterate the rows, taking only the steamid and owned_games cols
    rows = []
    for sid, og in zip(df[steamid_col], df[owned_col]): # sid -> steamid, og -> owned_games
        games = og.get("games")
        if games is None or len(games) == 0:
            continue
        for g in games:
            appid = g.get("appid")
            if appid is None:
                continue # in case the missing/incorrect structure, skip them
            

            val = g.get(value_key, 0)
            if val == 0:
                continue # skip games that the player hasn't play yet.

            rows.append((str(sid), int(appid), float(val))) # steam's appid is actually int, and it's also better for sorting later

    # don't think it will come to this but in case everything is empty
    if not rows:
        return pd.DataFrame(index=pd.Index([], name="steamid"), columns=pd.Index([], name="appid"))

    # long -> wide pivot
    long = pd.DataFrame(rows, columns=["steamid", "appid", value_key])
    mat = (
        long.pivot_table(index="steamid", columns="appid", values=value_key, aggfunc="sum", fill_value=0)
            .sort_index(axis=0).sort_index(axis=1)
    )
    mat = mat.astype("float32", copy=False) 
    mat = mat.astype(pd.SparseDtype("float32", 0)) # fill the NaN as 0
    return mat

In [36]:
user_game_matrix = build_user_game_matrix(df=public_user_has_games)

In [37]:
user_game_matrix.head()

appid,10,20,30,40,50,60,70,80,92,100,...,3783770,3793150,3801410,3805420,3813350,3816700,3836320,3859340,3861280,3886520
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
76561197960303702,261.0,0.0,0.0,0.0,0,0.0,245.0,30.0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197960308314,5.0,6.0,0.0,1.0,0,1.0,1597.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197960387984,287.0,0.0,1.0,0.0,0,0.0,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197960512741,13539.0,0.0,415.0,0.0,0,0.0,105.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197960531388,2950.0,1.0,0.0,0.0,0,0.0,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0


##### 2. Normalize playtime
Since the playtime_forever might differ based on the size of content in games (like mmo, rpg, etc). Applying transformation like log-scaling, min-max normalization will reduce the bias so we can compare *engagement patterns* rather than raw playtime_forever value.

**Why log+normalize matters?**

Raw playtime values are heavily biased by *long games* (MMO, sandbox, or endless loop games) where a players can easily have thousands of playtime hours. This can distort cosine similarity, making such games look "close" to **ECHH** even if the engagement pattern is very different (producing false positives)

In [39]:
def transform_playtime_matrix(
    matrix: pd.DataFrame,
    log1p: bool = True,
    normalize: bool = True,
) -> pd.DataFrame:

    X = matrix.copy()

    # log1p transform to compress skew / outliers
    if log1p:
        # keep zero as zeros
        X = X.where(X.eq(0), np.log1p(X))

    # min-max normalization, normalize the playtime_forever value into a range of 0 ~ 1
    if normalize:
        col_min = X.min(axis=0)
        col_max = X.max(axis=0)
        denom = (col_max - col_min).replace(0, 1)  # avoid divide by zero

        Y = (X - col_min) / denom
        X = X.where(X.eq(0), Y)

    return X.astype(pd.SparseDtype("float32", 0))

In [40]:
user_game_matrix_transformed = transform_playtime_matrix(user_game_matrix)

In [42]:
user_game_matrix_transformed.head()

appid,10,20,30,40,50,60,70,80,92,100,...,3783770,3793150,3801410,3805420,3813350,3816700,3836320,3859340,3861280,3886520
steamid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
76561197960303702,0.455464,0.0,0.0,0.0,0,0.0,0.600943,0.434241,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197960308314,0.146557,0.297756,0.0,0.123169,0,0.17297,0.805195,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197960387984,0.463203,0.0,0.091793,0.0,0,0.0,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197960512741,0.778152,0.0,0.798641,0.0,0,0.0,0.509045,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
76561197960531388,0.653536,0.106063,0.0,0.0,0,0.0,0.0,0.0,0,0,...,0,0,0,0,0,0,0,0,0,0
