# Steam Data Analysis. Analysis of the datasets structure and cleanup

## Introduction

After data gathering, we have four csv files:

* `steam_app_data.csv`: Application and DLC data for all IDs from Steam Storefront (2022, April 26)
* `steamspy_data.csv`: Application data from SteamSpy for the same IDs (2022, April 27)
* `steamreviews_data.csv`: Summary review data from Steam API (2022, April 28)
* `missing_ids.csv`: List of the Apps not included in the dataset

Almost all the data necessary for the analysis should be at the `steam_app_data.csv`.
In `steamspy_appid.csv` we have additional information which might be very useful:

* Positive Reviews (count)
* Negative Reviews (count)
* Average and Medians of Concurrent Players (several columns)
* Peak Concurrent Players (ccu column)
* Owners estimate, by using Steam Spy algorithm (wide ranges)
* Tags (list)

Due to how data is gathered on SteamSpy there might be some discrepancies so the third dataset `steamreviews_data.csv` with the review summary data was downloaded from the Steam AppReviews API and used as an additional source of information:

* Review Score
* Review Score (as description string)
* Positive Reviews (count)
* Negative Reviews (count)
* Total Reviews (count)

In this notepad I'll go through each of the data table comparing them and taking notice for the clean-up and column parsing when necessary. There goals here are: 

* Prepare the table structure that will be exported and used later in the analysis/visualization creation
* Make the fields/tables as easy to uperate later in analysis as possible
* Keep as much data as possible (even with the null fields - even these data might be useful for the dataset users)
* Document the changes and prepare a streamlined automated process for the future updates


In [1]:
# Module imports
# standard library imports
import csv
import datetime as dt
import json
import os
import statistics
import time
import re
import ast
import itertools

# third-party imports
import numpy as np
import pandas as pd

In [2]:
# Loading data tables
storefront = pd.read_csv('../data/processing/steam_app_data.csv', dtype={'required_age': 'str', 'download_appid': 'int'})
steamspy = pd.read_csv('../data/processing/steamspy_data.csv')
reviews = pd.read_csv('../data/processing/steamreviews_data.csv', dtype={'download_appid': 'int'})
missing_ids = pd.read_csv('../data/processing/missing_ids.csv')

In [3]:
# Setting some constants
# usd/eu exchange rate at the time of collection
usd_eu_rate = 0.95
# date of the dataset collection
df_collection_date = pd.Timestamp(2022,6,22)

### Utility functions
Let's define some utility functions used for processing and troubleshooting:

In [4]:
def getSteamLink(df):
    '''
        Give us the name and links to any subseries of apps, for troubleshooting.
    '''
    for item in df.index:
        print(f'{df.loc[item]["name"]} https://store.steampowered.com/app/{str(item)}')

In [5]:
def export_data(df, filename, index=False, list_columns = []):
    '''
    Export dataframe to the csv file in export folder'.
    
        filename: file name string without file extension
        index: boolean, to export index as well or not
        list_columns: list columns to transform from '['item']' to the simple 
            ';' delimited list
    '''
    filepath = '../data/export/' + filename + '.csv'
    
    def list_convert(input_list):
        try:       
            return ';'.join(str(item) for item in input_list)
        except Exception as ex:
            print(input_list)
            print(ex)
            raise(ex)
    
    for col in list_columns:
        df[col].fillna({i: [] for i in storefront.index},inplace = True)
        df[col] = df[col].apply(lambda x: list_convert(x))

        
    df.to_csv(filepath, index=index)

    print(f'Exported {filename} to "{filepath}"')

In [6]:
def boolean_df(item_lists, unique_items):
    '''
    Create boolean dataframe from from the item list series and 
    a list of unique item values
    
        items_lists: pandas series with item lists
        unique_items: list with the unique item valaues
    
    '''
    
    # Create empty dict
    bool_dict = {}
    
    # Loop through all the tags
    for i, item in enumerate(unique_items):
        
        # Apply boolean mask
        bool_dict[item] = item_lists.apply(lambda x: item in x)
            
    # Return the results as a dataframe
    return pd.DataFrame(bool_dict)

In [7]:
# utility function to add the removed IDs to the missing_ids
def removeIDs(df, ids_list, reason):
    '''
    Remove ids and add them to the missing_ids with the reason
    '''
    global missing_ids
    
    # removing ids from df
    df = df.loc[~df.index.isin(ids_list)]
    
    # adding ids to the missing_ids
    temp_df = pd.DataFrame(ids_list,columns =['appid'])
    temp_df['reason'] = reason
    missing_ids = pd.concat([missing_ids, temp_df]).reset_index(drop=True)
    return df

## Preparing data

As I've noted earlier, here 

Let's start with the overall structure of our tables - number of columns, total data counts and the amount of non-null data.

In [8]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105035 entries, 0 to 105034
Data columns (total 41 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     104355 non-null  object
 1   name                     104344 non-null  object
 2   steam_appid              105035 non-null  int64 
 3   required_age             104355 non-null  object
 4   is_free                  104355 non-null  object
 5   controller_support       26065 non-null   object
 6   dlc                      9804 non-null    object
 7   detailed_description     104193 non-null  object
 8   about_the_game           104192 non-null  object
 9   short_description        104188 non-null  object
 10  fullgame                 35364 non-null   object
 11  supported_languages      104164 non-null  object
 12  header_image             104355 non-null  object
 13  website                  61137 non-null   object
 14  pc_requirements     

In [9]:
steamspy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105035 entries, 0 to 105034
Data columns (total 20 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   appid            105035 non-null  int64  
 1   name             104770 non-null  object 
 2   developer        93788 non-null   object 
 3   publisher        84463 non-null   object 
 4   score_rank       52 non-null      float64
 5   positive         105035 non-null  int64  
 6   negative         105035 non-null  int64  
 7   userscore        105035 non-null  int64  
 8   owners           105035 non-null  object 
 9   average_forever  105035 non-null  int64  
 10  average_2weeks   105035 non-null  int64  
 11  median_forever   105035 non-null  int64  
 12  median_2weeks    105035 non-null  int64  
 13  price            94146 non-null   float64
 14  initialprice     94157 non-null   float64
 15  discount         94157 non-null   float64
 16  languages        93934 non-null   obje

In [10]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105035 entries, 0 to 105034
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   appid              105035 non-null  int64 
 1   review_score       105035 non-null  int64 
 2   review_score_desc  105035 non-null  object
 3   total_positive     105035 non-null  int64 
 4   total_negative     105035 non-null  int64 
 5   total_reviews      105035 non-null  int64 
 6   download_appid     105035 non-null  int32 
dtypes: int32(1), int64(5), object(1)
memory usage: 5.2+ MB


We have roughly 100000 App IDs in each of the tables.

**Steam Storefront** data has some seemingly optional information in these columns: *dlc*, *fullgame*, *website*, *legal_notice*, *drm_notice*, *ext_user_account_notice*,   *demos*, *metacritic*, *reviews*, *movies*, *recommendations*, *achievements*. 

*developers*, *publishers*, *demos*, *price_overview*, *packages* have quite a big number of nulls that definetely need some investigating. Some other columns also haave a small number of null data.

**Steam Reviews** data don't seem to have any nulls.

**SteamSpy** data also have some fields with a noticeable amount of nulls: *developer*, *publisher*, *score_rank*, *price*, *initialprice*, *discount*, *languages*, *genre*.

The total numbers of App IDs is a bit different between the table. There is one noticeable 'Feature' in the Steam Storefront API - it doesn't return the data for the games that are not available in the regioin. I've downloaded the data from the Netherlands and it might explain some games missing as they are not available in the region. The small difference between the Steam Reviews and SteamSpy might be caused by the different dates the data was gathered.

There is some data that appears in two data tables. Since the data might differ both in the format and content, I'll check both and decide how they are handled as we move along.

| Field 1 | Field 2 |
| --- | --- |
| storefront.name | steamspy.name |
| developers | developer |
| publishers | publisher |
| storefront.price_overview | steamspy.price/initialprice/discount |
| storefront.genres | steamspy.genre |
| storefront.supported_languages | steamspy.languages |
| reviews.review_score | steamspy.userscore |
| reviews.total_positive | steamspy.positive |
| reviews.total_negative | steamspy.negative |

I'll start with the **most important fields** to check if we'll have to remove some data right from the start.

### Unique IDs

App IDs shuld be unique but we should check if we have duplicated app ids in our dataframes. We used an iterative process, and it could be possible that some ids when requested redirect us to a new id. This has been observed trying to access directly in the Steam Store page with some of the 'missing' ids. For instance, different versions of Guild Wars 2 all lead us to a unique store page on Steam, as the old versions do not exist anymore.

Sadly, in some cases it leads to some issues as app_id for the same app might be different in different datasets, for example:

In [11]:
storefront[storefront['steam_appid'].isin([34330,201270])]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,download_appid,last_modified
105016,game,Total War: SHOGUN 2,34330,0,False,,"[223180, 201279, 201277, 34348, 34342, 34343, ...",<h1>Total War: SHOGUN 2 out now for Linux.</h1...,<strong>MASTER THE ART OF WAR</strong><br>\t\t...,Total War: SHOGUN 2 is the perfect mix of real...,...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'total': 106, 'highlighted': [{'name': 'Stran...","{'coming_soon': False, 'date': '14 Mar, 2011'}","{'url': 'https://support.sega.co.uk', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",201270,1603131194


In [12]:
steamspy[steamspy['appid'].isin([34330,201270])]

Unnamed: 0,appid,name,developer,publisher,score_rank,positive,negative,userscore,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,languages,genre,ccu,tags
1801,201270,Total War: SHOGUN 2,"CREATIVE ASSEMBLY, Feral Interactive (Mac), Fe...","SEGA, Feral Interactive (Mac), Feral Interacti...",,47212,4210,0,"2,000,000 .. 5,000,000",0,0,0,0,2999.0,2999.0,0.0,"English, Czech, French, German, Italian, Polis...",Strategy,0,"{'Strategy': 1112, 'Historical': 627, 'Turn-Ba..."


In [13]:
reviews[reviews['appid'].isin([34330,201270])]

Unnamed: 0,appid,review_score,review_score_desc,total_positive,total_negative,total_reviews,download_appid
1801,201270,8,Very Positive,46712,4057,50769,201270


In [14]:
missing_ids[missing_ids['appid'].isin([34330,201270])]

Unnamed: 0,appid,reason


In [15]:
storefront[storefront.steam_appid != storefront.download_appid][['name','steam_appid','download_appid','last_modified']]

Unnamed: 0,name,steam_appid,download_appid,last_modified
40155,Fate of the World: Tipping Point,80200,901776,1447354767
71303,BUGWORLD:ONLINE,1421280,1421310,1602887075
96596,Tiny Stories: Santa Closes,1841060,1841070,1653425761
105012,Tom Clancy's Splinter Cell Conviction™ Deluxe ...,33220,33229,1579096325
105013,"Warhammer 40,000: Dawn of War II: Retribution",56400,56437,1603127845
105014,IL-2 Sturmovik: Cliffs of Dover,63950,63970,1447354055
105015,Call of Duty®: Modern Warfare® 3,42680,115300,1646849563
105016,Total War: SHOGUN 2,34330,201270,1603131194
105020,The Elder Scrolls IV: Oblivion® Game of the Ye...,22330,900883,1593798311
105021,Earthworm Jim,38480,901147,1634573795


In this case, the same App is present in different tables under different AppIDs. And it looks like we have removed a duplicate during the collectioin phase. It looks like:
1) JSONs returned by the Steam Storefront are the same
2) Review results seem to be the same for both appids (within a few days difference between download times)
3) steam_appid, returned in the Storefront response might be different from the appid used by the Steam itself: Links to resources use the new AppIDs almost everywhere (except, interestingly, achivements images that seem to be not updated by the publisher)

Fortunately, we have the appid we used for downloading saved in 'download_appid' column and we can notice that:
1) Old appids don't seem to be returned by the IStoreService API
2) SteamSpy has different stats returned by the old and new Appid
3) Old appid is only present in the Storefront

We can replace 'steam_appid' with the 'download_id' when it's different and add a note in the missing_ids. After that we can remove 'download_id' as not being relevant anymore.

#### [Subroutine] steam_appid: Fixing


In [16]:
# steam_appid fixing old AppIds
def appid_fix(df, missing_df):
    def replace_id(row, removed_ids, missing_ids):
        # replacing id in the row if it answers the conditions
        print('Replacing', row['steam_appid'], ' with ', row['download_appid'], ' for ', row['name'])
        missing_ids.append(row['download_appid'])
        removed_ids.append(row['steam_appid'])
        return row['download_appid']
    
    df = df.copy()
    # parsing header images to parsed_id
    # select only rows that have different Storefront and download_id IDs
    mask = (df['steam_appid'] != df['download_appid'])
    # replacing ids
    removed_ids = []
    missing_ids = []
    df.loc[mask,'steam_appid'] = df[mask].apply(lambda row: replace_id(row, removed_ids, missing_ids), axis=1)
    temp_df = pd.DataFrame(removed_ids,columns =['appid'])
    temp_df['reason'] = 'Storefront appid fix'
    missing_df =  missing_df.loc[~missing_df.appid.isin(missing_ids)]
    missing_df = pd.concat([missing_df, temp_df]).reset_index(drop=True)
    # dropping download_appid
    df.drop('download_appid', inplace = True, axis = 1)
    return df, missing_df

In [17]:
storefront, missing_ids = appid_fix(storefront, missing_ids)

Replacing 80200  with  901776  for  Fate of the World: Tipping Point
Replacing 1421280  with  1421310  for  BUGWORLD:ONLINE
Replacing 1841060  with  1841070  for  Tiny Stories: Santa Closes
Replacing 33220  with  33229  for  Tom Clancy's Splinter Cell Conviction™ Deluxe Edition
Replacing 56400  with  56437  for  Warhammer 40,000: Dawn of War II: Retribution
Replacing 63950  with  63970  for  IL-2 Sturmovik: Cliffs of Dover
Replacing 42680  with  115300  for  Call of Duty®: Modern Warfare® 3
Replacing 34330  with  201270  for  Total War: SHOGUN 2
Replacing 22330  with  900883  for  The Elder Scrolls IV: Oblivion® Game of the Year Edition Deluxe
Replacing 38480  with  901147  for  Earthworm Jim
Replacing 24980  with  901242  for  Mass Effect 2 Digital Deluxe Edition
Replacing 31220  with  901399  for  Sam & Max: The Devil’s Playhouse 
Replacing 12210  with  901583  for  Grand Theft Auto IV: Complete Edition
Replacing 39160  with  901638  for  Dungeon Siege III
Replacing 8260  with  90166

Checking for duplicates just in case:

In [18]:
storefront['steam_appid'].duplicated().sum()

0

In [19]:
steamspy['appid'].duplicated().sum()

0

In [20]:
reviews['appid'].duplicated().sum()

8

There are might be some duplicates in tables - I'll need to check the data collecting functions to remove the possibility of the duplicates getting in laters. For now I'll just clean it up:

In [21]:
storefront = storefront.drop_duplicates(subset='steam_appid', keep='last')
steamspy = steamspy.drop_duplicates(subset='appid', keep='last')
reviews = reviews.drop_duplicates(subset='appid', keep='last')

# Steam Storefront table

Let's process Steam Storefront table by each column

### Name

In [22]:
steamspy[steamspy['name'].isnull()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 265 entries, 1408 to 104930
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   appid            265 non-null    int64  
 1   name             0 non-null      object 
 2   developer        8 non-null      object 
 3   publisher        8 non-null      object 
 4   score_rank       0 non-null      float64
 5   positive         265 non-null    int64  
 6   negative         265 non-null    int64  
 7   userscore        265 non-null    int64  
 8   owners           265 non-null    object 
 9   average_forever  265 non-null    int64  
 10  average_2weeks   265 non-null    int64  
 11  median_forever   265 non-null    int64  
 12  median_2weeks    265 non-null    int64  
 13  price            15 non-null     float64
 14  initialprice     15 non-null     float64
 15  discount         15 non-null     float64
 16  languages        11 non-null     object 
 17  genre     

In [23]:
storefront[storefront['name'].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified
25,,,660,,,,,,,,...,,,,,,,,,,1447354286
229,,,8040,,,,,,,,...,,,,,,,,,,1592490371
230,,,8060,,,,,,,,...,,,,,,,,,,1592490414
332,,,11610,,,,,,,,...,,,,,,,,,,1516788252
365,,,12650,,,,,,,,...,,,,,,,,,,1591334081
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103449,,,1981630,,,,,,,,...,,,,,,,,,,1654657760
103450,,,1981631,,,,,,,,...,,,,,,,,,,1654658229
103451,,,1981632,,,,,,,,...,,,,,,,,,,1654658253
103452,,,1981633,,,,,,,,...,,,,,,,,,,1654658255


In [24]:
storefront[storefront['name'].isnull()]['steam_appid']

25            660
229          8040
230          8060
332         11610
365         12650
           ...   
103449    1981630
103450    1981631
103451    1981632
103452    1981633
103698    1987380
Name: steam_appid, Length: 691, dtype: int64

In [25]:
steamspy[steamspy['name'].isnull()]['appid']

1408        63970
2310       210562
3860       256090
3909       257302
6668       315210
           ...   
104854    2025230
104855    2025240
104856    2025260
104857    2025270
104930    2051250
Name: appid, Length: 265, dtype: int64

#### Name overview
Judging by the quick overview of the blank game names, there seems to be multiple causes for it:
* The application is not present in Steam
* The application is a recent release that hasn't been parsed by SteamSpy properly yet
* The application is not released yet
* The 'application' is a DLC/DLC bundle
* The application has an emoticon in the name

Let's do a crosscheck between SteamSpy and Steam Storefront data:

In [26]:
storefront[storefront['steam_appid'].isin(steamspy[steamspy['name'].isnull()]['appid'].values)].sample(10)

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified
51919,game,DOOM Eternal: The Ancient Gods - Part Two,1098293,0.0,False,,,The Ancient Gods – Part Two is the epic conclu...,The Ancient Gods – Part Two is the epic conclu...,The Ancient Gods – Part Two is the epic conclu...,...,"[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256825994, 'name': 'DLC2 Announce EN',...",{'total': 3468},,"{'coming_soon': False, 'date': '18 Mar, 2021'}","{'url': 'http://help.bethesda.net/', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': 'Similar to DOOM from...",1626280206
99593,game,Himbo Harem Homicide,1899730,0.0,False,,,A classic whodunit – but with a harem of himbo...,A classic whodunit – but with a harem of himbo...,A classic whodunit – but with a harem of himbo...,...,"[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256889809, 'name': 'HHH Trailer 2', 't...",,,"{'coming_soon': True, 'date': '30 Dec, 2022'}",{'url': 'https://www.mousepotatogames.com/abou...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1654378581
53713,dlc,RPG Maker VX Ace - Hibiki Katakura MV Monsters...,1124111,0.0,False,,,RPG Maker MV's beloved characters are now in H...,RPG Maker MV's beloved characters are now in H...,RPG Maker MV's beloved characters are now in H...,...,"[{'id': '53', 'description': 'Design & Illustr...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '8 Aug, 2019'}","{'url': 'http://www.rpgmakerweb.com/support', ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1589362238
99865,game,King Rabbit - Race,1905900,0.0,True,,,"<a href=""https://steamcommunity.com/linkfilter...","<a href=""https://steamcommunity.com/linkfilter...",Race with friends through dangerous obstacle c...,...,"[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256876310, 'name': 'Release Trailer', ...",,,"{'coming_soon': True, 'date': '21 Jun, 2022'}","{'url': 'https://www.raresloth.com', 'email': ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1655177785
98700,game,3 DAYS: Ghosts of War,1881790,0.0,False,,,3DAYS: Ghosts of War is a paragraph-based text...,3DAYS: Ghosts of War is a paragraph-based text...,You wake up... Afghanistan. You have 3 days . ...,...,"[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': True, 'date': '6/2022'}","{'url': 'www.koval.games', 'email': 'kovalgame...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1654879100
81084,dlc,"FUSER™ - J. Cole - ""No Role Modelz""",1588960,12.0,False,full,,Expand your FUSER™ library with &quot;No Role ...,Expand your FUSER™ library with &quot;No Role ...,Expand your FUSER™ library with &quot;No Role ...,...,"[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '6 May, 2021'}","{'url': 'https://us.ncsoft.com/en-us/support',...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1620313384
89440,game,Starfield,1716740,0.0,False,,,Starfield is the first new universe in 25 year...,Starfield is the first new universe in 25 year...,Starfield is the first new universe in 25 year...,...,"[{'id': '3', 'description': 'RPG'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256891408, 'name': '30s Trailer EN', '...",,,"{'coming_soon': True, 'date': '2023'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1655219857
96644,game,Mini Mart,1841960,0.0,False,full,,<strong>Welcome to Mini Mart!</strong> A mini ...,<strong>Welcome to Mini Mart!</strong> A mini ...,Stock up and rejuvenate your quaint mini mart ...,...,"[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': True, 'date': 'SUMMER 2022'}","{'url': 'https://www.toomanyfishstudios.com/',...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1644605787
60830,game,INTO THE BACKROOMS,1234330,0.0,False,full,,<strong>INTO THE BACKROOMS</strong> - Psycholo...,<strong>INTO THE BACKROOMS</strong> - Psycholo...,INTO THE BACKROOMS - Psychological horror in w...,...,"[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': True, 'date': 'TBD'}","{'url': '', 'email': 'support@playpun.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': 'Blood, Scenes of vio...",1647000146
104828,game,Test Game 4,2025190,0.0,False,,,#Summer2022_hunt_quest_1_desc,#Summer2022_hunt_quest_1_desc,#Summer2022_hunt_quest_1_desc,...,,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1655397815


#### Are there any duplicate names?

In [27]:
storefront['name'].value_counts()[storefront['name'].value_counts()>1]

Alone             6
Bounce            4
Lost              4
Space Survival    4
Underwater        3
                 ..
Potentia          2
Maze Escape       2
Dark Light        2
Nightfall         2
Naughty Girl      2
Name: name, Length: 389, dtype: int64

In [28]:
storefront[storefront['name']=="['']"]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified


We have some duplicate names, but they are really different games. There is an interesting case with 'Fantasy Grounds - Aegis of Empires 1: The Book in the Old House' that is actually 3 different applications with the same name from the same developer.

Just in case, let's check also for some weird names.

In [29]:
storefront[storefront['name'].apply(lambda x: len(str(x)) < 6)]['name'].value_counts()

Alone    6
Lost     4
Helix    3
Cube     3
Arena    3
        ..
SNKRX    1
Lily     1
SPITE    1
REVN     1
Zeko     1
Name: name, Length: 2693, dtype: int64

In [30]:
storefront[storefront['name'].isin(['none','None','na','Na','False','false',0,'','invalid','Invalid'])]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified


#### [Subroutine] Name cleaning

All games from the store database have valid names, except those that we should clearly remove. We keep the rest of the column from store as is.

* Replace the ['none','None','na','Na','False','false',0,'','invalid','Invalid'] names with NaN

In [31]:
# Replacing incorrect columns with NaN (or delete them)
def cleanName(storefront, remove_data = False):
    badnames = ['none','None','na','Na','False','false',0,'','invalid','Invalid', pd.NA, np.nan]
    if (remove_data):
        remove_ids = storefront[storefront.name.isin(badnames)].index.tolist()
        storefront = removeIDs(storefront, remove_ids, 'Missing app name')
    else:
        storefront['name'].mask(storefront.name.isin(badnames), pd.NA, inplace=True )
    return storefront

In [32]:
storefront = cleanName(storefront, True)

In [33]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104344 entries, 0 to 105034
Data columns (total 40 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     104344 non-null  object
 1   name                     104344 non-null  object
 2   steam_appid              104344 non-null  int64 
 3   required_age             104344 non-null  object
 4   is_free                  104344 non-null  object
 5   controller_support       26065 non-null   object
 6   dlc                      9804 non-null    object
 7   detailed_description     104188 non-null  object
 8   about_the_game           104187 non-null  object
 9   short_description        104183 non-null  object
 10  fullgame                 35362 non-null   object
 11  supported_languages      104159 non-null  object
 12  header_image             104344 non-null  object
 13  website                  61136 non-null   object
 14  pc_requirements     

In [34]:
storefront[storefront['name'].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified


### type

**Type** is an application type (you can actually designate it when downloading the data from the Steamfront. I've set it to download both dlc's and games). Besides the ones I've designated to download, there seems to be one special application reserved for Steam Gift Cards and some applications that don't have 'type' set:


In [35]:
storefront['type'].value_counts(dropna=False)

game           68934
dlc            35409
advertising        1
Name: type, dtype: int64

Let's take a look on the appliations that don't have the type set up:

In [36]:
storefront[storefront['type'].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified


It seems like these applications don't have anything set besides the appid and name. It might be either test/removed applications or the ones that don't have the data filled in. Since they don't have any valuable data, I consider them safe to remove.

#### [Subroutine] Type cleaning

All games from the store database have valid types, except those that we should clearly remove. We keep the rest of the column from store as is.

* Replace the ['none','None','na','Na','False','false',0,'','invalid','Invalid'] names with NaN

In [37]:
# Replacing incorrect columns with NaN (or delete them)
def cleanType(storefront, remove_data = False):
    badnames = ['none','None','na','Na','False','false',0,'','invalid','Invalid',pd.NA, np.nan]
    if (remove_data):
        remove_ids = storefront[storefront.type.isin(badnames)].index.tolist()
        storefront = removeIDs(storefront, remove_ids, 'Missing app type')
    else:
        storefront['type'].mask(storefront.type.isin(badnames), pd.NA, inplace=True )
    return storefront

In [38]:
storefront = cleanType(storefront, True)

In [39]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104344 entries, 0 to 105034
Data columns (total 40 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     104344 non-null  object
 1   name                     104344 non-null  object
 2   steam_appid              104344 non-null  int64 
 3   required_age             104344 non-null  object
 4   is_free                  104344 non-null  object
 5   controller_support       26065 non-null   object
 6   dlc                      9804 non-null    object
 7   detailed_description     104188 non-null  object
 8   about_the_game           104187 non-null  object
 9   short_description        104183 non-null  object
 10  fullgame                 35362 non-null   object
 11  supported_languages      104159 non-null  object
 12  header_image             104344 non-null  object
 13  website                  61136 non-null   object
 14  pc_requirements     

### Developers

Compared to publishers where the store dataset has no null values, we have a few missing developers. Let's check them just in case.

In [40]:
storefront[storefront['developers'].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified
272,game,Tycoon City: New York,9730,0.0,False,,,<h1>Special Offer</h1><p>Officially Licensed T...,Here's your chance to make it big in the Big A...,Here's your chance to make it big in the Big A...,...,"[{'id': '28', 'description': 'Simulation'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 176},,"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1486757370
326,game,Crash Time 2,11390,0.0,False,,,Solve exciting criminal cases on the mean stre...,Solve exciting criminal cases on the mean stre...,Crash Time 2 is an open-world combat racing ga...,...,"[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256810412, 'name': 'Crash Time 2 Steam...",{'total': 1083},,"{'coming_soon': False, 'date': '27 Aug, 2009'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1640793888
813,game,18 Wheels of Steel: Extreme Trucker,33730,0.0,False,,,You ‘da Boss! Move it better and faster while ...,You ‘da Boss! Move it better and faster while ...,You ‘da Boss! Move it better and faster while ...,...,"[{'id': '28', 'description': 'Simulation'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 113},,"{'coming_soon': False, 'date': '23 Sep, 2009'}","{'url': 'https://playhardgames.net/contact/', ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1591335113
814,game,Prison Tycoon 4: SuperMax,33750,0.0,False,,,Hard Time is Money <br>\t\t\t\t\t\tBuild a pro...,Hard Time is Money <br>\t\t\t\t\t\tBuild a pro...,Hard Time is Money Build a profitable privatel...,...,"[{'id': '28', 'description': 'Simulation'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1591335156
1256,dlc,Mafia II - Vegas DLC,50142,0.0,False,,,,,,...,,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [5], 'notes': None}",1582659485
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96893,game,Age of Empires IV Content Editor,1846820,0,False,,,,,,...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '7 Apr, 2022'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1649360761
97039,dlc,OMSI 2 - Add-on Irisbus Familie – Citybus Pack,1849680,0,False,full,,With the OMSI AddOn Irisbus Family Citybus Pac...,With the OMSI AddOn Irisbus Family Citybus Pac...,With the OMSI AddOn Irisbus Family Citybus Pac...,...,"[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '21 Dec, 2021'}",{'url': 'https://helpdesk.aerosoft.com/portal/...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1640077359
104202,game,Neon Ronin Playtest,2002110,0,False,,,,,,...,,,,,,"{'coming_soon': False, 'date': '23 May, 2022'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1653325318
104205,dlc,Dread Hunger Bone Rings,2002150,0,False,,,,,,...,,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [1, 2, 5], 'notes': 'You can chop the ...",1652738679


There are around 300 entries without developers - there are some games which are no longer available, some are retro games which some publisher has the right to, but the developer is unlisted and in some cases publisher just never filled in the developer field.

In [41]:
storefront['developers'].value_counts().head(30)

['SmiteWorks USA, LLC']                          2339
['TigerQiuQiu']                                  2252
['Ubisoft - San Francisco']                      1667
['KOEI TECMO GAMES CO., LTD.']                   1499
['CAPCOM Co., Ltd.']                              517
['N3V Games']                                     453
['Dovetail Games']                                398
['Milestone S.r.l.']                              263
['Tamsoft']                                       207
['Harmonix Music Systems, Inc']                   196
['Paradox Development Studio']                    194
['Laush Dmitriy Sergeevich']                      188
['Nihon Falcom']                                  182
['Choice of Games']                               172
['Rebellion']                                     163
['Square Enix', 'KOEI TECMO GAMES CO., LTD.']     152
['The Digital Puzzle Company']                    148
['Capcom']                                        147
['Creobit']                 

In [42]:
storefront[storefront['developers']=='[""]']

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified


As we'll see in the later sectioins, it seems like [''] is a placeholder in Steam for mandatory values which are not filled, or have been deleted.

In [43]:
steamspy[steamspy['appid'].isin(storefront[storefront['developers'].isnull()]['steam_appid'].values)]['developer'].value_counts().head(60)

一次元创作组                        3
Christian tavares da silva    3
Valve                         1
IPBuilders                    1
Lesson of Passion             1
BitLight                      1
Atomic Jelly                  1
CCS                           1
Kangeado games                1
Name: developer, dtype: int64

#### [Subroutine] Developers: Cleaning

* First we will merge storefront and steamspy, keeping storefront data unless we have a NaN
* This process is identical to other columns that appear in multiple dataframes so we'll go through all of them before the actual data merge (with Steam data always having priority over Steam Spy data)
* Also, we'll have to adjust the column data to the same format as it's different between the datasets.
* Then we will copy the publisher name into the developer, for the game cases without developers. Games with other missing information we will take care of afterwards.

In [44]:
# To simplify cleaning, let's change appid and steam_appid to the appid and make it an index (since we already made sure it's unique)
# Since we will be using df.fillna(df2) later, it would be useful to change similar column names so keep them identicall across different datasets.
def renameIDs(storefront,steamspy,reviews,missing_ids):
    storefront = storefront.rename(columns={'steam_appid':'appid'})
    storefront = storefront.set_index('appid')
    steamspy = steamspy.rename(columns={'genre':'genres', 'developer':'developers', 'publisher':'publishers',
                              'languages':'supported_languages','userscore':'review_score','positive':'total_positive',
                                'negative':'total_negative'})
    steamspy = steamspy.set_index('appid')
    reviews = reviews.set_index('appid')
    missing_ids = missing_ids.set_index('appid')
    return storefront, steamspy, reviews, missing_ids

In [45]:
storefront, steamspy, reviews, missing_ids = renameIDs(storefront,steamspy,reviews,missing_ids)

In [46]:
# This is the function that fills the null data in maindf with the data from the subdf.
# In this function, the index from both dataframes must be the same - the old appid in our case.
# Also, the column names where we will be getting our values should also be the same.
# Lastly, ideally we would the values to be formatted in the same way - but we can also check later.
def updateFromAlternateSource(maindf,subdf):
    df = maindf.copy()
    df = df.fillna(subdf)
    return df

Now we could actually run this function and update the developers from Steam Spy. But the data is formatted differently in some columns and this will be a problem when filling the null data

We will have to take this into account when formatting these columns, as the information from Steam Spy will be added for the NaN.

### Publishers

It seemed that the publishers were ok, as we have no NaN. However, there are a lot of blank names. This is probably a mandatory metadata from Steam, and some ids have managed to not put a publisher whatsoever doing that.

Let's look at them, if there are valid ones (i.e ones who have a developer) we can consider them self-published and just do the same as before, copying the developer name into the publisher.

In [47]:
storefront['publishers'].value_counts()

['']                                     9812
['TigerQiuQiu']                          2248
['Degica']                               1540
['KOEI TECMO GAMES CO., LTD.']           1388
['Dovetail Games - Trains']               605
                                         ... 
['Actawesome']                              1
['Orange Valley']                           1
['Cantastorie']                             1
['Gabriele Barone', 'Giacomo Garcea']       1
['Bialy']                                   1
Name: publishers, Length: 38466, dtype: int64

In [48]:
(storefront['publishers']=="['']").sum()

9812

In [49]:
storefront[(storefront['publishers']=="['']") & (storefront['developers'].isnull())]

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50142,dlc,Mafia II - Vegas DLC,0.0,False,,,,,,,...,,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [5], 'notes': None}",1582659485
218064,dlc,BIT.TRIP Presents... Runner2: Future Legend of...,0.0,False,,,BIT.TRIP Presents... Runner2: Future Legend of...,BIT.TRIP Presents... Runner2: Future Legend of...,BIT.TRIP Presents... Runner2: Future Legend of...,,...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '26 Feb, 2013'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1478115278
218980,game,Patterns,0.0,False,,,Create worlds beyond your imagination in Patte...,Create worlds beyond your imagination in Patte...,Create worlds beyond your imagination in Patte...,,...,"[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 2028932, 'name': 'Patterns Trailer 2',...",{'total': 108},,"{'coming_soon': False, 'date': ''}",{'url': 'http://www.buildpatterns.com/#!commun...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1542738952
222860,game,Left 4 Dead 2 Dedicated Server,0.0,False,,,,,,,...,,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1574121232
224880,game,Equate Game,0.0,False,,,,,,,...,,,,,,"{'coming_soon': True, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1573775729
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1688010,game,GotG Dedicated Server,0,False,,,,,,,...,,,,,,"{'coming_soon': False, 'date': '10 Sep, 2021'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1643739511
1763330,game,Polyslime,0,True,,,In this game the goal is simply to survive as ...,In this game the goal is simply to survive as ...,An action survival game where you will craft w...,,...,"[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256853022, 'name': 'Trailer1', 'thumbn...",,,"{'coming_soon': False, 'date': '13 Oct, 2021'}","{'url': '', 'email': 'sugmastudios@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1647471338
1846820,game,Age of Empires IV Content Editor,0,False,,,,,,,...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '7 Apr, 2022'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1649360761
2002110,game,Neon Ronin Playtest,0,False,,,,,,,...,,,,,,"{'coming_soon': False, 'date': '23 May, 2022'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1653325318


In [50]:
storefront[storefront['publishers']=="['']"].sample(10)

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1656540,dlc,Fantasy Grounds - D&D The Wild Beyond the Witc...,0.0,False,,,"<h2 class=""bb_tag"">D&amp;D The Wild Beyond the...","<h2 class=""bb_tag"">D&amp;D The Wild Beyond the...","Once every eight years, the fantastic Witchlig...","{'appid': '1196310', 'name': 'Fantasy Grounds ...",...,"[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '20 Sep, 2021'}","{'url': 'www.fantasygrounds.com', 'email': 'su...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1632837407
1677610,dlc,Tanto Cuore - Expanding the House,0.0,False,,,"Tanto Cuore: Expanding the House, the first ex...","Tanto Cuore: Expanding the House, the first ex...",The first expansion to Tanto Cuore! Expand you...,"{'appid': '1441230', 'name': 'Tanto Cuore'}",...,"[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256873056, 'name': 'Expanding the Hous...",,,"{'coming_soon': False, 'date': '15 Jan, 2022'}","{'url': 'https://japanimegames.com/', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [5], 'notes': 'Suggestive Themes'}",1646096655
433200,dlc,Goliath - Original Soundtrack,0.0,False,full,,"The official soundtrack from Goliath, the epic...","The official soundtrack from Goliath, the epic...","The official soundtrack from Goliath, the epic...","{'appid': '363520', 'name': 'Goliath'}",...,"[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '12 May, 2016'}","{'url': 'http://www.goliath-game.com/', 'email...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1463072818
828500,dlc,XXZ: Squad,0.0,False,,,A squad of military officers from the secret b...,A squad of military officers from the secret b...,A squad of military officers from the secret b...,"{'appid': '696590', 'name': 'XXZ: XXL'}",...,"[{'id': '71', 'description': 'Sexual Content'}...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256713007, 'name': 'Gameplay', 'thumbn...",,,"{'coming_soon': False, 'date': '30 Mar, 2018'}","{'url': '', 'email': 'stotch_steam@list.ru'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [1, 5], 'notes': None}",1522447607
1226780,dlc,Fantasy Grounds - Lemurian Chronicles,0.0,False,,,<strong>Adventures for Barbarian of Lemuria</s...,<strong>Adventures for Barbarian of Lemuria</s...,Adventures for Barbarian of LemuriaFree your i...,"{'appid': '252690', 'name': 'Fantasy Grounds C...",...,"[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '18 Feb, 2020'}","{'url': '', 'email': 'support@fantasygrounds.c...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1582127767
1145760,dlc,The Ghost on the Desktop,0.0,True,,,"<h2 class=""bb_tag"">About this DLC</h2>The Ghos...","<h2 class=""bb_tag"">About this DLC</h2>The Ghos...",The Ghost on the Desktop is a free add-on to Y...,"{'appid': '765030', 'name': ""樱雪集～Yuyuko's Butt...",...,"[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '16 Sep, 2019'}","{'url': 'https://weibo.com/u/2697566561', 'ema...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1594616413
1210200,dlc,Biker Garage Mechanic Simulator - Junkyard DLC,0.0,True,,,This is a junkyard. From now on you will be ab...,This is a junkyard. From now on you will be ab...,Junkyard DLC for Biker Garage Mechanic Simulator.,"{'appid': '989770', 'name': 'Biker Garage: Mec...",...,"[{'id': '9', 'description': 'Racing'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256772557, 'name': 'release trailer', ...",,,"{'coming_soon': False, 'date': '16 Jan, 2020'}","{'url': '', 'email': 'rafal@devjuice.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1579197716
604870,dlc,Fantasy Grounds - Blackwater Creek (CoC7E),0.0,False,,,"<h2 class=""bb_tag""><strong>Blackwater Creek</s...","<h2 class=""bb_tag""><strong>Blackwater Creek</s...",Blackwater CreekFirst of two adventures includ...,"{'appid': '252690', 'name': 'Fantasy Grounds C...",...,"[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '7 Mar, 2017'}","{'url': '', 'email': 'support@fantasygrounds.c...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1495227482
1086930,dlc,名城BGM合辑,0.0,False,,,"BGM, from more than 30 different urban backgro...","BGM, from more than 30 different urban backgro...","BGM, from more than 30 different urban backgro...","{'appid': '955170', 'name': '铁道物语：陆王（Railway S...",...,"[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '21 May, 2019'}","{'url': 'http://www.rwsaga.com/cs', 'email': '...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1606880102
565460,dlc,Fantasy Grounds - Mini-Dungeon #008: Carrionho...,0.0,False,,,"<h2 class=""bb_tag""><strong>For 4-5 PCs of Leve...","<h2 class=""bb_tag""><strong>For 4-5 PCs of Leve...",For 4-5 PCs of Levels 4-5Home to a coven of th...,"{'appid': '252690', 'name': 'Fantasy Grounds C...",...,"[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '16 Dec, 2016'}","{'url': '', 'email': 'support@fantasygrounds.c...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1481894700


It seems like Steam Storefront uses [''] to fill the empty data in the mandatory fields. We'll change it to NaN for the easier filtering and do the same in the other fields.

Is it possible that some of these values were registered at some point by Steam Spy and conserved? Let's check that, if not we will simply treat them like NaNs.

Also, it seems like we even have some apps with both Publisher and Developer data being empty. It's either in the games removed from Steam or in ithe DLCs where the game creater was lazy and didn't fill the relevant data in the DLC package, so we can take it from the parent app.

In [51]:
(~steamspy[steamspy.index.isin(storefront[storefront['publishers']=="['']"].index)]['publishers'].isnull()).sum()

199

It seems we can recover some values from Steam Spy, now that we have discovered that this supposedly complete column had some NaNs..

#### Publisher/Others: Cleaning Decision

* I.e using `storefront = storefront.replace("['']", pd.NA)` we should catch any [''] fields in the steam database, which we thought more complete. Then merge ids, using the Steam Store value (if available) and falling back to Steam Spy if possible.


* If there is no publisher, but we have a developer, then we will use the developer as publisher as well. If there is no publisher or developer, we will simply delete the record.

* If it's we have neither and it's a DLC we'll check the parent app

* If neither option succeed, we'll replace the values with pd.NA to keep the null data consistent.


#### [Subroutine] Publishers: Cleaning

In [52]:
# Replace empty data with the parent app data
# {'appid': '1141390', 'name': 'The Blitzkrieg:'}
def getParentValue(row, column):
    if (pd.isna(row[column])) & (not pd.isna(row['fullgame'])):
        try:
            appid2 = int(ast.literal_eval(row['fullgame'])['appid'])
            parent_row = storefront.loc[appid2]
            return parent_row[column]
        except:
            return row[column]
    else:
        return row[column]

In [53]:
# Getting the data from other column
def getOtherColumnValue(row,current,alternate):
    if pd.isna(row[current]):
        return row[alternate]
    else:
        return row[current]

In [54]:
# Get other column and if it's not available - parent
def getOtherOrParentColumnValue(row,current,alternate):
    if pd.isna(row[current]):
        if (pd.isna(row[alternate])) & (not pd.isna(row['fullgame'])):
            try:
                appid2 = int(ast.literal_eval(row['fullgame'])['appid'])
                parent_row = storefront.loc[appid2]
                return parent_row[current]
            except:
                return row[current]
        else:
            return row[alternate]
    else:
        return row[current]

In [55]:
# Fixing data for publishers/developers
def fixDevPub(storefront, steamspy):
    storefront = storefront.replace("['']", pd.NA)
    storefront = updateFromAlternateSource(storefront,steamspy)
    storefront['developers'] = storefront.apply(getOtherOrParentColumnValue, current='developers', alternate='publishers', axis=1)
    storefront['publishers'] = storefront.apply(getOtherOrParentColumnValue, current='publishers', alternate='developers', axis=1)
    return storefront

Running this function will get any values from Steam Spy which are useful from the repeated columns. We have also eliminated the empty string values and replaced them with NaN, to ensure our cleaning functions detect them properly.

However, note that we have also updated genres and languages by doing it this way...

In [56]:
storefront = fixDevPub(storefront, steamspy)

In [57]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104344 entries, 10 to 1873351
Data columns (total 39 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     104344 non-null  object
 1   name                     104344 non-null  object
 2   required_age             104344 non-null  object
 3   is_free                  104344 non-null  bool  
 4   controller_support       26065 non-null   object
 5   dlc                      9804 non-null    object
 6   detailed_description     104188 non-null  object
 7   about_the_game           104187 non-null  object
 8   short_description        104183 non-null  object
 9   fullgame                 35362 non-null   object
 10  supported_languages      104180 non-null  object
 11  header_image             104344 non-null  object
 12  website                  61136 non-null   object
 13  pc_requirements          104344 non-null  object
 14  mac_requirements  

It seems we still have some rows with publisher/developer data not available.

### Genres

There are 2 similar types of data here. We have genres and categories. Genres are present in both datasets, categories - only in Storefront.

The stucture for these columns is quite similar - it's a list of dictionaries similar to {'id': 'N', 'description': 'XXX}. We have a couple of approaches when analysing data here - unwrap the list of dictionaries for each row into the list of genres/categories and either:

1) Keep them in the same column as a simple list of items.
2) Spread the list (with the item being the column name and binary value of the item present in the row) and keep it in the same table.
3) Move the list into a separate table with appid being the key and the rest of the columns - categories with binary value.
4) Transform that said wide table into the long one with the 'appid' and 'category' column.

These approaches have different advantaged/disadvantages but for all of them we'll have to unwrap the dictionaries into a simple list of values.

In [58]:
storefront['genres'].value_counts()

[{'id': '1', 'description': 'Action'}]                                                                                                                                                                                                                                                                                                                                                                    5480
[{'id': '1', 'description': 'Action'}, {'id': '23', 'description': 'Indie'}]                                                                                                                                                                                                                                                                                                                              5029
[{'id': '4', 'description': 'Casual'}, {'id': '23', 'description': 'Indie'}]                                                                                                                              

In [59]:
steamspy['genres'].value_counts()

Action                                                              5135
Action, Indie                                                       4610
Casual, Indie                                                       4393
Action, Casual, Indie                                               4100
Action, Adventure, Indie                                            3563
                                                                    ... 
Violent, Gore, Action, Adventure, Indie, Racing                        1
Action, Adventure, Indie, Racing, RPG, Sports                          1
Action, Massively Multiplayer, Racing, Simulation                      1
Action, Adventure, Casual, Indie, Racing, Strategy, Early Access       1
Indie, Racing, RPG                                                     1
Name: genres, Length: 1946, dtype: int64

If there are no single commas inside any genre, it would make sense to list them exactly like Steam Spy has done. If not, we will look for a different character, or even just splitting it into a list, but something clearer than this dict form in string available for the Steam Store.

In [60]:
storefront[storefront['genres']=="['']"].shape[0]

0

In [61]:
storefront['genres'].isnull().sum()

200

In [62]:
# unwrapping list of dictionaries into to the list
# remove the NaN valueus while we are at it
def extractDictList(jsonDict, key):
    if jsonDict != jsonDict:
        return pd.NA
    else:
        try:
            evalList = ast.literal_eval(jsonDict)
            items = []
            if(type(evalList) == dict):
                if not (pd.isna(evalList[key])):
                    items.append(evalList[key])
                return items
            else:
                for dictionary in evalList:
                        if not (pd.isna(dictionary[key])):
                            items.append(dictionary[key])
                return items
        except :
            return pd.NA

A little explanation of above. Most games are indeed formatted with a dict inside. But there are a few ones (48), that after closer inspection already had the genre column formatted into the games of the genres separated by commas. Of these ones, there is only one valid game (one game that still exists in the store), https://store.steampowered.com/app/22330/The_Elder_Scrolls_IV_Oblivion_Game_of_the_Year_Edition/

This was actually recovered with the update function we defined and executed above with the developers and publishers, the information is coming from steam spy.

If there is no proper item in the list or it's empty, the value will be set as NaN.

#### [Subroutine] Genres: Cleaning

In [63]:
storefront['genres'] = storefront['genres'].apply(extractDictList, key='description')

In [64]:
storefront.genres.explode().value_counts(dropna=False)

Indie                    66509
Action                   43611
Casual                   37710
Adventure                34950
Simulation               24086
Strategy                 21901
RPG                      21353
Free to Play              8329
Early Access              7985
Sports                    4887
Racing                    4067
Massively Multiplayer     3287
Design & Illustration     1775
Web Publishing            1672
Violent                    813
Utilities                  525
Gore                       501
Animation & Modeling       362
Education                  287
Software Training          263
Nudity                     251
Sexual Content             245
Game Development           231
Video Production           226
NaN                        202
Photo Editing              199
Audio Production           169
Accounting                   5
Movie                        3
Documentary                  1
Episodic                     1
Short                        1
Tutorial

In [65]:
storefront[storefront['genres'].isnull()].sample(10)

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
394650,dlc,Satellite Reign Soundtrack,0.0,False,,,Featuring original music from BAFTA nominated ...,Featuring original music from BAFTA nominated ...,12 Cyberpunk tracks from BAFTA Nominated compo...,"{'appid': '268870', 'name': 'Satellite Reign'}",...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '20 Aug, 2015'}","{'url': 'http://satellitereign.com/', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1464918671
71167,game,ToeJam & Earl in Panic on Funkotron,0.0,False,,,<h1>Just Updated</h1><p>All SEGA MegaDrive/Gen...,The peaceful planet of Funkotron has been inva...,The peaceful planet of Funkotron has been inva...,,...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '14 Mar, 2012'}","{'url': 'https://support.sega.co.uk', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1548935871
325670,game,CRYENGINE - Wwise Project DLC,0.0,True,,,The EaaS Wwise Project DLC gives you access to...,The EaaS Wwise Project DLC gives you access to...,The EaaS Wwise Project DLC gives you access to...,,...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '7 Oct, 2014'}",{'url': 'http://www.cryengine.com/community/vi...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1478096559
318830,dlc,Sprite Lamp - Pro upgrade,0.0,False,,,Sprite Lamp pro adds a variety of additional f...,Sprite Lamp pro adds a variety of additional f...,Sprite Lamp pro brings a variety of additional...,"{'appid': '316830', 'name': 'Sprite Lamp'}",...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '25 Sep, 2014'}","{'url': '', 'email': 'support@snakehillgames.c...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1447363205
104600,game,Portal 2 - The Final Hours,0.0,False,,,<h1>Free Bonus Chapter Added</h1><p><p><strong...,<p>Portal 2 draws from the award-winning formu...,Portal 2 draws from the award-winning formula ...,,...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 543},,"{'coming_soon': False, 'date': '17 May, 2011'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1447354230
204731,dlc,Krater - Soundtrack,0.0,False,,,<strong>THE MUSIC OF KRATER</strong><br>Krater...,<strong>THE MUSIC OF KRATER</strong><br>Krater...,THE MUSIC OF KRATERKrater is a game where a lo...,"{'appid': '42170', 'name': 'Krater'}",...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '12 Jun, 2012'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1478100739
542620,game,Lonelyland VR,0.0,False,,,"Lonelyland is a sandbox survival VR game, and ...","Lonelyland is a sandbox survival VR game, and ...","It’s a sandbox survival game, now you can play...",,...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256672585, 'name': 'Lonelyland Trailer...",,,"{'coming_soon': False, 'date': '9 Nov, 2016'}","{'url': '', 'email': 'catboxvr@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1509753263
2025260,game,Test Game 9,0.0,False,,,#Summer2022_hunt_quest_9_desc,#Summer2022_hunt_quest_9_desc,#Summer2022_hunt_quest_9_desc,,...,,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1654204714
593000,game,Alien Swarm: Reactive Drop - SDK,0.0,False,,,,,,,...,,,,,,"{'coming_soon': False, 'date': '20 Apr, 2017'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1650830074
371930,dlc,Pillar Original Soundtrack,0.0,False,,,The soundtrack for Pillar; ambient music set i...,The soundtrack for Pillar; ambient music set i...,The soundtrack for Pillar; ambient music set i...,"{'appid': '371900', 'name': 'Pillar'}",...,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '29 May, 2015'}","{'url': 'http://forums.michaelarts.net', 'emai...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1447374766


In [66]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104344 entries, 10 to 1873351
Data columns (total 39 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     104344 non-null  object
 1   name                     104344 non-null  object
 2   required_age             104344 non-null  object
 3   is_free                  104344 non-null  bool  
 4   controller_support       26065 non-null   object
 5   dlc                      9804 non-null    object
 6   detailed_description     104188 non-null  object
 7   about_the_game           104187 non-null  object
 8   short_description        104183 non-null  object
 9   fullgame                 35362 non-null   object
 10  supported_languages      104180 non-null  object
 11  header_image             104344 non-null  object
 12  website                  61136 non-null   object
 13  pc_requirements          104344 non-null  object
 14  mac_requirements  

### Categories

Let's see what we have in categories.

In [67]:
categories_check = storefront['categories'].apply(extractDictList, key='description')
categories_check

appid
10         [Multi-player, PvP, Online PvP, Shared/Split S...
20         [Multi-player, PvP, Online PvP, Shared/Split S...
30                  [Multi-player, Valve Anti-Cheat enabled]
40         [Multi-player, PvP, Online PvP, Shared/Split S...
50         [Single-player, Multi-player, Valve Anti-Cheat...
                                 ...                        
1351710             [Single-player, Full controller support]
1357200                  [Single-player, Steam Achievements]
1444140    [Single-player, Multi-player, Co-op, Online Co...
1863540    [Single-player, Multi-player, PvP, Online PvP,...
1873351    [Single-player, Downloadable Content, Steam Tr...
Name: categories, Length: 104344, dtype: object

In [68]:
categories_check.explode().value_counts(dropna=False)

Single-player                 94979
Steam Achievements            50901
Downloadable Content          35409
Steam Cloud                   29274
Multi-player                  28302
Full controller support       26065
Steam Trading Cards           20005
Partial Controller Support    19154
Co-op                         16176
PvP                           13580
Steam Leaderboards            13578
Online PvP                    12762
Shared/Split Screen           10313
Online Co-op                   8946
Remote Play Together           6943
Cross-Platform Multiplayer     6826
Shared/Split Screen PvP        6205
Stats                          5548
Steam Workshop                 5036
In-App Purchases               4840
Shared/Split Screen Co-op      4822
Includes level editor          3741
Remote Play on TV              2774
Captions available             2188
MMO                            2055
Remote Play on Tablet          1562
LAN Co-op                      1542
Remote Play on Phone        

#### [Subroutine] Categories: Cleaning

In [69]:
storefront['categories'] = storefront['categories'].apply(extractDictList, key='description')

There are some apps with null categories and with null genres.

In [70]:
storefront[storefront['categories'].isnull()].sample(15)

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors,last_modified
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
288730,game,Redneck Racers,0.0,False,,,<strong>Welcome to Pimperstown - a small villa...,<strong>Welcome to Pimperstown - a small villa...,Welcome to Pimperstown - a small village miles...,,...,[Casual],"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 2031548, 'name': 'RedneckRacersTrailer...",,,"{'coming_soon': False, 'date': '28 Apr, 2014'}","{'url': '', 'email': 'support@libredia.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1528291068
886420,game,UnBorn,0.0,False,,,GAME: <br />\r\n<br />\r\nYou are trapped in a...,GAME: <br />\r\n<br />\r\nYou are trapped in a...,You are trapped in a graveyard. Unlike the zom...,,...,"[Violent, Action, Casual, Indie, Simulation]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256720102, 'name': 'trailer', 'thumbna...",,,"{'coming_soon': False, 'date': '4 Jul, 2018'}","{'url': '', 'email': 'unborn.co'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1530698820
500590,game,Puppy Dog: Jigsaw Puzzles,0.0,False,,,<strong>Puppy Dog: Jigsaw Puzzles</strong> is ...,<strong>Puppy Dog: Jigsaw Puzzles</strong> is ...,Puppy Dog: Jigsaw Puzzles is the perfect game ...,,...,[Casual],"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256666841, 'name': 'Trailer Puppy Dog ...",,,"{'coming_soon': False, 'date': '19 Jul, 2016'}","{'url': 'http://www.ensenasoft.com', 'email': ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1478754279
2025260,game,Test Game 9,0.0,False,,,#Summer2022_hunt_quest_9_desc,#Summer2022_hunt_quest_9_desc,#Summer2022_hunt_quest_9_desc,,...,,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1654204714
1433960,game,Fast Food Simulator,0.0,False,,,<h1>CHECK OUT MORE GAMES FROM US</h1><p><a hre...,"<a href=""https://store.steampowered.com/app/13...",Fast Food Simulator is a restaurant and cookin...,,...,"[Indie, Simulation]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256807923, 'name': 'fast food', 'thumb...",,,"{'coming_soon': False, 'date': '11 Nov, 2020'}","{'url': 'http://indiegames3000.com/', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1624901432
1079340,game,homicide,0.0,False,,,This is a decryption game (first-person perspe...,This is a decryption game (first-person perspe...,This is a decryption game (first-person perspe...,,...,"[Action, Adventure, Indie]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256749515, 'name': '01', 'thumbnail': ...",,,"{'coming_soon': False, 'date': '18 May, 2019'}","{'url': '', 'email': '974357348@qq.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1558501720
1037960,game,Grand Dude Simulator,0.0,False,,,<strong>Grand Dude Simulator</strong> is an op...,<strong>Grand Dude Simulator</strong> is an op...,Grand Dude Simulator is an open world sandbox ...,,...,"[Action, Indie, Racing, Simulation]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256744885, 'name': 'Trailer', 'thumbna...",{'total': 138},,"{'coming_soon': False, 'date': '12 Mar, 2019'}","{'url': '', 'email': 'jooomly@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1552825064
1544560,game,Death in sands,0.0,False,,,Death in sands is a shooter that will make you...,Death in sands is a shooter that will make you...,Death in sands is a shooter that will make you...,,...,"[Action, Indie]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256870451, 'name': 'First trailer', 't...",,,"{'coming_soon': True, 'date': '14 Feb, 2022'}",{'url': 'https://vk.com/andranoracreativestudi...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1643435578
710460,game,Rush on Rome,0.0,False,,,<h1>🔥 Upcoming games - Wishlist now!</h1><p><a...,The player steps into the sandals of a Roman p...,Find the missing pieces of the artifact and sa...,,...,[Strategy],"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256694489, 'name': 'Rush on Rome', 'th...",,,"{'coming_soon': False, 'date': '17 Oct, 2017'}","{'url': '', 'email': 'support@rokaplay.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1620642248
1433970,game,Cake Shop Simulator,0.0,False,,,<h1>CHECK OUT MORE GAMES FROM US</h1><p><a hre...,"<a href=""https://store.steampowered.com/app/13...",Cake Shop Simulator is an exciting cooking gam...,,...,"[Indie, Simulation]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256811800, 'name': 'Gameplay_1', 'thum...",,,"{'coming_soon': False, 'date': '7 Dec, 2020'}","{'url': 'http://indiegames3000.com/', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",1624899396


There are actually tons of useful metadata here. This seems to be what is shown at the steam store webpage at the right.

This might be usefull for the different ways we can group and analyse the data later, like achievement availability, controller supporot and console ports (if we'll get a console games dataset, for example).


### required_age

In [71]:
storefront['required_age'].value_counts()

0.0        62954
0          38589
18.0        1197
16.0         520
18           312
17.0         178
12           175
12.0         153
16            94
15.0          45
13.0          37
3.0           14
7.0           14
15             7
17             6
3              6
13             5
14.0           4
10.0           4
7              4
11.0           3
10             3
18+            2
6.0            2
11             2
1.0            2
6              2
20             1
１８             1
19.0           1
14             1
99999.0        1
5.0            1
4.0            1
20.0           1
171.0          1
12+            1
Name: required_age, dtype: int64

In [72]:
getSteamLink(storefront[storefront['required_age']=='18.0'])

Quake 4 https://store.steampowered.com/app/2210
QUAKE https://store.steampowered.com/app/2310
Company of Heroes - Legacy Edition https://store.steampowered.com/app/4560
Condemned: Criminal Origins https://store.steampowered.com/app/4720
Hitman: Blood Money https://store.steampowered.com/app/6860
Hitman: Codename 47 https://store.steampowered.com/app/6900
Men of War™ https://store.steampowered.com/app/7830
NecroVision https://store.steampowered.com/app/7860
Just Cause 2 https://store.steampowered.com/app/8190
BioShock® 2 https://store.steampowered.com/app/8850
Borderlands Game of the Year https://store.steampowered.com/app/8980
RAGE https://store.steampowered.com/app/9200
Call of Duty: World at War https://store.steampowered.com/app/10090
Manhunt https://store.steampowered.com/app/12130
Max Payne https://store.steampowered.com/app/12140
Max Payne 2: The Fall of Max Payne https://store.steampowered.com/app/12150
Grand Theft Auto: Episodes from Liberty City https://store.steampowered.com/

This column is really messy. Values seem to have different types (integer, floating and even string), some values are very suspicious (171.0 and 99999.0). 0 seems to mean 'no restriction'. According to PEGI the values should be 3, 7, 12, 16 and 18 but age restrictions might vary from country to country so having a lot of different numbers is understandable. 
The detailed rated content description is explained in 'content_descriptors' column:

In [73]:
storefront[storefront['required_age']=='18.0'].content_descriptors.value_counts()

{'ids': [], 'notes': None}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          636
{'ids': [2, 5], 'notes': None}                                                                                                                                                                                                                                                                                                                                                                                                                  

I'll transform required_age in this way:
* **required_age**: change value to the same type, parsing strings if necessary. Leave strange age as is?

#### [Subroutine] Required_age: Cleaning

In [74]:
# Getting integer age from the data
def getAge(age):
    age = str(age)
    try:
        x = re.search('\d+', age).group()
        x = int(x)
    except:
        return pd.NA
    return x

In [75]:
# Cleaning up age
storefront['required_age'] = storefront['required_age'].apply(getAge)

In [76]:
storefront['required_age'].value_counts()

0        101543
18         1512
16          614
12          329
17          184
15           52
13           42
3            20
7            18
10            7
11            5
14            5
6             4
1             2
20            2
171           1
4             1
5             1
19            1
99999         1
Name: required_age, dtype: int64

### content_descriptors

As we've seen above, 'content_descriptors' is a JSON object consisting of 'ids' and 'notes'. Sadly, it seems like 'ids' doesn't have any correlation with either 'notes' or 'required_age' and seems like some internal ID. So I've opted to only extract the 'notes'

**content_descriptors**: extract 'notes' dictionaries to the string, set to NaN if the string equals to 'none', 'na', etc.

#### [Subroutine] Content_descriptors: Cleaning

In [77]:
# unwrapping list of dictionaries into to the item
# return the NaN values on error
def extractDictItem(jsonDict, key):
    if jsonDict != jsonDict:
        return pd.NA
    else:
        try:
            evalList = eval(jsonDict)
            if(type(evalList) == dict):
                if not (pd.isna(evalList[key])):
                    item = evalList[key]
                return item
            else:
                return evalList
        except :
            return pd.NA

In [78]:
# extracting 'notes' dictionaries to the list, set empty or invalid ones to NaN
def cleanContentDesc(storefront):
    badstrings = ['none','None','na','Na','False','false',0,'','invalid','Invalid','\r\n']
    storefront['content_descriptors'] = storefront['content_descriptors'].apply(extractDictItem, key='notes')
    storefront['content_descriptors'].mask(storefront.content_descriptors.isin(badstrings), pd.NA, inplace=True )
    return storefront

In [79]:
storefront = cleanContentDesc(storefront)

In [80]:
storefront['content_descriptors'].value_counts(dropna=False)

NaN                                                                                                                                                                                                                                                           91019
This Game may contain content not appropriate for all ages, or may not be appropriate for viewing at work: Frequent Violence or Gore, Partial Nudity, Sexual Content                                                                                            464
Nakedness.\r\nAll characters appearing in this game are over 18 years of age.                                                                                                                                                                                   180
This Game may contain content not appropriate for all ages, or may not be appropriate for viewing at work: Frequent Violence or Gore, General Mature Content                                                                

### platforms

This is a dictionary based on the platform availability. I'll unwrap it into the list of supported platforms. Theoretically, it might be a good idea to get each platform into a separate column but it's always possible we'll see more platforms in the future (like a separate flag for Steam Deck, for example).

In [81]:
storefront['platforms'].value_counts(dropna=False)

{'windows': True, 'mac': False, 'linux': False}    74655
{'windows': True, 'mac': True, 'linux': False}     13889
{'windows': True, 'mac': True, 'linux': True}      13293
{'windows': True, 'mac': False, 'linux': True}      2488
{'windows': False, 'mac': True, 'linux': False}       12
{'windows': False, 'mac': False, 'linux': True}        6
{'windows': False, 'mac': True, 'linux': True}         1
Name: platforms, dtype: int64

#### [Subroutine] Platforms: Cleaning

In [82]:
# unwrapping list of dictionaries into to the item
def extractBoolDict(boolDict):
    if boolDict != boolDict:
        return pd.NA
    else:
        try:
            evalDict = eval(boolDict)
            if(type(evalDict) == dict):
                items = []
                for key in evalDict.keys():
                    if (evalDict[key] == True):
                        items.append(key)
                return items
            else:
                return pd.NA
        except:
            return pd.NA

In [83]:
storefront['platforms'] = storefront['platforms'].apply(extractBoolDict)
storefront['platforms'].fillna({i: [] for i in storefront.index},inplace = True)

In [84]:
storefront['platforms'].apply(tuple).value_counts(dropna=False)

(windows,)               74655
(windows, mac)           13889
(windows, mac, linux)    13293
(windows, linux)          2488
(mac,)                      12
(linux,)                     6
(mac, linux)                 1
Name: platforms, dtype: int64

### pc_requirements, mac_requirements, linux_requirements
These three columns contain information about the game system requirements. Two things to note:
* Not being available in 'platforms' doesn't mean the game doesn't have system requirements for that platform (maybe for Proton/Steam Deck?).
* The empty requirements are done as the empty lists.
* The contents seem to be the same that appear on the Steam page bu thet structure of requirements themselves is not very defined (apart from having minimum/recommended).

I assume the hardware requirements for windows and linux are similar and extracting data from windows should be enough.

We'll have to split dictianary and remove the tags to do some clean-up.

I'll extract the data for PC/Mac and move the requirements to the separate table for export (while keeping the raw data as well if needed).


In [85]:
# Here is a possible example of the above:
storefront[storefront.platforms.apply(lambda x: sorted(x) == sorted(['windows', 'linux']))]['mac_requirements'].value_counts()

[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

Let's do a quick check before transforming the data for the final dataset:

In [86]:
temp_df =  storefront[['pc_requirements', 'mac_requirements']].copy()
# removing  rows with empty requirements
temp_df = temp_df[(temp_df['pc_requirements'] != '[]') & (temp_df['mac_requirements'] != '[]')]
# processing pc requirement data
temp_df['pc_clean'] = (temp_df['pc_requirements']
                      .str.replace(r'\\[rtn]', '', regex=True)
                      .str.replace(r'<[pbr]{1,2}>', ' ', regex=True)
                      .str.replace(r'<[\/"=\w\s]+>', '', regex=True)
                      )
temp_df['pc_clean'] = temp_df['pc_clean'].apply(lambda x: ast.literal_eval(x))
# split out minimum and recommended into separate columns
temp_df['pc_minimum'] = temp_df['pc_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else pd.NA)
temp_df['pc_recommended'] = temp_df['pc_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else pd.NA)
temp_df = temp_df.drop('pc_clean', axis=1)
# processing mac requirement data
temp_df['mac_clean'] = (temp_df['mac_requirements']
                      .str.replace(r'\\[rtn]', '', regex=True)
                      .str.replace(r'<[pbr]{1,2}>', ' ', regex=True)
                      .str.replace(r'<[\/"=\w\s]+>', '', regex=True)
                      )
temp_df['mac_clean'] = temp_df['mac_clean'].apply(lambda x: ast.literal_eval(x))
temp_df['mac_minimum'] = temp_df['mac_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else pd.NA)
temp_df['mac_recommended'] = temp_df['mac_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else pd.NA)
temp_df = temp_df.drop('mac_clean', axis=1)
temp_df

Unnamed: 0_level_0,pc_requirements,mac_requirements,pc_minimum,pc_recommended,mac_minimum,mac_recommended
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
20,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
30,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
40,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
50,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
...,...,...,...,...,...,...
201270,{'minimum': '<strong>Minimum:</strong><br>\t\t...,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",OS: Windows 7 / Vista / XP Processor: 2 GHz In...,OS: Windows 7 / Vista / XP Processor: 2nd Gene...,OS: macOS 10.14.5 Processor: 1.1 GHZ Intel Cor...,
756950,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"OS: Microsoft® Windows Server 2008, Windows 7,...",,"OS: Mac OS X v10.6, and above Processor: Intel...",
901399,{'minimum': '<ul>\n\t\t\t\t\t\t\t\t\t<li><stro...,{'minimum': '<ul>\n\t\t\t\t\t\t\t\t\t<li><stro...,OS: Windows XP / Vista / Windows 7Processor: 2...,,OS: Mac OS X 10.5 or laterProcessor: Intel Cor...,
901663,{'minimum': '<ul>\n\t\t\t\t\t<li><strong>Suppo...,{'minimum': '<ul>\n\t\t\t\t\t\t<li><strong>OS:...,Supported OS: Microsoft Windows&reg; XP or Vis...,,OS: OS X 10.5 or laterProcessor: Intel Core 2 ...,


Seems fine, so let's proceed with the cleaning:


#### [Subroutine] 'pc_requirements', 'mac_requirements', 'linux_requirements': Cleaning

In [87]:
# Cleaning up the hardware requirements, exporting to the separate table and removing columns from the storefront
def cleanRequirements(df, export=False):
    if export:
        requirements = df[['pc_requirements', 'mac_requirements', 'linux_requirements']].copy()
        
        # remove rows with missing requirements
        requirements = requirements[(requirements['pc_requirements'] != '[]') & (requirements['mac_requirements'] != '[]')]
        
        requirements['pc_clean'] = (requirements['pc_requirements']
                              .str.replace(r'\\[rtn]', '', regex=True)
                              .str.replace(r'<[pbr]{1,2}>', ' ', regex=True)
                              .str.replace(r'<[\/"=\w\s]+>', '', regex=True)
                              )
        requirements['pc_clean'] = requirements['pc_clean'].apply(lambda x: ast.literal_eval(x))
        # processing pc requirement data
        requirements['pc_minimum'] = requirements['pc_clean'].apply(
            lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
        requirements['pc_recommended'] = requirements['pc_clean'].apply(
            lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
        requirements = requirements.drop('pc_clean', axis=1)
        # processing mac requirement data
        requirements['mac_clean'] = (requirements['mac_requirements']
                              .str.replace(r'\\[rtn]', '', regex=True)
                              .str.replace(r'<[pbr]{1,2}>', ' ', regex=True)
                              .str.replace(r'<[\/"=\w\s]+>', '', regex=True)
                              )
        requirements['mac_clean'] = requirements['mac_clean'].apply(lambda x: ast.literal_eval(x))
        requirements['mac_minimum'] = requirements['mac_clean'].apply(
            lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
        requirements['mac_recommended'] = requirements['mac_clean'].apply(
            lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
        requirements = requirements.drop('mac_clean', axis=1)
        
        export_data(requirements, 'steam_requirements_data', True)
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)  
    return df

In [88]:
storefront = cleanRequirements(storefront, True)

Exported steam_requirements_data to "../data/export/steam_requirements_data.csv"


In [89]:
# verifying hardware reqs export
pd.read_csv('../data/export/steam_requirements_data.csv').head()

Unnamed: 0,appid,pc_requirements,mac_requirements,linux_requirements,pc_minimum,pc_recommended,mac_minimum,mac_recommended
0,10,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
1,20,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
2,30,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
3,40,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
4,50,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",


### 'detailed_description', 'about_the_game', 'short_description'

These three columns contain descriptive texts about the applications. They can be useful for the sentiment/recommendation analysis but they are quite 'heavy' and might be  redundant for statistical analysis hence I'll move them to the separate table as well

In [90]:
storefront[['detailed_description', 'about_the_game', 'short_description']].isnull().sum()

detailed_description    156
about_the_game          157
short_description       161
dtype: int64

Quite a few have null values. For the exported table, I'll exclude the rows where all three descriptions are empty.

#### [Subroutine] 'detailed_description', 'about_the_game', 'short_description': Cleaning

In [91]:
def cleanDescriptions(df, export=False):
    '''
    Cleaning descriptions. Empty descriptions are not included into the exported table.
    
    Export descriptions to external csv file then remove these columns.
    '''
    # remove rows with missing description data
    temp_df = df.dropna(subset=['detailed_description', 'about_the_game', 'short_description'], how='all').copy()  
    
    # by default we don't export, useful if calling function later
    if export:
        # create dataframe of description columns
        description_data = temp_df[['detailed_description', 'about_the_game', 'short_description']]
        
        export_data(description_data, filename='steam_description_data', index=True)
    
    # drop description columns from main dataframe
    df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
    
    return df

In [92]:
storefront = cleanDescriptions(storefront, export=True)

Exported steam_description_data to "../data/export/steam_description_data.csv"


In [93]:
# Verifying exported data
pd.read_csv('../data/export/steam_description_data.csv').head()

Unnamed: 0,appid,detailed_description,about_the_game,short_description
0,10,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...
1,20,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...
2,30,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...
3,40,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...
4,50,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...


### 'header_image', 'screenshots', 'background', 'movies'

These four columns contain links to the various media data about the app: header image and the background (as it appears on the Steam page), screenshots and trailers.

I don't think they are very useful for analysis but still might be helpful for extracting data for dashboards or getting game logos.
I'll keep them in the separate table as well.

In [94]:
image_cols = ['header_image', 'screenshots', 'background', 'movies']

for col in image_cols:
    print(col+':', storefront[col].isnull().sum())

storefront[image_cols].sample(10)

header_image: 0
screenshots: 176
background: 157
movies: 34892


Unnamed: 0_level_0,header_image,screenshots,background,movies
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1598220,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256829022, 'name': 'Iris Trailer', 'th..."
959850,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256742481, 'name': 'American Patriots:..."
446000,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256661412, 'name': 'Acquire me! teaser..."
353760,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
1386190,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256794991, 'name': 'Fishhead: Blueprin..."
866100,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256721111, 'name': 'AlphaTrailer', 'th..."
1156630,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256763952, 'name': ""A NIGHTMARE'S TRIP..."
218680,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 2028118, 'name': 'Scribblenauts Unlimi..."
726870,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256702056, 'name': 'Anchorhead trailer..."
1912560,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256876098, 'name': 'Super Gosu Cars Tr..."


All apps seem to have headers but some are missing screenshots/backgrounds and a lot of them - trailers (which is understandable). As for the strucure, it seems like background and header_image have simple links while screenshots and movies are a bit more complicated. I'll keep them as is.

#### [Subroutine] 'header_image', 'screenshots', 'background', 'movies': Cleaning

In [95]:
def cleanMedia(df, export=False):
    '''Remove media columns from dataframe, optionally exporting them to csv first.'''
    df = df.copy()
    
    if export:
        media_df = df[df['screenshots'].notnull()].copy()
        media_data = media_df[['header_image', 'screenshots', 'background', 'movies']]
        
        export_data(media_data, 'steam_media_data', index=True)
        
    df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
    
    return df

In [96]:
storefront = cleanMedia(storefront, export=True)

Exported steam_media_data to "../data/export/steam_media_data.csv"


In [97]:
# Verifying exported data
pd.read_csv('../data/export/steam_media_data.csv').head()

Unnamed: 0,appid,header_image,screenshots,background,movies
0,10,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
1,20,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
2,30,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
3,40,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
4,50,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,


### 'website', 'support_info'

These two columns contain information about the games's website, support web page and email:

In [98]:
print('website nulls count:', storefront['website'].isnull().sum())
print('support_info nulls count:', storefront['support_info'].isnull().sum())

with pd.option_context('display.max_colwidth', 100): # ensures strings not cut short
    display(storefront[['name', 'website', 'support_info']].sample(10))

website nulls count: 43208
support_info nulls count: 0


Unnamed: 0_level_0,name,website,support_info
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1249000,"Mýrdalssandur, Iceland VR Compatibility DLC",https://cavesrd.itch.io/iceland,"{'url': 'https://www.cavesrd.com', 'email': ''}"
937180,Impixable,,"{'url': '', 'email': 'Support@FlamingBatStudios.com'}"
1164230,Visual Novel Maker - Alec Shea's JRPG Music Pack,http://visualnovelmaker.com/,"{'url': '', 'email': 'games@degica.com'}"
1822080,萌宠方块派对 Yippy cube,,"{'url': '', 'email': 'yippycube@gmail.com'}"
1255260,Hunting Simulator 2 Bear Hunter Pack,,"{'url': '', 'email': 'support@nacon.fr'}"
304020,Euro Truck Simulator 2 - Polish Paint Jobs Pack,http://eurotrucksimulator2.com/,"{'url': 'http://eurotrucksimulator2.com/support.php', 'email': 'support@eurotrucksimulator2.com'}"
1133400,Fantasy Grounds - A23: Twin Crossings (5E),http://www.fantasygrounds.com,"{'url': '', 'email': 'support@fantasygrounds.com'}"
656740,Cosmonator,,"{'url': 'http://www.bourbontank.com/', 'email': 'support@bourbontank.com'}"
286220,Pier Solar and the Great Architects,http://www.piersolar.com,"{'url': 'http://www.magicalgamefactory.com/en/faq/contact-support_5/', 'email': 'support@waterme..."
1615430,Fantasy Grounds - D&D Classics - PHBR9 The Complete Book of Gnomes & Halflings (2E),https://www.fantasygrounds.com,"{'url': 'www.fantasygrounds.com', 'email': 'support@fantasygrounds.com'}"


I'll split these two columns into three (website, support_url and support_email) and move to the separate table. As we can see, the empty website field is NaN while empty  url/emails are just ''. It will appear as NaN after export-import to csv.

It might also be a good idea to check if all three fields are NaN before exporting the data table to avoid having unnecesary data.

#### [Subroutine] 'website', 'support_info': Cleaning

In [99]:
def cleanSupport(df, export=False):
    '''Drop support information from dataframe, optionally exporting beforehand.'''
    if export:
        support_info = df[['website', 'support_info']].copy()
        
        support_info['support_info'] = support_info['support_info'].apply(lambda x: ast.literal_eval(x))
        support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'] if (x['url']!='') else pd.NA)
        support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email']  if (x['email']!='') else pd.NA)
        
        support_info = support_info.drop('support_info', axis=1)
        
        # only keep rows with at least one piece of information
        support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'].notnull()) | (support_info['support_email'].notnull())]

        export_data(support_info, 'steam_support_info', index=True)
    
    df = df.drop(['website', 'support_info'], axis=1)
    
    return df

In [100]:
storefront = cleanSupport(storefront, export=True)

Exported steam_support_info to "../data/export/steam_support_info.csv"


In [101]:
# Verifying exported data
pd.read_csv('../data/export/steam_support_info.csv').sample(15)

Unnamed: 0,appid,website,support_url,support_email
41734,941952,http://www.capcom.co.jp/devil5/,http://www.capcom.co.jp/support/contact/,
21315,578970,http://newworldrevolution.bandcamp.com/,,support@rocketbirds.com
84378,1669210,,,contact.magicpixel2012@gmail.com
80395,1594760,,,astrosailorgame@gmail.com
41378,936070,https://www.binaryboxstudios.com/apartment-327-en,https://www.binaryboxstudios.com/,contacto@binaryboxstudios.com
38478,885790,http://www.capplay.com/,http://capplay.com,service@capplay.com
43812,977490,,https://www.bottlecube.co.jp,steam_support@bottlecube.co.jp
44298,985680,http://www.chuangguansha.com/,http://www.chuangguansha.com/,27437676@qq.com
66386,1352280,,,ivlenov@gmail.com
34844,820400,,,pugwarestudio@gmail.com


### supported_languages

This is a supported languages field and it's a bit complicated. This is a string listing languages supported by the game but the audio support is marked with `<strong>*</strong>` so we'll have to parse the strings if we want to get both audio and text support.

I'll split this column into two - supported_languages and audio_languages. The languages will be kept as a list.


In [102]:
print('supported_languages nulls count:', storefront['supported_languages'].isnull().sum())

storefront['supported_languages'].value_counts().head(15)

supported_languages nulls count: 164


English                                                                                                                                                                                                                           27376
English<strong>*</strong><br><strong>*</strong>languages with full audio support                                                                                                                                                  23182
English, Russian                                                                                                                                                                                                                   1948
English<strong>*</strong>, German<strong>*</strong>, French<strong>*</strong>, Italian<strong>*</strong>, Spanish - Spain<strong>*</strong>, Japanese<strong>*</strong><br><strong>*</strong>languages with full audio support     1559
English, Japanese                                                       

As we can see, there are some nulls in this column and also languages are neither sorted alphabetically nor grouped up by audio support so I'll do sorting as well.

Let's test things first:

In [103]:
def audioParse(string):
    '''
    Parsing audio part of the language string into the separate column
    '''
    if string != string:
        return pd.NA
    try:
        # This regex is not too complicated: just matching the text groups ending with <strong>*</strong>
        pattern = '(?:([A-Za-z -]+)(?:<strong>\*<\/strong>)(?:, )*)'
        items = re.findall(pattern, string)
        # Replacing empty lists with NaN. For the group operations, keeping empty lists would actually 
        # be better but they will be transformed to NaN on export anyways.
        if len(items) == 0:
            return pd.NA
        return sorted(items)
    except:
        return pd.NA

temp_df = storefront[['supported_languages']].copy()
# parsing for audio support
temp_df['audio_languages'] = temp_df['supported_languages'].apply(audioParse)
# removing tags and unnecessary endings and splitting the string into the text support list
temp_df['text_languages'] = (temp_df['supported_languages']
                             .str.replace(r'<br><strong>\*<\/strong>languages with full audio support','',regex=True)
                             .str.replace(r'<strong>\*</strong>','',regex=True)
                            ).str.split(', ').apply(lambda x: sorted(x) if type(x) is list else pd.NA)
temp_df

Unnamed: 0_level_0,supported_languages,audio_languages,text_languages
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,"English<strong>*</strong>, French<strong>*</st...","[English, French, German, Italian, Korean, Sim...","[English, French, German, Italian, Korean, Sim..."
20,"English, French, German, Italian, Spanish - Sp...",,"[English, French, German, Italian, Korean, Rus..."
30,"English, French, German, Italian, Spanish - Spain",,"[English, French, German, Italian, Spanish - S..."
40,"English, French, German, Italian, Spanish - Sp...",,"[English, French, German, Italian, Korean, Rus..."
50,"English, French, German, Korean",,"[English, French, German, Korean]"
...,...,...,...
1351710,English<strong>*</strong><br><strong>*</strong...,[English],[English]
1357200,English,,[English]
1444140,English,,[English]
1863540,English<strong>*</strong><br><strong>*</strong...,[English],[English]


In [104]:
temp_df['text_languages'].apply(lambda x: tuple(x) if type(x) is list else pd.NA).value_counts(dropna = False)

(English,)                                                                                                                                                                          50558
(English, Russian)                                                                                                                                                                   2928
(English, Japanese)                                                                                                                                                                  2200
(English, Simplified Chinese)                                                                                                                                                        1910
(Simplified Chinese,)                                                                                                                                                                1842
                                                                      

In [105]:
temp_df['audio_languages'].apply(lambda x: tuple(x) if type(x) is list else pd.NA).value_counts(dropna = False)

NaN                                                                                                                                                 53994
(English,)                                                                                                                                          31954
( Japanese,)                                                                                                                                         2210
(English, French, German, Italian, Japanese, Spanish - Spain)                                                                                        1752
( Japanese, English)                                                                                                                                 1528
                                                                                                                                                    ...  
(English, Italian, Russian, Simplified Chinese, Spanish - Spain)            

Everything seems fine, so let's make the transform function:

#### [Subroutine] 'supported_languages': Cleaning

In [106]:
def cleanLanguages(df):
    '''Clean and split supported_languages into two columns: supported_languages and supported_audio'''
    
    #parsing audio in the separate function
    def audioParse(string):
        if string != string:
            return pd.NA
        try:
            # This regex is not too complicated: just matching the text groups ending with <strong>*</strong>
            pattern = '(?:([A-Za-z -]+)(?:<strong>\*<\/strong>)(?:, )*)'
            items = re.findall(pattern, string)
            # Replacing empty lists with NaN. For the group operations, keeping empty lists would actually 
            # be better but they will be transformed to NaN on export anyways.
            if len(items) == 0:
                return pd.NA
            return sorted(items)
        except:
            return pd.NA    

    # parsing for audio support
    df['supported_audio'] = df['supported_languages'].apply(audioParse)
    # removing tags and unnecessary endings and splitting the string into the text support list
    df['supported_languages'] = (df['supported_languages']
                                 .str.replace(r'<br><strong>\*<\/strong>languages with full audio support','',regex=True)
                                 .str.replace(r'<strong>\*</strong>','',regex=True)
                                ).str.split(', ').apply(lambda x: sorted(x) if type(x) is list else pd.NA)
    return df

In [107]:
storefront = cleanLanguages(storefront)

In [108]:
storefront[['name','supported_audio','supported_languages']].sample(15)

Unnamed: 0_level_0,name,supported_audio,supported_languages
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1172140,Monster Battles - Portals,[English],"[English, Japanese, Russian, Simplified Chines..."
936430,FlyWings 2018 - Military Helicopters,[English],[English]
1318610,The Sewers,[English],[English]
500770,Quadrant - Official Soundtrack,[English],[English]
25762,Madballs BDI Evolution Skin Rollup,,"[English, French, German, Italian, Korean, Por..."
1252280,18+ Madlab for 3D Visual Novel Maker,,"[Arabic, Bulgarian, Czech, Danish, Dutch, Engl..."
490310,Anarcute - Official Soundtrack,,"[English, French, German, Italian, Japanese, K..."
202730,Dynamite Jack,[English],[English]
1184660,"AUDICA - Imagine Dragons - ""Believer""",[English],[English]
1146700,Battle Sorcerers,,"[English, French, German, Japanese, Simplified..."


### release_date

In [109]:
storefront['release_date'].value_counts()

{'coming_soon': True, 'date': '2022'}                          1185
{'coming_soon': True, 'date': 'TBA'}                            976
{'coming_soon': True, 'date': 'Coming Soon'}                    787
{'coming_soon': True, 'date': ''}                               461
{'coming_soon': False, 'date': '30 May, 2022'}                  346
                                                               ... 
{'coming_soon': True, 'date': '10/07/2022'}                       1
{'coming_soon': True, 'date': 'Coming Soon. Wishlist Now.'}       1
{'coming_soon': False, 'date': 'Mar 2010'}                        1
{'coming_soon': False, 'date': '29 Mar, 2011'}                    1
{'coming_soon': False, 'date': '16 May, 2008'}                    1
Name: release_date, Length: 7093, dtype: int64

In [110]:
storefront['release_date'].sample(n=10)

appid
1153310    {'coming_soon': False, 'date': '28 Oct, 2021'}
753896     {'coming_soon': False, 'date': '30 Oct, 2018'}
1536760    {'coming_soon': False, 'date': '18 Mar, 2021'}
1693430    {'coming_soon': False, 'date': '21 Dec, 2021'}
952080     {'coming_soon': False, 'date': '29 Nov, 2018'}
1545000     {'coming_soon': False, 'date': '3 Mar, 2021'}
686510     {'coming_soon': False, 'date': '21 Aug, 2017'}
569430      {'coming_soon': False, 'date': '9 Feb, 2017'}
1473490         {'coming_soon': True, 'date': 'Dec 2021'}
436636      {'coming_soon': False, 'date': '7 Jun, 2016'}
Name: release_date, dtype: object

There are two different fields stored in this dict - Boolean on whether the game is released or not (coming_soon) and the release date. 

For the upcoming game date format seems to be free string (some strings are not even in English).

For the released games - it seems to be a standard '%d %b, %Y'

**Note:** There are some (less than 10 at the time of writing this) games that have coming_soon set flag to False while their release_date is set long after the data collection. I'll set coming_soon to True in that case.

I'll convert the datetime for the released games to datetime and add the column for coming_soon games. The field for the incorrect dates is set to the NaN

#### [Subroutine] 'release_date': Cleaning

In [111]:
def cleanReleaseDate(df):
    '''
    Cleaning release date, separating coming soon and the date itself
    '''
    df = df.copy()

    #getting values for comming_soon column
    def getComingSoon(value):
        if extractDictItem(value,'coming_soon') == True:
            return True
        return False
    
    #parsing dates
    def processReleaseDateValues(value):
        thisDate = extractDictItem(value, 'date')
        try:
            return pd.to_datetime(thisDate, errors='raise')
        except:
            return pd.NA  
        
  
    df['coming_soon'] = df['release_date'].apply(getComingSoon)
    df['release_date'] = df['release_date'].apply(processReleaseDateValues)
    df.loc[df['release_date'] > df_collection_date,['coming_soon']]= True
    return df

In [112]:
temp_df = cleanReleaseDate(storefront)

In [113]:
temp_df.coming_soon.value_counts()

False    92313
True     12031
Name: coming_soon, dtype: int64

In [114]:
temp_df['release_date'].sample(15)

appid
418950     2016-03-24 00:00:00
1611540    2021-05-11 00:00:00
975770     2018-11-20 00:00:00
659040     2017-07-20 00:00:00
266430     2014-09-03 00:00:00
427520     2020-08-14 00:00:00
939840     2018-09-27 00:00:00
1055067    2019-04-16 00:00:00
1749710    2021-10-01 00:00:00
258070     2014-06-12 00:00:00
1562230    2022-03-02 00:00:00
543440     2018-01-29 00:00:00
1732750    2022-06-21 00:00:00
868422     2018-12-04 00:00:00
1552250    2021-03-01 00:00:00
Name: release_date, dtype: object

In [115]:
temp_df[(temp_df['release_date']>df_collection_date) & (temp_df['coming_soon'] == False)]['release_date']

Series([], Name: release_date, dtype: object)

Everything seems fine. Processing:

In [116]:
storefront = cleanReleaseDate(storefront)

### Processing price

There are multiple columns that are related to price:
* price_overview 
* is_free 
* packages 
* package_groups 

price_overview and is_free are obvious, as for packages and package_groups, as you'll see later, the app might be sold just as a part of package and not sold separately.

Let's start with taking a peek at price_overview:

In [117]:
print('price_overview nulls count:', storefront['price_overview'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront[['name', 'is_free', 'price_overview']].sample(15))

price_overview nulls count: 22995


Unnamed: 0_level_0,name,is_free,price_overview
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1089162,Rocksmith® 2014 Edition – Remastered – Trivium - “Dying in Your Arms”,False,"{'currency': 'EUR', 'initial': 299, 'final': 299, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '2,99€'}"
29600,Guild Wars® Game of the Year Edition,False,"{'currency': 'EUR', 'initial': 1999, 'final': 1999, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '19,99€'}"
208460,Tryst,False,"{'currency': 'EUR', 'initial': 1399, 'final': 1399, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '13,99€'}"
1587620,Sniper Ghost Warrior Contracts 2 - Fire & Smoke Skin Pack,False,"{'currency': 'EUR', 'initial': 399, 'final': 399, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '3,99€'}"
1857090,Norland,False,
956893,Guan Suo - Officer Ticket / 関索使用券,False,"{'currency': 'EUR', 'initial': 199, 'final': 199, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '1,99€'}"
23500,Puzzle Quest: Galactrix,False,"{'currency': 'EUR', 'initial': 819, 'final': 819, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '8,19€'}"
1057990,Viking Trickshot,True,
1151090,Easter Egg,False,"{'currency': 'EUR', 'initial': 1079, 'final': 1079, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '10,79€'}"
1271930,Zango's Shark Adventure,False,"{'currency': 'EUR', 'initial': 399, 'final': 399, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '3,99€'}"


Things to note:

* There are a quite a lot of nulls in price_overview
* price_overview being null doesn't always correlate with is_free being True (although we'll check how often that happens next)
* price_overview's currency is Euro (which is understandable as the dataset was download from the location in Europe. But we'll check if there are any inconsistencies here)
* There are both prices both with the current discount and without it. Considering how often Steam does sales on different products, I'll only leave the price without the discount.

First, let's take a closer look at is_free and price_overview:

In [118]:
print('is_free = True and price_overview == nulls count:',
      storefront[storefront['is_free'] == True]['price_overview'].isnull().sum())
print('is_free = True and price_overview != nulls count:',
      storefront[storefront['is_free'] == True]['price_overview'].notnull().sum())
print('is_free = False and price_overview == nulls count:',
      storefront[storefront['is_free'] == False]['price_overview'].isnull().sum())
print('Filtered out non-released apps from the above:',
      storefront[(storefront['is_free'] == False) & (storefront['coming_soon'] == False)]['price_overview'].isnull().sum())
print()
with pd.option_context('display.max_colwidth', 150):
    display(storefront[(storefront['is_free'] == True) & (storefront['price_overview'].notnull())][['name','price_overview']])

is_free = True and price_overview == nulls count: 10776
is_free = True and price_overview != nulls count: 21
is_free = False and price_overview == nulls count: 12219
Filtered out non-released apps from the above: 1339



Unnamed: 0_level_0,name,price_overview
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
8650,RACE 07: Andy Priaulx Crowne Plaza Raceway (Free DLC),"{'currency': 'EUR', 'initial': 2995, 'final': 2995, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
70615,Worms Ultimate Mayhem - Single Player Pack DLC,"{'currency': 'EUR', 'initial': 1699, 'final': 1699, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
215373,Omerta - City of Gangsters - The Bulgarian Colossus DLC,"{'currency': 'EUR', 'initial': 2499, 'final': 2499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
219136,Painkiller Hell & Damnation: Satan Claus DLC,"{'currency': 'EUR', 'initial': 6999, 'final': 6999, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
222680,Dungeon Defenders Anniversary Pack,"{'currency': 'EUR', 'initial': 159, 'final': 159, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
229080,DmC Devil May Cry: Bloody Palace Mode,"{'currency': 'EUR', 'initial': 3999, 'final': 999, 'discount_percent': 75, 'initial_formatted': '39,99€', 'final_formatted': 'Free'}"
236080,Resident Evil 6 Wallpaper,"{'currency': 'EUR', 'initial': 4499, 'final': 4499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
247307,Saints Row IV - Reverse Cosplay Pack,"{'currency': 'EUR', 'initial': 499, 'final': 499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
250810,LOST PLANET® 3 - Hi Res Movies,"{'currency': 'EUR', 'initial': 3999, 'final': 799, 'discount_percent': 80, 'initial_formatted': '39,99€', 'final_formatted': 'Free'}"
255050,Saints Row IV - Thank You Pack,"{'currency': 'EUR', 'initial': 499, 'final': 499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"


There are some apps that were not free on the start but are now listed as 'Free'. These are either DLCs that became free or the first episodes/demoversions of the games. We can safely set their price to 0 so I'll set the price for all free games to 0 and can remove is_free column as redundant

Another thing to notice - there are a lot of apps with the incorrect price. 
 
There may be multiple reasons for that:
* Free (we've checked these)
* Not released yet (fortunately, we've already parsed the date and filtered these above)
* Being superseded by the different app (Like Bioshock App ID 7670, for example)
* Not being sold anymore (RollerCoaster Tycoon® 3: Platinum, App ID 2700)
* A demo that has been marked incorrectly (App ID 1883370)
* A part of some bundle and not sold separately

Now, let's transform price_overview to the more understandable format, dealing with the free free apps (The incorrect price is set to null for now):

In [119]:
def parse_price(x):
    '''
    Parsing price column
    '''
    try:
        if x != x:
            return {'currency': 'EUR', 'initial': np.nan}
        else:
            return ast.literal_eval(x)
    except:
        print(x)

price_df = storefront[['name','coming_soon','type','packages', 'package_groups','is_free','price_overview']].copy()
# Evaluate as dictionary and set to NaN if missing
price_df['price_overview'] = price_df['price_overview'].apply(parse_price)
# Set currencies
price_df['currency'] = price_df['price_overview'].apply(lambda x: x['currency'])
# Get prices
price_df['price'] = price_df['price_overview'].apply(lambda x: x['initial']/100 if x['initial'] > 0 else x['initial'])
# set price of free games to 0
price_df.loc[price_df['is_free'], 'price'] = 0
print('Number of prices with negative values:', price_df[price_df['price']<0].shape[0])
print('Number of prices with incorrect values:', price_df[price_df['price'].isnull()].shape[0])
price_df['currency'].value_counts()

Number of prices with negative values: 0
Number of prices with incorrect values: 12219


EUR    104338
USD         6
Name: currency, dtype: int64

It looks like some Steam for some reason doesn't return the price in Euros for some games. Going to convert them using the conversion course at the time of data collection: ***1 USD to 0.95 EU as for 2022-06-27***

I'll use the price_df dataframe created earlier for the checks. Let's start by filtering out the non-released games:

In [120]:
price_df[(price_df['is_free'] == False) 
         & (price_df['coming_soon'] == False) 
         & (price_df['packages'].isnull())
         & (price_df['price'].isnull())][['name','type','packages','package_groups','price']]

Unnamed: 0_level_0,name,type,packages,package_groups,price
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
340,Half-Life 2: Lost Coast,game,,[],
2570,Vigil: Blood Bitterness™,game,,[],
2700,RollerCoaster Tycoon® 3: Platinum,game,,[],
3400,Hammer Heads Deluxe,game,,[],
3490,Venice Deluxe,game,,[],
...,...,...,...,...,...
2025270,Test Game 10,game,,[],
63970,IL-2 Sturmovik: Cliffs of Dover,game,,[],
901583,Grand Theft Auto IV: Complete Edition,game,,[],
901638,Dungeon Siege III,game,,[],


And let's check apps that are part of some package:

In [121]:
price_df[(price_df['is_free'] == False) 
         & (price_df['coming_soon'] == False) 
         & (price_df['packages'].notnull())
         & (price_df['price'].isnull())][['name','type','packages','package_groups','price']]

Unnamed: 0_level_0,name,type,packages,package_groups,price
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2420,The Ship: Single Player,game,[56669],"[{'name': 'default', 'title': 'Buy The Ship: S...",
7670,BioShock™,game,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™',...",
8850,BioShock® 2,game,"[81419, 127633]","[{'name': 'default', 'title': 'Buy BioShock® 2...",
20500,Red Faction Guerrilla Steam Edition,game,[189796],"[{'name': 'default', 'title': 'Buy Red Faction...",
31220,Sam & Max 301: The Penal Zone,game,"[109585, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...",
...,...,...,...,...,...
1937590,DEMON GAZE EXTRA - Tons of Fun! Perfect Gem Set,dlc,[698017],[],
1968101,Warframe: Angels of the Zariman Chrysalith Pack,dlc,[710699],[],
1992730,Trail of Ayash: Prologue Demo,game,[731519],[],
2014600,Card Shark Digital Artbook,dlc,[727367],[],


Sadly, it doesn't seem like there is much we can do to clean it up further. Now, do we remove these rows with the null price or not? 

The number of games and dlcs is quite significant and it might be interesting for the people checking the non-available games. ***I'll leave it as null at this stage and decide whether to remove it when doing the analysis***.

So, the **price is**:

* set to 0 for free games
* set as null for the incorrect/unavailable price
* converted EUR if it was in USD, using the conversion rate at the time of gathering
* left 

Now, let's make a cleaning function:

#### [Subroutine] 'is_free', 'price_overview': Cleaning

In [122]:
def cleanPrice(df):
    '''
    Cleaning price column, checking for currencies and free games
    '''
    df = df.copy()

    #parsing the price_overview, filling in the incorrect values and nulls for the further processing
    def parse_price(x):
        try:
            if x != x:
                return {'currency': 'EUR', 'initial': pd.NA}
            else:
                return ast.literal_eval(x)
        except:
            return {'currency': 'EUR', 'initial': pd.NA}
    
    # Evaluate as dictionary and set to null if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    # Set currencies
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    # Get prices and change it to be shown in the proper dimansion
    df['price'] = df['price_overview'].apply(lambda x: x['initial']/100)
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    # convert the price from USD to EU
    df.loc[df['currency'] == 'USD', 'price'].apply(lambda x: x*usd_eu_rate if x > 0 else x)
    
    df = df.drop(['is_free','price_overview', 'currency'], axis=1)
    
    return df

In [123]:
storefront = cleanPrice(storefront)

### 'packages'

We've already seen some use of this column when we were processing 'price_overview' but let's take a close look at it now.

'package' represents the list of package IDs the application is a member of. It can be usefull when tracking DLCs, for example.

In [124]:
print('packages nulls count:', storefront['packages'].isnull().sum())
print('packages - after filtering out possible null causes:',
      storefront[(storefront['packages'].isnull()) 
                 & (storefront['coming_soon'] == False) 
                 & (storefront['price'] != 0)
                 & (storefront['price'].notnull())
                ].shape[0])
print('packages empty lists count:', storefront[~storefront['packages'].apply(lambda x: True if x!=x else bool(ast.literal_eval(x)) )].shape[0])
with pd.option_context('display.max_colwidth', 250):
    display(storefront[['name','packages']].sample(10))

packages nulls count: 21922
packages - after filtering out possible null causes: 0
packages empty lists count: 0


Unnamed: 0_level_0,name,packages
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1840890,My Friend Peppa Pig: Pirate Adventures,[662332]
1746370,DUMB: Treasure,
1261000,Spoon Simulator 2020,
1525900,Infinite Dungeon Crawler,
1577460,Gamble Tower,
1847950,Fencing Girl,[664958]
1789100,Roman Wars: Deck Building Game,[642551]
208366,Train Simulator: Class 111 DMU Add-On,[18884]
522590,Near Midnight,[124109]
657424,Pinball FX3 - Sci-Fi Pack,[185631]


As we can see, there are some nulls in this columns but all of them are caused by:
* Not being released yet
* Being Free
* Having incorrect price

We've reviewed the prices earlier so I don't think there is any need in removing the rows with the null package value. I'll leave the column as is

#### [Subroutine] 'packages': Cleaning

**Reserved in case we'll do cleaning in the future**. For now stays as is

### 'package_groups'

We've already seen some use of the column earlier when we were processing 'price_overview'.

* 'package_groups' is a list of purchase options (apps might be either be purchased right away or through the subscription usage and this)
* sadly, there is no information on bundles available through the store API (to my knowledge, SteamDB is web scraping Steam pages to get that data)

Let's take a look at this column:

In [125]:
print('package_groups nulls count:', storefront['package_groups'].isnull().sum())
print('package_groups empty list count:', storefront[~storefront['package_groups'].apply(lambda x: bool(ast.literal_eval(x)))].shape[0])
print('package_groups lists with multiple items count:', storefront[storefront['package_groups'].apply(lambda x: len(ast.literal_eval(x))) > 1].shape[0])
with pd.option_context('display.max_colwidth', 500):
    display(storefront[['name','package_groups']].sample(10))

package_groups nulls count: 0
package_groups empty list count: 22441
package_groups lists with multiple items count: 652


Unnamed: 0_level_0,name,package_groups
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1065960,狼人杀单机版,"[{'name': 'default', 'title': 'Buy 狼人杀单机版', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 356746, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': '狼人杀单机版 - 1,59€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 159}]}]"
844970,Cowboy : Attack of Wild Animal,"[{'name': 'default', 'title': 'Buy Cowboy : Attack of Wild Animal', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 264161, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Cowboy : Attack of Wild Animal - 2,39€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 239}]}]"
107200,Space Pirates and Zombies,"[{'name': 'default', 'title': 'Buy Space Pirates and Zombies', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 11144, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Space Pirates and Zombies(SPAZ) - 9,99€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 999}]}]"
1032080,Boxed Out,"[{'name': 'default', 'title': 'Buy Boxed Out', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 343342, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Boxed Out - 0,79€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 79}]}]"
1217820,Topfold,"[{'name': 'default', 'title': 'Buy Topfold', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 421596, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Topfold - 0,79€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 79}]}]"
860330,Mutation Mayhem,[]
410303,RTK13 - Hyakuman nin no Sangokushi Item 4 『100万人の三國志』連携特典４ シナリオ「英雄集結」,[]
1838900,Apocalypse: Floodgates,"[{'name': 'default', 'title': 'Buy Apocalypse: Floodgates', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 661592, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Apocalypse: Floodgates - 3,99€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]"
1890690,Guns and Roses,"[{'name': 'default', 'title': 'Buy Guns and Roses', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 680929, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Guns and Roses - 6,59€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 659}]}]"
1838881,Neko Secret - Homecoming: 18+ Hentai Arts,"[{'name': 'default', 'title': 'Buy Neko Secret - Homecoming: 18+ Hentai Arts', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 661586, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Neko Secret - Homecoming: 18+ Hentai Arts - 0,79€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 79}]}]"


In [126]:
with pd.option_context('display.max_colwidth', 500):
    display(storefront[storefront['package_groups'].apply(lambda x: len(ast.literal_eval(x))) > 1][['name','package_groups']].sample(5))

Unnamed: 0_level_0,name,package_groups
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
559330,A Fisherman's Tale,"[{'name': 'default', 'title': ""Buy A Fisherman's Tale"", 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 140124, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': ""A Fisherman's Tale - 14,99€"", 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 1499}]}, {'name': 'subscriptions', 'title': ""Buy A Fisherman..."
613790,RED CUBE VR,"[{'name': 'default', 'title': 'Buy RED CUBE VR', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 164408, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'RED CUBE VR - 1,99€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 199}]}, {'name': 'subscriptions', 'title': 'Buy RED CUBE VR Subscription Pl..."
791872,Epic Roller Coasters — Armageddon,"[{'name': 'default', 'title': 'Buy Epic Roller Coasters — Armageddon', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 239750, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Epic Roller Coasters — Armageddon - 2,39€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 239}]}, {'name': 'subscriptions..."
477190,Epic Snowday Adventure,"[{'name': 'default', 'title': 'Buy Epic Snowday Adventure', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 106340, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Epic Snowday Adventure - 7,99€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 799}]}, {'name': 'subscriptions', 'title': 'Buy Epic ..."
503770,Elven Assassin,"[{'name': 'default', 'title': 'Buy Elven Assassin', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 115907, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Elven Assassin - 16,79€', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 1679}]}, {'name': 'subscriptions', 'title': 'Buy Elven Assassin Subs..."


In [127]:
storefront.loc[764830].package_groups

'[{\'name\': \'default\', \'title\': \'Buy Snowmania\', \'description\': \'\', \'selection_text\': \'Select a purchase option\', \'save_text\': \'\', \'display_type\': 0, \'is_recurring_subscription\': \'false\', \'subs\': [{\'packageid\': 226121, \'percent_savings_text\': \' \', \'percent_savings\': 0, \'option_text\': \'Snowmania - 6,99€\', \'option_description\': \'\', \'can_get_free_license\': \'0\', \'is_free_license\': False, \'price_in_cents_with_discount\': 699}]}, {\'name\': \'subscriptions\', \'title\': \'Buy Snowmania Subscription Plan\', \'description\': \'To be billed on a recurring basis.\', \'selection_text\': \'Starting at 6,99€ / month\', \'save_text\': \'\', \'display_type\': 0, \'is_recurring_subscription\': \'true\', \'subs\': [{\'packageid\': 235307, \'percent_savings_text\': \' \', \'percent_savings\': 0, \'option_text\': \'6,99€ for a month, then 0,79€ / month\', \'option_description\': \'<p class="game_purchase_subscription">6,99€ at checkout, auto-renewed every

This data does seem useful for the research on purchasing options, for example but might not be worth keeping in the main data table.

The items' structure in the lists seems to be rigid but there are lists with multiple items out here. 

package_groups table structure:

| package_groups | Original field | Field Type |
| --- | --- | --- |
| appid | storefront.appid | int |
| type | storefront.package_groups.item.name | string |
| title | storefront.package_groups.item.title | string |
| is_recurring_subscription | storefront.package_groups.item.is_recurring_subscription | bool |
| subs | storefront.package_groups.item.subs | list of dicts/object |

subs will need additional parsing before the analysis as it contains the detailed data on purchasing options - price, free tiers, billing options, etc.

#### [Subroutine] 'package_groups': Cleaning

In [128]:
def cleanPackageGroups(df, export=False):
    '''
    Drop Package groups information from the dataframe, optionally exporting beforehand.
    '''
    
    def packageGroupsParse(row):
        '''
        Parsing each row to get the new columns
        '''
        row['package_groups']['appid'] = row['appid']
        # parsing boolean field to python boolean
        if row['package_groups']['is_recurring_subscription'] == 'false':
            row['package_groups']['is_recurring_subscription'] = False
        else:
            row['package_groups']['is_recurring_subscription'] = True
        result = pd.Series(row['package_groups'])
        return result
    
    if export:
        # removing empty package_groups and columns not needed in processing
        packages_info = df[df['package_groups'].apply(lambda x: bool(ast.literal_eval(x)))][['package_groups']].copy().reset_index()
        # evaluating string to the list and exploding the list
        packages_info['package_groups'] = packages_info['package_groups'].apply(lambda x: ast.literal_eval(x))
        packages_info = packages_info.explode('package_groups')
        packages_info = packages_info.apply(lambda row: packageGroupsParse(row), axis = 1)
        # removing unnecessary oclumns
        packages_info.drop(['description','selection_text','save_text','display_type'], axis = 1, inplace = True)
        # renaming ocolumns
        packages_info.rename(columns={'name':'type'}, inplace = True)
        # changing column order
        packages_info = packages_info[['appid', 'type', 'title', 'is_recurring_subscription', 'subs']]

        export_data(packages_info, 'steam_packages_info', index=False)
    
    df = df.drop(['package_groups'], axis=1)
    
    return df

In [129]:
storefront = cleanPackageGroups(storefront, export = True)

Exported steam_packages_info to "../data/export/steam_packages_info.csv"


In [130]:
#Verifying exported data
pd.read_csv('../data/export/steam_packages_info.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82561 entries, 0 to 82560
Data columns (total 5 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   appid                      82561 non-null  int64 
 1   type                       82561 non-null  object
 2   title                      82561 non-null  object
 3   is_recurring_subscription  82561 non-null  bool  
 4   subs                       82561 non-null  object
dtypes: bool(1), int64(1), object(3)
memory usage: 2.6+ MB


### achievements
This columns contains dictionaries with the total number of the application achievements and the information about the 10 highlited ones. Information about the highlited achievements is not very useful (besides, there is no description - just the name and the icon link) but the total number is worth saving:

In [131]:
print('Achievements nulls count:', storefront['achievements'].isnull().sum())
print('DLCs with achievements count:', storefront[storefront['type']=='dlc']['achievements'].notnull().sum())
with pd.option_context('display.max_colwidth', 500):
    display(storefront[['name', 'achievements']].sample(10))

Achievements nulls count: 72802
DLCs with achievements count: 21


Unnamed: 0_level_0,name,achievements
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1550120,Fantasy Grounds - The Shattered Circle (2E),
679690,SwapQuest,"{'total': 34, 'highlighted': [{'name': 'Bug Catcher', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/679690/992f857bd838e266ac54cd905973ae2406721c92.jpg'}, {'name': 'Extinguisher', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/679690/23c297edc6eb1848558e7a4c69e5fdfc52e074f3.jpg'}, {'name': 'Under the Surface', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/679690/fc6ed9fe644ead1a3b3bb8591315e44185a5..."
1924880,Moon No Light-Patch,
1993050,Morgue Rot : The Coroner's Quest,
631990,"Remember, Remember","{'total': 26, 'highlighted': [{'name': 'The Story Begins...', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/631990/2238c0ba86eb76d89e499888c231dca51075c1c1.jpg'}, {'name': ""Murderers Don't Go To Heaven"", 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/631990/6bdd247ca7769a9b2fc0974f7569080f073517d0.jpg'}, {'name': 'Your First Charge', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/631990/f4189802cab2..."
1857730,Color Patterns,"{'total': 8, 'highlighted': [{'name': 'Just starting', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/1857730/5e50894fe5e633aed35ca054cf3df0f9cddfaea1.jpg'}, {'name': 'Getting the hand', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/1857730/364e820f1dfca8830b1a77b8ac2328859e0be006.jpg'}, {'name': 'You have a way', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/1857730/91998024a3bd2da36d0ca8f0d4b9a5e..."
395260,Shadowrun: Hong Kong - Extended Edition Deluxe Upgrade DLC,
534440,Killing Zombies,"{'total': 33, 'highlighted': [{'name': 'Play on Halloween', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/534440/08e99e3c515b09c4adcb6ed91b57bb3b6cf5b4f4.jpg'}, {'name': '1,000 Damage', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/534440/4d25a2dcbf2cc52d85cece42df95c9f4cbc0b961.jpg'}, {'name': '10,000 Damage', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/534440/4d25a2dcbf2cc52d85cece42df95c9f4cb..."
648390,Marie's Room,"{'total': 7, 'highlighted': [{'name': 'Welcome back Kelsey', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/648390/8202ac10c0c98cbf8c2e9a4edc74f4a131262545.jpg'}, {'name': 'Every word', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/648390/40060e06086c8978e9d4fcbfc015b8aa4b7ad2d1.jpg'}, {'name': 'The complete picture', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/648390/74bad575de45fac801b88b3acf46..."
1326300,Pair Matching Puzzle Connect - Expansion Pack 6,


Number of nulls is not very surprising considering a lot of games launched before achievements appeared and we have DLCs in our table that usually have achievements attached to the base game.

#### [Subroutine] 'achievements': Cleaning

In [132]:
def processAchievements(df):
    '''
    Parse as total number of achievements.
    '''
    df = df.copy()
     
    def parse_achievements(x):
        if pd.isna(x):
            # missing data, assume has no achievements
            return 0
        else:
            # else has data, so can extract and return number under total
            return literal_eval(x)['total']
        
    df['achievements'] = df['achievements'].apply(extractDictItem, key = 'total')
    df['achievements'].fillna(0, inplace=True)
    
    return df

In [133]:
storefront = processAchievements(storefront)
storefront['achievements'].value_counts()

0       73425
10       1543
12       1226
20       1106
15       1019
        ...  
208         1
350         1
324         1
2007        1
141         1
Name: achievements, Length: 418, dtype: int64

### demos

Game demo versions were quite popular back in the day and are still used by some developers publishers. This column contains a list of dictionaries with appids and descriptions of the game demo versions. It might contain multiple elements.

We didn't download the demo versions in this dataset so it's not very useful. Still, will convert it to the simple lists of appids.

In [134]:
# demos
print('demos nulls count:', storefront['demos'].isnull().sum())
print('demos lists with multiple items count:', storefront[storefront['demos'].apply(lambda x: 0 if x != x else len(ast.literal_eval(x))) > 1].shape[0])
storefront[storefront['demos'].notnull()][['name','demos']].sample(15)

demos nulls count: 97069
demos lists with multiple items count: 21


Unnamed: 0_level_0,name,demos
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1433480,Poop On 2020 Simulator,"[{'appid': 1467750, 'description': ''}]"
1147270,The Lost Island:Battle Royale,"[{'appid': 1147730, 'description': 'Download D..."
1426740,Aveliana,"[{'appid': 1430710, 'description': ''}]"
383690,Mu Complex,"[{'appid': 434640, 'description': 'Try the dem..."
1709350,Bat Boy,"[{'appid': 1762630, 'description': ''}]"
1582790,Frail Faces,"[{'appid': 1583040, 'description': ''}]"
1573100,Subterrain: Mines of Titan,"[{'appid': 2014290, 'description': ''}]"
16060,Samantha Swift and the Golden Touch,"[{'appid': 16062, 'description': ''}]"
516890,Alchemic Jousts,"[{'appid': 597880, 'description': ''}]"
1035850,Balancelot,"[{'appid': 1052910, 'description': ''}]"


#### [Subroutine] 'demos': Cleaning

In [135]:
storefront['demos'] = storefront['demos'].apply(extractDictList, key='appid')

### fullgame
This column is specifically for DLCs and contains information about the base game. It is stored as an appid: name dictionary and we will only leave the appis (as our main table is indexed by it and contains everything else needed). Let's take a closer look on how clean it is:

In [136]:
# fullgame
print('fullgame nulls count:', storefront['fullgame'].isnull().sum())
print('fullgame non-nulls count:', storefront['fullgame'].notnull().sum())
print('fullgame nulls for dlcs count:', storefront[storefront['type']=='dlc']['fullgame'].isnull().sum())
storefront[storefront['fullgame'].notnull()][['name','fullgame']].sample(5)

fullgame nulls count: 68982
fullgame non-nulls count: 35362
fullgame nulls for dlcs count: 47


Unnamed: 0_level_0,name,fullgame
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
593055,Dead Rising 4 - Super Ultra Dead Rising 4 Mini...,"{'appid': '543460', 'name': 'Dead Rising 4'}"
1460350,"FUSER™ - SAINt JHN - ""Roses (Imanbek Remix)""","{'appid': '1331440', 'name': 'FUSER™'}"
749450,Fantasy Grounds - Pathfinder RPG - Carrion Cro...,"{'appid': '252690', 'name': 'Fantasy Grounds C..."
364170,Albedo: Original Soundtrack,"{'appid': '277520', 'name': 'Albedo: Eyes from..."
540721,Euro Truck Simulator 2 - Belgian Paint Jobs Pack,"{'appid': '227300', 'name': 'Euro Truck Simula..."


In [137]:
getSteamLink(storefront[(storefront['type']=='dlc') & (storefront['fullgame'].isnull())][['name','fullgame']].sample(5))

Dread Hunger Bone Rings https://store.steampowered.com/app/2002150
GG1 PRR Silver Add-on Livery https://store.steampowered.com/app/256529
BIT.TRIP Presents... Runner2: Future Legend of Rhythm Alien Soundtrack https://store.steampowered.com/app/218064
cyberpunkdreams: cincinnati stories https://store.steampowered.com/app/1555170
BIT.TRIP FLUX Soundtrack https://store.steampowered.com/app/205093


Undortunately, it seems like some developers have forgotten to mark the fulllgame for some DLCs. You can see the lack of 'This content requires the base game..' field on the game page, for example. 

Fortunately, we have a 'dlc' column that should contain the list for the game. Let's check if we can recover fullgame from it:

In [138]:
def fullgame_dlc_check(df):
    '''
    Checking if we can get the base game information for DLCs
    '''
    temp_data = df.copy()
    dlcs_list = df[df['dlc'].notnull()]['dlc'].apply(lambda x: ast.literal_eval(x)).explode().unique()
    temp_data = temp_data[(storefront['type']=='dlc') & (temp_data['fullgame'].isna())]
    temp_data['dlc_available'] = temp_data.index
    temp_data['dlc_available'] = temp_data['dlc_available'].apply(lambda x: x in dlcs_list)
    return temp_data

temp_data = fullgame_dlc_check(storefront)
print('recoverable dlcs count: ', temp_data[temp_data['dlc_available'] == True][['name','dlc_available']].shape[0])
print('unrecoverable dlcs count: ', temp_data[temp_data['dlc_available'] == False][['name','dlc_available']].shape[0])

recoverable dlcs count:  21
unrecoverable dlcs count:  26


It seems like we can recover some from the fullgame column. I'll leave the remaining nulls as is for now.

#### [Subroutine] 'fullgame': Cleaning

In [139]:
def fullgame_cleaning(df):
    '''
    Cleaning fullgame
    '''

    # Creating a temporary table with appid-dlc data
    df = df.copy()
    dlcs_df = df.copy()
    dlcs_df = dlcs_df.loc[dlcs_df['dlc'].notnull()][['dlc']]
    dlcs_df['dlc'] = dlcs_df['dlc'].apply(lambda x: ast.literal_eval(x))
    dlcs_df = dlcs_df.explode('dlc')

    # Filling out the fullgame column when possible
    def fillFullgame(appid):
        index_list = dlcs_df.index[dlcs_df['dlc']==appid]
        if len(index_list) == 0:
            return pd.NA
        else:
            index_list[0]

    df['fullgame'] = df['fullgame'].apply(extractDictItem, key='appid')
    mask = (df['type']=='dlc') & (df['fullgame'].isnull())
    df.loc[mask, 'fullgame'] = df[mask].apply(lambda row: fillFullgame(appid = row.name), axis = 1)
    
    return df

In [140]:
storefront = fullgame_cleaning(storefront)

Let's check if we recovered data correctly:

In [141]:
temp_data = fullgame_dlc_check(storefront)
print('recoverable dlcs count: ', temp_data[temp_data['dlc_available'] == True][['name','dlc_available']].shape[0])
print('unrecoverable dlcs count: ', temp_data[temp_data['dlc_available'] == False][['name','dlc_available']].shape[0])

recoverable dlcs count:  21
unrecoverable dlcs count:  26


### dlc
This column contains the list of dlcs (in form of appids) for the game. Let's take a peek at how clean the data is:

In [142]:
#dlc
print('dlc nulls count:', storefront['dlc'].isnull().sum())
print('dlc non-nulls count:', storefront['dlc'].notnull().sum())
storefront[storefront['dlc'].notnull()][['name','dlc']].sample(5)

dlc nulls count: 94540
dlc non-nulls count: 9804


Unnamed: 0_level_0,name,dlc
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1417270,Mind-Blowing Girls,[1428510]
710600,Red Bow,[1231960]
1269890,Virus Popper,"[1291620, 1845110]"
785110,Anime girl Or Bottle?,[836090]
662260,On Earth As It Is In Heaven - A Kinetic Novel,"[800260, 800261]"


In [143]:
# numbers of items in dlc lists. -1 - for null in the column
storefront['dlc'].apply(lambda x: -1 if x != x else len(ast.literal_eval(x))).value_counts()

-1      94540
 1       6281
 2       1548
 3        570
 4        335
        ...  
 45         1
 87         1
 169        1
 200        1
 191        1
Name: dlc, Length: 90, dtype: int64

Everything seems fine, we can leave the field as is.

#### [Subroutine] 'dlc': Cleaning

Reserved, the field left as is.

### ext_user_account_notice

The column contains information about the external accounts used, for example, for authentication. There are not many of them filled but this might still be useful information. Leaving as is.

#### [Subroutine] 'ext_user_account_notice': Cleaning

Reserved, the field left as is.

In [144]:
print('External user account nulls count:', storefront['ext_user_account_notice'].isnull().sum())
print('External user account count:', storefront['ext_user_account_notice'].notnull().sum())
storefront['ext_user_account_notice'].value_counts(dropna = False)

External user account nulls count: 103287
External user account count: 1057


NaN                                                                                                                                                                                                                           103287
Uplay (Supports Linking to Steam Account)                                                                                                                                                                                         38
EA Account (Supports Linking to Steam Account)                                                                                                                                                                                    30
Slitherine PBEM++ for Multiplayer                                                                                                                                                                                                 27
Twitch                                                                              

### drm_notice

This column contains information on DRM Protection technology used in the app. The number of of non-null items is suprisingly small so it seems it'was not strictly necessary to fill it. It seems like there is no fixed field structure to parse it.

Considering all of that, I'll leave it as is but it is a very strong candidate on removal.

#### [Subroutine] 'drm_notice': Cleaning

Reserved, the field left as is.

In [145]:
print('DRM Notice nulls count:', storefront['drm_notice'].isnull().sum())
print('DRM Notice non-nulls count:', storefront['drm_notice'].notnull().sum())
storefront['drm_notice'].value_counts(dropna = False)

DRM Notice nulls count: 103621
DRM Notice non-nulls count: 723


NaN                                                                                           103621
Denuvo Anti-tamper<br>5 different PC within a day machine activation limit                       228
Denuvo Anti-tamper                                                                               170
EA on-line activation and Origin client software installation and background use required.        76
Denuvo Antitamper                                                                                 33
                                                                                               ...  
MoYi Technology                                                                                    1
Reality Pump DLM V2                                                                                1
Denuvo Anti-tamper<br>5 machine activation limit                                                   1
Tages Solidshield<br>3 machine activation limit                                            

### recommendations

This column contains the total number of reviews for games. We have this field in the reviews table (and it's basically using the same source) as well so it can be safely removed as redundant. Interestingly, we have much less nulls in the reviews table - values of 100 and below are filtered out of the field in storefront.

In [146]:
# recommendations quick look
def getminrec(df):
    temp_df = df.copy()
    temp_df['recommendations'] = temp_df['recommendations'].apply(extractDictItem, key = 'total')
    return temp_df['recommendations'].min()

print('recommendations nulls count:', storefront['recommendations'].isnull().sum())
print('storefront.recommendations minumum value', getminrec(storefront))
print('reviews.total_reviews count:', reviews[(pd.isna(reviews['total_positive'])) | (reviews['total_positive'] == 0)].shape[0])
storefront[storefront['recommendations'].notnull()][['name','recommendations']].sample(5)

recommendations nulls count: 91009
storefront.recommendations minumum value 101
reviews.total_reviews count: 32790


Unnamed: 0_level_0,name,recommendations
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
378644,The Witcher 3: Wild Hunt - Alternative Look fo...,{'total': 208}
12510,Prison Tycoon 3™: Lockdown,{'total': 114}
283640,Salt and Sanctuary,{'total': 14905}
944020,Dream Ending,{'total': 2057}
318230,Convoy,{'total': 1032}


In [147]:
temp_df = storefront.copy()
temp_df['recommendations'] = temp_df['recommendations'].apply(extractDictItem, key = 'total')
temp_df['recommendations'].min()

101

In [148]:
storefront.loc[839930].recommendations

"{'total': 366}"

In [149]:
reviews.loc[839930]

review_score              5
review_score_desc     Mixed
total_positive          282
total_negative          133
total_reviews           415
download_appid       839930
Name: 839930, dtype: object

#### [Subroutine] 'recommendations': Cleaning

Removed as redundant.

In [150]:
storefront = storefront.drop('recommendations', axis = 1)

### reviews

This column a selection of journalist reviews' quotes. The metacritic column is much more helpful as it contains both the score and link to the combined reviews. So I consider this column redundant (it's used to show a selection of review quotes on the game page).


In [151]:
# reviews
with pd.option_context('display.max_colwidth', 500):
    display(storefront[storefront['reviews'].notnull()][['name','reviews']].sample(5))

Unnamed: 0_level_0,name,reviews
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
402390,Hyperdrive Massacre,"“Hyperdrive Massacre understands what it is, and plays to its strength of frenetic multiplayer action to perfection.”<br>80 – <a href=""https://steamcommunity.com/linkfilter/?url=http://cogconnected.com/review/hyperdrive-massacre-review/"" target=""_blank"" rel=""noopener"" >CogConnected</a><br><br>“A prime slice of local-multiplayer gaming that offers both co-operative and competitive action that is balanced, consistent and most importantly of all, hugely entertaining.”<br>80 – <a href=""https://..."
447820,Day of Infamy,"“Day of Infamy gets the cinematography of war just right.”<br><a href=""https://www.rockpapershotgun.com/2016/08/08/day-of-infamy-review-early-access/"" target=""_blank"" rel=""noreferrer"" >Rock, Paper, Shotgun</a><br><br>“Day of Infamy is a solid game with a developer that is showing a lot of love for their work.”<br><a href=""https://steamcommunity.com/linkfilter/?url=http://www.wargamer.com/reviews/review-day-of-infamy/"" target=""_blank"" rel=""noopener"" >Wargamer</a><br><br>“Day of Infamy does ..."
284100,Unclaimed World,"“Seriously, I’ve played a little over twenty hours and I still feel like I’ve only discovered a little of what’s possible. It’s a smart game, with a lot under the hood.”<br><a href=""https://steamcommunity.com/linkfilter/?url=https://explorminate.net/2016/10/21/friday-excursion-unclaimed-world/"" target=""_blank"" rel=""noopener"" >eXplorminate</a><br><br>“I. Want. This!”<br><a href=""http://www.rockpapershotgun.com/2013/01/25/best-thing-ever-today-unclaimed-world/"" target=""_blank"" rel=""noreferrer..."
468920,Ultimate Fishing Simulator,"“For a fishing simulator, Ultimate Fishing for PC looks very nice. The developers devoted much attention to modeling equipment, baits, and other accessories. Furthermore, water looks realistic, and the day-and-night cycle is the icing on the cake.”<br>8.5 – <a href=""https://steamcommunity.com/linkfilter/?url=https://www.gamepressure.com/games/ultimate-fishing-simulator/z74c79"" target=""_blank"" rel=""noopener"" >GAME PRESSURE</a><br><br>“Ultimate Fishing Simulator captures a lot of intricate li..."
342020,BiT Evolution,"“This is f’n cool! This is gorgeous!”<br><a href=""https://www.youtube.com/watch?v=QVUtiR5mtrw"" target=""_blank"" rel=""noreferrer"" >PewDiePie</a><br><br>“BiT Evolution bleeds retro. Awesome attention to detail. BiT Evolution will give you hours of enjoyment.”<br><a href=""https://steamcommunity.com/linkfilter/?url=http://www.gamingcypher.com/bit-evolution-pc-preview-and-details"" target=""_blank"" rel=""noopener"" >Gamer Headlines</a><br><br>“What better way to learn about history than to experienc..."


#### [Subroutine] 'reviews': Cleaning

Removed as redundant

In [152]:
storefront = storefront.drop('reviews', axis = 1)

### controller_support

This column contains information about the apps' controllers support levels. If you remember, we had game features in categories and there was controller support information there as well:

* Full controller support       25126
* Partial Controller Support    18518

Let's take a look at this column and compare it with categories:

In [153]:
print('Controller Support nulls count:', storefront['controller_support'].isnull().sum())
storefront['controller_support'].value_counts(dropna = False)

Controller Support nulls count: 78279


NaN     78279
full    26065
Name: controller_support, dtype: int64

In [154]:
# creating temporary boolean dataframe with the required
temp_df = storefront[['controller_support','categories']].copy()
temp_uniques = ['Full controller support','Partial Controller Support']
temp_df['categories'].fillna({i: [] for i in temp_df.index},inplace = True)
temp_df = boolean_df(temp_df['categories'], temp_uniques)
temp_df['controller_support - full'] = storefront['controller_support'].apply(lambda x: True if x=='full' else False)
print('controller_support == full and no Full controller support category:',
      temp_df[(temp_df['Full controller support'] == False) & (temp_df['controller_support - full'] == True)].shape[0])
print('controller_support == None and Full controller support category:',
      temp_df[(temp_df['Full controller support']) & (temp_df['controller_support - full'] == False)].shape[0])

controller_support == full and no Full controller support category: 0
controller_support == None and Full controller support category: 0


As we can see, controller data already exists in categories and there are no discrepancies here (Also, we even have a partial controller support in categories, unlike this column).

We can safely drop it:
#### [Subroutine] 'controller_support': Cleaning

In [155]:
storefront = storefront.drop('controller_support', axis = 1)

### legal_notice
This column doesn't seem to contain any usefull information. Dropping.

In [156]:
print('Legal Notice nulls count:', storefront['legal_notice'].isnull().sum())
with pd.option_context('display.max_colwidth', 100):
    display(storefront[storefront['legal_notice'].notnull()][['name','legal_notice']].sample(10))

Legal Notice nulls count: 62675


Unnamed: 0_level_0,name,legal_notice
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1841130,RPG Maker MZ - MT Tiny Tales Overworld Tiles,©Mega Tiles 2021
344890,A.R.M. PLANETARY PROSPECTORS EP1 Asteroid Resource Mining,© 2015 ARM Planetary Prospectors Asteroid Resource Mining Copyright Nefarious Dimensions Inc.<br...
1770000,SteelPinion,© Copyright 2022 Rare Gentlemen Syndicate
866770,Spaceteam VR,© 2020 Cooperative Innovations Limited. All rights reserved. All other copyrights and trademarks...
874700,The Putinland: Divide & Conquer,Copyrights Tero Lunkka<br />\r\nThe Putinland: Divide & Conquer uses the Unreal® Engine. Unreal®...
982510,Dungeon Hunter 5,©2018 Gameloft. All Rights Reserved. Gameloft and the Gameloft logo are trademarks of Gameloft.
392660,Fallen: A2P Protocol - Digital Deluxe Edition,2015 Red Katana
614005,X-Plane 11 - Add-on: FunnerFlight - Airport Los Angeles International V2,Copyright © 2017 AEROSOFT. All rights reserved. All trademarks and brand names are trademarks or...
1651570,Field of Glory II: Medieval - Swords and Scimitars,"©2020 Slitherine Ltd. All Rights Reserved. Field of Glory II, Field of Glory II Medieval, Slithe..."
941850,Vanguard: Normandy 1944,© Pathfinder Games Limited. All rights reserved. Vanguard: Normandy 1944 and Pathfinder Games ar...


#### [Subroutine] 'legal_notice': Cleaning

In [157]:
storefront = storefront.drop('legal_notice', axis = 1)

### metacritic

This column contains the Metacritic score and the link for the apps. Unfortunately, there are not many of them but it's an interesting information.

It makes sense to move it to the reviews for now. We will decide the final table structure close to the end.

Let's take a look at this column:

In [158]:
# metacritic
print('Metacritic nulls count:', storefront['metacritic'].isnull().sum())
print('Metacritic non-nulls count:', storefront['metacritic'].notnull().sum())
print('rows missing from reviews that contain non-null metacritic: ',
      storefront[(storefront.index.isin(
          storefront.index.difference(reviews.index))) 
                 & (storefront['metacritic'].notnull())].shape[0]
                 )
with pd.option_context('display.max_colwidth', 100):
    display(storefront[storefront['metacritic'].notnull()][['name','metacritic']].sample(5))

Metacritic nulls count: 100504
Metacritic non-nulls count: 3840
rows missing from reviews that contain non-null metacritic:  1


Unnamed: 0_level_0,name,metacritic
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
624970,Battlezone: Combat Commander,"{'score': 68, 'url': 'https://www.metacritic.com/game/pc/battlezone-combat-commander?ftag=MCD-06..."
874040,Ministry of Broadcast,"{'score': 67, 'url': 'https://www.metacritic.com/game/pc/ministry-of-broadcast?ftag=MCD-06-10aaa..."
15320,IL-2 Sturmovik: 1946,"{'score': 86, 'url': 'https://www.metacritic.com/game/pc/il-2-sturmovik-1946?ftag=MCD-06-10aaa1f'}"
449960,Book of Demons,"{'score': 72, 'url': 'https://www.metacritic.com/game/pc/book-of-demons?ftag=MCD-06-10aaa1f'}"
1164060,The Last Show of Mr. Chardish,"{'score': 79, 'url': 'https://www.metacritic.com/game/pc/the-last-show-of-mr-chardish?ftag=MCD-0..."


#### [Subroutine] 'metacritic': Cleaning

In [159]:
def metacritic_clean(df1,df2):
    ''' 
    Parse metacritic  column to 2 news columns - metacritic_score and metacritic_url,
    copy them to reviews and remove from the storefront
    '''
    def metacritic_parse(data):
    # parsing metacritic column
        if data != data:
            return pd.NA, pd.NA
        try:
            evalDict = eval(data)
            if(type(evalDict) == dict):
                return evalDict['score'],evalDict['url']
        except:
            pd.NA,pd.NA
        return pd.NA,pd.NA
    
    df1 = df1.copy()
    df2 = df2.copy()
    df1[['metacritic_score','metacritic_url']]=df1.apply(lambda row: metacritic_parse(row.metacritic),axis=1,result_type='expand')

    # copying columns to reviews (creating new rows if necessary)
    df2 = pd.concat([df2,df1[['metacritic_score','metacritic_url']]], ignore_index=False, axis = 1)
    # filling the new rows
    review_fills = {'review_score': 0, 'review_score_desc': 'No user reviews', 'total_positive': 0, 'total_reviews': 0, 'total_negative': 0}
    df2.fillna(value = review_fills, inplace = True)
    
    # removing unneeded columns
    
    df1.drop(['metacritic', 'metacritic_score', 'metacritic_url'], axis = 1, inplace = True)
    return df1, df2

In [160]:
storefront, reviews = metacritic_clean(storefront,reviews)

Checking updated dataframes:

In [161]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104344 entries, 10 to 1873351
Data columns (total 22 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     104344 non-null  object
 1   name                     104344 non-null  object
 2   required_age             104344 non-null  int64 
 3   dlc                      9804 non-null    object
 4   fullgame                 35362 non-null   object
 5   supported_languages      104180 non-null  object
 6   drm_notice               723 non-null     object
 7   ext_user_account_notice  1057 non-null    object
 8   developers               104300 non-null  object
 9   publishers               104301 non-null  object
 10  demos                    7275 non-null    object
 11  packages                 82422 non-null   object
 12  platforms                104344 non-null  object
 13  categories               104228 non-null  object
 14  genres            

In [162]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105035 entries, 10 to 2060820
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   review_score       105035 non-null  float64
 1   review_score_desc  105035 non-null  object 
 2   total_positive     105035 non-null  float64
 3   total_negative     105035 non-null  float64
 4   total_reviews      105035 non-null  float64
 5   download_appid     105027 non-null  float64
 6   metacritic_score   3840 non-null    object 
 7   metacritic_url     3840 non-null    object 
dtypes: float64(5), object(3)
memory usage: 7.2+ MB


### last_modified

This column contains the UTC timestamp of the last data modification, taken from the IStoreService.

Let's take a quick look at this data:

In [163]:
# last_modified
print('last_modified nulls count:', storefront['last_modified'].isnull().sum())
print('last_modified non-nulls count:', storefront['last_modified'].notnull().sum())
storefront[storefront['last_modified'].notnull()][['name','last_modified']].sample(5)

last_modified nulls count: 0
last_modified non-nulls count: 104344


Unnamed: 0_level_0,name,last_modified
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1825874,Trainz 2022 DLC - SNCF - AGC Languedoc,1653976009
1445210,Lost Lies,1648553147
289060,The Emerald Tablet,1581107659
948380,Merchant - Extra Hero Slots,1539050931
1959700,Trainz 2022 DLC - Kansas City Southern - EMD S...,1653976721


There are no nulls here. We can change the datatype to the datetime but keeping it as UTC timestamp is probably much simplier. 

**I'll leave the column as is**

# Reviews table

This table contains the combined data about the games reviews, and we've also added the Metacritic scores + url links earlier. Judging by the info, there shouldn't be any nulls (aside from metacritic columns) but we'll take a look anyways.


In [164]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105035 entries, 10 to 2060820
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   review_score       105035 non-null  float64
 1   review_score_desc  105035 non-null  object 
 2   total_positive     105035 non-null  float64
 3   total_negative     105035 non-null  float64
 4   total_reviews      105035 non-null  float64
 5   download_appid     105027 non-null  float64
 6   metacritic_score   3840 non-null    object 
 7   metacritic_url     3840 non-null    object 
dtypes: float64(5), object(3)
memory usage: 7.2+ MB


In [165]:
reviews

Unnamed: 0_level_0,review_score,review_score_desc,total_positive,total_negative,total_reviews,download_appid,metacritic_score,metacritic_url
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
10,9.0,Overwhelmingly Positive,196884.0,5077.0,201961.0,10.0,88,https://www.metacritic.com/game/pc/counter-str...
20,8.0,Very Positive,5586.0,910.0,6496.0,20.0,,
30,8.0,Very Positive,5126.0,562.0,5688.0,30.0,79,https://www.metacritic.com/game/pc/day-of-defe...
40,8.0,Very Positive,1897.0,418.0,2315.0,40.0,,
50,9.0,Overwhelmingly Positive,13950.0,692.0,14642.0,50.0,,
...,...,...,...,...,...,...,...,...
2057790,0.0,No user reviews,0.0,0.0,0.0,2057790.0,,
2058200,0.0,No user reviews,0.0,0.0,0.0,2058200.0,,
2058310,0.0,No user reviews,0.0,0.0,0.0,2058310.0,,
2058440,0.0,No user reviews,0.0,0.0,0.0,2058440.0,,


We have three columns describing the reviews counts:
* total_positive - total positive reviews,
* total_negative - total negative reviews,
* total_reviews - total reviews.

Two columns describing the Steam reviews scores:
* review_score - reviews score as calculated by Steam,
* review_score_desc - text description of the said score.

And two columns describing Metacritic scores:
* metacritic_score - the score at the time of data collection
* metacritic_url - url address of the game on the Metacritic site

Sadly, Steam score has some issues, as described by [SteamDB](https://steamdb.info/blog/steamdb-rating/). In short, that rating has issues with the low number of reviews and is not very good with sorting. The formula proposed by the linked article takes that into account and gives us the adjusted rating with respect to the number of reviews and the “real rating”.

*Thanks to SteamDB and /u/tornmandate for providing such a useful rating score (which is shared under the MIT license)*

I'll use the said formula as well to determine the rating. **Keep in mind, that it's still not recommended to rely on the rating with less than 500 votes**.

![image.png](attachment:ac928351-f3a5-4c7d-ad3c-03237ed936da.png)![image.png](attachment:601f3b09-d93c-42d5-90b0-31795e2a6d6b.png)

In [166]:
reviews['rating'] = (
                        reviews['total_positive']/reviews['total_reviews'] - 
                        (reviews['total_positive']/reviews['total_reviews'] - 0.5)*np.power(2,-np.log10(reviews['total_reviews']+1))
                    )*100

Let's check the games with the highest reviews:

In [167]:
temp_df = pd.concat([storefront,reviews], ignore_index=False, axis = 1)
temp_df.sort_values(by='rating', ascending = False)[['name','review_score','total_positive','total_negative','rating']]

Unnamed: 0_level_0,name,review_score,total_positive,total_negative,rating
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
620,Portal 2,9.0,293166.0,3560.0,97.701036
427520,Factorio,9.0,132387.0,1446.0,97.519213
1118200,People Playground,9.0,119382.0,1324.0,97.459069
1794680,Vampire Survivors,9.0,104864.0,1154.0,97.409679
1145360,Hades,9.0,187114.0,2587.0,97.382838
...,...,...,...,...,...
2057600,Jester`s Theater Museum sounds and themes,0.0,0.0,0.0,
2057790,Loop,0.0,0.0,0.0,
2058200,Showrunner,0.0,0.0,0.0,
2058310,Cyber Sex,0.0,0.0,0.0,


This correlates with what we see at https://steamdb.info/stats/gameratings/ . As you can see, the numbers of reviews are somewhat different, the reason being that we've downloaded only reviews for people that bought the game from Steam.

In [168]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105035 entries, 10 to 2060820
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   review_score       105035 non-null  float64
 1   review_score_desc  105035 non-null  object 
 2   total_positive     105035 non-null  float64
 3   total_negative     105035 non-null  float64
 4   total_reviews      105035 non-null  float64
 5   download_appid     105027 non-null  float64
 6   metacritic_score   3840 non-null    object 
 7   metacritic_url     3840 non-null    object 
 8   rating             74895 non-null   float64
dtypes: float64(6), object(3)
memory usage: 12.0+ MB


In the mathematical operations we got a lot of NaNs, due to games with 0 total reviews. Let's assign them a score of 50%, as the medium point. This is the same approach used in the algorithm above.

In [169]:
reviews['rating'] = reviews['rating'].fillna(50.0)

Let's also remove the excessive entries from reviews if they are present:

In [170]:
def df_remove_excesses(df_primary, df_secondary):
    excesses = df_secondary.index.difference(df_primary.index)
    df_secondary = df_secondary.drop(excesses, axis=0)
    return df_secondary

And remove the field that contain the excessive information:
* total_reviews - we can calculate them by using total_positive and total_negative
* review_score_desc - we can describe the scores in the review_score metadata if needed.
* download_appid - it is no longer needed

In [171]:
reviews = df_remove_excesses(storefront,reviews)
reviews.drop([
        'total_reviews', 'review_score_desc', 'download_appid'
    ], axis=1, inplace = True)

We'll decide how we are going to  join/split the table close to the end and leave the reviews for now.

# SteamSpy table

This table contains data collected from SteamSpy. The columns are:

| Column  | Description |
| --- | --- |
| appid | Appid, used as index |
| name | Application name |
| developers | Application developers |
| publishers | Application publishers |
| score_rank| Steam reviews score rank |
| total_positive | Positive reviews count|
| total_negative | Negative reviews count |
| review_score | Steam review score |
| owners | Estimated owner numbers |
| average_forever | Average playtime |
| average_2weeks | Average playtime in the last two weeks |
| median_forever | Median playtime |
| median_2weeks | Median playtime in the last two weeks |
| price | Current game price |
| initialprice | Initial game price |
| discount | Discount |
| supported_languages | Supported languages |
| genres | App Genres |
| ccu | Peak concurrent players on the day before the data collection (*not the max historical!*) |
| tags | User tags with counts |


We have already got the clean data for most of the fields from the Storefront and Reviews tables. Also, the data from SteamSpy is not as complete and recent comparing to the one directy downloaded from Steam. These columns are:
- appid
- name
- developers
- publishers
- score_rank
- total_positive
- total_negative
- review_score
- price
- initialprice
- discount
- supported_languages
- genres
 
I'll still take a quick look on this fields one by one.

The columns we are interested in:
- owners
- average_forever
- average_2weeks
- median_forever
- median_2weeks
- ccu
- tags

### Conforming the table rows to the same ids as in the main storefront table:


In [172]:
steamspy = df_remove_excesses(storefront,steamspy)


I'll create a temporary table merging storefront, reviews and steamspy for the steamspy data check

In [173]:
storefront_s = pd.concat([storefront, reviews.add_suffix('_reviews'), steamspy.add_suffix('_steamspy')], axis = 1)

## Columns we already have good data data on
### name
Application name. We've already worked with it on the storefront so it's of no use to us.

In [174]:
print('name nulls count:', steamspy['name'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[['name', 'name_steamspy']].sample(5))

name nulls count: 240


Unnamed: 0_level_0,name,name_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
793850,Pizza Delivery 3000,Pizza Delivery 3000
1634900,ACHERON,Acheron
666090,Bad Government,Bad Government
448201,TS Marketplace: BR Hawksworth Coach Pack 01 Add-On,TS Marketplace: BR Hawksworth Coach Pack 01 Add-On
1825780,上班摸鱼模拟器 - Lazy at work,上班摸鱼模拟器 - Lazy at work


### developers
Developers. We've already worked with it on the storefront so it's of no use to us.

In [175]:
print('developers nulls count:', storefront_s['developers_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[['name', 'developers', 'developers_steamspy']].sample(5))

developers nulls count: 10687


Unnamed: 0_level_0,name,developers,developers_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1558060,Club Lipstick VR,['TooDrunkenBrothers'],TooDrunkenBrothers
1819540,Santa Slays Nazis,"['Jeremy Martin', 'Jared Martin']","Jeremy Martin, Jared Martin"
1856630,明星志願2000 - 玩轉綜藝,['SOFTSTAR ENTERTAINMENT'],
1627570,The Plucky Squire,['All Possible Futures'],
1646890,Hot Summer,['玫瑰工作室'],玫瑰工作室


### publishers
Publishers. We've already worked with it on the storefront so it's of no use to us.

In [176]:
print('publishers nulls count:', storefront_s['publishers_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[['name', 'publishers', 'publishers_steamspy']].sample(5))

publishers nulls count: 19998


Unnamed: 0_level_0,name,publishers,publishers_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1831700,Tin Hearts,['Wired Productions'],
1699100,RPG Maker MZ - The Agency,['Degica'],Degica
1732730,Captures II,['cosmic_dust'],cosmic_dust
374150,Concealed Intent,['Jarrah Technology'],Jarrah Technology
637690,Rocksmith® 2014 Edition – Remastered – 70s Mix Song Pack II,['Ubisoft - San Francisco'],


### score_rank
Steam review score rank. We've already worked with it on the storefront so it's of no use to us.

In [177]:
print('score_rank nulls count:', storefront_s['score_rank_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['score_rank_steamspy'].notnull()][['name', 'rating_reviews', 'score_rank_steamspy']].sample(5))

score_rank nulls count: 104298


Unnamed: 0_level_0,name,rating_reviews,score_rank_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
966460,Undress Tournament,50.920162,98.0
371120,Discouraged Workers,58.969204,98.0
896890,VR Paradise - Steam Edition,80.837158,99.0
975020,The Spirit Master of Retarnia -Conqueror of the Labyrinth-,65.963854,99.0
780020,True Hentai Puzzle,65.891316,98.0


### total_positive
Total number of positive reviews. We've already worked with it on the storefront so it's of no use to us.

In [178]:
print('total_positive nulls count:', storefront_s['total_positive_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['total_positive_steamspy']>0][['name', 'total_positive_reviews', 'total_positive_steamspy']].sample(5))

total_positive nulls count: 0


Unnamed: 0_level_0,name,total_positive_reviews,total_positive_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1443300,Martha,23.0,23
374060,Stairs,228.0,229
1246490,Domino Simulator,50.0,50
205870,Auditorium,225.0,225
1488050,Beyond The Diorama: Caribou World,4.0,4


### total_negative
Total number of negative reviews. We've already worked with it on the storefront so it's of no use to us.

In [179]:
print('total_negative nulls count:', storefront_s['total_negative_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['total_negative_steamspy']>0][['name', 'total_negative_reviews', 'total_negative_steamspy']].sample(5))

total_negative nulls count: 0


Unnamed: 0_level_0,name,total_negative_reviews,total_negative_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1028280,Bridge! 3,7.0,7
1588340,Riftbound,2.0,2
1318420,Summer in Mara Prologue,132.0,133
1902210,Hidden Floating City Top-Down 3D,6.0,6
569570,Light Apprentice - The Comic Book RPG,7.0,7


### review_score
Review score. We've already worked with it on the storefront so it's of no use to us.

In [180]:
print('review_score nulls count:', storefront_s['review_score_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['review_score_steamspy']>0][['name', 'review_score_reviews', 'review_score_steamspy']].sample(5))

review_score nulls count: 0


Unnamed: 0_level_0,name,review_score_reviews,review_score_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
914140,Hentai Dojo,5.0,65
975020,The Spirit Master of Retarnia -Conqueror of the Labyrinth-,6.0,71
331065,Call of Duty®: Advanced Warfare - Lightning Premium Personalization Pack,5.0,56
962380,HOT FIT!,8.0,80
42695,Call of Duty®: Modern Warfare® 3 Collection 1,7.0,92


### price
Current price (including discounts). We've already worked with it on the storefront so it's of no use to us.

In [181]:
print('price nulls count:', storefront_s['price_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['price_steamspy']>0][['name', 'price', 'price_steamspy']].sample(5))

price nulls count: 10375


Unnamed: 0_level_0,name,price,price_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
498740,Way of Redemption,6.99,699.0
563101,Ballistic Overkill - Marksman: Zombie,6.99,699.0
330720,Flashpoint Campaigns: Red Storm Player's Edition,45.99,4999.0
1179690,UFO穿越 （UFO Cross）,2.39,299.0
1715780,The Crafter's Defense,12.49,1499.0


### initialprice
Price without the discount. We've already worked with it on the storefront so it's of no use to us.

In [182]:
print('price nulls count:', storefront_s['initialprice_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['initialprice_steamspy']>0][['name', 'price', 'initialprice_steamspy']].sample(5))

price nulls count: 10373


Unnamed: 0_level_0,name,price,initialprice_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
436609,Rocksmith® 2014 – Crobot - “Legend of the Spaceborne Killer”,2.99,299.0
649360,Yesterday,4.99,499.0
1523750,Strip Black Jack - Sex Teacher,1.79,199.0
294917,Rocksmith® 2014 – Cake - “Stickshifts And Safetybelts”,2.99,299.0
351290,SURVIVAL: Postapocalypse Now,12.49,1499.0


### discount
Current dicount. We've already worked with it on the storefront so it's of no use to us.

In [183]:
print('discount nulls count:', storefront_s['discount_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['discount_steamspy']>0][['name', 'price_steamspy', 'initialprice_steamspy', 'discount_steamspy']].sample(5))

discount nulls count: 10373


Unnamed: 0_level_0,name,price_steamspy,initialprice_steamspy,discount_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1705073,Tiger Soldier Ⅰ MP064,74.0,99.0,25.0
1680940,RPG Maker MZ - Wizards Pack (8 Elements),1799.0,1999.0,10.0
714336,Omega Quintet: Billionaire's Club Pack,149.0,299.0,50.0
1185190,Climb Challenge,99.0,199.0,50.0
420640,Knights of Pen and Paper 2 - Here Be Dragons,249.0,499.0,50.0


### supported_languages
Supported languages. Here languages are not divided on autio/text. We've already worked with this data on the storefront so it's of no use to us.

In [184]:
print('supported_languages nulls count:', storefront_s['supported_languages_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['supported_languages_steamspy'].notnull()][[
        'name',
        'supported_audio',
        'supported_languages',
        'supported_languages_steamspy'
    ]].sample(5))

supported_languages nulls count: 10543


Unnamed: 0_level_0,name,supported_audio,supported_languages,supported_languages_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
971530,VoidGate,,[English],English
455320,Blood and Ice,,"[English, Russian]","English, Russian"
1712490,Station 17,[English],[English],English
1838250,Lost Contact,[ Russian],"[English, Russian]","English, Russian"
389570,Mitos.is: The Game,,"[Danish, Dutch, English, French, German, Italian, Portuguese, Portuguese - Brazil, Romanian, Spanish - Spain, Swedish, Turkish]","English, Italian, Portuguese - Brazil, French, Spanish - Spain, Portuguese, German, Danish, Romanian, Dutch, Swedish, Turkish"


### genres
Genres. We've already worked with this data on the storefront so it's of no use to us.

In [185]:
print('genres nulls count:', storefront_s['genres_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['genres_steamspy'].notnull()][['name', 'genres', 'genres_steamspy']].sample(5))

genres nulls count: 10625


Unnamed: 0_level_0,name,genres,genres_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1623781,Speedpunk - Greeble city,"[Action, Casual, Indie, Racing, Simulation]","Action, Casual, Indie, Racing, Simulation"
418160,Garden Rescue: Christmas Edition,"[Casual, Strategy]","Casual, Strategy"
1478210,Scramble,"[Casual, Indie, Strategy]","Casual, Indie, Strategy"
623320,Button Button Up!,"[Action, Adventure, Indie]","Action, Adventure, Indie"
1247340,Island Saver - Dinosaur Island,"[Action, Adventure]","Action, Adventure"


## Columns requiring analysis

### Owners
SteamSpy owners estimation. A string with lower .. upper application owners estimates. We could split it into two for lower and upper estimations but I'll just slightly reformat it to keep consistent with Nik Davis dataset.

In [186]:
print('owners nulls count:', storefront_s['owners_steamspy'].isnull().sum())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['owners_steamspy'].notnull()][['name', 'owners_steamspy']].sample(5))

owners nulls count: 0


Unnamed: 0_level_0,name,owners_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
933557,WARRIORS OROCHI 4/無双OROCHI３ - Legendary Weapons Orochi Pack 2,"0 .. 20,000"
1903232,hexceed - Insulam Pack,"0 .. 20,000"
638780,Reflector: Bug Hunt,"0 .. 20,000"
1506060,QFIELD,"0 .. 20,000"
1187303,[Revival] DOA6 Santa's Helper Costume - Hayabusa,"0 .. 20,000"


#### [Subroutine] 'owners': Cleaning

In [187]:
def owners_clean(df):
    '''
    Reformatting owners column to lower-upper format
    '''
    df = df.copy()
    df['owners'] = df['owners'].str.replace(',', '', regex=True).str.replace(' .. ', '-', regex=True)
    return df

In [188]:
steamspy = owners_clean(steamspy)

In [189]:
steamspy.sample(10)

Unnamed: 0_level_0,name,developers,publishers,score_rank,total_positive,total_negative,review_score,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,supported_languages,genres,ccu,tags
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
513100,TS Marketplace: Falns 121 Wagon Pack Add-On,virtualRailroads,Dovetail Games,,0,0,0,0-20000,0,0,0,0,499.0,499.0,0.0,English,Simulation,0,[]
1068990,Magnetic Pull,Groggy Goblin Games,Groggy Goblin Games,,4,0,0,0-20000,0,0,0,0,299.0,299.0,0.0,English,"Action, Casual, Indie",0,"{'Action': 21, 'Indie': 21, 'Casual': 21}"
1654350,The Little Tank Hunter,Team-K,Team-K,,0,0,0,0-20000,0,0,0,0,399.0,399.0,0.0,English,"Action, Adventure, Casual, Indie",0,[]
1114080,Blazing War,,,,0,0,0,0-20000,0,0,0,0,,,,,,0,[]
233230,Kairo,Locked Door Puzzle,Lupus Studios Limited,,1232,449,0,100000-200000,110,0,123,0,499.0,499.0,0.0,"English, German, French, Italian, Spanish - Spain","Adventure, Indie",0,"{'Puzzle': 145, 'Exploration': 119, 'Indie': 1..."
1908990,Froggy Quest,,,,0,0,0,0-20000,0,0,0,0,,,,,,0,[]
473445,Depth Crawler Gold Coat,SANDLOT,D3 PUBLISHER,,4,0,0,0-20000,0,0,0,0,199.0,199.0,0.0,"English, Japanese",Action,0,{'Action': 21}
1179560,Gachi Heroes 2: Flexboll,Lohsoft,Lohsoft,,196,39,0,0-20000,0,0,0,0,99.0,99.0,0.0,"English, Russian, French, German, Spanish - Sp...","Action, Adventure, Casual, Indie, Simulation",1,"{'Memes': 168, 'Comedy': 160, 'Funny': 155, 'C..."
941957,Devil May Cry 5 - Gerbera GP01,"CAPCOM Co., Ltd.","CAPCOM Co., Ltd.",,5,1,0,0-20000,0,0,0,0,99.0,99.0,0.0,"English, French, Italian, German, Spanish - Sp...",Action,0,"{'Action': 22, 'Nudity': 22, 'Violent': 21}"
686720,SaGa SCARLET GRACE: AMBITIONS,Square Enix,Square Enix,,130,36,0,0-20000,0,0,0,0,2999.0,2999.0,0.0,"English, Japanese",RPG,18,"{'RPG': 163, 'JRPG': 98, 'Anime': 93, 'Choices..."


### average_forever 

Average player playtime. We have some nulls (since we don't have data for some games). We will replace it with 0 to keep consistent with data on SteamSpy.

In [190]:
print('average_forever nulls count:', storefront_s['average_forever_steamspy'].isnull().sum())
print('average_forever zero count:', storefront_s[storefront_s['average_forever_steamspy']==0]['average_forever_steamspy'].count())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['average_forever_steamspy'].notnull()][['name', 'average_forever_steamspy']].sample(5))

average_forever nulls count: 0
average_forever zero count: 92720


Unnamed: 0_level_0,name,average_forever_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
981140,混乱使者,0
872450,Let's Seduce the Heroine!,0
292090,Rube Works: The Official Rube Goldberg Invention Game,0
1648120,Disgraced - Trailblazer DLC,0
1025010,Cursor - by Mr iLyn.,0


#### [Subroutine] 'average_forever': Cleaning

In [191]:
def average_forever_clean(df):
    '''
    Cleaning average_forever in SteamSpy
    '''
    df = df.copy()
    df['average_forever'].fillna(0)
    return df

In [192]:
steamspy = average_forever_clean(steamspy)

### average_2weeks 

Average player playtime in the last 2 weeks. While the data is interesting, it's only the last two weeks so it doesn't seem valuable in the long term. Going to drop.

In [193]:
print('average_2weeks nulls count:', storefront_s['average_2weeks_steamspy'].isnull().sum())
print('average_2weeks zero count:', storefront_s[storefront_s['average_2weeks_steamspy']==0]['average_2weeks_steamspy'].count())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['average_2weeks_steamspy'].notnull()][['name', 'average_2weeks_steamspy']].sample(5))

average_2weeks nulls count: 0
average_2weeks zero count: 102622


Unnamed: 0_level_0,name,average_2weeks_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
819480,Pizza Titan Ultra,0
1486120,2020: THE RIDE,0
209100,Resident Evil: Operation Raccoon City,0
1111470,DOA6 Seaside Eden Costume - Bass,0
1079130,Hentai Cowgirl,0


### median_forever 

Median player playtime. We have some nulls (since we don't have data for some games). We will replace it with 0 to keep consistent with data on SteamSpy.

In [194]:
print('median_forever nulls count:', storefront_s['median_forever_steamspy'].isnull().sum())
print('median_forever zero count:', storefront_s[storefront_s['median_forever_steamspy']==0]['median_forever_steamspy'].count())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['median_forever_steamspy'].notnull()][['name', 'median_forever_steamspy']].sample(5))

median_forever nulls count: 0
median_forever zero count: 92720


Unnamed: 0_level_0,name,median_forever_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
352188,FSX: Steam Edition - Santa Barbara Airport (KSBA) Add-On,0
659140,Titan Saga: Chains of Kronos,0
1859220,疯狂厨房VR,0
1520360,Divine Dust,0
1013040,Planetoid Pioneers Online,0


#### [Subroutine] 'median_forever': Cleaning

In [195]:
def median_forever_clean(df):
    '''
    Cleaning average_forever in SteamSpy
    '''
    df = df.copy()
    df['median_forever'].fillna(0)
    return df

In [196]:
steamspy = median_forever_clean(steamspy)

### median_2weeks 

Median player playtime in the last 2 weeks. While the data is interesting, it's only the last two weeks so it doesn't seem valuable in the long term. Going to drop.

In [197]:
print('median_2weeks nulls count:', storefront_s['median_2weeks_steamspy'].isnull().sum())
print('median_2weeks zero count:', storefront_s[storefront_s['median_2weeks_steamspy']==0]['median_2weeks_steamspy'].count())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['median_2weeks_steamspy'].notnull()][['name', 'median_2weeks_steamspy']].sample(5))

median_2weeks nulls count: 0
median_2weeks zero count: 102622


Unnamed: 0_level_0,name,median_2weeks_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1323480,No One Can Ever Know,0
447540,Lost Cosmonaut,0
566340,cloudphobia - OST,0
1013180,Funbag Fantasy,0
1061880,Conan Chop Chop,0


### ccu 

Peak concurrent user count. This is a very interesting stat. Sadly, it's not a lifetime stat, but the stat for the day before the dataset is downloaded so it's not useful for analysis. Going to drop.

In [198]:
print('ccu nulls count:', storefront_s['ccu_steamspy'].isnull().sum())
print('ccu zero count:', storefront_s[storefront_s['ccu_steamspy']==0]['ccu_steamspy'].count())
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['ccu_steamspy'].notnull()][['name', 'ccu_steamspy']].sample(5))

ccu nulls count: 0
ccu zero count: 88551


Unnamed: 0_level_0,name,ccu_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
834401,Train Simulator: Amtrak SDP40F Loco Add-On,0
450740,Mind Unleashed,0
380231,Trove - Arcanium Expedition Pack,0
2052740,Horse Shelter 2022 - Prologue,0
542400,Pixel Gladiator,1


### tags

User tags data. It includes both the tag and the number of users that put the tag. A dict with 'tag_name':tag_count elements.

I'll save just the tags themselves in the main table and move tags with tag numbers to the separate one.

In [199]:
print('tags nulls count:', storefront_s['tags_steamspy'].isnull().sum())
print('tags empty list count:', storefront_s[~storefront_s['tags_steamspy'].apply(lambda x: False if pd.isna(x) else bool(ast.literal_eval(x)))].shape[0])
with pd.option_context('display.max_colwidth', 150):
    display(storefront_s[storefront_s['tags_steamspy'].notnull()][['name', 'tags_steamspy']].sample(5))

tags nulls count: 0
tags empty list count: 42728


Unnamed: 0_level_0,name,tags_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
326720,Instant Dungeon!,"{'Indie': 35, 'Action': 34, 'Action Roguelike': 25, 'Arcade': 18, 'Retro': 18, 'Great Soundtrack': 12, 'Rogue-like': 11, 'Dungeon Crawler': 6}"
900560,RIDE 3 - Naked Bikes Pack,[]
1736250,Code Dread,"{'Adventure': 54, 'Survival Horror': 49, 'Action': 45, 'Multiplayer': 41, 'Co-op': 38, 'Online Co-Op': 33, 'Action-Adventure': 23, 'Exploration': ..."
1002300,Fear & Hunger,[]
1149000,Monster Loves You Too!,[]


#### [Subroutine] 'tags': Cleaning

In [200]:
def clean_tags(df, export=False):
    '''
    Processing SteamSpy tags with possible export.
    For exporting, we are spreading the tags to columns and put the number of users using the said tag as a value
    tags are renamed to comply with pandas column names requirements    
    
    We are leaving only the tags themselves in the table
    '''    
    if export: 
        
        tag_data = df[['tags']].copy()
        
        def parse_export_tags(x):
            if pd.isnull(x):
                return {}
            x = ast.literal_eval(x)

            if isinstance(x, dict):
                return x
            elif isinstance(x, list):
                return {}
            else:
                raise TypeError('Something other than dict or list found')

        tag_data['tags'] = tag_data['tags'].apply(parse_export_tags)

        # Getting all tags for column names
        cols = set(itertools.chain(*tag_data['tags']))

        # And setting the user values
        for col in sorted(cols):
            col_name = col.lower().replace(' ', '_').replace('-', '_').replace("'", '')

            tag_data[col_name] = tag_data['tags'].apply(lambda x: x[col] if col in x.keys() else 0)

        tag_data = tag_data.drop('tags', axis=1)
        
        export_data(tag_data, 'steamspy_tag_data', index=True)
        print('Exported tag data')
        
        
    def parse_tags(x):
        if pd.isnull(x):
            return pd.NA
        x = ast.literal_eval(x)
        
        if isinstance(x, dict):
            return list(x.keys())
        else:
            return pd.NA
    
    df['tags'] = df['tags'].apply(parse_tags)
       
    return df

In [201]:
steamspy = clean_tags(steamspy, export = True)

  tag_data[col_name] = tag_data['tags'].apply(lambda x: x[col] if col in x.keys() else 0)


Exported steamspy_tag_data to "../data/export/steamspy_tag_data.csv"
Exported tag data


In [202]:
#Verifying exported data
pd.read_csv('../data/export/steamspy_tag_data.csv').sample(10)

Unnamed: 0,appid,1980s,1990s,2.5d,2d,2d_fighter,2d_platformer,360_video,3d,3d_fighter,...,web_publishing,well_written,werewolves,western,word_game,world_war_i,world_war_ii,wrestling,zombies,e_sports
30508,739010,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
82934,1632630,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
101622,1955390,0,0,0,0,0,0,0,166,0,...,0,0,0,0,0,0,0,0,0,0
41811,935550,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39985,903244,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98441,1889420,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
76996,1526370,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48071,1042440,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15203,462100,0,0,0,17,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9818,368600,0,6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### [Subroutine] SteamSpy: dropping columns

In [203]:
steamspy = steamspy.drop([
        'name', 'developers', 'publishers', 'score_rank', 'total_positive', 'total_negative', 'review_score',
    'price', 'initialprice', 'discount', 'supported_languages', 'genres', 'average_2weeks', 'median_2weeks',
    'ccu'
    ], axis=1)

After the processing, our SteamSpy data table will look like this:

In [204]:
steamspy.sample(10)

Unnamed: 0_level_0,owners,average_forever,median_forever,tags
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
513091,0-20000,0,0,
434633,0-20000,0,0,[Simulation]
746620,20000-50000,0,0,"[Casual, Simulation, Management, Retro, Lore-R..."
1394320,0-20000,0,0,"[Adventure, Casual, Action-Adventure, Platform..."
1182370,100000-200000,0,0,"[Horror, Free to Play, Indie, Singleplayer, Su..."
1352721,0-20000,0,0,
747350,50000-100000,41,6,"[Action, Adventure, Indie, Violent, Female Pro..."
1930910,0-20000,0,0,
1266719,0-20000,0,0,
1674020,0-20000,0,0,"[Early Access, Adventure, RPG, Survival, Sandb..."


We'll combine it with storefront and review data into one Steam data table:

In [205]:
steam = pd.concat([storefront,reviews,steamspy], ignore_index=False, axis = 1)

In [206]:
steam.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 104344 entries, 10 to 2060820
Data columns (total 32 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   type                     104344 non-null  object 
 1   name                     104344 non-null  object 
 2   required_age             104344 non-null  int64  
 3   dlc                      9804 non-null    object 
 4   fullgame                 35362 non-null   object 
 5   supported_languages      104180 non-null  object 
 6   drm_notice               723 non-null     object 
 7   ext_user_account_notice  1057 non-null    object 
 8   developers               104300 non-null  object 
 9   publishers               104301 non-null  object 
 10  demos                    7275 non-null    object 
 11  packages                 82422 non-null   object 
 12  platforms                104344 non-null  object 
 13  categories               104228 non-null  object 
 14  ge

In [207]:
steam.sample(10)

Unnamed: 0_level_0,type,name,required_age,dlc,fullgame,supported_languages,drm_notice,ext_user_account_notice,developers,publishers,...,review_score,total_positive,total_negative,metacritic_score,metacritic_url,rating,owners,average_forever,median_forever,tags
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1587622,dlc,Sniper Ghost Warrior Contracts 2 - shroud DLC,0,,1338770.0,"[Czech, English, French, German, Italian, Japa...",,,['CI Games'],['CI Games'],...,5.0,6.0,8.0,,,46.018206,0-20000,0,0,
385940,game,Battles of the Ancient World,0,,,[English],,,['HexWar Games'],['HexWar Games'],...,5.0,13.0,9.0,,,55.553491,0-20000,0,0,[Strategy]
745912,dlc,Game Tengoku - Voice Mix,0,,663130.0,"[English, Japanese]",,,"['CITY CONNECTION', 'Chara-Ani']",['Degica'],...,0.0,2.0,0.0,,,64.079515,0-20000,0,0,[Action]
463120,game,MOP Operation Cleanup,0,,,[English],,,['EnsenaSoft'],['EnsenaSoft'],...,0.0,0.0,2.0,,,35.920485,0-20000,0,0,"[Action, Strategy, Casual, Platformer, 2D, Sin..."
1346580,dlc,Gravity - Creator Bundle,0,,1333090.0,[English],,,['Daniel Gelber'],['Daniel Gelber'],...,0.0,0.0,0.0,,,50.0,0-20000,0,0,
922550,game,keyg,0,,,"[English, Russian, Spanish - Latin America, Sp...",,,['Axel Sonic'],['Axel Sonic'],...,8.0,107.0,17.0,,,77.807156,0-20000,0,0,"[Indie, Puzzle, 2D, Minimalist, Difficult, Sil..."
647370,dlc,VR SUPER SPORTS - Golf,0,,593240.0,"[English, Japanese, Simplified Chinese, Tradit...",,,['SAT-BOX'],['SAT-BOX'],...,0.0,2.0,1.0,,,55.686457,0-20000,0,0,
1490340,game,We Are OFK,0,,,[English],,,['Team OFK'],['Team OFK'],...,0.0,0.0,0.0,,,50.0,0-20000,0,0,
1611850,game,The Marionette 提线木偶,0,,,[Simplified Chinese],,,['雪見いつか'],['Ningbing Games'],...,7.0,12.0,1.0,,,73.191543,0-20000,0,0,"[RPG, JRPG, 2D Platformer, Indie, 2D, Psycholo..."
870680,game,Automatum,0,,,"[English, Spanish - Spain]",,,['Old Point Games'],['Old Point Games'],...,0.0,3.0,2.0,,,54.168865,0-20000,0,0,"[Indie, Action]"


# Finalizing table structure

After the processing have these tables available:

* steam
* steam_description_data
* steam_media_data
* steam_packages_info
* steam_requirements_data
* steam_support_info
* steamspy_tag_data
* missing_ids

Sadly, there are a lot of optional data in the steam table so it might be a good idea to move it to the optional table and join with the main table when necessary. The fields that go to the **steam_optional** are:

* drm_notice
* ext_user_account_notice
* demos
* content_descriptors
* metacritic_score
* metacritic_url

### [Subroutine] steam and steam_optional export

In [208]:
def steam_export(df):
    '''
    Creating steam_optional table and exporting both steam and steam_optional
    '''
    df = df.copy()
    # copying necessary columns into new df
    steam_optional_df = df[[
        'drm_notice',
        'ext_user_account_notice',
        'demos',
        'content_descriptors',
        'metacritic_score',
        'metacritic_url',
    ]].copy()
    
    # removing empty rows
    steam_optional_df.dropna(how = 'all', inplace=True)
           
    # dropping unnneeded columns from the main dataframe
    df = df.drop([
        'drm_notice',
        'ext_user_account_notice',
        'demos',
        'content_descriptors',
        'metacritic_score',
        'metacritic_url',
    ], axis=1)
    
    export_data(df, 'steam', index=True)
    export_data(steam_optional_df, 'steam_optional', index=True)

In [209]:
steam_export(steam)

Exported steam to "../data/export/steam.csv"
Exported steam_optional to "../data/export/steam_optional.csv"


In [210]:
# Verifying exported steam data
pd.read_csv('../data/export/steam.csv').sample(5)

Unnamed: 0,appid,type,name,required_age,dlc,fullgame,supported_languages,developers,publishers,packages,...,coming_soon,price,review_score,total_positive,total_negative,rating,owners,average_forever,median_forever,tags
777,33220,game,Tom Clancy's Splinter Cell Conviction™,0,[33372],,"['English', 'French', 'German', 'Italian', 'Sp...",['Ubisoft Montreal'],['Ubisoft'],"[4261, 4261, 4158]",...,False,14.99,6.0,2349.0,905.0,70.243896,100000-200000,385,250,"['Action', 'Co-op', 'Stealth', 'Multiplayer', ..."
52962,1119070,game,Shibui Coliseum,0,,,['English'],['Soiree Games'],['Soiree Games'],[379082],...,False,4.99,0.0,0.0,0.0,50.0,0-20000,0,0,
6159,305200,dlc,Heroes Rise: The Prodigy - Perfect Legend Guide,0,,299540.0,['English'],['Choice of Games'],['Choice of Games'],[45191],...,False,0.79,0.0,0.0,3.0,32.940629,0-20000,0,0,"['RPG', 'Indie']"
31738,758010,game,Big NEON Tower VS Tiny Square,0,,,"['English', 'French', 'German', 'Portuguese - ...",['Evil Objective'],['EO Interactive Ltd.'],[223641],...,False,3.99,7.0,24.0,0.0,81.02647,50000-100000,0,0,"['Precision Platformer', '2D Platformer', 'Par..."
6044,302450,dlc,Van Helsing II: Magic Pack,0,,272470.0,"['English', 'French', 'German', 'Hungarian', '...",['NeocoreGames'],['NeocoreGames'],[44560],...,False,0.99,0.0,5.0,2.0,59.969853,0-20000,0,0,"['Action', 'Adventure', 'RPG', 'Indie']"


In [211]:
# Verifying exported steam_optional data
pd.read_csv('../data/export/steam_optional.csv').sample(5)

Unnamed: 0,appid,drm_notice,ext_user_account_notice,demos,content_descriptors,metacritic_score,metacritic_url
16486,1517490,,,,Drugs,,
1448,251470,,,,,87.0,https://www.metacritic.com/game/pc/towerfall-a...
17247,1566150,,,,Occasional visual violence. Infrequent blood e...,,
11348,1206460,,,,This DLC unlocks an option of full nudity in t...,,
12542,1265070,,,,References to sexual abuse\r\nViolence\r\nGore...,,


### [Subroutine] missing_ids export

In [212]:
export_data(missing_ids, 'missing_ids', index=True)

Exported missing_ids to "../data/export/missing_ids.csv"


# Combined clean-up script

# TODO

In [213]:
def combined_cleanup():
    
    
    return True

# Tests

In [216]:
# Testing if the number of rows is consistent between the pre and post processing
def row_check():
    pre_count = pd.read_csv('../data/processing/steam_app_data.csv').shape[0]
    print('Number of rows before processing:', pre_count)
    missing_count_pre = pd.read_csv('../data/processing/missing_ids.csv').shape[0]
    print('Number of missing before processing:', missing_count_pre)
    post_count = pd.read_csv('../data/export/steam.csv').shape[0]
    print('Number of rows after processing:', post_count)
    missing_count_post = pd.read_csv('../data/export/missing_ids.csv').shape[0]
    print('Number of missing after processing:', missing_count_post)
    if ((pre_count + missing_count_pre) <= (post_count + missing_count_post)):
        return True
    return False

In [217]:
print('Number of rows test results:', row_check())

  print('Number of rows test results:', row_check())


Number of rows before processing: 105035
Number of missing before processing: 3
Number of rows after processing: 104344
Number of missing after processing: 710
Number of rows test results: True
