# Steam Data Analysis. Analysis of the datasets structure and cleanup

## Introduction

After data gathering, we have four csv files:

* `steam_app_data.csv`: Application and DLC data for all IDs from Steam Storefront (2022, April 26)
* `steamspy_data.csv`: Application data from SteamSpy for the same IDs (2022, April 27)
* `steamreviews_data.csv`: Summary review data from Steam API (2022, April 28)
* `missing_ids.csv`: List of the Apps not included in the dataset

Almost all the data necessary for the analysis should be at the `steam_app_data.csv`.
In `steamspy_appid.csv` we have additional information which might be very useful:

* Positive Reviews (count)
* Negative Reviews (count)
* Average and Medians of Concurrent Players (several columns)
* Peak Concurrent Players (ccu column)
* Owners estimate, by using Steam Spy algorithm (wide ranges)
* Tags (list)

Due to how data is gathered on SteamSpy there might be some discrepancies so the third dataset `steamreviews_data.csv` with the review summary data was downloaded from the Steam AppReviews API and used as an additional source of information:

* Review Score
* Review Score (as description string)
* Positive Reviews (count)
* Negative Reviews (count)
* Total Reviews (count)

In this notepad I'll go through each of the data table comparing them and taking notice for the clean-up and column parsing when necessary. There goals here are: 

* Prepare the table structure that will be exported and used later in the analysis/visualization creation
* Make the fields/tables as easy to uperate later in analysis as possible
* Keep as much data as possible (even with the null fields - even these data might be useful for the dataset users)
* Document the changes and prepare a streamlined automated process for the future updates

In [1]:
# Module imports
# standard library imports
import csv
import datetime as dt
import json
import os
import statistics
import time
import re
import ast
import itertools

# third-party imports
import numpy as np
import seaborn as sns
import pandas as pd

In [2]:
# Loading data tables
storefront = pd.read_csv("../data/processing/steam_app_data.csv")
steamspy = pd.read_csv("../data/processing/steamspy_data.csv")
reviews = pd.read_csv("../data/processing/steamreviews_data.csv")
missing_ids = pd.read_csv("../data/processing/missing_ids.csv")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
# Setting some constants
#usd/eu exchange rate at the time of collection
usd_eu_rate = 0.95
#date of the dataset collection
df_collection_date = pd.Timestamp(2022,4,28)

### Utility functions
Let's define some utility functions used for processing and troubleshooting:

In [4]:
def getSteamLink(df):
    """
        Give us the name and links to any subseries of apps, for troubleshooting.
    """
    for item in df.index:
        print(df.loc[item]["name"]+" https://store.steampowered.com/app/"+str(item))

In [5]:
def export_data(df, filename, index=False, list_columns = []):
    """
    Export dataframe to csv file in export folder'.
    
        filename: file name string without file extension
        index: boolean, to export index as well or not
        list_columns: list columns to transform from "['item']" to the simple 
            ';' delimited list
    """
    filepath = '../data/export/' + filename + '.csv'
    
    def list_convert(input_list):
        try:       
            return ';'.join(str(item) for item in input_list)
        except Exception as ex:
            print(input_list)
            print(ex)
            raise(ex)
    
    for col in list_columns:
        df[col].fillna({i: [] for i in storefront.index},inplace = True)
        df[col] = df[col].apply(lambda x: list_convert(x))

        
    df.to_csv(filepath, index=index)

    print("Exported {} to '{}'".format(filename, filepath))

In [6]:
def boolean_df(item_lists, unique_items):
    """
    Create boolean dataframe from from the item list series and 
    a list of unique item values
    
        items_lists: pandas series with item lists
        unique_items: list with the unique item valaues
    
    """
    
    # Create empty dict
    bool_dict = {}
    
    # Loop through all the tags
    for i, item in enumerate(unique_items):
        
        # Apply boolean mask
        bool_dict[item] = item_lists.apply(lambda x: item in x)
            
    # Return the results as a dataframe
    return pd.DataFrame(bool_dict)

In [7]:
# utility function to add the removed ids to the missing_ids
def removeIDs(df, ids_list, reason):
    """
    Remove ids and add them to the missing_ids with the reason
    """
    global missing_ids
    
    # removing ids from df
    df = df.loc[~df.index.isin(ids_list)]
    
    # adding ids to the missing_ids
    temp_df = pd.DataFrame(reason, index = ids_list,
                                              columns =['reason'])
    temp_df.index.name = "appid"
    missing_ids = pd.concat([missing_ids, temp_df]).reset_index(drop=True)
    return df

## Preparing data

As I've noted earlier, here 

Let's start with the overall structure of our tables - number of columns, total data counts and the amount of non-null data.

In [8]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103185 entries, 0 to 103184
Data columns (total 39 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     102514 non-null  object
 1   name                     102504 non-null  object
 2   steam_appid              103185 non-null  int64 
 3   required_age             102514 non-null  object
 4   is_free                  102514 non-null  object
 5   controller_support       25497 non-null   object
 6   dlc                      9696 non-null    object
 7   detailed_description     102357 non-null  object
 8   about_the_game           102356 non-null  object
 9   short_description        102353 non-null  object
 10  fullgame                 34588 non-null   object
 11  supported_languages      102333 non-null  object
 12  header_image             102514 non-null  object
 13  website                  60070 non-null   object
 14  pc_requirements     

In [9]:
steamspy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103198 entries, 0 to 103197
Data columns (total 20 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   appid            103198 non-null  int64  
 1   name             102949 non-null  object 
 2   developer        92441 non-null   object 
 3   publisher        83266 non-null   object 
 4   score_rank       52 non-null      float64
 5   positive         103198 non-null  int64  
 6   negative         103198 non-null  int64  
 7   userscore        103198 non-null  int64  
 8   owners           103198 non-null  object 
 9   average_forever  103198 non-null  int64  
 10  average_2weeks   103198 non-null  int64  
 11  median_forever   103198 non-null  int64  
 12  median_2weeks    103198 non-null  int64  
 13  price            92798 non-null   float64
 14  initialprice     92809 non-null   float64
 15  discount         92809 non-null   float64
 16  languages        92588 non-null   obje

In [10]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102453 entries, 0 to 102452
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   appid              102453 non-null  int64  
 1   review_score       102413 non-null  float64
 2   review_score_desc  102413 non-null  object 
 3   total_positive     102413 non-null  float64
 4   total_negative     102413 non-null  float64
 5   total_reviews      102413 non-null  float64
dtypes: float64(4), int64(1), object(1)
memory usage: 4.7+ MB


We have roughly 100000 App IDs in each of the tables.

**Steam Storefront** data has some seemingly optional information in these columns: *dlc*, *fullgame*, *website*, *legal_notice*, *drm_notice*, *ext_user_account_notice*,   *demos*, *metacritic*, *reviews*, *movies*, *recommendations*, *achievements*. 

*developers*, *publishers*, *demos*, *price_overview*, *packages* have quite a big number of nulls that definetely need some investigating. Some other columns also haave a small number of null data.

**Steam Reviews** data don't seem to have any nulls.

**SteamSpy** data also have some fields with a noticeable amount of nulls: *developer*, *publisher*, *score_rank*, *price*, *initialprice*, *discount*, *languages*, *genre*.

The total numbers of App IDs is a bit different between the table. There is one noticeable "Feature" in the Steam Storefront API - it doesn't return the data for the games that are not available in the regioin. I've downloaded the data from the Netherlands and it might explain some games missing as they are not available in the region. The small difference between the Steam Reviews and SteamSpy might be caused by the different dates the data was gathered.

There is some data that appears in two data tables. Since the data might differ both in the format and content, I'll check both and decide how they are handled as we move along.

| Field 1 | Field 2 |
| --- | --- |
| storefront.name | steamspy.name |
| developers | developer |
| publishers | publisher |
| storefront.price_overview | steamspy.price/initialprice/discount |
| storefront.genres | steamspy.genre |
| storefront.supported_languages | steamspy.languages |
| reviews.review_score | steamspy.userscore |
| reviews.total_positive | steamspy.positive |
| reviews.total_negative | steamspy.negative |

I'll start with the **most important fields** to check if we'll have to remove some data right from the start.

### Unique IDs

We just said the app ids are unique... But we should check if we have duplicated app ids in our dataframes. We used an iterative process, and it could be possible that some ids when requested redirect us to a new id. This has been observed trying to access directly in the Steam Store page with some of the "missing" ids. For instance, different versions of Guild Wars 2 all lead us to a unique store page on Steam, as the old versions do not exist anymore.

In [11]:
storefront["steam_appid"].duplicated().sum()

0

In [12]:
steamspy["appid"].duplicated().sum()

0

In [13]:
reviews["appid"].duplicated().sum()

0

There are might be some duplicates in tables - I'll need to check the data collecting functions to remove the possibility of the duplicates getting in laters. For now I'll just clean it up:

In [14]:
storefront = storefront.drop_duplicates(subset="steam_appid", keep="last")
steamspy = steamspy.drop_duplicates(subset="appid", keep="last")
reviews = reviews.drop_duplicates(subset="appid", keep="last")

# Steam Storefront table
### Name

In [15]:
steamspy[steamspy["name"].isnull()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 249 entries, 3859 to 102835
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   appid            249 non-null    int64  
 1   name             0 non-null      object 
 2   developer        8 non-null      object 
 3   publisher        8 non-null      object 
 4   score_rank       0 non-null      float64
 5   positive         249 non-null    int64  
 6   negative         249 non-null    int64  
 7   userscore        249 non-null    int64  
 8   owners           249 non-null    object 
 9   average_forever  249 non-null    int64  
 10  average_2weeks   249 non-null    int64  
 11  median_forever   249 non-null    int64  
 12  median_2weeks    249 non-null    int64  
 13  price            15 non-null     float64
 14  initialprice     15 non-null     float64
 15  discount         15 non-null     float64
 16  languages        11 non-null     object 
 17  genre     

In [16]:
storefront[storefront["name"].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
97451,,,1874520,,,,,,,,...,,,,,,,,,,
101564,,,1972370,,,,,,,,...,,,,,,,,,,
101795,,,1980220,,,,,,,,...,,,,,,,,,,
102440,,,660,,,,,,,,...,,,,,,,,,,
102441,,,8040,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103145,,,1920982,,,,,,,,...,,,,,,,,,,
103146,,,1920983,,,,,,,,...,,,,,,,,,,
103147,,,1920984,,,,,,,,...,,,,,,,,,,
103148,,,1920985,,,,,,,,...,,,,,,,,,,


In [17]:
storefront[storefront["name"].isnull()]["steam_appid"]

97451     1874520
101564    1972370
101795    1980220
102440        660
102441       8040
           ...   
103145    1920982
103146    1920983
103147    1920984
103148    1920985
103149    1925800
Name: steam_appid, Length: 681, dtype: int64

In [18]:
steamspy[steamspy["name"].isnull()]["appid"]

3859       257302
6612       315210
15099      460250
16272      487170
19220      537390
           ...   
102685     952112
102792    1001520
102821    1074060
102827    1158760
102835    1219280
Name: appid, Length: 249, dtype: int64

#### Name overview
Judging by the quick overview of the blank game names, there seems to be multiple causes for it:
* The application is not present in Steam
* The application is a recent release that hasn't been parsed by SteamSpy properly yet
* The application is not released yet
* The 'application' is a DLC/DLC bundle
* The application has an emoticon in the name

Let's do a crosscheck between SteamSpy and Steam Storefront data:

In [19]:
storefront[storefront["steam_appid"].isin(steamspy[steamspy["name"].isnull()]["appid"].values)].sample(10)

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
98486,game,God s' Margarita: The Lonely Reniat Noc,1898020,0.0,False,full,,<strong>God s' Margarita</strong> is a story-d...,<strong>God s' Margarita</strong> is a story-d...,Œíecome the latest hope of humanity! Embody the...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256872544, 'name': ""God s' Margarita T...",,,"{'coming_soon': True, 'date': '9 Aug, 2022'}","{'url': 'https://www.sittingass.com/', 'email'...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [1, 5], 'notes': ""God s' Margarita has..."
95183,game,Crystal Tales Tactics: Echoes of the Libertas War,1829850,0.0,False,full,,"<h1>CHECK OUT OTHER GAMES</h1><p><a href=""http...","<img src=""https://cdn.akamai.steamstatic.com/s...",Inspired by Fire Emblem and Final Fantasy Tact...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '2', 'description': 'Strategy'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': True, 'date': 'TBA'}","{'url': '', 'email': 'maledollstudio@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
102546,,,440290,,,,,,,,...,,,,,,,,,,
16262,dlc,Call to Arms - Basic Edition,487170,16.0,False,,,Call to Arms offers an innovative mix of real-...,Call to Arms offers an innovative mix of real-...,Call to Arms offers an innovative mix of real-...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': ''}",{'url': 'http://digitalmindsoft.eu/about-us/co...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
93387,game,Mojito the Cat: 3d Puzzle,1798950,0.0,False,,,Mojito the Cat: 3D Puzzle Cat labyrinth game w...,Mojito the Cat: 3D Puzzle Cat labyrinth game w...,Mojito the Cat: 3D Puzzle Cat labyrinth game w...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256874187, 'name': 'Mojito the Cat: 3d...",,,"{'coming_soon': True, 'date': '2022'}","{'url': '', 'email': 'support@gtzastudio.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
85007,game,DELTARUNE,1671210,0.0,False,,,"<h2 class=""bb_tag"">The next adventure in the <...","<h2 class=""bb_tag"">The next adventure in the <...","UNDERTALE's parallel story, DELTARUNE. Meet ne...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '37', 'description': 'Free to Play'}, ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256849454, 'name': 'TRAILER_TEASER', '...",,,"{'coming_soon': True, 'date': ''}","{'url': 'http://deltarune.com', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
100298,dlc,Cool Kid Cody - Season 1 Episode 07,1939315,0.0,False,,,<strong>Episode 7: Pulp Friction!</strong><br>...,<strong>Episode 7: Pulp Friction!</strong><br>...,Episode 7: Pulp Friction!After a lazy Saturday...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '20 May, 2022'}",{'url': 'https://www.ninetypercentstudios.com/...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
85188,dlc,Capcom Arcade Stadium Pack 2: Arcade Revolutio...,1674611,0.0,False,full,,Play ten early arcade favorites for one low pr...,Play ten early arcade favorites for one low pr...,Play ten early arcade favorites for one low pr...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '24 May, 2021'}",{'url': 'http://www.capcom.co.jp/support/conta...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
102669,,,940080,,,,,,,,...,,,,,,,,,,
62013,dlc,Fall Guys - Gordon Headcrab Preorder Bonus,1261620,0.0,False,full,,Get to the Head-Crab of the pack with this ama...,Get to the Head-Crab of the pack with this ama...,Get to the Head-Crab of the pack with this ama...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': ''}","{'url': 'http://fallguys.com', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


#### Are there any duplicate names?

In [20]:
storefront["name"].value_counts()[storefront["name"].value_counts()>1]

Alone             6
Lost              4
Bounce            4
Space Survival    4
Vortex            3
                 ..
Bunker Down       2
Clan Wars         2
Memoria           2
The House         2
Tomorrow          2
Name: name, Length: 372, dtype: int64

In [21]:
storefront[storefront["name"]=="['']"]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors


We have some duplicate names, but they are really different games. There is an interesting case with "Fantasy Grounds - Aegis of Empires 1: The Book in the Old House" that is actually 3 different applications with the same name from the same developer.

Just in case, let's check also for some weird names.

In [22]:
storefront[storefront["name"].apply(lambda x: len(str(x)) < 6)]["name"].value_counts()

Alone    6
Lost     4
Maze     3
Arena    3
Surge    3
        ..
Algae    1
Soter    1
Kings    1
TOK      1
MIST     1
Name: name, Length: 2628, dtype: int64

In [23]:
storefront[storefront["name"].isin(["none","None","na","Na","False","false",0,"","invalid","Invalid"])]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors


#### [Subroutine] Name cleaning

All games from the store database have valid names, except those that we should clearly remove. We keep the rest of the column from store as is.

* Replace the ["none","None","na","Na","False","false",0,"","invalid","Invalid"] names with NaN

In [24]:
#Replacing incorrect columns with NaN (or delete them)
def cleanName(storefront, remove_data = False):
    badnames = ["none","None","na","Na","False","false",0,"","invalid","Invalid",np.nan]
    if (remove_data):
        remove_ids = storefront[storefront.name.isin(badnames)].index.tolist()
        storefront = removeIDs(storefront, remove_ids, "Missing app name")
    else:
        storefront['name'].mask(storefront.name.isin(badnames), np.nan, inplace=True )
    return storefront

In [25]:
storefront = cleanName(storefront, True)

In [26]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102504 entries, 0 to 103184
Data columns (total 39 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     102504 non-null  object
 1   name                     102504 non-null  object
 2   steam_appid              102504 non-null  int64 
 3   required_age             102504 non-null  object
 4   is_free                  102504 non-null  object
 5   controller_support       25497 non-null   object
 6   dlc                      9696 non-null    object
 7   detailed_description     102353 non-null  object
 8   about_the_game           102352 non-null  object
 9   short_description        102349 non-null  object
 10  fullgame                 34586 non-null   object
 11  supported_languages      102329 non-null  object
 12  header_image             102504 non-null  object
 13  website                  60069 non-null   object
 14  pc_requirements     

In [27]:
storefront[storefront["name"].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors


### type

**Type** is an application type (you can actually designate it when downloading the data from the Steamfront. I've set it to download both dlc's and games). Besides the ones I've designated to download, there seems to be one special application reserved for Steam Gift Cards and some applications that don't have "type" set:


In [28]:
storefront["type"].value_counts(dropna=False)

game           67870
dlc            34632
advertising        1
music              1
Name: type, dtype: int64

Let's take a look on the appliations that don't have the type set up:

In [29]:
storefront[storefront["type"].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors


It seems like these applications don't have anything set besides the appid and name. It might be either test/removed applications or the ones that don't have the data filled in. Since they don't have any valuable data, I consider them safe to remove.

#### [Subroutine] Type cleaning

All games from the store database have valid types, except those that we should clearly remove. We keep the rest of the column from store as is.

* Replace the ["none","None","na","Na","False","false",0,"","invalid","Invalid"] names with NaN

In [30]:
#Replacing incorrect columns with NaN (or delete them)
def cleanType(storefront, remove_data = False):
    badnames = ["none","None","na","Na","False","false",0,"","invalid","Invalid",np.nan]
    if (remove_data):
        remove_ids = storefront[storefront.type.isin(badnames)].index.tolist()
        storefront = removeIDs(storefront, remove_ids, "Missing app type")
    else:
        storefront['type'].mask(storefront.type.isin(badnames), np.nan, inplace=True )
    return storefront

In [31]:
storefront = cleanType(storefront, True)

In [32]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102504 entries, 0 to 103184
Data columns (total 39 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     102504 non-null  object
 1   name                     102504 non-null  object
 2   steam_appid              102504 non-null  int64 
 3   required_age             102504 non-null  object
 4   is_free                  102504 non-null  object
 5   controller_support       25497 non-null   object
 6   dlc                      9696 non-null    object
 7   detailed_description     102353 non-null  object
 8   about_the_game           102352 non-null  object
 9   short_description        102349 non-null  object
 10  fullgame                 34586 non-null   object
 11  supported_languages      102329 non-null  object
 12  header_image             102504 non-null  object
 13  website                  60069 non-null   object
 14  pc_requirements     

### Developers

Compared to publishers where the store dataset has no null values, we have a few missing developers. Let's check them just in case.

In [33]:
storefront[storefront["developers"].isnull()]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
270,game,Tycoon City: New York,9730,0.0,False,,,<h1>Special Offer</h1><p>Officially Licensed T...,Here's your chance to make it big in the Big A...,Here's your chance to make it big in the Big A...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 174},,"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
324,game,Crash Time 2,11390,0.0,False,,,Solve exciting criminal cases on the mean stre...,Solve exciting criminal cases on the mean stre...,Crash Time 2 is an open-world combat racing ga...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256810412, 'name': 'Crash Time 2 Steam...",{'total': 1082},,"{'coming_soon': False, 'date': '27 Aug, 2009'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
790,game,18 Wheels of Steel: Extreme Trucker,33730,0.0,False,,,You ‚Äòda Boss! Move it better and faster while ...,You ‚Äòda Boss! Move it better and faster while ...,You ‚Äòda Boss! Move it better and faster while ...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 110},,"{'coming_soon': False, 'date': '23 Sep, 2009'}","{'url': 'https://playhardgames.net/contact/', ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
791,game,Prison Tycoon 4: SuperMax,33750,0.0,False,,,Hard Time is Money <br>\t\t\t\t\t\tBuild a pro...,Hard Time is Money <br>\t\t\t\t\t\tBuild a pro...,Hard Time is Money Build a profitable privatel...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
1230,dlc,Mafia II - Vegas DLC,50142,0.0,False,,,,,,...,"[{'id': 21, 'description': 'Downloadable Conte...",,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [5], 'notes': None}"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92059,dlc,X-Plane 11 - Add-on: FeelThere - KRDU - Raleig...,1776630,0,False,,,"(IATA: RDU, ICAO: KRDU, FAA LID: RDU), locally...","(IATA: RDU, ICAO: KRDU, FAA LID: RDU), locally...","(IATA: RDU, ICAO: KRDU, FAA LID: RDU), locally...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '28', 'description': 'Simulation'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '7 Oct, 2021'}",{'url': 'https://helpdesk.aerosoft.com/portal/...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
96060,game,Age of Empires IV Content Editor,1846820,0,False,,,,,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '7 Apr, 2022'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
96198,dlc,OMSI 2 - Add-on Irisbus Familie ‚Äì Citybus Pack,1849680,0,False,full,,With the OMSI AddOn Irisbus Family Citybus Pac...,With the OMSI AddOn Irisbus Family Citybus Pac...,With the OMSI AddOn Irisbus Family Citybus Pac...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '21 Dec, 2021'}",{'url': 'https://helpdesk.aerosoft.com/portal/...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
102354,dlc,Dread Hunger Bone Rings,2002150,0,False,,,,,,...,"[{'id': 21, 'description': 'Downloadable Conte...",,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"


There are around 300 entries without developers - there are some games which are no longer available, some are retro games which some publisher has the right to, but the developer is unlisted and in some cases publisher just never filled in the developer field.

In [34]:
storefront["developers"].value_counts().head(30)

['SmiteWorks USA, LLC']                          2293
['TigerQiuQiu']                                  2242
['Ubisoft - San Francisco']                      1677
['KOEI TECMO GAMES CO., LTD.']                   1469
['CAPCOM Co., Ltd.']                              513
['Dovetail Games']                                394
['Milestone S.r.l.']                              255
['N3V Games']                                     239
['Tamsoft']                                       207
['The Digital Puzzle Company']                    206
['Harmonix Music Systems, Inc']                   196
['Paradox Development Studio']                    192
['Laush Dmitriy Sergeevich']                      183
['Nihon Falcom']                                  182
['Choice of Games']                               171
['Rebellion']                                     161
['Square Enix', 'KOEI TECMO GAMES CO., LTD.']     152
['Capcom']                                        147
['Creobit']                 

In [35]:
storefront[storefront["developers"]=="['']"]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors


As we'll see in the later sectioins, it seems like [''] is a placeholder in Steam for mandatory values which are not filled, or have been deleted.

In [36]:
steamspy[steamspy["appid"].isin(storefront[storefront["developers"].isnull()]["steam_appid"].values)]["developer"].value_counts().head(60)

‰∏ÄÊ¨°ÂÖÉÂàõ‰ΩúÁªÑ                              3
Christian tavares da silva          3
Valve                               1
IPBuilders                          1
Lesson of Passion                   1
BitLight                            1
Atomic Jelly                        1
Wally Hardmaker, Wally Hardmaker    1
Kangeado games                      1
Name: developer, dtype: int64

#### [Subroutine] Developers: Cleaning

* First we will merge storefront and steamspy, keeping storefront data unless we have a NaN
* This process is identical to other columns that appear in multiple dataframes so we'll go through all of them before the actual data merge (with Steam data always having priority over Steam Spy data)
* Also, we'll have to adjust the column data to the same format as it's different between the datasets.
* Then we will copy the publisher name into the developer, for the game cases without developers. Games with other missing information we will take care of afterwards.

In [37]:
# To simplify cleaning, let's change appid and steam_appid to the appid and make it an index (since we already made sure it's unique)
# Since we will be using df.fillna(df2) later, it would be useful to change similar column names so keep them identicall across different datasets.
def renameIDs(storefront,steamspy,reviews,missing_ids):
    storefront = storefront.rename(columns={"steam_appid":"appid"})
    storefront = storefront.set_index("appid")
    steamspy = steamspy.rename(columns={"genre":"genres", "developer":"developers", "publisher":"publishers",
                              "languages":"supported_languages","userscore":"review_score","positive":"total_positive",
                                "negative":"total_negative"})
    steamspy = steamspy.set_index("appid")
    reviews = reviews.set_index("appid")
    missing_ids = missing_ids.set_index("appid")
    return storefront, steamspy, reviews, missing_ids

In [38]:
storefront, steamspy, reviews,missing_ids = renameIDs(storefront,steamspy,reviews,missing_ids)

In [39]:
# This is the function that fills the null data in maindf with the data from the subdf.
# In this function, the index from both dataframes must be the same - the old appid in our case.
# Also, the column names where we will be getting our values should also be the same.
# Lastly, ideally we would the values to be formatted in the same way - but we can also check later.
def updateFromAlternateSource(maindf,subdf):
    df = maindf.copy()
    df = df.fillna(subdf)
    return df

Now we could actually run this function and update the developers from Steam Spy. But the data is formatted differently in some columns and this will be a problem when filling the null data

We will have to take this into account when formatting these columns, as the information from Steam Spy will be added for the NaN.

### Publishers

It seemed that the publishers were ok, as we have no NaN. However, there are a lot of blank names. This is probably a mandatory metadata from Steam, and some ids have managed to not put a publisher whatsoever doing that.

Let's look at them, if there are valid ones (i.e ones who have a developer) we can consider them self-published and just do the same as before, copying the developer name into the publisher.

In [40]:
storefront["publishers"].value_counts()

['']                              9648
['TigerQiuQiu']                   2238
['Degica']                        1519
['KOEI TECMO GAMES CO., LTD.']    1387
['Dovetail Games - Trains']        602
                                  ... 
['Paracosmic Illusions']             1
['HCPGames']                         1
['Angelo Parodi']                    1
['Velikan']                          1
['1actose']                          1
Name: publishers, Length: 37918, dtype: int64

In [41]:
(storefront["publishers"]=="['']").sum()

9648

In [42]:
storefront[(storefront["publishers"]=="['']") & (storefront["developers"].isnull())]

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50142,dlc,Mafia II - Vegas DLC,0.0,False,,,,,,,...,"[{'id': 21, 'description': 'Downloadable Conte...",,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [5], 'notes': None}"
218064,dlc,BIT.TRIP Presents... Runner2: Future Legend of...,0.0,False,,,BIT.TRIP Presents... Runner2: Future Legend of...,BIT.TRIP Presents... Runner2: Future Legend of...,BIT.TRIP Presents... Runner2: Future Legend of...,,...,"[{'id': 21, 'description': 'Downloadable Conte...",,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '26 Feb, 2013'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
218980,game,Patterns,0.0,False,,,Create worlds beyond your imagination in Patte...,Create worlds beyond your imagination in Patte...,Create worlds beyond your imagination in Patte...,,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 2028932, 'name': 'Patterns Trailer 2',...",{'total': 108},,"{'coming_soon': False, 'date': ''}",{'url': 'http://www.buildpatterns.com/#!commun...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
222860,game,Left 4 Dead 2 Dedicated Server,0.0,False,,,,,,,...,,,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
224880,game,Equate Game,0.0,False,,,,,,,...,"[{'id': 2, 'description': 'Single-player'}, {'...",,,,,,"{'coming_soon': True, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1688010,game,GotG Dedicated Server,0,False,,,,,,,...,,,,,,,"{'coming_soon': False, 'date': '10 Sep, 2021'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
1763330,game,Polyslime,0,True,,,In this game the goal is simply to survive as ...,In this game the goal is simply to survive as ...,An action survival game where you will craft w...,,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256853022, 'name': 'Trailer1', 'thumbn...",,,"{'coming_soon': False, 'date': '13 Oct, 2021'}","{'url': '', 'email': 'sugmastudios@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
1846820,game,Age of Empires IV Content Editor,0,False,,,,,,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '7 Apr, 2022'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
2002150,dlc,Dread Hunger Bone Rings,0,False,,,,,,,...,"[{'id': 21, 'description': 'Downloadable Conte...",,,,,,"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"


In [43]:
storefront[storefront["publishers"]=="['']"].sample(10)

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
409200,dlc,ePic Character Generator - Season #1: Dwarf Male,0.0,False,,,This Fantasy themed package can be used to cre...,This Fantasy themed package can be used to cre...,This Fantasy themed package can be used to cre...,"{'appid': '408930', 'name': 'ePic Character Ge...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '53', 'description': 'Design & Illustr...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '4 Nov, 2015'}",{'url': 'http://epicgenerator.net/index.php/fo...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
988016,dlc,Groove Coaster - Jukusei Jozo Hakkosei‚óé-Space ...,0.0,False,full,,Original music DLC for Groove Coaster<br><stro...,Original music DLC for Groove Coaster<br><stro...,Original music DLC for Groove CoasterGenre:Ori...,"{'appid': '744060', 'name': 'Groove Coaster'}",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256741609, 'name': 'Trailer', 'thumbna...",,,"{'coming_soon': False, 'date': '4 Feb, 2019'}","{'url': '', 'email': 'games@degica.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
1375200,dlc,Noda Full,0.0,False,,,Noda is a space to build and share 3D mental m...,Noda is a space to build and share 3D mental m...,"For those with more to explore, Noda Full offe...","{'appid': '578060', 'name': 'Noda'}",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '9 Oct, 2020'}","{'url': 'http://noda.io', 'email': 'contact@no...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
637778,dlc,Rocksmith¬Æ 2014 Edition ‚Äì Remastered ‚Äì The Pre...,0.0,False,,,Play &quot;Don‚Äôt Get Me Wrong&quot; by The Pre...,Play &quot;Don‚Äôt Get Me Wrong&quot; by The Pre...,Play &quot;Don‚Äôt Get Me Wrong&quot; by The Pre...,"{'appid': '221680', 'name': 'Rocksmith¬Æ 2014 E...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '14 Nov, 2017'}","{'url': 'https://support.ubi.com/', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
1198480,dlc,Freestyle2 - Steady Settlement Package,0.0,False,,,"<strong>**Once you purchase this DLC, log into...","<strong>**Once you purchase this DLC, log into...",This package contains all of the essential ite...,"{'appid': '339610', 'name': 'Freestyle 2: Stre...",...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '26 Nov, 2019'}",{'url': 'http://freestyle2.joycitygames.com/su...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
688960,dlc,Cynoclept: The Game Soundtrack,0.0,False,,,"<img src=""https://cdn.akamai.steamstatic.com/s...","<img src=""https://cdn.akamai.steamstatic.com/s...",Couldn't get enough of the shredding main menu...,"{'appid': '688880', 'name': 'Cynoclept: The Ga...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '73', 'description': 'Violent'}, {'id'...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '17 Aug, 2017'}","{'url': 'http://www.cynoclept.com', 'email': '...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': None}"
779210,dlc,Êò•È£é | Spring Breeze --Soundtrack DLC,0.0,False,,,Spring Breeze --Soundtrack DLC<br />\r\n<br />...,Spring Breeze --Soundtrack DLC<br />\r\n<br />...,"Romantic college campus love, would you pick o...","{'appid': '692790', 'name': 'Êò•È£é | Spring Breeze'}",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256705645, 'name': 'DLC Soundtrack DLC...",,,"{'coming_soon': True, 'date': 'Jan 2018'}","{'url': '', 'email': 'ghostwing@msn.cn'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
760340,dlc,The Black Watchmen - Whitechapel,0.0,True,,,"<h2 class=""bb_tag"">Rumors have emerged suggest...","<h2 class=""bb_tag"">Rumors have emerged suggest...",Rumors have emerged suggesting the return of o...,"{'appid': '349220', 'name': 'The Black Watchmen'}",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256701884, 'name': 'Whitechapel', 'thu...",,,"{'coming_soon': False, 'date': '15 Dec, 2017'}","{'url': 'http://forums.blackwatchmen.com', 'em...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
861440,dlc,Fantasy Grounds - 13th Age Bestiary (13th Age),0.0,False,,,"<h2 class=""bb_tag""><strong>13th Age Combined B...","<h2 class=""bb_tag""><strong>13th Age Combined B...",13th Age Combined Bestiary200 NEW FOES FOR THE...,"{'appid': '252690', 'name': 'Fantasy Grounds C...",...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '29 May, 2018'}","{'url': '', 'email': 'support@fantasygrounds.c...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
445260,dlc,Character: Naotora Ii,16.0,False,,,<h1>Featured DLC</h1><p>We have the bundle &qu...,"With lightning fast leg strikes, Naotora Ii le...",Download this to use Naotora Ii in Dead or Ali...,"{'appid': '311730', 'name': 'DEAD OR ALIVE 5 L...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '28 Mar, 2016'}",{'url': 'http://www.koeitecmoamerica.com/suppo...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


It seems like Steam Storefront uses [''] to fill the empty data in the mandatory fields. We'll change it to NaN for the easier filtering and do the same in the other fields.

Is it possible that some of these values were registered at some point by Steam Spy and conserved? Let's check that, if not we will simply treat them like NaNs.

Also, it seems like we even have some apps with both Publisher and Developer data being empty. It's either in the games removed from Steam or in ithe DLCs where the game creater was lazy and didn't fill the relevant data in the DLC package, so we can take it from the parent app.

In [44]:
(~steamspy[steamspy.index.isin(storefront[storefront["publishers"]=="['']"].index)]["publishers"].isnull()).sum()

205

It seems we can recover some values from Steam Spy, now that we have discovered that this supposedly complete column had some NaNs..

#### Publisher/Others: Cleaning Decision

* I.e using `storefront = storefront.replace("['']", np.NaN)` we should catch any [''] fields in the steam database, which we thought more complete. Then merge ids, using the Steam Store value (if available) and falling back to Steam Spy if possible.


* If there is no publisher, but we have a developer, then we will use the developer as publisher as well. If there is no publisher or developer, we will simply delete the record.

* If it's we have neither and it's a DLC we'll check the parent app

* If neither option succeed, we'll replace the values with np.nan to keep the null data consistent.


#### [Subroutine] Publishers: Cleaning

In [45]:
#Replace empty data with the parent app data
#{'appid': '1141390', 'name': 'The Blitzkrieg:'}
def getParentValue(row, column):
    if (pd.isna(row[column])) & (not pd.isna(row['fullgame'])):
        try:
            appid2 = int(ast.literal_eval(row['fullgame'])['appid'])
            parent_row = storefront.loc[appid2]
            return parent_row[column]
        except:
            return row[column]
    else:
        return row[column]

In [46]:
#Getting the data from other column
def getOtherColumnValue(row,current,alternate):
    if pd.isna(row[current]):
        return row[alternate]
    else:
        return row[current]

In [47]:
#Get other column and if it's not available - parent
def getOtherOrParentColumnValue(row,current,alternate):
    if pd.isna(row[current]):
        if (pd.isna(row[alternate])) & (not pd.isna(row['fullgame'])):
            try:
                appid2 = int(ast.literal_eval(row['fullgame'])['appid'])
                parent_row = storefront.loc[appid2]
                return parent_row[current]
            except:
                return row[current]
        else:
            return row[alternate]
    else:
        return row[current]

In [48]:
#Fixing data for publishers/developers
def fixDevPub(storefront, steamspy):
    storefront = storefront.replace("['']", np.NaN)
    storefront = updateFromAlternateSource(storefront,steamspy)
    storefront["developers"] = storefront.apply(getOtherOrParentColumnValue, current="developers", alternate="publishers", axis=1)
    storefront["publishers"] = storefront.apply(getOtherOrParentColumnValue, current="publishers", alternate="developers", axis=1)
    return storefront

Running this function will get any values from steam spy which are useful from the repeated columns. We have also eliminated the empty string values and replaced them with NaN, to ensure our cleaning functions detect them properly.

However, note that we have also updated genres and languages by doing it this way...

In [49]:
storefront = fixDevPub(storefront, steamspy)

In [50]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102504 entries, 10 to 2008820
Data columns (total 38 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     102504 non-null  object
 1   name                     102504 non-null  object
 2   required_age             102504 non-null  object
 3   is_free                  102504 non-null  bool  
 4   controller_support       25497 non-null   object
 5   dlc                      9696 non-null    object
 6   detailed_description     102353 non-null  object
 7   about_the_game           102352 non-null  object
 8   short_description        102349 non-null  object
 9   fullgame                 34586 non-null   object
 10  supported_languages      102352 non-null  object
 11  header_image             102504 non-null  object
 12  website                  60069 non-null   object
 13  pc_requirements          102504 non-null  object
 14  mac_requirements  

It seems we still have some rows with publisher/developer data not available.

### Genres

There are 2 similar types of data here. We have genres and categories. Genres are present in both datasets, categories - only in Storefront.

The stucture for these columns is quite similar - it's a list of dictionaries similar to {'id': 'N', 'description': 'XXX}. We have a couple of approaches when analysing data here - unwrap the list of dictionaries for each row into the list of genres/categories and either:

1) Keep them in the same column as a simple list of items.
2) Spread the list (with the item being the column name and binary value of the item present in the row) and keep it in the same table.
3) Move the list into a separate table with appid being the key and the rest of the columns - categories with binary value.
4) Transform that said wide table into the long one with the 'appid' and 'category' column.

These approaches have different advantaged/disadvantages but for all of them we'll have to unwrap the dictionaries into a simple list of values.

In [51]:
storefront["genres"].value_counts()

[{'id': '1', 'description': 'Action'}]                                                                                                                                                                                                                                                                                                                     5393
[{'id': '1', 'description': 'Action'}, {'id': '23', 'description': 'Indie'}]                                                                                                                                                                                                                                                                               4982
[{'id': '4', 'description': 'Casual'}, {'id': '23', 'description': 'Indie'}]                                                                                                                                                                                                                            

In [52]:
steamspy["genres"].value_counts()

Action                                                                                          5076
Action, Indie                                                                                   4574
Casual, Indie                                                                                   4314
Action, Casual, Indie                                                                           4077
Action, Adventure, Indie                                                                        3531
                                                                                                ... 
Sexual Content, Adventure, RPG, Strategy                                                           1
Adventure, Free to Play, Massively Multiplayer, RPG, Simulation, Early Access                      1
Action, Adventure, Free to Play, Indie, Massively Multiplayer, Strategy, Early Access              1
Action, Adventure, Casual, Free to Play, Indie, Massively Multiplayer, Racing, RPG, Strateg

If there are no single commas inside any genre, it would make sense to list them exactly like Steam Spy has done. If not, we will look for a different character, or even just splitting it into a list, but something clearer than this dict form in string available for the Steam Store.

In [53]:
storefront[storefront["genres"]=="['']"].shape[0]

0

In [54]:
storefront["genres"].isnull().sum()

189

In [55]:
#unwrapping list of dictionaries into to the list
#remove the NaN valueus while we are at it
def extractDictList(jsonDict, key):
    if jsonDict != jsonDict:
        return np.NaN
    else:
        try:
            evalList = eval(jsonDict)
            items = []
            if(type(evalList) == dict):
                if (evalList[key]!=np.nan):
                    items.append(evalList[key])
                return items
            else:
                for dictionary in evalList:
                        if (dictionary[key]!=np.nan):
                            items.append(dictionary[key])
                return items
        except :
            return np.NaN

A little explanation of above. Most games are indeed formatted with a dict inside. But there are a few ones (48), that after closer inspection already had the genre column formatted into the games of the genres separated by commas. Of these ones, there is only one valid game (one game that still exists in the store), https://store.steampowered.com/app/22330/The_Elder_Scrolls_IV_Oblivion_Game_of_the_Year_Edition/

This was actually recovered with the update function we defined and executed above with the developers and publishers, the information is coming from steam spy.

If there is no proper item in the list or it's empty, the value will be set as NaN.

#### [Subroutine] Genres: Cleaning

In [56]:
storefront["genres"] = storefront["genres"].apply(extractDictList, key="description")

In [57]:
storefront.genres.explode().value_counts(dropna=False)

Indie                    65616
Action                   43012
Casual                   37213
Adventure                34405
Simulation               23366
Strategy                 21610
RPG                      21014
Free to Play              8196
Early Access              7890
Sports                    4826
Racing                    4008
Massively Multiplayer     3236
Design & Illustration     1748
Web Publishing            1645
Violent                    816
Utilities                  516
Gore                       502
Animation & Modeling       356
Education                  280
Software Training          258
Nudity                     252
Sexual Content             246
Game Development           228
Video Production           222
Photo Editing              196
NaN                        193
Audio Production           165
Accounting                   5
Movie                        3
Documentary                  1
Episodic                     1
Short                        1
Tutorial

In [58]:
storefront[storefront["genres"].isnull()].sample(10)

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
473930,advertising,Steam Gift Cards,0.0,False,,,<h1>Now Available from your favorite retailer<...,"Visit Best Buy, GameStop, GAME, 7-Eleven, EB G...",Give the gift of games through Steam Wallet Ca...,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '1 May, 2016'}","{'url': 'www.steampowered.com/wallet/support',...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
549990,game,Cobalt Dedicated Server,0.0,False,,,,,,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '13 Feb, 2018'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
354570,dlc,Artizens Official Soundtrack Vol. 1,0.0,False,,,Volume 1 of the Official Artizens Soundtrack b...,Volume 1 of the Official Artizens Soundtrack b...,Volume 1 of the Official Artizens Soundtrack b...,"{'appid': '339540', 'name': 'Artizens'}",...,"[{'id': 21, 'description': 'Downloadable Conte...",,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '27 Mar, 2015'}","{'url': '', 'email': 'support@artizensonline.c...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
1846820,game,Age of Empires IV Content Editor,0.0,False,,,,,,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '7 Apr, 2022'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
56436,dlc,"Warhammer 40,000: Dawn of War II - Retribution...",0.0,False,,,Includes 4 unique ability and wargear items to...,Includes 4 unique ability and wargear items to...,Includes 4 unique ability and wargear items to...,"{'appid': '56400', 'name': 'Warhammer 40,000: ...",...,"[{'id': 21, 'description': 'Downloadable Conte...",,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '28 Feb, 2011'}","{'url': 'https://support.sega.co.uk', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
315394,dlc,The Crew‚Ñ¢ Extreme Car Pack,0.0,False,,,This pack includes 3 new cars: Aston Martin V1...,This pack includes 3 new cars: Aston Martin V1...,This pack includes 3 new cars: Aston Martin V1...,"{'appid': '241560', 'name': 'The Crew‚Ñ¢'}",...,"[{'id': 21, 'description': 'Downloadable Conte...",,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '20 Jan, 2015'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
275171,dlc,"Modelset 1 - Railstation, Houses, Barn",0.0,False,,,<strong>MODELSET 1 for Railroad X</strong><br>...,<strong>MODELSET 1 for Railroad X</strong><br>...,MODELSET 1 for Railroad XThis MODELSET contain...,"{'appid': '251020', 'name': 'Railroad X'}",...,"[{'id': 21, 'description': 'Downloadable Conte...",,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '9 Jun, 2014'}","{'url': 'www.eep4u.com', 'email': 'support@rai...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
457600,game,Emerge: Cities of the Apocalypse,0.0,False,,,Emerge: Cities of the Apocalypse places player...,Emerge: Cities of the Apocalypse places player...,"Part turn-based resource management, part real...",,...,"[{'id': 2, 'description': 'Single-player'}, {'...",,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256662240, 'name': 'Emerge: Cities of ...",{'total': 129},"{'total': 29, 'highlighted': [{'name': 'Throwa...","{'coming_soon': False, 'date': '27 Apr, 2016'}","{'url': '', 'email': 'emil_v_dweller@yahoo.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
493130,dlc,Breached - Original Soundtrack,0.0,False,,,Original soundtrack with piano variations.<h2 ...,Original soundtrack with piano variations.<h2 ...,Original soundtrack with piano variations.,"{'appid': '460640', 'name': 'Breached'}",...,"[{'id': 21, 'description': 'Downloadable Conte...",,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '22 Jun, 2016'}","{'url': '', 'email': 'mail@breached-game.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
34278,game,Sonic 3D Blast‚Ñ¢,0.0,False,,,Dr. Eggman (AKA Dr. Robotnik) discovers unusua...,Dr. Eggman (AKA Dr. Robotnik) discovers unusua...,Dr. Eggman (AKA Dr. Robotnik) discovers unusua...,,...,"[{'id': 2, 'description': 'Single-player'}, {'...",,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256663653, 'name': 'SMDC ESRB', 'thumb...",,,"{'coming_soon': False, 'date': '1 Jun, 2010'}","{'url': 'https://support.sega.co.uk', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


In [59]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102504 entries, 10 to 2008820
Data columns (total 38 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     102504 non-null  object
 1   name                     102504 non-null  object
 2   required_age             102504 non-null  object
 3   is_free                  102504 non-null  bool  
 4   controller_support       25497 non-null   object
 5   dlc                      9696 non-null    object
 6   detailed_description     102353 non-null  object
 7   about_the_game           102352 non-null  object
 8   short_description        102349 non-null  object
 9   fullgame                 34586 non-null   object
 10  supported_languages      102352 non-null  object
 11  header_image             102504 non-null  object
 12  website                  60069 non-null   object
 13  pc_requirements          102504 non-null  object
 14  mac_requirements  

### Categories

Let's see what we have in categories.

In [60]:
categories_check = storefront["categories"].apply(extractDictList, key="description")
categories_check

appid
10         [Multi-player, PvP, Online PvP, Shared/Split S...
20         [Multi-player, PvP, Online PvP, Shared/Split S...
30                  [Multi-player, Valve Anti-Cheat enabled]
40         [Multi-player, PvP, Online PvP, Shared/Split S...
50         [Single-player, Multi-player, Valve Anti-Cheat...
                                 ...                        
2004490                                      [Single-player]
2004650                                      [Single-player]
2004670          [Single-player, Partial Controller Support]
2007870                                      [Single-player]
2008820             [Single-player, Full controller support]
Name: categories, Length: 102504, dtype: object

In [61]:
categories_check.explode().value_counts(dropna=False)

Single-player                    93252
Steam Achievements               50195
Downloadable Content             34632
Steam Cloud                      28915
Multi-player                     27544
Full controller support          25497
Steam Trading Cards              19918
Partial Controller Support       18783
Co-op                            15520
Steam Leaderboards               13490
PvP                              13362
Online PvP                       12580
Shared/Split Screen              10188
Online Co-op                      8482
Remote Play Together              6866
Cross-Platform Multiplayer        6466
Shared/Split Screen PvP           6150
Stats                             5492
Steam Workshop                    4972
In-App Purchases                  4771
Shared/Split Screen Co-op         4726
Includes level editor             3297
Remote Play on TV                 2742
Captions available                2164
MMO                               2027
Remote Play on Tablet    

#### [Subroutine] Categories: Cleaning

In [62]:
storefront["categories"] = storefront["categories"].apply(extractDictList, key="description")

There are some apps with null categories and with null genres.

In [63]:
storefront[storefront["categories"].isnull()].sample(15)

Unnamed: 0_level_0,type,name,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
992080,game,SpaBerry VR Experience,0.0,True,,,Ever been in a hot tub fully dressed? How abou...,Ever been in a hot tub fully dressed? How abou...,The world's first and only hot tub VR experien...,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256737214, 'name': 'Spaberry VR Experi...",,,"{'coming_soon': False, 'date': '18 Dec, 2018'}","{'url': 'https://thespaberry.com', 'email': 'e...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
302130,game,Brink of Consciousness: Dorian Gray Syndrome C...,0.0,False,,,Venture into the realm of a madman to free you...,Venture into the realm of a madman to free you...,"To free your beloved from captivity, you must ...",,...,,"[Adventure, Casual]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 2033543, 'name': 'Dorian Gray', 'thumb...",{'total': 151},,"{'coming_soon': False, 'date': '18 Jul, 2014'}","{'url': 'http://support.encore.com', 'email': ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
439660,game,Tower Unite Dedicated Server,0.0,False,,,,,,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '29 Mar, 2016'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
1110390,game,Unturned - Dedicated Server,0.0,False,,,,,,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '21 Jun, 2019'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
390500,game,Arma 3 Samples,0.0,True,,,"Formerly part of <a href=""http://store.steampo...","Formerly part of <a href=""http://store.steampo...",Samples of assets for Arma 3 which provide a s...,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': ''}","{'url': 'http://feedback.arma3.com/', 'email':...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
39530,game,Painkiller: Black Edition,0.0,False,,,Painkiller Black Edition includes Painkiller a...,Painkiller Black Edition includes Painkiller a...,Painkiller Black Edition includes the expansio...,,...,,[Action],"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 1111},,"{'coming_soon': False, 'date': '24 Jan, 2007'}","{'url': 'https://helpcenter.kochmedia.com', 'e...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': None}"
1184960,game,Escalation,0.0,False,,,"DONT BUY IT, THE PUBLISHER IS BANKRUPT AND I'M...","DONT BUY IT, THE PUBLISHER IS BANKRUPT AND I'M...","DONT BUY IT, THE PUBLISHER IS BANKRUPT AND I'M...",,...,,[Early Access],,,,,"{'coming_soon': False, 'date': '1 Feb, 2020'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
780820,game,XOXO Blood Droplets,0.0,True,,,"XOXO Blood Droplets is an absurd, dark comedy,...","XOXO Blood Droplets is an absurd, dark comedy,...",A collection of very short stories. XOXO Blood...,,...,,"[Casual, Free to Play, Indie]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256733474, 'name': 'XOXO Blood Droplet...",,,"{'coming_soon': False, 'date': '30 Oct, 2019'}","{'url': '', 'email': 'gb.patch.games@gmail.com'}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': 'XOXO Blood Droplets ..."
206760,game,Painkiller: Recurring Evil,18.0,False,,,Battle never ceases in the realm known as Hell...,Battle never ceases in the realm known as Hell...,Battle never ceases in the realm known as Hell...,,...,,[Action],"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 233},,"{'coming_soon': False, 'date': '29 Feb, 2012'}","{'url': 'https://helpcenter.kochmedia.com', 'e...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
338640,game,Subsiege,0.0,False,,,"<img src=""https://cdn.akamai.steamstatic.com/s...","<img src=""https://cdn.akamai.steamstatic.com/s...",Subsiege is an intense real-time tactic game w...,,...,,,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256729398, 'name': 'Release Trailer', ...",,,"{'coming_soon': False, 'date': '7 Sep, 2018'}","{'url': 'http://subsiege-game.com/', 'email': ...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


There are actually tons of useful metadata here. This seems to be what is shown at the steam store webpage at the right.

This might be usefull for the different ways we can group and analyse the data later, like achievement availability, controller supporot and console ports (if we'll get a console games dataset, for example).


### required_age

In [64]:
storefront["required_age"].value_counts()

0.0        63344
0          36367
18.0        1204
16.0         527
18           297
17.0         178
12           174
12.0         152
16            89
15.0          45
13.0          37
7.0           14
3.0           14
15             7
17             6
13             5
7              5
3              5
10.0           4
14.0           4
11.0           3
10             3
18+            2
11             2
1.0            2
6              2
6.0            2
20             1
ÔºëÔºò             1
19.0           1
14             1
99999.0        1
5.0            1
4.0            1
20.0           1
171.0          1
12+            1
Name: required_age, dtype: int64

In [65]:
getSteamLink(storefront[storefront["required_age"]==18.0])

Quake 4 https://store.steampowered.com/app/2210
QUAKE https://store.steampowered.com/app/2310
Company of Heroes - Legacy Edition https://store.steampowered.com/app/4560
Condemned: Criminal Origins https://store.steampowered.com/app/4720
Hitman: Blood Money https://store.steampowered.com/app/6860
Hitman: Codename 47 https://store.steampowered.com/app/6900
Men of War‚Ñ¢ https://store.steampowered.com/app/7830
NecroVision https://store.steampowered.com/app/7860
Just Cause 2 https://store.steampowered.com/app/8190
BioShock¬Æ 2 https://store.steampowered.com/app/8850
Borderlands Game of the Year https://store.steampowered.com/app/8980
RAGE https://store.steampowered.com/app/9200
Call of Duty: World at War https://store.steampowered.com/app/10090
Manhunt https://store.steampowered.com/app/12130
Max Payne https://store.steampowered.com/app/12140
Max Payne 2: The Fall of Max Payne https://store.steampowered.com/app/12150
Grand Theft Auto: Episodes from Liberty City https://store.steampowered.c

This column is really messy. Values seem to have different types (integer, floating and even string), some values are very suspicious (171.0 and 99999.0). 0 seems to mean "no restriction". According to PEGI the values should be 3, 7, 12, 16 and 18 but age restrictions might vary from country to country so having a lot of different numbers is understandable. 
The detailed rated content description is explained in "content_descriptors" column:

In [66]:
storefront[storefront["required_age"]==18.0].content_descriptors.value_counts()

{'ids': [], 'notes': None}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          637
{'ids': [2, 5], 'notes': None}                                                                                                                                                                                                                                                                                                                                                                                                                  

I'll transform required_age in this way:
* **required_age**: change value to the same type, parsing strings if necessary. Leave strange age as is?

#### [Subroutine] Required_age: Cleaning

In [67]:
#Getting integer age from the data
def getAge(age):
    age = str(age)
    try:
        x = re.search("\d+", age).group()
        x = int(x)
    except:
        return np.NaN
    return x

In [68]:
#Cleaning up age
storefront["required_age"] = storefront["required_age"].apply(getAge)

In [69]:
storefront["required_age"].value_counts()

0        99711
18        1504
16         616
12         327
17         184
15          52
13          42
7           19
3           19
10           7
11           5
14           5
6            4
1            2
20           2
171          1
4            1
5            1
19           1
99999        1
Name: required_age, dtype: int64

### content_descriptors

As we've seen above, 'content_descriptors' is a JSON object consisting of 'ids' and 'notes'. Sadly, it seems like 'ids' doesn't have any correlation with either 'notes' or 'required_age' and seems like some internal ID. So I've opted to only extract the 'notes'

**content_descriptors**: extract 'notes' dictionaries to the string, set to NaN if the string equals to "none", "na", etc.

#### [Subroutine] Content_descriptors: Cleaning

In [70]:
#unwrapping list of dictionaries into to the item
#return the NaN values on error
def extractDictItem(jsonDict, key):
    if jsonDict != jsonDict:
        return np.NaN
    else:
        try:
            evalList = eval(jsonDict)
            if(type(evalList) == dict):
                if (evalList[key]!=np.nan):
                    item = evalList[key]
                return item
            else:
                return evalList
        except :
            return np.NaN

In [71]:
#extracting 'notes' dictionaries to the list, set empty or invalid ones to NaN
def cleanContentDesc(storefront):
    badstrings = ["none","None","na","Na","False","false",0,"","invalid","Invalid","\r\n"]
    storefront["content_descriptors"] = storefront["content_descriptors"].apply(extractDictItem, key="notes")
    storefront["content_descriptors"].mask(storefront.content_descriptors.isin(badstrings), np.nan, inplace=True )
    return storefront

In [72]:
storefront = cleanContentDesc(storefront)

In [73]:
storefront["content_descriptors"].value_counts(dropna=False)

NaN                                                                                                                                                                                                89483
This Game may contain content not appropriate for all ages, or may not be appropriate for viewing at work: Frequent Violence or Gore, Partial Nudity, Sexual Content                                 464
Nakedness.\r\nAll characters appearing in this game are over 18 years of age.                                                                                                                        180
This Game may contain content not appropriate for all ages, or may not be appropriate for viewing at work: Frequent Violence or Gore, General Mature Content                                         145
This Game may contain content not appropriate for all ages, or may not be appropriate for viewing at work: Blood and Gore, Nudity, Strong Language, Use of Drugs                                    

### platforms

This is a dictionary based on the platform availability. I'll unwrap it into the list of supported platforms. Theoretically, it might be a good idea to get each platform into a separate column but it's always possible we'll see more platforms in the future (like a separate flag for Steam Deck, for example).

In [74]:
storefront["platforms"].value_counts(dropna=False)

{'windows': True, 'mac': False, 'linux': False}    73501
{'windows': True, 'mac': True, 'linux': False}     13350
{'windows': True, 'mac': True, 'linux': True}      13193
{'windows': True, 'mac': False, 'linux': True}      2442
{'windows': False, 'mac': True, 'linux': False}       12
{'windows': False, 'mac': False, 'linux': True}        5
{'windows': False, 'mac': True, 'linux': True}         1
Name: platforms, dtype: int64

#### [Subroutine] Platforms: Cleaning

In [75]:
#unwrapping list of dictionaries into to the item
def extractBoolDict(boolDict):
    if boolDict != boolDict:
        return np.NaN
    else:
        try:
            evalDict = eval(boolDict)
            if(type(evalDict) == dict):
                items = []
                for key in evalDict.keys():
                    if (evalDict[key] == True):
                        items.append(key)
                return items
            else:
                return np.NaN
        except:
            return np.NaN

In [76]:
storefront["platforms"] = storefront["platforms"].apply(extractBoolDict)
storefront["platforms"].fillna({i: [] for i in storefront.index},inplace = True)
storefront["platforms"].value_counts(dropna=False)

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 5231, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[windows]                73501
[windows, mac]           13350
[windows, mac, linux]    13193
[windows, linux]          2442
[mac]                       12
[linux]                      5
[mac, linux]                 1
Name: platforms, dtype: int64

### pc_requirements, mac_requirements, linux_requirements
These three columns contain information about the game system requirements. Two things to note:
* Not being available in "platforms" doesn't mean the game doesn't have system requirements for that platform (maybe for Proton/Steam Deck?).
* The empty requirements are done as the empty lists.
* The contents seem to be the same that appear on the Steam page bu thet structure of requirements themselves is not very defined (apart from having minimum/recommended).

I assume the hardware requirements for windows and linux are similar and extracting data from windows should be enough.

We'll have to split dictianary and remove the tags to do some clean-up.

I'll extract the data for PC/Mac and move the requirements to the separate table for export (while keeping the raw data as well if needed).


In [77]:
#Here is a possible example of the above:
storefront[storefront.platforms.apply(lambda x: sorted(x) == sorted(["windows", "linux"]))]["mac_requirements"].value_counts()

[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

Let's do a quick check before transforming the data for the final dataset:

In [78]:
temp_df =  storefront[['pc_requirements', 'mac_requirements']].copy()
#removing  rows with empty requirements
temp_df = temp_df[(temp_df['pc_requirements'] != '[]') & (temp_df['mac_requirements'] != '[]')]
#processing pc requirement data
temp_df['pc_clean'] = (temp_df['pc_requirements']
                      .str.replace(r'\\[rtn]', '', regex=True)
                      .str.replace(r'<[pbr]{1,2}>', ' ', regex=True)
                      .str.replace(r'<[\/"=\w\s]+>', '', regex=True)
                      )
temp_df['pc_clean'] = temp_df['pc_clean'].apply(lambda x: ast.literal_eval(x))
# split out minimum and recommended into separate columns
temp_df['pc_minimum'] = temp_df['pc_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
temp_df['pc_recommended'] = temp_df['pc_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
temp_df = temp_df.drop('pc_clean', axis=1)
#processing mac requirement data
temp_df['mac_clean'] = (temp_df['mac_requirements']
                      .str.replace(r'\\[rtn]', '', regex=True)
                      .str.replace(r'<[pbr]{1,2}>', ' ', regex=True)
                      .str.replace(r'<[\/"=\w\s]+>', '', regex=True)
                      )
temp_df['mac_clean'] = temp_df['mac_clean'].apply(lambda x: ast.literal_eval(x))
temp_df['mac_minimum'] = temp_df['mac_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
temp_df['mac_recommended'] = temp_df['mac_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
temp_df = temp_df.drop('mac_clean', axis=1)
temp_df

Unnamed: 0_level_0,pc_requirements,mac_requirements,pc_minimum,pc_recommended,mac_minimum,mac_recommended
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
20,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
30,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
40,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
50,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
...,...,...,...,...,...,...
1990850,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,Requires a 64-bit processor and operating syst...,Requires a 64-bit processor and operating syst...,Requires a 64-bit processor and operating system,Requires a 64-bit processor and operating system
1997590,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,Requires a 64-bit processor and operating syst...,Requires a 64-bit processor and operating syst...,Requires a 64-bit processor and operating system,Requires a 64-bit processor and operating system
2003620,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,Requires a 64-bit processor and operating syst...,Requires a 64-bit processor and operating syst...,Requires a 64-bit processor and operating system,Requires a 64-bit processor and operating system
2004650,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,Requires a 64-bit processor and operating syst...,Requires a 64-bit processor and operating syst...,Requires a 64-bit processor and operating system,Requires a 64-bit processor and operating system


Seems fine, so let's proceed with the cleaning:


#### [Subroutine] 'pc_requirements', 'mac_requirements', 'linux_requirements': Cleaning

In [79]:
# Cleaning up the hardware requirements, exporting to the separate table and removing columns from the storefront
def cleanRequirements(df, export=False):
    if export:
        requirements = df[['pc_requirements', 'mac_requirements', 'linux_requirements']].copy()
        
        #remove rows with missing requirements
        requirements = requirements[(requirements['pc_requirements'] != '[]') & (requirements['mac_requirements'] != '[]')]
        
        requirements['pc_clean'] = (requirements['pc_requirements']
                              .str.replace(r'\\[rtn]', '', regex=True)
                              .str.replace(r'<[pbr]{1,2}>', ' ', regex=True)
                              .str.replace(r'<[\/"=\w\s]+>', '', regex=True)
                              )
        requirements['pc_clean'] = requirements['pc_clean'].apply(lambda x: ast.literal_eval(x))
        #processing pc requirement data
        requirements['pc_minimum'] = requirements['pc_clean'].apply(
            lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
        requirements['pc_recommended'] = requirements['pc_clean'].apply(
            lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
        requirements = requirements.drop('pc_clean', axis=1)
        #processing mac requirement data
        requirements['mac_clean'] = (requirements['mac_requirements']
                              .str.replace(r'\\[rtn]', '', regex=True)
                              .str.replace(r'<[pbr]{1,2}>', ' ', regex=True)
                              .str.replace(r'<[\/"=\w\s]+>', '', regex=True)
                              )
        requirements['mac_clean'] = requirements['mac_clean'].apply(lambda x: ast.literal_eval(x))
        requirements['mac_minimum'] = requirements['mac_clean'].apply(
            lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
        requirements['mac_recommended'] = requirements['mac_clean'].apply(
            lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
        requirements = requirements.drop('mac_clean', axis=1)
        
        export_data(requirements, 'steam_requirements_data', True)
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)  
    return df

In [80]:
storefront = cleanRequirements(storefront, True)

Exported steam_requirements_data to '../data/export/steam_requirements_data.csv'


In [81]:
#verifying hardware reqs export
pd.read_csv('../data/export/steam_requirements_data.csv').head()

Unnamed: 0,appid,pc_requirements,mac_requirements,linux_requirements,pc_minimum,pc_recommended,mac_minimum,mac_recommended
0,10,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
1,20,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
2,30,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
3,40,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",
4,50,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","500 mhz processor, 96mb ram, 16mb video card, ...",,"OS X Snow Leopard 10.6.3, 1GB RAM, 4GB Hard D...",


### 'detailed_description', 'about_the_game', 'short_description'

These three columns contain descriptive texts about the applications. They can be useful for the sentiment/recommendation analysis but they are quite 'heavy' and might be  redundant for statistical analysis hence I'll move them to the separate table as well

In [82]:
storefront[['detailed_description', 'about_the_game', 'short_description']].isnull().sum()

detailed_description    151
about_the_game          152
short_description       155
dtype: int64

Quite a few have null values. For the exported table, I'll exclude the rows where all three descriptions are empty.

#### [Subroutine] 'detailed_description', 'about_the_game', 'short_description': Cleaning

In [83]:
#Cleaning descriptions. Empty descriptions are not included into the exported table
def cleanDescriptions(df, export=False):
    """Export descriptions to external csv file then remove these columns."""
    # remove rows with missing description data
    temp_df = df.dropna(subset=['detailed_description', 'about_the_game', 'short_description'], how='all').copy()  
    
    # by default we don't export, useful if calling function later
    if export:
        # create dataframe of description columns
        description_data = temp_df[['detailed_description', 'about_the_game', 'short_description']]
        
        export_data(description_data, filename='steam_description_data', index=True)
    
    # drop description columns from main dataframe
    df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
    
    return df

In [84]:
storefront = cleanDescriptions(storefront, export=True)

Exported steam_description_data to '../data/export/steam_description_data.csv'


In [85]:
#Verifying exported data
pd.read_csv('../data/export/steam_description_data.csv').head()

Unnamed: 0,appid,detailed_description,about_the_game,short_description
0,10,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...
1,20,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...
2,30,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...
3,40,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...
4,50,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...


### 'header_image', 'screenshots', 'background', 'movies'

These four columns contain links to the various media data about the app: header image and the background (as it appears on the Steam page), screenshots and trailers.

I don't think they are very useful for analysis but still might be helpful for extracting data for dashboards or getting game logos.
I'll keep them in the separate table as well.

In [86]:
image_cols = ['header_image', 'screenshots', 'background', 'movies']

for col in image_cols:
    print(col+':', storefront[col].isnull().sum())

storefront[image_cols].sample(10)

header_image: 0
screenshots: 158
background: 140
movies: 34144


Unnamed: 0_level_0,header_image,screenshots,background,movies
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1330500,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256788031, 'name': 'AHEAD Final Traile..."
1560939,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
1703210,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256846652, 'name': 'Gameplay', 'thumbn..."
636670,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
1989780,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256885049, 'name': 'Tiny Arcade Racers..."
314950,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 2034943, 'name': 'Spectre Steam Greenl..."
1161950,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
886840,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256721856, 'name': 'Furfury gameplay t..."
1161140,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 256764138, 'name': 'Release Trailer', ..."
822060,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,


All apps seem to have headers but some are missing screenshots/backgrounds and a lot of them - trailers (which is understandable). As for the strucure, it seems like background and header_image have simple links while screenshots and movies are a bit more complicated. I'll keep them as is.

#### [Subroutine] 'header_image', 'screenshots', 'background', 'movies': Cleaning

In [87]:
def cleanMedia(df, export=False):
    """Remove media columns from dataframe, optionally exporting them to csv first."""
    df = df.copy()
    
    if export:
        media_df = df[df['screenshots'].notnull()].copy()
        media_data = media_df[['header_image', 'screenshots', 'background', 'movies']]
        
        export_data(media_data, 'steam_media_data', index=True)
        
    df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
    
    return df

In [88]:
storefront = cleanMedia(storefront, export=True)

Exported steam_media_data to '../data/export/steam_media_data.csv'


In [89]:
#Verifying exported data
pd.read_csv('../data/export/steam_media_data.csv').head()

Unnamed: 0,appid,header_image,screenshots,background,movies
0,10,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
1,20,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
2,30,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
3,40,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,
4,50,https://cdn.akamai.steamstatic.com/steam/apps/...,"[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",https://cdn.akamai.steamstatic.com/steam/apps/...,


### 'website', 'support_info'

These two columns contain information about the games's website, support web page and email:

In [90]:
print('website nulls count:', storefront['website'].isnull().sum())
print('support_info nulls count:', storefront['support_info'].isnull().sum())

with pd.option_context("display.max_colwidth", 100): # ensures strings not cut short
    display(storefront[['name', 'website', 'support_info']].sample(10))

website nulls count: 42435
support_info nulls count: 0


Unnamed: 0_level_0,name,website,support_info
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1405260,Rebellion,,"{'url': '', 'email': 'tgmsoftwarehouse@gmail.com'}"
1293190,Hunt Planet Bug,,"{'url': '', 'email': 'carlosd1974@hotmail.com'}"
1943850,Indecent Details - Find the Difference,,"{'url': '', 'email': 'support@witchinghour.club'}"
1930070,Mahjong Business Style,https://www.facebook.com/8FloorGames/,"{'url': 'https://www.facebook.com/8FloorGames', 'email': 'mikhail.zverev@8floor.net'}"
661480,UniOne,http://www.playcolorfulgames.com,"{'url': '', 'email': '308835414@qq.com'}"
931970,Bunny Minesweeper: Skins,http://dillyframe.games/,"{'url': '', 'email': 'support@dillyframe.games'}"
873960,Panic Room 2: Hide and Seek,https://www.gamexp.com/ru/,"{'url': '', 'email': 'support@gamexp.com'}"
637812,Rocksmith¬Æ 2014 Edition ‚Äì Remastered ‚Äì Johnny Cash Song Pack II,http://rocksmith.ubi.com/,"{'url': 'https://support.ubi.com/', 'email': ''}"
1930,Two Worlds Epic Edition,http://www.2-worlds.com/,"{'url': 'http://www.2-worlds.com/confirm.php', 'email': 'support@topware.com'}"
1692590,Puzzles with cats,,"{'url': '', 'email': 'belka.na.more@mail.ru'}"


I'll split these two columns into three (website, support_url and support_email) and move to the separate table. As we can see, the empty website field is NaN while empty  url/emails are just ''. It will appear as NaN after export-import to csv.

It might also be a good idea to check if all three fields are NaN before exporting the data table to avoid having unnecesary data.

#### [Subroutine] 'website', 'support_info': Cleaning

In [91]:
def cleanSupport(df, export=False):
    """Drop support information from dataframe, optionally exporting beforehand."""
    if export:
        support_info = df[['website', 'support_info']].copy()
        
        support_info['support_info'] = support_info['support_info'].apply(lambda x: ast.literal_eval(x))
        support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'] if (x['url']!='') else np.NaN)
        support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email']  if (x['email']!='') else np.NaN)
        
        support_info = support_info.drop('support_info', axis=1)
        
        # only keep rows with at least one piece of information
        support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'].notnull()) | (support_info['support_email'].notnull())]

        export_data(support_info, 'steam_support_info', index=True)
    
    df = df.drop(['website', 'support_info'], axis=1)
    
    return df

In [92]:
storefront = cleanSupport(storefront, export=True)

Exported steam_support_info to '../data/export/steam_support_info.csv'


In [93]:
#Verifying exported data
pd.read_csv('../data/export/steam_support_info.csv').sample(15)

Unnamed: 0,appid,website,support_url,support_email
13953,444740,https://www.nhneno.com/cologne,https://www.nhneno.com,
12469,418180,http://www.herocraft.com/,,support@herocraft.com
27858,697000,,,3288481289@qq.com
61364,1259140,https://twitter.com/StrangaGames,https://twitter.com/StrangaGames,strangastudios@gmail.com
84909,1678190,https://www.fantasygrounds.com,www.fantasygrounds.com,support@fantasygrounds.com
41459,936980,https://www.nbsim.co.uk/,https://www.nbsim.co.uk/members,nbs.thewaratsea@gmail.com
64294,1308750,http://niftyllamagames.com/,,support@niftyllamagames.com
73366,1472300,https://bos.ycgame.com/en/,,bos@ycgame.com
66927,1360810,,https://2pgames.net/,
77232,1542160,http://casual-arts.com/va_parkRanger5/va_parkR...,http://casual-arts.com/contact.htm,contact@casual-arts.com


### supported_languages

This is a supported languages field and it's a bit complicated. This is a string listing languages supported by the game but the audio support is marked with `<strong>*</strong>` so we'll have to parse the strings if we want to get both audio and text support.

I'll split this column into two - supported_languages and audio_languages. The languages will be kept as a list.


In [94]:
print('supported_languages nulls count:', storefront['supported_languages'].isnull().sum())

storefront['supported_languages'].value_counts().head(15)

supported_languages nulls count: 152


English                                                                                                                                                                                                                           27004
English<strong>*</strong><br><strong>*</strong>languages with full audio support                                                                                                                                                  22848
English, Russian                                                                                                                                                                                                                   1941
English<strong>*</strong>, German<strong>*</strong>, French<strong>*</strong>, Italian<strong>*</strong>, Spanish - Spain<strong>*</strong>, Japanese<strong>*</strong><br><strong>*</strong>languages with full audio support     1559
English, Japanese                                                       

As we can see, there are some nulls in this column and also languages are neither sorted alphabetically nor grouped up by audio support so I'll do sorting as well.

Let's test things first:

In [95]:
#parsing audio in the separate function
def audioParse(string):
    if string != string:
        return np.NaN
    try:
        # This regex is not too complicated: just matching the text groups ending with <strong>*</strong>
        pattern = "(?:([A-Za-z -]+)(?:<strong>\*<\/strong>)(?:, )*)"
        items = re.findall(pattern, string)
        # Replacing empty lists with NaN. For the group operations, keeping empty lists would actually 
        # be better but they will be transformed to NaN on export anyways.
        if len(items) == 0:
            return np.NaN
        return sorted(items)
    except:
        return np.NaN

temp_df = storefront[["supported_languages"]].copy()
# parsing for audio support
temp_df["audio_languages"] = temp_df["supported_languages"].apply(audioParse)
# removing tags and unnecessary endings and splitting the string into the text support list
temp_df["text_languages"] = (temp_df["supported_languages"]
                             .str.replace(r'<br><strong>\*<\/strong>languages with full audio support','',regex=True)
                             .str.replace(r'<strong>\*</strong>','',regex=True)
                            ).str.split(', ').apply(lambda x: sorted(x) if type(x) is list else np.NaN)
temp_df

Unnamed: 0_level_0,supported_languages,audio_languages,text_languages
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
10,"English<strong>*</strong>, French<strong>*</st...","[English, French, German, Italian, Korean, Sim...","[English, French, German, Italian, Korean, Sim..."
20,"English, French, German, Italian, Spanish - Sp...",,"[English, French, German, Italian, Korean, Rus..."
30,"English, French, German, Italian, Spanish - Spain",,"[English, French, German, Italian, Spanish - S..."
40,"English, French, German, Italian, Spanish - Sp...",,"[English, French, German, Italian, Korean, Rus..."
50,"English, French, German, Korean",,"[English, French, German, Korean]"
...,...,...,...
2004490,English,,[English]
2004650,English<strong>*</strong><br><strong>*</strong...,[English],[English]
2004670,English,,[English]
2007870,Simplified Chinese,,[Simplified Chinese]


In [96]:
temp_df["text_languages"].apply(lambda x: tuple(x) if type(x) is list else np.NaN).value_counts(dropna = False)

(English,)                                                                                                                                                             49852
(English, Russian)                                                                                                                                                      2904
(English, Japanese)                                                                                                                                                     2170
(English, Simplified Chinese)                                                                                                                                           1892
(Simplified Chinese,)                                                                                                                                                   1783
                                                                                                                                       

In [97]:
temp_df["audio_languages"].apply(lambda x: tuple(x) if type(x) is list else np.NaN).value_counts(dropna = False)

NaN                                                                                                                                    52806
(English,)                                                                                                                             31515
( Japanese,)                                                                                                                            2145
(English, French, German, Italian, Japanese, Spanish - Spain)                                                                           1762
( Japanese, English)                                                                                                                    1518
                                                                                                                                       ...  
(English, French, German, Portuguese - Brazil, Russian, Simplified Chinese)                                                                1
(English, Por

Everything seems fine, so let's make the transform function:

#### [Subroutine] 'supported_languages': Cleaning

In [98]:
def cleanLanguages(df):
    """Clean and split supported_languages into two columns: supported_languages and supported_audio"""
    
    #parsing audio in the separate function
    def audioParse(string):
        if string != string:
            return np.NaN
        try:
            # This regex is not too complicated: just matching the text groups ending with <strong>*</strong>
            pattern = "(?:([A-Za-z -]+)(?:<strong>\*<\/strong>)(?:, )*)"
            items = re.findall(pattern, string)
            # Replacing empty lists with NaN. For the group operations, keeping empty lists would actually 
            # be better but they will be transformed to NaN on export anyways.
            if len(items) == 0:
                return np.NaN
            return sorted(items)
        except:
            return np.NaN    

    # parsing for audio support
    df["supported_audio"] = df["supported_languages"].apply(audioParse)
    # removing tags and unnecessary endings and splitting the string into the text support list
    df["supported_languages"] = (df["supported_languages"]
                                 .str.replace(r'<br><strong>\*<\/strong>languages with full audio support','',regex=True)
                                 .str.replace(r'<strong>\*</strong>','',regex=True)
                                ).str.split(', ').apply(lambda x: sorted(x) if type(x) is list else np.NaN)
    return df

In [99]:
storefront = cleanLanguages(storefront)

In [100]:
storefront[["name","supported_audio","supported_languages"]].sample(15)

Unnamed: 0_level_0,name,supported_audio,supported_languages
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1434340,Insect Adventure Demo,,[English]
705670,Ballz: Farm,,"[Arabic, Bulgarian, Czech, Danish, Dutch, Engl..."
1740520,Behind the Frame: The Finest Scenery - Art Book,,"[English, French, German, Italian, Japanese, K..."
1873380,Wonky Works!,,[English]
1021250,Welcome To... Chichester 2 : VNMaker Version,,[English]
24790,Command & Conquer 3: Tiberium Wars,,"[Dutch, English, French, German, Italian, Poli..."
1068190,ÈáçÁîüËΩÆÂõû/Reborn Not Again,,"[English, Simplified Chinese]"
1310390,Mine Trap Reborn,,[English]
1873410,Witch Sacrifice,,[English]
529090,Violet Haunted,,[English]


### release_date

In [101]:
storefront["release_date"].value_counts()

{'coming_soon': True, 'date': '2022'}                                         1221
{'coming_soon': True, 'date': 'TBA'}                                           958
{'coming_soon': True, 'date': 'Coming Soon'}                                   778
{'coming_soon': True, 'date': ''}                                              440
{'coming_soon': True, 'date': 'TBD'}                                           304
                                                                              ... 
{'coming_soon': True, 'date': 'Once in a lifetime'}                              1
{'coming_soon': True, 'date': 'Coming soon, wishlist and follow!'}               1
{'coming_soon': False, 'date': '31 Mar, 2010'}                                   1
{'coming_soon': True, 'date': '18 Jun, 2020'}                                    1
{'coming_soon': True, 'date': 'Aiming for late October or early November'}       1
Name: release_date, Length: 6929, dtype: int64

In [102]:
storefront["release_date"].sample(n=10)

appid
832320     {'coming_soon': False, 'date': '25 Jul, 2019'}
1051210    {'coming_soon': False, 'date': '18 Oct, 2019'}
1608940     {'coming_soon': False, 'date': '9 Jun, 2021'}
1693650    {'coming_soon': False, 'date': '30 Jul, 2021'}
1380150    {'coming_soon': False, 'date': '21 Aug, 2020'}
688254     {'coming_soon': False, 'date': '17 Oct, 2017'}
421200     {'coming_soon': False, 'date': '19 Nov, 2015'}
60          {'coming_soon': False, 'date': '1 Nov, 2000'}
1981380         {'coming_soon': True, 'date': 'Dec 2022'}
1692773    {'coming_soon': False, 'date': '13 Jul, 2021'}
Name: release_date, dtype: object

There are two different fields stored in this dict - Boolean on whether the game is released or not (coming_soon) and the release date. 

For the upcoming game date format seems to be free string (some strings are not even in English).

For the released games - it seems to be a standard "%d %b, %Y"

**Note:** There are some (less than 10 at the time of writing this) games that have coming_soon set flag to False while their release_date is set long after the data collection. I'll set coming_soon to True in that case.

I'll convert the datetime for the released games to datetime and add the column for coming_soon games. The field for the incorrect dates is set to the NaN

#### [Subroutine] 'release_date': Cleaning

In [103]:
def cleanReleaseDate(df):
    df = df.copy()

    #getting values for comming_soon column
    def getComingSoon(value):
        if extractDictItem(value,"coming_soon") == True:
            return True
        return False
    
    #parsing dates
    def processReleaseDateValues(value):
        thisDate = extractDictItem(value, "date")
        try:
            return pd.to_datetime(thisDate, errors='raise')
        except:
            return np.NaN  
        
  
    df["coming_soon"] = df["release_date"].apply(getComingSoon)
    df["release_date"] = df["release_date"].apply(processReleaseDateValues)
    df.loc[df["release_date"] > df_collection_date,["coming_soon"]]= True
    return df

In [104]:
temp_df = cleanReleaseDate(storefront)

In [105]:
temp_df.coming_soon.value_counts()

False    89513
True     12991
Name: coming_soon, dtype: int64

In [106]:
temp_df["release_date"].sample(15)

appid
543870    2017-08-25
1153950   2019-12-11
337880    2014-12-22
860120    2018-05-09
346490    2015-05-28
1237600   2020-02-26
493290    2016-12-06
750740    2017-12-07
1853802   2022-02-15
796930    2018-10-08
1660300   2021-09-02
295207    2016-09-28
439750    2016-01-29
994170    2019-02-22
789710    2018-03-26
Name: release_date, dtype: datetime64[ns]

In [107]:
temp_df[(temp_df["release_date"]>df_collection_date) & (temp_df["coming_soon"] == False)]["release_date"]

Series([], Name: release_date, dtype: datetime64[ns])

Everything seems fine. Processing:

In [108]:
storefront = cleanReleaseDate(storefront)

### Processing price

There are multiple columns that are related to price:
* price_overview 
* is_free 
* packages 
* package_groups 

price_overview and is_free are obvious, as for packages and package_groups, as you'll see later, the app might be sold just as a part of package and not sold separately.

Let's start with taking a peek at price_overview:

In [109]:
print('price_overview nulls count:', storefront['price_overview'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront[["name", "is_free", "price_overview"]].sample(15))

price_overview nulls count: 22427


Unnamed: 0_level_0,name,is_free,price_overview
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1016500,Trainz 2019 DLC: Appen,False,"{'currency': 'EUR', 'initial': 1499, 'final': 1499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '14,99‚Ç¨'}"
1736815,RPGScenery - Dark Wood Scene,False,"{'currency': 'EUR', 'initial': 569, 'final': 569, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '5,69‚Ç¨'}"
1002410,The Five Cores Remastered,False,"{'currency': 'EUR', 'initial': 1399, 'final': 1399, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '13,99‚Ç¨'}"
462082,RPG Maker VX Ace - Retro Halloween Tiles,False,"{'currency': 'EUR', 'initial': 399, 'final': 399, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '3,99‚Ç¨'}"
488330,Highway to the Moon,False,"{'currency': 'EUR', 'initial': 99, 'final': 99, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '0,99‚Ç¨'}"
776300,Harvest Seasons - Starter Bundle,False,"{'currency': 'EUR', 'initial': 499, 'final': 499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '4,99‚Ç¨'}"
1055690,Awakening: The Dreamless Castle,False,"{'currency': 'EUR', 'initial': 569, 'final': 569, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '5,69‚Ç¨'}"
293460,I Shall Remain,False,"{'currency': 'EUR', 'initial': 999, 'final': 999, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '9,99‚Ç¨'}"
2002550,Chloe Puzzle Game,False,
401330,Akuatica,False,"{'currency': 'EUR', 'initial': 430, 'final': 430, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '4,30‚Ç¨'}"


Things to note:

* There are a quite a lot of nulls in price_overview
* price_overview being null doesn't always correlate with is_free being True (although we'll check how often that happens next)
* price_overview's currency is Euro (which is understandable as the dataset was download from the location in Europe. But we'll check if there are any inconsistencies here)
* There are both prices both with the current discount and without it. Considering how often Steam does sales on different products, I'll only leave the price without the discount.

First, let's take a closer look at is_free and price_overview:

In [110]:
print('is_free = True and price_overview == nulls count:',
      storefront[storefront["is_free"] == True]["price_overview"].isnull().sum())
print('is_free = True and price_overview != nulls count:',
      storefront[storefront["is_free"] == True]["price_overview"].notnull().sum())
print('is_free = False and price_overview == nulls count:',
      storefront[storefront["is_free"] == False]["price_overview"].isnull().sum())
print('Filtered out non-released apps from the above:',
      storefront[(storefront["is_free"] == False) & (storefront["coming_soon"] == False)]["price_overview"].isnull().sum())
print()
with pd.option_context("display.max_colwidth", 150):
    display(storefront[(storefront["is_free"] == True) & (storefront["price_overview"].notnull())][["name","price_overview"]])

is_free = True and price_overview == nulls count: 10593
is_free = True and price_overview != nulls count: 17
is_free = False and price_overview == nulls count: 11834
Filtered out non-released apps from the above: 1132



Unnamed: 0_level_0,name,price_overview
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
8650,RACE 07: Andy Priaulx Crowne Plaza Raceway (Free DLC),"{'currency': 'EUR', 'initial': 2995, 'final': 2995, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
70615,Worms Ultimate Mayhem - Single Player Pack DLC,"{'currency': 'EUR', 'initial': 1699, 'final': 1699, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
215373,Omerta - City of Gangsters - The Bulgarian Colossus DLC,"{'currency': 'EUR', 'initial': 2499, 'final': 2499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
219136,Painkiller Hell & Damnation: Satan Claus DLC,"{'currency': 'EUR', 'initial': 6999, 'final': 6999, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
222680,Dungeon Defenders Anniversary Pack,"{'currency': 'EUR', 'initial': 159, 'final': 159, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
229080,DmC Devil May Cry: Bloody Palace Mode,"{'currency': 'EUR', 'initial': 3999, 'final': 3999, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
236080,Resident Evil 6 Wallpaper,"{'currency': 'EUR', 'initial': 4499, 'final': 4499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
247307,Saints Row IV - Reverse Cosplay Pack,"{'currency': 'EUR', 'initial': 499, 'final': 499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
250810,LOST PLANET¬Æ 3 - Hi Res Movies,"{'currency': 'EUR', 'initial': 3999, 'final': 3999, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"
255050,Saints Row IV - Thank You Pack,"{'currency': 'EUR', 'initial': 499, 'final': 499, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': 'Free'}"


There are some apps that were not free on the start but are now listed as 'Free'. These are either DLCs that became free or the first episodes/demoversions of the games. We can safely set their price to 0 so I'll set the price for all free games to 0 and can remove is_free column as redundant

Another thing to notice - there are a lot of apps with the incorrect price. 
 
There may be multiple reasons for that:
* Free (we've checked these)
* Not released yet (fortunately, we've already parsed the date and filtered these above)
* Being superseded by the different app (Like Bioshock App ID 7670, for example)
* Not being sold anymore (RollerCoaster Tycoon¬Æ 3: Platinum, App ID 2700)
* A demo that has been marked incorrectly (App ID 1883370)
* A part of some bundle and not sold separately

Now, let's transform price_overview to the more understandable format, dealing with the free free apps (The incorrect price is set to null for now):

In [111]:
def parse_price(x):
    try:
        if x != x:
            return {'currency': 'EUR', 'initial': np.NaN}
        else:
            return ast.literal_eval(x)
    except:
        print(x)

price_df = storefront[['name','coming_soon','type','packages', 'package_groups','is_free','price_overview']].copy()
# Evaluate as dictionary and set to NaN if missing
price_df['price_overview'] = price_df['price_overview'].apply(parse_price)
# Set currencies
price_df['currency'] = price_df['price_overview'].apply(lambda x: x['currency'])
# Get prices
price_df['price'] = price_df['price_overview'].apply(lambda x: x['initial']/100 if x['initial'] > 0 else x['initial'])
# set price of free games to 0
price_df.loc[price_df['is_free'], 'price'] = 0
print('Number of prices with negative values:', price_df[price_df['price']<0].shape[0])
print('Number of prices with incorrect values:', price_df[price_df['price'].isnull()].shape[0])
price_df['currency'].value_counts()

Number of prices with negative values: 0
Number of prices with incorrect values: 11834


EUR    102504
Name: currency, dtype: int64

It looks like some Steam for some reason doesn't return the price in Euros for some games. Going to convert them using the conversion course at the time of data collection: ***1 USD to 0.95 EU as for 2022-04-27***

I'll use the price_df dataframe created earlier for the checks. Let's start by filtering out the non-released games:

In [112]:
price_df[(price_df["is_free"] == False) 
         & (price_df["coming_soon"] == False) 
         & (price_df["packages"].isnull())
         & (price_df["price"].isnull())][['name','type','packages','package_groups','price']]

Unnamed: 0_level_0,name,type,packages,package_groups,price
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
340,Half-Life 2: Lost Coast,game,,[],
2570,Vigil: Blood Bitterness‚Ñ¢,game,,[],
2700,RollerCoaster Tycoon¬Æ 3: Platinum,game,,[],
3400,Hammer Heads Deluxe,game,,[],
3490,Venice Deluxe,game,,[],
...,...,...,...,...,...
2002130,Dread Hunger Fur Hats,dlc,,[],
2002150,Dread Hunger Bone Rings,dlc,,[],
63950,IL-2 Sturmovik: Cliffs of Dover,game,,[],
12210,Grand Theft Auto IV: Complete Edition,game,,[],


And let's check apps that are part of some package:

In [113]:
price_df[(price_df["is_free"] == False) 
         & (price_df["coming_soon"] == False) 
         & (price_df["packages"].notnull())
         & (price_df["price"].isnull())][['name','type','packages','package_groups','price']]

Unnamed: 0_level_0,name,type,packages,package_groups,price
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2420,The Ship: Single Player,game,[56669],"[{'name': 'default', 'title': 'Buy The Ship: S...",
7670,BioShock‚Ñ¢,game,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock‚Ñ¢',...",
8850,BioShock¬Æ 2,game,"[81419, 127633]","[{'name': 'default', 'title': 'Buy BioShock¬Æ 2...",
20500,Red Faction Guerrilla Steam Edition,game,"[189796, 15630]","[{'name': 'default', 'title': 'Buy Red Faction...",
31230,Sam & Max 302: The Tomb of Sammun-Mak,game,"[109586, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...",
...,...,...,...,...,...
1928041,Warframe: Garuda Prime Access - Blood Altar Pack,dlc,[694615],[],
1928042,Warframe: Garuda Prime Access - Seeking Talons...,dlc,[694618],[],
1937590,DEMON GAZE EXTRA - Tons of Fun! Perfect Gem Set,dlc,[698017],[],
1968101,Warframe: Angels of the Zariman Chrysalith Pack,dlc,[710699],[],


Sadly, it doesn't seem like there is much we can do to clean it up further. Now, do we remove these rows with the null price or not? 

The number of games and dlcs is quite significant and it might be interesting for the people checking the non-available games. ***I'll leave it as null at this stage and decide whether to remove it when doing the analysis***.

So, the **price is**:

* set to 0 for free games
* set as null for the incorrect/unavailable price
* converted EUR if it was in USD, using the conversion rate at the time of gathering
* left 

Now, let's make a cleaning function:

#### [Subroutine] 'is_free', 'price_overview': Cleaning

In [114]:
def cleanPrice(df):
    df = df.copy()

    #parsing the price_overview, filling in the incorrect values and nulls for the further processing
    def parse_price(x):
        try:
            if x != x:
                return {'currency': 'EUR', 'initial': np.NaN}
            else:
                return ast.literal_eval(x)
        except:
            return {'currency': 'EUR', 'initial': np.NaN}
    
    # Evaluate as dictionary and set to null if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    # Set currencies
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    # Get prices and change it to be shown in the proper dimansion
    df['price'] = df['price_overview'].apply(lambda x: x['initial']/100)
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    # convert the price from USD to EU
    df.loc[df['currency'] == 'USD', 'price'].apply(lambda x: x*usd_eu_rate if x > 0 else x)
    
    df = df.drop(['is_free','price_overview', 'currency'], axis=1)
    
    return df

In [115]:
storefront = cleanPrice(storefront)

### 'packages'

We've already seen some use of this column when we were processing 'price_overview' but let's take a close look at it now.

'package' represents the list of package IDs the application is a member of. It can be usefull when tracking DLCs, for example.

In [116]:
print('packages nulls count:', storefront['packages'].isnull().sum())
print('packages - after filtering out possible null causes:',
      storefront[(storefront['packages'].isnull()) 
                 & (storefront['coming_soon'] == False) 
                 & (storefront['price'] != 0)
                 & (storefront['price'].notnull())
                ].shape[0])
print('packages empty lists count:', storefront[~storefront['packages'].apply(lambda x: True if x!=x else bool(ast.literal_eval(x)) )].shape[0])
with pd.option_context("display.max_colwidth", 250):
    display(storefront[['name','packages']].sample(10))

packages nulls count: 21351
packages - after filtering out possible null causes: 0
packages empty lists count: 0


Unnamed: 0_level_0,name,packages
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1571540,Gimmick in the Chaos Dimension,[555476]
1076580,Visual Novel Maker - COSMIC MUSIC DLC PACK,
1227100,Wildemist Isle,"[425318, 629943]"
903180,Privateers,[288890]
1374250,The Divine Speaker - 2019 Art Collection,[481999]
969270,Glass Masquerade - Halloween Puzzle Pack,[317607]
804350,Starman,[244985]
206193,Gunpoint Extras Pack 1,[30312]
499440,klocki,[114322]
960920,Immersion Pack - Europa Universalis IV: Golden Century,[314265]


As we can see, there are some nulls in this columns but all of them are caused by:
* Not being released yet
* Being Free
* Having incorrect price

We've reviewed the prices earlier so I don't think there is any need in removing the rows with the null package value. I'll leave the column as is

#### [Subroutine] 'packages': Cleaning

**Reserved in case we'll do cleaning in the future**. For now stays as is

### 'package_groups'

We've already seen some use of the column earlier when we were processing 'price_overview'.

* 'package_groups' is a list of purchase options (apps might be either be purchased right away or through the subscription usage and this)
* sadly, there is no information on bundles available through the store API (to my knowledge, SteamDB is web scraping Steam pages to get that data)

Let's take a look at this column:

In [117]:
print('package_groups nulls count:', storefront['package_groups'].isnull().sum())
print('package_groups empty list count:', storefront[~storefront['package_groups'].apply(lambda x: bool(ast.literal_eval(x)))].shape[0])
print('package_groups lists with multiple items count:', storefront[storefront['package_groups'].apply(lambda x: len(ast.literal_eval(x))) > 1].shape[0])
with pd.option_context("display.max_colwidth", 500):
    display(storefront[['name','package_groups']].sample(10))

package_groups nulls count: 0
package_groups empty list count: 21870
package_groups lists with multiple items count: 652


Unnamed: 0_level_0,name,package_groups
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
504750,Lew Pulsipher's Doomstar,"[{'name': 'default', 'title': ""Buy Lew Pulsipher's Doomstar"", 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 116234, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': ""Lew Pulsipher's Doomstar - 9,99‚Ç¨"", 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 999}]}]"
687360,Doodle Jamboree,"[{'name': 'default', 'title': 'Buy Doodle Jamboree', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 196812, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Doodle Jamboree - 1,99‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 199}]}]"
1845230,"CUSTOM ORDER MAID 3D2 Personality Pack Naturally sadistic, Sweet little devil","[{'name': 'default', 'title': 'Buy CUSTOM ORDER MAID 3D2 Personality Pack Naturally sadistic, Sweet little devil', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 663969, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'CUSTOM ORDER MAID 3D2 Personality Pack Naturally sadistic, Sweet little devil - 33,99‚Ç¨', 'option_description': '', 'can_get_free_license': '0', ..."
940140,Project AETHER: First Contact,"[{'name': 'default', 'title': 'Buy Project AETHER: First Contact', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 305264, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Project AETHER: First Contact - 10,79‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 1079}]}]"
1200346,Fishing Planet: Tropic Hunter Pack,"[{'name': 'default', 'title': 'Buy Fishing Planet: Tropic Hunter Pack', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 414088, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Fishing Planet: Tropic Hunter Pack - 29,99‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 2999}]}]"
932930,Sexy Girls Puzzle,"[{'name': 'default', 'title': 'Buy Sexy Girls Puzzle', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 302309, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Sexy Girls Puzzle - 0,89‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 89}]}]"
1070840,Quest Together,[]
436470,Gamma Bros,[]
1325230,Ready Set Sumo!,[]
856100,Little Comet,"[{'name': 'default', 'title': 'Buy Little Comet', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 269154, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Little Comet - 3,29‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 329}]}]"


In [118]:
with pd.option_context("display.max_colwidth", 500):
    display(storefront[storefront['package_groups'].apply(lambda x: len(ast.literal_eval(x))) > 1][['name','package_groups']].sample(5))

Unnamed: 0_level_0,name,package_groups
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
951450,Fit It,"[{'name': 'default', 'title': 'Buy Fit It', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 309791, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Fit it - 10,79‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 1079}]}, {'name': 'subscriptions', 'title': 'Buy Fit It Subscription Plan', 'descrip..."
1104560,Epic Roller Coasters ‚Äî Lost Forest,"[{'name': 'default', 'title': 'Buy Epic Roller Coasters ‚Äî Lost Forest', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 373005, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Epic Roller Coasters ‚Äî Lost Forest - 2,39‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 239}]}, {'name': 'subscriptio..."
923230,DreamBack VR,"[{'name': 'default', 'title': 'Buy DreamBack VR', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 298082, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'DreamBack VR - 13,99‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 1399}]}, {'name': 'subscriptions', 'title': 'Buy DreamBack VR Subscripti..."
55110,Red Faction¬Æ: Armageddon‚Ñ¢,"[{'name': 'default', 'title': 'Buy Red Faction¬Æ: Armageddon‚Ñ¢', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 8392, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Red Faction: Armageddon - 19,99‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 1999}, {'packageid': 15630, 'percent_savings_text'..."
1056960,Wolfenstein: Youngblood,"[{'name': 'default', 'title': 'Buy Wolfenstein: Youngblood', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 353104, 'percent_savings_text': ' ', 'percent_savings': 0, 'option_text': 'Wolfenstein: Youngblood - 19,99‚Ç¨', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 1999}, {'packageid': 353105, 'percent_savings_text..."


In [119]:
storefront.loc[764830].package_groups

'[{\'name\': \'default\', \'title\': \'Buy Snowmania\', \'description\': \'\', \'selection_text\': \'Select a purchase option\', \'save_text\': \'\', \'display_type\': 0, \'is_recurring_subscription\': \'false\', \'subs\': [{\'packageid\': 226121, \'percent_savings_text\': \' \', \'percent_savings\': 0, \'option_text\': \'Snowmania - 6,99‚Ç¨\', \'option_description\': \'\', \'can_get_free_license\': \'0\', \'is_free_license\': False, \'price_in_cents_with_discount\': 699}]}, {\'name\': \'subscriptions\', \'title\': \'Buy Snowmania Subscription Plan\', \'description\': \'To be billed on a recurring basis.\', \'selection_text\': \'Starting at 6,99‚Ç¨ / month\', \'save_text\': \'\', \'display_type\': 0, \'is_recurring_subscription\': \'true\', \'subs\': [{\'packageid\': 235307, \'percent_savings_text\': \' \', \'percent_savings\': 0, \'option_text\': \'6,99‚Ç¨ for a month, then 0,79‚Ç¨ / month\', \'option_description\': \'<p class="game_purchase_subscription">6,99‚Ç¨ at checkout, auto-ren

This data does seem useful for the research on purchasing options, for example but might not be worth keeping in the main data table.

The items' structure in the lists seems to be rigid but there are lists with multiple items out here. 

package_groups table structure:

| package_groups | Original field | Field Type |
| --- | --- | --- |
| appid | storefront.appid | int |
| type | storefront.package_groups.item.name | string |
| title | storefront.package_groups.item.title | string |
| is_recurring_subscription | storefront.package_groups.item.is_recurring_subscription | bool |
| subs | storefront.package_groups.item.subs | list of dicts/object |

subs will need additional parsing before the analysis as it contains the detailed data on purchasing options - price, free tiers, billing options, etc.

#### [Subroutine] 'package_groups': Cleaning

In [120]:
def cleanPackageGroups(df, export=False):
    """
    Drop Package groups information from the dataframe, optionally exporting beforehand.
    """
    
    def packageGroupsParse(row):
        """
        Parsing each row to get the new columns
        """
        row['package_groups']['appid'] = row['appid']
        # parsing boolean field to python boolean
        if row['package_groups']['is_recurring_subscription'] == 'false':
            row['package_groups']['is_recurring_subscription'] = False
        else:
            row['package_groups']['is_recurring_subscription'] = True
        result = pd.Series(row['package_groups'])
        return result
    
    if export:
        # removing empty package_groups and columns not needed in processing
        packages_info = df[df['package_groups'].apply(lambda x: bool(ast.literal_eval(x)))][['package_groups']].copy().reset_index()
        # evaluating string to the list and exploding the list
        packages_info['package_groups'] = packages_info['package_groups'].apply(lambda x: ast.literal_eval(x))
        packages_info = packages_info.explode('package_groups')
        packages_info = packages_info.apply(lambda row: packageGroupsParse(row), axis = 1)
        # removing unnecessary oclumns
        packages_info.drop(['description','selection_text','save_text','display_type'], axis = 1, inplace = True)
        # renaming ocolumns
        packages_info.rename(columns={'name':'type'}, inplace = True)
        # changing column order
        packages_info = packages_info[['appid', 'type', 'title', 'is_recurring_subscription', 'subs']]

        export_data(packages_info, 'steam_packages_info', index=False)
    
    df = df.drop(['package_groups'], axis=1)
    
    return df

In [121]:
storefront = cleanPackageGroups(storefront, export = True)

Exported steam_packages_info to '../data/export/steam_packages_info.csv'


In [122]:
#Verifying exported data
pd.read_csv('../data/export/steam_packages_info.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81293 entries, 0 to 81292
Data columns (total 5 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   appid                      81293 non-null  int64 
 1   type                       81293 non-null  object
 2   title                      81293 non-null  object
 3   is_recurring_subscription  81293 non-null  bool  
 4   subs                       81293 non-null  object
dtypes: bool(1), int64(1), object(3)
memory usage: 2.6+ MB


### achievements
This columns contains dictionaries with the total number of the application achievements and the information about the 10 highlited ones. Information about the highlited achievements is not very useful (besides, there is no description - just the name and the icon link) but the total number is worth saving:

In [123]:
print('Achievements nulls count:', storefront['achievements'].isnull().sum())
print('DLCs with achievements count:', storefront[storefront['type']=='dlc']['achievements'].notnull().sum())
with pd.option_context("display.max_colwidth", 500):
    display(storefront[['name', 'achievements']].sample(10))

Achievements nulls count: 71424
DLCs with achievements count: 20


Unnamed: 0_level_0,name,achievements
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1325080,CyberBorn,
1927810,Coloring Pixels - Platformers,
482300,Investigator,"{'total': 52, 'highlighted': [{'name': 'Investigator', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/482300/9004d2033a1083c3e71727e51fbb20aa92d6cc34.jpg'}, {'name': 'Secret Room', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/482300/0d5bf49ad19f24d9aa9addbaf4ea8d3b0beea601.jpg'}, {'name': 'Dumb =)', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/482300/08e666e40a970b013aef69156818099b9f2b4e3d.jpg'}..."
987480,Ëä±ËêΩÂÜ¨ÈôΩ Snowdreams -lost in winter-,"{'total': 31, 'highlighted': [{'name': 'Â∫èÂπï', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/987480/b74ae8e689cf3a327978684196b0afcae0cb2858.jpg'}, {'name': 'ÁôΩÈõ≤ÈöéÊ¢Ø‰∏äÁöÑÂ§¢ÊÉ≥', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/987480/e877949ecccbd02029cf666aa2e8bf69f603cb11.jpg'}, {'name': 'Silent Night', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/987480/e846bb3fccd4fbedc6a8db6611a17298c7ece8dd.jpg'}, {'name..."
542160,War for the Overworld - My Pet Dungeon Expansion,
859920,Double Head Shark Attack,"{'total': 10, 'highlighted': [{'name': 'Complete 5 Missions', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/859920/f430540d10ef1d8a2e9b27a8fa7c69a72e0fb265.jpg'}, {'name': 'Unlock Great White Double Head', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/859920/e6911c8f0a488976c98d980ca1e44e8d9644782b.jpg'}, {'name': 'Complete 25 Missions', 'path': 'https://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/859920/0afd944..."
528620,Order of Battle: Blitzkrieg,
907910,Spooky Ghosts Dot Com - Soundtrack,
467513,Sentinels of the Multiverse - Mini-Pack 4,
983460,Katy & Bob: Cake Caf√© Soundtrack,


Number of nulls is not very surprising considering a lot of games launched before achievements appeared and we have DLCs in our table that usually have achievements attached to the base game.

#### [Subroutine] 'achievements': Cleaning

In [124]:
def processAchievements(df):
    """
    Parse as total number of achievements.
    """
    df = df.copy()
     
    def parse_achievements(x):
        if x is np.nan:
            # missing data, assume has no achievements
            return 0
        else:
            # else has data, so can extract and return number under total
            return literal_eval(x)['total']
        
    df['achievements'] = df['achievements'].apply(extractDictItem, key = 'total')
    df['achievements'].fillna(0, inplace=True)
    
    return df

In [125]:
storefront = processAchievements(storefront)
storefront['achievements'].value_counts()

0.0      72045
10.0      1503
12.0      1213
20.0      1097
15.0      1000
         ...  
379.0        1
208.0        1
350.0        1
324.0        1
141.0        1
Name: achievements, Length: 418, dtype: int64

### demos

Game demo versions were quite popular back in the day and are still used by some developers publishers. This column contains a list of dictionaries with appids and descriptions of the game demo versions. It might contain multiple elements.

We didn't download the demo versions in this dataset so it's not very useful. Still, will convert it to the simple lists of appids.

In [126]:
#demos
print('demos nulls count:', storefront['demos'].isnull().sum())
print('demos lists with multiple items count:', storefront[storefront['demos'].apply(lambda x: 0 if x != x else len(ast.literal_eval(x))) > 1].shape[0])
storefront[storefront["demos"].notnull()][["name","demos"]].sample(15)

demos nulls count: 96000
demos lists with multiple items count: 18


Unnamed: 0_level_0,name,demos
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1950310,Until The End,"[{'appid': 1991680, 'description': ''}]"
1488510,VRNOID,"[{'appid': 1539930, 'description': ''}]"
1039280,ProtoCorgi,"[{'appid': 1041180, 'description': 'ProtoCorgi..."
1311300,HIVE,"[{'appid': 1343840, 'description': ''}]"
954310,Master Of Earth,"[{'appid': 1325150, 'description': ''}]"
526550,PitchFork,"[{'appid': 541250, 'description': 'Pitchfork D..."
1906510,Arto,"[{'appid': 1991560, 'description': ''}]"
1760640,Awakened Evil,"[{'appid': 1838870, 'description': ''}]"
1658210,Witty witch,"[{'appid': 1717950, 'description': ''}]"
536430,The Revenge of Johnny Bonasera: Episode 1,"[{'appid': 536860, 'description': ''}]"


#### [Subroutine] 'demos': Cleaning

In [127]:
storefront['demos'] = storefront['demos'].apply(extractDictList, key='appid')

### fullgame
This column is specifically for DLCs and contains information about the base game. It is stored as an appid: name dictionary and we will only leave the appis (as our main table is indexed by it and contains everything else needed). Let's take a closer look on how clean it is:

In [128]:
#fullgame
print('fullgame nulls count:', storefront['fullgame'].isnull().sum())
print('fullgame non-nulls count:', storefront['fullgame'].notnull().sum())
print('fullgame nulls for dlcs count:', storefront[storefront['type']=='dlc']['fullgame'].isnull().sum())
storefront[storefront["fullgame"].notnull()][["name","fullgame"]].sample(5)

fullgame nulls count: 67918
fullgame non-nulls count: 34586
fullgame nulls for dlcs count: 47


Unnamed: 0_level_0,name,fullgame
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
903263,Air Threat - Huge Donation,"{'appid': '853380', 'name': 'Air Threat'}"
938550,Jellyfish Season Fan Pack,"{'appid': '937980', 'name': 'Jellyfish Season'}"
1242768,Monster Hunter World: Iceborne - MHW:I Monster...,"{'appid': '582010', 'name': 'Monster Hunter: W..."
1788580,Adult Puzzles - Hentai NightClub ArtBook,"{'appid': '1788490', 'name': 'Adult Puzzles - ..."
1816662,X-Plane 11 - Add-on: Verticalsim - KFAY - Faye...,"{'appid': '269950', 'name': 'X-Plane 11'}"


In [129]:
getSteamLink(storefront[(storefront['type']=='dlc') & (storefront['fullgame'].isnull())][['name','fullgame']].sample(5))

Batman: Arkham Origins - Infinite Earths Skin Pack https://store.steampowered.com/app/237621
Great Northern F7 Big Sky Blue Add-on Livery https://store.steampowered.com/app/256539
Jigsaw Puzzle Pack - Pixel Puzzles Ultimate: Fractals https://store.steampowered.com/app/502840
GG1 PRR Silver Add-on Livery https://store.steampowered.com/app/256529
BR Class 31 Ochre Add-on Livery https://store.steampowered.com/app/256552


Undortunately, it seems like some developers have forgotten to mark the fulllgame for some DLCs. You can see the lack of "This content requires the base game.." field on the game page, for example. 

Fortunately, we have a 'dlc' column that should contain the list for the game. Let's check if we can recover fullgame from it:

In [130]:
#data["genres"].explode().unique() 
def fullgame_dlc_check(df):
    temp_data = df.copy()
    dlcs_list = df[df['dlc'].notnull()]['dlc'].apply(lambda x: ast.literal_eval(x)).explode().unique()
    temp_data = temp_data[(storefront['type']=='dlc') & (temp_data['fullgame'].isnull())]
    temp_data['dlc_available'] = temp_data.index
    temp_data['dlc_available'] = temp_data['dlc_available'].apply(lambda x: x in dlcs_list)
    return temp_data

temp_data = fullgame_dlc_check(storefront)
print('recoverable dlcs count: ', temp_data[temp_data['dlc_available'] == True][['name','dlc_available']].shape[0])
print('unrecoverable dlcs count: ', temp_data[temp_data['dlc_available'] == False][['name','dlc_available']].shape[0])

recoverable dlcs count:  21
unrecoverable dlcs count:  26


It seems like we can recover some from the fullgame column. I'll leave the remaining nulls as is for now.

#### TODO: Remove the remaining nulls?

#### [Subroutine] 'fullgame': Cleaning

In [131]:
def fullgame_cleaning(df):
    """
    Cleaning fullgame
    """

    # Creating a temporary table with appid-dlc data
    df = df.copy()
    dlcs_df = df.copy()
    dlcs_df = dlcs_df.loc[dlcs_df['dlc'].notnull()][['dlc']]
    dlcs_df['dlc'] = dlcs_df['dlc'].apply(lambda x: ast.literal_eval(x))
    dlcs_df = dlcs_df.explode('dlc')

    # Filling out the fullgame column when possible
    def fillFullgame(appid):
        index_list = dlcs_df.index[dlcs_df['dlc']==appid]
        if len(index_list) == 0:
            return np.NaN
        else:
            return index_list[0]

    mask = (df['type']=='dlc') & (df['fullgame'].isnull())
    df.loc[mask, 'fullgame'] = df[mask].apply(lambda row: fillFullgame(appid = row.name), axis = 1)
    
    return df

In [132]:
storefront = fullgame_cleaning(storefront)

In [133]:
def fullgame_dlc_check(df):
    temp_data = df.copy()
    dlcs_list = df[df['dlc'].notnull()]['dlc'].apply(lambda x: ast.literal_eval(x)).explode().unique()
    temp_data = temp_data[(storefront['type']=='dlc') & (temp_data['fullgame'].isnull())]
    temp_data['dlc_available'] = temp_data.index
    temp_data['dlc_available'] = temp_data['dlc_available'].apply(lambda x: x in dlcs_list)
    return temp_data

temp_data = fullgame_dlc_check(storefront)
print('recoverable dlcs count: ', temp_data[temp_data['dlc_available'] == True][['name','dlc_available']].shape[0])
print('unrecoverable dlcs count: ', temp_data[temp_data['dlc_available'] == False][['name','dlc_available']].shape[0])

recoverable dlcs count:  0
unrecoverable dlcs count:  26


### dlc
This column contains the list of dlcs (in form of appids) for the game. Let's take a peek at how clean the data is:

In [134]:
#dlc
print('dlc nulls count:', storefront['dlc'].isnull().sum())
print('dlc non-nulls count:', storefront['dlc'].notnull().sum())
storefront[storefront["dlc"].notnull()][["name","dlc"]].sample(5)

dlc nulls count: 92808
dlc non-nulls count: 9696


Unnamed: 0_level_0,name,dlc
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
238320,Outlast,[273300]
1380410,Puzzle Quest 3,"[1882940, 1882941, 1882942, 1882943]"
1259430,First Snow,[1408990]
1066130,Truck Life,"[1468080, 1130600, 1385070, 1287490]"
1120160,Tricky Cow,[1123760]


In [135]:
#numbers of items in dlc lists. -1 - for null in the column
storefront['dlc'].apply(lambda x: -1 if x != x else len(ast.literal_eval(x))).value_counts()

-1      92808
 1       6230
 2       1519
 3        563
 4        326
        ...  
 59         1
 136        1
 87         1
 169        1
 101        1
Name: dlc, Length: 87, dtype: int64

Everything seems fine, we can leave the field as is.

#### [Subroutine] 'dlc': Cleaning

Reserved, the field left as is.

### ext_user_account_notice

The column contains information about the external accounts used, for example, for authentication. There are not many of them filled but this might still be useful information. Leaving as is.

#### [Subroutine] 'ext_user_account_notice': Cleaning

Reserved, the field left as is.

In [136]:
print('External user account nulls count:', storefront['ext_user_account_notice'].isnull().sum())
print('External user account count:', storefront['ext_user_account_notice'].notnull().sum())
storefront['ext_user_account_notice'].value_counts(dropna = False)

External user account nulls count: 101467
External user account count: 1037


NaN                                               101467
Uplay (Supports Linking to Steam Account)             38
EA Account (Supports Linking to Steam Account)        30
Slitherine PBEM++ for Multiplayer                     27
Twitch                                                23
                                                   ...  
RareSloth ID                                           1
http://anpa.us                                         1
Spoorky Account                                        1
facebook (Supports Linking to Steam Account)           1
Wonderpot Account                                      1
Name: ext_user_account_notice, Length: 703, dtype: int64

### drm_notice

This column contains information on DRM Protection technology used in the app. The number of of non-null items is suprisingly small so it seems it'was not strictly necessary to fill it. It seems like there is no fixed field structure to parse it.

Considering all of that, I'll leave it as is but it is a very strong candidate on removal.

#### [Subroutine] 'drm_notice': Cleaning

Reserved, the field left as is.

In [137]:
print('DRM Notice nulls count:', storefront['drm_notice'].isnull().sum())
print('DRM Notice non-nulls count:', storefront['drm_notice'].notnull().sum())
storefront['drm_notice'].value_counts(dropna = False)

DRM Notice nulls count: 101798
DRM Notice non-nulls count: 706


NaN                                                                                           101798
Denuvo Anti-tamper<br>5 different PC within a day machine activation limit                       228
Denuvo Anti-tamper                                                                               162
EA on-line activation and Origin client software installation and background use required.        74
Denuvo Antitamper                                                                                 33
                                                                                               ...  
Valeroa Anti-Tamper                                                                                1
Denuvo Anti-Tamper<br>5 a day machine activation limit                                             1
My.com                                                                                             1
Proprietary DRM<br>5 (renewal upon request) machine activation limit                       

### recommendations

This column contains the total number of reviews for games. We have this field in the reviews table (and it's basically using the same source) as well so it can be safely removed as redundant. Interestingly, we have much less nulls in the reviews table - values of 100 and below are filtered out of the field in storefront.

In [138]:
#recommendations quick look
def getminrec(df):
    temp_df = df.copy()
    temp_df['recommendations'] = temp_df['recommendations'].apply(extractDictItem, key = 'total')
    return temp_df['recommendations'].min()

print('recommendations nulls count:', storefront['recommendations'].isnull().sum())
print('storefront.recommendations minumum value', getminrec(storefront))
print('reviews.total_reviews count:', reviews[(reviews['total_positive'] == np.NaN) | (reviews['total_positive'] == 0)].shape[0])
storefront[storefront["recommendations"].notnull()][["name","recommendations"]].sample(5)

recommendations nulls count: 89318
storefront.recommendations minumum value 101.0
reviews.total_reviews count: 40794


Unnamed: 0_level_0,name,recommendations
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
981260,Love In Drawing,{'total': 209}
6400,Joint Task Force,{'total': 182}
428690,Youtubers Life,{'total': 13119}
639270,Operation Warcade VR,{'total': 253}
32430,STAR WARS‚Ñ¢ - The Force Unleashed‚Ñ¢ Ultimate Sit...,{'total': 5579}


In [139]:
temp_df = storefront.copy()
temp_df['recommendations'] = temp_df['recommendations'].apply(extractDictItem, key = 'total')
temp_df['recommendations'].min()

101.0

In [140]:
storefront.loc[610400].recommendations

"{'total': 101}"

In [141]:
reviews.loc[610400]

review_score                   8.0
review_score_desc    Very Positive
total_positive                84.0
total_negative                17.0
total_reviews                101.0
Name: 610400, dtype: object

#### [Subroutine] 'recommendations': Cleaning

Removed as redundant.

In [142]:
storefront = storefront.drop("recommendations", axis = 1)

### reviews

This column a selection of journalist reviews' quotes. The metacritic column is much more helpful as it contains both the score and link to the combined reviews. So I consider this column redundant (it's used to show a selection of review quotes on the game page).


In [143]:
#reviews
with pd.option_context("display.max_colwidth", 500):
    display(storefront[storefront["reviews"].notnull()][["name","reviews"]].sample(5))

Unnamed: 0_level_0,name,reviews
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
90400,Blue Toad Murder Files‚Ñ¢: The Mysteries of Little Riddle,"‚ÄúBlue Toad is a detailed pleasure, filled with humour.‚Äù<br>8/10 ‚Äì <a href=""https://steamcommunity.com/linkfilter/?url=http://www.eurogamer.net/articles/blue-toad-murder-files-season-one-review"" target=""_blank"" rel=""noopener"" >EuroGamer</a><br><br>‚ÄúMurder Files is a constant source of snorts, guffaws, and smart laughs. Listen closely to everything you hear in the game - not only will it help you ace the game's pop quizzes, but because you don't want to miss any of Murder Files' best and most..."
1613770,Cave Quest 2,"‚ÄúCave Quest 2 was a treat for me. I am very happy to have reviewed this game and to be able to give it such a good score so it hopefully gets the attention it deserves.‚Äù<br>93% ‚Äì <a href=""https://steamcommunity.com/linkfilter/?url=https://www.lifeisxbox.eu/review-cave-quest-2/"" target=""_blank"" rel=""noopener"" >LifeisXbox</a><br><br>‚ÄúCave quest 2 brilliantly combines the Match 3, Hidden Objects and Adventure genres; I thoroughly enjoyed playing it. What a little GEM!‚Äù<br>10/10 ‚Äì <a href=""http..."
1155940,Help! I am REALLY horny!,"‚ÄúThe irony of a game filled to the brim with giant dicks still being more respectful to consumers than many other titles is not lost on me.‚Äù<br><a href=""https://steamcommunity.com/linkfilter/?url=http://www.336gamereviews.com/help-i-am-really-horny-r18-review/"" target=""_blank"" rel=""noopener"" >336gamereviews</a><br>"
1420740,Your amazing T-Gotchi!,"‚ÄúI always feel a bit strange about games where you take care of a woman as a pet‚Äù<br><a href=""https://steamcommunity.com/linkfilter/?url=https://www.kritiqal.com/articles/thoughts-on-indiepocalpyse-8"" target=""_blank"" rel=""noopener"" >KRITIQAL</a><br>"
777250,The Tavern of Magic,"‚Äúthe Tavern of Magic is the level of visual effects and atmosphere, which is very lacking in other games in VR. Keep it up!‚Äù<br />\r\nDevGAMM Judge<br />\r\n<br />\r\n‚ÄúMagic duel for multiple players, very vivid, spectacular and is an excellent test for your reaction, playing with friends is a pleasure‚Äù<br />\r\nVR Games and All_All_All<br />\r\n<br />\r\n‚ÄúA beautiful online game in which you have to fight in magic card matches‚Äù<br />\r\nVR Goldgrabbers<br />\r\n"


#### [Subroutine] 'reviews': Cleaning

Removed as redundant

In [144]:
storefront = storefront.drop("reviews", axis = 1)

### controller_support

This column contains information about the apps' controllers support levels. If you remember, we had game features in categories and there was controller support information there as well:

* Full controller support       25126
* Partial Controller Support    18518

Let's take a look at this column and compare it with categories:

In [145]:
print('Controller Support nulls count:', storefront['controller_support'].isnull().sum())
storefront['controller_support'].value_counts(dropna = False)

Controller Support nulls count: 77007


NaN     77007
full    25497
Name: controller_support, dtype: int64

In [146]:
# creating temporary boolean dataframe with the required
temp_df = storefront[['controller_support','categories']].copy()
temp_uniques = ['Full controller support','Partial Controller Support']
temp_df['categories'].fillna({i: [] for i in temp_df.index},inplace = True)
temp_df = boolean_df(temp_df['categories'], temp_uniques)
temp_df['controller_support - full'] = storefront['controller_support'].apply(lambda x: True if x=='full' else False)
print('controller_support == full and no Full controller support category:',
      temp_df[(temp_df['Full controller support'] == False) & (temp_df['controller_support - full'] == True)].shape[0])
print('controller_support == None and Full controller support category:',
      temp_df[(temp_df['Full controller support']) & (temp_df['controller_support - full'] == False)].shape[0])

controller_support == full and no Full controller support category: 0
controller_support == None and Full controller support category: 0


As we can see, controller data already exists in categories and there are no discrepancies here (Also, we even have a partial controller support in categories, unlike this column).

We can safely drop it:
#### [Subroutine] 'controller_support': Cleaning

In [147]:
storefront = storefront.drop("controller_support", axis = 1)

### legal_notice
This column doesn't seem to contain any usefull information. Dropping.

In [148]:
print('Legal Notice nulls count:', storefront['legal_notice'].isnull().sum())
with pd.option_context("display.max_colwidth", 100):
    display(storefront[storefront['legal_notice'].notnull()][['name','legal_notice']].sample(10))

Legal Notice nulls count: 61740


Unnamed: 0_level_0,name,legal_notice
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1788740,MedaAbi,¬©„Ç™„É¢„É†„É≠„Çπ
1304441,Train Sim World¬Æ 2: BR Heavy Freight Pack Loco Add-On,"¬© 2019 Dovetail Games, a trading name of RailSimulator.com Limited (‚ÄúDTG‚Äù). All rights reserved...."
349230,Stronghold Crusader 2: The Templar and The Duke,¬© 2020 FIREFLY HOLDINGS LIMITED. All rights reserved.
623194,Fate/EXTELLA - Resort Vacances,"¬©TYPE-MOON. ¬©2017 Marvelous Inc. Licensed to and published by XSEED Games / Marvelous USA, Inc.<..."
1787770,Spirits of Carter Mansion,"Spirits of Carter Mansion, and the artworks therein are the property of Cutlass Boardgames"
263480,Final Rush,"<i>¬©2014 Strike Games, LLC. All rights reserved.</i>"
896741,FSX Steam Edition: Piper PA-32 Saratoga II TC Add-On,‚ÄúDovetail Games‚Äù (‚ÄúDTG‚Äù) is a trading name of RailSimulator.com Limited. ‚ÄúDovetail Games‚Äù is a t...
928580,Rytmik Studio ‚Äì MEGA PACK: Games & Videos,"2017-2019 ¬© CINEMAX, s.r.o."
1559870,ORBTRAIN - Slot Racing,"<strong>LIFE-CYCLE</strong><br><ul class=""bb_ul"">The life-cycle of the game includes the phases ..."
1100350,#Funtime,"¬© Copyright 2020 The Quantum Astrophysicists Guild, Incorporated. All rights reserved."


#### [Subroutine] 'legal_notice': Cleaning

In [149]:
storefront = storefront.drop("legal_notice", axis = 1)

### metacritic

This column contains the Metacritic score and the link for the apps. Unfortunately, there are not many of them but it's an interesting information.

It makes sense to move it to the reviews for now. We will decide the final table structure close to the end.

Let's take a look at this column:

In [150]:
#metacritic
print('Metacritic nulls count:', storefront['metacritic'].isnull().sum())
print('Metacritic non-nulls count:', storefront['metacritic'].notnull().sum())
print('rows missing from reviews that contain non-null metacritic: ',
      storefront[(storefront.index.isin(
          storefront.index.difference(reviews.index))) 
                 & (storefront['metacritic'].notnull())].shape[0]
                 )
with pd.option_context("display.max_colwidth", 100):
    display(storefront[storefront["metacritic"].notnull()][["name","metacritic"]].sample(5))

Metacritic nulls count: 98679
Metacritic non-nulls count: 3825
rows missing from reviews that contain non-null metacritic:  1


Unnamed: 0_level_0,name,metacritic
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
622770,Hacktag,"{'score': 73, 'url': 'https://www.metacritic.com/game/pc/hacktag?ftag=MCD-06-10aaa1f'}"
594650,Hunt: Showdown,"{'score': 81, 'url': 'https://www.metacritic.com/game/pc/hunt-showdown?ftag=MCD-06-10aaa1f'}"
783340,FRACTER,"{'score': 64, 'url': 'https://www.metacritic.com/game/pc/fracter?ftag=MCD-06-10aaa1f'}"
327500,Zenzizenzic,"{'score': 85, 'url': 'https://www.metacritic.com/game/pc/zenzizenzic?ftag=MCD-06-10aaa1f'}"
1787810,Song in the Smoke,"{'score': 78, 'url': 'https://www.metacritic.com/game/pc/song-in-the-smoke?ftag=MCD-06-10aaa1f'}"


#### [Subroutine] 'metacritic': Cleaning

In [151]:
def metacritic_clean(df1,df2):
    """ 
    Parse metacritic  column to 2 news columns - metacritic_score and metacritic_url,
    copy them to reviews and remove from the storefront
    """
    def metacritic_parse(data):
    # parsing metacritic column
        if data != data:
            return np.NaN, np.NaN
        try:
            evalDict = eval(data)
            if(type(evalDict) == dict):
                return evalDict['score'],evalDict['url']
        except:
            np.NaN,np.NaN
        return np.NaN,np.NaN
    
    df1 = df1.copy()
    df2 = df2.copy()
    df1[['metacritic_score','metacritic_url']]=df1.apply(lambda row: metacritic_parse(row.metacritic),axis=1,result_type='expand')

    #copying columns to reviews (creating new rows if necessary)
    df2 = pd.concat([df2,df1[['metacritic_score','metacritic_url']]], ignore_index=False, axis = 1)
    #filling the new rows
    review_fills = {"review_score": 0, "review_score_desc": "No user reviews", "total_positive": 0, "total_reviews": 0, "total_negative": 0}
    df2.fillna(value = review_fills, inplace = True)
    
    # removing unneeded columns
    
    df1.drop(['metacritic', 'metacritic_score', 'metacritic_url'], axis = 1, inplace = True)
    return df1, df2

In [152]:
storefront, reviews = metacritic_clean(storefront,reviews)

Checking updated dataframes:

In [153]:
storefront.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102504 entries, 10 to 2008820
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   type                     102504 non-null  object        
 1   name                     102504 non-null  object        
 2   required_age             102504 non-null  int64         
 3   dlc                      9696 non-null    object        
 4   fullgame                 34607 non-null   object        
 5   supported_languages      102352 non-null  object        
 6   drm_notice               706 non-null     object        
 7   ext_user_account_notice  1037 non-null    object        
 8   developers               102463 non-null  object        
 9   publishers               102464 non-null  object        
 10  demos                    6504 non-null    object        
 11  packages                 81153 non-null   object        
 12  platforms     

In [154]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102521 entries, 10 to 2028850
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   review_score       102521 non-null  float64
 1   review_score_desc  102521 non-null  object 
 2   total_positive     102521 non-null  float64
 3   total_negative     102521 non-null  float64
 4   total_reviews      102521 non-null  float64
 5   metacritic_score   3825 non-null    float64
 6   metacritic_url     3825 non-null    object 
dtypes: float64(5), object(2)
memory usage: 6.3+ MB


# Reviews table

This table contains the combined data about the games reviews, and we've also added the Metacritic scores + url links earlier. Judging by the info, there shouldn't be any nulls (aside from metacritic columns) but we'll take a look anyways.


In [155]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102521 entries, 10 to 2028850
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   review_score       102521 non-null  float64
 1   review_score_desc  102521 non-null  object 
 2   total_positive     102521 non-null  float64
 3   total_negative     102521 non-null  float64
 4   total_reviews      102521 non-null  float64
 5   metacritic_score   3825 non-null    float64
 6   metacritic_url     3825 non-null    object 
dtypes: float64(5), object(2)
memory usage: 6.3+ MB


In [156]:
reviews

Unnamed: 0_level_0,review_score,review_score_desc,total_positive,total_negative,total_reviews,metacritic_score,metacritic_url
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
10,9.0,Overwhelmingly Positive,117261.0,3686.0,120947.0,88.0,https://www.metacritic.com/game/pc/counter-str...
20,8.0,Very Positive,3896.0,705.0,4601.0,,
30,8.0,Very Positive,2794.0,398.0,3192.0,79.0,https://www.metacritic.com/game/pc/day-of-defe...
40,6.0,Mostly Positive,1214.0,308.0,1522.0,,
50,9.0,Overwhelmingly Positive,11343.0,519.0,11862.0,,
...,...,...,...,...,...,...,...
2028023,0.0,No user reviews,0.0,0.0,0.0,,
2028055,0.0,No user reviews,0.0,0.0,0.0,,
2028056,0.0,No user reviews,0.0,0.0,0.0,,
2028062,0.0,No user reviews,0.0,0.0,0.0,,


We have three columns describing the reviews counts:
* total_positive - total positive reviews,
* total_negative - total negative reviews,
* total_reviews - total reviews.

Two columns describing the Steam reviews scores:
* review_score - reviews score as calculated by Steam,
* review_score_desc - text description of the said score.

And two columns describing Metacritic scores:
* metacritic_score - the score at the time of data collection
* metacritic_url - url address of the game on the Metacritic site

Sadly, Steam score has some issues, as described by [SteamDB](https://steamdb.info/blog/steamdb-rating/). In short, that rating has issues with the low number of reviews and is not very good with sorting. The formula proposed by the linked article takes that into account and gives us the adjusted rating with respect to the number of reviews and the ‚Äúreal rating‚Äù.

*Thanks to SteamDB and /u/tornmandate for providing such a useful rating score (which is shared under the MIT license)*

I'll use the said formula as well to determine the rating. **Keep in mind, that it's still not recommended to rely on the rating with less than 500 votes**.

![image.png](attachment:ac928351-f3a5-4c7d-ad3c-03237ed936da.png)![image.png](attachment:601f3b09-d93c-42d5-90b0-31795e2a6d6b.png)

In [157]:
reviews["rating"] = (
                        reviews["total_positive"]/reviews["total_reviews"] - 
                        (reviews["total_positive"]/reviews["total_reviews"] - 0.5)*np.power(2,-np.log10(reviews["total_reviews"]+1))
                    )*100

Let's check the games with the highest reviews:

In [158]:
temp_df = pd.concat([storefront,reviews], ignore_index=False, axis = 1)
temp_df.sort_values(by='rating', ascending = False)[['name','review_score','total_positive','total_negative','rating']]

Unnamed: 0_level_0,name,review_score,total_positive,total_negative,rating
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
620,Portal 2,9.0,234828.0,2820.0,97.637876
1118200,People Playground,9.0,108309.0,1119.0,97.487823
1794680,Vampire Survivors,9.0,93523.0,968.0,97.418751
427520,Factorio,9.0,106872.0,1187.0,97.408600
1145360,Hades,9.0,171135.0,2345.0,97.360320
...,...,...,...,...,...
2009270,Othello: Daynight Time Clash - 18+ Expansion Pack,0.0,0.0,0.0,
2028023,Total War Saga: FALL OF THE SAMURAI ‚Äì Blood Pack,0.0,0.0,0.0,
2028055,Tom Clancy's Ghost Recon Future Soldier - Seas...,0.0,0.0,0.0,
2028056,Worms Revolution Season Pass,0.0,0.0,0.0,


This correlates with what we see at https://steamdb.info/stats/gameratings/ . As you can see, the numbers of reviews are somewhat different, the reason being that we've downloaded only reviews for people that bought the game from Steam.

In [159]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102521 entries, 10 to 2028850
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   review_score       102521 non-null  float64
 1   review_score_desc  102521 non-null  object 
 2   total_positive     102521 non-null  float64
 3   total_negative     102521 non-null  float64
 4   total_reviews      102521 non-null  float64
 5   metacritic_score   3825 non-null    float64
 6   metacritic_url     3825 non-null    object 
 7   rating             64786 non-null   float64
dtypes: float64(6), object(2)
memory usage: 11.1+ MB


In the mathematical operations we got a lot of NaNs, due to games with 0 total reviews. Let's assign them a score of 50%, as the medium point. This is the same approach used in the algorithm above.

In [160]:
reviews["rating"] = reviews["rating"].fillna(50.0)

Let's also remove the excessive entries from reviews if they are present:

In [161]:
def df_remove_excesses(df_primary, df_secondary):
    excesses = df_secondary.index.difference(df_primary.index)
    df_secondary = df_secondary.drop(excesses, axis=0)
    return df_secondary

And remove the field that contain the excessive information:
* total_reviews - we can calculate them by using total_positive and total_negative
* review_score_desc - we can describe the scores in the review_score metadata if needed.

In [162]:
reviews = df_remove_excesses(storefront,reviews)
reviews.drop([
        'total_reviews', 'review_score_desc'
    ], axis=1, inplace = True)

We'll decide how we are going to  join/split the table close to the end and leave the reviews for now.

# SteamSpy

This table contains data collected from SteamSpy. The columns are:

| Column  | Description |
| --- | --- |
| appid | Appid, used as index |
| name | Application name |
| developers | Application developers |
| publishers | Application publishers |
| score_rank| Steam reviews score rank |
| total_positive | Positive reviews count|
| total_negative | Negative reviews count |
| review_score | Steam review score |
| owners | Estimated owner numbers |
| average_forever | Average playtime |
| average_2weeks | Average playtime in the last two weeks |
| median_forever | Median playtime |
| median_2weeks | Median playtime in the last two weeks |
| price | Current game price |
| initialprice | Initial game price |
| discount | Discount |
| supported_languages | Supported languages |
| genres | App Genres |
| ccu | Peak concurrent players on the day before the data collection (*not the max historical!*) |
| tags | User tags with counts |


We have already got the clean data for most of the fields from the Storefront and Reviews tables. Also, the data from SteamSpy is not as complete and recent comparing to the one directy downloaded from Steam. These columns are:
- appid
- name
- developers
- publishers
- score_rank
- total_positive
- total_negative
- review_score
- price
- initialprice
- discount
- supported_languages
- genres
 
I'll still take a quick look on this fields one by one.

The columns we are interested in:
- owners
- average_forever
- average_2weeks
- median_forever
- median_2weeks
- ccu
- tags

### Conforming the table rows to the same ids as in the main storefront table:


In [163]:
steamspy = df_remove_excesses(storefront,steamspy)


I'll create a temporary table merging storefront, reviews and steamspy for the steamspy data check

In [164]:
storefront_s = pd.concat([storefront, reviews.add_suffix("_reviews"), steamspy.add_suffix("_steamspy")], axis = 1)

## Columns we already have good data data on
### name
Application name. We've already worked with it on the storefront so it's of no use to us.

In [165]:
print('name nulls count:', steamspy['name'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[["name", "name_steamspy"]].sample(5))

name nulls count: 222


Unnamed: 0_level_0,name,name_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
409520,Ginger: Beyond the Crystal,Ginger: Beyond the Crystal
391470,Towers of Altrac - Endless Mode,Towers of Altrac - Endless Mode
1630480,LuvSic - 18+ Adult Only Content,LuvSic - Patch
1207950,Èõ∂ÁïåÊàòÁ∫ø-Áà±‰∏Ω‰∏ùËßíËâ≤ÂåÖ,Èõ∂ÁïåÊàòÁ∫ø-Áà±‰∏Ω‰∏ùËßíËâ≤ÂåÖ
593370,Expeditions: Viking - Blood-Ice,Expeditions: Viking - Blood-Ice


### developers
Developers. We've already worked with it on the storefront so it's of no use to us.

In [166]:
print('developers nulls count:', storefront_s['developers_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[["name", "developers", "developers_steamspy"]].sample(5))

developers nulls count: 10196


Unnamed: 0_level_0,name,developers,developers_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
645190,Jungle Hostages,['Daylights Games'],Daylights Games
1509050,Space of Retaliation,"['Matt Sowards', 'Michael Scott']","Matt Sowards, Michael Scott"
1361180,Gemini Strategy Origin,['Gemini Stars Games'],Gemini Stars Games
371430,Space Grunts,['Orangepixel'],Orangepixel
1130820,Framing Dawes,['Jinx-It Games'],Jinx-It Games


### publishers
Publishers. We've already worked with it on the storefront so it's of no use to us.

In [167]:
print('publishers nulls count:', storefront_s['publishers_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[["name", "publishers", "publishers_steamspy"]].sample(5))

publishers nulls count: 19359


Unnamed: 0_level_0,name,publishers,publishers_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1125700,Endless Void,['Punkmice'],Punkmice
1879760,Layerworld,['BrandKnew.io'],
1078400,Furries & Scalies & Bears OH MY!: The Bear DLC,['Stegalosaurus Game Development'],Stegalosaurus Game Development
1446750,Healing Animal,['TeamAppleMonkey Inc.'],TeamAppleMonkey Inc.
543070,"Gray Skies, Dark Waters","['Green Willow Games, LLC']","Green Willow Games, LLC"


### score_rank
Steam review score rank. We've already worked with it on the storefront so it's of no use to us.

In [168]:
print('score_rank nulls count:', storefront_s['score_rank_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["score_rank_steamspy"].notnull()][["name", "rating_reviews", "score_rank_steamspy"]].sample(5))

score_rank nulls count: 102458


Unnamed: 0_level_0,name,rating_reviews,score_rank_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
929310,Kamasutra Connect : Sexy Hentai Girls,61.593487,98.0
331065,Call of Duty¬Æ: Advanced Warfare - Lightning Premium Personalization Pack,54.55079,98.0
726360,BOOBS SAGA: Prepare To Hentai Edition,74.016814,99.0
896890,VR Paradise - Steam Edition,83.503452,99.0
367210,BADLAND: Game of the Year Edition - Digital Art Booklet & Ambient Soundtrack,75.029343,100.0


### total_positive
Total number of positive reviews. We've already worked with it on the storefront so it's of no use to us.

In [169]:
print('total_positive nulls count:', storefront_s['total_positive_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["total_positive_steamspy"]>0][["name", "total_positive_reviews", "total_positive_steamspy"]].sample(5))

total_positive nulls count: 4


Unnamed: 0_level_0,name,total_positive_reviews,total_positive_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1220880,ÈôÜÂ§ßËøπÁ•û‚Ö°,5.0,5.0
1637200,Mage Tower,8.0,22.0
703720,ERSATZ,11.0,14.0
648980,100% Orange Juice - Starter Character Voice Pack,59.0,52.0
313020,Soul Gambler,1809.0,2266.0


### total_negative
Total number of negative reviews. We've already worked with it on the storefront so it's of no use to us.

In [170]:
print('total_negative nulls count:', storefront_s['total_negative_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["total_negative_steamspy"]>0][["name", "total_negative_reviews", "total_negative_steamspy"]].sample(5))

total_negative nulls count: 4


Unnamed: 0_level_0,name,total_negative_reviews,total_negative_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
575500,Learn Japanese To Survive! Katakana War - Manga + Art Book,2.0,2.0
8000,Tomb Raider: Anniversary,812.0,887.0
655780,Project 5: Sightseer,122.0,135.0
1217310,One True Cuddle,0.0,1.0
801150,IMM Defense,1.0,1.0


### review_score
Review score. We've already worked with it on the storefront so it's of no use to us.

In [171]:
print('review_score nulls count:', storefront_s['review_score_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["review_score_steamspy"]>0][["name", "review_score_reviews", "review_score_steamspy"]].sample(5))

review_score nulls count: 4


Unnamed: 0_level_0,name,review_score_reviews,review_score_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1013180,Funbag Fantasy,8.0,100.0
891870,King of Phoenix,5.0,84.0
825300,To Trust an Incubus,8.0,80.0
966460,Undress Tournament,0.0,57.0
906050,Hentai Case Opening,5.0,63.0


### price
Current price (including discounts). We've already worked with it on the storefront so it's of no use to us.

In [172]:
print('price nulls count:', storefront_s['price_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["price_steamspy"]>0][["name", "price", "price_steamspy"]].sample(5))

price nulls count: 9885


Unnamed: 0_level_0,name,price,price_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1697350,Tiger Tank 59 ‚Ö† Volcano MP031,0.79,99.0
1572810,–ú–ï–ú–û–õ–û–ì–ò–Ø,0.79,99.0
773951,Freeman: Guerrilla Warfare,20.99,2499.0
1475700,Arcade Game Machine Basketball,1.59,199.0
948570,Tabletopia - Euphoria: Build a Better Dystopia,4.99,599.0


### initialprice
Price without the discount. We've already worked with it on the storefront so it's of no use to us.

In [173]:
print('price nulls count:', storefront_s['initialprice_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["initialprice_steamspy"]>0][["name", "price", "initialprice_steamspy"]].sample(5))

price nulls count: 9883


Unnamed: 0_level_0,name,price,initialprice_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
774181,Rhythm Doctor,13.29,1599.0
416840,Color By,9.99,999.0
470200,Defend Your Crypt: Soundtrack + Extras,1.99,199.0
1374870,Idle Champions - Ascendant Widdle Theme Pack,20.99,2499.0
279622,Europa Universalis IV: Trade Nations Unit Pack,1.99,199.0


### discount
Current dicount. We've already worked with it on the storefront so it's of no use to us.

In [174]:
print('discount nulls count:', storefront_s['discount_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["discount_steamspy"]>0][["name", "price_steamspy", "initialprice_steamspy", "discount_steamspy"]].sample(5))

discount nulls count: 9883


Unnamed: 0_level_0,name,price_steamspy,initialprice_steamspy,discount_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1748463,Starry Moon Island Red Snake MP04,74.0,99.0,25.0
1151050,Golf Gang,899.0,999.0,10.0
1127850,Apple Slash,319.0,399.0,20.0
46760,Ironclads: Schleswig War 1864,199.0,999.0,80.0
1600210,Halloween Night Mahjong 2,49.0,499.0,90.0


### supported_languages
Supported languages. Here languages are not divided on autio/text. We've already worked with this data on the storefront so it's of no use to us.

In [175]:
print('supported_languages nulls count:', storefront_s['supported_languages_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["supported_languages_steamspy"].notnull()][[
        "name",
        "supported_audio",
        "supported_languages",
        "supported_languages_steamspy"
    ]].sample(5))

supported_languages nulls count: 10051


Unnamed: 0_level_0,name,supported_audio,supported_languages,supported_languages_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1187303,[Revival] DOA6 Santa's Helper Costume - Hayabusa,"[ Japanese, English]","[English, French, German, Italian, Japanese, Korean, Russian, Simplified Chinese, Spanish - Spain, Traditional Chinese]","English, French, Italian, German, Spanish - Spain, Russian, Simplified Chinese, Traditional Chinese, Japanese, Korean"
1617948,Agrou - Bear Pet,,"[English, French, Korean, Turkish, Vietnamese]","French, English, Korean, Turkish, Vietnamese"
1170870,Rain of Fire,,[English],English
1273054,Death end re;Quest 2 - Deluxe Helping Hand Set,"[English, Japanese]","[English, Japanese, Simplified Chinese, Traditional Chinese]","English, Japanese, Simplified Chinese, Traditional Chinese"
393460,Crazy Pixel Streaker,,"[English, French, German, Italian, Portuguese - Brazil, Spanish - Spain]","English, French, Italian, German, Spanish - Spain, Portuguese - Brazil"


### genres
Genres. We've already worked with this data on the storefront so it's of no use to us.

In [176]:
print('genres nulls count:', storefront_s['genres_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["genres_steamspy"].notnull()][["name", "genres", "genres_steamspy"]].sample(5))

genres nulls count: 10134


Unnamed: 0_level_0,name,genres,genres_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1492810,No Cure 2,[Indie],Indie
1120070,Hyacinthus-donation,"[Free to Play, Indie]","Free to Play, Indie"
342801,Rocksmith¬Æ 2014 ‚Äì Spinal Tap - ‚ÄúTonight I‚Äôm Gonna Rock You Tonight‚Äù,"[Casual, Simulation]","Casual, Simulation"
1728910,IL-2 Sturmovik: Sd.Kfz. 10/5 Flak 38 Anti-Aircraft Gun,"[Action, Simulation]","Action, Simulation"
409110,Metal Reaper Online - Veteran Package,"[Free to Play, Massively Multiplayer]","Free to Play, Massively Multiplayer"


## Columns requiring analysis

### Owners
SteamSpy owners estimation. A string with lower .. upper application owners estimates. We could split it into two for lower and upper estimations but I'll just slightly reformat it to keep consistent with Nik Davis dataset.

In [177]:
print('owners nulls count:', storefront_s['owners_steamspy'].isnull().sum())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["owners_steamspy"].notnull()][["name", "owners_steamspy"]].sample(5))

owners nulls count: 4


Unnamed: 0_level_0,name,owners_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1599192,Warframe: Prime Vault ‚Äì Chroma Prime Pack,"0 .. 20,000"
235820,Element4l,"50,000 .. 100,000"
1213500,School for 3D Visual Novel Maker,"0 .. 20,000"
361600,Luna Sky,"50,000 .. 100,000"
1927300,Yin Yang Space,"0 .. 20,000"


#### [Subroutine] 'owners': Cleaning

In [178]:
def owners_clean(df):
    """
    Reformatting owners column to lower-upper format
    """
    df = df.copy()
    df['owners'] = df['owners'].str.replace(',', '', regex=True).str.replace(' .. ', '-', regex=True)
    return df

In [179]:
steamspy.sample(10)

Unnamed: 0_level_0,name,developers,publishers,score_rank,total_positive,total_negative,review_score,owners,average_forever,average_2weeks,median_forever,median_2weeks,price,initialprice,discount,supported_languages,genres,ccu,tags
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1341570,DoHots,AChat Animation Studios,AChat Animation Studios,,0,0,0,"0 .. 20,000",0,0,0,0,1499.0,1499.0,0.0,English,"Casual, RPG",1,[]
277560,Where Angels Cry,Cateia Games,Cateia Games,,58,52,0,"100,000 .. 200,000",0,0,0,0,399.0,399.0,0.0,English,"Adventure, Casual, Indie",0,"{'Adventure': 147, 'Casual': 136, 'Point & Cli..."
1384970,LightBreak,Deev Interactive,Deev Interactive,,0,0,0,"0 .. 20,000",0,0,0,0,0.0,0.0,0.0,English,Indie,0,[]
92622,Xotic DLC: Warp Field Expansion Pack,"WXP Games, LLC","WXP Games, LLC",,1,0,0,"0 .. 20,000",0,0,0,0,199.0,199.0,0.0,"English, French, German, Italian, Spanish - Spain","Action, Indie",0,"{'Action': 22, 'Indie': 22}"
577470,Last Days Of Tascaria,Baltica Games,Baltica Games,,21,5,0,"0 .. 20,000",0,0,0,0,799.0,799.0,0.0,English,"Adventure, Indie, RPG, Strategy",0,"{'Strategy': 31, 'Adventure': 31, 'Indie': 30,..."
1578920,CrateTastrophe,Starstrike Studios,Starstrike Studios,,5,0,0,"20,000 .. 50,000",0,0,0,0,199.0,199.0,0.0,English,"Action, Indie, Strategy",0,"{'Action': 174, 'Strategy': 169, 'FPS': 161, '..."
755470,The World Next Door,Rose City Games,VIZ Media,,198,39,0,"0 .. 20,000",1,0,1,0,999.0,999.0,0.0,English,"Action, Adventure, Indie",1,"{'Visual Novel': 153, 'Puzzle': 145, 'Indie': ..."
346430,Spectrum: First Light,"Mido Basim, Danny Wei",Mido Basim,,4,4,0,"0 .. 20,000",0,0,0,0,999.0,999.0,0.0,English,Indie,0,"{'Indie': 21, 'Puzzle-Platformer': 12, 'Comic ..."
1890200,POG 7,Cute Hannah's Games,Cute Hannah's Games,,2,2,0,"0 .. 20,000",0,0,0,0,199.0,199.0,0.0,"English, French, Italian, German, Spanish - Sp...","Casual, Indie",0,"{'Casual': 75, 'Side Scroller': 47, '2D': 47, ..."
679750,Catch & Release,metricminds GmbH & Co KG,Advanced Interactive Gaming Ltd.,,284,30,0,"20,000 .. 50,000",0,0,0,0,1999.0,1999.0,0.0,"English, French, Italian, German, Spanish - Sp...","Simulation, Sports",2,"{'Simulation': 49, 'Sports': 44, 'Fishing': 26..."


### average_forever 

Average player playtime. We have some nulls (since we don't have data for some games). We will replace it with 0 to keep consistent with data on SteamSpy.

In [180]:
print('average_forever nulls count:', storefront_s['average_forever_steamspy'].isnull().sum())
print('average_forever zero count:', storefront_s[storefront_s["average_forever_steamspy"]==0]['average_forever_steamspy'].count())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["average_forever_steamspy"].notnull()][["name", "average_forever_steamspy"]].sample(5))

average_forever nulls count: 4
average_forever zero count: 90439


Unnamed: 0_level_0,name,average_forever_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1154810,Going Under,237.0
554330,Choppa,262.0
884010,Drugs to Bee - OST,0.0
1946310,Escape the Ayuwoki DEMAKE,0.0
602310,theHunter: Call of the Wild‚Ñ¢ - Bearclaw Lite Compound Bow,0.0


#### [Subroutine] 'average_forever': Cleaning

In [181]:
def average_forever_clean(df):
    """
    Cleaning average_forever in SteamSpy
    """
    df = df.copy()
    df['average_forever'].fillna(0)
    return df

In [182]:
steamspy = average_forever_clean(steamspy)

### average_2weeks 

Average player playtime in the last 2 weeks. While the data is interesting, it's only the last two weeks so it doesn't seem valuable in the long term. Going to drop.

In [183]:
print('average_2weeks nulls count:', storefront_s['average_2weeks_steamspy'].isnull().sum())
print('average_2weeks zero count:', storefront_s[storefront_s["average_2weeks_steamspy"]==0]["average_2weeks_steamspy"].count())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["average_2weeks_steamspy"].notnull()][["name", "average_2weeks_steamspy"]].sample(5))

average_2weeks nulls count: 4
average_2weeks zero count: 101063


Unnamed: 0_level_0,name,average_2weeks_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
295222,Europa Universalis IV: Indian Ships Unit Pack,0.0
1490540,Poco In Dungeon - Pico In Cave,0.0
1073180,Hell Wedding Â§úÂ´ÅÂêéÁª≠ÔºàÂÆåÁªìÔºâ,0.0
972660,Spiritfarer¬Æ: Farewell Edition,43.0
1909230,VRUSEUM,0.0


### median_forever 

Median player playtime. We have some nulls (since we don't have data for some games). We will replace it with 0 to keep consistent with data on SteamSpy.

In [184]:
print('median_forever nulls count:', storefront_s['median_forever_steamspy'].isnull().sum())
print('median_forever zero count:', storefront_s[storefront_s["median_forever_steamspy"]==0]["median_forever_steamspy"].count())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["median_forever_steamspy"].notnull()][["name", "median_forever_steamspy"]].sample(5))

median_forever nulls count: 4
median_forever zero count: 90439


Unnamed: 0_level_0,name,median_forever_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1675750,Tiger Tank 59 ‚Ö† Air Strike MP041,0.0
1849170,Nested Rooms,0.0
1680620,Tree of Mu,0.0
335190,200% Mixed Juice!,4745.0
316441,DW8XLCE - SPECIAL COSTUME PACK 2,0.0


#### [Subroutine] 'median_forever': Cleaning

In [185]:
def median_forever_clean(df):
    """
    Cleaning average_forever in SteamSpy
    """
    df = df.copy()
    df['median_forever'].fillna(0)
    return df

In [186]:
steamspy = median_forever_clean(steamspy)

### median_2weeks 

Median player playtime in the last 2 weeks. While the data is interesting, it's only the last two weeks so it doesn't seem valuable in the long term. Going to drop.

In [187]:
print('median_2weeks nulls count:', storefront_s['median_2weeks_steamspy'].isnull().sum())
print('median_2weeks zero count:', storefront_s[storefront_s["median_2weeks_steamspy"]==0]["median_2weeks_steamspy"].count())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["median_2weeks_steamspy"].notnull()][["name", "median_2weeks_steamspy"]].sample(5))

median_2weeks nulls count: 4
median_2weeks zero count: 101063


Unnamed: 0_level_0,name,median_2weeks_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1452500,The Good Life,0.0
1145100,Singaria - Prologue,0.0
1034550,RIOT - Civil Unrest Soundtrack and Art Book,0.0
781920,The Haunted House of Doom,0.0
713840,Goblin Storm,0.0


### ccu 

Peak concurrent user count. This is a very interesting stat. Sadly, it's not a lifetime stat, but the stat for the day before the dataset is downloaded so it's not useful for analysis. Going to drop.

In [188]:
print('ccu nulls count:', storefront_s['ccu_steamspy'].isnull().sum())
print('ccu zero count:', storefront_s[storefront_s["ccu_steamspy"]==0]["ccu_steamspy"].count())
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["ccu_steamspy"].notnull()][["name", "ccu_steamspy"]].sample(5))

ccu nulls count: 4
ccu zero count: 86582


Unnamed: 0_level_0,name,ccu_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1124390,Paradigm City,0.0
1697366,Tiger Tank 59 ‚Ö† Volcano MP047,0.0
1305610,Cosmodome,0.0
1350940,RPG Maker MZ - Add-on Vol.3: Train Tileset,0.0
463780,Lion Quest Soundtrack,0.0


### tags

User tags data. It includes both the tag and the number of users that put the tag. A dict with 'tag_name':tag_count elements.

I'll save just the tags themselves in the main table and move tags with tag numbers to the separate one.

In [189]:
print('tags nulls count:', storefront_s['tags_steamspy'].isnull().sum())
print('tags empty list count:', storefront_s[~storefront_s['tags_steamspy'].apply(lambda x: False if pd.isna(x) else bool(ast.literal_eval(x)))].shape[0])
with pd.option_context("display.max_colwidth", 150):
    display(storefront_s[storefront_s["tags_steamspy"].notnull()][["name", "tags_steamspy"]].sample(5))

tags nulls count: 4
tags empty list count: 41460


Unnamed: 0_level_0,name,tags_steamspy
appid,Unnamed: 1_level_1,Unnamed: 2_level_1
1956770,M.O.O.D.S.,[]
899870,Rocksmith¬Æ 2014 Edition ‚Äì Remastered ‚Äì Cat Stevens - ‚ÄúMorning Has Broken‚Äù,[]
769420,JuVentures,"{'Early Access': 171, 'Hidden Object': 155, 'Hand-drawn': 149, 'Family Friendly': 146, 'Relaxing': 141, 'Cute': 136, 'Atmospheric': 130, 'Colorful..."
1815990,Pedro and Sofia's Nuclear Winter,[]
1083070,Sapper - Defuse The Bomb Simulator,"{'Early Access': 291, 'Simulation': 264, 'First-Person': 257, 'Singleplayer': 254, 'Action-Adventure': 250, 'Adventure': 246, 'Colorful': 241, 'At..."


#### [Subroutine] 'tags': Cleaning

In [190]:
def clean_tags(df, export=False):
    """
    Processing SteamSpy tags with possible export.
    For exporting, we are spreading the tags to columns and put the number of users using the said tag as a value
    tags are renamed to comply with pandas column names requirements    
    
    We are leaving only the tags themselves in the table
    """    
    if export: 
        
        tag_data = df[['tags']].copy()
        
        def parse_export_tags(x):
            if pd.isnull(x):
                return {}
            x = ast.literal_eval(x)

            if isinstance(x, dict):
                return x
            elif isinstance(x, list):
                return {}
            else:
                raise TypeError('Something other than dict or list found')

        tag_data['tags'] = tag_data['tags'].apply(parse_export_tags)

        # Getting all tags for column names
        cols = set(itertools.chain(*tag_data['tags']))

        # And setting the user values
        for col in sorted(cols):
            col_name = col.lower().replace(' ', '_').replace('-', '_').replace("'", "")

            tag_data[col_name] = tag_data['tags'].apply(lambda x: x[col] if col in x.keys() else 0)

        tag_data = tag_data.drop('tags', axis=1)
        
        export_data(tag_data, 'steamspy_tag_data', index=True)
        print("Exported tag data")
        
        
    def parse_tags(x):
        if pd.isnull(x):
            return np.nan
        x = ast.literal_eval(x)
        
        if isinstance(x, dict):
            return list(x.keys())
        else:
            return np.nan
    
    df['tags'] = df['tags'].apply(parse_tags)
    
    # rows with null tags seem to be superseded by newer release, so remove (e.g. dead island)
    df = df[df['tags'].notnull()]
    
    return df

In [191]:
steamspy = clean_tags(steamspy, export = True)

  tag_data[col_name] = tag_data['tags'].apply(lambda x: x[col] if col in x.keys() else 0)


Exported steamspy_tag_data to '../data/export/steamspy_tag_data.csv'
Exported tag data


In [192]:
#Verifying exported data
pd.read_csv('../data/export/steamspy_tag_data.csv').sample(10)

Unnamed: 0,appid,1980s,1990s,2.5d,2d,2d_fighter,2d_platformer,360_video,3d,3d_fighter,...,web_publishing,well_written,werewolves,western,word_game,world_war_i,world_war_ii,wrestling,zombies,e_sports
36860,849680,0,0,0,12,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
78928,1561510,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
84508,1662890,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
627,24810,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36325,840380,136,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
67590,1362165,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
77441,1535100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
28685,705390,0,0,0,172,0,207,0,0,0,...,0,0,0,0,0,0,0,0,0,0
86754,1694310,0,0,0,0,0,0,0,213,195,...,0,0,0,0,0,0,0,0,0,0
30913,746410,0,0,0,9,0,0,0,0,0,...,0,0,0,0,0,0,0,0,9,0


### [Subroutine] SteamSpy: dropping columns

In [193]:
steamspy = steamspy.drop([
        'name', 'developers', 'publishers', 'score_rank', 'total_positive', 'total_negative', 'review_score',
    'price', 'initialprice', 'discount', 'supported_languages', 'genres', 'average_2weeks', 'median_2weeks',
    'ccu'
    ], axis=1)

After the processing, our SteamSpy data table will look like this:

In [194]:
steamspy.sample(10)

Unnamed: 0_level_0,owners,average_forever,median_forever,tags
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
805880,"0 .. 20,000",0,0,"[Casual, Action, Indie, Text-Based, Word Game,..."
1057470,"0 .. 20,000",0,0,"[Indie, Casual, Racing]"
207113,"0 .. 20,000",0,0,[Racing]
523980,"0 .. 20,000",0,0,"[Strategy, Sports]"
800700,"0 .. 20,000",0,0,"[Strategy, Indie, Tower Defense]"
1000830,"0 .. 20,000",0,0,"[Casual, Sexual Content, Indie, Strategy, Meme..."
643682,"0 .. 20,000",0,0,[Simulation]
579537,"0 .. 20,000",0,0,[Simulation]
896170,"0 .. 20,000",0,0,"[Adventure, Casual, Visual Novel, Anime, Free ..."
313630,"200,000 .. 500,000",73,102,"[Survival, Adventure, Exploration, Sci-fi, VR,..."


We'll combine it with storefront and review data into one Steam data table:

In [195]:
steam = pd.concat([storefront,reviews,steamspy], ignore_index=False, axis = 1)

In [196]:
steam.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102504 entries, 10 to 2028850
Data columns (total 31 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   type                     102504 non-null  object        
 1   name                     102504 non-null  object        
 2   required_age             102504 non-null  int64         
 3   dlc                      9696 non-null    object        
 4   fullgame                 34607 non-null   object        
 5   supported_languages      102352 non-null  object        
 6   drm_notice               706 non-null     object        
 7   ext_user_account_notice  1037 non-null    object        
 8   developers               102463 non-null  object        
 9   publishers               102464 non-null  object        
 10  demos                    6504 non-null    object        
 11  packages                 81153 non-null   object        
 12  platforms     

In [197]:
steam.sample(10)

Unnamed: 0_level_0,type,name,required_age,dlc,fullgame,supported_languages,drm_notice,ext_user_account_notice,developers,publishers,...,review_score,total_positive,total_negative,metacritic_score,metacritic_url,rating,owners,average_forever,median_forever,tags
appid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
547070,dlc,Particle Fleet: Emergence - Corporate Bonus,0,,"{'appid': '422900', 'name': 'Particle Fleet: E...",[English],,,['Knuckle Cracker'],['Knuckle Cracker'],...,0.0,4.0,0.0,,,69.199408,"0 .. 20,000",0.0,0.0,"[Strategy, Indie, Simulation]"
819220,game,Yuso,0,,,"[English, French, German, Italian, Japanese, K...",,,['Vertical Reach'],['Vertical Reach'],...,0.0,4.0,1.0,,,62.506594,"0 .. 20,000",0.0,0.0,"[Indie, Strategy, Casual]"
1141770,game,Rage Melee,0,,,[English],,,"['N J FOX', 'KRH']",['N J FOX'],...,0.0,0.0,0.0,,,50.0,"0 .. 20,000",0.0,0.0,"[Free to Play, Indie, Action, Casual]"
340200,game,Bloop,0,,,[English],,,['2SD'],['KISS ltd'],...,4.0,6.0,11.0,,,41.454624,"100,000 .. 200,000",6.0,12.0,"[Indie, Casual, Simulation, Puzzle, Physics, 2D]"
636170,game,Reaching for Petals,0,[704610],,[English],,,['Blue Entropy Studios'],['Blue Entropy Studios'],...,6.0,76.0,30.0,,,66.38295,"0 .. 20,000",0.0,0.0,"[Adventure, Indie, Casual, Walking Simulator, ..."
978190,game,Steel Sword Story,0,,,"[English, Japanese, Simplified Chinese]",,,"['8bits fanatics', 'KADOKAWA CORPORATION']",['PLAYISM'],...,5.0,43.0,21.0,,,62.295669,"20,000 .. 50,000",0.0,0.0,"[Action, Indie, 2D, Pixel Graphics, Retro, Sin..."
2001900,dlc,RPG Maker MZ - MT Trees,0,,"{'appid': '1096900', 'name': 'RPG Maker MZ'}","[English, French, German, Italian, Japanese, K...",,,['Mega Tiles'],['Degica'],...,0.0,0.0,0.0,,,50.0,,,,
1070020,game,ÊàëÁöÑÁ∫∏Áâá‰∫∫Â•≥Âèã/Make butter together!,0,[1126350],,[Simplified Chinese],,,"['G+Â∑•Á®ãÂõ¢', 'Ê©òÂ≠êÁè≠']",[' PeriScope Game'],...,8.0,507.0,38.0,,,86.57441,"0 .. 20,000",193.0,193.0,"[Casual, Adventure, Indie, RPG, Anime, Visual ..."
643940,game,Jim is Moving Out!,0,,,[English],,,"['Handsome Box', 'CINEMAX, s.r.o.']","['CINEMAX, s.r.o.']",...,0.0,7.0,1.0,,,68.145781,"0 .. 20,000",0.0,0.0,"[Indie, Casual, Platformer, Puzzle, Puzzle-Pla..."
537180,game,Digimon Masters Online,0,,,"[English, Spanish - Latin America, Spanish - S...",,,"['Move Games Co., Ltd.']",['MOVE ON USA CO.'],...,5.0,110.0,72.0,,,58.263772,"1,000,000 .. 2,000,000",440.0,73.0,"[Free to Play, Anime, MMORPG, Massively Multip..."


# Finalizing table structure

After the processing have these tables available:

* steam
* steam_description_data
* steam_media_data
* steam_packages_info
* steam_requirements_data
* steam_support_info
* steamspy_tag_data
* missing_ids

Sadly, there are a lot of optional data in the steam table so it might be a good idea to move it to the optional table and join with the main table when necessary. The fields that go to the **steam_optional** are:

* drm_notice
* ext_user_account_notice
* demos
* content_descriptors
* metacritic_score
* metacritic_url

### [Subroutine] steam and steam_optional export

In [198]:
def steam_export(df):
    """
    Creating steam_optional table and exporting both steam and steam_optional
    """
    df = df.copy()
    # copying necessary columns into new df
    steam_optional_df = df[[
        "drm_notice",
        "ext_user_account_notice",
        "demos",
        "content_descriptors",
        "metacritic_score",
        "metacritic_url",
    ]].copy()
    
    # removing empty rows
    steam_optional_df.dropna(how = 'all', inplace=True)
           
    # dropping unnneeded columns from the main dataframe
    df = df.drop([
        "drm_notice",
        "ext_user_account_notice",
        "demos",
        "content_descriptors",
        "metacritic_score",
        "metacritic_url",
    ], axis=1)
    
    export_data(df, 'steam', index=True)
    export_data(steam_optional_df, 'steam_optional', index=True)

In [199]:
steam_export(steam)

Exported steam to '../data/export/steam.csv'
Exported steam_optional to '../data/export/steam_optional.csv'


In [200]:
#Verifying exported steam data
pd.read_csv('../data/export/steam.csv').sample(5)

Unnamed: 0,appid,type,name,required_age,dlc,fullgame,supported_languages,developers,publishers,packages,...,coming_soon,price,review_score,total_positive,total_negative,rating,owners,average_forever,median_forever,tags
81394,1601770,game,RUNNER,0,,,['English'],"['Truant Pixel, LLC']","['Truant Pixel, LLC']",,...,True,,0.0,0.0,0.0,50.0,,,,
55321,1156140,dlc,Vital Signs: ED - Injuries Package #1,0,,"{'appid': '1096640', 'name': 'Vital Signs: Eme...",['English'],['BreakAway Games'],['BreakAway Games'],[394273],...,False,3.99,0.0,0.0,0.0,50.0,,,,
32403,769170,game,Flinch,0,,,['English'],['Beplaya'],"['Beplaya', 'AJG']",[228712],...,False,3.99,0.0,1.0,1.0,50.0,"0 .. 20,000",0.0,0.0,"['Action', 'Adventure', 'Indie', 'Casual', 'Ea..."
7433,329370,dlc,Vertical Drop Heroes - Halloween Theme,0,,"{'appid': '311480', 'name': 'Vertical Drop Her...","['English', 'French', 'German', 'Portuguese - ...",['Nerdook Productions'],['Digerati'],[53158],...,False,0.0,0.0,0.0,0.0,50.0,"0 .. 20,000",0.0,0.0,"['Action', 'Indie']"
36916,850400,dlc,8-in-1 IQ Scale Bundle - Greek Dance (OST),0,,"{'appid': '772470', 'name': '8-in-1 IQ Scale B...","['English', 'French', 'German', 'Italian', 'Ja...",['ALEKSANDER CHEPAIKIN'],['ALEKSANDER CHEPAIKIN'],[266670],...,False,0.79,0.0,0.0,0.0,50.0,,,,


In [201]:
#Verifying exported steam_optional data
pd.read_csv('../data/export/steam_optional.csv').sample(5)

Unnamed: 0,appid,drm_notice,ext_user_account_notice,demos,content_descriptors,metacritic_score,metacritic_url
7131,970600,,,,This game contains a little violence and gore....,,
16118,1505297,,,,The game includes descriptions of women's clot...,,
672,47880,,,,,86.0,https://www.metacritic.com/game/pc/battlefield...
1598,265000,,,,,67.0,https://www.metacritic.com/game/pc/forced-show...
11559,1221340,,,,This Game may contain content not appropriate ...,,


### [Subroutine] missing_ids export

In [202]:
export_data(missing_ids, 'missing_ids', index=True)

Exported missing_ids to '../data/export/missing_ids.csv'


# Combined clean-up script

# TODO

In [203]:
def combined_cleanup():
    
    
    return True

# Tests

In [204]:
# Testing if the number of rows is consistent between the pre and post processing
def row_check():
    pre_count = pd.read_csv("../data/processing/steam_app_data.csv").shape[0]
    print("Number of rows before processing:", pre_count)
    missing_count_pre = pd.read_csv('../data/processing/missing_ids.csv').shape[0]
    print("Number of missing before processing:", missing_count_pre)
    post_count = pd.read_csv('../data/export/steam.csv').shape[0]
    print("Number of rows after processing:", post_count)
    missing_count_post = pd.read_csv('../data/export/missing_ids.csv').shape[0]
    print("Number of missing after processing:", missing_count_post)
    if ((pre_count + missing_count_pre) == (post_count + missing_count_post)):
        return True
    return False

In [205]:
print("Number of rows test results:", row_check())

  print("Number of rows test results:", row_check())


Number of rows before processing: 103185
Number of missing before processing: 20
Number of rows after processing: 102504
Number of missing after processing: 701
Number of rows test results: True
