# Steam Data Cleaning (Part 2)

*This is part of a larger series of notebooks on downloading, processing and analysing data from the steam store. [See all notebooks here.](../notebooks)*

See https://github.com/jbwhit/OSCON-2015/blob/master/develop/2015-07-16-jw-example-notebook-setup.ipynb for local imports



**TODO**: genre and categories section writeup

Currently our downloaded data is not in a very usable or useful state. Many of the columns contain lengthy strings or missing values, both of which are crippling to analysis and especially to any machine learning techniques we may wish to implement.

The main aims of this project are to investigate various sales and play-time statistics for games from the steam store, and see how different features of games may have an effect on the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust data set for future analysis projects.

To begin with, we'll import our libraries and set some options, then take a look at the downloaded data from the steam api. Once that is taken care of we will move on to the steamspy data and repeat the process. Hopefully by the end we will have clean data sets to use in the next step, exploratory analysis and visualisation.

### Aims:
- Improve functions
- Prepare notebook for delivery

### (Raw) Data Dictionary

Sort out data dictionary  

API and data dictionary:
https://steamspy.com/api.php

### Future ideas:
- pc requirements analysis over time
- picture analysis
- keyword/recommender analysis
- categories could make table in a database all on its own, perhaps in future
- for genres (and categories?) could create main genre, selected from list of key genres, allowing hybrids like action_adventure if contains both
- remove titles over £60/100?

In [1]:
# load extensions and magics

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1915 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Thu May 09 20:04:32 2019 GMT Summer Time,Thu May 09 20:04:32 2019 GMT Summer Time


In [2]:
# import libraries
from ast import literal_eval
import itertools
import time
import re

import numpy as np
import pandas as pd

In [3]:
# customisations
pd.set_option("max_columns", 100)
# pd.reset_option("max_columns")

## Cleaning steam data

### Import Data

We begin by importing the raw steam data we generated previously in data collection, which can be viewed by following the link to `../deliver/1-data-collection.ipynb` below. From a quick inspection of the data, we can see that we have a mixture of numeric and string columns, plenty of missing values, and a number of columns stored as dictionaries.

In [4]:
from IPython.display import FileLink
FileLink("../deliver/1-data-collection.ipynb")

In [5]:
raw_steam_data = pd.read_csv('../data/raw/steam_app_data.csv')

print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])
raw_steam_data.head()

Rows: 29235
Columns: 39


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0.0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","{'score': 88, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 65735},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0.0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 2802},{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0.0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","{'score': 79, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 1992},{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,game,Deathmatch Classic,40,0.0,False,,,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 931},{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,game,Half-Life: Opposing Force,50,0.0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Gearbox Software'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 4355},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


We can chain the `isnull()` and `sum()` methods to easily see how many missing values we have in each column. Immediately we can see that a number of columns have over 20,000 rows with missing data, and in a data set of almost 30,000 rows these are unlikely to provide any useful information.

In [6]:
raw_steam_data.isnull().sum()

type                         149
name                           1
steam_appid                    0
required_age                 149
is_free                      149
controller_support         23237
dlc                        24260
detailed_description         175
about_the_game               175
short_description            175
fullgame                   29235
supported_languages          163
header_image                 149
website                     9983
pc_requirements              149
mac_requirements             149
linux_requirements           149
legal_notice               19168
drm_notice                 29077
ext_user_account_notice    28723
developers                   264
publishers                   149
demos                      27096
price_overview              3712
packages                    3370
package_groups               149
platforms                    149
metacritic                 26254
reviews                    23330
categories                   714
genres    

### Website and support info

Next we will look at the `website` and `support_info` columns, both containing links to external websites. There are a large number of rows with no website listed, and while there are no null values in the support_info column, it looks like many will have both emails and url inside the data.

For our data set we'll be dropping both these columns. But it might be useful, if not interesting, to extract this data and export to a csv file as we have before.

Below we can see the null counts and some example rows.

In [48]:
print('website null counts:', steam_data['website'].isnull().sum())
print('support_info null counts:', steam_data['support_info'].isnull().sum())

with pd.option_context("display.max_colwidth", 100): # ensures strings not cut short
    display(steam_data[['name', 'website', 'support_info']][80:85])

website null counts: 9787
support_info null counts: 0


Unnamed: 0,name,website,support_info
83,X: Tension,http://www.egosoft.com/games/x_tension/info_en.php,"{'url': '', 'email': ''}"
84,X Rebirth,http://www.egosoft.com/games/x_rebirth/info_en.php,"{'url': 'http://www.egosoft.com/support/index_en.php', 'email': 'info@egosoft.com'}"
85,688(I) Hunter/Killer,,"{'url': 'http://strategyfirst.com/products/support.html', 'email': ''}"
86,Fleet Command,,"{'url': 'http://strategyfirst.com/products/support.html', 'email': ''}"
87,Sub Command,,"{'url': '', 'email': ''}"


We keep all the code that parses the columns inside the export if statement, so it only runs if we wish to export to csv. We don't need to worry that the rows with missing website data contain NaN whereas the other two columns contain a blank string for missing data, as once we have exported to csv they will be treated the same.

In [49]:
def process_info(df, export=False):
    """Drop support information from dataframe, optionally exporting beforehand."""
    if export:
        support_info = df[['steam_appid', 'website', 'support_info']].copy()
        
        support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))
        support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])
        support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])
        
        support_info = support_info.drop('support_info', axis=1)
        
        support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]

        export_data(support_info, 'support_info')
    
    df = df.drop(['website', 'support_info'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported support info to '../data/exports/steam_support_info.csv'


Unnamed: 0,name,steam_appid,required_age,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
0,Counter-Strike,10,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1
1,Team Fortress Classic,20,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1
2,Day of Defeat,30,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
3,Deathmatch Classic,40,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
4,Half-Life: Opposing Force,50,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1


In [50]:
# inspect exported file
pd.read_csv('../data/exports/steam_support_info.csv').head()

Unnamed: 0,steam_appid,website,support_url,support_email
0,10,,http://steamcommunity.com/app/10,
1,30,http://www.dayofdefeat.com/,,
2,50,,https://help.steampowered.com,
3,70,http://www.half-life.com/,http://steamcommunity.com/app/70,
4,80,,http://steamcommunity.com/app/80,


### System Requirements

At first it looks like we have data for every row.

In [51]:
req_cols = ['pc_requirements', 'mac_requirements', 'linux_requirements']

print('null counts:\n')

for col in req_cols:
    print(col+':', steam_data[col].isnull().sum())

null counts:

pc_requirements: 0
mac_requirements: 0
linux_requirements: 0


However if we look at the data a little more closely, we see that some rows actually have an empty list. These won't appear as null rows, but once evaluated these rows won't provide any information and are essentially useless to us, so can be thought of as such.

In [52]:
steam_data[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].tail()

Unnamed: 0,steam_appid,pc_requirements,mac_requirements,linux_requirements
29230,1065230,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
29231,1065570,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
29232,1065650,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
29233,1066700,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[]
29234,1069460,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[]


We can check how many rows in each requirements column have empty lists using a simple boolean filter. By checking the first value in the shape parameter, we can get a count for how many empty lists there are.

In [53]:
print('Empty list counts:\n')

for col in req_cols:
    print(col+':', steam_data[steam_data[col] == '[]'].shape[0])

Empty list counts:

pc_requirements: 16
mac_requirements: 17125
linux_requirements: 20189


That's over half of the rows for both mac and linux requirements. That probably means that there is not enough data in these two columns to be useful for our analysis.

It turns out most games are developed solely for windows, with the growth in mac and linux ports only growing in recent years. Naturally it would make sense that any games that aren't supported on mac or linux would not have corresponding requirements.

As we have already cleaned our platforms column, we can check how many rows actually have missing data by comparing rows with empty lists in the requirements with data in the respective platform columns (mac/linux). If a row has an empty list in the requirements column but a 1 (True) in the platform column, it means the data is missing.

In [54]:
for col in ['mac_requirements', 'linux_requirements']:
    platform = col.split('_')[0]
    print(platform+':', steam_data[(steam_data[col] == '[]') & (steam_data[platform])].shape[0])

mac: 141
linux: 168


Whilst not an insignificant number, this means that the vast majority of rows are as they should be, and we're not looking at too many data errors.

Let's also have a look for missing values in the pc/windows column. We couldn't include it in our previous loop as the columns have different names, something we may wish to change later.

In [55]:
print('windows:', steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['windows'])].shape[0])

windows: 11


11 rows have missing system requirements. We can take a look at some of them below, and follow the links to the steam pages to try and discover if anything is amiss.

In [56]:
missing_windows_requirements = steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['windows'])]

print_steam_links(missing_windows_requirements[:5])
missing_windows_requirements.head()

Uplink: https://store.steampowered.com/app/1510
Battlestations: Midway: https://store.steampowered.com/app/6870
Grand Theft Auto 2: https://store.steampowered.com/app/12180
Penumbra: Requiem: https://store.steampowered.com/app/22140
Sam & Max 301: The Penal Zone: https://store.steampowered.com/app/31220


Unnamed: 0,name,steam_appid,required_age,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
34,Uplink,1510,3,[],[],[],['Introversion Software'],['Introversion Software'],"[112, 14002]","[{'name': 'default', 'title': 'Buy Uplink', 'd...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '23', 'description': 'Indie'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '23 Aug, 2006'}","{'ids': [], 'notes': None}",1,1,1,6.99,1
197,Battlestations: Midway,6870,3,[],[],[],['Eidos Interactive'],['Square Enix'],[284],"[{'name': 'default', 'title': 'Buy Battlestati...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '15 Mar, 2007'}","{'ids': [], 'notes': None}",1,0,0,4.99,1
346,Grand Theft Auto 2,12180,3,[],[],[],['Rockstar North'],['Rockstar Games'],,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '4 Jan, 2008'}","{'ids': [], 'notes': None}",1,0,0,0.0,1
549,Penumbra: Requiem,22140,3,[],[],[],['Frictional Games'],['Frictional Games'],,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': ''}","{'ids': [], 'notes': None}",1,1,1,-1.0,1
651,Sam & Max 301: The Penal Zone,31220,3,[],[],[],['Telltale Games'],['Telltale Games'],"[109585, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': ''}","{'ids': [], 'notes': None}",1,0,0,-1.0,1


There doesn't appear to be any common issue in these rows - some of the games are quite old but that's about it. It may simply be that no requirements were supplied when the games were added to the steam store.

Let's say that the fictional company we're doing analysis for is interested in developing for windows only. Also we can assume that a cross-platform game will have similar requirements in terms of hardware for each platform it supports. With this in mind we can safely drop both the mac and linux requirements columns, as we already know which games support these operating systems by our cleaned platform columns. That means we can focus on the pc_requirements column, which has information for almost every game in our data.

Now we will take a look at a couple of rows from the dataset to see how the data is stored.

In [57]:
display(steam_data['pc_requirements'].iloc[0])
display(steam_data['pc_requirements'].iloc[2000])
display(steam_data['pc_requirements'].iloc[15000])

"{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</strong> 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t<p><strong>Recommended:</strong> 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t'}"

'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows 7, Windows 8<br></li><li><strong>Processor:</strong> Intel Core 2 Duo, AMD Athlon X2, or equal at 1.6GHz or better<br></li><li><strong>Memory:</strong> 2 GB RAM<br></li><li><strong>Graphics:</strong> DirectX 9.0c-compatible, SM 3.0-compatible<br></li><li><strong>DirectX:</strong> Version 9.0c<br></li><li><strong>Storage:</strong> 1 GB available space<br></li><li><strong>Sound Card:</strong> DirectX 9.0c-compatible, 16-bit</li></ul>\', \'recommended\': \'<strong>Recommended:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows 7, Windows 8<br></li><li><strong>Processor:</strong> QuadCore 2.0 GHz +<br></li><li><strong>Memory:</strong> 8 GB RAM<br></li><li><strong>Graphics:</strong> NVIDIA GeForce 8800 GTS or better, 512MB+ VRAM<br></li><li><strong>DirectX:</strong> Version 9.0c<br></li><li><strong>Storage:</strong> 1 GB available space<br></li><li><strong>Sound Card:</strong> Direct

'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Microsoft Windows 7<br></li><li><strong>Processor:</strong> 2 GHz CPU<br></li><li><strong>Memory:</strong> 1 GB RAM<br></li><li><strong>DirectX:</strong> Version 9.0c<br></li><li><strong>Storage:</strong> 1 GB available space</li></ul>\', \'recommended\': \'<strong>Recommended:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Microsoft Windows 7<br></li><li><strong>Processor:</strong> 2 GHz CPU<br></li><li><strong>Memory:</strong> 1 GB RAM<br></li><li><strong>DirectX:</strong> Version 10<br></li><li><strong>Storage:</strong> 1 GB available space</li></ul>\'}'

In short: it's a mess. It looks like the data is stored as a dictionary, as we've seen before. There is definitely a key for 'minimum', but apart from that it is hard to see at a glance. The strings are full of html formatting, which is presumably parsed to display the information on the website. It also looks like there are different categories like Processor and Memory for some, but not all, rows.

Let's take a stab and cleaning out some of the unnessecary formatting and see if it becomes clearer.

By creating a dataframe from a selection of rows, we can easily and quickly make changes using the pandas .str accessor, allowing us to use python string formatting and regular expressions.

In [58]:
view_requirements = steam_data['pc_requirements'].iloc[[0, 2000, 15000]].copy()

view_requirements = (view_requirements
                         .str.replace(r'\\[rtn]', '')
                         .str.replace(r'<[pbr]{1,2}>', ' ')
                         .str.replace(r'<[\/"=\w\s]+>', '')
                    )

for i, row in view_requirements.iteritems():
    display(row)

"{'minimum': ' Minimum: 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection Recommended: 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection'}"

"{'minimum': 'Minimum: OS: Windows 7, Windows 8 Processor: Intel Core 2 Duo, AMD Athlon X2, or equal at 1.6GHz or better Memory: 2 GB RAM Graphics: DirectX 9.0c-compatible, SM 3.0-compatible DirectX: Version 9.0c Storage: 1 GB available space Sound Card: DirectX 9.0c-compatible, 16-bit', 'recommended': 'Recommended: OS: Windows 7, Windows 8 Processor: QuadCore 2.0 GHz + Memory: 8 GB RAM Graphics: NVIDIA GeForce 8800 GTS or better, 512MB+ VRAM DirectX: Version 9.0c Storage: 1 GB available space Sound Card: DirectX 9.0c-compatible, 16-bit'}"

"{'minimum': 'Minimum: OS: Microsoft Windows 7 Processor: 2 GHz CPU Memory: 1 GB RAM DirectX: Version 9.0c Storage: 1 GB available space', 'recommended': 'Recommended: OS: Microsoft Windows 7 Processor: 2 GHz CPU Memory: 1 GB RAM DirectX: Version 10 Storage: 1 GB available space'}"

We can now see more clearly the contents and structure of these rows. Some rows have both Minimum and Recommended requirements inside a 'minimum' key, some have separate 'minimum' and 'recommended' keys. Some have headings like 'Processor:' and 'Storage:' before various components, others simply have a list of components. Some state particular speeds for components, like 2 Ghz CPU, others state specific models, like 'Intel Core 2 Duo', amongst this information.

It seems like it would be possible to extract invidivual component information from this data, however it would be a lengthy and complex process recquiring the handling of many exceptions and invididual cases. Whilst we may wish to tackle this in the future, as it could provide an interesting window into how the demands of gaming have changed over the years, it won't necessarily provide us with useful information for our current objectives.

With that in mind, it seems best to proceed by cleaning the data slightly so it is readable, exporting to an external csv for future use, then removing the columns from our dataframe.

In [59]:
def process_requirements(df, export=False):
    if export:
        requirements = df[['steam_appid', 'pc_requirements']].copy()
        
        requirements = requirements[requirements['pc_requirements'] != '[]']
        
        requirements['requirements_clean'] = (requirements['pc_requirements']
                                                  .str.replace(r'\\[rtn]', '')
                                                  .str.replace(r'<[pbr]{1,2}>', ' ')
                                                  .str.replace(r'<[\/"=\w\s]+>', '')
                                             )
        
        export_data(requirements, 'requirements_data')
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported requirements data to '../data/exports/steam_requirements_data.csv'


Unnamed: 0,name,steam_appid,required_age,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
0,Counter-Strike,10,3,['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1
1,Team Fortress Classic,20,3,['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1
2,Day of Defeat,30,3,['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
3,Deathmatch Classic,40,3,['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
4,Half-Life: Opposing Force,50,3,['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1


In [60]:
# verify export
pd.read_csv('../data/exports/steam_requirements_data.csv').head()

Unnamed: 0,steam_appid,pc_requirements,requirements_clean
0,10,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."
1,20,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."
2,30,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."
3,40,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."
4,50,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."


### Processing developers and publishers

The next two columns, developers and publishers, will most likely contain similar information so we can look at them together. 

We'll start by checking the null counts, noticing that while the publishers column doesn't appear to have any null values at first, if we search for empty lists we see that we have 227 hidden null values.

In [61]:
print('developers null counts:', steam_data['developers'].isnull().sum())
print('developers empty list counts:', steam_data[steam_data['developers'] == "['']"].shape[0])

print('\npublishers null counts:', steam_data['publishers'].isnull().sum())
print('publishers empty list counts:', steam_data[steam_data['publishers'] == "['']"].shape[0])

developers null counts: 111
developers empty list counts: 0

publishers null counts: 0
publishers empty list counts: 227


In [62]:
no_dev = steam_data[steam_data['developers'].isnull()]

print('Total games missing developer:', no_dev.shape[0], '\n')
print_steam_links(no_dev[:5])

no_dev.head()

Total games missing developer: 111 

Tycoon City: New York: https://store.steampowered.com/app/9730
Nikopol: Secrets of the Immortals: https://store.steampowered.com/app/11370
Crash Time 2: https://store.steampowered.com/app/11390
Hunting Unlimited 2010: https://store.steampowered.com/app/12690
18 Wheels of Steel: Extreme Trucker: https://store.steampowered.com/app/33730


Unnamed: 0,name,steam_appid,required_age,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
285,Tycoon City: New York,9730,3,,['Retroism'],[34667],"[{'name': 'default', 'title': 'Buy Tycoon City...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}]",{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'ids': [], 'notes': None}",1,0,0,6.99,1
330,Nikopol: Secrets of the Immortals,11370,3,,['Meridian4'],[1930],"[{'name': 'default', 'title': 'Buy Nikopol: Se...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '30 Jul, 2009'}","{'ids': [], 'notes': None}",1,0,0,3.99,1
331,Crash Time 2,11390,3,,['Meridian4'],[2030],"[{'name': 'default', 'title': 'Buy Crash Time ...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '27 Aug, 2009'}","{'ids': [], 'notes': None}",1,0,0,6.99,1
379,Hunting Unlimited 2010,12690,3,,"['ValuSoft', 'Retroism']","[2680, 17219]","[{'name': 'default', 'title': 'Buy Hunting Unl...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...",{'total': 0},"{'coming_soon': False, 'date': '7 Jul, 2009'}","{'ids': [], 'notes': None}",1,0,0,6.99,1
742,18 Wheels of Steel: Extreme Trucker,33730,3,,"['ValuSoft', 'Play Hard Games']","[2679, 17219]","[{'name': 'default', 'title': 'Buy 18 Wheels o...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}]",{'total': 0},"{'coming_soon': False, 'date': '23 Sep, 2009'}","{'ids': [], 'notes': None}",1,0,0,6.99,1


In [63]:
no_pub = steam_data[steam_data['publishers'] == "['']"]

print('Total games missing publisher:', no_pub.shape[0], '\n')
print_steam_links(no_pub[:5])

no_pub.head()

Total games missing publisher: 227 

RIP - Trilogy™: https://store.steampowered.com/app/2540
Vigil: Blood Bitterness™: https://store.steampowered.com/app/2570
Bullet Candy: https://store.steampowered.com/app/6600
AudioSurf: https://store.steampowered.com/app/12900
Everyday Shooter: https://store.steampowered.com/app/16300


Unnamed: 0,name,steam_appid,required_age,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
67,RIP - Trilogy™,2540,3,['Elephant Games'],[''],[346],"[{'name': 'default', 'title': 'Buy RIP - Trilo...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2007'}","{'ids': [], 'notes': None}",1,0,0,3.99,1
68,Vigil: Blood Bitterness™,2570,3,['Freegamer'],[''],,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '23', 'description': 'Indie'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '29 Jun, 2007'}","{'ids': [], 'notes': None}",1,0,0,0.0,1
190,Bullet Candy,6600,3,['R C Knight'],[''],[258],"[{'name': 'default', 'title': 'Buy Bullet Cand...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","{'total': 20, 'highlighted': [{'name': 'Casual...","{'coming_soon': False, 'date': '14 Feb, 2007'}","{'ids': [], 'notes': None}",1,0,0,2.79,1
385,AudioSurf,12900,3,['Dylan Fitterer'],[''],[636],"[{'name': 'default', 'title': 'Buy AudioSurf',...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}]","{'total': 19, 'highlighted': [{'name': 'Royal ...","{'coming_soon': False, 'date': '15 Feb, 2008'}","{'ids': [], 'notes': None}",1,0,0,6.99,1
451,Everyday Shooter,16300,3,['Queasy Games'],[''],[724],"[{'name': 'default', 'title': 'Buy Everyday Sh...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '8 May, 2008'}","{'ids': [], 'notes': None}",1,0,0,7.19,1


In [64]:
no_dev_or_pub = steam_data[(steam_data['developers'].isnull()) & (steam_data['publishers'] == "['']")]

print('Total games missing developer and publisher:', no_dev_or_pub.shape[0], '\n')
print_steam_links(no_dev_or_pub[:5])

no_dev_or_pub.head()

Total games missing developer and publisher: 73 

Patterns: https://store.steampowered.com/app/218980
PlayClaw 5 - Game Recording and Streaming: https://store.steampowered.com/app/237370
Artemis Spaceship Bridge Simulator: https://store.steampowered.com/app/247350
A Walk in the Dark: https://store.steampowered.com/app/248730
Forge Quest: https://store.steampowered.com/app/249950


Unnamed: 0,name,steam_appid,required_age,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
1701,Patterns,218980,3,,[''],,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': ''}","{'ids': [], 'notes': None}",1,1,0,-1.0,1
2011,PlayClaw 5 - Game Recording and Streaming,237370,3,,[''],[28917],"[{'name': 'default', 'title': 'Buy PlayClaw 5 ...","[{'id': 22, 'description': 'Steam Achievements'}]","[{'id': '52', 'description': 'Audio Production...","{'total': 10, 'highlighted': [{'name': 'Verbal...","{'coming_soon': False, 'date': '10 Sep, 2013'}","{'ids': [], 'notes': None}",1,0,0,29.99,1
2201,Artemis Spaceship Bridge Simulator,247350,3,,[''],"[29600, 31847]","[{'name': 'default', 'title': 'Buy Artemis Spa...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '16 Sep, 2013'}","{'ids': [], 'notes': None}",1,0,0,4.99,1
2231,A Walk in the Dark,248730,3,,[''],[29907],"[{'name': 'default', 'title': 'Buy A Walk in t...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","{'total': 27, 'highlighted': [{'name': 'Toughe...","{'coming_soon': False, 'date': '7 Nov, 2013'}","{'ids': [], 'notes': None}",1,0,0,4.99,1
2251,Forge Quest,249950,3,,[''],"[30345, 35189]","[{'name': 'default', 'title': 'Buy Forge Quest...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","{'total': 49, 'highlighted': [{'name': 'Papers...","{'coming_soon': False, 'date': '29 May, 2015'}","{'ids': [], 'notes': None}",1,1,1,6.99,1


Options:
- remove rows with missing developer or publisher information
- impute missing information by replacing missing columns with the column we have
- write missing information as 'unkown' or none
- keep everything
- remove rows with both missing developer and publisher information

In [65]:
def process_developers_and_publishers(df):
    num_rows = df.shape[0]
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
    print('Before:', num_rows, '\nAfter:', df.shape[0], '\nRows dropped:', num_rows - df.shape[0])
            
    df['developers'] = df['developers'].apply(lambda x: literal_eval(x))
    df['publishers'] = df['publishers'].apply(lambda x: literal_eval(x))
    
    df['developer'] = df['developers'].apply(lambda x: x[0])
    df['publisher'] = df['publishers'].apply(lambda x: x[0])
    
    df['other_developers'] = df['developers'].apply(lambda x: ', '.join(x[1:]) if len(x) > 1 else np.nan)
    df['other_publishers'] = df['publishers'].apply(lambda x: ', '.join(x[1:]) if len(x) > 1 else np.nan)

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df

dev_pub_data = process_developers_and_publishers(steam_data)
dev_pub_data[['developer', 'publisher', 'other_developers', 'other_publishers']].head()

Before: 29028 
After: 28763 
Rows dropped: 265


Unnamed: 0,developer,publisher,other_developers,other_publishers
0,Valve,Valve,,
1,Valve,Valve,,
2,Valve,Valve,,
3,Valve,Valve,,
4,Gearbox Software,Valve,,


It may be worth investigating how many rows actually have other developers or publishers, as the other_developers and other_publishers columns are filled with null values for the first few rows.

In [66]:
print('Null counts:\n')

for col in ['developer', 'publisher', 'other_developers', 'other_publishers']:
    print(col + ':', dev_pub_data[col].isnull().sum())

Null counts:

developer: 0
publisher: 0
other_developers: 27002
other_publishers: 27860


It turns out that most games only have one developer and one publisher, and so our columns are filled with null values so they're of little use. It may be better to combine these columns into one. We can do this fairly easily using the python join method on a string. By invoking join on a comma, when there is only one value in the list of developers/publishers join will return that value, otherwise when there are multiple values we will get a comma-separated string like so:

In [67]:
', '.join(['one item'])

'one item'

In [68]:
', '.join(['multiple', 'different', 'items'])

'multiple, different, items'

We can now modify and finish our function, and will be ready to move on to the next column.

In [69]:
def process_developers_and_publishers(df):
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
            
    df['developers'] = df['developers'].apply(lambda x: literal_eval(x))
    df['publishers'] = df['publishers'].apply(lambda x: literal_eval(x))
    
    df['developer'] = df['developers'].apply(lambda x: ', '.join(x))
    df['publisher'] = df['publishers'].apply(lambda x: ', '.join(x))

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


### Processing Packages

We are not incredibly interested in the `packages` and `package_groups` columns, except for where we are missing price data (and earlier filled these with -1). We can now easily investigate these rows. Overall we have 811 rows with missing price data.

In [70]:
print(steam_data[steam_data['price'] == -1].shape[0])

811


We can split these rows into two categories: those with package_groups data and those without. If we take a quick look at the package_groups column we see that there are no null values, but rows without data are stored as empty lists.

In [71]:
print('Null counts:', steam_data['package_groups'].isnull().sum())
print('Empty list counts:', steam_data[steam_data['package_groups'] == "[]"].shape[0])

Null counts: 0
Empty list counts: 3307


Using a combination of filters, we can find out how many rows have both missing price and package_group data and investigate.

In [72]:
missing_price_and_package = steam_data[(steam_data['price'] == -1) & (steam_data['package_groups'] == "[]")]

print('Number of rows:', missing_price_and_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_and_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_and_package[-10:-5])

missing_price_and_package.head()

Number of rows: 774 

First few rows:

RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
Beijing 2008™ - The Official Video Game of the Olympic Games: https://store.steampowered.com/app/10520
LUMINES™ Advance Pack: https://store.steampowered.com/app/11920
Midnight Club 2: https://store.steampowered.com/app/12160
Age of Booty™: https://store.steampowered.com/app/21600

Last few rows:

RoboVirus: https://store.steampowered.com/app/1001870
soko loco deluxe: https://store.steampowered.com/app/1003730
POCKET CAR : VRGROUND: https://store.steampowered.com/app/1004710
The Princess, the Stray Cat, and Matters of the Heart: https://store.steampowered.com/app/1010600
Mr Boom's Firework Factory: https://store.steampowered.com/app/1013670


Unnamed: 0,name,steam_appid,required_age,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
75,RollerCoaster Tycoon® 3: Platinum,2700,3,,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...",{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'ids': [], 'notes': None}",1,1,0,-1.0,1,"Frontier, Aspyr (Mac)","Atari, Aspyr (Mac)"
311,Beijing 2008™ - The Official Video Game of the...,10520,3,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '18', 'description': 'Sports'}]",{'total': 0},"{'coming_soon': False, 'date': '14 Aug, 2008'}","{'ids': [], 'notes': None}",1,0,0,-1.0,1,Eurocom,SEGA
337,LUMINES™ Advance Pack,11920,3,,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]",{'total': 0},"{'coming_soon': False, 'date': '18 Apr, 2008'}","{'ids': [], 'notes': None}",1,0,0,-1.0,1,Q Entertainment Inc.,Q Entertainment Inc.
344,Midnight Club 2,12160,3,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '9', 'description': 'Racing'}]",{'total': 0},"{'coming_soon': False, 'date': '4 Jan, 2008'}","{'ids': [], 'notes': None}",1,0,0,-1.0,1,Rockstar San Diego,Rockstar Games
536,Age of Booty™,21600,3,,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '9 Mar, 2009'}","{'ids': [], 'notes': None}",1,0,0,-1.0,1,Certain Affinity™,Capcom


Most of our games with missing price data fall into the above category. From looking at the first few rows on the store page, it looks like they are currently unavailable or have been delisted from the store. Looking at the last few rows, it appears most of haven't been released and haven't had a price set. We will take care of all unreleased games when we clean the release_date column, but we can remove all of these apps now.

Let's now take a look at the apps that have missing price data but do have package_groups data.

In [73]:
missing_price_have_package = steam_data.loc[(steam_data['price'] == -1) & (steam_data['package_groups'] != "[]"), ['name', 'steam_appid', 'package_groups', 'price']]

print('Number of rows:', missing_price_have_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_have_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_have_package[-10:-5])

display(missing_price_have_package.head())
missing_price_have_package.iloc[-10:-5]

Number of rows: 37 

First few rows:

The Ship: Single Player: https://store.steampowered.com/app/2420
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210
Sam & Max 103: The Mole, the Mob and the Meatball: https://store.steampowered.com/app/8220

Last few rows:

Viscera Cleanup Detail: Shadow Warrior: https://store.steampowered.com/app/255520
Space Hulk: Deathwing: https://store.steampowered.com/app/298900
7,62 Hard Life: https://store.steampowered.com/app/306290
Letter Quest: Grimm's Journey: https://store.steampowered.com/app/328730
Rad Rodgers: World One: https://store.steampowered.com/app/353580


Unnamed: 0,name,steam_appid,package_groups,price
63,The Ship: Single Player,2420,"[{'name': 'default', 'title': 'Buy The Ship: S...",-1.0
220,BioShock™,7670,"[{'name': 'default', 'title': 'Buy BioShock™',...",-1.0
234,Sam & Max 101: Culture Shock,8200,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0
235,Sam & Max 102: Situation: Comedy,8210,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0
236,"Sam & Max 103: The Mole, the Mob and the Meatball",8220,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0


Unnamed: 0,name,steam_appid,package_groups,price
2421,Viscera Cleanup Detail: Shadow Warrior,255520,"[{'name': 'default', 'title': 'Buy Viscera Cle...",-1.0
3576,Space Hulk: Deathwing,298900,"[{'name': 'default', 'title': 'Buy Space Hulk:...",-1.0
3811,"7,62 Hard Life",306290,"[{'name': 'default', 'title': 'Buy 7,62 Hard L...",-1.0
4504,Letter Quest: Grimm's Journey,328730,"[{'name': 'default', 'title': ""Buy Letter Ques...",-1.0
5514,Rad Rodgers: World One,353580,"[{'name': 'default', 'title': 'Buy Rad Rodgers...",-1.0


Looking at a selection of these rows, the games appear to be: supersceded by a newer release or remaster, part of a bigger bundle of games or episodic, or included by purchasing another game. 

Whilst we could extract prices from the package_groups data, the most sensible option seems to be removing these rows. Since our logic interacts heavily with the price data we will rewrite the process_price function rather than putting this logic inside it's own function.

In [74]:
def process_price(df):
    """Process price_overview column into formatted price column, and take care of package columns."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # remove rows where price is -1
    df = df[df['price'] != -1]
    
    # change price to display in pounds (can apply to all now -1 rows removed)
    df['price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview', 'packages', 'package_groups'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


### Processing Categories and Genres

Drop rows with missing categories/genres?

In [75]:
print(steam_data['categories'].isnull().sum())

509


In [76]:
print(steam_data['categories'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['categories'].head())

[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]


0    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
1    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
2                                                                                                       [{'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
3    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
4                                                            [{'id': 2, 'description': 'Single-player'}, {'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enable

In [77]:
print_steam_links(steam_data[steam_data['categories'].isnull()].tail(20))

MOTiON by RADiCAL: https://store.steampowered.com/app/999900
The Marvellous Machine: https://store.steampowered.com/app/1000510
iDancer: https://store.steampowered.com/app/1004740
SubnetPing: https://store.steampowered.com/app/1008160
YouTube Center: https://store.steampowered.com/app/1009330
Discord Bot - Controls: https://store.steampowered.com/app/1010170
Wallpaper Maker （造物主视频桌面）: https://store.steampowered.com/app/1010800
Nero GameVR: https://store.steampowered.com/app/1011110
Greenland Melting: https://store.steampowered.com/app/1012510
VEGAS Movie Studio 16 Steam Edition: https://store.steampowered.com/app/1016810
VEGAS Movie Studio 16 Platinum Steam Edition: https://store.steampowered.com/app/1016840
Planet Evolution PC Live Wallpaper: https://store.steampowered.com/app/1017060
Screenbits - Screen Recorder: https://store.steampowered.com/app/1018680
Wondershare Video Converter Ultimate: https://store.steampowered.com/app/1025020
ACID Music Studio 11 Steam Edition: https://store

In [78]:
print(steam_data['genres'].isnull().sum())

37


In [79]:
print(steam_data['genres'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['genres'].iloc[100:105])

[{'id': '1', 'description': 'Action'}]


121    [{'id': '2', 'description': 'Strategy'}, {'id': '4', 'description': 'Casual'}]
122                                            [{'id': '4', 'description': 'Casual'}]
123                                            [{'id': '4', 'description': 'Casual'}]
124                                          [{'id': '2', 'description': 'Strategy'}]
125                                            [{'id': '4', 'description': 'Casual'}]
Name: genres, dtype: object

In [80]:
print_steam_links(steam_data[steam_data['genres'].isnull()].head(10))
print_steam_links(steam_data[steam_data['genres'].isnull()].tail(10))

Hot Dish: https://store.steampowered.com/app/12570
Dr. Daisy Pet Vet: https://store.steampowered.com/app/12580
Call of Cthulhu®: Dark Corners of the Earth: https://store.steampowered.com/app/22340
Super Granny Collection: https://store.steampowered.com/app/36270
Sacrifice: https://store.steampowered.com/app/38440
Nancy Drew® Dossier: Resorting to Danger!: https://store.steampowered.com/app/42200
Air Forte: https://store.steampowered.com/app/55020
Sonic Adventure DX: https://store.steampowered.com/app/71250
Portal 2 - The Final Hours: https://store.steampowered.com/app/104600
Sonic CD: https://store.steampowered.com/app/200940
EatWell: https://store.steampowered.com/app/678870
No Lights: https://store.steampowered.com/app/682910
Cyborg Arena: https://store.steampowered.com/app/706440
M.I.A. - Overture: https://store.steampowered.com/app/712060
VEHICLES FURY: https://store.steampowered.com/app/749290
The Big Three: https://store.steampowered.com/app/823390
BlueberryNOVA: https://store.st

In [81]:
steam_data[(steam_data['genres'].isnull()) | (steam_data['categories'].isnull())]

Unnamed: 0,name,steam_appid,required_age,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
371,Hot Dish,12570,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '29 Jul, 2008'}","{'ids': [], 'notes': None}",1,0,0,5.99,1,Zemnott,ValuSoft
372,Dr. Daisy Pet Vet,12580,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '29 Jul, 2008'}","{'ids': [], 'notes': None}",1,0,0,5.99,1,Zemnott,ValuSoft
404,Tom Clancy's Ghost Recon® Island Thunder™,13630,3,,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '15 Jul, 2008'}","{'ids': [], 'notes': None}",1,0,0,4.29,1,Red Storm Entertainment,Ubisoft
557,Call of Cthulhu®: Dark Corners of the Earth,22340,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '16 Jun, 2009'}","{'ids': [], 'notes': None}",1,0,0,3.99,1,Headfirst Productions,Bethesda Softworks
789,Westward Collection,36150,3,,"[{'id': '4', 'description': 'Casual'}]",{'total': 0},"{'coming_soon': False, 'date': '17 Jul, 2009'}","{'ids': [], 'notes': None}",1,0,0,10.99,1,Sandlot Games,Sandlot Games
793,Super Granny Collection,36270,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '17 Jul, 2009'}","{'ids': [], 'notes': None}",1,0,0,10.99,1,Sandlot Games,Sandlot Games
846,Sacrifice,38440,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '19 Aug, 2009'}","{'ids': [], 'notes': None}",1,0,0,6.99,1,Shiny Entertainment,Interplay Inc.
866,Painkiller: Black Edition,39530,3,,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '24 Jan, 2007'}","{'ids': [2, 5], 'notes': None}",1,0,0,8.99,1,People Can Fly,THQ Nordic
921,Nancy Drew® Dossier: Resorting to Danger!,42200,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '19 Nov, 2009'}","{'ids': [], 'notes': None}",1,0,0,5.19,1,HeR Interactive,HeR Interactive
1029,Might & Magic: Heroes VI,48220,3,,"[{'id': '3', 'description': 'RPG'}, {'id': '2'...",{'total': 0},"{'coming_soon': False, 'date': '13 Oct, 2011'}","{'ids': [], 'notes': None}",1,0,0,16.99,1,Blackhole,Ubisoft


In [82]:
def process_categories(df, export=False):
    df = df[df['categories'].notnull()].copy()
    
    if export:
        category_data = df[['steam_appid', 'categories']].copy()

        category_data['categories'] = category_data['categories'].apply(lambda x: [item['description'] for item in literal_eval(x)])

        cols = set(list(itertools.chain(*category_data['categories'])))
        
        for col in sorted(cols):
            col_name = 'c_' + (col.lower()
                                  .replace('-', '_')
                                  .replace(' ', '_')
                                  .replace('(', '')
                                  .replace(')', '')
                                  .replace('/', '_or_')
                              )
            category_data[col_name] = category_data['categories'].apply(lambda x: 1 if col in x else 0)
        
        category_data = category_data.drop('categories', axis=1)
        export_data(category_data, 'category_data')
    
    df = df.drop('categories', axis=1)
    
    return df


def process_genres(df, export=False):
    df = df[df['genres'].notnull()].copy()
    
    if export:
        genre_data = df[['steam_appid', 'genres']].copy()

        genre_data['genres'] = genre_data['genres'].apply(lambda x: [item['description'] for item in literal_eval(x)])
        
        cols = set(list(itertools.chain(*genre_data['genres'])))

        for col in sorted(cols):
            col_name = 'g_' + (col.lower()
                            .replace(' ', '_')
                            .replace('&', 'and')
                       )
            genre_data[col_name] = genre_data['genres'].apply(lambda x: 1 if col in x else 0)

        genre_data = genre_data.drop('genres', axis=1)            
        export_data(genre_data, 'genre_data')
        
    df = df.drop('genres', axis=1)
    
    return df


process_categories(steam_data, export=True).head()
process_genres(steam_data, export=True).head()

Exported category data to '../data/exports/steam_category_data.csv'
Exported genre data to '../data/exports/steam_genre_data.csv'


Unnamed: 0,name,steam_appid,required_age,categories,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"[{'id': 2, 'description': 'Single-player'}, {'...",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [83]:
def expand_columns(df, col):
    df[col] = df[col].apply(lambda x: [item['description'] for item in literal_eval(x)])
    new_cols = set(list(itertools.chain(*df[col])))
    
    for new_col in sorted(new_cols):
        new_col_name = (new_col.lower()
                               .replace('-', '_')
                               .replace(' ', '_')
                               .replace('(', '')
                               .replace(')', '')
                               .replace('/', '_or_')
                               .replace('&', 'and')
                       )
        df[new_col_name] = df[col].apply(lambda x: 1 if new_col in x else 0)
            
    return df.drop(col, axis=1)


def process_categories(df, export=False):
    df = df[df['categories'].notnull()].copy()
    
    category_data = df[['steam_appid', 'categories']].copy()
    category_data = expand_columns(category_data, 'categories')
    
    if export:
        export_data(category_data, 'category_data')
    
    df = df.drop('categories', axis=1)
    
    return df


def process_genres(df, export=False):
    df = df[df['genres'].notnull()].copy()
    
    genre_data = df[['steam_appid', 'genres']].copy()
    genre_data = expand_columns(genre_data, 'genres')
        
    if export:    
        export_data(genre_data, 'genre_data')
        
    df = df.drop('genres', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    df = process_categories(df, export=True)
    df = process_genres(df, export=True)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Exported category data to '../data/exports/steam_category_data.csv'
Exported genre data to '../data/exports/steam_genre_data.csv'


Unnamed: 0,name,steam_appid,required_age,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [84]:
pd.read_csv('../data/exports/steam_category_data.csv').head()

Unnamed: 0,steam_appid,captions_available,co_op,commentary_available,cross_platform_multiplayer,full_controller_support,in_app_purchases,includes_source_sdk,includes_level_editor,local_co_op,local_multi_player,mmo,mods,mods_require_hl2,multi_player,online_co_op,online_multi_player,partial_controller_support,shared_or_split_screen,single_player,stats,steam_achievements,steam_cloud,steam_leaderboards,steam_trading_cards,steam_turn_notifications,steam_workshop,steamvr_collectibles,vr_support,valve_anti_cheat_enabled
0,10,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1,20,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,30,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,40,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4,50,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1


In [85]:
pd.read_csv('../data/exports/steam_genre_data.csv').head()

Unnamed: 0,steam_appid,accounting,action,adventure,animation_and_modeling,audio_production,casual,design_and_illustration,documentary,early_access,education,free_to_play,game_development,gore,indie,massively_multiplayer,nudity,photo_editing,rpg,racing,sexual_content,simulation,software_training,sports,strategy,tutorial,utilities,video_production,violent,web_publishing
0,10,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,20,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,30,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,50,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Achievements and Content Descriptors

In [86]:
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [87]:
steam_data['achievements'].isnull().sum()

1855

In [88]:
literal_eval(steam_data['achievements'][9])

{'total': 33,
 'highlighted': [{'name': 'Defiant',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_hit_cancop_withcan.jpg'},
  {'name': 'Submissive',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_put_canintrash.jpg'},
  {'name': 'Malcontent',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_escape_apartmentraid.jpg'},
  {'name': 'What cat?',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_break_miniteleporter.jpg'},
  {'name': 'Trusty Hardware',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_crowbar.jpg'},
  {'name': 'Barnacle Bowling',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_barnacleswithbarrel.jpg'},
  {'name': "Anchor's Aweigh!",
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboat.jpg'},
  {'nam

In [89]:
steam_data['content_descriptors'].isnull().sum()

0

In [90]:
steam_data['content_descriptors'].value_counts().head(6)

{'ids': [], 'notes': None}                                                                                                                                                                  25394
{'ids': [2, 5], 'notes': None}                                                                                                                                                                427
{'ids': [1, 5], 'notes': None}                                                                                                                                                                251
{'ids': [5], 'notes': None}                                                                                                                                                                   127
{'ids': [1, 2, 5], 'notes': None}                                                                                                                                                             122
{'ids': [2, 5], 'notes': 'This

In [91]:
def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    df = df.drop(['achievements', 'content_descriptors'], axis=1)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    df = process_categories(df)
    df = process_genres(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,release_date,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"{'coming_soon': False, 'date': '1 Nov, 2000'}",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"{'coming_soon': False, 'date': '1 Apr, 1999'}",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"{'coming_soon': False, 'date': '1 May, 2003'}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"{'coming_soon': False, 'date': '1 Jun, 2001'}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"{'coming_soon': False, 'date': '1 Nov, 1999'}",1,1,1,3.99,1,Gearbox Software,Valve
