# Steam Data Cleaning (Part 1)

*This forms part of a larger series of posts for my [blog](http://nik-davis.github.io) on downloading, processing and analysing data from the steam store. [See all posts here](http://nik-davis.github.io/tags/steam).*

# TODO: Move some stuff to later in project - feature processing/engineering. Simple semicolon separated lists instead?

In [1]:
# view software version information

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1900 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Mon Jun 03 15:47:30 2019 GMT Summer Time,Mon Jun 03 15:47:30 2019 GMT Summer Time


<!-- PELICAN_BEGIN_SUMMARY -->

In the first part of this project, we downloaded and generated data sets from the Steam Store API and SteamSpy API. We now need to take this raw data and prepare it in a process commonly referred to as [data cleaning](https://en.wikipedia.org/wiki/Data_cleansing).

Currently the downloaded data is not in a very useful state. Many of the columns contain lengthy strings or missing values, which hinder analysis and are especially crippling to any machine learning techniques we may wish to implement. Data cleaning involves handling missing values, tidying up values, and ensuring data is neatly and consistently formatted.

<!-- PELICAN_END_SUMMARY -->

Data cleaning is often cited as being the lengthiest part of any project. As such, it will be broken up across a series of posts starting with this one. We will begin by taking care of the columns in the steam data that are easiest to deal with and outlining a framework for the process. Of course it could all be done in one go and a lot more concisely, however we'll be stepping through all the reasons for each decision and building the process iteratively.

The main aims of this project are to investigate various sales and play-time statistics for games from the steam store, and see how different features of games affect the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust data set for future projects.

In part 2, we'll take care of columns that are going to export separate data of some kind, in order to store it for later. Finally for the steam data, in part 3 we will walk through the process of optimising the handling of a column, before exporting the clean data.

Once that is complete we will repeat the whole cleaning process for the steamspy data and combine the results in part 4, finishing with a complete data set ready for analysis.

The raw data can be found and downloaded on [Kaggle](https://www.kaggle.com/nikdavis/steam-store-raw).

## API references:

- https://partner.steamgames.com/doc/webapi
- https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI
- https://steamapi.xpaw.me/#
- https://steamspy.com/api.php

## Import Libraries and Inspect Data

To begin with, we'll import the required libraries and set customisation options, then take a look at the downloaded data by reading it into a pandas dataframe.

In [2]:
# standard library imports
from ast import literal_eval
import itertools
import time
import re

# third-party imports
import numpy as np
import pandas as pd

# customisations
pd.set_option("max_columns", 100)

In [3]:
# read in downloaded data
raw_steam_data = pd.read_csv('../data/raw/steam_app_data.csv')

# print out number of rows and columns
print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])

# view first five rows
raw_steam_data.head()

Rows: 29235
Columns: 39


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0.0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","{'score': 88, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 65735},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0.0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 2802},{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0.0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","{'score': 79, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 1992},{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,game,Deathmatch Classic,40,0.0,False,,,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 931},{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,game,Half-Life: Opposing Force,50,0.0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Gearbox Software'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 4355},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


From a quick inspection of the data, we can see that we have a mixture of numeric and string columns, plenty of missing values, and a number of columns that look to be stored as dictionaries or lists.

We can chain the `isnull()` and `sum()` methods to easily see how many missing values we have in each column. Immediately we can see that a number of columns have over 20,000 rows with missing data, and in a data set of roughly 30,000 rows these are unlikely to provide any meaningful information.

In [4]:
null_counts = raw_steam_data.isnull().sum()
null_counts

type                         149
name                           1
steam_appid                    0
required_age                 149
is_free                      149
controller_support         23237
dlc                        24260
detailed_description         175
about_the_game               175
short_description            175
fullgame                   29235
supported_languages          163
header_image                 149
website                     9983
pc_requirements              149
mac_requirements             149
linux_requirements           149
legal_notice               19168
drm_notice                 29077
ext_user_account_notice    28723
developers                   264
publishers                   149
demos                      27096
price_overview              3712
packages                    3370
package_groups               149
platforms                    149
metacritic                 26254
reviews                    23330
categories                   714
genres    

## Initial Processing

We will most likely have to handle each column individually, so we will write some functions to keep our methodology organised, and help iteratively develop the process.

Our first function will remove the columns with more than 50% missing values, taking care of the columns with high null counts. We can do this by running a filter on the dataframe, as seen below.

In [5]:
threshold = raw_steam_data.shape[0] // 2

print('Drop columns with more than {} missing rows'.format(threshold))
print()

drop_rows = raw_steam_data.columns[null_counts > threshold]

print('Columns to drop: {}'.format(list(drop_rows)))

Drop columns with more than 14617 missing rows

Columns to drop: ['controller_support', 'dlc', 'fullgame', 'legal_notice', 'drm_notice', 'ext_user_account_notice', 'demos', 'metacritic', 'reviews', 'recommendations']


We can then look at the type and name columns, thinning out our data set a little by removing apps without either.

In the data collection stage, if no information was returned from an app's API request, only the name and appid was stored. We can easily identify these apps by looking at rows with missing data in the `type` column, as all other apps have a value here. As seen below, these rows contain no other information so we can safely remove them.

In [6]:
print('Rows to remove:', raw_steam_data[raw_steam_data['type'].isnull()].shape[0])

# preview rows with missing type data
raw_steam_data[raw_steam_data['type'].isnull()].head()

Rows to remove: 149


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
26,,Half-Life: Opposing Force,852,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
147,,Half-Life: Opposing Force,4330,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
256,,Half-Life: Opposing Force,8740,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
264,,Half-Life: Opposing Force,8955,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
336,,Half-Life: Opposing Force,11610,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We can look at the counts of unique values in a column by using the pandas [Series.value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method. By checking the value counts we see that all rows either have a missing value, as noted above, or 'game' in the `type` column.

Once the null rows are removed, we'll be able to remove this column as it doesn't provide us with any more useful information.

In [7]:
raw_steam_data['type'].value_counts(dropna=False)

game    29086
NaN       149
Name: type, dtype: int64

Taking a look now at the name column, we can check for rows which either have a null value or a string containing 'none'. This isn't recognised as a null value but should be treated as such.

We achieve this by combining boolean filters using brackets and a vertical bar, `|`, symbolising a logical 'or'.

There are only four rows which match these criteria, and they appear to be missing a lot of data in other columns so we should definitely remove them.

In [8]:
raw_steam_data[(raw_steam_data['name'].isnull()) | (raw_steam_data['name'] == 'none')]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
4918,game,none,339860,0.0,False,,,,,,,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/339...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],,,,,[''],,,,[],"{'windows': True, 'mac': False, 'linux': False}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 3, 'highlighted': [{'name': 'Master ...","{'coming_soon': False, 'date': '27 Feb, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
6779,game,none,385020,0.0,False,,,- discontinued - (please remove),- discontinued - (please remove),- discontinued - (please remove),,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/385...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,,,,['none'],[''],,,,[],"{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",,,,{'total': 0},"{'coming_soon': False, 'date': '4 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
7235,game,,396420,0.0,True,,,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。 村...,,,https://steamcdn-a.akamaihd.net/steam/apps/396...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],,,,,[''],,,,[],"{'windows': True, 'mac': False, 'linux': False}",,,,,,,,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2016'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
7350,game,none,398970,0.0,False,,,,,,,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/398...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,,,,['none'],['none'],"[{'appid': 516340, 'description': ''}]",,,[],"{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 35, 'highlighted': [{'name': ""They'v...","{'coming_soon': False, 'date': '5 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"


As we know for certain that all AppIDs should be unique, any rows with the same ID need to be handled.

We can easily view duplicated rows using the [DataFrame.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) method of pandas. We can pass `keep=False` to view all duplicated rows, or leave the defaults (`keep='first'`) to skip over the first row and just show the rest of the duplicates. We can also pass a column label into `subset` if we want to filter by a single column.

As we only want to remove the extra rows, we can keep the default behaviour.

In [9]:
duplicate_rows = raw_steam_data[raw_steam_data.duplicated()]

print('Duplicate rows to remove:', duplicate_rows.shape[0])

duplicate_rows.head(3)

Duplicate rows to remove: 7


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
31,game,SiN Episodes: Emergence,1300,0.0,False,,,"You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...",,"English, Russian, French",https://steamcdn-a.akamaihd.net/steam/apps/130...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],,,,['Ritual Entertainment'],['Ritual Entertainment'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[70],"[{'name': 'default', 'title': 'Buy SiN Episode...","{'windows': True, 'mac': False, 'linux': False}","{'score': 75, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 265},{'total': 0},"{'coming_soon': False, 'date': '10 May, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/130...,"{'ids': [], 'notes': None}"
32,game,SiN Episodes: Emergence,1300,0.0,False,,,"You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...",,"English, Russian, French",https://steamcdn-a.akamaihd.net/steam/apps/130...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],,,,['Ritual Entertainment'],['Ritual Entertainment'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[70],"[{'name': 'default', 'title': 'Buy SiN Episode...","{'windows': True, 'mac': False, 'linux': False}","{'score': 75, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 265},{'total': 0},"{'coming_soon': False, 'date': '10 May, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/130...,"{'ids': [], 'notes': None}"
356,game,Jagged Alliance 2 Gold,1620,0.0,False,,,<p>The small country of Arulco has been taken ...,<p>The small country of Arulco has been taken ...,The small country of Arulco has been taken ove...,,English,https://steamcdn-a.akamaihd.net/steam/apps/162...,http://www.jaggedalliance2.com/,{'minimum': '<p><strong>Minimum Configuration:...,[],[],,,,['Strategy First'],['Strategy First'],,"{'currency': 'GBP', 'initial': 1499, 'final': ...",[94],"[{'name': 'default', 'title': 'Buy Jagged Alli...","{'windows': True, 'mac': False, 'linux': False}",,,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '2', 'description': 'Strategy'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,,{'total': 0},"{'coming_soon': False, 'date': '6 Jul, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/162...,"{'ids': [], 'notes': None}"


Let's also quickly verify that we aren't missing any rows duplicated on just the `steam_appid` column by comparing the `duplicate_rows` dataframe with the one generated by passing `subset='steam_appid'` into the duplicated method.

In [10]:
duplicate_app_id_rows = raw_steam_data[raw_steam_data.duplicated(subset='steam_appid')]

print('True if same:', duplicate_app_id_rows.equals(duplicate_rows))

True if same: True


We're now ready to define functions implementing the filters we just looked at. This allows us to easily make changes in the future if we want to alter how the columns are handled, or want to choose a different cut-off threshold for getting rid of columns, for example. 

We also define a general purpose `process` function which will run all the processing functions we create on the data set. This will allow us to slowly add to it as we develop more functions and ensure we're cleaning the correct dataframe.

Finally we run this function on the raw data, inspecting the first few rows and viewing how many rows and columns have been removed.

In [11]:
def drop_null_cols(df, thresh=0.5):
    """Drop columns with more than a certain proportion of missing values (Default 50%)."""
    cutoff_count = len(df) * thresh
    
    return df.dropna(thresh=cutoff_count, axis=1)


def process_name_type(df):
    """Remove null values in name and type columns, and remove type column.."""
    df = df[df['type'].notnull()]
    
    df = df[df['name'].notnull()]
    df = df[df['name'] != 'none']
    
    df = df.drop('type', axis=1)
    
    return df
    

def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process rest of columns
    df = process_name_type(df)
    
    return df

print(raw_steam_data.shape)
steam_data = process(raw_steam_data)
print(steam_data.shape)
steam_data.head()

(29235, 39)
(29075, 28)


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors
0,Counter-Strike,10,0.0,False,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,Team Fortress Classic,20,0.0,False,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,Day of Defeat,30,0.0,False,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,Deathmatch Classic,40,0.0,False,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,Half-Life: Opposing Force,50,0.0,False,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


## Processing Age

Next we'll look at the `required_age` column. By looking at the value counts we can see that values are already stored as integers, and the values range from 0 to 20, with one likely error (1818). There are no missing values in this column, but the vast majority have a value of 0. We'll clean the column anyway, but this probably means it won't be of much use in analysis as there is little variance in the data.

In [12]:
steam_data['required_age'].value_counts(dropna=False).sort_index()

0.0       28431
1.0           1
3.0          10
4.0           2
5.0           1
6.0           1
7.0           8
10.0          3
11.0          4
12.0         72
13.0         21
14.0          4
15.0         39
16.0        141
17.0         47
18.0        288
20.0          1
1818.0        1
Name: required_age, dtype: int64

Whilst fairly useful in its current state, we may benefit from reducing the number of categories that ages fall into. For example, instead of comparing games rated as 5, 6, 7 or 8, we could compare games rated 5+ or 8+.

To decide which categories (or bins) we should use, we will look at the [PEGI age ratings](https://pegi.info/) as this is the system used in the United Kingdom, where we're performing our analysis. Ratings fall into one of five categories (3, 7, 12, 16, 18), defining the minimum age recommended to play a game.

Using this to inform our decision, we can use the [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) function to sort our data into each of these categories. Rows with 0 may mean they are unrated, unstated as in missing, or rated as suitable for everyone. Because we can't tell we'll leave these as they are. As the erroneous row (1818) is most likely meant to be rated 18 anyway, we can set the upper bound above this value to catch it inside this category.

Below we define a `process_age` function to handle this, and add it into our `process` definition.

In [13]:
def process_age(df):
    """Format ratings in age column to be in line with the PEGI Age Ratings system."""
    # PEGI Age ratings: 3, 7, 12, 16, 18
    cut_points = [-1, 0, 3, 7, 12, 16, 2000]
    label_values = [0, 3, 7, 12, 16, 18]
    
    df['required_age'] = pd.cut(df['required_age'], bins=cut_points, labels=label_values)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process rest of columns
    df = process_name_type(df)
    df = process_age(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data['required_age'].value_counts().sort_index()

0     28431
3        11
7        12
12       79
16      205
18      337
Name: required_age, dtype: int64

## Processing the Platforms Column

Whilst we could look at the next column in the dataframe, `is_free`, it would make sense that this is linked to the `price_overview` column. Ultimately we may wish to combine these columns into one, where free games would have a price of 0. 

Looking at the `price_overview` column, we can see it is stored in a dictionary-like structure, with multiple keys and values. Handling both of these together might be somewhat trickty, so instead we'll look at a simpler example.

In [14]:
steam_data['price_overview'].head()

0    {'currency': 'GBP', 'initial': 719, 'final': 7...
1    {'currency': 'GBP', 'initial': 399, 'final': 3...
2    {'currency': 'GBP', 'initial': 399, 'final': 3...
3    {'currency': 'GBP', 'initial': 399, 'final': 3...
4    {'currency': 'GBP', 'initial': 399, 'final': 3...
Name: price_overview, dtype: object

The `platforms` column appears to contain a key for each of the main operating systems - windows, mac and linux - and a corresponding boolean value, set to True or False depending on the availability on that platform. This should be a reasonably straighforward place to start. We can separate this data out into three columns - one for each platform - filled with boolean values.

In [15]:
steam_data['platforms'].head()

0    {'windows': True, 'mac': True, 'linux': True}
1    {'windows': True, 'mac': True, 'linux': True}
2    {'windows': True, 'mac': True, 'linux': True}
3    {'windows': True, 'mac': True, 'linux': True}
4    {'windows': True, 'mac': True, 'linux': True}
Name: platforms, dtype: object

So far the cleaning process has been relatively simple, mainly requiring checking for null values and dropping some rows or columns. Already we can see that handling the platforms will be a little more complex.

Our first hurdle is getting python to recognise the data in the columns as dictionaries rather than just strings. This will allow us to access the different values separately, without having to do some unnecessarily complicated string formatting. As we can see below, even though the data looks like a dictionary it is in fact stored as a string.

In [16]:
platforms_first_row = steam_data['platforms'].iloc[0]

print(type(platforms_first_row))

platforms_first_row

<class 'str'>


"{'windows': True, 'mac': True, 'linux': True}"

We can get around this using the handy [literal_eval](https://docs.python.org/3/library/ast.html#ast.literal_eval) function from the built-in `ast` module. As the name suggests, this will allow us to evaluate the string, and then index into it as a 
dictionary.

In [17]:
eval_first_row = literal_eval(platforms_first_row)

print(type(eval_first_row))

eval_first_row['windows']

<class 'dict'>


True

We also need to check for missing values, but fortunately it appears there aren't any in this column.

In [18]:
steam_data['platforms'].isnull().sum()

0

Putting this all together, we can use the pandas [Series.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method to quickly evaluate all of the rows, then make calls to `apply` again to create the new columns for each platform.

We could return the True/False value directly and store the values as boolean types, but since we'll be exporting the cleaned data to a csv file, let's store them as integers as this should reduce the file size slightly. Setting True as 1 and False as 0 can still be interpreted as a boolean type, but less data is used to store the information.

In [19]:
# def process_platforms(df):
#     """Split platforms column into separate boolean columns for each platform."""
#     # evaluate values in platforms column, so can index into dictionaries
#     df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
#     # loop across keys (the platforms) which will be turned into columns
#     for platform in df['platforms'][0].keys():
#         # set 1 if value for platform in original column is True, or 0 if it's False
#         df[platform] = df['platforms'].apply(lambda x: 1 if x[platform] else 0)
    
#     # remove the original platforms column
#     df = df.drop('platforms', axis=1)
    
#     return df


# def process(df):
#     """Process data set. Will eventually contain calls to all functions we write."""
    
#     # Copy the input dataframe to avoid accidentally modifying original data
#     df = df.copy()
    
#     # Remove duplicate rows - all appids should be unique
#     df = df.drop_duplicates()
    
#     # Remove collumns with more than 50% null values
#     df = drop_null_cols(df)
    
#     # Process rest of columns
#     df = process_name_type(df)
#     df = process_age(df)
#     df = process_platforms(df)
    
#     return df


# steam_data = process(raw_steam_data)
# steam_data[['name', 'windows', 'mac', 'linux']].head()

# Investigating alternate platforms processing

In [20]:
def process_platforms(df):
    """Split platforms column into separate boolean columns for each platform."""
    # evaluate values in platforms column, so can index into dictionaries
    
    def parse_platforms(x):
        
        x_eval = literal_eval(x)
        output = []
        
        for key in x_eval.keys():
            if x_eval[key]:
                output.append(key)
        
        return ';'.join(output)
    
    df['platforms'] = df['platforms'].apply(parse_platforms)
    
#     # loop across keys (the platforms) which will be turned into columns
#     for platform in df['platforms'][0].keys():
#         # set 1 if value for platform in original column is True, or 0 if it's False
#         df[platform] = df['platforms'].apply(lambda x: 1 if x[platform] else 0)
    
#     # remove the original platforms column
#     df = df.drop('platforms', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process rest of columns
    df = process_name_type(df)
    df = process_age(df)
    df = process_platforms(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'platforms']].head()

Unnamed: 0,name,platforms
0,Counter-Strike,windows;mac;linux
1,Team Fortress Classic,windows;mac;linux
2,Day of Defeat,windows;mac;linux
3,Deathmatch Classic,windows;mac;linux
4,Half-Life: Opposing Force,windows;mac;linux


## Processing Price

Now we have built up some intuition around how to deal with data stored as dictionaries, let's return to the `is_free` and `price_overview` columns as we should now be able to handle them.

First let's check how many null values there are in `price_overview`.

In [21]:
steam_data['price_overview'].isnull().sum()

3559

Whilst that looks like a lot, we have to consider the impact that the `is_free` column might be having. Before jumping to conclusions let's check if there any rows with `is_free` marked as True and null values in the `price_overview` column.

In [22]:
free_and_null_price = steam_data[(steam_data['is_free']) & (steam_data['price_overview'].isnull())]

print(free_and_null_price.shape[0])
free_and_null_price.head()

2713


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors
14,Half-Life 2: Lost Coast,340,0,True,Originally planned as a section of the Highway...,Originally planned as a section of the Highway...,Originally planned as a section of the Highway...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/340...,http://www.half-life2.com,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],['Valve'],['Valve'],,,[],windows;mac;linux,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '27 Oct, 2005'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/340...,"{'ids': [], 'notes': None}"
19,Team Fortress 2,440,0,True,"<h1>The Jungle Inferno Update</h1><p><a href=""...","<p><strong>""The most fun you can have online""<...",Nine distinct classes provide a broad range of...,"English<strong>*</strong>, Danish, Dutch, Finn...",https://steamcdn-a.akamaihd.net/steam/apps/440...,http://www.teamfortress.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Valve'],['Valve'],,"[197845, 330198, 469]","[{'name': 'default', 'title': 'Buy Team Fortre...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256698790, 'name': 'Jungle Inferno', '...","{'total': 520, 'highlighted': [{'name': 'Head ...","{'coming_soon': False, 'date': '10 Oct, 2007'}","{'url': 'http://steamcommunity.com/app/440', '...",https://steamcdn-a.akamaihd.net/steam/apps/440...,"{'ids': [2, 5], 'notes': 'Includes cartoon vio..."
22,Dota 2,570,0,True,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...","Bulgarian, Czech, Danish, Dutch, English<stron...",https://steamcdn-a.akamaihd.net/steam/apps/570...,http://www.dota2.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Valve'],['Valve'],,"[197846, 330209]","[{'name': 'default', 'title': 'Buy Dota 2', 'd...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256692021, 'name': 'Dota 2 - Join the ...",,"{'coming_soon': False, 'date': '9 Jul, 2013'}","{'url': 'http://dev.dota2.com/', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/570...,"{'ids': [], 'notes': None}"
24,Alien Swarm,630,0,True,Alien Swarm is a game and Source SDK release f...,Alien Swarm is a game and Source SDK release f...,Co-operative multiplayer game and complete cod...,English,https://steamcdn-a.akamaihd.net/steam/apps/630...,http://www.alienswarm.com,{'minimum': '<strong>Minimum:</strong><br>\t\t...,[],[],['Valve'],['Valve'],,,[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 66, 'highlighted': [{'name': 'Clear ...","{'coming_soon': False, 'date': '19 Jul, 2010'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/630...,"{'ids': [], 'notes': None}"
25,Counter-Strike: Global Offensive,730,0,True,Counter-Strike: Global Offensive (CS: GO) expa...,Counter-Strike: Global Offensive (CS: GO) expa...,Counter-Strike: Global Offensive (CS: GO) expa...,"Czech, Danish, Dutch, English<strong>*</strong...",https://steamcdn-a.akamaihd.net/steam/apps/730...,http://blog.counter-strike.net/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['Valve', 'Hidden Path Entertainment']",['Valve'],,"[329385, 298963, 54029]","[{'name': 'default', 'title': 'Buy Counter-Str...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 81958, 'name': 'CS:GO Trailer Long', '...","{'total': 167, 'highlighted': [{'name': 'Someo...","{'coming_soon': False, 'date': '21 Aug, 2012'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/730...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."


It turns out this accounts for most of the missing values in the `price_overview` column, meaning we can handle these by setting the final price as 0. This makes intuitive sense - free games wouldn't have a price.

This means that there are almost 850 rows which aren't free but have null values in the `price_overview` column. Let's investigate those next.

In [23]:
not_free_and_null_price = steam_data[(steam_data['is_free'] == False) & (steam_data['price_overview'].isnull())]

not_free_and_null_price.head()

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors
63,The Ship: Single Player,2420,0,False,For PC gamers who enjoy multiplayer games with...,For PC gamers who enjoy multiplayer games with...,The Ship is a murder mystery alternative to tr...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/242...,http://www.blazinggriffin.com/games/the-ship-m...,{'minimum': '<strong>Minimum:</strong> 1.8 GHz...,[],[],['Outerlight Ltd.'],['Blazing Griffin Ltd.'],,[56669],"[{'name': 'default', 'title': 'Buy The Ship: S...",windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 2035597, 'name': 'the Ship: Intro', '...",{'total': 0},"{'coming_soon': False, 'date': '20 Nov, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/242...,"{'ids': [], 'notes': None}"
75,RollerCoaster Tycoon® 3: Platinum,2700,0,False,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/270...,http://www.atari.com/rollercoastertycoon/us/in...,{'minimum': '<strong>Minimum: </strong><br>\t\...,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],"['Frontier', 'Aspyr (Mac)']","['Atari', 'Aspyr (Mac)']",,,[],windows;mac,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'url': 'http://www.atari.com/support/atari', ...",https://steamcdn-a.akamaihd.net/steam/apps/270...,"{'ids': [], 'notes': None}"
220,BioShock™,7670,0,False,<h1>Special Offer</h1><p>Buying BioShock™ also...,BioShock is a shooter unlike any you've ever p...,BioShock is a shooter unlike any you've ever p...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/767...,http://www.BioShockGame.com,"{'minimum': '<h2 class=""bb_tag""><strong>Minimu...",{'minimum': 'Please See BioShock Remastered'},[],"['2K Boston', '2K Australia']",['2K'],,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™',...",windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '21 Aug, 2007'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/767...,"{'ids': [], 'notes': None}"
234,Sam & Max 101: Culture Shock,8200,0,False,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,Sam &amp; Max: Episode 1 - Culture Shock - The...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/820...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[357, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...",windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/820...,"{'ids': [], 'notes': None}"
235,Sam & Max 102: Situation: Comedy,8210,0,False,<strong>Sam &amp; Max: Episode 2 - Situation: ...,<strong>Sam &amp; Max: Episode 2 - Situation: ...,Sam &amp; Max: Episode 2 - Situation: Comedy -...,"English, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/821...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[358, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...",windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/821...,"{'ids': [], 'notes': None}"


The first few rows contain some big, well-known games which appear to have pretty complete data. It looks like we can rule out data errors, so let's dig a little deeper and see if we can find out what is going on.

We'll start by looking at the store pages for some of these titles. The url to an app on the steam website follows this structure:

    https://store.steampowered.com/app/[steam_appid]

This means we can easily generate these links using our above filter. We'll wrap it up in a function in case we want to use it later.

In [24]:
def print_steam_links(df):
    """Print links to store page for apps in a dataframe."""
    url_base = "https://store.steampowered.com/app/"
    
    for i, row in df.iterrows():
        appid = row['steam_appid']
        name = row['name']
        
        print(name + ':', url_base + str(appid))
        

print_steam_links(not_free_and_null_price[:5])

The Ship: Single Player: https://store.steampowered.com/app/2420
RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210


For these games we can conclude that:

- The Ship: Single Player is a tutorial, and comes as part of The Ship: Murder Party
- RollerCoaster Tycoon 3: Platinum has been removed from steam (and another game website: [GOG](https://www.gog.com/))  
  - "A spokesperson for GOG told Eurogamer it pulled the game "due to expiring licensing rights", and stressed it'll talk with "new distribution rights holders" to bring the game back as soon as possible." Source: [Eurogamer](https://www.eurogamer.net/articles/2018-05-09-rollercoaster-tycoon-3-pulled-from-steam-gog)
- BioShock has been replaced by BioShock Remastered
- Sam & Max 101 is sold as part of a season, and this can be found in the `package_groups` column

So we have a couple of options here. We could just drop these rows, we could try to figure out the price based on the `package_groups` column, or we could leave them for now and return to them later. We'll leave them for now, handling the two prices columns, then take a look at the packages next. It may also be that some of these rows are removed later in the cleaning process for other reasons.

If we want to find rows similar to these and deal with each case individually, we could use the `.str.contains()` method, as seen below.

In [25]:
steam_data[steam_data['name'].str.contains("BioShock™")]

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors
220,BioShock™,7670,0,False,<h1>Special Offer</h1><p>Buying BioShock™ also...,BioShock is a shooter unlike any you've ever p...,BioShock is a shooter unlike any you've ever p...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/767...,http://www.BioShockGame.com,"{'minimum': '<h2 class=""bb_tag""><strong>Minimu...",{'minimum': 'Please See BioShock Remastered'},[],"['2K Boston', '2K Australia']",['2K'],,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™',...",windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '21 Aug, 2007'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/767...,"{'ids': [], 'notes': None}"
7734,BioShock™ Remastered,409710,18,False,<h1>Special Offer</h1><p>Buying BioShock™ Rema...,BioShock is a shooter unlike any you've ever p...,"BioShock is a shooter unlike any other, loaded...","English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/409...,http://www.BioShockGame.com,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['2K Boston', '2K Australia', 'Blind Squirrel'...","['2K', 'Feral Interactive (Mac)']","{'currency': 'GBP', 'initial': 999, 'final': 9...","[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™ R...",windows;mac,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 65, 'highlighted': [{'name': 'Comple...","{'coming_soon': False, 'date': '15 Sep, 2016'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/409...,"{'ids': [], 'notes': None}"
7735,BioShock™ 2 Remastered,409720,18,False,<h1>Special Offer</h1><p>Buying BioShock 2™ Re...,BioShock 2 provides players with the perfect b...,"In BioShock 2, you step into the boots of the ...","English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/409...,http://www.bioshockgame.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['2K Marin', '2K China', 'Digital Extremes', '...",['2K'],"{'currency': 'GBP', 'initial': 1399, 'final': ...","[81419, 127633]","[{'name': 'default', 'title': 'Buy BioShock™ 2...",windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 53, 'highlighted': [{'name': ""Daddy'...","{'coming_soon': False, 'date': '15 Sep, 2016'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/409...,"{'ids': [5], 'notes': None}"


Now we need to figure out how to process the column.

If we take a look at the data for the first row, we can see that there are a variety of formats in which the price is stored. There is a currency, GBP, which is perfect as we are performing our analysis in the UK. Next we have a number of different values for the price, so which one do we use?

In [26]:
steam_data['price_overview'][0]

"{'currency': 'GBP', 'initial': 719, 'final': 719, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '£7.19'}"

If we inspect another row, we see that there is an active discount, applying an 80% price reduction to the title. It looks like `initial` contains the normal price before discount, and `final` contains the discounted price. `initial_formatted` and `final_formatted` contain the price formatted and displayed in the currency. We don't have to worry about these last two, as we'll be storing the price as a float (or integer) and if we wanted, we could format it like this when printing.

With all this in mind, it looks like we'll be checking the value under the `currency` key, and using the value in the `initial` key.

In [27]:
steam_data['price_overview'][37]

"{'currency': 'GBP', 'initial': 2299, 'final': 459, 'discount_percent': 80, 'initial_formatted': '£22.99', 'final_formatted': '£4.59'}"

Now the preliminary investigation is complete we can begin definining our function. 

We start by evaluating the strings using `literal_eval` as before, however if there is a null value (caught by the try/except block) we return a properly formatted dictionary with -1 for the `initial` value. This will allow us to fill in a value of 0 for free games, then be left with an easily targetable value for the null rows.

In [28]:
def process_price(df):
    df = df.copy()
        
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # Create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # Set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    return df

price_data = process_price(steam_data)[['name', 'currency', 'price']]
price_data.head()

Unnamed: 0,name,currency,price
0,Counter-Strike,GBP,719
1,Team Fortress Classic,GBP,399
2,Day of Defeat,GBP,399
3,Deathmatch Classic,GBP,399
4,Half-Life: Opposing Force,GBP,399


We're almost finished, but let's check if any games don't have GBP listed as the currency.

In [29]:
price_data[price_data['currency'] != 'GBP']

Unnamed: 0,name,currency,price
991,Robin Hood: The Legend of Sherwood,USD,799
5767,Assassin’s Creed® Chronicles: India,EUR,999
27593,Mortal Kombat 11,USD,5999
27995,Pagan Online,EUR,2699


For some reason there are four games listed in either USD or EUR. We could use the current exchange rate to try and convert them into GBP, however as there are only four rows it's easier and safer to simply drop them.

We can also divide the prices by 100 so they are displayed as floats in pounds.

In [30]:
def process_price(df):
    """Process price_overview column into formatted price column."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # change price to display in pounds (only applying to rows with a value greater than 0)
    df.loc[df['price'] > 0, 'price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process rest of columns
    df = process_name_type(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'price']].head()

Unnamed: 0,name,price
0,Counter-Strike,7.19
1,Team Fortress Classic,3.99
2,Day of Defeat,3.99
3,Deathmatch Classic,3.99
4,Half-Life: Opposing Force,3.99


## Processing Packages

We can now take a look at the `packages` and `package_groups` columns to help decide what to do with rows that are missing price data. We're not incredibly interested in the columns themselves, as they don't appear to provide much new useful information, except which games come with others as part of a bundle.

In [31]:
# temporarily set a pandas option using with and option_context
with pd.option_context("display.max_colwidth", 500):
    display(steam_data[['steam_appid', 'packages', 'package_groups', 'price']].head())

Unnamed: 0,steam_appid,packages,package_groups,price
0,10,[7],"[{'name': 'default', 'title': 'Buy Counter-Strike', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 7, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Counter-Strike: Condition Zero - £7.19', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 719}]}]",7.19
1,20,[29],"[{'name': 'default', 'title': 'Buy Team Fortress Classic', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 29, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Team Fortress Classic - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]",3.99
2,30,[30],"[{'name': 'default', 'title': 'Buy Day of Defeat', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 30, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Day of Defeat - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]",3.99
3,40,[31],"[{'name': 'default', 'title': 'Buy Deathmatch Classic', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 31, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Deathmatch Classic - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]",3.99
4,50,[32],"[{'name': 'default', 'title': 'Buy Half-Life: Opposing Force', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 32, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Opposing Force - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]",3.99


Overall we have 846 rows with missing price data, which we previously set to -1.

In [32]:
print(steam_data[steam_data['price'] == -1].shape[0])

846


We can split these rows into two categories: those with `package_groups` data and those without.

If we take a quick look at the `package_groups` column we see that there appear to be no null values. On closer inspection, we can find that rows without data are actually stored as empty lists.

In [33]:
print('Null counts:', steam_data['package_groups'].isnull().sum())
print('Empty list counts:', steam_data[steam_data['package_groups'] == "[]"].shape[0])

Null counts: 0
Empty list counts: 3353


Using a combination of filters, we can find out how many rows have both missing `price` and `package_group` data and investigate. We'll count the rows and print links to some of the store pages and look for patterns.

In [34]:
missing_price_and_package = steam_data[(steam_data['price'] == -1) & (steam_data['package_groups'] == "[]")]

print('Number of rows:', missing_price_and_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_and_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_and_package[-10:-5])

missing_price_and_package.head()

Number of rows: 799 

First few rows:

RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
Beijing 2008™ - The Official Video Game of the Olympic Games: https://store.steampowered.com/app/10520
LUMINES™ Advance Pack: https://store.steampowered.com/app/11920
Midnight Club 2: https://store.steampowered.com/app/12160
Age of Booty™: https://store.steampowered.com/app/21600

Last few rows:

RoboVirus: https://store.steampowered.com/app/1001870
soko loco deluxe: https://store.steampowered.com/app/1003730
POCKET CAR : VRGROUND: https://store.steampowered.com/app/1004710
The Princess, the Stray Cat, and Matters of the Heart: https://store.steampowered.com/app/1010600
Mr Boom's Firework Factory: https://store.steampowered.com/app/1013670


Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,price
75,RollerCoaster Tycoon® 3: Platinum,2700,0,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/270...,http://www.atari.com/rollercoastertycoon/us/in...,{'minimum': '<strong>Minimum: </strong><br>\t\...,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],"['Frontier', 'Aspyr (Mac)']","['Atari', 'Aspyr (Mac)']",,[],windows;mac,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'url': 'http://www.atari.com/support/atari', ...",https://steamcdn-a.akamaihd.net/steam/apps/270...,"{'ids': [], 'notes': None}",-1.0
311,Beijing 2008™ - The Official Video Game of the...,10520,0,<p> Embrace the competi...,<p> Embrace the competi...,Embrace the competitive spirit of the world's ...,English,https://steamcdn-a.akamaihd.net/steam/apps/105...,http://www.olympicvideogames.com,{'minimum': '<p><strong>Minimum:</strong></p> ...,[],[],['Eurocom'],['SEGA'],,[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '18', 'description': 'Sports'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '14 Aug, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/105...,"{'ids': [], 'notes': None}",-1.0
337,LUMINES™ Advance Pack,11920,0,<p>Ready for the next challenge? Prepare yours...,<p>Ready for the next challenge? Prepare yours...,Ready for the next challenge? Prepare yourself...,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/119...,,{'minimum': '<p><strong>Minimum:</strong></p>\...,[],[],['Q Entertainment Inc.'],['Q Entertainment Inc.'],,[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '18 Apr, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/119...,"{'ids': [], 'notes': None}",-1.0
344,Midnight Club 2,12160,0,Members of the world's most notorious illegal ...,Members of the world's most notorious illegal ...,The world's most notorious drivers meet each n...,"English<strong>*</strong>, French, Italian, Ge...",https://steamcdn-a.akamaihd.net/steam/apps/121...,http://www.rockstargames.com/midnightclub2,{'minimum': '<strong>Minimum:</strong><br>\t\t...,[],[],['Rockstar San Diego'],['Rockstar Games'],,[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '9', 'description': 'Racing'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '4 Jan, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/121...,"{'ids': [], 'notes': None}",-1.0
536,Age of Booty™,21600,0,"Set in the swashbuckling era, Age of Booty™ is...","Set in the swashbuckling era, Age of Booty™ is...","Set in the swashbuckling era, Age of Booty™ is...",English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/216...,http://www.certainaffinity.com/ageofbooty/,{'minimum': '<strong>Minimum:</strong> ...,[],[],['Certain Affinity™'],['Capcom'],,[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '9 Mar, 2009'}",{'url': 'http://www.capcom.co.jp/support/conta...,https://steamcdn-a.akamaihd.net/steam/apps/216...,"{'ids': [], 'notes': None}",-1.0


Most of the games - 799 of 846 - with missing price data fall into the above category. This probably means they can be safely removed.

From following the links for the first few rows to the store page, it looks like they are currently unavailable or have been delisted from the store. Looking at the last few rows, it appears most of them haven't yet been released and haven't had a price set. We'll take care of all the unreleased games when we clean the `release_date` column, but we can remove all of these apps here.

Let's now take a look at the rows that have missing price data but do have `package_groups` data. We may be interested in keeping these rows and extracting price data from the package data.

In [35]:
missing_price_have_package = steam_data.loc[(steam_data['price'] == -1) & (steam_data['package_groups'] != "[]"), ['name', 'steam_appid', 'package_groups', 'price']]

print('Number of rows:', missing_price_have_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_have_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_have_package[-10:-5])

display(missing_price_have_package.head())
missing_price_have_package.iloc[-10:-5]

Number of rows: 47 

First few rows:

The Ship: Single Player: https://store.steampowered.com/app/2420
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210
Sam & Max 103: The Mole, the Mob and the Meatball: https://store.steampowered.com/app/8220

Last few rows:

Viscera Cleanup Detail: Shadow Warrior: https://store.steampowered.com/app/255520
Space Hulk: Deathwing: https://store.steampowered.com/app/298900
7,62 Hard Life: https://store.steampowered.com/app/306290
Letter Quest: Grimm's Journey: https://store.steampowered.com/app/328730
Rad Rodgers: World One: https://store.steampowered.com/app/353580


Unnamed: 0,name,steam_appid,package_groups,price
63,The Ship: Single Player,2420,"[{'name': 'default', 'title': 'Buy The Ship: S...",-1.0
220,BioShock™,7670,"[{'name': 'default', 'title': 'Buy BioShock™',...",-1.0
234,Sam & Max 101: Culture Shock,8200,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0
235,Sam & Max 102: Situation: Comedy,8210,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0
236,"Sam & Max 103: The Mole, the Mob and the Meatball",8220,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0


Unnamed: 0,name,steam_appid,package_groups,price
2421,Viscera Cleanup Detail: Shadow Warrior,255520,"[{'name': 'default', 'title': 'Buy Viscera Cle...",-1.0
3576,Space Hulk: Deathwing,298900,"[{'name': 'default', 'title': 'Buy Space Hulk:...",-1.0
3811,"7,62 Hard Life",306290,"[{'name': 'default', 'title': 'Buy 7,62 Hard L...",-1.0
4504,Letter Quest: Grimm's Journey,328730,"[{'name': 'default', 'title': ""Buy Letter Ques...",-1.0
5514,Rad Rodgers: World One,353580,"[{'name': 'default', 'title': 'Buy Rad Rodgers...",-1.0


Looking at a selection of these rows, the games appear to be: supersceded by a newer release or remaster, part of a bigger bundle of games or episodic, or included by purchasing another game. 

Whilst we could extract prices from the `package_groups` data, the most sensible option seems to be removing these rows. There are only 47 rows this applies to, and any with a newer release will still have the re-release in the data.

Since our logic interacts heavily with the price data we will update the `process_price` function rather than creating a new one.

In [36]:
def process_price(df):
    """Process price_overview column into formatted price column, and take care of package columns."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # remove rows where price is -1
    df = df[df['price'] != -1]
    
    # change price to display in pounds (can apply to all now -1 rows removed)
    df['price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview', 'packages', 'package_groups'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process rest of columns
    df = process_name_type(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,price
0,Counter-Strike,10,0,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",7.19
1,Team Fortress Classic,20,0,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",3.99
2,Day of Defeat,30,0,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}",3.99
3,Deathmatch Classic,40,0,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}",3.99
4,Half-Life: Opposing Force,50,0,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],windows;mac;linux,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}",3.99


The next columns in the data are descriptive columns - `detailed_description`, `about_the_game` and `short_description`. We won't be handling them now, instead returning to them in a later post dealing with export columns. These are columns where we will export all or some of the data to a separate csv file as part of the cleaning.

## Processing Langauges

Beyond that, the next column is `supported_languages`. As we will be performing the analysis for an English company, we will only be interested in games available in English. Whilst we could remove non-english game at this stage, instead we will create a column marking english games with a boolean value - True or False.

We begin as usual by looking for rows with null values.

In [37]:
steam_data['supported_languages'].isnull().sum()

4

Taking a closer look at these apps, it doesn't look like there's anything wrong with them. It may be that the data simply wasn't supplied. As there are only 4 rows affected we will go ahead and remove these from the data set.

In [38]:
steam_data[steam_data['supported_languages'].isnull()]

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,price
4866,Subsiege,338640,0,"<img src=""https://steamcdn-a.akamaihd.net/stea...","<img src=""https://steamcdn-a.akamaihd.net/stea...",Subsiege is an intense real-time tactic game w...,,https://steamcdn-a.akamaihd.net/steam/apps/338...,http://subsiege-game.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Icebird Studios'],['Icebird Studios'],windows,,,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256729398, 'name': 'Release Trailer', ...",{'total': 0},"{'coming_soon': False, 'date': '7 Sep, 2018'}","{'url': 'http://subsiege-game.com/', 'email': ...",https://steamcdn-a.akamaihd.net/steam/apps/338...,"{'ids': [], 'notes': None}",17.89
14560,MARS VR(全球使命VR),596560,0,1.\t4K level audio-visual experience <br />\r\...,1.\t4K level audio-visual experience <br />\r\...,Welcome to 《Mars VR》. This is an immersive fir...,,https://steamcdn-a.akamaihd.net/steam/apps/596...,http://qqsm.zygames.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['Ying Pei Digital Technology Shanghai Co., Li...","['SHANGHAI ZHENYOU TECHNOLOGY CO.,LTD']",windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '73', 'description': 'Violent'}, {'id'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256681371, 'name': 'marsvr', 'thumbnai...",{'total': 0},"{'coming_soon': False, 'date': '5 Apr, 2017'}","{'url': 'http://www.zygames.com/contact', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/596...,"{'ids': [], 'notes': None}",1.99
16386,Numberline 2,654970,0,NumberLine 2 is the continuation of the popula...,NumberLine 2 is the continuation of the popula...,NumberLine 2 is the continuation of the popula...,,https://steamcdn-a.akamaihd.net/steam/apps/654...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],"['V34D4R', 'Egor Magurin']",['Indovers Studio'],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256687192, 'name': 'Numberline 2 Trail...","{'total': 60, 'highlighted': [{'name': '1st le...","{'coming_soon': False, 'date': '14 Jul, 2017'}","{'url': '', 'email': 'radaew.zhenya@yandex.ru'}",https://steamcdn-a.akamaihd.net/steam/apps/654...,"{'ids': [], 'notes': None}",1.59
26855,SNUSE 221,948070,0,<strong> Hey. My name is *&amp;#!$.<br>Today I...,<strong> Hey. My name is *&amp;#!$.<br>Today I...,Hey. My name is *&amp;#!$. Today I will tell y...,,https://steamcdn-a.akamaihd.net/steam/apps/948...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['SNUSE GM'],['SNUSE GM'],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256745662, 'name': 'snuse', 'thumbnail...",{'total': 0},"{'coming_soon': False, 'date': '2 Apr, 2019'}","{'url': 'vk.com/nilow_i', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/948...,"{'ids': [], 'notes': None}",0.79


Next we'll take a look at the structure of the column. By looking at the value for the first row and the values for the most common rows, it looks like languages are stored as a string which can be anything from a comma-separated list of languages to a mix of html and headings. It seems reasonably safe to assume that if the app is in English, the word English will appear somewhere in this string. With this in mind we can simply search the string and return a value based on the result.

In [39]:
print(steam_data['supported_languages'][0])
steam_data['supported_languages'].value_counts().head(10)

English<strong>*</strong>, French<strong>*</strong>, German<strong>*</strong>, Italian<strong>*</strong>, Spanish - Spain<strong>*</strong>, Simplified Chinese<strong>*</strong>, Traditional Chinese<strong>*</strong>, Korean<strong>*</strong><br><strong>*</strong>languages with full audio support


English                                                                                                        8512
English<strong>*</strong><br><strong>*</strong>languages with full audio support                               7409
English, Russian                                                                                                707
English, Simplified Chinese                                                                                     280
English, Japanese                                                                                               235
English<strong>*</strong>, Russian<strong>*</strong><br><strong>*</strong>languages with full audio support     222
English, French, Italian, German, Spanish - Spain                                                               180
English, German                                                                                                 161
Simplified Chinese                                                      

It looks like English-only games make up a little over half the rows in our dataset (~16,000), and English plus other languages make up many more. We could create columns for any of the other languages by string searching, but for simplicity we'll focus on just the English ones.

Using the Series.apply method once again, we can check if the string 'english' appears in each row. We define an anonymous function on the fly using a [lambda](https://docs.python.org/3/tutorial/controlflow.html?highlight=lambda#lambda-expressions) expression. This returns 1 if 'english' is found and 0 otherwise. As mentioned in the platforms section, this can be interpreted as a boolean value. 

The variable `x` will take on the value of each row as the expression is evaluated. We apply the `lower()` string method so capitalisation doesn't matter.

In [40]:
def process_language(df):
    """Process supported_languages column into a boolean 'is english' column."""
    df = df.copy()
    
    # drop rows with missing language data
    df = df.dropna(subset=['supported_languages'])
    
    df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)
    df = df.drop('supported_languages', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process rest of columns
    df = process_name_type(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'english']].head()

Unnamed: 0,name,english
0,Counter-Strike,1
1,Team Fortress Classic,1
2,Day of Defeat,1
3,Deathmatch Classic,1
4,Half-Life: Opposing Force,1


Before moving on, we can take a quick look at the results and see that most of the apps support English.

In [41]:
steam_data['english'].value_counts()

1    27699
0      522
Name: english, dtype: int64

## Processing Developers and Publishers

Again we'll skip the next few columns as we'll deal with them another time, and take a look at `developers` and `publishers`. They will most likely contain similar information so we can look at them together. 

We'll start by checking the null counts, noticing that while the publishers column doesn't appear to have any null values, if we search for empty lists we see that we have over 200 'hidden' null values.

In [42]:
print('Developers null counts:', steam_data['developers'].isnull().sum())
print('Developers empty list counts:', steam_data[steam_data['developers'] == "['']"].shape[0])

print('\npublishers null counts:', steam_data['publishers'].isnull().sum())
print('publishers empty list counts:', steam_data[steam_data['publishers'] == "['']"].shape[0])

Developers null counts: 104
Developers empty list counts: 0

publishers null counts: 0
publishers empty list counts: 213


Ultimately we want a data set with no missing values. That means we have a few options for dealing with these two columns:

- Remove all rows missing either developer or publisher information
- Impute missing information by replacing the missing column with the column we have (i.e. if developers is missing, fill it with the value in publishers)
- Fill missing information with 'Unknown' or 'None'

We can investigate some of the rows with missing data to help inform our decision.

In [43]:
no_dev = steam_data[steam_data['developers'].isnull()]

print('Total games missing developer:', no_dev.shape[0], '\n')

print_steam_links(no_dev[:5])

no_pub = steam_data[steam_data['publishers'] == "['']"]

print('\nTotal games missing publisher:', no_pub.shape[0], '\n')
print_steam_links(no_pub[:5])

no_dev_or_pub = steam_data[(steam_data['developers'].isnull()) & (steam_data['publishers'] == "['']")]

print('\nTotal games missing developer and publisher:', no_dev_or_pub.shape[0], '\n')
print_steam_links(no_dev_or_pub[:5])

Total games missing developer: 104 

Tycoon City: New York: https://store.steampowered.com/app/9730
Nikopol: Secrets of the Immortals: https://store.steampowered.com/app/11370
Crash Time 2: https://store.steampowered.com/app/11390
Hunting Unlimited 2010: https://store.steampowered.com/app/12690
18 Wheels of Steel: Extreme Trucker: https://store.steampowered.com/app/33730

Total games missing publisher: 213 

RIP - Trilogy™: https://store.steampowered.com/app/2540
Vigil: Blood Bitterness™: https://store.steampowered.com/app/2570
Bullet Candy: https://store.steampowered.com/app/6600
AudioSurf: https://store.steampowered.com/app/12900
Everyday Shooter: https://store.steampowered.com/app/16300

Total games missing developer and publisher: 67 

PlayClaw 5 - Game Recording and Streaming: https://store.steampowered.com/app/237370
Artemis Spaceship Bridge Simulator: https://store.steampowered.com/app/247350
A Walk in the Dark: https://store.steampowered.com/app/248730
Forge Quest: https://stor

It appears we are looking at a mix of titles, smaller ones especially, and some of the smaller indie titles may have been self-published. Others simply have wrong or missing data, found by searching for the titles elsewhere. As our priority is creating a clean data set, and there are only a few hundred rows, it will be fine to remove them from the data.

Let's take a look at the structure of the data. Below we inspect some rows near the beginning of the dataframe. It looks like both columns are stored as lists which can have one or multiple values. We'll have to evaluate the rows as before, so they are recognised as lists, then index into them accordingly.

In [44]:
steam_data[['developers', 'publishers']].iloc[24:28]

Unnamed: 0,developers,publishers
24,['Valve'],['Valve']
25,"['Valve', 'Hidden Path Entertainment']",['Valve']
27,['Mark Healey'],['Mark Healey']
28,['Tripwire Interactive'],['Tripwire Interactive']


As we have some single values and some multiple, we have to decide how to handle them. Here are some potential solutions:

 - Create a column for each value in the list (i.e. developer_1, developer_2)
 - Create a column with the first value in the list and a column with the rest of the values (i.e. developer_1, other_developers)
 - Create a column with the first value in the list and disregard the rest
 - Combine all values into one column, simply unpacking the list
 
Let's begin defining our function, and take a look at how many rows have multiple developers or publishers. After evaluating each row, we can find the length of the lists in each row by using the [Series.str.len()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.len.html) method. By filtering only rows where the list has more than one element, we can find the number of rows with more than one value in each column.

In [45]:
def process_developers_and_publishers(df):
    # remove rows with missing data
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
    
    for col in ['developers', 'publishers']:
        df[col] = df[col].apply(lambda x: literal_eval(x))
        
        # filter dataframe to rows with lists longer than 1, and store the number of rows
        num_rows = df[df[col].str.len() > 1].shape[0]
        
        print('Rows in {} column with multiple values:'.format(col), num_rows)

process_developers_and_publishers(steam_data)

Rows in developers column with multiple values: 1720
Rows in publishers column with multiple values: 884


It turns out that the vast majority have only one value for these columns. If we went with the first or second solutions above, we'd be left with columns with mostly missing data. We could go with the third option, but the first value in the list isn't necessarily the most important, and this seems unfair if multiple teams were involved.

The best way forward seems to be the fourth option - if there are multiple values we combine them into the same column. We'll create a list in this case. We can achieve this by calling the [str.join()](https://docs.python.org/3/library/stdtypes.html#str.join) method on a string and passing the list of values into the function. If we pass a list with only one value, we get a string with just that value. If we pass a list with multiple values, we get a comma-separated list as desired. We can see this below.

In [46]:
', '.join(['one item'])

'one item'

In [47]:
', '.join(['multiple', 'different', 'items'])

'multiple, different, items'

We can't join on a comma as a number of developers and publishers have a comma in their name, a couple of which can be seen below.

In [48]:
steam_data.loc[steam_data['developers'].str.contains(",", na=False), ['steam_appid', 'developers', 'publishers']].head(4)

Unnamed: 0,steam_appid,developers,publishers
25,730,"['Valve', 'Hidden Path Entertainment']",['Valve']
66,2520,"['CINEMAX, s.r.o.']","['CINEMAX, s.r.o.']"
73,2630,"['Infinity Ward', 'Aspyr (Mac)']","['Activision', 'Aspyr (Mac)']"
97,3300,"['PopCap Games, Inc.']","['PopCap Games, Inc.']"


Instead we can join on a semi-colon (`;`). We have 3 rows which contains a semi-colon in the name, so we'll remove these. We'll be able to identify and split individual developer/publisher names in the future by handling it this way.

In [49]:
steam_data.loc[steam_data['developers'].str.contains(";", na=False), ['steam_appid', 'developers', 'publishers']]

Unnamed: 0,steam_appid,developers,publishers
9550,460210,['bool games;'],['bool games;']
13489,568480,"[';)', 'Quickdraw Studios']",['Quickdraw Studios']
16871,665890,['Semicolon;'],['Semicolon;']


Now we're ready to finish the function we started. We'll abandon the for loop, as there is not too much repetition, and add it into the `process` function as always.

In [71]:
def process_developers_and_publishers(df):
    """Parse columns as semicolon-separated string."""
    # remove rows with missing data
    # df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
    df = df[df['developers'].notnull()].copy()
    
    # remove rows with semicolon in either column (~ means not)
    df = df[~(df['developers'].str.contains(';')) & ~(df['publishers'].str.contains(';'))]
    
    # create list for each
    df['developer'] = df['developers'].apply(lambda x: ';'.join(literal_eval(x)))
    df['publisher'] = df['publishers'].apply(lambda x: ';'.join(literal_eval(x)))
    
    df.loc[df['publisher'] == '', 'publisher'] = np.nan

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process rest of columns
    df = process_name_type(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    df = process_developers_and_publishers(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'steam_appid', 'developer', 'publisher']].head()

Unnamed: 0,name,steam_appid,developer,publisher
0,Counter-Strike,10,Valve,Valve
1,Team Fortress Classic,20,Valve,Valve
2,Day of Defeat,30,Valve,Valve
3,Deathmatch Classic,40,Valve,Valve
4,Half-Life: Opposing Force,50,Gearbox Software,Valve


In [73]:
steam_data[steam_data['publisher'].isnull()]

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,price,english,developer,publisher
67,RIP - Trilogy™,2540,0,With the completion of the third title in the ...,With the completion of the third title in the ...,With the completion of the third title in the ...,https://steamcdn-a.akamaihd.net/steam/apps/254...,,{'minimum': '<strong>Minimum: </strong>Windows...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/254...,"{'ids': [], 'notes': None}",3.99,1,Elephant Games,
68,Vigil: Blood Bitterness™,2570,0,<p>Vigil: Blood Bitterness plunges you into th...,<p>Vigil: Blood Bitterness plunges you into th...,Vigil: Blood Bitterness plunges you into the d...,https://steamcdn-a.akamaihd.net/steam/apps/257...,,"{'minimum': 'Windows XP/2000, 1.2 GHz Processo...",[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '29 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/257...,"{'ids': [], 'notes': None}",0.00,1,Freegamer,
190,Bullet Candy,6600,0,"Bullet Candy is a fast paced, action packed ar...","Bullet Candy is a fast paced, action packed ar...","Bullet Candy is a fast paced, action packed ar...",https://steamcdn-a.akamaihd.net/steam/apps/660...,http://www.charliesgames.com,{'minimum': '<strong>Minimum: </strong><br>\t\...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 20, 'highlighted': [{'name': 'Casual...","{'coming_soon': False, 'date': '14 Feb, 2007'}","{'url': '', 'email': 'charlie@charliesgames.com'}",https://steamcdn-a.akamaihd.net/steam/apps/660...,"{'ids': [], 'notes': None}",2.79,1,R C Knight,
385,AudioSurf,12900,0,Ride your music.<br>\t\t\t\t\tAudiosurf is a m...,Ride your music.<br>\t\t\t\t\tAudiosurf is a m...,Ride your music. Audiosurf is a music-adapting...,https://steamcdn-a.akamaihd.net/steam/apps/129...,http://www.audio-surf.com/,{'minimum': '<strong>Minimum:</strong><br>\t\t...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 2030283, 'name': 'Audiosurf Trailer', ...","{'total': 19, 'highlighted': [{'name': 'Royal ...","{'coming_soon': False, 'date': '15 Feb, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/129...,"{'ids': [], 'notes': None}",6.99,1,Dylan Fitterer,
451,Everyday Shooter,16300,0,<p>Everyday Shooter is an album of games explo...,<p>Everyday Shooter is an album of games explo...,Everyday Shooter is an album of games explorin...,https://steamcdn-a.akamaihd.net/steam/apps/163...,http://www.everydayshooter.com/,{'minimum': '<ul>\n\t\t\t\t\t<li><strong>OS:</...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '8 May, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/163...,"{'ids': [], 'notes': None}",7.19,1,Queasy Games,
478,Dystopia,17580,0,<strong>Dystopia</strong> is a cyberpunk game ...,<strong>Dystopia</strong> is a cyberpunk game ...,Dystopia is a cyberpunk game on the Source eng...,https://steamcdn-a.akamaihd.net/steam/apps/175...,http://dystopia-game.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],windows;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256708878, 'name': 'Dystopia 1.2 Relea...","{'total': 42, 'highlighted': [{'name': 'Street...","{'coming_soon': False, 'date': '6 Feb, 2009'}",{'url': 'http://punyhumangames.com/contact.php...,https://steamcdn-a.akamaihd.net/steam/apps/175...,"{'ids': [], 'notes': None}",0.00,1,Puny Human,
623,Cogs,26500,0,<p>Cogs is a puzzle game where players build m...,<p>Cogs is a puzzle game where players build m...,Cogs is a puzzle game where players build mach...,https://steamcdn-a.akamaihd.net/steam/apps/265...,http://www.cogsgame.com,{'minimum': '<ul>\n\t <li><...,{'minimum': '<ul>\n\t<li><strong>OS:</strong> ...,[],windows;mac;linux,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 18, 'highlighted': [{'name': 'Appren...","{'coming_soon': False, 'date': '14 Apr, 2009'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/265...,"{'ids': [], 'notes': None}",4.79,1,Lazy 8 Studios,
657,Poker Night at the Inventory,31280,0,***Not Compatible with Mac OS 10.8.x and above...,***Not Compatible with Mac OS 10.8.x and above...,Prepare for a different kind of poker night in...,https://steamcdn-a.akamaihd.net/steam/apps/312...,http://www.telltalegames.com/pokernight,{'minimum': '<strong>Minimum:</strong><br>\t\t...,{'minimum': '<strong>Minimum:</strong><br>\t\t...,[],windows;mac,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 20, 'highlighted': [{'name': 'Win A ...","{'coming_soon': False, 'date': '22 Nov, 2010'}",{'url': 'http://www.telltalegames.com/communit...,https://steamcdn-a.akamaihd.net/steam/apps/312...,"{'ids': [], 'notes': None}",3.99,1,Telltale Games,
894,Super Meat Boy,40800,0,Super Meat Boy is a tough as nails platformer ...,Super Meat Boy is a tough as nails platformer ...,"The infamous, tough-as-nails platformer comes ...",https://steamcdn-a.akamaihd.net/steam/apps/408...,http://www.supermeatboy.com,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...","{'minimum': '<ul class=""bb_ul""><li><strong>OS:...","{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",windows;mac;linux,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 48, 'highlighted': [{'name': 'Wood B...","{'coming_soon': False, 'date': '30 Nov, 2010'}","{'url': '', 'email': 'support@supermeatboy.com'}",https://steamcdn-a.akamaihd.net/steam/apps/408...,"{'ids': [], 'notes': None}",10.99,1,Team Meat,
1039,Pirates of Black Cove,49330,0,"Set in the golden age of pirates, this is your...","Set in the golden age of pirates, this is your...","Set in the golden age of pirates, this is your...",https://steamcdn-a.akamaihd.net/steam/apps/493...,http://www.blackcovegame.com/,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '2', 'description': 'Strategy'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 44, 'highlighted': [{'name': 'I love...","{'coming_soon': False, 'date': '2 Aug, 2011'}","{'url': '', 'email': 'support@paradoxplaza.com'}",https://steamcdn-a.akamaihd.net/steam/apps/493...,"{'ids': [], 'notes': None}",7.99,1,Nitro Games,


## Processing Achievements and Content Descriptors

The final two columns we will take care of in this section are `achievements` and `content_descriptors`. Let's take a look at the null counts for each column and a small sample of rows.

In [51]:
print('Achievements null counts:', steam_data['achievements'].isnull().sum())
print('Content Decsriptors null counts:', steam_data['content_descriptors'].isnull().sum())

steam_data[['name', 'achievements', 'content_descriptors']].iloc[8:13]

Achievements null counts: 1946
Content Decsriptors null counts: 0


Unnamed: 0,name,achievements,content_descriptors
8,Half-Life: Blue Shift,{'total': 0},"{'ids': [], 'notes': None}"
9,Half-Life 2,"{'total': 33, 'highlighted': [{'name': 'Defian...","{'ids': [], 'notes': None}"
10,Counter-Strike: Source,"{'total': 147, 'highlighted': [{'name': 'Someo...","{'ids': [2, 5], 'notes': 'Includes intense vio..."
11,Half-Life: Source,{'total': 0},"{'ids': [], 'notes': None}"
12,Day of Defeat: Source,"{'total': 54, 'highlighted': [{'name': 'Double...","{'ids': [], 'notes': None}"


It looks like both columns are stored as dictionaries, with standard formats if no details are provided or exist.

Below we take a closer look at a single row from the achievements column.

In [52]:
literal_eval(steam_data['achievements'][9])

{'total': 33,
 'highlighted': [{'name': 'Defiant',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_hit_cancop_withcan.jpg'},
  {'name': 'Submissive',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_put_canintrash.jpg'},
  {'name': 'Malcontent',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_escape_apartmentraid.jpg'},
  {'name': 'What cat?',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_break_miniteleporter.jpg'},
  {'name': 'Trusty Hardware',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_crowbar.jpg'},
  {'name': 'Barnacle Bowling',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_barnacleswithbarrel.jpg'},
  {'name': "Anchor's Aweigh!",
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboat.jpg'},
  {'nam

There are two keys in the top level of the dictionary: `total` and `highlighted`. The highlighted column looks too specific, being a selection of achievements specific to that game, so we will remove it. It may be worthwhile extracting the `total` value though.

Now let's take a look at the `content_descriptors` column.

In [53]:
steam_data['content_descriptors'].value_counts().head(10)

{'ids': [], 'notes': None}                                                                                                                                                                    25943
{'ids': [2, 5], 'notes': None}                                                                                                                                                                  428
{'ids': [1, 5], 'notes': None}                                                                                                                                                                  253
{'ids': [5], 'notes': None}                                                                                                                                                                     128
{'ids': [1, 2, 5], 'notes': None}                                                                                                                                                               122
{'ids': [2, 5], 'not

Content descriptors contain age-related warnings about the content of a game. They are identified by a numeric ID number, with optional notes supplied. Almost 26,000 rows have an empty list, indicating either no content descriptors or none provided. Because of this, and because the rows are highly specific to each game, we will drop this column entirely.

Let's now define a function.

In [74]:
def process_achievements_and_descriptors(df):
    """Parse as total number of achievements."""
    df = df.copy()
    
    df = df.drop('content_descriptors', axis=1)
    
    def parse_achievements(x):
        try:
            return literal_eval(x)['total']
        except ValueError:
            # handle missing values
            if np.isnan(x):
                return 0
            else:
                # safety in case of other problem
                print(x)
        
    df['achievements'] = df['achievements'].apply(parse_achievements)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
    
    # Process rest of columns
    df = process_name_type(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    df = process_developers_and_publishers(df)
    df = process_achievements_and_descriptors(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'achievements']].head()

Unnamed: 0,name,achievements
0,Counter-Strike,0
1,Team Fortress Classic,0
2,Day of Defeat,0
3,Deathmatch Classic,0
4,Half-Life: Opposing Force,0


We know that the first few rows have 0 total achievements so that's fine, but let's take a look at the value counts to verify everything went as expected.

In [55]:
with pd.option_context("display.max_rows", 12):
    display(steam_data['achievements'].value_counts().sort_index())

0       12494
1         272
2         102
3         143
4         215
5         373
        ...  
4996        1
4997        1
4999        1
5000       96
5394        1
9821        1
Name: achievements, Length: 411, dtype: int64

It looks like we were successful. We'll leave this column as it is for now, however we may wish to consider grouping the values together in bins, like we did for the age column. This is a decision we can make during the feature engineering stage of our analysis, and we can decide at that point if it will be more useful.

## Export Partially Clean Data

As I said at the beginning, data cleaning is a lengthy process. Already this part is more than long enough, and there's still plenty more to do. In the next part we'll take care of most of the remaining columns, and we'll be exporting a bunch of data too.

Before we wrap up, let's take a look at the current state of the data, then export it ready to continue in part 2.

In [75]:
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,price,english,developer,publisher
0,Counter-Strike,10,0,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,7.19,1,Valve,Valve
1,Team Fortress Classic,20,0,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,3.99,1,Valve,Valve
2,Day of Defeat,30,0,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,3.99,1,Valve,Valve
3,Deathmatch Classic,40,0,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,3.99,1,Gearbox Software,Valve


In [76]:
steam_data.isnull().sum()

name                       0
steam_appid                0
required_age               0
detailed_description      14
about_the_game            14
short_description         14
header_image               0
website                 9531
pc_requirements            0
mac_requirements           0
linux_requirements         0
platforms                  0
categories               511
genres                    38
screenshots                5
movies                  1782
achievements               0
release_date               0
support_info               0
background                 5
price                      0
english                    0
developer                  0
publisher                146
dtype: int64

In [77]:
steam_data.to_csv('../data/exports/steam_clean_part_1.csv', index=False)

# Steam Data Cleaning (Part 2)

*This is part of a larger series of notebooks on downloading, processing and analysing data from the steam store. [See all notebooks here.](../notebooks)*

See https://github.com/jbwhit/OSCON-2015/blob/master/develop/2015-07-16-jw-example-notebook-setup.ipynb for local imports

In [1]:
# load extensions and magics

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1900 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Mon Jun 03 15:52:33 2019 GMT Summer Time,Mon Jun 03 15:52:33 2019 GMT Summer Time


# Exports

**TODO**: genre and categories section writeup

Welcome back to the second part in this Steam data cleaning series. Last time we (...)

In [2]:
# standard library imports
from ast import literal_eval
import itertools
import time
import re

# third-party imports
import numpy as np
import pandas as pd

# customisations
pd.set_option("max_columns", 100)

## Import and Inspect Data

Continuing from before, import and inspect data.

In [3]:
imported_steam_data = pd.read_csv('../data/exports/steam_clean_part_1.csv')

print('Rows:', imported_steam_data.shape[0])
print('Columns:', imported_steam_data.shape[1])
imported_steam_data.head()

Rows: 28114
Columns: 24


Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,price,english,developer,publisher
0,Counter-Strike,10,0,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,7.19,1,Valve,Valve
1,Team Fortress Classic,20,0,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,3.99,1,Valve,Valve
2,Day of Defeat,30,0,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,3.99,1,Valve,Valve
3,Deathmatch Classic,40,0,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,3.99,1,Gearbox Software,Valve


Look at the null count to see how we're doing after the first round of cleaning.

In [4]:
imported_steam_data.isnull().sum()

name                       0
steam_appid                0
required_age               0
detailed_description      14
about_the_game            14
short_description         14
header_image               0
website                 9531
pc_requirements            0
mac_requirements           0
linux_requirements         0
platforms                  0
categories               511
genres                    38
screenshots                5
movies                  1782
achievements               0
release_date               0
support_info               0
background                 5
price                      0
english                    0
developer                  0
publisher                158
dtype: int64

Strangely, just by exporting to and importing from csv, 12 null values have appeared in the publisher column. Let's take a look at a couple of these rows, by looking at them in the original, raw data.

In [5]:
raw_data = pd.read_csv('../data/raw/steam_app_data.csv')

raw_data[['name', 'steam_appid', 'publishers']][(raw_data['publishers'] == "['N/A']") | (raw_data['publishers'] == "['NA']")]

Unnamed: 0,name,steam_appid,publishers
4860,Alum,338420,['N/A']
5431,Scribble Space,351450,['N/A']
5949,Freshman Year,364450,['N/A']
7676,Cibele,408120,['N/A']
8858,Fantasy Tales Online,442710,['NA']
9895,Memoir En Code: Reissue,467940,['N/A']
12663,The Morgue Fissure Between Worlds,547150,['N/A']
14712,Kimmy,600660,['N/A']
14863,Night of Terror,604200,['N/A']
23124,Negative World,832130,['N/A']


Interestingly, by handling the data as we did we exposed some hidden null values. Only by re-importing the data were they recognised as actual null values, rather than the 'N/A' string (or in one case, 'NA' string). When it comes to defining our `process` function, we'll drop these rows. 

Apart from that, it looks like all the null values are in columns we haven't yet cleaned, which is perfect.

## Processing Description Columns

We have a series of columns with descriptive text about each game: `detailed_description`, `about_the_game` and `short_description`. As the column names imply, these provide information about each game in string format. This is great for humans' understanding, but when it comes to machines is a lot trickier.

These columns could be used as the basis for an interesting [recommender system](https://en.wikipedia.org/wiki/Recommender_system) or keyword analysis project, however they are not required in our current project. We'll be removing them as they likely take up large amounts of space, and will only serve to slow down our project.

We'll inspect the columns anyway, in case we find anomalies, and also export just the description data to a separate file, in case we want to use it in a future investigation.

In [6]:
imported_steam_data[['detailed_description', 'about_the_game', 'short_description']].isnull().sum()

detailed_description    14
about_the_game          14
short_description       14
dtype: int64

We have 14 rows with missing data for these columns, and chances are the 14 rows with missing `detailed_description` are the rows with missing `about_the_game` and `short_description` data too. 

By inspecting the individual rows below, we can see that this is true - all rows with missing data in one description column have missing data in the others as well.

In [7]:
imported_steam_data[imported_steam_data['detailed_description'].isnull()]

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,price,english,developer,publisher
92,Bejeweled 2 Deluxe,3300,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/330...,,{'minimum': '<p><strong>Minimum Requirements:<...,{'minimum': '<ul>\n\t<li><strong>OS:</strong> ...,[],windows;mac,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/330...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
93,Chuzzle Deluxe,3310,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/331...,,{'minimum': '<p><strong>Minimum Requirements:<...,{'minimum': '<ul>\n\t<li><strong>OS:</strong> ...,[],windows;mac,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/331...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
94,Insaniquarium Deluxe,3320,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/332...,,{'minimum': '<strong>Minimum Requirements:</st...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/332...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
96,AstroPop Deluxe,3340,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/334...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/334...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
97,Bejeweled Deluxe,3350,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/335...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/335...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
98,Big Money! Deluxe,3360,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/336...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/336...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
99,Dynomite Deluxe,3380,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/338...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/338...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
100,Feeding Frenzy 2 Deluxe,3390,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/339...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/339...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
101,Hammer Heads Deluxe,3400,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/340...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/340...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
103,Iggle Pop Deluxe,3420,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/342...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/342...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."


Interestingly, all of these titles are games from 2006 developed and published by PopCap Games. My best guess is that they were developed previously and all added to the Steam store in one go after Valve allowed third-party titles.

We'll remove these rows, as well as any with a description of less than 20 characters, like those below.

In [8]:
imported_steam_data[imported_steam_data['detailed_description'].str.len() <= 20]

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,price,english,developer,publisher
9883,Penguins Cretins,490990,0,...,...,...,https://steamcdn-a.akamaihd.net/steam/apps/490...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '22 Jun, 2016'}","{'url': '', 'email': 'support@hfmgames.net'}",https://steamcdn-a.akamaihd.net/steam/apps/490...,1.69,1,HFM Games,HFM Games
19041,拼词游戏 2017,745840,0,带一点恐怖元素的休闲游戏,带一点恐怖元素的休闲游戏,一款有一点恐怖元素的休闲益智游戏。,https://steamcdn-a.akamaihd.net/steam/apps/745...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],windows;mac,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256699963, 'name': 'alpha', 'thumbnail...",11,"{'coming_soon': False, 'date': '29 Nov, 2017'}","{'url': '', 'email': '12668934@qq.com'}",https://steamcdn-a.akamaihd.net/steam/apps/745...,0.79,0,Mianwotu,Mianwotu
20982,God Test,797660,0,God Test,God Test,God Test,https://steamcdn-a.akamaihd.net/steam/apps/797...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],windows,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...",,,0,"{'coming_soon': False, 'date': '18 Apr, 2018'}","{'url': '', 'email': 'insanegamedev@outlook.com'}",,0.0,1,God Test,God Test
25149,В поисках Атлантиды,925640,0,Интересная игра,Интересная игра,Atlantis,https://steamcdn-a.akamaihd.net/steam/apps/925...,https://vk.com/atlantisforever,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256725871, 'name': 'Game', 'thumbnail'...",1,"{'coming_soon': False, 'date': '1 Nov, 2018'}","{'url': 'https://vk.com/atlantisforever', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/925...,1.69,0,Dmitr Che,Dmitr Che
25281,东方百问~TouHouAsked,930840,0,Null,Null,Null,https://steamcdn-a.akamaihd.net/steam/apps/930...,https://asked.touhou.ren/,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256726640, 'name': 'TouHouAsked', 'thu...",2,"{'coming_soon': False, 'date': '7 Oct, 2018'}","{'url': 'https://asked.touhou.ren', 'email': '...",https://steamcdn-a.akamaihd.net/steam/apps/930...,0.79,0,Root Nine Studio,Root Nine Studio


To handle exporting the data to file, we'll write a reusable function which we can call upon for future columns. We will include the `steam_appid` column as it will allow us to match up these rows with rows in our primary data set later on, using a merge (like a join in SQL).

In [9]:
def export_data(df, filename, prefix='steam_', extension='.csv'):
    """Export dataframe to csv file, filename prepended with 'steam_'.
    
    filename : str without file extension
    """
    filepath = '../data/exports/' + prefix + filename + extension
    print_name = filename.replace('_', ' ')
    
    df.to_csv(filepath, index=False)
    
    print("Exported {} to '{}'".format(print_name, filepath))

We can now define a function to process and export the description columns. Notice we also remove the troublesome publisher rows.

In [10]:
def process_descriptions(df, export=False):
    """Export descriptions to external csv file then remove these columns."""
    # remove rows with missing description data
    df = df[df['detailed_description'].notnull()].copy()
    
    # remove rows with unusually small description
    df = df[df['detailed_description'].str.len() > 20]
    
    # by default we don't export, useful for calling function later
    if export:
        # create dataframe of description columns
        description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]
        
        export_data(description_data, filename='description_data')
    
    # drop description columns from main dataframe
    df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df, export=True)
    
    return df


steam_data = process(imported_steam_data)

Exported description data to '../data/exports/steam_description_data.csv'


In [11]:
# inspect exported data
pd.read_csv('../data/exports/steam_description_data.csv').head()

Unnamed: 0,steam_appid,detailed_description,about_the_game,short_description
0,10,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...
1,20,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...
2,30,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...
3,40,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...
4,50,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...


## Processing Media Columns

Similar to the description columns, we have three columns that contain links to various images: `header_image`, `screenshots` and `background`. Whilst we won't be needing this data in this project, it could open the door to some interesting image analysis in the future. We will treat these columns in almost the same way, exporting the contents to a csv file then removing them from the dataset.

Again, let's check for missing values.

In [12]:
image_cols = ['header_image', 'screenshots', 'background']

for col in image_cols:
    print(col+':', steam_data[col].isnull().sum())

steam_data[image_cols].head()

header_image: 0
screenshots: 4
background: 4


Unnamed: 0,header_image,screenshots,background
0,https://steamcdn-a.akamaihd.net/steam/apps/10/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/10/...
1,https://steamcdn-a.akamaihd.net/steam/apps/20/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/20/...
2,https://steamcdn-a.akamaihd.net/steam/apps/30/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/30/...
3,https://steamcdn-a.akamaihd.net/steam/apps/40/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/40/...
4,https://steamcdn-a.akamaihd.net/steam/apps/50/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/50/...


As with the description columns, it is likely that the 4 rows with no `screenshots` data are the same rows with no `background` data. There are so few that it is probably safe to remove them.

Before we make up our made let's inspect the rows in question. In part 1 of cleaning the data, we wrote a `print_steam_links` function to easily create links from a dataframe. To use it again, we could copy the code and define it here. Instead, we're going to use a handy trick in jupyter notebook. If we place the function in a separate python (.py) file inside the `src` folder, we can tell python to look there for local modules using `sys.path.append`. Next, we can import the function directly.

In [13]:
import sys
sys.path.append('../src/')

from datacleaning import print_steam_links

With the `print_steam_links` function now available, we can inspect the rows without screenshots. As we predicted, the rows without screenshots are also the rows without a background. It looks like two are unreleased, and if we'd dealt with the `release_date` column already these would already be removed. One was released recently (5 Jan, 2019), and perhaps didn't have screenshots at the time of downloading, and one simply doesn't have any. As we suspected, it's safe to remove all these rows.

In [14]:
no_screenshots = steam_data[steam_data['screenshots'].isnull()]

print_steam_links(no_screenshots)
no_screenshots

The Light Empire: https://store.steampowered.com/app/416220
Girl and Goblin: https://store.steampowered.com/app/880510
Arida: Backland's Awakening: https://store.steampowered.com/app/907760
Nukalypse: The Final War: https://store.steampowered.com/app/947940


Unnamed: 0,name,steam_appid,required_age,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,price,english,developer,publisher
7525,The Light Empire,416220,0,https://steamcdn-a.akamaihd.net/steam/apps/416...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...",,,4,"{'coming_soon': False, 'date': '2 Dec, 2015'}","{'url': '', 'email': 'Jemy.TLE@outlook.com'}",,4.79,1,Jemy,Jemy
23832,Girl and Goblin,880510,0,https://steamcdn-a.akamaihd.net/steam/apps/880...,,{'minimum': '<strong>最低配置:</strong><br><ul cla...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,"[{'id': 256739772, 'name': '3', 'thumbnail': '...",1552,"{'coming_soon': False, 'date': '5 Jan, 2019'}","{'url': '', 'email': 'smagician13@yahoo.com'}",,0.79,1,Inverse Game,Inverse Game
24641,Arida: Backland's Awakening,907760,0,https://steamcdn-a.akamaihd.net/steam/apps/907...,http://www.aridagame.com,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,windows;mac,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,"[{'id': 256729551, 'name': 'Teaser Beta 2018',...",0,"{'coming_soon': True, 'date': ''}","{'url': 'http://www.aridagame.com', 'email': '...",,0.0,1,Aoca Game Lab,Aoca Game Lab
25769,Nukalypse: The Final War,947940,0,https://steamcdn-a.akamaihd.net/steam/apps/947...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...",,"[{'id': 256745274, 'name': 'Nukalypse: The Fin...",0,"{'coming_soon': True, 'date': 'Oct 2019'}","{'url': '', 'email': 'nukalypse@gmail.com'}",,0.0,1,Zion Games Studio,Zion Games Studio


There is also a `movies` column with similar data. Whilst having more missing values, presumably for games without videos, it appears to contain names, thumbnails and links to various videos and trailers. It's unlikely we'll need them but we can include them in the export and remove them from our data set.

In [15]:
steam_data['movies'].isnull().sum()

1746

In [16]:
with pd.option_context("display.max_colwidth", 1000):
    print(steam_data[steam_data['movies'].notnull()]['movies'].head(2))

9                                                                                                                                                                                                                                                                                                                                                         [{'id': 904, 'name': 'Half-Life 2 Trailer', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/904/movie.jpg?t=1507237301', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie480.webm?t=1507237301', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie_max.webm?t=1507237301'}, 'highlight': True}, {'id': 5724, 'name': 'Free Yourself', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/5724/movie.293x165.jpg?t=1507237311', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie480.webm?t=1507237311', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie_max.webm?t=1507237311'}, 'highlight': Fa

We can now put this all together and define a `process_media` function, adding it in to `process` as before.

In [17]:
def process_media(df, export=False):
    """Remove media columns from dataframe, optionally exporting them to csv first."""
    df = df[df['screenshots'].notnull()].copy()
    
    if export:
        media_data = df[['steam_appid', 'header_image', 'screenshots', 'background', 'movies']]
        
        export_data(media_data, 'media_data')
        
    df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df)
    df = process_media(df, export=True)
    
    return df


steam_data = process(imported_steam_data)

Exported media data to '../data/exports/steam_media_data.csv'


In [18]:
# inspect exported data
pd.read_csv('../data/exports/steam_media_data.csv').head()

Unnamed: 0,steam_appid,header_image,screenshots,background,movies
0,10,https://steamcdn-a.akamaihd.net/steam/apps/10/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,
1,20,https://steamcdn-a.akamaihd.net/steam/apps/20/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,
2,30,https://steamcdn-a.akamaihd.net/steam/apps/30/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/30/...,
3,40,https://steamcdn-a.akamaihd.net/steam/apps/40/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,
4,50,https://steamcdn-a.akamaihd.net/steam/apps/50/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,


Before we move on, we can inspect the memory savings of removing these columns by comparing the output of the [DataFrame.info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method. If we pass `memory_usage="deep"` we get the true memory usage of each DataFrame. Without this, pandas estimates the amount used. This is because of the way python stores object (string) columns under the hood. Essentially python keeps track of a list of pointers which point to the actual strings in memory. It's a bit like if you hid a bunch of items around the house, and kept a list of where everything was. You couldn't tell the total size of everything just by looking at the list, but you could take a rough guess. Only by following the list and inspecting each individual item could you get an exact figure.

The blog post '[Why Python Is Slow](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/)' goes into more detail, but all we need to be aware of is that by passing the parameter we ensure we get the true value of memory usage. We also pass `verbose=False` to truncate unnecessary output.

We can see that already we have shrunk the memory usage from almost 300 MB to just under 60 MB. This is great because in general, the smaller the memory footprint the faster our code will run in future. And of course, we're not finished yet.

In [19]:
print('Imported Data:\n')
imported_steam_data.info(verbose=False, memory_usage="deep")

print('\nData with descriptions and media removed:\n')
steam_data.info(verbose=False, memory_usage="deep")

Imported Data:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28114 entries, 0 to 28113
Columns: 24 entries, name to publisher
dtypes: float64(1), int64(4), object(19)
memory usage: 297.7 MB

Data with descriptions and media removed:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27933 entries, 0 to 28113
Columns: 17 entries, name to publisher
dtypes: float64(1), int64(4), object(12)
memory usage: 60.3 MB


## Website and Support Info

Next we will look at the `website` and `support_info` columns. Seen below, they both contain links to external websites. The website column is simply stored as a string whereas the support info column is stored as a dictionary of `url` and `email`.

There are a large number of rows with no website listed, and while there are no null values in the `support_info` column, it looks like many will have empty `url` and `email` values inside the data.

For our dataset we'll be dropping both these columns, as they are far too specific to be useful in our analysis. As you may have guessed, we will extract and export this data as we have done before. If not useful, it could be interesting at a later date.

In [20]:
print('website null counts:', imported_steam_data['website'].isnull().sum())
print('support_info null counts:', imported_steam_data['support_info'].isnull().sum())

with pd.option_context("display.max_colwidth", 100): # ensures strings not cut short
    display(imported_steam_data[['name', 'website', 'support_info']][75:80])

website null counts: 9531
support_info null counts: 0


Unnamed: 0,name,website,support_info
75,X3: Reunion,http://www.egosoft.com/games/x3/info_en.php,"{'url': '', 'email': ''}"
76,X3: Terran Conflict,http://www.egosoft.com/games/x3tc/info_en.php,"{'url': '', 'email': 'info@egosoft.com'}"
77,X: Beyond the Frontier,http://www.egosoft.com/games/x/info_en.php,"{'url': '', 'email': ''}"
78,X: Tension,http://www.egosoft.com/games/x_tension/info_en.php,"{'url': '', 'email': ''}"
79,X Rebirth,http://www.egosoft.com/games/x_rebirth/info_en.php,"{'url': 'http://www.egosoft.com/support/index_en.php', 'email': 'info@egosoft.com'}"


We're going to split the support info into two separate columns. We'll keep all the code that parses the columns inside the export `if` statement, so it only runs if we wish to export to csv. We don't need to worry that the rows with missing website data contain `NaN` whereas the other two columns contain a blank string (`''`) for missing data, as once we have exported to csv they will be represented the same way.

In [21]:
def process_info(df, export=False):
    """Drop support information from dataframe, optionally exporting beforehand."""
    if export:
        support_info = df[['steam_appid', 'website', 'support_info']].copy()
        
        support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))
        support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])
        support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])
        
        support_info = support_info.drop('support_info', axis=1)
        
        support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]

        export_data(support_info, 'support_info')
    
    df = df.drop(['website', 'support_info'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df)
    df = process_media(df)
    df = process_info(df, export=True)
    
    return df


steam_data = process(imported_steam_data)

Exported support info to '../data/exports/steam_support_info.csv'


In [22]:
# inspect exported file
pd.read_csv('../data/exports/steam_support_info.csv').head()

Unnamed: 0,steam_appid,website,support_url,support_email
0,10,,http://steamcommunity.com/app/10,
1,30,http://www.dayofdefeat.com/,,
2,50,,https://help.steampowered.com,
3,70,http://www.half-life.com/,http://steamcommunity.com/app/70,
4,80,,http://steamcommunity.com/app/80,


## System Requirements

At first it looks like we have data for every row.

In [23]:
req_cols = ['pc_requirements', 'mac_requirements', 'linux_requirements']

print('null counts:\n')

for col in req_cols:
    print(col+':', steam_data[col].isnull().sum())

null counts:

pc_requirements: 0
mac_requirements: 0
linux_requirements: 0


However if we look at the data a little more closely, we see that some rows actually have an empty list. These won't appear as null rows, but once evaluated these rows won't provide any information and are essentially useless to us, so can be thought of as such.

In [24]:
steam_data[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].tail()

Unnamed: 0,steam_appid,pc_requirements,mac_requirements,linux_requirements
28109,1065230,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
28110,1065570,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
28111,1065650,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
28112,1066700,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[]
28113,1069460,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[]


We can check how many rows in each requirements column have empty lists using a simple boolean filter. By checking the first value in the shape parameter, we can get a count for how many empty lists there are.

In [25]:
print('Empty list counts:\n')

for col in req_cols:
    print(col+':', steam_data[steam_data[col] == '[]'].shape[0])

Empty list counts:

pc_requirements: 13
mac_requirements: 16487
linux_requirements: 19422


That's over half of the rows for both mac and linux requirements. That probably means that there is not enough data in these two columns to be useful for our analysis.

It turns out most games are developed solely for windows, with the growth in mac and linux ports only growing in recent years. Naturally it would make sense that any games that aren't supported on mac or linux would not have corresponding requirements.

As we have already cleaned our platforms column, we can check how many rows actually have missing data by comparing rows with empty lists in the requirements with data in the respective platform columns (mac/linux). If a row has an empty list in the requirements column but a 1 (True) in the platform column, it means the data is missing.

In [26]:
for col in ['mac_requirements', 'linux_requirements']:
    platform = col.split('_')[0]
    print(platform+':', steam_data[(steam_data[col] == '[]') & (steam_data['platforms'].str.contains(platform))].shape[0])

mac: 134
linux: 155


Whilst not an insignificant number, this means that the vast majority of rows are as they should be, and we're not looking at too many data errors.

Let's also have a look for missing values in the pc/windows column. We couldn't include it in our previous loop as the columns have different names, something we may wish to change later.

In [27]:
print('windows:', steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['platforms'].str.contains('windows'))].shape[0])

windows: 9


11 rows have missing system requirements. We can take a look at some of them below, and follow the links to the steam pages to try and discover if anything is amiss.

In [28]:
missing_windows_requirements = steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['platforms'].str.contains('windows'))]

print_steam_links(missing_windows_requirements[:5])
missing_windows_requirements.head()

Uplink: https://store.steampowered.com/app/1510
Battlestations: Midway: https://store.steampowered.com/app/6870
Grand Theft Auto 2: https://store.steampowered.com/app/12180
Shift 2 Unleashed: https://store.steampowered.com/app/47920
iBomber Defense: https://store.steampowered.com/app/104000


Unnamed: 0,name,steam_appid,required_age,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
31,Uplink,1510,0,[],[],[],windows;mac;linux,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '23', 'description': 'Indie'}, {'id': ...",0,"{'coming_soon': False, 'date': '23 Aug, 2006'}",6.99,1,Introversion Software,Introversion Software
191,Battlestations: Midway,6870,0,[],[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]",0,"{'coming_soon': False, 'date': '15 Mar, 2007'}",4.99,1,Eidos Interactive,Square Enix
314,Grand Theft Auto 2,12180,0,[],[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]",0,"{'coming_soon': False, 'date': '4 Jan, 2008'}",0.0,1,Rockstar North,Rockstar Games
931,Shift 2 Unleashed,47920,0,[],[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '9', 'description': 'Racing'}]",0,"{'coming_soon': False, 'date': '31 Mar, 2011'}",19.99,1,Slightly Mad Studios,Electronic Arts
1165,iBomber Defense,104000,0,[],[],[],windows;mac,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...",22,"{'coming_soon': False, 'date': '26 May, 2011'}",2.99,1,Cobra Mobile,Cobra Mobile


There doesn't appear to be any common issue in these rows - some of the games are quite old but that's about it. It may simply be that no requirements were supplied when the games were added to the steam store.

Let's say that the fictional company we're doing analysis for is interested in developing for windows only. Also we can assume that a cross-platform game will have similar requirements in terms of hardware for each platform it supports. With this in mind we can safely drop both the mac and linux requirements columns, as we already know which games support these operating systems by our cleaned platform columns. That means we can focus on the pc_requirements column, which has information for almost every game in our data.

Now we will take a look at a couple of rows from the dataset to see how the data is stored.

In [29]:
display(steam_data['pc_requirements'].iloc[0])
display(steam_data['pc_requirements'].iloc[2000])
display(steam_data['pc_requirements'].iloc[15000])

"{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</strong> 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t<p><strong>Recommended:</strong> 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t'}"

'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows XP or higher<br></li><li><strong>Processor:</strong> 1 GHz<br></li><li><strong>Memory:</strong> 512 MB RAM<br></li><li><strong>Graphics:</strong> OpenGL compatible graphics chip<br></li><li><strong>Storage:</strong> 2 GB available space</li></ul>\'}'

'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows 10<br></li><li><strong>Processor:</strong> Intel Core i7<br></li><li><strong>Memory:</strong> 8 GB RAM<br></li><li><strong>Graphics:</strong> GTX 1070 or equivalent<br></li><li><strong>DirectX:</strong> Version 12<br></li><li><strong>Storage:</strong> 20 GB available space</li></ul>\'}'

In short: it's a mess. It looks like the data is stored as a dictionary, as we've seen before. There is definitely a key for 'minimum', but apart from that it is hard to see at a glance. The strings are full of html formatting, which is presumably parsed to display the information on the website. It also looks like there are different categories like Processor and Memory for some, but not all, rows.

Let's take a stab and cleaning out some of the unnessecary formatting and see if it becomes clearer.

By creating a dataframe from a selection of rows, we can easily and quickly make changes using the pandas .str accessor, allowing us to use python string formatting and regular expressions.

In [30]:
view_requirements = steam_data['pc_requirements'].iloc[[0, 2000, 15000]].copy()

view_requirements = (view_requirements
                         .str.replace(r'\\[rtn]', '')
                         .str.replace(r'<[pbr]{1,2}>', ' ')
                         .str.replace(r'<[\/"=\w\s]+>', '')
                    )

for i, row in view_requirements.iteritems():
    display(row)

"{'minimum': ' Minimum: 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection Recommended: 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection'}"

"{'minimum': 'Minimum: OS: Windows XP or higher Processor: 1 GHz Memory: 512 MB RAM Graphics: OpenGL compatible graphics chip Storage: 2 GB available space'}"

"{'minimum': 'Minimum: OS: Windows 10 Processor: Intel Core i7 Memory: 8 GB RAM Graphics: GTX 1070 or equivalent DirectX: Version 12 Storage: 20 GB available space'}"

We can now see more clearly the contents and structure of these rows. Some rows have both Minimum and Recommended requirements inside a 'minimum' key, some have separate 'minimum' and 'recommended' keys. Some have headings like 'Processor:' and 'Storage:' before various components, others simply have a list of components. Some state particular speeds for components, like 2 Ghz CPU, others state specific models, like 'Intel Core 2 Duo', amongst this information.

It seems like it would be possible to extract invidivual component information from this data, however it would be a lengthy and complex process recquiring the handling of many exceptions and invididual cases. Whilst we may wish to tackle this in the future, as it could provide an interesting window into how the demands of gaming have changed over the years, it won't necessarily provide us with useful information for our current objectives.

With that in mind, it seems best to proceed by cleaning the data slightly so it is readable, exporting to an external csv for future use, then removing the columns from our dataframe.

In [31]:
def process_requirements(df, export=False):
    if export:
        requirements = df[['steam_appid', 'pc_requirements']].copy()
        
        requirements = requirements[requirements['pc_requirements'] != '[]']
        
        requirements['requirements_clean'] = (requirements['pc_requirements']
                                                  .str.replace(r'\\[rtn]', '')
                                                  .str.replace(r'<[pbr]{1,2}>', ' ')
                                                  .str.replace(r'<[\/"=\w\s]+>', '')
                                             )
        
        requirements['requirements_clean'] = requirements['requirements_clean'].apply(lambda x: literal_eval(x))
        
        requirements['minimum'] = requirements['requirements_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
        requirements['recommended'] = requirements['requirements_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
        
        requirements = requirements.drop('requirements_clean', axis=1)
        
        export_data(requirements, 'requirements_data')
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)
    
    return df

def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df)
    df = process_media(df)
    df = process_info(df)
    df = process_requirements(df, export=True)
    
    return df


steam_data = process(imported_steam_data)

Exported requirements data to '../data/exports/steam_requirements_data.csv'


In [32]:
# verify export
pd.read_csv('../data/exports/steam_requirements_data.csv').head()

Unnamed: 0,steam_appid,pc_requirements,minimum,recommended
0,10,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",
1,20,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",
2,30,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",
3,40,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",
4,50,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",


### Processing Categories and Genres

Drop rows with missing categories/genres?

In [33]:
print(steam_data['categories'].isnull().sum())

509


In [34]:
print(steam_data['categories'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['categories'].head())

[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]


0    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
1    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
2                                                                                                       [{'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
3    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
4                                                            [{'id': 2, 'description': 'Single-player'}, {'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enable

In [35]:
print_steam_links(steam_data[steam_data['categories'].isnull()].tail(20))

MOTiON by RADiCAL: https://store.steampowered.com/app/999900
The Marvellous Machine: https://store.steampowered.com/app/1000510
iDancer: https://store.steampowered.com/app/1004740
SubnetPing: https://store.steampowered.com/app/1008160
YouTube Center: https://store.steampowered.com/app/1009330
Discord Bot - Controls: https://store.steampowered.com/app/1010170
Wallpaper Maker （造物主视频桌面）: https://store.steampowered.com/app/1010800
Nero GameVR: https://store.steampowered.com/app/1011110
Greenland Melting: https://store.steampowered.com/app/1012510
VEGAS Movie Studio 16 Steam Edition: https://store.steampowered.com/app/1016810
VEGAS Movie Studio 16 Platinum Steam Edition: https://store.steampowered.com/app/1016840
Planet Evolution PC Live Wallpaper: https://store.steampowered.com/app/1017060
Screenbits - Screen Recorder: https://store.steampowered.com/app/1018680
Wondershare Video Converter Ultimate: https://store.steampowered.com/app/1025020
ACID Music Studio 11 Steam Edition: https://store

In [36]:
print(steam_data['genres'].isnull().sum())

37


In [37]:
print(steam_data['genres'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['genres'].iloc[100:105])

[{'id': '1', 'description': 'Action'}]


116    [{'id': '2', 'description': 'Strategy'}, {'id': '4', 'description': 'Casual'}]
117                                            [{'id': '4', 'description': 'Casual'}]
118                                            [{'id': '4', 'description': 'Casual'}]
119                                          [{'id': '2', 'description': 'Strategy'}]
120                                            [{'id': '4', 'description': 'Casual'}]
Name: genres, dtype: object

In [38]:
print_steam_links(steam_data[steam_data['genres'].isnull()].head(10))
print_steam_links(steam_data[steam_data['genres'].isnull()].tail(10))

Hot Dish: https://store.steampowered.com/app/12570
Dr. Daisy Pet Vet: https://store.steampowered.com/app/12580
Call of Cthulhu®: Dark Corners of the Earth: https://store.steampowered.com/app/22340
Super Granny Collection: https://store.steampowered.com/app/36270
Sacrifice: https://store.steampowered.com/app/38440
Nancy Drew® Dossier: Resorting to Danger!: https://store.steampowered.com/app/42200
Air Forte: https://store.steampowered.com/app/55020
Sonic Adventure DX: https://store.steampowered.com/app/71250
Portal 2 - The Final Hours: https://store.steampowered.com/app/104600
Sonic CD: https://store.steampowered.com/app/200940
EatWell: https://store.steampowered.com/app/678870
No Lights: https://store.steampowered.com/app/682910
Cyborg Arena: https://store.steampowered.com/app/706440
M.I.A. - Overture: https://store.steampowered.com/app/712060
VEHICLES FURY: https://store.steampowered.com/app/749290
The Big Three: https://store.steampowered.com/app/823390
BlueberryNOVA: https://store.st

In [39]:
steam_data[(steam_data['genres'].isnull()) | (steam_data['categories'].isnull())]

Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
338,Hot Dish,12570,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '29 Jul, 2008'}",5.99,1,Zemnott,ValuSoft
339,Dr. Daisy Pet Vet,12580,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '29 Jul, 2008'}",5.99,1,Zemnott,ValuSoft
366,Tom Clancy's Ghost Recon® Island Thunder™,13630,0,windows,,"[{'id': '1', 'description': 'Action'}]",0,"{'coming_soon': False, 'date': '15 Jul, 2008'}",4.29,1,Red Storm Entertainment,Ubisoft
508,Call of Cthulhu®: Dark Corners of the Earth,22340,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '16 Jun, 2009'}",3.99,1,Headfirst Productions,Bethesda Softworks
709,Westward Collection,36150,0,windows,,"[{'id': '4', 'description': 'Casual'}]",0,"{'coming_soon': False, 'date': '17 Jul, 2009'}",10.99,1,Sandlot Games,Sandlot Games
713,Super Granny Collection,36270,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '17 Jul, 2009'}",10.99,1,Sandlot Games,Sandlot Games
766,Sacrifice,38440,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '19 Aug, 2009'}",6.99,1,Shiny Entertainment,Interplay Inc.
785,Painkiller: Black Edition,39530,0,windows,,"[{'id': '1', 'description': 'Action'}]",0,"{'coming_soon': False, 'date': '24 Jan, 2007'}",8.99,1,People Can Fly,THQ Nordic
837,Nancy Drew® Dossier: Resorting to Danger!,42200,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '19 Nov, 2009'}",5.19,1,HeR Interactive,HeR Interactive
936,Might & Magic: Heroes VI,48220,0,windows,,"[{'id': '3', 'description': 'RPG'}, {'id': '2'...",0,"{'coming_soon': False, 'date': '13 Oct, 2011'}",16.99,1,Blackhole,Ubisoft


In [40]:
def process_categories_and_genres(df, export=False):
    df = df[df['genres'].notnull()].copy()
    df = df[df['categories'].notnull()].copy()
    
    df['genres'] = df['genres'].apply(lambda x: ';'.join(item['description'] for item in literal_eval(x)))
    df['categories'] = df['categories'].apply(lambda x: ';'.join(item['description'] for item in literal_eval(x)))
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df)
    df = process_media(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_categories_and_genres(df)
    
    return df


steam_data = process(imported_steam_data)

## Export



In [41]:
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
0,Counter-Strike,10,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,"{'coming_soon': False, 'date': '1 Nov, 2000'}",7.19,1,Valve,Valve
1,Team Fortress Classic,20,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,"{'coming_soon': False, 'date': '1 Apr, 1999'}",3.99,1,Valve,Valve
2,Day of Defeat,30,0,windows;mac;linux,Multi-player;Valve Anti-Cheat enabled,Action,0,"{'coming_soon': False, 'date': '1 May, 2003'}",3.99,1,Valve,Valve
3,Deathmatch Classic,40,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,"{'coming_soon': False, 'date': '1 Jun, 2001'}",3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0,windows;mac;linux,Single-player;Multi-player;Valve Anti-Cheat en...,Action,0,"{'coming_soon': False, 'date': '1 Nov, 1999'}",3.99,1,Gearbox Software,Valve


In [42]:
steam_data.isnull().sum()

name            0
steam_appid     0
required_age    0
platforms       0
categories      0
genres          0
achievements    0
release_date    0
price           0
english         0
developer       0
publisher       0
dtype: int64

In [43]:
steam_data.to_csv("../data/exports/steam_clean_part_2.csv", index=False)

# Steam Data Cleaning (Part 3)

*This is part of a larger series of notebooks on downloading, processing and analysing data from the steam store. [See all notebooks here.](../notebooks)*

In [2]:
# load extensions and magics

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1900 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Fri May 24 12:07:27 2019 GMT Summer Time,Fri May 24 12:07:27 2019 GMT Summer Time


# Release Data optimisation and combining

Almost finished with steam data. Final column then combine

In [1]:
# standard library imports
from ast import literal_eval
import itertools
import time
import re
import sys
sys.path.append('../src/')

# third-party imports
import numpy as np
import pandas as pd

# local imports
from datacleaning import print_steam_links

# customisations
pd.set_option("max_columns", 100)

## Import and Inspect Data

In [2]:
imported_steam_data = pd.read_csv('../data/exports/steam_clean_part_2.csv')

print('Rows:', imported_steam_data.shape[0])
print('Columns:', imported_steam_data.shape[1])
imported_steam_data.head()

Rows: 27391
Columns: 12


Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
0,Counter-Strike,10,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,"{'coming_soon': False, 'date': '1 Nov, 2000'}",7.19,1,Valve,Valve
1,Team Fortress Classic,20,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,"{'coming_soon': False, 'date': '1 Apr, 1999'}",3.99,1,Valve,Valve
2,Day of Defeat,30,0,windows;mac;linux,Multi-player;Valve Anti-Cheat enabled,Action,0,"{'coming_soon': False, 'date': '1 May, 2003'}",3.99,1,Valve,Valve
3,Deathmatch Classic,40,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,"{'coming_soon': False, 'date': '1 Jun, 2001'}",3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0,windows;mac;linux,Single-player;Multi-player;Valve Anti-Cheat en...,Action,0,"{'coming_soon': False, 'date': '1 Nov, 1999'}",3.99,1,Gearbox Software,Valve


Check null counts

In [3]:
imported_steam_data.isnull().sum()

name            0
steam_appid     0
required_age    0
platforms       0
categories      0
genres          0
achievements    0
release_date    0
price           0
english         0
developer       0
publisher       0
dtype: int64

### Processing and Optimising Release Date

The final column to clean, release date, provides some interesting optimisation and learning challenges. We've encountered some columns with a similar structure already, so we can use what we've learned so far, but now we have some dates to handle.

First we shall inspect the raw format of the column. As we can see below, it is stored as a dictionary-like string object containing values for `coming_soon` and `date`. From the first few rows it would appear that the dates are stored in a uniform format - day as an integer, month as a 3-character string abbreviation, a comma, then the year as a four-digit number. We can parse this either using the python built-in datetime module, or as we already have pandas imported, we can use the [pd.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function.

Also, as our analysis will involve looking at ownership and sales data, looking at games that are not released yet will not be useful to us. Intuitively, we can drop any titles which are marked as coming soon, presumably having this value set to true. As a side note, once parsed it may be worth checking that no release dates in our data are beyond the current date, just to make doubly sure none slip through.

In [7]:
display(imported_steam_data['release_date'][0])

"{'coming_soon': False, 'date': '1 Nov, 2000'}"

In [8]:
imported_steam_data[['name', 'release_date']].head()

Unnamed: 0,name,release_date
0,Counter-Strike,"{'coming_soon': False, 'date': '1 Nov, 2000'}"
1,Team Fortress Classic,"{'coming_soon': False, 'date': '1 Apr, 1999'}"
2,Day of Defeat,"{'coming_soon': False, 'date': '1 May, 2003'}"
3,Deathmatch Classic,"{'coming_soon': False, 'date': '1 Jun, 2001'}"
4,Half-Life: Opposing Force,"{'coming_soon': False, 'date': '1 Nov, 1999'}"


As usual, one of the first steps we'll take is to check for null values. Luckily, it seems that the cleaning we have performed already has removed any null values from our data set, as seen below. We may still have some hidden empty values of course. 

In [9]:
print('Null values:\n')
# print('Raw data:', imported_steam_data['release_date'].isnull().sum())
print('Partially cleaned:', imported_steam_data['release_date'].isnull().sum())

Null values:

Partially cleaned: 0


Exploring the data using the value_counts method brings a couple of data issues to light.

In the raw data, we can see that 64 rows have data but date is an empty string, ''. Like we've seen before, this means they do not have null values, but may need to be treated as such depending on the reason. This may be data corruption, or it may be another reason entirely. We will probably have to decide what to do with these cases and investigate further.

Another issue we can notice is that while most of the dates are stored in the format we saw previously (dd mmm, yyyy), at least a couple are simply stored as the month and year (e.g. 'May 2019'). This means that the dates aren't all stored uniformly so we will have to take care when parsing them.

# May have to change this and discover later

In [5]:
display(imported_steam_data['release_date'].value_counts().head())

imported_steam_data['release_date'].value_counts().tail()

{'coming_soon': False, 'date': '13 Jul, 2018'}    64
{'coming_soon': False, 'date': '31 Jan, 2019'}    58
{'coming_soon': False, 'date': '5 Apr, 2016'}     56
{'coming_soon': False, 'date': '16 Nov, 2018'}    56
{'coming_soon': False, 'date': '31 May, 2018'}    55
Name: release_date, dtype: int64

{'coming_soon': False, 'date': '24 Sep, 2007'}    1
{'coming_soon': False, 'date': '19 Sep, 2008'}    1
{'coming_soon': False, 'date': '31 Jul, 2009'}    1
{'coming_soon': False, 'date': '11 Dec, 2012'}    1
{'coming_soon': False, 'date': '1 Apr, 2006'}     1
Name: release_date, dtype: int64

Before we move on, let's quickly inspect some of the rows which have a blank date. 

It looks like some are special re-releases, like anniversary or game of the year editions, some are early access and not officially released yet, and others simply have a missing date. Apart from that there don't appear to be any clear patterns emerging, so as there are only 22 rows it may be best to remove them.

In [6]:
no_release_date = imported_steam_data[imported_steam_data['release_date'] == "{'coming_soon': False, 'date': ''}"]

print('Rows with no release date:', no_release_date.shape[0], '\n')
print_steam_links(no_release_date.head())
no_release_date.head()

Rows with no release date: 22 

Borderlands Game of the Year: https://store.steampowered.com/app/8980
Sherlock Holmes: The Mystery of the Persian Carpet: https://store.steampowered.com/app/11180
1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby): https://store.steampowered.com/app/15540
The Great Art Race: https://store.steampowered.com/app/33580
SpellForce 2 - Anniversary Edition: https://store.steampowered.com/app/39550


Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
224,Borderlands Game of the Year,8980,18,windows,Single-player;Multi-player;Co-op;Steam Achieve...,Action;RPG,80,"{'coming_soon': False, 'date': ''}",24.99,1,Gearbox Software,2K
275,Sherlock Holmes: The Mystery of the Persian Ca...,11180,0,windows,Single-player,Adventure;Casual,0,"{'coming_soon': False, 'date': ''}",6.99,1,Frogwares,Frogwares
366,1... 2... 3... KICK IT! (Drop That Beat Like a...,15540,0,windows,Single-player;Steam Achievements;Steam Trading...,Action;Early Access;Indie,1,"{'coming_soon': False, 'date': ''}",6.99,1,"Dejobaan Games, LLC","Dejobaan Games, LLC"
634,The Great Art Race,33580,0,windows,Single-player,Simulation;Strategy,0,"{'coming_soon': False, 'date': ''}",3.99,1,Ascaron Entertainment ltd.,Assemble Entertainment
757,SpellForce 2 - Anniversary Edition,39550,0,windows,Single-player;Multi-player;Online Multi-Player...,RPG;Strategy,0,"{'coming_soon': False, 'date': ''}",13.99,1,Phenomic;THQ Nordic,THQ Nordic


Taking a look at the format of the column, we'll need to be using literal_eval once more. Apart from that it should be straightforward enough to extract the date.

In [7]:
print(type(imported_steam_data['release_date'].iloc[0]))

imported_steam_data['release_date'].iloc[0]

<class 'str'>


"{'coming_soon': False, 'date': '1 Nov, 2000'}"

In [8]:
print(type(literal_eval(imported_steam_data['release_date'].iloc[0])))

literal_eval(imported_steam_data['release_date'].iloc[0])['date']

<class 'dict'>


'1 Nov, 2000'

Once extracted, we can use the pd.to_datetime functon to interpret and store dates as datetime objects. This will be particularly useful as it will allow us to search and sort our dataset when it comes to performing analysis. Say for example we only wish to examine games released in 2010, by converting our dates to a python-recognisable format this will be very easy to achieve.

As seen below, we can supply the to_datetime function with our date and pandas will automatically interpret the format. We can then inspect it or print an attribute like the year. We can also provide pandas with the format explicitly, so it knows what to look for and how to parse it, which may be [quicker for large sets of data](https://stackoverflow.com/questions/32034689/why-is-pandas-to-datetime-slow-for-non-standard-time-format-such-as-2014-12-31).

In [9]:
timestamp = pd.to_datetime(literal_eval(imported_steam_data['release_date'].iloc[0])['date'])

print(timestamp)
print(timestamp.year)

pd.to_datetime(literal_eval(imported_steam_data['release_date'].iloc[0])['date'], format='%d %b, %Y')

2000-11-01 00:00:00
2000


Timestamp('2000-11-01 00:00:00')

Now we are ready to begin defining our function. As we only want to keep unreleased games, we first values from the coming_soon key, and keep only those where the value is False. Next we extract the release date, and set missing dates to np.nan, the default way of storing null values in pandas.

Then, using the formats we learned previously, we interpret those datesusing the to_datetime function. Once complete we pass over the dataframe once more with a general call to to_datetime, catching any dates we missed.

Finally we drop the columns we no longer need and return the dataframe.

Whilst functional, the process is quite slow. We can use the %timeit magic to test how long it takes to run our function, and we can see that on average it takes almost four seconds. Whilst manageable, we could certainly benefit from optimising our code, as this could quickly add up in larger data sets, where increasing efficiency can prove invaluable.

In [10]:
def process_release_date(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    # Only want to keep released games
    df = df[df['coming_soon'] == False].copy()
    
    # extract release date and set missing dates to null
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    df.loc[df['date'] == '', 'date'] = np.nan
    
    # Parse the date formats we have discovered
    df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')
    df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')
    
    # Parse the rest of the date formats
    df['release_date'] = pd.to_datetime(df['datetime'])
    
    df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)
    return df

%timeit process_release_date(imported_steam_data)

KeyboardInterrupt: 

There are a few areas we can investigate to make improvements. When initially parsing the date, we end up calling literal_eval twice, which may be a source of slowdown. We also loop over the entire dataset multiple times when calling the to_datetime function. 

We'll investigate which part is causing the greatest slowdown, but we can be certain that reducing the traversals over the data set will most likely provide significant gains. There are also a few other issues that we'll dive into over the course of our optimisation process.

First, let's find out where the main slowdowns are. As we just saw we can use the %timeit magic to time our function. We can also use the in-built time module to inspect parts of our code.

In [None]:
def process_release_date(df):
    df = df.copy()
    
    eval_start = time.time()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    print('Evaluation run-time:', time.time() - eval_start)
    
    df.loc[df['date'] == '', 'date'] = None
    
    first_parse_start = time.time()
    
    df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')
    df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')
    
    print('First parse run-time:', time.time() - first_parse_start)
    
    second_parse_start = time.time()
    
    df['release_date'] = pd.to_datetime(df['datetime'])
    
    print('Final parse run-time:', time.time() - second_parse_start)
    
    df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)
    return df

function_start = time.time()
process_release_date(imported_steam_data)
print('\nTotal run-time:', time.time() - function_start)

Immediately we can see that the majority of run-time is taken up by the final call to pd.to_datetime. This suggests that the first two calls are not functioning as expected - they are possibly terminating after the first error instead of skipping over it as desired - and most of the work is being done by the final call. Now it makes sense why it is slow - pandas has to figure out how each date is formatted, and since we know we have some variations this may be slowing it down considerably.

Whilst the evaluation run-time is much shorter, our multiple calls to literal_eval may be slowing the function as well, so we may wish to investigate that. As we know the biggest slowdown, we should begin there.

We now know that handling our dates in their current form is slow, and we know that we have some different formats mixed in there. Whilst there are likely many possible solutions to this problem, using regular expressions (or regex) comes to mind as they tend to excel at pattern matching in strings.

We know for sure two of the patterns, so let's build a regex for each of those. Then we can iteratively add more as we discover any other patterns. A powerful and useful tool for building and testing regex can be found at [regexr.com](https://regexr.com/).

In [None]:
pattern = r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}'
string = '13 Jul, 2018'

print(re.search(pattern, string))

pattern = r'[A-Za-z]{3} [\d]{4}'
string = 'Apr 2016'

print(re.search(pattern, string))

Using these two patterns we can start building out our function. We're going to apply a function to the date column which searches for each pattern, returning a standardised date string which we will then feed into the to_datetime function.

Our first search matches the 'mmm yyyy' pattern, like 'Apr 2019'. As we don't know the particular day for these matches we will assume it is the first of the month, returning '1 Apr 2019' in this example.

If we don't match this, we'll check for the second case. Our second match will be the 'dd mmm, yyyy' pattern, like '13 Jul, 2018'. In this case we will simply return the match with the comma removed, to become '13 Jul 2018'.

Finally we'll check for the empty string, and return it for now.

For anything else we'll simply print the string so we know what else we should be searching for.

In [None]:
def process_release_date(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x 
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            return x
        else:
            print(x)
            
    df['date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
    
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df

result = process_release_date(imported_steam_data)

It looks like we've caught all of the patterns and don't have any to take care of.

Previously we used the `infer_datetime_format` parameter of to_datetime, which can speed up the process. However, as we now know exactly the format our dates will be in, we can explicitly set it ourselves, which should be the fastest way of doing things.

We also need to decide how to handle our missing dates - those with the empty strings. For now let's change the way the function handles errors from raise to coerce, which returns NaT (not a time) instead.

We can now rewrite our function and time it as we did before.

In [None]:
def process_release_date_old(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    # Simple parsing
    df['release_date'] = pd.to_datetime(df['date'])
    
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df


def process_release_date_new(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    # Complex parsing
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            return x
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
    
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df

print('Testing date parsing:\n')
%timeit process_release_date_old(imported_steam_data)
%timeit process_release_date_new(imported_steam_data)

Our results show that the new method is almost four times faster, so we're on the right track.

Another optimisation we can make here is checking which part of the if/elif statements has the most matches. It makes sense to order our statements from most matches to least, so for the majority of rows we only have to search through once. 

To do this, instead of returning the date we'll return a number for each match. We can then print the value counts for the column and see which is the most frequent.

In [None]:
def optimise_regex_order(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '0: mmm yyyy' # '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return '1: dd mmm, yyyy' # x.replace(',', '')
        elif x == '':
            return '2: empty' # pass
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['date'].apply(parse_date)
    
    return df


result = optimise_regex_order(imported_steam_data)

result['release_date'].value_counts()

By far the majority of dates are in the 'dd mmm, yyyy' format, which is second in our if/else statements. This means that for all these rows we are unnecessarily searching the string twice. Simply by reordering our searches we should see a minor performance improvement.

In [None]:
def process_release_date_unordered(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            return x
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df


def process_release_date_ordered(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif x == '':
            return x
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df


%timeit process_release_date_unordered(imported_steam_data)
%timeit process_release_date_ordered(imported_steam_data)

It's an improvement, if only slightly, so we'll keep it. If anything this goes to show how fast regex pattern matching is, as there was hardly any slowdown in searching every string twice.

Now parsing is well-optimised we can move on to the evaluation section.

In [None]:
# Testing evaluation methods
def evaluation_method_original(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])    
    df = df[df['coming_soon'] == False].copy()
    df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    return df


def evaluation_method_1(df):
    df = df.copy()
    
    df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x))
    
    df['coming_soon'] = df['release_date'].apply(lambda x: x['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    
    df['release_date'] = df['release_date'].apply(lambda x: x['date'])
    
    return df


def evaluation_method_2(df):
    df = df.copy()
    
    df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x))
    df_2 = df['release_date'].transform([lambda x: x['coming_soon'], lambda x: x['date']])
    df = pd.concat([df, df_2], axis=1)
    
    return df


def evaluation_method_3(df):
    df = df.copy()
    
    def eval_date(x):
        x = literal_eval(x)
        if x['coming_soon']:
            return np.nan
        else:
            return x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    df = df[df['release_date'].notnull()]  # could change to drop when '' and deal with missing release dates also
    
    return df


%timeit evaluation_method_original(imported_steam_data)

%timeit evaluation_method_1(imported_steam_data)
%timeit evaluation_method_2(imported_steam_data)
%timeit evaluation_method_3(imported_steam_data)

It looks like we may have been right in our assumption that multiple calls to literal_eval were slowing down the function - by calling it once instead of twice we almost halved the run-time.

Of our new methods the final one was just about the fastest, which is useful because it contains flexible custom logic we can modify if needed. Let's put everything together into our final function, and time it once more to see the improvements we've made.

We'll make a couple of changes so we can easily remove missing values at the end, which should mean we end up with clean release dates.

In [12]:
def process_release_date(df):
    df = df.copy()
    
    def eval_date(x):
        x = literal_eval(x)
        if x['coming_soon']:
            return '' # return blank string so can drop missing at end
        else:
            return x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    
    def parse_date(x):
        if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif x == '':
            return np.nan
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['release_date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['release_date'], format='%d %b %Y', errors='coerce')
    
    df = df[df['release_date'].notnull()]
    
    return df

%timeit process_release_date(imported_steam_data)

Referring back to our original time of 3.6s, we've achieved a 7x speed increase. That's almost an order of magnitude improvement. 

We'll now update our process function, run it on our data set, and move on to some final checks.

In [13]:
# def process(df):
#     """Process data set. Will eventually contain calls to all functions we write."""
    
#     # Copy the input dataframe to avoid accidentally modifying original data
#     df = df.copy()
    
#     # Remove duplicate rows - all appids should be unique
#     df = df.drop_duplicates()
    
#     # Remove collumns with more than 50% null values
#     df = process_null_cols(df)
    
#     # Process rest of columns
#     df = df.drop(['achievements', 'content_descriptors'], axis=1)
#     df = process_type(df)
#     df = process_name(df)
#     df = process_age(df)
#     df = process_platforms(df)
#     df = process_price(df)
#     df = process_language(df)
#     df = process_developers_and_publishers(df)
#     df = process_release_date(df)
    
#     # Process columns which export data
#     df = process_descriptions(df, export=True)
#     df = process_images(df, export=True)
#     df = process_info(df, export=True)
#     df = process_requirements(df, export=True)
#     df = process_categories(df, export=True)
#     df = process_genres(df, export=True)
    
#     return df

steam_data = process_release_date(imported_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
0,Counter-Strike,10,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,2000-11-01,7.19,1,Valve,Valve
1,Team Fortress Classic,20,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,1999-04-01,3.99,1,Valve,Valve
2,Day of Defeat,30,0,windows;mac;linux,Multi-player;Valve Anti-Cheat enabled,Action,0,2003-05-01,3.99,1,Valve,Valve
3,Deathmatch Classic,40,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,2001-06-01,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0,windows;mac;linux,Single-player;Multi-player;Valve Anti-Cheat en...,Action,0,1999-11-01,3.99,1,Gearbox Software,Valve


## Final Steps

TODO: Could get rid of price > 60 here  
TODO: check size

Our data set is hopefully complete. Before we export it to csv, let's check if we have any null values.

In [14]:
steam_data[steam_data['price'] > 60]

Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
1117,3DCoat 4.8,100980,0,windows,Steam Cloud,Animation & Modeling,0,2012-10-02,95.99,1,Pilgway,Pilgway
1878,Wing IDE 5,244830,0,windows;mac;linux,Single-player,Utilities,0,2014-04-30,60.99,1,Wingware,Wingware
1928,Clickteam Fusion 2.5,248170,0,windows,Single-player;Multi-player;MMO;Co-op;Shared/Sp...,Animation & Modeling;Education;Utilities;Web P...,0,2013-12-05,69.99,1,Clickteam,Clickteam
2012,Leadwerks Game Engine,251810,0,windows,Single-player;Steam Achievements;Steam Worksho...,Animation & Modeling;Design & Illustration;Edu...,3,2014-01-06,78.99,1,Leadwerks Software,Leadwerks Software
2073,Aartform Curvy 3D 3.0,253670,0,windows,Single-player,Animation & Modeling,0,2013-11-12,75.99,1,Aartform,Aartform
3862,Command: Modern Air / Naval Operations WOTY,321410,0,windows,Single-player;Steam Workshop;Includes level ed...,Simulation;Strategy,0,2014-09-26,60.99,1,WarfareSims,Slitherine Ltd.
3976,AppGameKit: Easy Game Development,325180,0,windows;mac;linux,Steam Workshop,Animation & Modeling;Design & Illustration;Edu...,0,2014-11-21,60.99,1,The Game Creators,The Game Creators
5410,RPG Maker MV,363890,0,windows;mac;linux,Steam Trading Cards;Partial Controller Support,Design & Illustration;Web Publishing,0,2015-10-23,60.99,1,KADOKAWA;Yoji Ojima,Degica
5699,Gary Grigsby's War in the East,370540,0,windows,Single-player;Multi-player,Simulation;Strategy,0,2015-07-09,60.99,1,2by3 Games,Slitherine Ltd.
7727,The Music Room,431030,0,windows,VR Support,Audio Production,0,2017-08-17,98.99,1,Chroma Coda,Chroma Coda


In [15]:
steam_data.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27332 entries, 0 to 27390
Columns: 12 entries, name to publisher
dtypes: datetime64[ns](1), float64(1), int64(4), object(6)
memory usage: 13.9 MB


In [16]:
steam_data.isnull().sum()

name            0
steam_appid     0
required_age    0
platforms       0
categories      0
genres          0
achievements    0
release_date    0
price           0
english         0
developer       0
publisher       0
dtype: int64

Looks good. We also want to check that no games slipped through that aren't released yet (data scraped on or before 1st May 2019).

In [17]:
steam_data[steam_data['release_date'] > '2019-05-01']

Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher


### Combining and exporting data frames

Now that we're happy with our dataframe we are ready to export to file and finish this part of the project. 

First we export steam_data, then we merge genre_data and category_data into a new dataframe, check it for missing values, then export it.

In [18]:
steam_data.to_csv('../data/steam_data_clean.csv', index=False)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
0,Counter-Strike,10,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,2000-11-01,7.19,1,Valve,Valve
1,Team Fortress Classic,20,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,1999-04-01,3.99,1,Valve,Valve
2,Day of Defeat,30,0,windows;mac;linux,Multi-player;Valve Anti-Cheat enabled,Action,0,2003-05-01,3.99,1,Valve,Valve
3,Deathmatch Classic,40,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,2001-06-01,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0,windows;mac;linux,Single-player;Multi-player;Valve Anti-Cheat en...,Action,0,1999-11-01,3.99,1,Gearbox Software,Valve


# Next steps

We could clean some of the data we exported, like description and requirements.

We are now ready to move on to cleaning our steamspy data.