# Data Cleaning

**TODO**: genre and categories section writeup

Currently our downloaded data is not in a very usable or useful state. Many of the columns contain lengthy strings or missing values, both of which are crippling to analysis and especially to any machine learning techniques we may wish to implement.

The main aims of this project are to investigate various sales and play-time statistics for games from the steam store, and see how different features of games may have an effect on the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust data set for future analysis projects.

To begin with, we'll import our libraries and set some options, then take a look at the downloaded data from the steam api. Once that is taken care of we will move on to the steamspy data and repeat the process. Hopefully by the end we will have clean data sets to use in the next step, exploratory analysis and visualisation.

### Aims:
- Improve functions
- Prepare notebook for delivery

### (Raw) Data Dictionary

Sort out data dictionary  

API and data dictionary:
https://steamspy.com/api.php

### Future ideas:
- pc requirements analysis over time
- picture analysis
- keyword/recommender analysis
- categories could make table in a database all on its own, perhaps in future
- for genres (and categories?) could create main genre, selected from list of key genres, allowing hybrids like action_adventure if contains both
- remove titles over £60/100?

In [1]:
# load extensions and magics

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1915 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Thu May 09 20:04:32 2019 GMT Summer Time,Thu May 09 20:04:32 2019 GMT Summer Time


In [2]:
# import libraries
from ast import literal_eval
import itertools
import time
import re

import numpy as np
import pandas as pd

In [3]:
# customisations
pd.set_option("max_columns", 100)
# pd.reset_option("max_columns")

## Cleaning steam data

### Import Data

We begin by importing the raw steam data we generated previously in data collection, which can be viewed by following the link to `../deliver/1-data-collection.ipynb` below. From a quick inspection of the data, we can see that we have a mixture of numeric and string columns, plenty of missing values, and a number of columns stored as dictionaries.

In [4]:
from IPython.display import FileLink
FileLink("../deliver/1-data-collection.ipynb")

In [5]:
raw_steam_data = pd.read_csv('../data/raw/steam_app_data.csv')

print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])
raw_steam_data.head()

Rows: 29235
Columns: 39


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0.0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","{'score': 88, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 65735},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0.0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 2802},{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0.0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","{'score': 79, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 1992},{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,game,Deathmatch Classic,40,0.0,False,,,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 931},{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,game,Half-Life: Opposing Force,50,0.0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Gearbox Software'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 4355},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


We can chain the `isnull()` and `sum()` methods to easily see how many missing values we have in each column. Immediately we can see that a number of columns have over 20,000 rows with missing data, and in a data set of almost 30,000 rows these are unlikely to provide any useful information.

In [6]:
raw_steam_data.isnull().sum()

type                         149
name                           1
steam_appid                    0
required_age                 149
is_free                      149
controller_support         23237
dlc                        24260
detailed_description         175
about_the_game               175
short_description            175
fullgame                   29235
supported_languages          163
header_image                 149
website                     9983
pc_requirements              149
mac_requirements             149
linux_requirements           149
legal_notice               19168
drm_notice                 29077
ext_user_account_notice    28723
developers                   264
publishers                   149
demos                      27096
price_overview              3712
packages                    3370
package_groups               149
platforms                    149
metacritic                 26254
reviews                    23330
categories                   714
genres    

## Defining Functions

We will most likely have to handle each column differently and individually, so we will write some functions to keep our methodology oragnised, and help iteratively develop the process.


### Initial processing

Our first function, `process_null_cols`, will remove the columns with more than 50% missing values, taking care of the null counts we saw previously. We then look at the type and name columns, thinning out our data set a little by removing apps without either.

In the data collection stage, if no information was returned for an app we just stored the name and steam_appid. As seen below, these rows contain no other information so we definitely need to remove them.

In [7]:
print('Rows to remove:', raw_steam_data[raw_steam_data['type'].isnull()].shape[0])

raw_steam_data[raw_steam_data['type'].isnull()].head()

Rows to remove: 149


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
26,,Half-Life: Opposing Force,852,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
147,,Half-Life: Opposing Force,4330,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
256,,Half-Life: Opposing Force,8740,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
264,,Half-Life: Opposing Force,8955,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
336,,Half-Life: Opposing Force,11610,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We can look at the counts of unique values in a column by using the pandas [Series.value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method.

Once the null rows are removed, we can see that all the other rows have 'game' as their type, meaning this column isn't of any use and can be safely dropped.

In [8]:
raw_steam_data['type'].value_counts(dropna=False)

game    29086
NaN       149
Name: type, dtype: int64

In the name column we have a couple of rows without a title (or 'none' as the title). It looks like these can be safely removed.

In [9]:
raw_steam_data[(raw_steam_data['name'].isnull()) | (raw_steam_data['name'] == 'none')]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
4918,game,none,339860,0.0,False,,,,,,,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/339...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],,,,,[''],,,,[],"{'windows': True, 'mac': False, 'linux': False}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 3, 'highlighted': [{'name': 'Master ...","{'coming_soon': False, 'date': '27 Feb, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
6779,game,none,385020,0.0,False,,,- discontinued - (please remove),- discontinued - (please remove),- discontinued - (please remove),,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/385...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,,,,['none'],[''],,,,[],"{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",,,,{'total': 0},"{'coming_soon': False, 'date': '4 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
7235,game,,396420,0.0,True,,,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。 村...,,,https://steamcdn-a.akamaihd.net/steam/apps/396...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],,,,,[''],,,,[],"{'windows': True, 'mac': False, 'linux': False}",,,,,,,,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2016'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
7350,game,none,398970,0.0,False,,,,,,,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/398...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,,,,['none'],['none'],"[{'appid': 516340, 'description': ''}]",,,[],"{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 35, 'highlighted': [{'name': ""They'v...","{'coming_soon': False, 'date': '5 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"


We also have some duplicated rows, likely caused by errors or overlapping in our data collection process. As we know for certain that all AppIDs should be unique, we can safely remove these duplicates straight away.

In [10]:
raw_steam_data[raw_steam_data.duplicated()].head()

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
31,game,SiN Episodes: Emergence,1300,0.0,False,,,"You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...",,"English, Russian, French",https://steamcdn-a.akamaihd.net/steam/apps/130...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],,,,['Ritual Entertainment'],['Ritual Entertainment'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[70],"[{'name': 'default', 'title': 'Buy SiN Episode...","{'windows': True, 'mac': False, 'linux': False}","{'score': 75, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 265},{'total': 0},"{'coming_soon': False, 'date': '10 May, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/130...,"{'ids': [], 'notes': None}"
32,game,SiN Episodes: Emergence,1300,0.0,False,,,"You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...",,"English, Russian, French",https://steamcdn-a.akamaihd.net/steam/apps/130...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],,,,['Ritual Entertainment'],['Ritual Entertainment'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[70],"[{'name': 'default', 'title': 'Buy SiN Episode...","{'windows': True, 'mac': False, 'linux': False}","{'score': 75, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 265},{'total': 0},"{'coming_soon': False, 'date': '10 May, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/130...,"{'ids': [], 'notes': None}"
356,game,Jagged Alliance 2 Gold,1620,0.0,False,,,<p>The small country of Arulco has been taken ...,<p>The small country of Arulco has been taken ...,The small country of Arulco has been taken ove...,,English,https://steamcdn-a.akamaihd.net/steam/apps/162...,http://www.jaggedalliance2.com/,{'minimum': '<p><strong>Minimum Configuration:...,[],[],,,,['Strategy First'],['Strategy First'],,"{'currency': 'GBP', 'initial': 1499, 'final': ...",[94],"[{'name': 'default', 'title': 'Buy Jagged Alli...","{'windows': True, 'mac': False, 'linux': False}",,,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '2', 'description': 'Strategy'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,,{'total': 0},"{'coming_soon': False, 'date': '6 Jul, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/162...,"{'ids': [], 'notes': None}"
493,game,Crazy Machines 1.5,18430,0.0,False,,,Create Your Own Zany &quot;Rube Goldberg&quot;...,Create Your Own Zany &quot;Rube Goldberg&quot;...,Create Your Own Zany &quot;Rube Goldberg&quot;...,,English,https://steamcdn-a.akamaihd.net/steam/apps/184...,,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],[],,,,['Fakt Software'],['Viva Media'],,"{'currency': 'GBP', 'initial': 699, 'final': 6...","[1242, 58401]","[{'name': 'default', 'title': 'Buy Crazy Machi...","{'windows': True, 'mac': False, 'linux': False}","{'score': 78, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,,{'total': 0},"{'coming_soon': False, 'date': '12 Dec, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/184...,"{'ids': [], 'notes': None}"
494,game,Crazy Machines 1.5,18430,0.0,False,,,Create Your Own Zany &quot;Rube Goldberg&quot;...,Create Your Own Zany &quot;Rube Goldberg&quot;...,Create Your Own Zany &quot;Rube Goldberg&quot;...,,English,https://steamcdn-a.akamaihd.net/steam/apps/184...,,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],[],,,,['Fakt Software'],['Viva Media'],,"{'currency': 'GBP', 'initial': 699, 'final': 6...","[1242, 58401]","[{'name': 'default', 'title': 'Buy Crazy Machi...","{'windows': True, 'mac': False, 'linux': False}","{'score': 78, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,,{'total': 0},"{'coming_soon': False, 'date': '12 Dec, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/184...,"{'ids': [], 'notes': None}"


Here we define and run our functions to handle everything we just looked at. We also define a general `process` function which will run all of our processing functions on the data set, allowing us to slowly add to it as we build out to processing more columns. Finally we run this function on our raw data, inspecting the first few rows and viewing how many rows and columns we have dropped.

In [11]:
def process_null_cols(df, thresh=0.5):
    """Drop columns with more than a certain proportion of missing values (Default 50%)."""
    cutoff_count = len(df) * thresh
    
    return df.dropna(thresh=cutoff_count, axis=1)


def drop_null_rows(df, col):
    """Drop rows with null values in a particular column."""
    return df[df[col].notnull()]


def process_type(df):
    """Remove rows with null values for type column, then drop the column."""
    df = drop_null_rows(df, 'type')
    df = df.drop('type', axis=1)
    
    return df
    
    
def process_name(df):
    """Remove rows with null values or 'none' in name column."""
    df = drop_null_rows(df, 'name')
    df = df[df['name'] != 'none']
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    
    return df

print(raw_steam_data.shape)
steam_data = process(raw_steam_data)
print(steam_data.shape)
steam_data.head()

(29235, 39)
(29075, 28)


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors
0,Counter-Strike,10,0.0,False,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,Team Fortress Classic,20,0.0,False,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,Day of Defeat,30,0.0,False,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,Deathmatch Classic,40,0.0,False,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,Half-Life: Opposing Force,50,0.0,False,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


### Processing age

The next column we will look at is 'required_age'. We can see that it is already stored as integers, and values range from 0 to 20, with one likely error (1818).

In [12]:
steam_data['required_age'].value_counts().sort_index()

0.0       28431
1.0           1
3.0          10
4.0           2
5.0           1
6.0           1
7.0           8
10.0          3
11.0          4
12.0         72
13.0         21
14.0          4
15.0         39
16.0        141
17.0         47
18.0        288
20.0          1
1818.0        1
Name: required_age, dtype: int64

Whilst fairly useful in its current state, we may benefit from reducing the number of categories that ages fall into. Instead of comparing games rated as 5, 6, 7 or 8, we could compare games rated 5+ or 8+, for example.

To decide which categories (or bins) we should use, we will look at the [PEGI age ratings](https://pegi.info/) as this is the system used in the United Kingdom, where we're performing our analysis. We can see that ratings fall into one of five categories (3, 7, 12, 16, 18), defining the minimum age required to buy a game.

Using this to inform our decision, we can use the [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) function to sort our data into each of these categories. As our erroneous row (1818) is most likely meant to be rated 18 anyway, we can set our upper bound above this value to catch it inside this category.


In [13]:
def process_age(df):
    """Format ratings in age column to be in line with the PEGI Age Ratings system."""
    # PEGI Age ratings: 3, 7, 12, 16, 18
    cut_points = [-1, 3, 7, 12, 16, 2000]
    label_values = [3, 7, 12, 16, 18]
    
    df['required_age'] = pd.cut(df['required_age'], bins=cut_points, labels=label_values)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data['required_age'].value_counts().sort_index()

3     28442
7        12
12       79
16      205
18      337
Name: required_age, dtype: int64

### Processing the platforms column

Whilst we could look at the next column in our dataframe, is_free, it would make sense that this is intrinsically linked to the price_overview column. Ultimately we may wish to combine these columns into one, where free games have a price of 0. Looking at the price_overview column, we can see it is stored in a dictionary-like structure, with multiple keys and values. Handling this may be quite tricky, so instead we'll look at a simpler example.

The platforms column appears to contain a key for each of the main operating systems - windows, mac and linux - and corresponding boolean value, set to True or False depending on whether it is available on that platform. This should be a reasonably straighforward place to start, and we can separate this data out into three columns, one for each platform, filled with boolean values.

In [14]:
steam_data['platforms'].head()

0    {'windows': True, 'mac': True, 'linux': True}
1    {'windows': True, 'mac': True, 'linux': True}
2    {'windows': True, 'mac': True, 'linux': True}
3    {'windows': True, 'mac': True, 'linux': True}
4    {'windows': True, 'mac': True, 'linux': True}
Name: platforms, dtype: object

So far the cleaning process has been relatively simple, requiring mainly checking for null values and dropping some rows or columns. Already we can see that handling the platforms will be a little more complex.

Our first hurdle is getting python to recognise the data in the columns as dictionaries rather than just strings. This will allow us to access the different values separately, without having to do some unnecessarily complicated string formatting. As we can see below, even though the data looks like a dictionary it is in fact stored as a string.

In [15]:
print(type(steam_data['platforms'].iloc[0]))

steam_data['platforms'].iloc[0]

<class 'str'>


"{'windows': True, 'mac': True, 'linux': True}"

We can get around this using the handy [literal_eval](https://docs.python.org/3/library/ast.html#ast.literal_eval) function from the in-built `ast` module. As the name suggests, this will allow us to evaluate the string, and index into it as a 
dictionary.

In [16]:
print(type(literal_eval(steam_data['platforms'].iloc[0])))

literal_eval(steam_data['platforms'].iloc[0])['windows']

<class 'dict'>


True

We also need to check for null values, but fortunately there aren't any in this column.

In [17]:
steam_data['platforms'].isnull().sum()

0

Putting this all together, we'll be using the pandas [Series.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method to help us quickly evaluate all of the rows, then we'll be calling `apply` again for each platform to create our new columns.

We could return the True/False value directly and store the values as boolean types, but since we'll be exporting the cleaned data to a csv file, let's store them as integers as this should reduce the file size slightly. Setting True as 1 and False as 0 can still be interpreted as a boolean type, but less data is used to store the information.

In [18]:
def process_platforms(df):
    """Split platforms column into separate boolean columns for each platform."""
    # evaluate values in platforms column, so can index into dictionaries
    df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
    # loop across keys, the platforms, which we'll turn into columns
    for platform in df['platforms'][0].keys():
        # set 1 if value for platform in original column is True, or 0 if it is False
        df[platform] = df['platforms'].apply(lambda x: 1 if x[platform] else 0)
    
    # remove the original platforms column
    df = df.drop('platforms', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'windows', 'mac', 'linux']].head()

Unnamed: 0,name,windows,mac,linux
0,Counter-Strike,1,1,1
1,Team Fortress Classic,1,1,1
2,Day of Defeat,1,1,1
3,Deathmatch Classic,1,1,1
4,Half-Life: Opposing Force,1,1,1


### Processing price

Now we have built up some intuition around how to deal with the data stored as dictionaries, let's return to the `is_free` and `price_overview` columns as we should now be able to handle them.

First let's check how many null values there are in `price_overview`.

In [19]:
steam_data['price_overview'].isnull().sum()

3559

Whilst that looks like a lot, we have to consider the impact that the `is_free` column might be having. Before jumping to conclusions let's check if there any rows with `is_free` marked as True and null values in the `price_overview` column.

In [20]:
free_and_null_price = steam_data[(steam_data['is_free']) & (steam_data['price_overview'].isnull())]

print(free_and_null_price.shape[0])
free_and_null_price.head()

2713


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
14,Half-Life 2: Lost Coast,340,3,True,Originally planned as a section of the Highway...,Originally planned as a section of the Highway...,Originally planned as a section of the Highway...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/340...,http://www.half-life2.com,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],['Valve'],['Valve'],,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '27 Oct, 2005'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/340...,"{'ids': [], 'notes': None}",1,1,1
19,Team Fortress 2,440,3,True,"<h1>The Jungle Inferno Update</h1><p><a href=""...","<p><strong>""The most fun you can have online""<...",Nine distinct classes provide a broad range of...,"English<strong>*</strong>, Danish, Dutch, Finn...",https://steamcdn-a.akamaihd.net/steam/apps/440...,http://www.teamfortress.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Valve'],['Valve'],,"[197845, 330198, 469]","[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256698790, 'name': 'Jungle Inferno', '...","{'total': 520, 'highlighted': [{'name': 'Head ...","{'coming_soon': False, 'date': '10 Oct, 2007'}","{'url': 'http://steamcommunity.com/app/440', '...",https://steamcdn-a.akamaihd.net/steam/apps/440...,"{'ids': [2, 5], 'notes': 'Includes cartoon vio...",1,1,1
22,Dota 2,570,3,True,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...","Bulgarian, Czech, Danish, Dutch, English<stron...",https://steamcdn-a.akamaihd.net/steam/apps/570...,http://www.dota2.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Valve'],['Valve'],,"[197846, 330209]","[{'name': 'default', 'title': 'Buy Dota 2', 'd...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256692021, 'name': 'Dota 2 - Join the ...",,"{'coming_soon': False, 'date': '9 Jul, 2013'}","{'url': 'http://dev.dota2.com/', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/570...,"{'ids': [], 'notes': None}",1,1,1
24,Alien Swarm,630,3,True,Alien Swarm is a game and Source SDK release f...,Alien Swarm is a game and Source SDK release f...,Co-operative multiplayer game and complete cod...,English,https://steamcdn-a.akamaihd.net/steam/apps/630...,http://www.alienswarm.com,{'minimum': '<strong>Minimum:</strong><br>\t\t...,[],[],['Valve'],['Valve'],,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 66, 'highlighted': [{'name': 'Clear ...","{'coming_soon': False, 'date': '19 Jul, 2010'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/630...,"{'ids': [], 'notes': None}",1,0,0
25,Counter-Strike: Global Offensive,730,3,True,Counter-Strike: Global Offensive (CS: GO) expa...,Counter-Strike: Global Offensive (CS: GO) expa...,Counter-Strike: Global Offensive (CS: GO) expa...,"Czech, Danish, Dutch, English<strong>*</strong...",https://steamcdn-a.akamaihd.net/steam/apps/730...,http://blog.counter-strike.net/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['Valve', 'Hidden Path Entertainment']",['Valve'],,"[329385, 298963, 54029]","[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 81958, 'name': 'CS:GO Trailer Long', '...","{'total': 167, 'highlighted': [{'name': 'Someo...","{'coming_soon': False, 'date': '21 Aug, 2012'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/730...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1


Turns out this accounts for most of our null values in the `price_overview` column, meaning we can handle these by setting our final price as 0. This means that there are almost 850 rows which aren't free but have null values in the `price_overview` column. Let's investigate those.

In [21]:
not_free_and_null_price = steam_data[(steam_data['is_free'] == False) & (steam_data['price_overview'].isnull())]

not_free_and_null_price.head()

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
63,The Ship: Single Player,2420,3,False,For PC gamers who enjoy multiplayer games with...,For PC gamers who enjoy multiplayer games with...,The Ship is a murder mystery alternative to tr...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/242...,http://www.blazinggriffin.com/games/the-ship-m...,{'minimum': '<strong>Minimum:</strong> 1.8 GHz...,[],[],['Outerlight Ltd.'],['Blazing Griffin Ltd.'],,[56669],"[{'name': 'default', 'title': 'Buy The Ship: S...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 2035597, 'name': 'the Ship: Intro', '...",{'total': 0},"{'coming_soon': False, 'date': '20 Nov, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/242...,"{'ids': [], 'notes': None}",1,0,0
75,RollerCoaster Tycoon® 3: Platinum,2700,3,False,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/270...,http://www.atari.com/rollercoastertycoon/us/in...,{'minimum': '<strong>Minimum: </strong><br>\t\...,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],"['Frontier', 'Aspyr (Mac)']","['Atari', 'Aspyr (Mac)']",,,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'url': 'http://www.atari.com/support/atari', ...",https://steamcdn-a.akamaihd.net/steam/apps/270...,"{'ids': [], 'notes': None}",1,1,0
220,BioShock™,7670,3,False,<h1>Special Offer</h1><p>Buying BioShock™ also...,BioShock is a shooter unlike any you've ever p...,BioShock is a shooter unlike any you've ever p...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/767...,http://www.BioShockGame.com,"{'minimum': '<h2 class=""bb_tag""><strong>Minimu...",{'minimum': 'Please See BioShock Remastered'},[],"['2K Boston', '2K Australia']",['2K'],,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™',...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '21 Aug, 2007'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/767...,"{'ids': [], 'notes': None}",1,0,0
234,Sam & Max 101: Culture Shock,8200,3,False,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,Sam &amp; Max: Episode 1 - Culture Shock - The...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/820...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[357, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/820...,"{'ids': [], 'notes': None}",1,0,0
235,Sam & Max 102: Situation: Comedy,8210,3,False,<strong>Sam &amp; Max: Episode 2 - Situation: ...,<strong>Sam &amp; Max: Episode 2 - Situation: ...,Sam &amp; Max: Episode 2 - Situation: Comedy -...,"English, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/821...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[358, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/821...,"{'ids': [], 'notes': None}",1,0,0


The first few rows contain big, well-known games which appear to have pretty complete data. It looks like we can rule out data errors, so let's dig a little deeper and see if we can find out what is going on.

We'll start by looking at the store pages for some of these titles. The url to an app on the steam website follows this structure:

    https://store.steampowered.com/app/[steam_appid]

This means we can easily generate these links using our above filter. We'll wrap it up in a function in case we want to use it later.

In [22]:
def print_steam_links(df):
    """Print links to store page for apps in a dataframe."""
    url_base = "https://store.steampowered.com/app/"
    
    for i, row in df.iterrows():
        appid = row['steam_appid']
        name = row['name']
        
        print(name + ':', url_base + str(appid))
        

print_steam_links(not_free_and_null_price[:5])

The Ship: Single Player: https://store.steampowered.com/app/2420
RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210


For these games we can conclude that:

- The Ship: Single Player is a tutorial, and comes as part of The Ship: Murder Party
- RollerCoaster Tycoon 3: Platinum has been removed from steam (and another game website: gog)  
  - "A spokesperson for GOG told Eurogamer it pulled the game "due to expiring licensing rights", and stressed it'll talk with "new distribution rights holders" to bring the game back as soon as possible." Source: [Eurogamer](https://www.eurogamer.net/articles/2018-05-09-rollercoaster-tycoon-3-pulled-from-steam-gog)
- BioShock has been replaced by BioShock Remastered
- Sam & Max 101 is sold as part of a season, and this can be found in the `package_groups` column

So we have a couple of options here. We could just drop these rows, we could try to figure out the price based on the package_groups column, or we could leave them for now and return to them later, which is what we will do. It may be that some or all of these rows are removed later in the cleaning process for other reasons.

Below we can view the games with similar names to the games we investigated, to help get an idea of what is happening.

In [23]:
steam_data[steam_data['name'].str.contains("The Ship:")]

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
62,The Ship: Murder Party,2400,3,True,<h1>Finding a Server</h1><p><strong>Ahoy Shipm...,"<strong>This package includes a tutorial, The ...",The Ship is a murder mystery multiplayer.,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/240...,http://www.blazinggriffin.com/games/the-ship-m...,{'minimum': '<strong>Minimum:</strong> 1.8 GHz...,[],[],['Outerlight Ltd.'],['Blazing Griffin Ltd.'],"{'currency': 'GBP', 'initial': 699, 'final': 6...",[56669],"[{'name': 'default', 'title': 'Buy The Ship: M...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 2034912, 'name': 'Single Player Intro'...",{'total': 0},"{'coming_soon': False, 'date': '11 Jul, 2006'}","{'url': 'http://www.blazinggriffin.com/', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/240...,"{'ids': [], 'notes': None}",1,0,0
63,The Ship: Single Player,2420,3,False,For PC gamers who enjoy multiplayer games with...,For PC gamers who enjoy multiplayer games with...,The Ship is a murder mystery alternative to tr...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/242...,http://www.blazinggriffin.com/games/the-ship-m...,{'minimum': '<strong>Minimum:</strong> 1.8 GHz...,[],[],['Outerlight Ltd.'],['Blazing Griffin Ltd.'],,[56669],"[{'name': 'default', 'title': 'Buy The Ship: S...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 2035597, 'name': 'the Ship: Intro', '...",{'total': 0},"{'coming_soon': False, 'date': '20 Nov, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/242...,"{'ids': [], 'notes': None}",1,0,0
6722,The Ship: Remasted,383790,3,False,<h1>Now Includes World Leaders!</h1><p>Not onl...,The Ship: Remasted is a remake of the classic ...,You find yourself aboard a series of luxury 19...,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/383...,http://www.blazinggriffin.com/games/the-ship-r...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Blazing Griffin'],['Blazing Griffin'],"{'currency': 'GBP', 'initial': 699, 'final': 6...",[253227],"[{'name': 'default', 'title': 'Buy The Ship: R...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256673834, 'name': 'All Aboard!', 'thu...","{'total': 22, 'highlighted': [{'name': 'Gone o...","{'coming_soon': False, 'date': '31 Oct, 2016'}","{'url': 'http://www.blazinggriffin.com/', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/383...,"{'ids': [], 'notes': None}",1,1,1


In [24]:
steam_data[steam_data['name'].str.contains("BioShock™")]

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
220,BioShock™,7670,3,False,<h1>Special Offer</h1><p>Buying BioShock™ also...,BioShock is a shooter unlike any you've ever p...,BioShock is a shooter unlike any you've ever p...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/767...,http://www.BioShockGame.com,"{'minimum': '<h2 class=""bb_tag""><strong>Minimu...",{'minimum': 'Please See BioShock Remastered'},[],"['2K Boston', '2K Australia']",['2K'],,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™',...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '21 Aug, 2007'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/767...,"{'ids': [], 'notes': None}",1,0,0
7734,BioShock™ Remastered,409710,18,False,<h1>Special Offer</h1><p>Buying BioShock™ Rema...,BioShock is a shooter unlike any you've ever p...,"BioShock is a shooter unlike any other, loaded...","English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/409...,http://www.BioShockGame.com,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['2K Boston', '2K Australia', 'Blind Squirrel'...","['2K', 'Feral Interactive (Mac)']","{'currency': 'GBP', 'initial': 999, 'final': 9...","[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™ R...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 65, 'highlighted': [{'name': 'Comple...","{'coming_soon': False, 'date': '15 Sep, 2016'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/409...,"{'ids': [], 'notes': None}",1,1,0
7735,BioShock™ 2 Remastered,409720,18,False,<h1>Special Offer</h1><p>Buying BioShock 2™ Re...,BioShock 2 provides players with the perfect b...,"In BioShock 2, you step into the boots of the ...","English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/409...,http://www.bioshockgame.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['2K Marin', '2K China', 'Digital Extremes', '...",['2K'],"{'currency': 'GBP', 'initial': 1399, 'final': ...","[81419, 127633]","[{'name': 'default', 'title': 'Buy BioShock™ 2...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 53, 'highlighted': [{'name': ""Daddy'...","{'coming_soon': False, 'date': '15 Sep, 2016'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/409...,"{'ids': [5], 'notes': None}",1,0,0


In [25]:
steam_data[steam_data['name'].str.contains("Sam & Max 1")]

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
234,Sam & Max 101: Culture Shock,8200,3,False,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,Sam &amp; Max: Episode 1 - Culture Shock - The...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/820...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[357, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/820...,"{'ids': [], 'notes': None}",1,0,0
235,Sam & Max 102: Situation: Comedy,8210,3,False,<strong>Sam &amp; Max: Episode 2 - Situation: ...,<strong>Sam &amp; Max: Episode 2 - Situation: ...,Sam &amp; Max: Episode 2 - Situation: Comedy -...,"English, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/821...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[358, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/821...,"{'ids': [], 'notes': None}",1,0,0
236,"Sam & Max 103: The Mole, the Mob and the Meatball",8220,3,False,"<strong>Sam &amp; Max Episode 3 - The Mole, Th...","<strong>Sam &amp; Max Episode 3 - The Mole, Th...","Sam &amp; Max Episode 3 - The Mole, The Mob, a...","English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/822...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[359, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/822...,"{'ids': [], 'notes': None}",1,0,0
237,Sam & Max 104: Abe Lincoln Must Die!,8230,3,False,<strong>Sam &amp; Max Episode 4 - Abe Lincoln ...,<strong>Sam &amp; Max Episode 4 - Abe Lincoln ...,Sam &amp; Max Episode 4 - Abe Lincoln Must Die...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/823...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[360, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/823...,"{'ids': [], 'notes': None}",1,0,0
238,Sam & Max 105: Reality 2.0,8240,3,False,With an internet crisis looming and a viral vi...,With an internet crisis looming and a viral vi...,With an internet crisis looming and a viral vi...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/824...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[361, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/824...,"{'ids': [], 'notes': None}",1,0,0
239,Sam & Max 106: Bright Side of the Moon,8250,3,False,<strong>Sam &amp; Max: Episode 6 - Bright Side...,<strong>Sam &amp; Max: Episode 6 - Bright Side...,Sam &amp; Max: Episode 6 - Bright Side of the ...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/825...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[362, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/825...,"{'ids': [], 'notes': None}",1,0,0


Finally if we take a look at the data for the first row, we can see that we have a variety of formats in which our price is stored. We have a `currency`, which is GBP, perfect as we are performing our analysis in the UK. Next we have a number of different values for the price so which one do we use?

In [26]:
steam_data['price_overview'][0]

"{'currency': 'GBP', 'initial': 719, 'final': 719, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '£7.19'}"

If we inspect another row, we see that there is an active discount, applying an 80% discount to the title. It looks like `initial` contains to normal price before discount, and `final` contains the discounted price. `initial_formatted` and `final_formatted` contain the price displayed in the currency. We don't have to worry about these, as we'll be storing the price as an integer (or float) and if we really wanted, could format it like this when printing.

With all this in mind, it looks like we'll be checking the value under the currency key, and using the value in the initial key.

In [27]:
steam_data['price_overview'][37]

"{'currency': 'GBP', 'initial': 2299, 'final': 459, 'discount_percent': 80, 'initial_formatted': '£22.99', 'final_formatted': '£4.59'}"

Now the preliminary investigation is complete we can begin definining our function. 

We start by evaluating the strings using literal_eval as before, however if there is a null value (caught by the try/except block) we return a properly formatted dictionary with -1 for the `initial` value. This will allow us to fill in a value of 0 for free games, then be left with an easily targetable value for the null rows.

In [28]:
def process_price(df):
    df = df.copy()
        
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # Create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # Set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    return df

price_data = process_price(steam_data)[['name', 'currency', 'price']]
price_data.head()

Unnamed: 0,name,currency,price
0,Counter-Strike,GBP,719
1,Team Fortress Classic,GBP,399
2,Day of Defeat,GBP,399
3,Deathmatch Classic,GBP,399
4,Half-Life: Opposing Force,GBP,399


We're almost finished here, bet let's check if any games don't have GBP listed as the currency.

In [29]:
price_data[price_data['currency'] != 'GBP']

Unnamed: 0,name,currency,price
991,Robin Hood: The Legend of Sherwood,USD,799
5767,Assassin’s Creed® Chronicles: India,EUR,999
27593,Mortal Kombat 11,USD,5999
27995,Pagan Online,EUR,2699


For some reason we have four games listed in either USD or EUR. We could use the current exchange rate to try and convert them into GBP, however as there are only four rows we will simply drop them.

We will also divide prices by 100 so they are displayed as floats in pounds.

In [30]:
def process_price(df):
    """Process price_overview column into formatted price column."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # change price to display in pounds (only applying to rows with a value greater than 0)
    df.loc[df['price'] > 0, 'price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'price']].head()

Unnamed: 0,name,price
0,Counter-Strike,7.19
1,Team Fortress Classic,3.99
2,Day of Defeat,3.99
3,Deathmatch Classic,3.99
4,Half-Life: Opposing Force,3.99


### Processing Description Columns

Next we have a series of columns with descriptive text about each game: `detailed_description`, `about_the_game` and `short_description`. These columns could be used as the basis for an interesting recommender or key-word analysis project, however they are not required in our current project and should be removed from our final data set as they take up large amounts of space.

In case we find some anomalies, let's inspect these columns anyway.

In [31]:
steam_data[['detailed_description', 'about_the_game', 'short_description']].isnull().sum()

detailed_description    24
about_the_game          24
short_description       24
dtype: int64

It looks like we have 24 rows with missing data for these columns, and chances are the 24 rows with missing `detailed_description` are the rows with missing `about_the_game` and `short_description` data too. 

By inspecting the individual rows below, we can see that this is true - all rows with missing data in one description column have missing data in the other too.

In [34]:
steam_data[steam_data['detailed_description'].isnull()]

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
97,Bejeweled 2 Deluxe,3300,3,,,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/330...,,{'minimum': '<p><strong>Minimum Requirements:<...,{'minimum': '<ul>\n\t<li><strong>OS:</strong> ...,[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']","[121, 1160]","[{'name': 'default', 'title': 'Buy Bejeweled 2...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/330...,"{'ids': [], 'notes': None}",1,1,0,4.25
98,Chuzzle Deluxe,3310,3,,,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/331...,,{'minimum': '<p><strong>Minimum Requirements:<...,{'minimum': '<ul>\n\t<li><strong>OS:</strong> ...,[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[126],"[{'name': 'default', 'title': 'Buy Chuzzle Del...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/331...,"{'ids': [], 'notes': None}",1,1,0,4.25
99,Insaniquarium Deluxe,3320,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/332...,,{'minimum': '<strong>Minimum Requirements:</st...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']","[127, 1160]","[{'name': 'default', 'title': 'Buy Insaniquari...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/332...,"{'ids': [], 'notes': None}",1,0,0,4.25
101,AstroPop Deluxe,3340,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/334...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[132],"[{'name': 'default', 'title': 'Buy AstroPop De...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/334...,"{'ids': [], 'notes': None}",1,0,0,4.25
102,Bejeweled Deluxe,3350,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/335...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[122],"[{'name': 'default', 'title': 'Buy Bejeweled D...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/335...,"{'ids': [], 'notes': None}",1,0,0,4.25
103,Big Money! Deluxe,3360,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/336...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[136],"[{'name': 'default', 'title': 'Buy Big Money! ...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/336...,"{'ids': [], 'notes': None}",1,0,0,4.25
104,Dynomite Deluxe,3380,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/338...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[128],"[{'name': 'default', 'title': 'Buy Dynomite De...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/338...,"{'ids': [], 'notes': None}",1,0,0,4.25
105,Feeding Frenzy 2 Deluxe,3390,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/339...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[124],"[{'name': 'default', 'title': 'Buy Feeding Fre...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/339...,"{'ids': [], 'notes': None}",1,0,0,4.25
106,Hammer Heads Deluxe,3400,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/340...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[135],"[{'name': 'default', 'title': 'Buy Hammer Head...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/340...,"{'ids': [], 'notes': None}",1,0,0,4.25
108,Iggle Pop Deluxe,3420,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/342...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[129],"[{'name': 'default', 'title': 'Buy Iggle Pop D...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/342...,"{'ids': [], 'notes': None}",1,0,0,4.25


Browsing these games it looks like about half are old PopCap games from 2006 and about half are from Telltale Games, similar to the Sam & Max title we encountered in the previous section.

There is also a dedicated server and a game which is now unlisted on the steam store. It would definitely be best to remove these two.

Let's remove these rows for now, but we can reintroduce them later if we wish.

As stated, the description columns may be useful for future projects, so before we remove them from this data set we will export them as a csv file. We will include the steam_appid column in this export as it will allow us to match up these rows with rows in our primary data set later on, using a merge (or a join in SQL). We will write a short function to handle this, which we can re-use later on if we have any more dataframes that need exporting.

In [35]:
def export_data(df, filename):
    """Export dataframe to csv file, filename prepended with 'steam_'.
    
    filename : str without file extension
    """
    filepath = '../data/exports/steam_' + filename + '.csv'
    formatted_name = filename.replace('_', ' ')
    
    df.to_csv(filepath, index=False)
    print("Exported {} to '{}'".format(formatted_name, filepath))

    
def process_descriptions(df, export=False):
    """Export descriptions to external csv file then remove these columns."""
    # remove rows with missing description data
    df = df[df['detailed_description'].notnull()].copy()
    
    if export:
        # create dataframe of description columns and export to csv
        description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]
        
        export_data(description_data, filename='description_data')
    
    # drop description columns from main dataframe
    df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported description data to '../data/exports/steam_description_data.csv'


Unnamed: 0,name,steam_appid,required_age,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
0,Counter-Strike,10,3,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19
1,Team Fortress Classic,20,3,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99
2,Day of Defeat,30,3,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}",1,1,1,3.99
3,Deathmatch Classic,40,3,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}",1,1,1,3.99
4,Half-Life: Opposing Force,50,3,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}",1,1,1,3.99


In [36]:
# inspect exported data
pd.read_csv('../data/exports/steam_description_data.csv').head()

Unnamed: 0,steam_appid,detailed_description,about_the_game,short_description
0,10,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...
1,20,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...
2,30,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...
3,40,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...
4,50,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...


### Processing Langauges

The next column is supported_languages. As we will be performing the analysis for an English company, we are only interested in apps that are in English. Whilst we could remove non-english apps at this stage, instead we will create a column marking english apps with a boolean value - True or False.

We begin as usual by looking for rows with null values.

In [37]:
steam_data['supported_languages'].isnull().sum()

4

Taking a closer look at these apps, it's possible one or two are not in english. As there are only 4 rows affected we will go ahead and remove these from the data set.

In [38]:
steam_data[steam_data['supported_languages'].isnull()]

Unnamed: 0,name,steam_appid,required_age,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
4866,Subsiege,338640,3,,https://steamcdn-a.akamaihd.net/steam/apps/338...,http://subsiege-game.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Icebird Studios'],['Icebird Studios'],[56500],"[{'name': 'default', 'title': 'Buy Subsiege', ...",,,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256729398, 'name': 'Release Trailer', ...",{'total': 0},"{'coming_soon': False, 'date': '7 Sep, 2018'}","{'url': 'http://subsiege-game.com/', 'email': ...",https://steamcdn-a.akamaihd.net/steam/apps/338...,"{'ids': [], 'notes': None}",1,0,0,17.89
14560,MARS VR(全球使命VR),596560,3,,https://steamcdn-a.akamaihd.net/steam/apps/596...,http://qqsm.zygames.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['Ying Pei Digital Technology Shanghai Co., Li...","['SHANGHAI ZHENYOU TECHNOLOGY CO.,LTD']",[156314],"[{'name': 'default', 'title': 'Buy MARS VR(全球使...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '73', 'description': 'Violent'}, {'id'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256681371, 'name': 'marsvr', 'thumbnai...",{'total': 0},"{'coming_soon': False, 'date': '5 Apr, 2017'}","{'url': 'http://www.zygames.com/contact', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/596...,"{'ids': [], 'notes': None}",1,0,0,1.99
16386,Numberline 2,654970,3,,https://steamcdn-a.akamaihd.net/steam/apps/654...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],"['V34D4R', 'Egor Magurin']",['Indovers Studio'],[184646],"[{'name': 'default', 'title': 'Buy Numberline ...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256687192, 'name': 'Numberline 2 Trail...","{'total': 60, 'highlighted': [{'name': '1st le...","{'coming_soon': False, 'date': '14 Jul, 2017'}","{'url': '', 'email': 'radaew.zhenya@yandex.ru'}",https://steamcdn-a.akamaihd.net/steam/apps/654...,"{'ids': [], 'notes': None}",1,0,0,1.59
26855,SNUSE 221,948070,3,,https://steamcdn-a.akamaihd.net/steam/apps/948...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['SNUSE GM'],['SNUSE GM'],[308421],"[{'name': 'default', 'title': 'Buy SNUSE 221',...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256745662, 'name': 'snuse', 'thumbnail...",{'total': 0},"{'coming_soon': False, 'date': '2 Apr, 2019'}","{'url': 'vk.com/nilow_i', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/948...,"{'ids': [], 'notes': None}",1,0,0,0.79


By looking at the value for the first row and the values for the most common rows, it looks like languages are stored as a string which can be anything from a comma-separated list of languages to a mix of html and headings. It seems reasonably safe to assume that if the app is in English, the word English will appear somewhere in this string. With this in mind we can simply search the string and return a value based on the result.

In [39]:
print(steam_data['supported_languages'][0])
steam_data['supported_languages'].value_counts().head(10)

English<strong>*</strong>, French<strong>*</strong>, German<strong>*</strong>, Italian<strong>*</strong>, Spanish - Spain<strong>*</strong>, Simplified Chinese<strong>*</strong>, Traditional Chinese<strong>*</strong>, Korean<strong>*</strong><br><strong>*</strong>languages with full audio support


English                                                                                                        8702
English<strong>*</strong><br><strong>*</strong>languages with full audio support                               7669
English, Russian                                                                                                719
English, Simplified Chinese                                                                                     291
English, Japanese                                                                                               239
English<strong>*</strong>, Russian<strong>*</strong><br><strong>*</strong>languages with full audio support     227
English, French, Italian, German, Spanish - Spain                                                               188
Simplified Chinese                                                                                              168
English, German                                                         

In [40]:
def process_language(df):
    """Process supported_languages column into a boolean 'is english' column."""
    df = df.copy()
    
    # drop rows with missing language data
    df = df.dropna(subset=['supported_languages'])
    
    df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)
    df = df.drop('supported_languages', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'english']].head()

Unnamed: 0,name,english
0,Counter-Strike,1
1,Team Fortress Classic,1
2,Day of Defeat,1
3,Deathmatch Classic,1
4,Half-Life: Opposing Force,1


Before moving on, we can take a quick look at our results and see that most of our apps are in English.

In [41]:
steam_data['english'].value_counts(dropna=False)

1    28500
0      543
Name: english, dtype: int64

### Processing image columns

Similar to our description columns, we have three columns that appear to contain links to various images: `header_image`, `screenshots` and `background`. We will treat these in almost the same way, exporting the contents to a csv file then removing the columns from our data set.

Whilst we won't be needed this data for our current project, it could open the door to some interesting image analysis in the future.

First we check for missing values.

In [42]:
image_cols = ['header_image', 'screenshots', 'background']

for col in image_cols:
    print(col+':', steam_data[col].isnull().sum())

header_image: 0
screenshots: 15
background: 15


Again it is likely that the 15 rows with missing screenshots data are the same rows with missing background data.

Seen below, some rows have missing `pc_requirements`, some have missing release_dates (blank string in the `date` part of release_date), and most have -1 for price, meaning we couldn't find any price data earlier.

It seems like it would be a good idea to remove these rows before proceeding.

In [43]:
steam_data[steam_data['screenshots'].isnull()]

Unnamed: 0,name,steam_appid,required_age,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price,english
652,Sam & Max 302: The Tomb of Sammun-Mak,31230,3,https://steamcdn-a.akamaihd.net/steam/apps/312...,,[],[],[],['Telltale Games'],['Telltale Games'],"[109586, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}]",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': 'https://telltale.com/support/', 'emai...",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
653,Sam & Max 303: They Stole Max's Brain!,31240,3,https://steamcdn-a.akamaihd.net/steam/apps/312...,,[],[],[],['Telltale Games'],['Telltale Games'],"[109587, 4172]","[{'name': 'default', 'title': ""Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}]",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': 'https://telltale.com/support/', 'emai...",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
654,Sam & Max 304: Beyond the Alley of the Dolls,31250,3,https://steamcdn-a.akamaihd.net/steam/apps/312...,,[],[],[],['Telltale Games'],['Telltale Games'],"[109588, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}]",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': 'https://telltale.com/support/', 'emai...",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
655,Sam & Max 305: The City That Dares Not Sleep,31260,3,https://steamcdn-a.akamaihd.net/steam/apps/312...,,[],[],[],['Telltale Games'],['Telltale Games'],"[109589, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}]",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': 'https://telltale.com/support/', 'emai...",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
1238,Hector: Episode 1,94600,3,https://steamcdn-a.akamaihd.net/steam/apps/946...,,{'minimum': '<strong>Минимальные:</strong><br>...,{'minimum': '<strong>Минимальные:</strong><br>...,[],['Straandlooper'],[''],[11279],"[{'name': 'default', 'title': 'Buy Hector: Epi...",,,,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
1239,Hector: Episode 2,94610,3,https://steamcdn-a.akamaihd.net/steam/apps/946...,http://www.telltalegames.com/hector,{'minimum': 'Minimum:<br>\t\t\t\t\t\t\t\t\t\t\...,[],[],['Straandlooper'],[''],"[109595, 11279]","[{'name': 'default', 'title': 'Buy Hector: Epi...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': '', 'email': 'support@telltalegames.com'}",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
5218,Into The War,346370,3,https://steamcdn-a.akamaihd.net/steam/apps/346...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Small Town Studios'],['Small Town Studios'],,[],"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...",,,"{'total': 1, 'highlighted': [{'name': 'First B...","{'coming_soon': False, 'date': '2 Dec, 2015'}","{'url': 'http://intothewar.com', 'email': 'nan...",,"{'ids': [], 'notes': None}",1,0,0,-1.0,1
7970,The Light Empire,416220,3,https://steamcdn-a.akamaihd.net/steam/apps/416...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Jemy'],['Jemy'],[83871],"[{'name': 'default', 'title': 'Buy The Light E...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...",,,"{'total': 4, 'highlighted': [{'name': 'We Begi...","{'coming_soon': False, 'date': '2 Dec, 2015'}","{'url': '', 'email': 'Jemy.TLE@outlook.com'}",,"{'ids': [], 'notes': None}",1,0,0,4.79,1
9408,A Land Fit For Heroes,456210,3,https://steamcdn-a.akamaihd.net/steam/apps/456...,http://landfitforheroes.com,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Liber Primus Games'],['Liber Primus Games'],,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,"[{'id': 256663531, 'name': 'A Land Fit For Her...",{'total': 0},"{'coming_soon': False, 'date': '3 May, 2016'}","{'url': 'http://landfitforheroes.com', 'email'...",,"{'ids': [], 'notes': None}",1,0,0,-1.0,1
19481,JumpSky,731910,3,https://steamcdn-a.akamaihd.net/steam/apps/731...,,[],{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['none'],['none'],,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]",,,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2017'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1


There is also a `movies` column with similar data. Whilst having more missing values, presumably for games without videos, it appears to contain names, thumbnails and links to various videos and trailers. It's unlikely we'll be needed them but we can include them in the export and remove them from our data set.

In [44]:
steam_data['movies'].isnull().sum()

1893

In [45]:
with pd.option_context("display.max_colwidth", 1000):
    print(steam_data[steam_data['movies'].notnull()]['movies'].head(3))

9                                                                                                                                                                                                                                                                                                                                                         [{'id': 904, 'name': 'Half-Life 2 Trailer', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/904/movie.jpg?t=1507237301', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie480.webm?t=1507237301', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie_max.webm?t=1507237301'}, 'highlight': True}, {'id': 5724, 'name': 'Free Yourself', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/5724/movie.293x165.jpg?t=1507237311', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie480.webm?t=1507237311', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie_max.webm?t=1507237311'}, 'highlight': Fa

In [46]:
def process_images(df, export=False):
    """Remove image columns from dataframe, optionally exporting them to csv first."""
    df = df[df['screenshots'].notnull()].copy()
    
    if export:
        image_data = df[['steam_appid', 'header_image', 'screenshots', 'background', 'movies']]
        
        export_data(image_data, 'image_data')
        
    df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported image data to '../data/exports/steam_image_data.csv'


Unnamed: 0,name,steam_appid,required_age,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,support_info,content_descriptors,windows,mac,linux,price,english
0,Counter-Strike,10,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1
1,Team Fortress Classic,20,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1
2,Day of Defeat,30,3,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}","{'ids': [], 'notes': None}",1,1,1,3.99,1
3,Deathmatch Classic,40,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}","{'ids': [], 'notes': None}",1,1,1,3.99,1
4,Half-Life: Opposing Force,50,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...","{'ids': [], 'notes': None}",1,1,1,3.99,1


In [47]:
# inspect exported data
pd.read_csv('../data/exports/steam_image_data.csv').head()

Unnamed: 0,steam_appid,header_image,screenshots,background,movies
0,10,https://steamcdn-a.akamaihd.net/steam/apps/10/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,
1,20,https://steamcdn-a.akamaihd.net/steam/apps/20/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,
2,30,https://steamcdn-a.akamaihd.net/steam/apps/30/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/30/...,
3,40,https://steamcdn-a.akamaihd.net/steam/apps/40/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,
4,50,https://steamcdn-a.akamaihd.net/steam/apps/50/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,


### Website and support info

Next we will look at the `website` and `support_info` columns, both containing links to external websites. There are a large number of rows with no website listed, and while there are no null values in the support_info column, it looks like many will have both emails and url inside the data.

For our data set we'll be dropping both these columns. But it might be useful, if not interesting, to extract this data and export to a csv file as we have before.

Below we can see the null counts and some example rows.

In [48]:
print('website null counts:', steam_data['website'].isnull().sum())
print('support_info null counts:', steam_data['support_info'].isnull().sum())

with pd.option_context("display.max_colwidth", 100): # ensures strings not cut short
    display(steam_data[['name', 'website', 'support_info']][80:85])

website null counts: 9787
support_info null counts: 0


Unnamed: 0,name,website,support_info
83,X: Tension,http://www.egosoft.com/games/x_tension/info_en.php,"{'url': '', 'email': ''}"
84,X Rebirth,http://www.egosoft.com/games/x_rebirth/info_en.php,"{'url': 'http://www.egosoft.com/support/index_en.php', 'email': 'info@egosoft.com'}"
85,688(I) Hunter/Killer,,"{'url': 'http://strategyfirst.com/products/support.html', 'email': ''}"
86,Fleet Command,,"{'url': 'http://strategyfirst.com/products/support.html', 'email': ''}"
87,Sub Command,,"{'url': '', 'email': ''}"


We keep all the code that parses the columns inside the export if statement, so it only runs if we wish to export to csv. We don't need to worry that the rows with missing website data contain NaN whereas the other two columns contain a blank string for missing data, as once we have exported to csv they will be treated the same.

In [49]:
def process_info(df, export=False):
    """Drop support information from dataframe, optionally exporting beforehand."""
    if export:
        support_info = df[['steam_appid', 'website', 'support_info']].copy()
        
        support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))
        support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])
        support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])
        
        support_info = support_info.drop('support_info', axis=1)
        
        support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]

        export_data(support_info, 'support_info')
    
    df = df.drop(['website', 'support_info'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported support info to '../data/exports/steam_support_info.csv'


Unnamed: 0,name,steam_appid,required_age,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
0,Counter-Strike,10,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1
1,Team Fortress Classic,20,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1
2,Day of Defeat,30,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
3,Deathmatch Classic,40,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
4,Half-Life: Opposing Force,50,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1


In [50]:
# inspect exported file
pd.read_csv('../data/exports/steam_support_info.csv').head()

Unnamed: 0,steam_appid,website,support_url,support_email
0,10,,http://steamcommunity.com/app/10,
1,30,http://www.dayofdefeat.com/,,
2,50,,https://help.steampowered.com,
3,70,http://www.half-life.com/,http://steamcommunity.com/app/70,
4,80,,http://steamcommunity.com/app/80,


### System Requirements

At first it looks like we have data for every row.

In [51]:
req_cols = ['pc_requirements', 'mac_requirements', 'linux_requirements']

print('null counts:\n')

for col in req_cols:
    print(col+':', steam_data[col].isnull().sum())

null counts:

pc_requirements: 0
mac_requirements: 0
linux_requirements: 0


However if we look at the data a little more closely, we see that some rows actually have an empty list. These won't appear as null rows, but once evaluated these rows won't provide any information and are essentially useless to us, so can be thought of as such.

In [52]:
steam_data[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].tail()

Unnamed: 0,steam_appid,pc_requirements,mac_requirements,linux_requirements
29230,1065230,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
29231,1065570,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
29232,1065650,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
29233,1066700,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[]
29234,1069460,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[]


We can check how many rows in each requirements column have empty lists using a simple boolean filter. By checking the first value in the shape parameter, we can get a count for how many empty lists there are.

In [53]:
print('Empty list counts:\n')

for col in req_cols:
    print(col+':', steam_data[steam_data[col] == '[]'].shape[0])

Empty list counts:

pc_requirements: 16
mac_requirements: 17125
linux_requirements: 20189


That's over half of the rows for both mac and linux requirements. That probably means that there is not enough data in these two columns to be useful for our analysis.

It turns out most games are developed solely for windows, with the growth in mac and linux ports only growing in recent years. Naturally it would make sense that any games that aren't supported on mac or linux would not have corresponding requirements.

As we have already cleaned our platforms column, we can check how many rows actually have missing data by comparing rows with empty lists in the requirements with data in the respective platform columns (mac/linux). If a row has an empty list in the requirements column but a 1 (True) in the platform column, it means the data is missing.

In [54]:
for col in ['mac_requirements', 'linux_requirements']:
    platform = col.split('_')[0]
    print(platform+':', steam_data[(steam_data[col] == '[]') & (steam_data[platform])].shape[0])

mac: 141
linux: 168


Whilst not an insignificant number, this means that the vast majority of rows are as they should be, and we're not looking at too many data errors.

Let's also have a look for missing values in the pc/windows column. We couldn't include it in our previous loop as the columns have different names, something we may wish to change later.

In [55]:
print('windows:', steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['windows'])].shape[0])

windows: 11


11 rows have missing system requirements. We can take a look at some of them below, and follow the links to the steam pages to try and discover if anything is amiss.

In [56]:
missing_windows_requirements = steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['windows'])]

print_steam_links(missing_windows_requirements[:5])
missing_windows_requirements.head()

Uplink: https://store.steampowered.com/app/1510
Battlestations: Midway: https://store.steampowered.com/app/6870
Grand Theft Auto 2: https://store.steampowered.com/app/12180
Penumbra: Requiem: https://store.steampowered.com/app/22140
Sam & Max 301: The Penal Zone: https://store.steampowered.com/app/31220


Unnamed: 0,name,steam_appid,required_age,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
34,Uplink,1510,3,[],[],[],['Introversion Software'],['Introversion Software'],"[112, 14002]","[{'name': 'default', 'title': 'Buy Uplink', 'd...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '23', 'description': 'Indie'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '23 Aug, 2006'}","{'ids': [], 'notes': None}",1,1,1,6.99,1
197,Battlestations: Midway,6870,3,[],[],[],['Eidos Interactive'],['Square Enix'],[284],"[{'name': 'default', 'title': 'Buy Battlestati...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '15 Mar, 2007'}","{'ids': [], 'notes': None}",1,0,0,4.99,1
346,Grand Theft Auto 2,12180,3,[],[],[],['Rockstar North'],['Rockstar Games'],,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '4 Jan, 2008'}","{'ids': [], 'notes': None}",1,0,0,0.0,1
549,Penumbra: Requiem,22140,3,[],[],[],['Frictional Games'],['Frictional Games'],,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': ''}","{'ids': [], 'notes': None}",1,1,1,-1.0,1
651,Sam & Max 301: The Penal Zone,31220,3,[],[],[],['Telltale Games'],['Telltale Games'],"[109585, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': ''}","{'ids': [], 'notes': None}",1,0,0,-1.0,1


There doesn't appear to be any common issue in these rows - some of the games are quite old but that's about it. It may simply be that no requirements were supplied when the games were added to the steam store.

Let's say that the fictional company we're doing analysis for is interested in developing for windows only. Also we can assume that a cross-platform game will have similar requirements in terms of hardware for each platform it supports. With this in mind we can safely drop both the mac and linux requirements columns, as we already know which games support these operating systems by our cleaned platform columns. That means we can focus on the pc_requirements column, which has information for almost every game in our data.

Now we will take a look at a couple of rows from the dataset to see how the data is stored.

In [57]:
display(steam_data['pc_requirements'].iloc[0])
display(steam_data['pc_requirements'].iloc[2000])
display(steam_data['pc_requirements'].iloc[15000])

"{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</strong> 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t<p><strong>Recommended:</strong> 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t'}"

'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows 7, Windows 8<br></li><li><strong>Processor:</strong> Intel Core 2 Duo, AMD Athlon X2, or equal at 1.6GHz or better<br></li><li><strong>Memory:</strong> 2 GB RAM<br></li><li><strong>Graphics:</strong> DirectX 9.0c-compatible, SM 3.0-compatible<br></li><li><strong>DirectX:</strong> Version 9.0c<br></li><li><strong>Storage:</strong> 1 GB available space<br></li><li><strong>Sound Card:</strong> DirectX 9.0c-compatible, 16-bit</li></ul>\', \'recommended\': \'<strong>Recommended:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows 7, Windows 8<br></li><li><strong>Processor:</strong> QuadCore 2.0 GHz +<br></li><li><strong>Memory:</strong> 8 GB RAM<br></li><li><strong>Graphics:</strong> NVIDIA GeForce 8800 GTS or better, 512MB+ VRAM<br></li><li><strong>DirectX:</strong> Version 9.0c<br></li><li><strong>Storage:</strong> 1 GB available space<br></li><li><strong>Sound Card:</strong> Direct

'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Microsoft Windows 7<br></li><li><strong>Processor:</strong> 2 GHz CPU<br></li><li><strong>Memory:</strong> 1 GB RAM<br></li><li><strong>DirectX:</strong> Version 9.0c<br></li><li><strong>Storage:</strong> 1 GB available space</li></ul>\', \'recommended\': \'<strong>Recommended:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Microsoft Windows 7<br></li><li><strong>Processor:</strong> 2 GHz CPU<br></li><li><strong>Memory:</strong> 1 GB RAM<br></li><li><strong>DirectX:</strong> Version 10<br></li><li><strong>Storage:</strong> 1 GB available space</li></ul>\'}'

In short: it's a mess. It looks like the data is stored as a dictionary, as we've seen before. There is definitely a key for 'minimum', but apart from that it is hard to see at a glance. The strings are full of html formatting, which is presumably parsed to display the information on the website. It also looks like there are different categories like Processor and Memory for some, but not all, rows.

Let's take a stab and cleaning out some of the unnessecary formatting and see if it becomes clearer.

By creating a dataframe from a selection of rows, we can easily and quickly make changes using the pandas .str accessor, allowing us to use python string formatting and regular expressions.

In [58]:
view_requirements = steam_data['pc_requirements'].iloc[[0, 2000, 15000]].copy()

view_requirements = (view_requirements
                         .str.replace(r'\\[rtn]', '')
                         .str.replace(r'<[pbr]{1,2}>', ' ')
                         .str.replace(r'<[\/"=\w\s]+>', '')
                    )

for i, row in view_requirements.iteritems():
    display(row)

"{'minimum': ' Minimum: 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection Recommended: 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection'}"

"{'minimum': 'Minimum: OS: Windows 7, Windows 8 Processor: Intel Core 2 Duo, AMD Athlon X2, or equal at 1.6GHz or better Memory: 2 GB RAM Graphics: DirectX 9.0c-compatible, SM 3.0-compatible DirectX: Version 9.0c Storage: 1 GB available space Sound Card: DirectX 9.0c-compatible, 16-bit', 'recommended': 'Recommended: OS: Windows 7, Windows 8 Processor: QuadCore 2.0 GHz + Memory: 8 GB RAM Graphics: NVIDIA GeForce 8800 GTS or better, 512MB+ VRAM DirectX: Version 9.0c Storage: 1 GB available space Sound Card: DirectX 9.0c-compatible, 16-bit'}"

"{'minimum': 'Minimum: OS: Microsoft Windows 7 Processor: 2 GHz CPU Memory: 1 GB RAM DirectX: Version 9.0c Storage: 1 GB available space', 'recommended': 'Recommended: OS: Microsoft Windows 7 Processor: 2 GHz CPU Memory: 1 GB RAM DirectX: Version 10 Storage: 1 GB available space'}"

We can now see more clearly the contents and structure of these rows. Some rows have both Minimum and Recommended requirements inside a 'minimum' key, some have separate 'minimum' and 'recommended' keys. Some have headings like 'Processor:' and 'Storage:' before various components, others simply have a list of components. Some state particular speeds for components, like 2 Ghz CPU, others state specific models, like 'Intel Core 2 Duo', amongst this information.

It seems like it would be possible to extract invidivual component information from this data, however it would be a lengthy and complex process recquiring the handling of many exceptions and invididual cases. Whilst we may wish to tackle this in the future, as it could provide an interesting window into how the demands of gaming have changed over the years, it won't necessarily provide us with useful information for our current objectives.

With that in mind, it seems best to proceed by cleaning the data slightly so it is readable, exporting to an external csv for future use, then removing the columns from our dataframe.

In [59]:
def process_requirements(df, export=False):
    if export:
        requirements = df[['steam_appid', 'pc_requirements']].copy()
        
        requirements = requirements[requirements['pc_requirements'] != '[]']
        
        requirements['requirements_clean'] = (requirements['pc_requirements']
                                                  .str.replace(r'\\[rtn]', '')
                                                  .str.replace(r'<[pbr]{1,2}>', ' ')
                                                  .str.replace(r'<[\/"=\w\s]+>', '')
                                             )
        
        export_data(requirements, 'requirements_data')
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported requirements data to '../data/exports/steam_requirements_data.csv'


Unnamed: 0,name,steam_appid,required_age,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
0,Counter-Strike,10,3,['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1
1,Team Fortress Classic,20,3,['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1
2,Day of Defeat,30,3,['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
3,Deathmatch Classic,40,3,['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
4,Half-Life: Opposing Force,50,3,['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1


In [60]:
# verify export
pd.read_csv('../data/exports/steam_requirements_data.csv').head()

Unnamed: 0,steam_appid,pc_requirements,requirements_clean
0,10,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."
1,20,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."
2,30,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."
3,40,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."
4,50,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"{'minimum': ' Minimum: 500 mhz processor, 96mb..."


### Processing developers and publishers

The next two columns, developers and publishers, will most likely contain similar information so we can look at them together. 

We'll start by checking the null counts, noticing that while the publishers column doesn't appear to have any null values at first, if we search for empty lists we see that we have 227 hidden null values.

In [61]:
print('developers null counts:', steam_data['developers'].isnull().sum())
print('developers empty list counts:', steam_data[steam_data['developers'] == "['']"].shape[0])

print('\npublishers null counts:', steam_data['publishers'].isnull().sum())
print('publishers empty list counts:', steam_data[steam_data['publishers'] == "['']"].shape[0])

developers null counts: 111
developers empty list counts: 0

publishers null counts: 0
publishers empty list counts: 227


In [62]:
no_dev = steam_data[steam_data['developers'].isnull()]

print('Total games missing developer:', no_dev.shape[0], '\n')
print_steam_links(no_dev[:5])

no_dev.head()

Total games missing developer: 111 

Tycoon City: New York: https://store.steampowered.com/app/9730
Nikopol: Secrets of the Immortals: https://store.steampowered.com/app/11370
Crash Time 2: https://store.steampowered.com/app/11390
Hunting Unlimited 2010: https://store.steampowered.com/app/12690
18 Wheels of Steel: Extreme Trucker: https://store.steampowered.com/app/33730


Unnamed: 0,name,steam_appid,required_age,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
285,Tycoon City: New York,9730,3,,['Retroism'],[34667],"[{'name': 'default', 'title': 'Buy Tycoon City...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}]",{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'ids': [], 'notes': None}",1,0,0,6.99,1
330,Nikopol: Secrets of the Immortals,11370,3,,['Meridian4'],[1930],"[{'name': 'default', 'title': 'Buy Nikopol: Se...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '30 Jul, 2009'}","{'ids': [], 'notes': None}",1,0,0,3.99,1
331,Crash Time 2,11390,3,,['Meridian4'],[2030],"[{'name': 'default', 'title': 'Buy Crash Time ...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '27 Aug, 2009'}","{'ids': [], 'notes': None}",1,0,0,6.99,1
379,Hunting Unlimited 2010,12690,3,,"['ValuSoft', 'Retroism']","[2680, 17219]","[{'name': 'default', 'title': 'Buy Hunting Unl...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...",{'total': 0},"{'coming_soon': False, 'date': '7 Jul, 2009'}","{'ids': [], 'notes': None}",1,0,0,6.99,1
742,18 Wheels of Steel: Extreme Trucker,33730,3,,"['ValuSoft', 'Play Hard Games']","[2679, 17219]","[{'name': 'default', 'title': 'Buy 18 Wheels o...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}]",{'total': 0},"{'coming_soon': False, 'date': '23 Sep, 2009'}","{'ids': [], 'notes': None}",1,0,0,6.99,1


In [63]:
no_pub = steam_data[steam_data['publishers'] == "['']"]

print('Total games missing publisher:', no_pub.shape[0], '\n')
print_steam_links(no_pub[:5])

no_pub.head()

Total games missing publisher: 227 

RIP - Trilogy™: https://store.steampowered.com/app/2540
Vigil: Blood Bitterness™: https://store.steampowered.com/app/2570
Bullet Candy: https://store.steampowered.com/app/6600
AudioSurf: https://store.steampowered.com/app/12900
Everyday Shooter: https://store.steampowered.com/app/16300


Unnamed: 0,name,steam_appid,required_age,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
67,RIP - Trilogy™,2540,3,['Elephant Games'],[''],[346],"[{'name': 'default', 'title': 'Buy RIP - Trilo...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2007'}","{'ids': [], 'notes': None}",1,0,0,3.99,1
68,Vigil: Blood Bitterness™,2570,3,['Freegamer'],[''],,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '23', 'description': 'Indie'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '29 Jun, 2007'}","{'ids': [], 'notes': None}",1,0,0,0.0,1
190,Bullet Candy,6600,3,['R C Knight'],[''],[258],"[{'name': 'default', 'title': 'Buy Bullet Cand...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","{'total': 20, 'highlighted': [{'name': 'Casual...","{'coming_soon': False, 'date': '14 Feb, 2007'}","{'ids': [], 'notes': None}",1,0,0,2.79,1
385,AudioSurf,12900,3,['Dylan Fitterer'],[''],[636],"[{'name': 'default', 'title': 'Buy AudioSurf',...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}]","{'total': 19, 'highlighted': [{'name': 'Royal ...","{'coming_soon': False, 'date': '15 Feb, 2008'}","{'ids': [], 'notes': None}",1,0,0,6.99,1
451,Everyday Shooter,16300,3,['Queasy Games'],[''],[724],"[{'name': 'default', 'title': 'Buy Everyday Sh...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '8 May, 2008'}","{'ids': [], 'notes': None}",1,0,0,7.19,1


In [64]:
no_dev_or_pub = steam_data[(steam_data['developers'].isnull()) & (steam_data['publishers'] == "['']")]

print('Total games missing developer and publisher:', no_dev_or_pub.shape[0], '\n')
print_steam_links(no_dev_or_pub[:5])

no_dev_or_pub.head()

Total games missing developer and publisher: 73 

Patterns: https://store.steampowered.com/app/218980
PlayClaw 5 - Game Recording and Streaming: https://store.steampowered.com/app/237370
Artemis Spaceship Bridge Simulator: https://store.steampowered.com/app/247350
A Walk in the Dark: https://store.steampowered.com/app/248730
Forge Quest: https://store.steampowered.com/app/249950


Unnamed: 0,name,steam_appid,required_age,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
1701,Patterns,218980,3,,[''],,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': ''}","{'ids': [], 'notes': None}",1,1,0,-1.0,1
2011,PlayClaw 5 - Game Recording and Streaming,237370,3,,[''],[28917],"[{'name': 'default', 'title': 'Buy PlayClaw 5 ...","[{'id': 22, 'description': 'Steam Achievements'}]","[{'id': '52', 'description': 'Audio Production...","{'total': 10, 'highlighted': [{'name': 'Verbal...","{'coming_soon': False, 'date': '10 Sep, 2013'}","{'ids': [], 'notes': None}",1,0,0,29.99,1
2201,Artemis Spaceship Bridge Simulator,247350,3,,[''],"[29600, 31847]","[{'name': 'default', 'title': 'Buy Artemis Spa...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '16 Sep, 2013'}","{'ids': [], 'notes': None}",1,0,0,4.99,1
2231,A Walk in the Dark,248730,3,,[''],[29907],"[{'name': 'default', 'title': 'Buy A Walk in t...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","{'total': 27, 'highlighted': [{'name': 'Toughe...","{'coming_soon': False, 'date': '7 Nov, 2013'}","{'ids': [], 'notes': None}",1,0,0,4.99,1
2251,Forge Quest,249950,3,,[''],"[30345, 35189]","[{'name': 'default', 'title': 'Buy Forge Quest...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","{'total': 49, 'highlighted': [{'name': 'Papers...","{'coming_soon': False, 'date': '29 May, 2015'}","{'ids': [], 'notes': None}",1,1,1,6.99,1


Options:
- remove rows with missing developer or publisher information
- impute missing information by replacing missing columns with the column we have
- write missing information as 'unkown' or none
- keep everything
- remove rows with both missing developer and publisher information

In [65]:
def process_developers_and_publishers(df):
    num_rows = df.shape[0]
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
    print('Before:', num_rows, '\nAfter:', df.shape[0], '\nRows dropped:', num_rows - df.shape[0])
            
    df['developers'] = df['developers'].apply(lambda x: literal_eval(x))
    df['publishers'] = df['publishers'].apply(lambda x: literal_eval(x))
    
    df['developer'] = df['developers'].apply(lambda x: x[0])
    df['publisher'] = df['publishers'].apply(lambda x: x[0])
    
    df['other_developers'] = df['developers'].apply(lambda x: ', '.join(x[1:]) if len(x) > 1 else np.nan)
    df['other_publishers'] = df['publishers'].apply(lambda x: ', '.join(x[1:]) if len(x) > 1 else np.nan)

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df

dev_pub_data = process_developers_and_publishers(steam_data)
dev_pub_data[['developer', 'publisher', 'other_developers', 'other_publishers']].head()

Before: 29028 
After: 28763 
Rows dropped: 265


Unnamed: 0,developer,publisher,other_developers,other_publishers
0,Valve,Valve,,
1,Valve,Valve,,
2,Valve,Valve,,
3,Valve,Valve,,
4,Gearbox Software,Valve,,


It may be worth investigating how many rows actually have other developers or publishers, as the other_developers and other_publishers columns are filled with null values for the first few rows.

In [66]:
print('Null counts:\n')

for col in ['developer', 'publisher', 'other_developers', 'other_publishers']:
    print(col + ':', dev_pub_data[col].isnull().sum())

Null counts:

developer: 0
publisher: 0
other_developers: 27002
other_publishers: 27860


It turns out that most games only have one developer and one publisher, and so our columns are filled with null values so they're of little use. It may be better to combine these columns into one. We can do this fairly easily using the python join method on a string. By invoking join on a comma, when there is only one value in the list of developers/publishers join will return that value, otherwise when there are multiple values we will get a comma-separated string like so:

In [67]:
', '.join(['one item'])

'one item'

In [68]:
', '.join(['multiple', 'different', 'items'])

'multiple, different, items'

We can now modify and finish our function, and will be ready to move on to the next column.

In [69]:
def process_developers_and_publishers(df):
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
            
    df['developers'] = df['developers'].apply(lambda x: literal_eval(x))
    df['publishers'] = df['publishers'].apply(lambda x: literal_eval(x))
    
    df['developer'] = df['developers'].apply(lambda x: ', '.join(x))
    df['publisher'] = df['publishers'].apply(lambda x: ', '.join(x))

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


### Processing Packages

We are not incredibly interested in the `packages` and `package_groups` columns, except for where we are missing price data (and earlier filled these with -1). We can now easily investigate these rows. Overall we have 811 rows with missing price data.

In [70]:
print(steam_data[steam_data['price'] == -1].shape[0])

811


We can split these rows into two categories: those with package_groups data and those without. If we take a quick look at the package_groups column we see that there are no null values, but rows without data are stored as empty lists.

In [71]:
print('Null counts:', steam_data['package_groups'].isnull().sum())
print('Empty list counts:', steam_data[steam_data['package_groups'] == "[]"].shape[0])

Null counts: 0
Empty list counts: 3307


Using a combination of filters, we can find out how many rows have both missing price and package_group data and investigate.

In [72]:
missing_price_and_package = steam_data[(steam_data['price'] == -1) & (steam_data['package_groups'] == "[]")]

print('Number of rows:', missing_price_and_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_and_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_and_package[-10:-5])

missing_price_and_package.head()

Number of rows: 774 

First few rows:

RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
Beijing 2008™ - The Official Video Game of the Olympic Games: https://store.steampowered.com/app/10520
LUMINES™ Advance Pack: https://store.steampowered.com/app/11920
Midnight Club 2: https://store.steampowered.com/app/12160
Age of Booty™: https://store.steampowered.com/app/21600

Last few rows:

RoboVirus: https://store.steampowered.com/app/1001870
soko loco deluxe: https://store.steampowered.com/app/1003730
POCKET CAR : VRGROUND: https://store.steampowered.com/app/1004710
The Princess, the Stray Cat, and Matters of the Heart: https://store.steampowered.com/app/1010600
Mr Boom's Firework Factory: https://store.steampowered.com/app/1013670


Unnamed: 0,name,steam_appid,required_age,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
75,RollerCoaster Tycoon® 3: Platinum,2700,3,,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...",{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'ids': [], 'notes': None}",1,1,0,-1.0,1,"Frontier, Aspyr (Mac)","Atari, Aspyr (Mac)"
311,Beijing 2008™ - The Official Video Game of the...,10520,3,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '18', 'description': 'Sports'}]",{'total': 0},"{'coming_soon': False, 'date': '14 Aug, 2008'}","{'ids': [], 'notes': None}",1,0,0,-1.0,1,Eurocom,SEGA
337,LUMINES™ Advance Pack,11920,3,,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]",{'total': 0},"{'coming_soon': False, 'date': '18 Apr, 2008'}","{'ids': [], 'notes': None}",1,0,0,-1.0,1,Q Entertainment Inc.,Q Entertainment Inc.
344,Midnight Club 2,12160,3,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '9', 'description': 'Racing'}]",{'total': 0},"{'coming_soon': False, 'date': '4 Jan, 2008'}","{'ids': [], 'notes': None}",1,0,0,-1.0,1,Rockstar San Diego,Rockstar Games
536,Age of Booty™,21600,3,,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...",{'total': 0},"{'coming_soon': False, 'date': '9 Mar, 2009'}","{'ids': [], 'notes': None}",1,0,0,-1.0,1,Certain Affinity™,Capcom


Most of our games with missing price data fall into the above category. From looking at the first few rows on the store page, it looks like they are currently unavailable or have been delisted from the store. Looking at the last few rows, it appears most of haven't been released and haven't had a price set. We will take care of all unreleased games when we clean the release_date column, but we can remove all of these apps now.

Let's now take a look at the apps that have missing price data but do have package_groups data.

In [73]:
missing_price_have_package = steam_data.loc[(steam_data['price'] == -1) & (steam_data['package_groups'] != "[]"), ['name', 'steam_appid', 'package_groups', 'price']]

print('Number of rows:', missing_price_have_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_have_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_have_package[-10:-5])

display(missing_price_have_package.head())
missing_price_have_package.iloc[-10:-5]

Number of rows: 37 

First few rows:

The Ship: Single Player: https://store.steampowered.com/app/2420
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210
Sam & Max 103: The Mole, the Mob and the Meatball: https://store.steampowered.com/app/8220

Last few rows:

Viscera Cleanup Detail: Shadow Warrior: https://store.steampowered.com/app/255520
Space Hulk: Deathwing: https://store.steampowered.com/app/298900
7,62 Hard Life: https://store.steampowered.com/app/306290
Letter Quest: Grimm's Journey: https://store.steampowered.com/app/328730
Rad Rodgers: World One: https://store.steampowered.com/app/353580


Unnamed: 0,name,steam_appid,package_groups,price
63,The Ship: Single Player,2420,"[{'name': 'default', 'title': 'Buy The Ship: S...",-1.0
220,BioShock™,7670,"[{'name': 'default', 'title': 'Buy BioShock™',...",-1.0
234,Sam & Max 101: Culture Shock,8200,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0
235,Sam & Max 102: Situation: Comedy,8210,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0
236,"Sam & Max 103: The Mole, the Mob and the Meatball",8220,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0


Unnamed: 0,name,steam_appid,package_groups,price
2421,Viscera Cleanup Detail: Shadow Warrior,255520,"[{'name': 'default', 'title': 'Buy Viscera Cle...",-1.0
3576,Space Hulk: Deathwing,298900,"[{'name': 'default', 'title': 'Buy Space Hulk:...",-1.0
3811,"7,62 Hard Life",306290,"[{'name': 'default', 'title': 'Buy 7,62 Hard L...",-1.0
4504,Letter Quest: Grimm's Journey,328730,"[{'name': 'default', 'title': ""Buy Letter Ques...",-1.0
5514,Rad Rodgers: World One,353580,"[{'name': 'default', 'title': 'Buy Rad Rodgers...",-1.0


Looking at a selection of these rows, the games appear to be: supersceded by a newer release or remaster, part of a bigger bundle of games or episodic, or included by purchasing another game. 

Whilst we could extract prices from the package_groups data, the most sensible option seems to be removing these rows. Since our logic interacts heavily with the price data we will rewrite the process_price function rather than putting this logic inside it's own function.

In [74]:
def process_price(df):
    """Process price_overview column into formatted price column, and take care of package columns."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # remove rows where price is -1
    df = df[df['price'] != -1]
    
    # change price to display in pounds (can apply to all now -1 rows removed)
    df['price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview', 'packages', 'package_groups'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


### Processing Categories and Genres

Drop rows with missing categories/genres?

In [75]:
print(steam_data['categories'].isnull().sum())

509


In [76]:
print(steam_data['categories'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['categories'].head())

[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]


0    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
1    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
2                                                                                                       [{'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
3    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
4                                                            [{'id': 2, 'description': 'Single-player'}, {'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enable

In [77]:
print_steam_links(steam_data[steam_data['categories'].isnull()].tail(20))

MOTiON by RADiCAL: https://store.steampowered.com/app/999900
The Marvellous Machine: https://store.steampowered.com/app/1000510
iDancer: https://store.steampowered.com/app/1004740
SubnetPing: https://store.steampowered.com/app/1008160
YouTube Center: https://store.steampowered.com/app/1009330
Discord Bot - Controls: https://store.steampowered.com/app/1010170
Wallpaper Maker （造物主视频桌面）: https://store.steampowered.com/app/1010800
Nero GameVR: https://store.steampowered.com/app/1011110
Greenland Melting: https://store.steampowered.com/app/1012510
VEGAS Movie Studio 16 Steam Edition: https://store.steampowered.com/app/1016810
VEGAS Movie Studio 16 Platinum Steam Edition: https://store.steampowered.com/app/1016840
Planet Evolution PC Live Wallpaper: https://store.steampowered.com/app/1017060
Screenbits - Screen Recorder: https://store.steampowered.com/app/1018680
Wondershare Video Converter Ultimate: https://store.steampowered.com/app/1025020
ACID Music Studio 11 Steam Edition: https://store

In [78]:
print(steam_data['genres'].isnull().sum())

37


In [79]:
print(steam_data['genres'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['genres'].iloc[100:105])

[{'id': '1', 'description': 'Action'}]


121    [{'id': '2', 'description': 'Strategy'}, {'id': '4', 'description': 'Casual'}]
122                                            [{'id': '4', 'description': 'Casual'}]
123                                            [{'id': '4', 'description': 'Casual'}]
124                                          [{'id': '2', 'description': 'Strategy'}]
125                                            [{'id': '4', 'description': 'Casual'}]
Name: genres, dtype: object

In [80]:
print_steam_links(steam_data[steam_data['genres'].isnull()].head(10))
print_steam_links(steam_data[steam_data['genres'].isnull()].tail(10))

Hot Dish: https://store.steampowered.com/app/12570
Dr. Daisy Pet Vet: https://store.steampowered.com/app/12580
Call of Cthulhu®: Dark Corners of the Earth: https://store.steampowered.com/app/22340
Super Granny Collection: https://store.steampowered.com/app/36270
Sacrifice: https://store.steampowered.com/app/38440
Nancy Drew® Dossier: Resorting to Danger!: https://store.steampowered.com/app/42200
Air Forte: https://store.steampowered.com/app/55020
Sonic Adventure DX: https://store.steampowered.com/app/71250
Portal 2 - The Final Hours: https://store.steampowered.com/app/104600
Sonic CD: https://store.steampowered.com/app/200940
EatWell: https://store.steampowered.com/app/678870
No Lights: https://store.steampowered.com/app/682910
Cyborg Arena: https://store.steampowered.com/app/706440
M.I.A. - Overture: https://store.steampowered.com/app/712060
VEHICLES FURY: https://store.steampowered.com/app/749290
The Big Three: https://store.steampowered.com/app/823390
BlueberryNOVA: https://store.st

In [81]:
steam_data[(steam_data['genres'].isnull()) | (steam_data['categories'].isnull())]

Unnamed: 0,name,steam_appid,required_age,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
371,Hot Dish,12570,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '29 Jul, 2008'}","{'ids': [], 'notes': None}",1,0,0,5.99,1,Zemnott,ValuSoft
372,Dr. Daisy Pet Vet,12580,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '29 Jul, 2008'}","{'ids': [], 'notes': None}",1,0,0,5.99,1,Zemnott,ValuSoft
404,Tom Clancy's Ghost Recon® Island Thunder™,13630,3,,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '15 Jul, 2008'}","{'ids': [], 'notes': None}",1,0,0,4.29,1,Red Storm Entertainment,Ubisoft
557,Call of Cthulhu®: Dark Corners of the Earth,22340,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '16 Jun, 2009'}","{'ids': [], 'notes': None}",1,0,0,3.99,1,Headfirst Productions,Bethesda Softworks
789,Westward Collection,36150,3,,"[{'id': '4', 'description': 'Casual'}]",{'total': 0},"{'coming_soon': False, 'date': '17 Jul, 2009'}","{'ids': [], 'notes': None}",1,0,0,10.99,1,Sandlot Games,Sandlot Games
793,Super Granny Collection,36270,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '17 Jul, 2009'}","{'ids': [], 'notes': None}",1,0,0,10.99,1,Sandlot Games,Sandlot Games
846,Sacrifice,38440,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '19 Aug, 2009'}","{'ids': [], 'notes': None}",1,0,0,6.99,1,Shiny Entertainment,Interplay Inc.
866,Painkiller: Black Edition,39530,3,,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '24 Jan, 2007'}","{'ids': [2, 5], 'notes': None}",1,0,0,8.99,1,People Can Fly,THQ Nordic
921,Nancy Drew® Dossier: Resorting to Danger!,42200,3,"[{'id': 2, 'description': 'Single-player'}]",,{'total': 0},"{'coming_soon': False, 'date': '19 Nov, 2009'}","{'ids': [], 'notes': None}",1,0,0,5.19,1,HeR Interactive,HeR Interactive
1029,Might & Magic: Heroes VI,48220,3,,"[{'id': '3', 'description': 'RPG'}, {'id': '2'...",{'total': 0},"{'coming_soon': False, 'date': '13 Oct, 2011'}","{'ids': [], 'notes': None}",1,0,0,16.99,1,Blackhole,Ubisoft


In [82]:
def process_categories(df, export=False):
    df = df[df['categories'].notnull()].copy()
    
    if export:
        category_data = df[['steam_appid', 'categories']].copy()

        category_data['categories'] = category_data['categories'].apply(lambda x: [item['description'] for item in literal_eval(x)])

        cols = set(list(itertools.chain(*category_data['categories'])))
        
        for col in sorted(cols):
            col_name = 'c_' + (col.lower()
                                  .replace('-', '_')
                                  .replace(' ', '_')
                                  .replace('(', '')
                                  .replace(')', '')
                                  .replace('/', '_or_')
                              )
            category_data[col_name] = category_data['categories'].apply(lambda x: 1 if col in x else 0)
        
        category_data = category_data.drop('categories', axis=1)
        export_data(category_data, 'category_data')
    
    df = df.drop('categories', axis=1)
    
    return df


def process_genres(df, export=False):
    df = df[df['genres'].notnull()].copy()
    
    if export:
        genre_data = df[['steam_appid', 'genres']].copy()

        genre_data['genres'] = genre_data['genres'].apply(lambda x: [item['description'] for item in literal_eval(x)])
        
        cols = set(list(itertools.chain(*genre_data['genres'])))

        for col in sorted(cols):
            col_name = 'g_' + (col.lower()
                            .replace(' ', '_')
                            .replace('&', 'and')
                       )
            genre_data[col_name] = genre_data['genres'].apply(lambda x: 1 if col in x else 0)

        genre_data = genre_data.drop('genres', axis=1)            
        export_data(genre_data, 'genre_data')
        
    df = df.drop('genres', axis=1)
    
    return df


process_categories(steam_data, export=True).head()
process_genres(steam_data, export=True).head()

Exported category data to '../data/exports/steam_category_data.csv'
Exported genre data to '../data/exports/steam_genre_data.csv'


Unnamed: 0,name,steam_appid,required_age,categories,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"[{'id': 2, 'description': 'Single-player'}, {'...",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [83]:
def expand_columns(df, col):
    df[col] = df[col].apply(lambda x: [item['description'] for item in literal_eval(x)])
    new_cols = set(list(itertools.chain(*df[col])))
    
    for new_col in sorted(new_cols):
        new_col_name = (new_col.lower()
                               .replace('-', '_')
                               .replace(' ', '_')
                               .replace('(', '')
                               .replace(')', '')
                               .replace('/', '_or_')
                               .replace('&', 'and')
                       )
        df[new_col_name] = df[col].apply(lambda x: 1 if new_col in x else 0)
            
    return df.drop(col, axis=1)


def process_categories(df, export=False):
    df = df[df['categories'].notnull()].copy()
    
    category_data = df[['steam_appid', 'categories']].copy()
    category_data = expand_columns(category_data, 'categories')
    
    if export:
        export_data(category_data, 'category_data')
    
    df = df.drop('categories', axis=1)
    
    return df


def process_genres(df, export=False):
    df = df[df['genres'].notnull()].copy()
    
    genre_data = df[['steam_appid', 'genres']].copy()
    genre_data = expand_columns(genre_data, 'genres')
        
    if export:    
        export_data(genre_data, 'genre_data')
        
    df = df.drop('genres', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    df = process_categories(df, export=True)
    df = process_genres(df, export=True)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Exported category data to '../data/exports/steam_category_data.csv'
Exported genre data to '../data/exports/steam_genre_data.csv'


Unnamed: 0,name,steam_appid,required_age,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [84]:
pd.read_csv('../data/exports/steam_category_data.csv').head()

Unnamed: 0,steam_appid,captions_available,co_op,commentary_available,cross_platform_multiplayer,full_controller_support,in_app_purchases,includes_source_sdk,includes_level_editor,local_co_op,local_multi_player,mmo,mods,mods_require_hl2,multi_player,online_co_op,online_multi_player,partial_controller_support,shared_or_split_screen,single_player,stats,steam_achievements,steam_cloud,steam_leaderboards,steam_trading_cards,steam_turn_notifications,steam_workshop,steamvr_collectibles,vr_support,valve_anti_cheat_enabled
0,10,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1,20,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,30,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,40,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4,50,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1


In [85]:
pd.read_csv('../data/exports/steam_genre_data.csv').head()

Unnamed: 0,steam_appid,accounting,action,adventure,animation_and_modeling,audio_production,casual,design_and_illustration,documentary,early_access,education,free_to_play,game_development,gore,indie,massively_multiplayer,nudity,photo_editing,rpg,racing,sexual_content,simulation,software_training,sports,strategy,tutorial,utilities,video_production,violent,web_publishing
0,10,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,20,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,30,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,50,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Achievements and Content Descriptors

In [86]:
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [87]:
steam_data['achievements'].isnull().sum()

1855

In [88]:
literal_eval(steam_data['achievements'][9])

{'total': 33,
 'highlighted': [{'name': 'Defiant',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_hit_cancop_withcan.jpg'},
  {'name': 'Submissive',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_put_canintrash.jpg'},
  {'name': 'Malcontent',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_escape_apartmentraid.jpg'},
  {'name': 'What cat?',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_break_miniteleporter.jpg'},
  {'name': 'Trusty Hardware',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_crowbar.jpg'},
  {'name': 'Barnacle Bowling',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_barnacleswithbarrel.jpg'},
  {'name': "Anchor's Aweigh!",
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboat.jpg'},
  {'nam

In [89]:
steam_data['content_descriptors'].isnull().sum()

0

In [90]:
steam_data['content_descriptors'].value_counts().head(6)

{'ids': [], 'notes': None}                                                                                                                                                                  25394
{'ids': [2, 5], 'notes': None}                                                                                                                                                                427
{'ids': [1, 5], 'notes': None}                                                                                                                                                                251
{'ids': [5], 'notes': None}                                                                                                                                                                   127
{'ids': [1, 2, 5], 'notes': None}                                                                                                                                                             122
{'ids': [2, 5], 'notes': 'This

In [91]:
def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    df = df.drop(['achievements', 'content_descriptors'], axis=1)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    df = process_categories(df)
    df = process_genres(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,release_date,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"{'coming_soon': False, 'date': '1 Nov, 2000'}",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"{'coming_soon': False, 'date': '1 Apr, 1999'}",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"{'coming_soon': False, 'date': '1 May, 2003'}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"{'coming_soon': False, 'date': '1 Jun, 2001'}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"{'coming_soon': False, 'date': '1 Nov, 1999'}",1,1,1,3.99,1,Gearbox Software,Valve


### Processing Release Date

The final column to clean, release date, provides some interesting optimisation and learning challenges. We've encountered some columns with a similar structure already, so we can use what we've learned so far, but now we have some dates to handle.

First we shall inspect the raw format of the column. As we can see below, it is stored as a dictionary-like string object containing values for `coming_soon` and `date`. From the first few rows it would appear that the dates are stored in a uniform format - day as an integer, month as a 3-character string abbreviation, a comma, then the year as a four-digit number. We can parse this either using the python built-in datetime module, or as we already have pandas imported, we can use the [pd.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function.

Also, as our analysis will involve looking at ownership and sales data, looking at games that are not released yet will not be useful to us. Intuitively, we can drop any titles which are marked as coming soon, presumably having this value set to true. As a side note, once parsed it may be worth checking that no release dates in our data are beyond the current date, just to make doubly sure none slip through.

In [92]:
display(raw_steam_data['price_overview'][0])
display(raw_steam_data['release_date'][0])

"{'currency': 'GBP', 'initial': 719, 'final': 719, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '£7.19'}"

"{'coming_soon': False, 'date': '1 Nov, 2000'}"

In [93]:
steam_data[['name', 'release_date']].head()

Unnamed: 0,name,release_date
0,Counter-Strike,"{'coming_soon': False, 'date': '1 Nov, 2000'}"
1,Team Fortress Classic,"{'coming_soon': False, 'date': '1 Apr, 1999'}"
2,Day of Defeat,"{'coming_soon': False, 'date': '1 May, 2003'}"
3,Deathmatch Classic,"{'coming_soon': False, 'date': '1 Jun, 2001'}"
4,Half-Life: Opposing Force,"{'coming_soon': False, 'date': '1 Nov, 1999'}"


As usual, one of the first steps we'll take is to check for null values. Luckily, it seems that the cleaning we have performed already has removed any null values from our data set, as seen below. We may still have some hidden empty values of course. 

In [94]:
print('Null values:\n')
print('Raw data:', raw_steam_data['release_date'].isnull().sum())
print('Partially cleaned:', steam_data['release_date'].isnull().sum())

Null values:

Raw data: 149
Partially cleaned: 0


Exploring the data using the value_counts method brings a couple of data issues to light.

In the raw data, we can see that 64 rows have data but date is an empty string, ''. Like we've seen before, this means they do not have null values, but may need to be treated as such depending on the reason. This may be data corruption, or it may be another reason entirely. We will probably have to decide what to do with these cases and investigate further.

Another issue we can notice is that while most of the dates are stored in the format we saw previously (dd mmm, yyyy), at least a couple are simply stored as the month and year (e.g. 'May 2019'). This means that the dates aren't all stored uniformly so we will have to take care when parsing them.

In [95]:
display(raw_steam_data['release_date'].value_counts().head())

steam_data['release_date'].value_counts().tail()

{'coming_soon': False, 'date': '13 Jul, 2018'}    65
{'coming_soon': False, 'date': ''}                64
{'coming_soon': False, 'date': '31 Jan, 2019'}    59
{'coming_soon': False, 'date': '5 Apr, 2016'}     59
{'coming_soon': False, 'date': '17 May, 2018'}    56
Name: release_date, dtype: int64

{'coming_soon': False, 'date': '10 Aug, 2011'}    1
{'coming_soon': False, 'date': '13 Aug, 2017'}    1
{'coming_soon': False, 'date': '24 Sep, 2007'}    1
{'coming_soon': False, 'date': '1 May, 2016'}     1
{'coming_soon': False, 'date': 'Oct 2015'}        1
Name: release_date, dtype: int64

Before we move on, let's quickly inspect some of the rows which have a blank date. 

It looks like some are special re-releases, like anniversary or game of the year editions, some are early access and not officially released yet, and others simply have a missing date. Apart from that there don't appear to be any clear patterns emerging, so as there are only 22 rows it may be best to remove them.

In [96]:
no_release_date = steam_data[steam_data['release_date'] == "{'coming_soon': False, 'date': ''}"]

print('Rows with no release date:', no_release_date.shape[0], '\n')
print_steam_links(no_release_date.head())
no_release_date.head()

Rows with no release date: 22 

Borderlands Game of the Year: https://store.steampowered.com/app/8980
Sherlock Holmes: The Mystery of the Persian Carpet: https://store.steampowered.com/app/11180
1... 2... 3... KICK IT! (Drop That Beat Like an Ugly Baby): https://store.steampowered.com/app/15540
The Great Art Race: https://store.steampowered.com/app/33580
SpellForce 2 - Anniversary Edition: https://store.steampowered.com/app/39550


Unnamed: 0,name,steam_appid,required_age,release_date,windows,mac,linux,price,english,developer,publisher
266,Borderlands Game of the Year,8980,18,"{'coming_soon': False, 'date': ''}",1,0,0,24.99,1,Gearbox Software,2K
319,Sherlock Holmes: The Mystery of the Persian Ca...,11180,3,"{'coming_soon': False, 'date': ''}",1,0,0,6.99,1,Frogwares,Frogwares
426,1... 2... 3... KICK IT! (Drop That Beat Like a...,15540,3,"{'coming_soon': False, 'date': ''}",1,0,0,6.99,1,"Dejobaan Games, LLC","Dejobaan Games, LLC"
731,The Great Art Race,33580,3,"{'coming_soon': False, 'date': ''}",1,0,0,3.99,1,Ascaron Entertainment ltd.,Assemble Entertainment
868,SpellForce 2 - Anniversary Edition,39550,3,"{'coming_soon': False, 'date': ''}",1,0,0,13.99,1,"Phenomic, THQ Nordic",THQ Nordic


Taking a look at the format of the column, we'll need to be using literal_eval once more. Apart from that it should be straightforward enough to extract the date.

In [97]:
print(type(steam_data['release_date'].iloc[0]))

steam_data['release_date'].iloc[0]

<class 'str'>


"{'coming_soon': False, 'date': '1 Nov, 2000'}"

In [98]:
print(type(literal_eval(steam_data['release_date'].iloc[0])))

literal_eval(steam_data['release_date'].iloc[0])['date']

<class 'dict'>


'1 Nov, 2000'

Once extracted, we can use the pd.to_datetime functon to interpret and store dates as datetime objects. This will be particularly useful as it will allow us to search and sort our dataset when it comes to performing analysis. Say for example we only wish to examine games released in 2010, by converting our dates to a python-recognisable format this will be very easy to achieve.

As seen below, we can supply the to_datetime function with our date and pandas will automatically interpret the format. We can then inspect it or print an attribute like the year. We can also provide pandas with the format explicitly, so it knows what to look for and how to parse it, which may be [quicker for large sets of data](https://stackoverflow.com/questions/32034689/why-is-pandas-to-datetime-slow-for-non-standard-time-format-such-as-2014-12-31).

In [99]:
timestamp = pd.to_datetime(literal_eval(steam_data['release_date'].iloc[0])['date'])

print(timestamp)
print(timestamp.year)

pd.to_datetime(literal_eval(steam_data['release_date'].iloc[0])['date'], format='%d %b, %Y')

2000-11-01 00:00:00
2000


Timestamp('2000-11-01 00:00:00')

Now we are ready to begin defining our function. As we only want to keep unreleased games, we first values from the coming_soon key, and keep only those where the value is False. Next we extract the release date, and set missing dates to np.nan, the default way of storing null values in pandas.

Then, using the formats we learned previously, we interpret those datesusing the to_datetime function. Once complete we pass over the dataframe once more with a general call to to_datetime, catching any dates we missed.

Finally we drop the columns we no longer need and return the dataframe.

Whilst functional, the process is quite slow. We can use the %timeit magic to test how long it takes to run our function, and we can see that on average it takes almost four seconds. Whilst manageable, we could certainly benefit from optimising our code, as this could quickly add up in larger data sets, where increasing efficiency can prove invaluable.

In [100]:
def process_release_date(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    # Only want to keep released games
    df = df[df['coming_soon'] == False].copy()
    
    # extract release date and set missing dates to null
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    df.loc[df['date'] == '', 'date'] = np.nan
    
    # Parse the date formats we have discovered
    df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')
    df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')
    
    # Parse the rest of the date formats
    df['release_date'] = pd.to_datetime(df['datetime'])
    
    df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)
    return df

%timeit process_release_date(steam_data)

3.72 s ± 45.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


There are a few areas we can investigate to make improvements. When initially parsing the date, we end up calling literal_eval twice, which may be a source of slowdown. We also loop over the entire dataset multiple times when calling the to_datetime function. 

We'll investigate which part is causing the greatest slowdown, but we can be certain that reducing the traversals over the data set will most likely provide significant gains. There are also a few other issues that we'll dive into over the course of our optimisation process.

First, let's find out where the main slowdowns are. As we just saw we can use the %timeit magic to time our function. We can also use the in-built time module to inspect parts of our code.

In [101]:
def process_release_date(df):
    df = df.copy()
    
    eval_start = time.time()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    print('Evaluation run-time:', time.time() - eval_start)
    
    df.loc[df['date'] == '', 'date'] = None
    
    first_parse_start = time.time()
    
    df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')
    df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')
    
    print('First parse run-time:', time.time() - first_parse_start)
    
    second_parse_start = time.time()
    
    df['release_date'] = pd.to_datetime(df['datetime'])
    
    print('Final parse run-time:', time.time() - second_parse_start)
    
    df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)
    return df

function_start = time.time()
process_release_date(steam_data)
print('\nTotal run-time:', time.time() - function_start)

Evaluation run-time: 0.7607526779174805
First parse run-time: 0.008997678756713867
Final parse run-time: 2.897845506668091

Total run-time: 3.7045834064483643


Immediately we can see that the majority of run-time is taken up by the final call to pd.to_datetime. This suggests that the first two calls are not functioning as expected - they are possibly terminating after the first error instead of skipping over it as desired - and most of the work is being done by the final call. Now it makes sense why it is slow - pandas has to figure out how each date is formatted, and since we know we have some variations this may be slowing it down considerably.

Whilst the evaluation run-time is much shorter, our multiple calls to literal_eval may be slowing the function as well, so we may wish to investigate that. As we know the biggest slowdown, we should begin there.

We now know that handling our dates in their current form is slow, and we know that we have some different formats mixed in there. Whilst there are likely many possible solutions to this problem, using regular expressions (or regex) comes to mind as they tend to excel at pattern matching in strings.

We know for sure two of the patterns, so let's build a regex for each of those. Then we can iteratively add more as we discover any other patterns. A powerful and useful tool for building and testing regex can be found at [regexr.com](https://regexr.com/).

In [102]:
pattern = r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}'
string = '13 Jul, 2018'

print(re.search(pattern, string))

pattern = r'[A-Za-z]{3} [\d]{4}'
string = 'Apr 2016'

print(re.search(pattern, string))

<re.Match object; span=(0, 12), match='13 Jul, 2018'>
<re.Match object; span=(0, 8), match='Apr 2016'>


Using these two patterns we can start building out our function. We're going to apply a function to the date column which searches for each pattern, returning a standardised date string which we will then feed into the to_datetime function.

Our first search matches the 'mmm yyyy' pattern, like 'Apr 2019'. As we don't know the particular day for these matches we will assume it is the first of the month, returning '1 Apr 2019' in this example.

If we don't match this, we'll check for the second case. Our second match will be the 'dd mmm, yyyy' pattern, like '13 Jul, 2018'. In this case we will simply return the match with the comma removed, to become '13 Jul 2018'.

Finally we'll check for the empty string, and return it for now.

For anything else we'll simply print the string so we know what else we should be searching for.

In [103]:
def process_release_date(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x 
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            return x
        else:
            print(x)
            
    df['date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
    
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df

result = process_release_date(steam_data)

It looks like we've caught all of the patterns and don't have any to take care of.

Previously we used the `infer_datetime_format` parameter of to_datetime, which can speed up the process. However, as we now know exactly the format our dates will be in, we can explicitly set it ourselves, which should be the fastest way of doing things.

We also need to decide how to handle our missing dates - those with the empty strings. For now let's change the way the function handles errors from raise to coerce, which returns NaT (not a time) instead.

We can now rewrite our function and time it as we did before.

In [104]:
def process_release_date_old(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    # Simple parsing
    df['release_date'] = pd.to_datetime(df['date'])
    
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df


def process_release_date_new(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    # Complex parsing
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            return x
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
    
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df

print('Testing date parsing:\n')
%timeit process_release_date_old(steam_data)
%timeit process_release_date_new(steam_data)

Testing date parsing:

3.69 s ± 72.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
917 ms ± 7.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Our results show that the new method is almost four times faster, so we're on the right track.

Another optimisation we can make here is checking which part of the if/elif statements has the most matches. It makes sense to order our statements from most matches to least, so for the majority of rows we only have to search through once. 

To do this, instead of returning the date we'll return a number for each match. We can then print the value counts for the column and see which is the most frequent.

In [105]:
def optimise_regex_order(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '0: mmm yyyy' # '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return '1: dd mmm, yyyy' # x.replace(',', '')
        elif x == '':
            return '2: empty' # pass
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['date'].apply(parse_date)
    
    return df


result = optimise_regex_order(steam_data)

result['release_date'].value_counts()

1: dd mmm, yyyy    27294
0: mmm yyyy           57
2: empty              22
Name: release_date, dtype: int64

By far the majority of dates are in the 'dd mmm, yyyy' format, which is second in our if/else statements. This means that for all these rows we are unnecessarily searching the string twice. Simply by reordering our searches we should see a minor performance improvement.

In [106]:
def process_release_date_unordered(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            return x
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df


def process_release_date_ordered(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif x == '':
            return x
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df


%timeit process_release_date_unordered(steam_data)
%timeit process_release_date_ordered(steam_data)

834 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
809 ms ± 9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


It's an improvement, if only slightly, so we'll keep it. If anything this goes to show how fast regex pattern matching is, as there was hardly any slowdown in searching every string twice.

Now parsing is well-optimised we can move on to the evaluation section.

In [107]:
# Testing evaluation methods
def evaluation_method_original(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])    
    df = df[df['coming_soon'] == False].copy()
    df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    return df


def evaluation_method_1(df):
    df = df.copy()
    
    df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x))
    
    df['coming_soon'] = df['release_date'].apply(lambda x: x['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    
    df['release_date'] = df['release_date'].apply(lambda x: x['date'])
    
    return df


def evaluation_method_2(df):
    df = df.copy()
    
    df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x))
    df_2 = df['release_date'].transform([lambda x: x['coming_soon'], lambda x: x['date']])
    df = pd.concat([df, df_2], axis=1)
    
    return df


def evaluation_method_3(df):
    df = df.copy()
    
    def eval_date(x):
        x = literal_eval(x)
        if x['coming_soon']:
            return np.nan
        else:
            return x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    df = df[df['release_date'].notnull()]  # could change to drop when '' and deal with missing release dates also
    
    return df


%timeit evaluation_method_original(steam_data)

%timeit evaluation_method_1(steam_data)
%timeit evaluation_method_2(steam_data)
%timeit evaluation_method_3(steam_data)

734 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
394 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
387 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
373 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


It looks like we may have been right in our assumption that multiple calls to literal_eval were slowing down the function - by calling it once instead of twice we almost halved the run-time.

Of our new methods the final one was just about the fastest, which is useful because it contains flexible custom logic we can modify if needed. Let's put everything together into our final function, and time it once more to see the improvements we've made.

We'll make a couple of changes so we can easily remove missing values at the end, which should mean we end up with clean release dates.

In [108]:
def process_release_date(df):
    df = df.copy()
    
    def eval_date(x):
        x = literal_eval(x)
        if x['coming_soon']:
            return '' # return blank string so can drop missing at end
        else:
            return x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    
    def parse_date(x):
        if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif x == '':
            return np.nan
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['release_date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['release_date'], format='%d %b %Y', errors='coerce')
    
    df = df[df['release_date'].notnull()]
    
    return df

%timeit process_release_date(steam_data)

511 ms ± 4.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Referring back to our original time of 3.6s, we've achieved a 7x speed increase. That's almost an order of magnitude improvement. 

We'll now update our process function, run it on our data set, and move on to some final checks.

In [109]:
def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = df.drop(['achievements', 'content_descriptors'], axis=1)
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    df = process_developers_and_publishers(df)
    df = process_release_date(df)
    
    # Process columns which export data
    df = process_descriptions(df, export=True)
    df = process_images(df, export=True)
    df = process_info(df, export=True)
    df = process_requirements(df, export=True)
    df = process_categories(df, export=True)
    df = process_genres(df, export=True)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Exported description data to '../data/exports/steam_description_data.csv'
Exported image data to '../data/exports/steam_image_data.csv'
Exported support info to '../data/exports/steam_support_info.csv'
Exported requirements data to '../data/exports/steam_requirements_data.csv'
Exported category data to '../data/exports/steam_category_data.csv'
Exported genre data to '../data/exports/steam_genre_data.csv'


Unnamed: 0,name,steam_appid,required_age,release_date,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,2000-11-01,1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,1999-04-01,1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,2003-05-01,1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,2001-06-01,1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,1999-11-01,1,1,1,3.99,1,Gearbox Software,Valve


### Final Steps

Our data set is hopefully complete. Before we export it to csv, let's check if we have any null values.

In [110]:
steam_data.isnull().sum()

name            0
steam_appid     0
required_age    0
release_date    0
windows         0
mac             0
linux           0
price           0
english         0
developer       0
publisher       0
dtype: int64

We will also export a data set including the category and genre data, so let's also check those for missing values.

We'll also prepend the columns in each data set to make them easily identifiable.

In [111]:
genre_data = pd.read_csv('../data/exports/steam_genre_data.csv')

# prepend with 'genre_' to make easily identifiable
genre_data = genre_data.add_prefix('genre_').rename({'genre_steam_appid':'steam_appid'}, axis=1)
genre_data.isnull().sum()

steam_appid                      0
genre_accounting                 0
genre_action                     0
genre_adventure                  0
genre_animation_and_modeling     0
genre_audio_production           0
genre_casual                     0
genre_design_and_illustration    0
genre_documentary                0
genre_early_access               0
genre_education                  0
genre_free_to_play               0
genre_game_development           0
genre_gore                       0
genre_indie                      0
genre_massively_multiplayer      0
genre_nudity                     0
genre_photo_editing              0
genre_rpg                        0
genre_racing                     0
genre_sexual_content             0
genre_simulation                 0
genre_software_training          0
genre_sports                     0
genre_strategy                   0
genre_tutorial                   0
genre_utilities                  0
genre_video_production           0
genre_violent       

In [112]:
category_data = pd.read_csv('../data/exports/steam_category_data.csv')
category_data = category_data.add_prefix('category_').rename({'category_steam_appid':'steam_appid'}, axis=1)
category_data.isnull().sum()

steam_appid                            0
category_captions_available            0
category_co_op                         0
category_commentary_available          0
category_cross_platform_multiplayer    0
category_full_controller_support       0
category_in_app_purchases              0
category_includes_source_sdk           0
category_includes_level_editor         0
category_local_co_op                   0
category_local_multi_player            0
category_mmo                           0
category_mods                          0
category_mods_require_hl2              0
category_multi_player                  0
category_online_co_op                  0
category_online_multi_player           0
category_partial_controller_support    0
category_shared_or_split_screen        0
category_single_player                 0
category_stats                         0
category_steam_achievements            0
category_steam_cloud                   0
category_steam_leaderboards            0
category_steam_t

Looks good. We also want to check that no games slipped through that aren't released yet (data scraped on or before 1st May 2019).

In [113]:
steam_data[steam_data['release_date'] > '2019-05-01']

Unnamed: 0,name,steam_appid,required_age,release_date,windows,mac,linux,price,english,developer,publisher


### Combining and exporting data frames

Now that we're happy with our dataframe we are ready to export to file and finish this part of the project. 

First we export steam_data, then we merge genre_data and category_data into a new dataframe, check it for missing values, then export it.

In [114]:
steam_data.to_csv('../data/steam_data_clean.csv', index=False)

steam_data_full = steam_data.merge(genre_data, how='left', on='steam_appid')
steam_data_full = steam_data_full.merge(category_data, how='left', on='steam_appid')

null_counts = steam_data_full.isnull().sum()
print(null_counts[null_counts > 0].shape[0])

steam_data_full.to_csv('../data/steam_data_with_genre_and_category.csv', index=False)

steam_data_full.head()

0


Unnamed: 0,name,steam_appid,required_age,release_date,windows,mac,linux,price,english,developer,publisher,genre_accounting,genre_action,genre_adventure,genre_animation_and_modeling,genre_audio_production,genre_casual,genre_design_and_illustration,genre_documentary,genre_early_access,genre_education,genre_free_to_play,genre_game_development,genre_gore,genre_indie,genre_massively_multiplayer,genre_nudity,genre_photo_editing,genre_rpg,genre_racing,genre_sexual_content,genre_simulation,genre_software_training,genre_sports,genre_strategy,genre_tutorial,genre_utilities,genre_video_production,genre_violent,genre_web_publishing,category_captions_available,category_co_op,category_commentary_available,category_cross_platform_multiplayer,category_full_controller_support,category_in_app_purchases,category_includes_source_sdk,category_includes_level_editor,category_local_co_op,category_local_multi_player,category_mmo,category_mods,category_mods_require_hl2,category_multi_player,category_online_co_op,category_online_multi_player,category_partial_controller_support,category_shared_or_split_screen,category_single_player,category_stats,category_steam_achievements,category_steam_cloud,category_steam_leaderboards,category_steam_trading_cards,category_steam_turn_notifications,category_steam_workshop,category_steamvr_collectibles,category_vr_support,category_valve_anti_cheat_enabled
0,Counter-Strike,10,3,2000-11-01,1,1,1,7.19,1,Valve,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1,Team Fortress Classic,20,3,1999-04-01,1,1,1,3.99,1,Valve,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,Day of Defeat,30,3,2003-05-01,1,1,1,3.99,1,Valve,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,Deathmatch Classic,40,3,2001-06-01,1,1,1,3.99,1,Valve,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4,Half-Life: Opposing Force,50,3,1999-11-01,1,1,1,3.99,1,Gearbox Software,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1


# Next steps

We could clean some of the data we exported, like description and requirements.

We are now ready to move on to cleaning our steamspy data.