# Steam Data Cleaning (Part 1)

*This forms part of a larger series of posts for my [blog](http://nik-davis.github.io) on downloading, processing and analysing data from the steam store. [See all posts here](http://nik-davis.github.io/tags/steam).*

In [56]:
# view software version information

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1900 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Thu May 30 14:51:14 2019 GMT Summer Time,Thu May 30 14:51:14 2019 GMT Summer Time


<!-- PELICAN_BEGIN_SUMMARY -->

In the first part of this project, we downloaded and generated data sets from the Steam Store API and SteamSpy API. We now need to take this raw data and prepare it in a process commonly referred to as [data cleaning](https://en.wikipedia.org/wiki/Data_cleansing).

Currently the downloaded data is not in a very useful state. Many of the columns contain lengthy strings or missing values, which hinder analysis and are especially crippling to any machine learning techniques we may wish to implement. Data cleaning involves handling missing values, tidying up values, and ensuring data is neatly and consistently formatted.

<!-- PELICAN_END_SUMMARY -->

Data cleaning is often cited as being the lengthiest part of any project. As such, it will be broken up across a series of posts starting with this one. We will begin by taking care of the columns in the steam data that are easiest to deal with and outlining a framework for the process. Of course it could all be done in one go and a lot more concisely, however we'll be stepping through all the reasons for each decision and building the process iteratively.

The main aims of this project are to investigate various sales and play-time statistics for games from the steam store, and see how different features of games affect the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust data set for future projects.

In part 2, we'll take care of columns that are going to export separate data of some kind, in order to store it for later. Finally for the steam data, in part 3 we will walk through the process of optimising the handling of a column, before exporting the clean data.

Once that is complete we will repeat the whole cleaning process for the steamspy data and combine the results in part 4, finishing with a complete data set ready for analysis.

The raw data can be found and downloaded on [Kaggle](https://www.kaggle.com/nikdavis/steam-store-raw).

## API references:

- https://partner.steamgames.com/doc/webapi
- https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI
- https://steamapi.xpaw.me/#
- https://steamspy.com/api.php

## Import Libraries and Inspect Data

To begin with, we'll import the required libraries and set customisation options, then take a look at the downloaded data by reading it into a pandas dataframe.

In [1]:
# standard library imports
from ast import literal_eval
import itertools
import time
import re

# third-party imports
import numpy as np
import pandas as pd

# customisations
pd.set_option("max_columns", 100)

In [2]:
# read in downloaded data
raw_steam_data = pd.read_csv('../data/raw/steam_app_data.csv')

# print out number of rows and columns
print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])

# view first five rows
raw_steam_data.head()

Rows: 29235
Columns: 39


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0.0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","{'score': 88, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 65735},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0.0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 2802},{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0.0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","{'score': 79, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 1992},{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,game,Deathmatch Classic,40,0.0,False,,,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 931},{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,game,Half-Life: Opposing Force,50,0.0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Gearbox Software'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 4355},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


From a quick inspection of the data, we can see that we have a mixture of numeric and string columns, plenty of missing values, and a number of columns that look to be stored as dictionaries or lists.

We can chain the `isnull()` and `sum()` methods to easily see how many missing values we have in each column. Immediately we can see that a number of columns have over 20,000 rows with missing data, and in a data set of roughly 30,000 rows these are unlikely to provide any meaningful information.

In [3]:
null_counts = raw_steam_data.isnull().sum()
null_counts

type                         149
name                           1
steam_appid                    0
required_age                 149
is_free                      149
controller_support         23237
dlc                        24260
detailed_description         175
about_the_game               175
short_description            175
fullgame                   29235
supported_languages          163
header_image                 149
website                     9983
pc_requirements              149
mac_requirements             149
linux_requirements           149
legal_notice               19168
drm_notice                 29077
ext_user_account_notice    28723
developers                   264
publishers                   149
demos                      27096
price_overview              3712
packages                    3370
package_groups               149
platforms                    149
metacritic                 26254
reviews                    23330
categories                   714
genres    

## Initial Processing

We will most likely have to handle each column individually, so we will write some functions to keep our methodology organised, and help iteratively develop the process.

Our first function will remove the columns with more than 50% missing values, taking care of the columns with high null counts. We can do this by running a filter on the dataframe, as seen below.

In [4]:
threshold = raw_steam_data.shape[0] // 2

print('Drop columns with more than {} missing rows'.format(threshold))
print()

drop_rows = raw_steam_data.columns[null_counts > threshold]

print('Columns to drop: {}'.format(list(drop_rows)))

Drop columns with more than 14617 missing rows

Columns to drop: ['controller_support', 'dlc', 'fullgame', 'legal_notice', 'drm_notice', 'ext_user_account_notice', 'demos', 'metacritic', 'reviews', 'recommendations']


We can then look at the type and name columns, thinning out our data set a little by removing apps without either.

In the data collection stage, if no information was returned from an app's API request, only the name and appid was stored. We can easily identify these apps by looking at rows with missing data in the `type` column, as all other apps have a value here. As seen below, these rows contain no other information so we can safely remove them.

In [5]:
print('Rows to remove:', raw_steam_data[raw_steam_data['type'].isnull()].shape[0])

# preview rows with missing type data
raw_steam_data[raw_steam_data['type'].isnull()].head()

Rows to remove: 149


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
26,,Half-Life: Opposing Force,852,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
147,,Half-Life: Opposing Force,4330,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
256,,Half-Life: Opposing Force,8740,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
264,,Half-Life: Opposing Force,8955,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
336,,Half-Life: Opposing Force,11610,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We can look at the counts of unique values in a column by using the pandas [Series.value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method. By checking the value counts we see that all rows either have a missing value, as noted above, or 'game' in the `type` column.

Once the null rows are removed, we'll be able to remove this column as it doesn't provide us with any more useful information.

In [6]:
raw_steam_data['type'].value_counts(dropna=False)

game    29086
NaN       149
Name: type, dtype: int64

Taking a look now at the name column, we can check for rows which either have a null value or a string containing 'none'. This isn't recognised as a null value but should be treated as such.

We achieve this by combining boolean filters using brackets and a vertical bar, `|`, symbolising a logical 'or'.

There are only four rows which match these criteria, and they appear to be missing a lot of data in other columns so we should definitely remove them.

In [7]:
raw_steam_data[(raw_steam_data['name'].isnull()) | (raw_steam_data['name'] == 'none')]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
4918,game,none,339860,0.0,False,,,,,,,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/339...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],,,,,[''],,,,[],"{'windows': True, 'mac': False, 'linux': False}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 3, 'highlighted': [{'name': 'Master ...","{'coming_soon': False, 'date': '27 Feb, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
6779,game,none,385020,0.0,False,,,- discontinued - (please remove),- discontinued - (please remove),- discontinued - (please remove),,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/385...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,,,,['none'],[''],,,,[],"{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",,,,{'total': 0},"{'coming_soon': False, 'date': '4 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
7235,game,,396420,0.0,True,,,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。 村...,,,https://steamcdn-a.akamaihd.net/steam/apps/396...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],,,,,[''],,,,[],"{'windows': True, 'mac': False, 'linux': False}",,,,,,,,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2016'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
7350,game,none,398970,0.0,False,,,,,,,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/398...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,,,,['none'],['none'],"[{'appid': 516340, 'description': ''}]",,,[],"{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 35, 'highlighted': [{'name': ""They'v...","{'coming_soon': False, 'date': '5 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"


As we know for certain that all AppIDs should be unique, any rows with the same ID need to be handled.

We can easily view duplicated rows using the [DataFrame.duplicated()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) method of pandas. We can pass `keep=False` to view all duplicated rows, or leave the defaults (`keep='first'`) to skip over the first row and just show the rest of the duplicates. We can also pass a column label into `subset` if we want to filter by a single column.

As we only want to remove the extra rows, we can keep the default behaviour.

In [8]:
duplicate_rows = raw_steam_data[raw_steam_data.duplicated()]

print('Duplicate rows to remove:', duplicate_rows.shape[0])

duplicate_rows.head(3)

Duplicate rows to remove: 7


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
31,game,SiN Episodes: Emergence,1300,0.0,False,,,"You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...",,"English, Russian, French",https://steamcdn-a.akamaihd.net/steam/apps/130...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],,,,['Ritual Entertainment'],['Ritual Entertainment'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[70],"[{'name': 'default', 'title': 'Buy SiN Episode...","{'windows': True, 'mac': False, 'linux': False}","{'score': 75, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 265},{'total': 0},"{'coming_soon': False, 'date': '10 May, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/130...,"{'ids': [], 'notes': None}"
32,game,SiN Episodes: Emergence,1300,0.0,False,,,"You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...",,"English, Russian, French",https://steamcdn-a.akamaihd.net/steam/apps/130...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],,,,['Ritual Entertainment'],['Ritual Entertainment'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[70],"[{'name': 'default', 'title': 'Buy SiN Episode...","{'windows': True, 'mac': False, 'linux': False}","{'score': 75, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 265},{'total': 0},"{'coming_soon': False, 'date': '10 May, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/130...,"{'ids': [], 'notes': None}"
356,game,Jagged Alliance 2 Gold,1620,0.0,False,,,<p>The small country of Arulco has been taken ...,<p>The small country of Arulco has been taken ...,The small country of Arulco has been taken ove...,,English,https://steamcdn-a.akamaihd.net/steam/apps/162...,http://www.jaggedalliance2.com/,{'minimum': '<p><strong>Minimum Configuration:...,[],[],,,,['Strategy First'],['Strategy First'],,"{'currency': 'GBP', 'initial': 1499, 'final': ...",[94],"[{'name': 'default', 'title': 'Buy Jagged Alli...","{'windows': True, 'mac': False, 'linux': False}",,,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '2', 'description': 'Strategy'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,,{'total': 0},"{'coming_soon': False, 'date': '6 Jul, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/162...,"{'ids': [], 'notes': None}"


Let's also quickly verify that we aren't missing any rows duplicated on just the `steam_appid` column by comparing the `duplicate_rows` dataframe with the one generated by passing `subset='steam_appid'` into the duplicated method.

In [9]:
duplicate_app_id_rows = raw_steam_data[raw_steam_data.duplicated(subset='steam_appid')]

print('True if same:', duplicate_app_id_rows.equals(duplicate_rows))

True if same: True


We're now ready to define functions implementing the filters we just looked at. This allows us to easily make changes in the future if we want to alter how the columns are handled, or want to choose a different cut-off threshold for getting rid of columns, for example. 

We also define a general purpose `process` function which will run all the processing functions we create on the data set. This will allow us to slowly add to it as we develop more functions and ensure we're cleaning the correct dataframe.

Finally we run this function on the raw data, inspecting the first few rows and viewing how many rows and columns have been removed.

In [10]:
def process_null_cols(df, thresh=0.5):
    """Drop columns with more than a certain proportion of missing values (Default 50%)."""
    cutoff_count = len(df) * thresh
    
    return df.dropna(thresh=cutoff_count, axis=1)


def drop_null_rows(df, col):
    """Drop rows with null values in a column."""
    return df[df[col].notnull()]


def process_type(df):
    """Remove rows with null values for type column, then drop the column."""
    df = drop_null_rows(df, 'type')
    df = df.drop('type', axis=1)
    
    return df
    
    
def process_name(df):
    """Remove rows with null values or 'none' in name column."""
    df = drop_null_rows(df, 'name')
    df = df[df['name'] != 'none']
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    
    return df

print(raw_steam_data.shape)
steam_data = process(raw_steam_data)
print(steam_data.shape)
steam_data.head()

(29235, 39)
(29075, 28)


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors
0,Counter-Strike,10,0.0,False,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,Team Fortress Classic,20,0.0,False,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,Day of Defeat,30,0.0,False,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,Deathmatch Classic,40,0.0,False,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,Half-Life: Opposing Force,50,0.0,False,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


## Processing Age

Next we'll look at the `required_age` column. By looking at the value counts we can see that values are already stored as integers, and the values range from 0 to 20, with one likely error (1818). There are no missing values in this column, but the vast majority have a value of 0. We'll clean the column anyway, but this probably means it won't be of much use in analysis as there is little variance in the data.

In [11]:
steam_data['required_age'].value_counts(dropna=False).sort_index()

0.0       28431
1.0           1
3.0          10
4.0           2
5.0           1
6.0           1
7.0           8
10.0          3
11.0          4
12.0         72
13.0         21
14.0          4
15.0         39
16.0        141
17.0         47
18.0        288
20.0          1
1818.0        1
Name: required_age, dtype: int64

Whilst fairly useful in its current state, we may benefit from reducing the number of categories that ages fall into. For example, instead of comparing games rated as 5, 6, 7 or 8, we could compare games rated 5+ or 8+.

To decide which categories (or bins) we should use, we will look at the [PEGI age ratings](https://pegi.info/) as this is the system used in the United Kingdom, where we're performing our analysis. Ratings fall into one of five categories (3, 7, 12, 16, 18), defining the minimum age recommended to play a game.

Using this to inform our decision, we can use the [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) function to sort our data into each of these categories. Rows with 0 may mean they are unrated, unstated as in missing, or rated as suitable for everyone. Because we can't tell we'll leave these as they are. As the erroneous row (1818) is most likely meant to be rated 18 anyway, we can set the upper bound above this value to catch it inside this category.

Below we define a `process_age` function to handle this, and add it into our `process` definition.

In [12]:
def process_age(df):
    """Format ratings in age column to be in line with the PEGI Age Ratings system."""
    # PEGI Age ratings: 3, 7, 12, 16, 18
    cut_points = [-1, 0, 3, 7, 12, 16, 2000]
    label_values = [0, 3, 7, 12, 16, 18]
    
    df['required_age'] = pd.cut(df['required_age'], bins=cut_points, labels=label_values)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data['required_age'].value_counts().sort_index()

0     28431
3        11
7        12
12       79
16      205
18      337
Name: required_age, dtype: int64

## Processing the Platforms Column

Whilst we could look at the next column in the dataframe, `is_free`, it would make sense that this is linked to the `price_overview` column. Ultimately we may wish to combine these columns into one, where free games would have a price of 0. 

Looking at the `price_overview` column, we can see it is stored in a dictionary-like structure, with multiple keys and values. Handling both of these together might be somewhat trickty, so instead we'll look at a simpler example.

In [13]:
steam_data['price_overview'].head()

0    {'currency': 'GBP', 'initial': 719, 'final': 7...
1    {'currency': 'GBP', 'initial': 399, 'final': 3...
2    {'currency': 'GBP', 'initial': 399, 'final': 3...
3    {'currency': 'GBP', 'initial': 399, 'final': 3...
4    {'currency': 'GBP', 'initial': 399, 'final': 3...
Name: price_overview, dtype: object

The `platforms` column appears to contain a key for each of the main operating systems - windows, mac and linux - and a corresponding boolean value, set to True or False depending on the availability on that platform. This should be a reasonably straighforward place to start. We can separate this data out into three columns - one for each platform - filled with boolean values.

In [14]:
steam_data['platforms'].head()

0    {'windows': True, 'mac': True, 'linux': True}
1    {'windows': True, 'mac': True, 'linux': True}
2    {'windows': True, 'mac': True, 'linux': True}
3    {'windows': True, 'mac': True, 'linux': True}
4    {'windows': True, 'mac': True, 'linux': True}
Name: platforms, dtype: object

So far the cleaning process has been relatively simple, mainly requiring checking for null values and dropping some rows or columns. Already we can see that handling the platforms will be a little more complex.

Our first hurdle is getting python to recognise the data in the columns as dictionaries rather than just strings. This will allow us to access the different values separately, without having to do some unnecessarily complicated string formatting. As we can see below, even though the data looks like a dictionary it is in fact stored as a string.

In [15]:
platforms_first_row = steam_data['platforms'].iloc[0]

print(type(platforms_first_row))

platforms_first_row

<class 'str'>


"{'windows': True, 'mac': True, 'linux': True}"

We can get around this using the handy [literal_eval](https://docs.python.org/3/library/ast.html#ast.literal_eval) function from the built-in `ast` module. As the name suggests, this will allow us to evaluate the string, and then index into it as a 
dictionary.

In [16]:
eval_first_row = literal_eval(platforms_first_row)

print(type(eval_first_row))

eval_first_row['windows']

<class 'dict'>


True

We also need to check for missing values, but fortunately it appears there aren't any in this column.

In [17]:
steam_data['platforms'].isnull().sum()

0

Putting this all together, we can use the pandas [Series.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method to quickly evaluate all of the rows, then make calls to `apply` again to create the new columns for each platform.

We could return the True/False value directly and store the values as boolean types, but since we'll be exporting the cleaned data to a csv file, let's store them as integers as this should reduce the file size slightly. Setting True as 1 and False as 0 can still be interpreted as a boolean type, but less data is used to store the information.

In [18]:
def process_platforms(df):
    """Split platforms column into separate boolean columns for each platform."""
    # evaluate values in platforms column, so can index into dictionaries
    df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
    # loop across keys (the platforms) which will be turned into columns
    for platform in df['platforms'][0].keys():
        # set 1 if value for platform in original column is True, or 0 if it's False
        df[platform] = df['platforms'].apply(lambda x: 1 if x[platform] else 0)
    
    # remove the original platforms column
    df = df.drop('platforms', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'windows', 'mac', 'linux']].head()

Unnamed: 0,name,windows,mac,linux
0,Counter-Strike,1,1,1
1,Team Fortress Classic,1,1,1
2,Day of Defeat,1,1,1
3,Deathmatch Classic,1,1,1
4,Half-Life: Opposing Force,1,1,1


## Processing Price

Now we have built up some intuition around how to deal with data stored as dictionaries, let's return to the `is_free` and `price_overview` columns as we should now be able to handle them.

First let's check how many null values there are in `price_overview`.

In [19]:
steam_data['price_overview'].isnull().sum()

3559

Whilst that looks like a lot, we have to consider the impact that the `is_free` column might be having. Before jumping to conclusions let's check if there any rows with `is_free` marked as True and null values in the `price_overview` column.

In [20]:
free_and_null_price = steam_data[(steam_data['is_free']) & (steam_data['price_overview'].isnull())]

print(free_and_null_price.shape[0])
free_and_null_price.head()

2713


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
14,Half-Life 2: Lost Coast,340,0,True,Originally planned as a section of the Highway...,Originally planned as a section of the Highway...,Originally planned as a section of the Highway...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/340...,http://www.half-life2.com,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],['Valve'],['Valve'],,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '27 Oct, 2005'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/340...,"{'ids': [], 'notes': None}",1,1,1
19,Team Fortress 2,440,0,True,"<h1>The Jungle Inferno Update</h1><p><a href=""...","<p><strong>""The most fun you can have online""<...",Nine distinct classes provide a broad range of...,"English<strong>*</strong>, Danish, Dutch, Finn...",https://steamcdn-a.akamaihd.net/steam/apps/440...,http://www.teamfortress.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Valve'],['Valve'],,"[197845, 330198, 469]","[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256698790, 'name': 'Jungle Inferno', '...","{'total': 520, 'highlighted': [{'name': 'Head ...","{'coming_soon': False, 'date': '10 Oct, 2007'}","{'url': 'http://steamcommunity.com/app/440', '...",https://steamcdn-a.akamaihd.net/steam/apps/440...,"{'ids': [2, 5], 'notes': 'Includes cartoon vio...",1,1,1
22,Dota 2,570,0,True,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...","Bulgarian, Czech, Danish, Dutch, English<stron...",https://steamcdn-a.akamaihd.net/steam/apps/570...,http://www.dota2.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Valve'],['Valve'],,"[197846, 330209]","[{'name': 'default', 'title': 'Buy Dota 2', 'd...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256692021, 'name': 'Dota 2 - Join the ...",,"{'coming_soon': False, 'date': '9 Jul, 2013'}","{'url': 'http://dev.dota2.com/', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/570...,"{'ids': [], 'notes': None}",1,1,1
24,Alien Swarm,630,0,True,Alien Swarm is a game and Source SDK release f...,Alien Swarm is a game and Source SDK release f...,Co-operative multiplayer game and complete cod...,English,https://steamcdn-a.akamaihd.net/steam/apps/630...,http://www.alienswarm.com,{'minimum': '<strong>Minimum:</strong><br>\t\t...,[],[],['Valve'],['Valve'],,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 66, 'highlighted': [{'name': 'Clear ...","{'coming_soon': False, 'date': '19 Jul, 2010'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/630...,"{'ids': [], 'notes': None}",1,0,0
25,Counter-Strike: Global Offensive,730,0,True,Counter-Strike: Global Offensive (CS: GO) expa...,Counter-Strike: Global Offensive (CS: GO) expa...,Counter-Strike: Global Offensive (CS: GO) expa...,"Czech, Danish, Dutch, English<strong>*</strong...",https://steamcdn-a.akamaihd.net/steam/apps/730...,http://blog.counter-strike.net/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['Valve', 'Hidden Path Entertainment']",['Valve'],,"[329385, 298963, 54029]","[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 81958, 'name': 'CS:GO Trailer Long', '...","{'total': 167, 'highlighted': [{'name': 'Someo...","{'coming_soon': False, 'date': '21 Aug, 2012'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/730...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1


It turns out this accounts for most of the missing values in the `price_overview` column, meaning we can handle these by setting the final price as 0. This makes intuitive sense - free games wouldn't have a price.

This means that there are almost 850 rows which aren't free but have null values in the `price_overview` column. Let's investigate those next.

In [21]:
not_free_and_null_price = steam_data[(steam_data['is_free'] == False) & (steam_data['price_overview'].isnull())]

not_free_and_null_price.head()

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
63,The Ship: Single Player,2420,0,False,For PC gamers who enjoy multiplayer games with...,For PC gamers who enjoy multiplayer games with...,The Ship is a murder mystery alternative to tr...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/242...,http://www.blazinggriffin.com/games/the-ship-m...,{'minimum': '<strong>Minimum:</strong> 1.8 GHz...,[],[],['Outerlight Ltd.'],['Blazing Griffin Ltd.'],,[56669],"[{'name': 'default', 'title': 'Buy The Ship: S...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 2035597, 'name': 'the Ship: Intro', '...",{'total': 0},"{'coming_soon': False, 'date': '20 Nov, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/242...,"{'ids': [], 'notes': None}",1,0,0
75,RollerCoaster Tycoon® 3: Platinum,2700,0,False,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/270...,http://www.atari.com/rollercoastertycoon/us/in...,{'minimum': '<strong>Minimum: </strong><br>\t\...,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],"['Frontier', 'Aspyr (Mac)']","['Atari', 'Aspyr (Mac)']",,,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'url': 'http://www.atari.com/support/atari', ...",https://steamcdn-a.akamaihd.net/steam/apps/270...,"{'ids': [], 'notes': None}",1,1,0
220,BioShock™,7670,0,False,<h1>Special Offer</h1><p>Buying BioShock™ also...,BioShock is a shooter unlike any you've ever p...,BioShock is a shooter unlike any you've ever p...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/767...,http://www.BioShockGame.com,"{'minimum': '<h2 class=""bb_tag""><strong>Minimu...",{'minimum': 'Please See BioShock Remastered'},[],"['2K Boston', '2K Australia']",['2K'],,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™',...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '21 Aug, 2007'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/767...,"{'ids': [], 'notes': None}",1,0,0
234,Sam & Max 101: Culture Shock,8200,0,False,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,Sam &amp; Max: Episode 1 - Culture Shock - The...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/820...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[357, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/820...,"{'ids': [], 'notes': None}",1,0,0
235,Sam & Max 102: Situation: Comedy,8210,0,False,<strong>Sam &amp; Max: Episode 2 - Situation: ...,<strong>Sam &amp; Max: Episode 2 - Situation: ...,Sam &amp; Max: Episode 2 - Situation: Comedy -...,"English, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/821...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[358, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/821...,"{'ids': [], 'notes': None}",1,0,0


The first few rows contain some big, well-known games which appear to have pretty complete data. It looks like we can rule out data errors, so let's dig a little deeper and see if we can find out what is going on.

We'll start by looking at the store pages for some of these titles. The url to an app on the steam website follows this structure:

    https://store.steampowered.com/app/[steam_appid]

This means we can easily generate these links using our above filter. We'll wrap it up in a function in case we want to use it later.

In [22]:
def print_steam_links(df):
    """Print links to store page for apps in a dataframe."""
    url_base = "https://store.steampowered.com/app/"
    
    for i, row in df.iterrows():
        appid = row['steam_appid']
        name = row['name']
        
        print(name + ':', url_base + str(appid))
        

print_steam_links(not_free_and_null_price[:5])

The Ship: Single Player: https://store.steampowered.com/app/2420
RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210


For these games we can conclude that:

- The Ship: Single Player is a tutorial, and comes as part of The Ship: Murder Party
- RollerCoaster Tycoon 3: Platinum has been removed from steam (and another game website: [GOG](https://www.gog.com/))  
  - "A spokesperson for GOG told Eurogamer it pulled the game "due to expiring licensing rights", and stressed it'll talk with "new distribution rights holders" to bring the game back as soon as possible." Source: [Eurogamer](https://www.eurogamer.net/articles/2018-05-09-rollercoaster-tycoon-3-pulled-from-steam-gog)
- BioShock has been replaced by BioShock Remastered
- Sam & Max 101 is sold as part of a season, and this can be found in the `package_groups` column

So we have a couple of options here. We could just drop these rows, we could try to figure out the price based on the `package_groups` column, or we could leave them for now and return to them later. We'll leave them for now, handling the two prices columns, then take a look at the packages next. It may also be that some of these rows are removed later in the cleaning process for other reasons.

If we want to find rows similar to these and deal with each case individually, we could use the `.str.contains()` method, as seen below.

In [23]:
steam_data[steam_data['name'].str.contains("BioShock™")]

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
220,BioShock™,7670,0,False,<h1>Special Offer</h1><p>Buying BioShock™ also...,BioShock is a shooter unlike any you've ever p...,BioShock is a shooter unlike any you've ever p...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/767...,http://www.BioShockGame.com,"{'minimum': '<h2 class=""bb_tag""><strong>Minimu...",{'minimum': 'Please See BioShock Remastered'},[],"['2K Boston', '2K Australia']",['2K'],,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™',...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '21 Aug, 2007'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/767...,"{'ids': [], 'notes': None}",1,0,0
7734,BioShock™ Remastered,409710,18,False,<h1>Special Offer</h1><p>Buying BioShock™ Rema...,BioShock is a shooter unlike any you've ever p...,"BioShock is a shooter unlike any other, loaded...","English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/409...,http://www.BioShockGame.com,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['2K Boston', '2K Australia', 'Blind Squirrel'...","['2K', 'Feral Interactive (Mac)']","{'currency': 'GBP', 'initial': 999, 'final': 9...","[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™ R...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 65, 'highlighted': [{'name': 'Comple...","{'coming_soon': False, 'date': '15 Sep, 2016'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/409...,"{'ids': [], 'notes': None}",1,1,0
7735,BioShock™ 2 Remastered,409720,18,False,<h1>Special Offer</h1><p>Buying BioShock 2™ Re...,BioShock 2 provides players with the perfect b...,"In BioShock 2, you step into the boots of the ...","English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/409...,http://www.bioshockgame.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['2K Marin', '2K China', 'Digital Extremes', '...",['2K'],"{'currency': 'GBP', 'initial': 1399, 'final': ...","[81419, 127633]","[{'name': 'default', 'title': 'Buy BioShock™ 2...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 53, 'highlighted': [{'name': ""Daddy'...","{'coming_soon': False, 'date': '15 Sep, 2016'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/409...,"{'ids': [5], 'notes': None}",1,0,0


Now we need to figure out how to process the column.

If we take a look at the data for the first row, we can see that there are a variety of formats in which the price is stored. There is a currency, GBP, which is perfect as we are performing our analysis in the UK. Next we have a number of different values for the price, so which one do we use?

In [24]:
steam_data['price_overview'][0]

"{'currency': 'GBP', 'initial': 719, 'final': 719, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '£7.19'}"

If we inspect another row, we see that there is an active discount, applying an 80% price reduction to the title. It looks like `initial` contains the normal price before discount, and `final` contains the discounted price. `initial_formatted` and `final_formatted` contain the price formatted and displayed in the currency. We don't have to worry about these last two, as we'll be storing the price as a float (or integer) and if we wanted, we could format it like this when printing.

With all this in mind, it looks like we'll be checking the value under the `currency` key, and using the value in the `initial` key.

In [25]:
steam_data['price_overview'][37]

"{'currency': 'GBP', 'initial': 2299, 'final': 459, 'discount_percent': 80, 'initial_formatted': '£22.99', 'final_formatted': '£4.59'}"

Now the preliminary investigation is complete we can begin definining our function. 

We start by evaluating the strings using `literal_eval` as before, however if there is a null value (caught by the try/except block) we return a properly formatted dictionary with -1 for the `initial` value. This will allow us to fill in a value of 0 for free games, then be left with an easily targetable value for the null rows.

In [26]:
def process_price(df):
    df = df.copy()
        
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # Create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # Set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    return df

price_data = process_price(steam_data)[['name', 'currency', 'price']]
price_data.head()

Unnamed: 0,name,currency,price
0,Counter-Strike,GBP,719
1,Team Fortress Classic,GBP,399
2,Day of Defeat,GBP,399
3,Deathmatch Classic,GBP,399
4,Half-Life: Opposing Force,GBP,399


We're almost finished, but let's check if any games don't have GBP listed as the currency.

In [27]:
price_data[price_data['currency'] != 'GBP']

Unnamed: 0,name,currency,price
991,Robin Hood: The Legend of Sherwood,USD,799
5767,Assassin’s Creed® Chronicles: India,EUR,999
27593,Mortal Kombat 11,USD,5999
27995,Pagan Online,EUR,2699


For some reason there are four games listed in either USD or EUR. We could use the current exchange rate to try and convert them into GBP, however as there are only four rows it's easier and safer to simply drop them.

We can also divide the prices by 100 so they are displayed as floats in pounds.

In [28]:
def process_price(df):
    """Process price_overview column into formatted price column."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # change price to display in pounds (only applying to rows with a value greater than 0)
    df.loc[df['price'] > 0, 'price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'price']].head()

Unnamed: 0,name,price
0,Counter-Strike,7.19
1,Team Fortress Classic,3.99
2,Day of Defeat,3.99
3,Deathmatch Classic,3.99
4,Half-Life: Opposing Force,3.99


## Processing Packages

We can now take a look at the `packages` and `package_groups` columns to help decide what to do with rows that are missing price data. We're not incredibly interested in the columns themselves, as they don't appear to provide much new useful information, except which games come with others as part of a bundle.

In [29]:
# temporarily set a pandas option using with and option_context
with pd.option_context("display.max_colwidth", 500):
    display(steam_data[['steam_appid', 'packages', 'package_groups', 'price']].head())

Unnamed: 0,steam_appid,packages,package_groups,price
0,10,[7],"[{'name': 'default', 'title': 'Buy Counter-Strike', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 7, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Counter-Strike: Condition Zero - £7.19', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 719}]}]",7.19
1,20,[29],"[{'name': 'default', 'title': 'Buy Team Fortress Classic', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 29, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Team Fortress Classic - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]",3.99
2,30,[30],"[{'name': 'default', 'title': 'Buy Day of Defeat', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 30, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Day of Defeat - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]",3.99
3,40,[31],"[{'name': 'default', 'title': 'Buy Deathmatch Classic', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 31, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Deathmatch Classic - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]",3.99
4,50,[32],"[{'name': 'default', 'title': 'Buy Half-Life: Opposing Force', 'description': '', 'selection_text': 'Select a purchase option', 'save_text': '', 'display_type': 0, 'is_recurring_subscription': 'false', 'subs': [{'packageid': 32, 'percent_savings_text': '', 'percent_savings': 0, 'option_text': 'Opposing Force - £3.99', 'option_description': '', 'can_get_free_license': '0', 'is_free_license': False, 'price_in_cents_with_discount': 399}]}]",3.99


Overall we have 846 rows with missing price data, which we previously set to -1.

In [30]:
print(steam_data[steam_data['price'] == -1].shape[0])

846


We can split these rows into two categories: those with `package_groups` data and those without.

If we take a quick look at the `package_groups` column we see that there appear to be no null values. On closer inspection, we can find that rows without data are actually stored as empty lists.

In [31]:
print('Null counts:', steam_data['package_groups'].isnull().sum())
print('Empty list counts:', steam_data[steam_data['package_groups'] == "[]"].shape[0])

Null counts: 0
Empty list counts: 3353


Using a combination of filters, we can find out how many rows have both missing `price` and `package_group` data and investigate. We'll count the rows and print links to some of the store pages and look for patterns.

In [32]:
missing_price_and_package = steam_data[(steam_data['price'] == -1) & (steam_data['package_groups'] == "[]")]

print('Number of rows:', missing_price_and_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_and_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_and_package[-10:-5])

missing_price_and_package.head()

Number of rows: 799 

First few rows:

RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
Beijing 2008™ - The Official Video Game of the Olympic Games: https://store.steampowered.com/app/10520
LUMINES™ Advance Pack: https://store.steampowered.com/app/11920
Midnight Club 2: https://store.steampowered.com/app/12160
Age of Booty™: https://store.steampowered.com/app/21600

Last few rows:

RoboVirus: https://store.steampowered.com/app/1001870
soko loco deluxe: https://store.steampowered.com/app/1003730
POCKET CAR : VRGROUND: https://store.steampowered.com/app/1004710
The Princess, the Stray Cat, and Matters of the Heart: https://store.steampowered.com/app/1010600
Mr Boom's Firework Factory: https://store.steampowered.com/app/1013670


Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
75,RollerCoaster Tycoon® 3: Platinum,2700,0,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/270...,http://www.atari.com/rollercoastertycoon/us/in...,{'minimum': '<strong>Minimum: </strong><br>\t\...,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],"['Frontier', 'Aspyr (Mac)']","['Atari', 'Aspyr (Mac)']",,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'url': 'http://www.atari.com/support/atari', ...",https://steamcdn-a.akamaihd.net/steam/apps/270...,"{'ids': [], 'notes': None}",1,1,0,-1.0
311,Beijing 2008™ - The Official Video Game of the...,10520,0,<p> Embrace the competi...,<p> Embrace the competi...,Embrace the competitive spirit of the world's ...,English,https://steamcdn-a.akamaihd.net/steam/apps/105...,http://www.olympicvideogames.com,{'minimum': '<p><strong>Minimum:</strong></p> ...,[],[],['Eurocom'],['SEGA'],,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '18', 'description': 'Sports'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '14 Aug, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/105...,"{'ids': [], 'notes': None}",1,0,0,-1.0
337,LUMINES™ Advance Pack,11920,0,<p>Ready for the next challenge? Prepare yours...,<p>Ready for the next challenge? Prepare yours...,Ready for the next challenge? Prepare yourself...,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/119...,,{'minimum': '<p><strong>Minimum:</strong></p>\...,[],[],['Q Entertainment Inc.'],['Q Entertainment Inc.'],,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '18 Apr, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/119...,"{'ids': [], 'notes': None}",1,0,0,-1.0
344,Midnight Club 2,12160,0,Members of the world's most notorious illegal ...,Members of the world's most notorious illegal ...,The world's most notorious drivers meet each n...,"English<strong>*</strong>, French, Italian, Ge...",https://steamcdn-a.akamaihd.net/steam/apps/121...,http://www.rockstargames.com/midnightclub2,{'minimum': '<strong>Minimum:</strong><br>\t\t...,[],[],['Rockstar San Diego'],['Rockstar Games'],,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '9', 'description': 'Racing'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '4 Jan, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/121...,"{'ids': [], 'notes': None}",1,0,0,-1.0
536,Age of Booty™,21600,0,"Set in the swashbuckling era, Age of Booty™ is...","Set in the swashbuckling era, Age of Booty™ is...","Set in the swashbuckling era, Age of Booty™ is...",English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/216...,http://www.certainaffinity.com/ageofbooty/,{'minimum': '<strong>Minimum:</strong> ...,[],[],['Certain Affinity™'],['Capcom'],,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '9 Mar, 2009'}",{'url': 'http://www.capcom.co.jp/support/conta...,https://steamcdn-a.akamaihd.net/steam/apps/216...,"{'ids': [], 'notes': None}",1,0,0,-1.0


Most of the games - 799 of 846 - with missing price data fall into the above category. This probably means they can be safely removed.

From following the links for the first few rows to the store page, it looks like they are currently unavailable or have been delisted from the store. Looking at the last few rows, it appears most of them haven't yet been released and haven't had a price set. We'll take care of all the unreleased games when we clean the `release_date` column, but we can remove all of these apps here.

Let's now take a look at the rows that have missing price data but do have `package_groups` data. We may be interested in keeping these rows and extracting price data from the package data.

In [33]:
missing_price_have_package = steam_data.loc[(steam_data['price'] == -1) & (steam_data['package_groups'] != "[]"), ['name', 'steam_appid', 'package_groups', 'price']]

print('Number of rows:', missing_price_have_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_have_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_have_package[-10:-5])

display(missing_price_have_package.head())
missing_price_have_package.iloc[-10:-5]

Number of rows: 47 

First few rows:

The Ship: Single Player: https://store.steampowered.com/app/2420
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210
Sam & Max 103: The Mole, the Mob and the Meatball: https://store.steampowered.com/app/8220

Last few rows:

Viscera Cleanup Detail: Shadow Warrior: https://store.steampowered.com/app/255520
Space Hulk: Deathwing: https://store.steampowered.com/app/298900
7,62 Hard Life: https://store.steampowered.com/app/306290
Letter Quest: Grimm's Journey: https://store.steampowered.com/app/328730
Rad Rodgers: World One: https://store.steampowered.com/app/353580


Unnamed: 0,name,steam_appid,package_groups,price
63,The Ship: Single Player,2420,"[{'name': 'default', 'title': 'Buy The Ship: S...",-1.0
220,BioShock™,7670,"[{'name': 'default', 'title': 'Buy BioShock™',...",-1.0
234,Sam & Max 101: Culture Shock,8200,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0
235,Sam & Max 102: Situation: Comedy,8210,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0
236,"Sam & Max 103: The Mole, the Mob and the Meatball",8220,"[{'name': 'default', 'title': 'Buy Sam & Max 1...",-1.0


Unnamed: 0,name,steam_appid,package_groups,price
2421,Viscera Cleanup Detail: Shadow Warrior,255520,"[{'name': 'default', 'title': 'Buy Viscera Cle...",-1.0
3576,Space Hulk: Deathwing,298900,"[{'name': 'default', 'title': 'Buy Space Hulk:...",-1.0
3811,"7,62 Hard Life",306290,"[{'name': 'default', 'title': 'Buy 7,62 Hard L...",-1.0
4504,Letter Quest: Grimm's Journey,328730,"[{'name': 'default', 'title': ""Buy Letter Ques...",-1.0
5514,Rad Rodgers: World One,353580,"[{'name': 'default', 'title': 'Buy Rad Rodgers...",-1.0


Looking at a selection of these rows, the games appear to be: supersceded by a newer release or remaster, part of a bigger bundle of games or episodic, or included by purchasing another game. 

Whilst we could extract prices from the `package_groups` data, the most sensible option seems to be removing these rows. There are only 47 rows this applies to, and any with a newer release will still have the re-release in the data.

Since our logic interacts heavily with the price data we will update the `process_price` function rather than creating a new one.

In [34]:
def process_price(df):
    """Process price_overview column into formatted price column, and take care of package columns."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # remove rows where price is -1
    df = df[df['price'] != -1]
    
    # change price to display in pounds (can apply to all now -1 rows removed)
    df['price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview', 'packages', 'package_groups'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
0,Counter-Strike,10,0,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19
1,Team Fortress Classic,20,0,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99
2,Day of Defeat,30,0,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}",1,1,1,3.99
3,Deathmatch Classic,40,0,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}",1,1,1,3.99
4,Half-Life: Opposing Force,50,0,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}",1,1,1,3.99


The next columns in the data are descriptive columns - `detailed_description`, `about_the_game` and `short_description`. We won't be handling them now, instead returning to them in a later post dealing with export columns. These are columns where we will export all or some of the data to a separate csv file as part of the cleaning.

## Processing Langauges

Beyond that, the next column is `supported_languages`. As we will be performing the analysis for an English company, we will only be interested in games available in English. Whilst we could remove non-english game at this stage, instead we will create a column marking english games with a boolean value - True or False.

We begin as usual by looking for rows with null values.

In [35]:
steam_data['supported_languages'].isnull().sum()

4

Taking a closer look at these apps, it doesn't look like there's anything wrong with them. It may be that the data simply wasn't supplied. As there are only 4 rows affected we will go ahead and remove these from the data set.

In [36]:
steam_data[steam_data['supported_languages'].isnull()]

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
4866,Subsiege,338640,0,"<img src=""https://steamcdn-a.akamaihd.net/stea...","<img src=""https://steamcdn-a.akamaihd.net/stea...",Subsiege is an intense real-time tactic game w...,,https://steamcdn-a.akamaihd.net/steam/apps/338...,http://subsiege-game.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Icebird Studios'],['Icebird Studios'],,,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256729398, 'name': 'Release Trailer', ...",{'total': 0},"{'coming_soon': False, 'date': '7 Sep, 2018'}","{'url': 'http://subsiege-game.com/', 'email': ...",https://steamcdn-a.akamaihd.net/steam/apps/338...,"{'ids': [], 'notes': None}",1,0,0,17.89
14560,MARS VR(全球使命VR),596560,0,1.\t4K level audio-visual experience <br />\r\...,1.\t4K level audio-visual experience <br />\r\...,Welcome to 《Mars VR》. This is an immersive fir...,,https://steamcdn-a.akamaihd.net/steam/apps/596...,http://qqsm.zygames.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['Ying Pei Digital Technology Shanghai Co., Li...","['SHANGHAI ZHENYOU TECHNOLOGY CO.,LTD']","[{'id': 2, 'description': 'Single-player'}]","[{'id': '73', 'description': 'Violent'}, {'id'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256681371, 'name': 'marsvr', 'thumbnai...",{'total': 0},"{'coming_soon': False, 'date': '5 Apr, 2017'}","{'url': 'http://www.zygames.com/contact', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/596...,"{'ids': [], 'notes': None}",1,0,0,1.99
16386,Numberline 2,654970,0,NumberLine 2 is the continuation of the popula...,NumberLine 2 is the continuation of the popula...,NumberLine 2 is the continuation of the popula...,,https://steamcdn-a.akamaihd.net/steam/apps/654...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],"['V34D4R', 'Egor Magurin']",['Indovers Studio'],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256687192, 'name': 'Numberline 2 Trail...","{'total': 60, 'highlighted': [{'name': '1st le...","{'coming_soon': False, 'date': '14 Jul, 2017'}","{'url': '', 'email': 'radaew.zhenya@yandex.ru'}",https://steamcdn-a.akamaihd.net/steam/apps/654...,"{'ids': [], 'notes': None}",1,0,0,1.59
26855,SNUSE 221,948070,0,<strong> Hey. My name is *&amp;#!$.<br>Today I...,<strong> Hey. My name is *&amp;#!$.<br>Today I...,Hey. My name is *&amp;#!$. Today I will tell y...,,https://steamcdn-a.akamaihd.net/steam/apps/948...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['SNUSE GM'],['SNUSE GM'],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256745662, 'name': 'snuse', 'thumbnail...",{'total': 0},"{'coming_soon': False, 'date': '2 Apr, 2019'}","{'url': 'vk.com/nilow_i', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/948...,"{'ids': [], 'notes': None}",1,0,0,0.79


Next we'll take a look at the structure of the column. By looking at the value for the first row and the values for the most common rows, it looks like languages are stored as a string which can be anything from a comma-separated list of languages to a mix of html and headings. It seems reasonably safe to assume that if the app is in English, the word English will appear somewhere in this string. With this in mind we can simply search the string and return a value based on the result.

In [37]:
print(steam_data['supported_languages'][0])
steam_data['supported_languages'].value_counts().head(10)

English<strong>*</strong>, French<strong>*</strong>, German<strong>*</strong>, Italian<strong>*</strong>, Spanish - Spain<strong>*</strong>, Simplified Chinese<strong>*</strong>, Traditional Chinese<strong>*</strong>, Korean<strong>*</strong><br><strong>*</strong>languages with full audio support


English                                                                                                        8512
English<strong>*</strong><br><strong>*</strong>languages with full audio support                               7409
English, Russian                                                                                                707
English, Simplified Chinese                                                                                     280
English, Japanese                                                                                               235
English<strong>*</strong>, Russian<strong>*</strong><br><strong>*</strong>languages with full audio support     222
English, French, Italian, German, Spanish - Spain                                                               180
English, German                                                                                                 161
Simplified Chinese                                                      

It looks like English-only games make up a little over half the rows in our dataset (~16,000), and English plus other languages make up many more. We could create columns for any of the other languages by string searching, but for simplicity we'll focus on just the English ones.

Using the Series.apply method once again, we can check if the string 'english' appears in each row. We define an anonymous function on the fly using a [lambda](https://docs.python.org/3/tutorial/controlflow.html?highlight=lambda#lambda-expressions) expression. This returns 1 if 'english' is found and 0 otherwise. As mentioned in the platforms section, this can be interpreted as a boolean value. 

The variable `x` will take on the value of each row as the expression is evaluated. We apply the `lower()` string method so capitalisation doesn't matter.

In [38]:
def process_language(df):
    """Process supported_languages column into a boolean 'is english' column."""
    df = df.copy()
    
    # drop rows with missing language data
    df = df.dropna(subset=['supported_languages'])
    
    df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)
    df = df.drop('supported_languages', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'english']].head()

Unnamed: 0,name,english
0,Counter-Strike,1
1,Team Fortress Classic,1
2,Day of Defeat,1
3,Deathmatch Classic,1
4,Half-Life: Opposing Force,1


Before moving on, we can take a quick look at the results and see that most of the apps support English.

In [39]:
steam_data['english'].value_counts()

1    27699
0      522
Name: english, dtype: int64

## Processing Developers and Publishers

Again we'll skip the next few columns as we'll deal with them another time, and take a look at `developers` and `publishers`. They will most likely contain similar information so we can look at them together. 

We'll start by checking the null counts, noticing that while the publishers column doesn't appear to have any null values, if we search for empty lists we see that we have over 200 'hidden' null values.

In [40]:
print('Developers null counts:', steam_data['developers'].isnull().sum())
print('Developers empty list counts:', steam_data[steam_data['developers'] == "['']"].shape[0])

print('\npublishers null counts:', steam_data['publishers'].isnull().sum())
print('publishers empty list counts:', steam_data[steam_data['publishers'] == "['']"].shape[0])

Developers null counts: 104
Developers empty list counts: 0

publishers null counts: 0
publishers empty list counts: 213


Ultimately we want a data set with no missing values. That means we have a few options for dealing with these two columns:

- Remove all rows missing either developer or publisher information
- Impute missing information by replacing the missing column with the column we have (i.e. if developers is missing, fill it with the value in publishers)
- Fill missing information with 'Unknown' or 'None'

We can investigate some of the rows with missing data to help inform our decision.

In [41]:
no_dev = steam_data[steam_data['developers'].isnull()]

print('Total games missing developer:', no_dev.shape[0], '\n')

print_steam_links(no_dev[:5])

no_pub = steam_data[steam_data['publishers'] == "['']"]

print('\nTotal games missing publisher:', no_pub.shape[0], '\n')
print_steam_links(no_pub[:5])

no_dev_or_pub = steam_data[(steam_data['developers'].isnull()) & (steam_data['publishers'] == "['']")]

print('\nTotal games missing developer and publisher:', no_dev_or_pub.shape[0], '\n')
print_steam_links(no_dev_or_pub[:5])

Total games missing developer: 104 

Tycoon City: New York: https://store.steampowered.com/app/9730
Nikopol: Secrets of the Immortals: https://store.steampowered.com/app/11370
Crash Time 2: https://store.steampowered.com/app/11390
Hunting Unlimited 2010: https://store.steampowered.com/app/12690
18 Wheels of Steel: Extreme Trucker: https://store.steampowered.com/app/33730

Total games missing publisher: 213 

RIP - Trilogy™: https://store.steampowered.com/app/2540
Vigil: Blood Bitterness™: https://store.steampowered.com/app/2570
Bullet Candy: https://store.steampowered.com/app/6600
AudioSurf: https://store.steampowered.com/app/12900
Everyday Shooter: https://store.steampowered.com/app/16300

Total games missing developer and publisher: 67 

PlayClaw 5 - Game Recording and Streaming: https://store.steampowered.com/app/237370
Artemis Spaceship Bridge Simulator: https://store.steampowered.com/app/247350
A Walk in the Dark: https://store.steampowered.com/app/248730
Forge Quest: https://stor

It appears we are looking at a mix of titles, smaller ones especially, and some of the smaller indie titles may have been self-published. Others simply have wrong or missing data, found by searching for the titles elsewhere. As our priority is creating a clean data set, and there are only a few hundred rows, it will be fine to remove them from the data.

Let's take a look at the structure of the data. Below we inspect some rows near the beginning of the dataframe. It looks like both columns are stored as lists which can have one or multiple values. We'll have to evaluate the rows as before, so they are recognised as lists, then index into them accordingly.

In [42]:
steam_data[['developers', 'publishers']].iloc[24:28]

Unnamed: 0,developers,publishers
24,['Valve'],['Valve']
25,"['Valve', 'Hidden Path Entertainment']",['Valve']
27,['Mark Healey'],['Mark Healey']
28,['Tripwire Interactive'],['Tripwire Interactive']


As we have some single values and some multiple, we have to decide how to handle them. Here are some potential solutions:

 - Create a column for each value in the list (i.e. developer_1, developer_2)
 - Create a column with the first value in the list and a column with the rest of the values (i.e. developer_1, other_developers)
 - Create a column with the first value in the list and disregard the rest
 - Combine all values into one column, simply unpacking the list
 
Let's begin defining our function, and take a look at how many rows have multiple developers or publishers. After evaluating each row, we can find the length of the lists in each row by using the [Series.str.len()](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.len.html) method. By filtering only rows where the list has more than one element, we can find the number of rows with more than one value in each column.

In [43]:
def process_developers_and_publishers(df):
    # remove rows with missing data
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
    
    for col in ['developers', 'publishers']:
        df[col] = df[col].apply(lambda x: literal_eval(x))
        
        # filter dataframe to rows with lists longer than 1, and store the number of rows
        num_rows = df[df[col].str.len() > 1].shape[0]
        
        print('Rows in {} column with multiple values:'.format(col), num_rows)

process_developers_and_publishers(steam_data)

Rows in developers column with multiple values: 1720
Rows in publishers column with multiple values: 884


It turns out that the vast majority have only one value for these columns. If we went with the first or second solutions above, we'd be left with columns with mostly missing data. We could go with the third option, but the first value in the list isn't necessarily the most important, and this seems unfair if multiple teams were involved.

The best way forward seems to be the fourth option - if there are multiple values we combine them into the same column. We'll create a comma-separated list in this case. We can achieve this by calling the [str.join()](https://docs.python.org/3/library/stdtypes.html#str.join) method on a comma string (`', '`) and passing the list of values into the function. If we pass a list with only one value, we get a string with just that value. If we pass a list with multiple values, we get a comma-separated list as desired. We can see this below.

In [44]:
', '.join(['one item'])

'one item'

In [45]:
', '.join(['multiple', 'different', 'items'])

'multiple, different, items'

Now we're ready to finish the function we started. We'll abandon the for loop, as there is not too much repetition, and add it into the `process` function as always.

In [46]:
def process_developers_and_publishers(df):
    """Parse columns as comma-separated string."""
    # remove rows with missing data
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
            
    df['developers'] = df['developers'].apply(lambda x: literal_eval(x))
    df['publishers'] = df['publishers'].apply(lambda x: literal_eval(x))
    
    df['developer'] = df['developers'].apply(lambda x: ', '.join(x))
    df['publisher'] = df['publishers'].apply(lambda x: ', '.join(x))

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    df = process_developers_and_publishers(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data[['name', 'steam_appid', 'developer', 'publisher']].head()

Unnamed: 0,name,steam_appid,developer,publisher
0,Counter-Strike,10,Valve,Valve
1,Team Fortress Classic,20,Valve,Valve
2,Day of Defeat,30,Valve,Valve
3,Deathmatch Classic,40,Valve,Valve
4,Half-Life: Opposing Force,50,Gearbox Software,Valve


There is one remaining issue with these columns. The names of some developers and publishers include commas, such as 'PopCap Games, Inc.'. Because of the way we handled the data, if we were to try and separate them out again by splitting on the comma we would run into problems. That's fine for now but we may have to keep that in mind in the future.

In [47]:
steam_data.loc[steam_data['developer'].str.contains('popcap', case=False), ['name', 'developer', 'publisher']].iloc[:2]

Unnamed: 0,name,developer,publisher
97,Bejeweled 2 Deluxe,"PopCap Games, Inc.","PopCap Games, Inc."
98,Chuzzle Deluxe,"PopCap Games, Inc.","PopCap Games, Inc."


## Processing Achievements and Content Descriptors

The final two columns we will take care of in this section are `achievements` and `content_descriptors`. Let's take a look at the null counts for each column and a small sample of rows.

In [48]:
print('Achievements null counts:', steam_data['achievements'].isnull().sum())
print('Content Decsriptors null counts:', steam_data['content_descriptors'].isnull().sum())

steam_data[['name', 'achievements', 'content_descriptors']].iloc[8:13]

Achievements null counts: 1946
Content Decsriptors null counts: 0


Unnamed: 0,name,achievements,content_descriptors
8,Half-Life: Blue Shift,{'total': 0},"{'ids': [], 'notes': None}"
9,Half-Life 2,"{'total': 33, 'highlighted': [{'name': 'Defian...","{'ids': [], 'notes': None}"
10,Counter-Strike: Source,"{'total': 147, 'highlighted': [{'name': 'Someo...","{'ids': [2, 5], 'notes': 'Includes intense vio..."
11,Half-Life: Source,{'total': 0},"{'ids': [], 'notes': None}"
12,Day of Defeat: Source,"{'total': 54, 'highlighted': [{'name': 'Double...","{'ids': [], 'notes': None}"


It looks like both columns are stored as dictionaries, with standard formats if no details are provided or exist.

Below we take a closer look at a single row from the achievements column.

In [49]:
literal_eval(steam_data['achievements'][9])

{'total': 33,
 'highlighted': [{'name': 'Defiant',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_hit_cancop_withcan.jpg'},
  {'name': 'Submissive',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_put_canintrash.jpg'},
  {'name': 'Malcontent',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_escape_apartmentraid.jpg'},
  {'name': 'What cat?',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_break_miniteleporter.jpg'},
  {'name': 'Trusty Hardware',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_crowbar.jpg'},
  {'name': 'Barnacle Bowling',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_barnacleswithbarrel.jpg'},
  {'name': "Anchor's Aweigh!",
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboat.jpg'},
  {'nam

There are two keys in the top level of the dictionary: `total` and `highlighted`. The highlighted column looks too specific, being a selection of achievements specific to that game, so we will remove it. It may be worthwhile extracting the `total` value though.

Now let's take a look at the `content_descriptors` column.

In [50]:
steam_data['content_descriptors'].value_counts().head(10)

{'ids': [], 'notes': None}                                                                                                                                                                  25946
{'ids': [2, 5], 'notes': None}                                                                                                                                                                428
{'ids': [1, 5], 'notes': None}                                                                                                                                                                253
{'ids': [5], 'notes': None}                                                                                                                                                                   128
{'ids': [1, 2, 5], 'notes': None}                                                                                                                                                             122
{'ids': [2, 5], 'notes': 'This

Content descriptors contain age-related warnings about the content of a game. They are identified by a numeric ID number, with optional notes supplied. Almost 26,000 rows have an empty list, indicating either no content descriptors or none provided. Because of this, and because the rows are highly specific to each game, we will drop this column entirely.

Let's now define a function.

In [51]:
def process_achievements_and_descriptors(df):
    """Parse as total number of achievements."""
    df = df.copy()
    
    df = df.drop('content_descriptors', axis=1)
    
    def parse_achievements(x):
        try:
            return literal_eval(x)['total']
        except ValueError:
            # handle missing values
            if np.isnan(x):
                return 0
            else:
                # safety in case of other problem
                print(x)
        
    df['achievements'] = df['achievements'].apply(parse_achievements)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    df = process_developers_and_publishers(df)
    df = process_achievements_and_descriptors(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data[['name', 'achievements']].head()

Unnamed: 0,name,achievements
0,Counter-Strike,0
1,Team Fortress Classic,0
2,Day of Defeat,0
3,Deathmatch Classic,0
4,Half-Life: Opposing Force,0


We know that the first few rows have 0 total achievements so that's fine, but let's take a look at the value counts to verify everything went as expected.

In [52]:
with pd.option_context("display.max_rows", 12):
    display(steam_data['achievements'].value_counts().sort_index())

0       12495
1         273
2         102
3         143
4         215
5         373
        ...  
4996        1
4997        1
4999        1
5000       96
5394        1
9821        1
Name: achievements, Length: 411, dtype: int64

It looks like we were successful. We'll leave this column as it is for now, however we may wish to consider grouping the values together in bins, like we did for the age column. This is a decision we can make during the feature engineering stage of our analysis, and we can decide at that point if it will be more useful.

## Export Partially Clean Data

As I said at the beginning, data cleaning is a lengthy process. Already this part is far longer than it probably should be, and there's still plenty more to do. In the next part we'll take care of most of the remaining columns, and we'll be exporting a bunch of data too.

Before we wrap up, let's take a look at the current state of the data, then export it ready to continue in part 2.

In [53]:
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,header_image,website,pc_requirements,mac_requirements,linux_requirements,categories,genres,screenshots,movies,achievements,release_date,support_info,background,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,0,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,0,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,0,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,0,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,1,1,1,3.99,1,Gearbox Software,Valve


In [54]:
steam_data.isnull().sum()

name                       0
steam_appid                0
required_age               0
detailed_description      14
about_the_game            14
short_description         14
header_image               0
website                 9487
pc_requirements            0
mac_requirements           0
linux_requirements         0
categories               509
genres                    37
screenshots                5
movies                  1762
achievements               0
release_date               0
support_info               0
background                 5
windows                    0
mac                        0
linux                      0
price                      0
english                    0
developer                  0
publisher                  0
dtype: int64

In [55]:
steam_data.to_csv('../data/exports/steam_clean_part_1.csv', index=False)