# Steam Data Cleaning (Part 1)

*This is part of a larger series of posts on downloading, processing and analysing data from the steam store - (add links and desc here)*

**TODO**: genre and categories section writeup

Currently our downloaded data is not in a very usable or useful state. Many of the columns contain lengthy strings or missing values, both of which are crippling to analysis and especially to any machine learning techniques we may wish to implement.

The main aims of this project are to investigate various sales and play-time statistics for games from the steam store, and see how different features of games may have an effect on the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust data set for future analysis projects.

To begin with, we'll import our libraries and set some options, then take a look at the downloaded data from the steam api. Once that is taken care of we will move on to the steamspy data and repeat the process. Hopefully by the end we will have clean data sets to use in the next step, exploratory analysis and visualisation.

### Aims:
- Improve functions
- Prepare notebook for delivery

### (Raw) Data Dictionary

Sort out data dictionary  

API and data dictionary:
https://steamspy.com/api.php

### Future ideas:
- pc requirements analysis over time
- picture analysis
- keyword/recommender analysis
- categories could make table in a database all on its own, perhaps in future
- for genres (and categories?) could create main genre, selected from list of key genres, allowing hybrids like action_adventure if contains both
- remove titles over £60/100?

In [2]:
# load extensions and magics

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1900 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Fri May 24 12:07:27 2019 GMT Summer Time,Fri May 24 12:07:27 2019 GMT Summer Time


In [3]:
# import libraries
from ast import literal_eval
import itertools
import time
import re

import numpy as np
import pandas as pd

In [4]:
# customisations
pd.set_option("max_columns", 100)
# pd.reset_option("max_columns")

## Cleaning steam data

### Import Data

We begin by importing the raw steam data we generated previously in data collection, which can be viewed by following the link to `../deliver/1-data-collection.ipynb` below. From a quick inspection of the data, we can see that we have a mixture of numeric and string columns, plenty of missing values, and a number of columns stored as dictionaries.

In [7]:
from IPython.display import FileLink
FileLink("../notebooks/1-data-collection.ipynb")

In [8]:
raw_steam_data = pd.read_csv('../data/raw/steam_app_data.csv')

print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])
raw_steam_data.head()

Rows: 29235
Columns: 39


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0.0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","{'score': 88, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 65735},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0.0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 2802},{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0.0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","{'score': 79, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 1992},{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,game,Deathmatch Classic,40,0.0,False,,,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 931},{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,game,Half-Life: Opposing Force,50,0.0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Gearbox Software'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 4355},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


We can chain the `isnull()` and `sum()` methods to easily see how many missing values we have in each column. Immediately we can see that a number of columns have over 20,000 rows with missing data, and in a data set of almost 30,000 rows these are unlikely to provide any useful information.

In [9]:
raw_steam_data.isnull().sum()

type                         149
name                           1
steam_appid                    0
required_age                 149
is_free                      149
controller_support         23237
dlc                        24260
detailed_description         175
about_the_game               175
short_description            175
fullgame                   29235
supported_languages          163
header_image                 149
website                     9983
pc_requirements              149
mac_requirements             149
linux_requirements           149
legal_notice               19168
drm_notice                 29077
ext_user_account_notice    28723
developers                   264
publishers                   149
demos                      27096
price_overview              3712
packages                    3370
package_groups               149
platforms                    149
metacritic                 26254
reviews                    23330
categories                   714
genres    

## Defining Functions

We will most likely have to handle each column differently and individually, so we will write some functions to keep our methodology oragnised, and help iteratively develop the process.


### Initial processing

Our first function, `process_null_cols`, will remove the columns with more than 50% missing values, taking care of the null counts we saw previously. We then look at the type and name columns, thinning out our data set a little by removing apps without either.

In the data collection stage, if no information was returned for an app we just stored the name and steam_appid. As seen below, these rows contain no other information so we definitely need to remove them.

In [13]:
# columns to be dropped
raw_steam_data.columns[raw_steam_data.isnull().sum() > (len(raw_steam_data) * 0.5)]

Index(['controller_support', 'dlc', 'fullgame', 'legal_notice', 'drm_notice',
       'ext_user_account_notice', 'demos', 'metacritic', 'reviews',
       'recommendations'],
      dtype='object')

In [7]:
print('Rows to remove:', raw_steam_data[raw_steam_data['type'].isnull()].shape[0])

raw_steam_data[raw_steam_data['type'].isnull()].head()

Rows to remove: 149


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
26,,Half-Life: Opposing Force,852,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
147,,Half-Life: Opposing Force,4330,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
256,,Half-Life: Opposing Force,8740,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
264,,Half-Life: Opposing Force,8955,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
336,,Half-Life: Opposing Force,11610,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We can look at the counts of unique values in a column by using the pandas [Series.value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method.

Once the null rows are removed, we can see that all the other rows have 'game' as their type, meaning this column isn't of any use and can be safely dropped.

In [8]:
raw_steam_data['type'].value_counts(dropna=False)

game    29086
NaN       149
Name: type, dtype: int64

In the name column we have a couple of rows without a title (or 'none' as the title). It looks like these can be safely removed.

In [9]:
raw_steam_data[(raw_steam_data['name'].isnull()) | (raw_steam_data['name'] == 'none')]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
4918,game,none,339860,0.0,False,,,,,,,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/339...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],,,,,[''],,,,[],"{'windows': True, 'mac': False, 'linux': False}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 3, 'highlighted': [{'name': 'Master ...","{'coming_soon': False, 'date': '27 Feb, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
6779,game,none,385020,0.0,False,,,- discontinued - (please remove),- discontinued - (please remove),- discontinued - (please remove),,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/385...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,,,,['none'],[''],,,,[],"{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",,,,{'total': 0},"{'coming_soon': False, 'date': '4 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
7235,game,,396420,0.0,True,,,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。<b...,Spookeningは3Dの恐怖ゲームで、あなたは毎夜に死んでゴーストとして復活します。 村...,,,https://steamcdn-a.akamaihd.net/steam/apps/396...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],,,,,[''],,,,[],"{'windows': True, 'mac': False, 'linux': False}",,,,,,,,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2016'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
7350,game,none,398970,0.0,False,,,,,,,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/398...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,,,,['none'],['none'],"[{'appid': 516340, 'description': ''}]",,,[],"{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 35, 'highlighted': [{'name': ""They'v...","{'coming_soon': False, 'date': '5 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"


We also have some duplicated rows, likely caused by errors or overlapping in our data collection process. As we know for certain that all AppIDs should be unique, we can safely remove these duplicates straight away.

In [10]:
raw_steam_data[raw_steam_data.duplicated()].head()

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
31,game,SiN Episodes: Emergence,1300,0.0,False,,,"You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...",,"English, Russian, French",https://steamcdn-a.akamaihd.net/steam/apps/130...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],,,,['Ritual Entertainment'],['Ritual Entertainment'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[70],"[{'name': 'default', 'title': 'Buy SiN Episode...","{'windows': True, 'mac': False, 'linux': False}","{'score': 75, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 265},{'total': 0},"{'coming_soon': False, 'date': '10 May, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/130...,"{'ids': [], 'notes': None}"
32,game,SiN Episodes: Emergence,1300,0.0,False,,,"You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...",,"English, Russian, French",https://steamcdn-a.akamaihd.net/steam/apps/130...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],,,,['Ritual Entertainment'],['Ritual Entertainment'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[70],"[{'name': 'default', 'title': 'Buy SiN Episode...","{'windows': True, 'mac': False, 'linux': False}","{'score': 75, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 265},{'total': 0},"{'coming_soon': False, 'date': '10 May, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/130...,"{'ids': [], 'notes': None}"
356,game,Jagged Alliance 2 Gold,1620,0.0,False,,,<p>The small country of Arulco has been taken ...,<p>The small country of Arulco has been taken ...,The small country of Arulco has been taken ove...,,English,https://steamcdn-a.akamaihd.net/steam/apps/162...,http://www.jaggedalliance2.com/,{'minimum': '<p><strong>Minimum Configuration:...,[],[],,,,['Strategy First'],['Strategy First'],,"{'currency': 'GBP', 'initial': 1499, 'final': ...",[94],"[{'name': 'default', 'title': 'Buy Jagged Alli...","{'windows': True, 'mac': False, 'linux': False}",,,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '2', 'description': 'Strategy'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,,{'total': 0},"{'coming_soon': False, 'date': '6 Jul, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/162...,"{'ids': [], 'notes': None}"
493,game,Crazy Machines 1.5,18430,0.0,False,,,Create Your Own Zany &quot;Rube Goldberg&quot;...,Create Your Own Zany &quot;Rube Goldberg&quot;...,Create Your Own Zany &quot;Rube Goldberg&quot;...,,English,https://steamcdn-a.akamaihd.net/steam/apps/184...,,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],[],,,,['Fakt Software'],['Viva Media'],,"{'currency': 'GBP', 'initial': 699, 'final': 6...","[1242, 58401]","[{'name': 'default', 'title': 'Buy Crazy Machi...","{'windows': True, 'mac': False, 'linux': False}","{'score': 78, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,,{'total': 0},"{'coming_soon': False, 'date': '12 Dec, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/184...,"{'ids': [], 'notes': None}"
494,game,Crazy Machines 1.5,18430,0.0,False,,,Create Your Own Zany &quot;Rube Goldberg&quot;...,Create Your Own Zany &quot;Rube Goldberg&quot;...,Create Your Own Zany &quot;Rube Goldberg&quot;...,,English,https://steamcdn-a.akamaihd.net/steam/apps/184...,,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],[],,,,['Fakt Software'],['Viva Media'],,"{'currency': 'GBP', 'initial': 699, 'final': 6...","[1242, 58401]","[{'name': 'default', 'title': 'Buy Crazy Machi...","{'windows': True, 'mac': False, 'linux': False}","{'score': 78, 'url': 'https://www.metacritic.c...",,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,,{'total': 0},"{'coming_soon': False, 'date': '12 Dec, 2008'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/184...,"{'ids': [], 'notes': None}"


Here we define and run our functions to handle everything we just looked at. We also define a general `process` function which will run all of our processing functions on the data set, allowing us to slowly add to it as we build out to processing more columns. Finally we run this function on our raw data, inspecting the first few rows and viewing how many rows and columns we have dropped.

In [11]:
def process_null_cols(df, thresh=0.5):
    """Drop columns with more than a certain proportion of missing values (Default 50%)."""
    cutoff_count = len(df) * thresh
    
    return df.dropna(thresh=cutoff_count, axis=1)


def drop_null_rows(df, col):
    """Drop rows with null values in a particular column."""
    return df[df[col].notnull()]


def process_type(df):
    """Remove rows with null values for type column, then drop the column."""
    df = drop_null_rows(df, 'type')
    df = df.drop('type', axis=1)
    
    return df
    
    
def process_name(df):
    """Remove rows with null values or 'none' in name column."""
    df = drop_null_rows(df, 'name')
    df = df[df['name'] != 'none']
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    
    return df

print(raw_steam_data.shape)
steam_data = process(raw_steam_data)
print(steam_data.shape)
steam_data.head()

(29235, 39)
(29075, 28)


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors
0,Counter-Strike,10,0.0,False,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,Team Fortress Classic,20,0.0,False,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,Day of Defeat,30,0.0,False,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,Deathmatch Classic,40,0.0,False,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,Half-Life: Opposing Force,50,0.0,False,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


### Processing age

The next column we will look at is 'required_age'. We can see that it is already stored as integers, and values range from 0 to 20, with one likely error (1818).

In [12]:
steam_data['required_age'].value_counts().sort_index()

0.0       28431
1.0           1
3.0          10
4.0           2
5.0           1
6.0           1
7.0           8
10.0          3
11.0          4
12.0         72
13.0         21
14.0          4
15.0         39
16.0        141
17.0         47
18.0        288
20.0          1
1818.0        1
Name: required_age, dtype: int64

Whilst fairly useful in its current state, we may benefit from reducing the number of categories that ages fall into. Instead of comparing games rated as 5, 6, 7 or 8, we could compare games rated 5+ or 8+, for example.

To decide which categories (or bins) we should use, we will look at the [PEGI age ratings](https://pegi.info/) as this is the system used in the United Kingdom, where we're performing our analysis. We can see that ratings fall into one of five categories (3, 7, 12, 16, 18), defining the minimum age required to buy a game.

Using this to inform our decision, we can use the [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) function to sort our data into each of these categories. As our erroneous row (1818) is most likely meant to be rated 18 anyway, we can set our upper bound above this value to catch it inside this category.


In [13]:
def process_age(df):
    """Format ratings in age column to be in line with the PEGI Age Ratings system."""
    # PEGI Age ratings: 3, 7, 12, 16, 18
    cut_points = [-1, 3, 7, 12, 16, 2000]
    label_values = [3, 7, 12, 16, 18]
    
    df['required_age'] = pd.cut(df['required_age'], bins=cut_points, labels=label_values)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data['required_age'].value_counts().sort_index()

3     28442
7        12
12       79
16      205
18      337
Name: required_age, dtype: int64

### Processing the platforms column

Whilst we could look at the next column in our dataframe, is_free, it would make sense that this is intrinsically linked to the price_overview column. Ultimately we may wish to combine these columns into one, where free games have a price of 0. Looking at the price_overview column, we can see it is stored in a dictionary-like structure, with multiple keys and values. Handling this may be quite tricky, so instead we'll look at a simpler example.

The platforms column appears to contain a key for each of the main operating systems - windows, mac and linux - and corresponding boolean value, set to True or False depending on whether it is available on that platform. This should be a reasonably straighforward place to start, and we can separate this data out into three columns, one for each platform, filled with boolean values.

In [14]:
steam_data['platforms'].head()

0    {'windows': True, 'mac': True, 'linux': True}
1    {'windows': True, 'mac': True, 'linux': True}
2    {'windows': True, 'mac': True, 'linux': True}
3    {'windows': True, 'mac': True, 'linux': True}
4    {'windows': True, 'mac': True, 'linux': True}
Name: platforms, dtype: object

So far the cleaning process has been relatively simple, requiring mainly checking for null values and dropping some rows or columns. Already we can see that handling the platforms will be a little more complex.

Our first hurdle is getting python to recognise the data in the columns as dictionaries rather than just strings. This will allow us to access the different values separately, without having to do some unnecessarily complicated string formatting. As we can see below, even though the data looks like a dictionary it is in fact stored as a string.

In [15]:
print(type(steam_data['platforms'].iloc[0]))

steam_data['platforms'].iloc[0]

<class 'str'>


"{'windows': True, 'mac': True, 'linux': True}"

We can get around this using the handy [literal_eval](https://docs.python.org/3/library/ast.html#ast.literal_eval) function from the in-built `ast` module. As the name suggests, this will allow us to evaluate the string, and index into it as a 
dictionary.

In [16]:
print(type(literal_eval(steam_data['platforms'].iloc[0])))

literal_eval(steam_data['platforms'].iloc[0])['windows']

<class 'dict'>


True

We also need to check for null values, but fortunately there aren't any in this column.

In [17]:
steam_data['platforms'].isnull().sum()

0

Putting this all together, we'll be using the pandas [Series.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method to help us quickly evaluate all of the rows, then we'll be calling `apply` again for each platform to create our new columns.

We could return the True/False value directly and store the values as boolean types, but since we'll be exporting the cleaned data to a csv file, let's store them as integers as this should reduce the file size slightly. Setting True as 1 and False as 0 can still be interpreted as a boolean type, but less data is used to store the information.

In [18]:
def process_platforms(df):
    """Split platforms column into separate boolean columns for each platform."""
    # evaluate values in platforms column, so can index into dictionaries
    df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
    # loop across keys, the platforms, which we'll turn into columns
    for platform in df['platforms'][0].keys():
        # set 1 if value for platform in original column is True, or 0 if it is False
        df[platform] = df['platforms'].apply(lambda x: 1 if x[platform] else 0)
    
    # remove the original platforms column
    df = df.drop('platforms', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'windows', 'mac', 'linux']].head()

Unnamed: 0,name,windows,mac,linux
0,Counter-Strike,1,1,1
1,Team Fortress Classic,1,1,1
2,Day of Defeat,1,1,1
3,Deathmatch Classic,1,1,1
4,Half-Life: Opposing Force,1,1,1


### Processing price

Now we have built up some intuition around how to deal with the data stored as dictionaries, let's return to the `is_free` and `price_overview` columns as we should now be able to handle them.

First let's check how many null values there are in `price_overview`.

In [19]:
steam_data['price_overview'].isnull().sum()

3559

Whilst that looks like a lot, we have to consider the impact that the `is_free` column might be having. Before jumping to conclusions let's check if there any rows with `is_free` marked as True and null values in the `price_overview` column.

In [20]:
free_and_null_price = steam_data[(steam_data['is_free']) & (steam_data['price_overview'].isnull())]

print(free_and_null_price.shape[0])
free_and_null_price.head()

2713


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
14,Half-Life 2: Lost Coast,340,3,True,Originally planned as a section of the Highway...,Originally planned as a section of the Highway...,Originally planned as a section of the Highway...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/340...,http://www.half-life2.com,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,[],[],['Valve'],['Valve'],,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '27 Oct, 2005'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/340...,"{'ids': [], 'notes': None}",1,1,1
19,Team Fortress 2,440,3,True,"<h1>The Jungle Inferno Update</h1><p><a href=""...","<p><strong>""The most fun you can have online""<...",Nine distinct classes provide a broad range of...,"English<strong>*</strong>, Danish, Dutch, Finn...",https://steamcdn-a.akamaihd.net/steam/apps/440...,http://www.teamfortress.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Valve'],['Valve'],,"[197845, 330198, 469]","[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256698790, 'name': 'Jungle Inferno', '...","{'total': 520, 'highlighted': [{'name': 'Head ...","{'coming_soon': False, 'date': '10 Oct, 2007'}","{'url': 'http://steamcommunity.com/app/440', '...",https://steamcdn-a.akamaihd.net/steam/apps/440...,"{'ids': [2, 5], 'notes': 'Includes cartoon vio...",1,1,1
22,Dota 2,570,3,True,<strong>The most-played game on Steam.</strong...,<strong>The most-played game on Steam.</strong...,"Every day, millions of players worldwide enter...","Bulgarian, Czech, Danish, Dutch, English<stron...",https://steamcdn-a.akamaihd.net/steam/apps/570...,http://www.dota2.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Valve'],['Valve'],,"[197846, 330209]","[{'name': 'default', 'title': 'Buy Dota 2', 'd...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256692021, 'name': 'Dota 2 - Join the ...",,"{'coming_soon': False, 'date': '9 Jul, 2013'}","{'url': 'http://dev.dota2.com/', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/570...,"{'ids': [], 'notes': None}",1,1,1
24,Alien Swarm,630,3,True,Alien Swarm is a game and Source SDK release f...,Alien Swarm is a game and Source SDK release f...,Co-operative multiplayer game and complete cod...,English,https://steamcdn-a.akamaihd.net/steam/apps/630...,http://www.alienswarm.com,{'minimum': '<strong>Minimum:</strong><br>\t\t...,[],[],['Valve'],['Valve'],,,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 66, 'highlighted': [{'name': 'Clear ...","{'coming_soon': False, 'date': '19 Jul, 2010'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/630...,"{'ids': [], 'notes': None}",1,0,0
25,Counter-Strike: Global Offensive,730,3,True,Counter-Strike: Global Offensive (CS: GO) expa...,Counter-Strike: Global Offensive (CS: GO) expa...,Counter-Strike: Global Offensive (CS: GO) expa...,"Czech, Danish, Dutch, English<strong>*</strong...",https://steamcdn-a.akamaihd.net/steam/apps/730...,http://blog.counter-strike.net/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['Valve', 'Hidden Path Entertainment']",['Valve'],,"[329385, 298963, 54029]","[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 81958, 'name': 'CS:GO Trailer Long', '...","{'total': 167, 'highlighted': [{'name': 'Someo...","{'coming_soon': False, 'date': '21 Aug, 2012'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/730...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1


Turns out this accounts for most of our null values in the `price_overview` column, meaning we can handle these by setting our final price as 0. This means that there are almost 850 rows which aren't free but have null values in the `price_overview` column. Let's investigate those.

In [21]:
not_free_and_null_price = steam_data[(steam_data['is_free'] == False) & (steam_data['price_overview'].isnull())]

not_free_and_null_price.head()

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
63,The Ship: Single Player,2420,3,False,For PC gamers who enjoy multiplayer games with...,For PC gamers who enjoy multiplayer games with...,The Ship is a murder mystery alternative to tr...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/242...,http://www.blazinggriffin.com/games/the-ship-m...,{'minimum': '<strong>Minimum:</strong> 1.8 GHz...,[],[],['Outerlight Ltd.'],['Blazing Griffin Ltd.'],,[56669],"[{'name': 'default', 'title': 'Buy The Ship: S...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 2035597, 'name': 'the Ship: Intro', '...",{'total': 0},"{'coming_soon': False, 'date': '20 Nov, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/242...,"{'ids': [], 'notes': None}",1,0,0
75,RollerCoaster Tycoon® 3: Platinum,2700,3,False,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,Rollercoaster Tycoon 3 Platinum combines the e...,"English, French, Italian, German, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/270...,http://www.atari.com/rollercoastertycoon/us/in...,{'minimum': '<strong>Minimum: </strong><br>\t\...,"{'minimum': '<ul class=""bb_ul""><li><strong>OS:...",[],"['Frontier', 'Aspyr (Mac)']","['Atari', 'Aspyr (Mac)']",,,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '28', 'description': 'Simulation'}, {'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '12 Mar, 2008'}","{'url': 'http://www.atari.com/support/atari', ...",https://steamcdn-a.akamaihd.net/steam/apps/270...,"{'ids': [], 'notes': None}",1,1,0
220,BioShock™,7670,3,False,<h1>Special Offer</h1><p>Buying BioShock™ also...,BioShock is a shooter unlike any you've ever p...,BioShock is a shooter unlike any you've ever p...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/767...,http://www.BioShockGame.com,"{'minimum': '<h2 class=""bb_tag""><strong>Minimu...",{'minimum': 'Please See BioShock Remastered'},[],"['2K Boston', '2K Australia']",['2K'],,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™',...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '21 Aug, 2007'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/767...,"{'ids': [], 'notes': None}",1,0,0
234,Sam & Max 101: Culture Shock,8200,3,False,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,Sam &amp; Max: Episode 1 - Culture Shock - The...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/820...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[357, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/820...,"{'ids': [], 'notes': None}",1,0,0
235,Sam & Max 102: Situation: Comedy,8210,3,False,<strong>Sam &amp; Max: Episode 2 - Situation: ...,<strong>Sam &amp; Max: Episode 2 - Situation: ...,Sam &amp; Max: Episode 2 - Situation: Comedy -...,"English, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/821...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[358, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/821...,"{'ids': [], 'notes': None}",1,0,0


The first few rows contain big, well-known games which appear to have pretty complete data. It looks like we can rule out data errors, so let's dig a little deeper and see if we can find out what is going on.

We'll start by looking at the store pages for some of these titles. The url to an app on the steam website follows this structure:

    https://store.steampowered.com/app/[steam_appid]

This means we can easily generate these links using our above filter. We'll wrap it up in a function in case we want to use it later.

In [22]:
def print_steam_links(df):
    """Print links to store page for apps in a dataframe."""
    url_base = "https://store.steampowered.com/app/"
    
    for i, row in df.iterrows():
        appid = row['steam_appid']
        name = row['name']
        
        print(name + ':', url_base + str(appid))
        

print_steam_links(not_free_and_null_price[:5])

The Ship: Single Player: https://store.steampowered.com/app/2420
RollerCoaster Tycoon® 3: Platinum: https://store.steampowered.com/app/2700
BioShock™: https://store.steampowered.com/app/7670
Sam & Max 101: Culture Shock: https://store.steampowered.com/app/8200
Sam & Max 102: Situation: Comedy: https://store.steampowered.com/app/8210


For these games we can conclude that:

- The Ship: Single Player is a tutorial, and comes as part of The Ship: Murder Party
- RollerCoaster Tycoon 3: Platinum has been removed from steam (and another game website: gog)  
  - "A spokesperson for GOG told Eurogamer it pulled the game "due to expiring licensing rights", and stressed it'll talk with "new distribution rights holders" to bring the game back as soon as possible." Source: [Eurogamer](https://www.eurogamer.net/articles/2018-05-09-rollercoaster-tycoon-3-pulled-from-steam-gog)
- BioShock has been replaced by BioShock Remastered
- Sam & Max 101 is sold as part of a season, and this can be found in the `package_groups` column

So we have a couple of options here. We could just drop these rows, we could try to figure out the price based on the package_groups column, or we could leave them for now and return to them later, which is what we will do. It may be that some or all of these rows are removed later in the cleaning process for other reasons.

Below we can view the games with similar names to the games we investigated, to help get an idea of what is happening.

In [23]:
steam_data[steam_data['name'].str.contains("The Ship:")]

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
62,The Ship: Murder Party,2400,3,True,<h1>Finding a Server</h1><p><strong>Ahoy Shipm...,"<strong>This package includes a tutorial, The ...",The Ship is a murder mystery multiplayer.,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/240...,http://www.blazinggriffin.com/games/the-ship-m...,{'minimum': '<strong>Minimum:</strong> 1.8 GHz...,[],[],['Outerlight Ltd.'],['Blazing Griffin Ltd.'],"{'currency': 'GBP', 'initial': 699, 'final': 6...",[56669],"[{'name': 'default', 'title': 'Buy The Ship: M...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 2034912, 'name': 'Single Player Intro'...",{'total': 0},"{'coming_soon': False, 'date': '11 Jul, 2006'}","{'url': 'http://www.blazinggriffin.com/', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/240...,"{'ids': [], 'notes': None}",1,0,0
63,The Ship: Single Player,2420,3,False,For PC gamers who enjoy multiplayer games with...,For PC gamers who enjoy multiplayer games with...,The Ship is a murder mystery alternative to tr...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/242...,http://www.blazinggriffin.com/games/the-ship-m...,{'minimum': '<strong>Minimum:</strong> 1.8 GHz...,[],[],['Outerlight Ltd.'],['Blazing Griffin Ltd.'],,[56669],"[{'name': 'default', 'title': 'Buy The Ship: S...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 2035597, 'name': 'the Ship: Intro', '...",{'total': 0},"{'coming_soon': False, 'date': '20 Nov, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/242...,"{'ids': [], 'notes': None}",1,0,0
6722,The Ship: Remasted,383790,3,False,<h1>Now Includes World Leaders!</h1><p>Not onl...,The Ship: Remasted is a remake of the classic ...,You find yourself aboard a series of luxury 19...,English<strong>*</strong><br><strong>*</strong...,https://steamcdn-a.akamaihd.net/steam/apps/383...,http://www.blazinggriffin.com/games/the-ship-r...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['Blazing Griffin'],['Blazing Griffin'],"{'currency': 'GBP', 'initial': 699, 'final': 6...",[253227],"[{'name': 'default', 'title': 'Buy The Ship: R...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256673834, 'name': 'All Aboard!', 'thu...","{'total': 22, 'highlighted': [{'name': 'Gone o...","{'coming_soon': False, 'date': '31 Oct, 2016'}","{'url': 'http://www.blazinggriffin.com/', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/383...,"{'ids': [], 'notes': None}",1,1,1


In [24]:
steam_data[steam_data['name'].str.contains("BioShock™")]

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
220,BioShock™,7670,3,False,<h1>Special Offer</h1><p>Buying BioShock™ also...,BioShock is a shooter unlike any you've ever p...,BioShock is a shooter unlike any you've ever p...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/767...,http://www.BioShockGame.com,"{'minimum': '<h2 class=""bb_tag""><strong>Minimu...",{'minimum': 'Please See BioShock Remastered'},[],"['2K Boston', '2K Australia']",['2K'],,"[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™',...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '21 Aug, 2007'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/767...,"{'ids': [], 'notes': None}",1,0,0
7734,BioShock™ Remastered,409710,18,False,<h1>Special Offer</h1><p>Buying BioShock™ Rema...,BioShock is a shooter unlike any you've ever p...,"BioShock is a shooter unlike any other, loaded...","English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/409...,http://www.BioShockGame.com,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['2K Boston', '2K Australia', 'Blind Squirrel'...","['2K', 'Feral Interactive (Mac)']","{'currency': 'GBP', 'initial': 999, 'final': 9...","[451, 127633]","[{'name': 'default', 'title': 'Buy BioShock™ R...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 65, 'highlighted': [{'name': 'Comple...","{'coming_soon': False, 'date': '15 Sep, 2016'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/409...,"{'ids': [], 'notes': None}",1,1,0
7735,BioShock™ 2 Remastered,409720,18,False,<h1>Special Offer</h1><p>Buying BioShock 2™ Re...,BioShock 2 provides players with the perfect b...,"In BioShock 2, you step into the boots of the ...","English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/409...,http://www.bioshockgame.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['2K Marin', '2K China', 'Digital Extremes', '...",['2K'],"{'currency': 'GBP', 'initial': 1399, 'final': ...","[81419, 127633]","[{'name': 'default', 'title': 'Buy BioShock™ 2...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,"{'total': 53, 'highlighted': [{'name': ""Daddy'...","{'coming_soon': False, 'date': '15 Sep, 2016'}","{'url': 'support.2k.com', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/409...,"{'ids': [5], 'notes': None}",1,0,0


In [25]:
steam_data[steam_data['name'].str.contains("Sam & Max 1")]

Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux
234,Sam & Max 101: Culture Shock,8200,3,False,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,<strong>Sam &amp; Max: Episode 1 - Culture Sho...,Sam &amp; Max: Episode 1 - Culture Shock - The...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/820...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[357, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/820...,"{'ids': [], 'notes': None}",1,0,0
235,Sam & Max 102: Situation: Comedy,8210,3,False,<strong>Sam &amp; Max: Episode 2 - Situation: ...,<strong>Sam &amp; Max: Episode 2 - Situation: ...,Sam &amp; Max: Episode 2 - Situation: Comedy -...,"English, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/821...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[358, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/821...,"{'ids': [], 'notes': None}",1,0,0
236,"Sam & Max 103: The Mole, the Mob and the Meatball",8220,3,False,"<strong>Sam &amp; Max Episode 3 - The Mole, Th...","<strong>Sam &amp; Max Episode 3 - The Mole, Th...","Sam &amp; Max Episode 3 - The Mole, The Mob, a...","English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/822...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[359, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/822...,"{'ids': [], 'notes': None}",1,0,0
237,Sam & Max 104: Abe Lincoln Must Die!,8230,3,False,<strong>Sam &amp; Max Episode 4 - Abe Lincoln ...,<strong>Sam &amp; Max Episode 4 - Abe Lincoln ...,Sam &amp; Max Episode 4 - Abe Lincoln Must Die...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/823...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[360, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/823...,"{'ids': [], 'notes': None}",1,0,0
238,Sam & Max 105: Reality 2.0,8240,3,False,With an internet crisis looming and a viral vi...,With an internet crisis looming and a viral vi...,With an internet crisis looming and a viral vi...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/824...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[361, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/824...,"{'ids': [], 'notes': None}",1,0,0
239,Sam & Max 106: Bright Side of the Moon,8250,3,False,<strong>Sam &amp; Max: Episode 6 - Bright Side...,<strong>Sam &amp; Max: Episode 6 - Bright Side...,Sam &amp; Max: Episode 6 - Bright Side of the ...,"English, French, German, Italian",https://steamcdn-a.akamaihd.net/steam/apps/825...,http://store.steampowered.com/app/901660/,"{'minimum': 'Windows XP or Vista, 1.5GHz proce...",[],[],['Telltale Games'],['Telltale Games'],,"[362, 539]","[{'name': 'default', 'title': 'Buy Sam & Max 1...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '15 Jun, 2007'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/825...,"{'ids': [], 'notes': None}",1,0,0


Finally if we take a look at the data for the first row, we can see that we have a variety of formats in which our price is stored. We have a `currency`, which is GBP, perfect as we are performing our analysis in the UK. Next we have a number of different values for the price so which one do we use?

In [26]:
steam_data['price_overview'][0]

"{'currency': 'GBP', 'initial': 719, 'final': 719, 'discount_percent': 0, 'initial_formatted': '', 'final_formatted': '£7.19'}"

If we inspect another row, we see that there is an active discount, applying an 80% discount to the title. It looks like `initial` contains to normal price before discount, and `final` contains the discounted price. `initial_formatted` and `final_formatted` contain the price displayed in the currency. We don't have to worry about these, as we'll be storing the price as an integer (or float) and if we really wanted, could format it like this when printing.

With all this in mind, it looks like we'll be checking the value under the currency key, and using the value in the initial key.

In [27]:
steam_data['price_overview'][37]

"{'currency': 'GBP', 'initial': 2299, 'final': 459, 'discount_percent': 80, 'initial_formatted': '£22.99', 'final_formatted': '£4.59'}"

Now the preliminary investigation is complete we can begin definining our function. 

We start by evaluating the strings using literal_eval as before, however if there is a null value (caught by the try/except block) we return a properly formatted dictionary with -1 for the `initial` value. This will allow us to fill in a value of 0 for free games, then be left with an easily targetable value for the null rows.

In [28]:
def process_price(df):
    df = df.copy()
        
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # Create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # Set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    return df

price_data = process_price(steam_data)[['name', 'currency', 'price']]
price_data.head()

Unnamed: 0,name,currency,price
0,Counter-Strike,GBP,719
1,Team Fortress Classic,GBP,399
2,Day of Defeat,GBP,399
3,Deathmatch Classic,GBP,399
4,Half-Life: Opposing Force,GBP,399


We're almost finished here, bet let's check if any games don't have GBP listed as the currency.

In [29]:
price_data[price_data['currency'] != 'GBP']

Unnamed: 0,name,currency,price
991,Robin Hood: The Legend of Sherwood,USD,799
5767,Assassin’s Creed® Chronicles: India,EUR,999
27593,Mortal Kombat 11,USD,5999
27995,Pagan Online,EUR,2699


For some reason we have four games listed in either USD or EUR. We could use the current exchange rate to try and convert them into GBP, however as there are only four rows we will simply drop them.

We will also divide prices by 100 so they are displayed as floats in pounds.

In [30]:
def process_price(df):
    """Process price_overview column into formatted price column."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # change price to display in pounds (only applying to rows with a value greater than 0)
    df.loc[df['price'] > 0, 'price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'price']].head()

Unnamed: 0,name,price
0,Counter-Strike,7.19
1,Team Fortress Classic,3.99
2,Day of Defeat,3.99
3,Deathmatch Classic,3.99
4,Half-Life: Opposing Force,3.99


### Processing Description Columns

Next we have a series of columns with descriptive text about each game: `detailed_description`, `about_the_game` and `short_description`. These columns could be used as the basis for an interesting recommender or key-word analysis project, however they are not required in our current project and should be removed from our final data set as they take up large amounts of space.

In case we find some anomalies, let's inspect these columns anyway.

In [31]:
steam_data[['detailed_description', 'about_the_game', 'short_description']].isnull().sum()

detailed_description    24
about_the_game          24
short_description       24
dtype: int64

It looks like we have 24 rows with missing data for these columns, and chances are the 24 rows with missing `detailed_description` are the rows with missing `about_the_game` and `short_description` data too. 

By inspecting the individual rows below, we can see that this is true - all rows with missing data in one description column have missing data in the other too.

In [34]:
steam_data[steam_data['detailed_description'].isnull()]

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
97,Bejeweled 2 Deluxe,3300,3,,,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/330...,,{'minimum': '<p><strong>Minimum Requirements:<...,{'minimum': '<ul>\n\t<li><strong>OS:</strong> ...,[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']","[121, 1160]","[{'name': 'default', 'title': 'Buy Bejeweled 2...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/330...,"{'ids': [], 'notes': None}",1,1,0,4.25
98,Chuzzle Deluxe,3310,3,,,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/331...,,{'minimum': '<p><strong>Minimum Requirements:<...,{'minimum': '<ul>\n\t<li><strong>OS:</strong> ...,[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[126],"[{'name': 'default', 'title': 'Buy Chuzzle Del...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/331...,"{'ids': [], 'notes': None}",1,1,0,4.25
99,Insaniquarium Deluxe,3320,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/332...,,{'minimum': '<strong>Minimum Requirements:</st...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']","[127, 1160]","[{'name': 'default', 'title': 'Buy Insaniquari...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/332...,"{'ids': [], 'notes': None}",1,0,0,4.25
101,AstroPop Deluxe,3340,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/334...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[132],"[{'name': 'default', 'title': 'Buy AstroPop De...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/334...,"{'ids': [], 'notes': None}",1,0,0,4.25
102,Bejeweled Deluxe,3350,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/335...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[122],"[{'name': 'default', 'title': 'Buy Bejeweled D...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/335...,"{'ids': [], 'notes': None}",1,0,0,4.25
103,Big Money! Deluxe,3360,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/336...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[136],"[{'name': 'default', 'title': 'Buy Big Money! ...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/336...,"{'ids': [], 'notes': None}",1,0,0,4.25
104,Dynomite Deluxe,3380,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/338...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[128],"[{'name': 'default', 'title': 'Buy Dynomite De...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/338...,"{'ids': [], 'notes': None}",1,0,0,4.25
105,Feeding Frenzy 2 Deluxe,3390,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/339...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[124],"[{'name': 'default', 'title': 'Buy Feeding Fre...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/339...,"{'ids': [], 'notes': None}",1,0,0,4.25
106,Hammer Heads Deluxe,3400,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/340...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[135],"[{'name': 'default', 'title': 'Buy Hammer Head...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/340...,"{'ids': [], 'notes': None}",1,0,0,4.25
108,Iggle Pop Deluxe,3420,3,,,,English,https://steamcdn-a.akamaihd.net/steam/apps/342...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],"['PopCap Games, Inc.']","['PopCap Games, Inc.']",[129],"[{'name': 'default', 'title': 'Buy Iggle Pop D...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/342...,"{'ids': [], 'notes': None}",1,0,0,4.25


Browsing these games it looks like about half are old PopCap games from 2006 and about half are from Telltale Games, similar to the Sam & Max title we encountered in the previous section.

There is also a dedicated server and a game which is now unlisted on the steam store. It would definitely be best to remove these two.

Let's remove these rows for now, but we can reintroduce them later if we wish.

As stated, the description columns may be useful for future projects, so before we remove them from this data set we will export them as a csv file. We will include the steam_appid column in this export as it will allow us to match up these rows with rows in our primary data set later on, using a merge (or a join in SQL). We will write a short function to handle this, which we can re-use later on if we have any more dataframes that need exporting.

In [35]:
def export_data(df, filename):
    """Export dataframe to csv file, filename prepended with 'steam_'.
    
    filename : str without file extension
    """
    filepath = '../data/exports/steam_' + filename + '.csv'
    formatted_name = filename.replace('_', ' ')
    
    df.to_csv(filepath, index=False)
    print("Exported {} to '{}'".format(formatted_name, filepath))

    
def process_descriptions(df, export=False):
    """Export descriptions to external csv file then remove these columns."""
    # remove rows with missing description data
    df = df[df['detailed_description'].notnull()].copy()
    
    if export:
        # create dataframe of description columns and export to csv
        description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]
        
        export_data(description_data, filename='description_data')
    
    # drop description columns from main dataframe
    df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported description data to '../data/exports/steam_description_data.csv'


Unnamed: 0,name,steam_appid,required_age,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
0,Counter-Strike,10,3,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19
1,Team Fortress Classic,20,3,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99
2,Day of Defeat,30,3,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}",1,1,1,3.99
3,Deathmatch Classic,40,3,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}",1,1,1,3.99
4,Half-Life: Opposing Force,50,3,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}",1,1,1,3.99


In [36]:
# inspect exported data
pd.read_csv('../data/exports/steam_description_data.csv').head()

Unnamed: 0,steam_appid,detailed_description,about_the_game,short_description
0,10,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...
1,20,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...
2,30,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...
3,40,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...
4,50,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...


### Processing Langauges

The next column is supported_languages. As we will be performing the analysis for an English company, we are only interested in apps that are in English. Whilst we could remove non-english apps at this stage, instead we will create a column marking english apps with a boolean value - True or False.

We begin as usual by looking for rows with null values.

In [37]:
steam_data['supported_languages'].isnull().sum()

4

Taking a closer look at these apps, it's possible one or two are not in english. As there are only 4 rows affected we will go ahead and remove these from the data set.

In [38]:
steam_data[steam_data['supported_languages'].isnull()]

Unnamed: 0,name,steam_appid,required_age,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
4866,Subsiege,338640,3,,https://steamcdn-a.akamaihd.net/steam/apps/338...,http://subsiege-game.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Icebird Studios'],['Icebird Studios'],[56500],"[{'name': 'default', 'title': 'Buy Subsiege', ...",,,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256729398, 'name': 'Release Trailer', ...",{'total': 0},"{'coming_soon': False, 'date': '7 Sep, 2018'}","{'url': 'http://subsiege-game.com/', 'email': ...",https://steamcdn-a.akamaihd.net/steam/apps/338...,"{'ids': [], 'notes': None}",1,0,0,17.89
14560,MARS VR(全球使命VR),596560,3,,https://steamcdn-a.akamaihd.net/steam/apps/596...,http://qqsm.zygames.com/,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,"['Ying Pei Digital Technology Shanghai Co., Li...","['SHANGHAI ZHENYOU TECHNOLOGY CO.,LTD']",[156314],"[{'name': 'default', 'title': 'Buy MARS VR(全球使...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '73', 'description': 'Violent'}, {'id'...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256681371, 'name': 'marsvr', 'thumbnai...",{'total': 0},"{'coming_soon': False, 'date': '5 Apr, 2017'}","{'url': 'http://www.zygames.com/contact', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/596...,"{'ids': [], 'notes': None}",1,0,0,1.99
16386,Numberline 2,654970,3,,https://steamcdn-a.akamaihd.net/steam/apps/654...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],"['V34D4R', 'Egor Magurin']",['Indovers Studio'],[184646],"[{'name': 'default', 'title': 'Buy Numberline ...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256687192, 'name': 'Numberline 2 Trail...","{'total': 60, 'highlighted': [{'name': '1st le...","{'coming_soon': False, 'date': '14 Jul, 2017'}","{'url': '', 'email': 'radaew.zhenya@yandex.ru'}",https://steamcdn-a.akamaihd.net/steam/apps/654...,"{'ids': [], 'notes': None}",1,0,0,1.59
26855,SNUSE 221,948070,3,,https://steamcdn-a.akamaihd.net/steam/apps/948...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['SNUSE GM'],['SNUSE GM'],[308421],"[{'name': 'default', 'title': 'Buy SNUSE 221',...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256745662, 'name': 'snuse', 'thumbnail...",{'total': 0},"{'coming_soon': False, 'date': '2 Apr, 2019'}","{'url': 'vk.com/nilow_i', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/948...,"{'ids': [], 'notes': None}",1,0,0,0.79


By looking at the value for the first row and the values for the most common rows, it looks like languages are stored as a string which can be anything from a comma-separated list of languages to a mix of html and headings. It seems reasonably safe to assume that if the app is in English, the word English will appear somewhere in this string. With this in mind we can simply search the string and return a value based on the result.

In [39]:
print(steam_data['supported_languages'][0])
steam_data['supported_languages'].value_counts().head(10)

English<strong>*</strong>, French<strong>*</strong>, German<strong>*</strong>, Italian<strong>*</strong>, Spanish - Spain<strong>*</strong>, Simplified Chinese<strong>*</strong>, Traditional Chinese<strong>*</strong>, Korean<strong>*</strong><br><strong>*</strong>languages with full audio support


English                                                                                                        8702
English<strong>*</strong><br><strong>*</strong>languages with full audio support                               7669
English, Russian                                                                                                719
English, Simplified Chinese                                                                                     291
English, Japanese                                                                                               239
English<strong>*</strong>, Russian<strong>*</strong><br><strong>*</strong>languages with full audio support     227
English, French, Italian, German, Spanish - Spain                                                               188
Simplified Chinese                                                                                              168
English, German                                                         

In [40]:
def process_language(df):
    """Process supported_languages column into a boolean 'is english' column."""
    df = df.copy()
    
    # drop rows with missing language data
    df = df.dropna(subset=['supported_languages'])
    
    df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)
    df = df.drop('supported_languages', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'english']].head()

Unnamed: 0,name,english
0,Counter-Strike,1
1,Team Fortress Classic,1
2,Day of Defeat,1
3,Deathmatch Classic,1
4,Half-Life: Opposing Force,1


Before moving on, we can take a quick look at our results and see that most of our apps are in English.

In [41]:
steam_data['english'].value_counts(dropna=False)

1    28500
0      543
Name: english, dtype: int64

### Processing image columns

Similar to our description columns, we have three columns that appear to contain links to various images: `header_image`, `screenshots` and `background`. We will treat these in almost the same way, exporting the contents to a csv file then removing the columns from our data set.

Whilst we won't be needed this data for our current project, it could open the door to some interesting image analysis in the future.

First we check for missing values.

In [42]:
image_cols = ['header_image', 'screenshots', 'background']

for col in image_cols:
    print(col+':', steam_data[col].isnull().sum())

header_image: 0
screenshots: 15
background: 15


Again it is likely that the 15 rows with missing screenshots data are the same rows with missing background data.

Seen below, some rows have missing `pc_requirements`, some have missing release_dates (blank string in the `date` part of release_date), and most have -1 for price, meaning we couldn't find any price data earlier.

It seems like it would be a good idea to remove these rows before proceeding.

In [43]:
steam_data[steam_data['screenshots'].isnull()]

Unnamed: 0,name,steam_appid,required_age,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price,english
652,Sam & Max 302: The Tomb of Sammun-Mak,31230,3,https://steamcdn-a.akamaihd.net/steam/apps/312...,,[],[],[],['Telltale Games'],['Telltale Games'],"[109586, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}]",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': 'https://telltale.com/support/', 'emai...",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
653,Sam & Max 303: They Stole Max's Brain!,31240,3,https://steamcdn-a.akamaihd.net/steam/apps/312...,,[],[],[],['Telltale Games'],['Telltale Games'],"[109587, 4172]","[{'name': 'default', 'title': ""Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}]",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': 'https://telltale.com/support/', 'emai...",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
654,Sam & Max 304: Beyond the Alley of the Dolls,31250,3,https://steamcdn-a.akamaihd.net/steam/apps/312...,,[],[],[],['Telltale Games'],['Telltale Games'],"[109588, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}]",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': 'https://telltale.com/support/', 'emai...",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
655,Sam & Max 305: The City That Dares Not Sleep,31260,3,https://steamcdn-a.akamaihd.net/steam/apps/312...,,[],[],[],['Telltale Games'],['Telltale Games'],"[109589, 4172]","[{'name': 'default', 'title': 'Buy Sam & Max 3...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}]",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': 'https://telltale.com/support/', 'emai...",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
1238,Hector: Episode 1,94600,3,https://steamcdn-a.akamaihd.net/steam/apps/946...,,{'minimum': '<strong>Минимальные:</strong><br>...,{'minimum': '<strong>Минимальные:</strong><br>...,[],['Straandlooper'],[''],[11279],"[{'name': 'default', 'title': 'Buy Hector: Epi...",,,,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
1239,Hector: Episode 2,94610,3,https://steamcdn-a.akamaihd.net/steam/apps/946...,http://www.telltalegames.com/hector,{'minimum': 'Minimum:<br>\t\t\t\t\t\t\t\t\t\t\...,[],[],['Straandlooper'],[''],"[109595, 11279]","[{'name': 'default', 'title': 'Buy Hector: Epi...","[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...",,,{'total': 0},"{'coming_soon': False, 'date': ''}","{'url': '', 'email': 'support@telltalegames.com'}",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1
5218,Into The War,346370,3,https://steamcdn-a.akamaihd.net/steam/apps/346...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Small Town Studios'],['Small Town Studios'],,[],"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...",,,"{'total': 1, 'highlighted': [{'name': 'First B...","{'coming_soon': False, 'date': '2 Dec, 2015'}","{'url': 'http://intothewar.com', 'email': 'nan...",,"{'ids': [], 'notes': None}",1,0,0,-1.0,1
7970,The Light Empire,416220,3,https://steamcdn-a.akamaihd.net/steam/apps/416...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Jemy'],['Jemy'],[83871],"[{'name': 'default', 'title': 'Buy The Light E...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...",,,"{'total': 4, 'highlighted': [{'name': 'We Begi...","{'coming_soon': False, 'date': '2 Dec, 2015'}","{'url': '', 'email': 'Jemy.TLE@outlook.com'}",,"{'ids': [], 'notes': None}",1,0,0,4.79,1
9408,A Land Fit For Heroes,456210,3,https://steamcdn-a.akamaihd.net/steam/apps/456...,http://landfitforheroes.com,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],['Liber Primus Games'],['Liber Primus Games'],,[],"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,"[{'id': 256663531, 'name': 'A Land Fit For Her...",{'total': 0},"{'coming_soon': False, 'date': '3 May, 2016'}","{'url': 'http://landfitforheroes.com', 'email'...",,"{'ids': [], 'notes': None}",1,0,0,-1.0,1
19481,JumpSky,731910,3,https://steamcdn-a.akamaihd.net/steam/apps/731...,,[],{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,['none'],['none'],,[],"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]",,,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2017'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}",1,1,0,-1.0,1


There is also a `movies` column with similar data. Whilst having more missing values, presumably for games without videos, it appears to contain names, thumbnails and links to various videos and trailers. It's unlikely we'll be needed them but we can include them in the export and remove them from our data set.

In [44]:
steam_data['movies'].isnull().sum()

1893

In [45]:
with pd.option_context("display.max_colwidth", 1000):
    print(steam_data[steam_data['movies'].notnull()]['movies'].head(3))

9                                                                                                                                                                                                                                                                                                                                                         [{'id': 904, 'name': 'Half-Life 2 Trailer', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/904/movie.jpg?t=1507237301', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie480.webm?t=1507237301', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie_max.webm?t=1507237301'}, 'highlight': True}, {'id': 5724, 'name': 'Free Yourself', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/5724/movie.293x165.jpg?t=1507237311', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie480.webm?t=1507237311', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie_max.webm?t=1507237311'}, 'highlight': Fa

In [46]:
def process_images(df, export=False):
    """Remove image columns from dataframe, optionally exporting them to csv first."""
    df = df[df['screenshots'].notnull()].copy()
    
    if export:
        image_data = df[['steam_appid', 'header_image', 'screenshots', 'background', 'movies']]
        
        export_data(image_data, 'image_data')
        
    df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported image data to '../data/exports/steam_image_data.csv'


Unnamed: 0,name,steam_appid,required_age,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,support_info,content_descriptors,windows,mac,linux,price,english
0,Counter-Strike,10,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1
1,Team Fortress Classic,20,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1
2,Day of Defeat,30,3,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}","{'ids': [], 'notes': None}",1,1,1,3.99,1
3,Deathmatch Classic,40,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}","{'ids': [], 'notes': None}",1,1,1,3.99,1
4,Half-Life: Opposing Force,50,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...","{'ids': [], 'notes': None}",1,1,1,3.99,1


In [47]:
# inspect exported data
pd.read_csv('../data/exports/steam_image_data.csv').head()

Unnamed: 0,steam_appid,header_image,screenshots,background,movies
0,10,https://steamcdn-a.akamaihd.net/steam/apps/10/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,
1,20,https://steamcdn-a.akamaihd.net/steam/apps/20/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,
2,30,https://steamcdn-a.akamaihd.net/steam/apps/30/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/30/...,
3,40,https://steamcdn-a.akamaihd.net/steam/apps/40/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,
4,50,https://steamcdn-a.akamaihd.net/steam/apps/50/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,


# Next Steps

Will continue in the next part