# Data Cleaning

### Aims:
- Improve functions
- Prepare notebook for delivery

### Intro

Currently our downloaded data is not in a very usable or useful state. Many of the columns contain lengthy strings or missing values, both of which are crippling to analysis and especially to any machine learning techniques we may wish to implement.

The main aims of this project are to investigate various sales and play-time statistics for games from the steam store, and see how different features of games may have an effect on the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust data set for future analysis projects.

To begin with, we'll import our libraries and set some options, then take a look at the downloaded data from the steam api. Once that is taken care of we will move on to the steamspy data and repeat the process. Hopefully by the end we will have clean data sets to use in the next step, exploratory analysis and visualisation.

### (Raw) Data Dictionary

Sort out data dictionary  

API and data dictionary:
https://steamspy.com/api.php

### Future ideas:
- pc requirements analysis over time
- picture analysis
- keyword/recommender analysis
- categories could make table in a database all on its own, perhaps in future
- for genres (and categories?) could create main genre, selected from list of key genres, allowing hybrids like action_adventure if contains both
- remove titles over £60/100?

In [1]:
# load extensions and magics

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1915 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Thu May 09 11:19:47 2019 GMT Summer Time,Thu May 09 11:19:47 2019 GMT Summer Time


In [278]:
# import libraries
from ast import literal_eval
import itertools
import time
import re

import numpy as np
import pandas as pd

In [3]:
# customisations
pd.set_option("max_columns", 100)
# pd.reset_option("max_columns")

## Cleaning steam data

### Import Data

We begin by importing the raw steam data we generated previously in data collection, which can be viewed by following the link to `../deliver/1-data-collection.ipynb` below. From a quick inspection of the data, we can see that we have a mixture of numeric and string columns, plenty of missing values, and a number of columns stored as dictionaries.

In [4]:
from IPython.display import FileLink
FileLink("../deliver/1-data-collection.ipynb")

In [5]:
raw_steam_data = pd.read_csv('../data/raw/steam_app_data.csv')

print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])
raw_steam_data.head()

Rows: 29235
Columns: 39


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0.0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","{'score': 88, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 65735},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0.0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 2802},{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0.0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","{'score': 79, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 1992},{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,game,Deathmatch Classic,40,0.0,False,,,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 931},{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,game,Half-Life: Opposing Force,50,0.0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Gearbox Software'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 4355},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


We can chain the `isnull()` and `sum()` methods to easily see how many missing values we have in each column. Immediately we can see that a number of columns have over 20,000 rows with missing data, and in a data set of almost 30,000 rows these are unlikely to provide any useful information.

In [6]:
raw_steam_data.isnull().sum()

type                         149
name                           1
steam_appid                    0
required_age                 149
is_free                      149
controller_support         23237
dlc                        24260
detailed_description         175
about_the_game               175
short_description            175
fullgame                   29235
supported_languages          163
header_image                 149
website                     9983
pc_requirements              149
mac_requirements             149
linux_requirements           149
legal_notice               19168
drm_notice                 29077
ext_user_account_notice    28723
developers                   264
publishers                   149
demos                      27096
price_overview              3712
packages                    3370
package_groups               149
platforms                    149
metacritic                 26254
reviews                    23330
categories                   714
genres    

## Defining Functions

We will most likely have to handle each column differently and individually, so we will write some functions to keep our methodology oragnised, and help iteratively develop the process.


### Initial processing

Our first function, `process_null_cols`, will remove the columns with more than 50% missing values, taking care of the null counts we saw previously. We then look at the type and name columns, thinning out our data set a little by removing apps without either.

In the data collection stage, if no information was returned for an app we wrote just the name and steam_appid to the file. As seen below, these rows contain no other information so we definitely need to remove them.

In [None]:
print('Rows to remove:', raw_steam_data[raw_steam_data['type'].isnull()].shape[0])

raw_steam_data[raw_steam_data['type'].isnull()].head()

Once the null rows are removed, we can see that all the other rows have 'game' as their type, meaning this column isn't of any use and can be safely dropped.

In [None]:
raw_steam_data['type'].value_counts()

In the name column we have a couple of rows without a title (or 'none' as the title). It looks like these can be safely removed.

In [None]:
raw_steam_data[(raw_steam_data['name'].isnull()) | (raw_steam_data['name'] == 'none')]

We also have some duplicated rows, likely caused by errors or overlapping in our data collection process. As we know for certain that all AppIDs should be unique, we can safely remove these duplicates straight away.

In [None]:
raw_steam_data[raw_steam_data.duplicated()].head()

Here we define and run our functions to handle everything we just looked at. We also define a general `process` function which will run all of our processing functions on the data set, allowing us to slowly add to it as we build out to processing more columns. Finally we run this function on our raw data, inspecting the first few rows and viewing how many rows and columns we have dropped.

In [7]:
def process_null_cols(df, thresh=0.5):
    """Drop columns with more than a certain proportion of missing values (Default 50%)."""
    cutoff_count = len(df) * thresh
    
    return df.dropna(thresh=cutoff_count, axis=1)


def drop_null_rows(df, col):
    """Drop rows with null values in a particular column."""
    return df[df[col].notnull()]


def process_type(df):
    """Remove rows with null values for type column, then drop the column."""
    df = drop_null_rows(df, 'type')
    df = df.drop('type', axis=1)
    
    return df
    
    
def process_name(df):
    """Remove rows with null values or 'none' in name column."""
    df = drop_null_rows(df, 'name')
    df = df[df['name'] != 'none']
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    
    return df

print(raw_steam_data.shape)
steam_data = process(raw_steam_data)
print(steam_data.shape)
steam_data.head()

(29235, 39)
(29075, 28)


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,price_overview,packages,package_groups,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors
0,Counter-Strike,10,0.0,False,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,Team Fortress Classic,20,0.0,False,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,Day of Defeat,30,0.0,False,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,Deathmatch Classic,40,0.0,False,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,Half-Life: Opposing Force,50,0.0,False,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


### Processing age

The next column we will look at is 'required_age'. We can see that it is already stored as integers, and values range from 0 to 20, with one likely error (1818).

In [None]:
steam_data['required_age'].value_counts().sort_index()

Whilst fairly useful in its current state, we may benefit from reducing the number of categories that ages fall into. Instead of comparing games rated as 5, 6, 7 or 8, we could compare games rated 5+ or 8+, for example.

To decide which categories (or bins) we should use, we will look at the [PEGI age ratings](https://pegi.info/) as this is the system used in the United Kingdom, where we're performing our analysis. We can see that ratings fall into one of five categories (3, 7, 12, 16, 18), defining the minimum age required to buy a game.

Using this to inform our decision, we can use the [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) function to sort our data into each of these categories. As our erroneous row (1818) is most likely meant to be rated 18 anyway, we can set our upper bound above this value to catch it inside this category.


In [8]:
def process_age(df):
    """Format ratings in age column to be in line with the PEGI Age Ratings system."""
    # PEGI Age ratings: 3, 7, 12, 16, 18
    cut_points = [-1, 3, 7, 12, 16, 2000]
    label_values = [3, 7, 12, 16, 18]
    
    df['required_age'] = pd.cut(df['required_age'], bins=cut_points, labels=label_values)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data['required_age'].value_counts().sort_index()

3     28442
7        12
12       79
16      205
18      337
Name: required_age, dtype: int64

### Processing the platforms column

Whilst we could look at the next column in our dataframe, is_free, it would make sense that this is intrinsically linked to the price_overview column. Ultimately we may wish to combine these columns into one, where free games have a price of 0. Looking at the price_overview column, we can see it is stored in a dictionary-like structure, with multiple keys and values. Handling this may be quite tricky, so instead we'll look at a simpler example.

The platforms column appears to contain a key for each of the main operating systems - windows, mac and linux - and corresponding boolean value, set to True or False depending on whether it is available on that platform. This should be a reasonably straighforward place to start, and we can separate this data out into three columns, one for each platform, filled with boolean values.

In [None]:
steam_data['platforms'].head()

So far the cleaning process has been relatively simple, requiring mainly checking for null values and dropping some rows or columns. Already we can see that handling the platforms will be a little more complex.

Our first hurdle is getting python to recognise the data in the columns as dictionaries rather than just strings. This will allow us to access the different values separately, without having to do some unnecessarily complicated string formatting. As we can see below, even though the data looks like a dictionary it is in fact stored as a string.

In [None]:
print(type(steam_data['platforms'].iloc[0]))

steam_data['platforms'].iloc[0]

We can get around this using the handy [literal_eval](https://docs.python.org/3/library/ast.html#ast.literal_eval) function from the in-built `ast` module. As the name suggests, this will allow us to evaluate the string, and index into it as a 
dictionary.

In [None]:
print(type(literal_eval(steam_data['platforms'].iloc[0])))

literal_eval(steam_data['platforms'].iloc[0])['windows']

We also need to check for null values, but fortunately there aren't any in this column.

In [None]:
steam_data['platforms'].isnull().sum()

Putting this all together, we'll be using the pandas [Series.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method to help us quickly evaluate all of the rows, then we'll be calling `apply` again for each platform to create our new columns.

We could return the True/False value directly and store the values as boolean types, but since we'll be exporting the cleaned data to a csv file, let's store them as integers as this should reduce the file size slightly. Setting True as 1 and False as 0 can still be interpreted as a boolean type, but less data is used to store the information.

In [9]:
def process_platforms(df):
    """Split platforms column into separate boolean columns for each platform."""
    # evaluate values in platforms column, so can index into dictionaries
    df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
    # loop across keys, the platforms, which we'll turn into columns
    for platform in df['platforms'][0].keys():
        # set 1 if value for platform in original column is True, or 0 if it is False
        df[platform] = df['platforms'].apply(lambda x: 1 if x[platform] else 0)
    
    # remove the original platforms column
    df = df.drop('platforms', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'windows', 'mac', 'linux']].head()

Unnamed: 0,name,windows,mac,linux
0,Counter-Strike,1,1,1
1,Team Fortress Classic,1,1,1
2,Day of Defeat,1,1,1
3,Deathmatch Classic,1,1,1
4,Half-Life: Opposing Force,1,1,1


### Processing price

Now we have built up some intuition around how to deal with the data stored as dictionaries, let's return to the `is_free` and `price_overview` columns as we should now be able to handle them.

First let's check how many null values there are in `price_overview`.

In [None]:
steam_data['price_overview'].isnull().sum()

Whilst that looks like a lot, we have to consider the impact that the `is_free` column might be having. Before jumping to conclusions let's check if there any rows with `is_free` marked as True and null values in the `price_overview` column.

In [None]:
free_and_null_price = steam_data[(steam_data['is_free']) & (steam_data['price_overview'].isnull())]

print(free_and_null_price.shape[0])
free_and_null_price.head()

Turns out this accounts for most of our null values in the `price_overview` column, meaning we can handle these by setting our final price as 0. This means that there are almost 850 rows which aren't free but have null values in the `price_overview` column. Let's investigate those.

In [None]:
not_free_and_null_price = steam_data[(steam_data['is_free'] == False) & (steam_data['price_overview'].isnull())]

not_free_and_null_price.head()

The first few rows contain big, well-known games which appear to have pretty complete data. It looks like we can rule out data errors, so let's dig a little deeper and see if we can find out what is going on.

We'll start by looking at the store pages for some of these titles. The url to an app on the steam website follows this structure:

    https://store.steampowered.com/app/[steam_appid]

This means we can easily generate these links using our above filter. We'll wrap it up in a function in case we want to use it later.

In [10]:
def print_steam_links(df):
    """Print links to store page for apps in a dataframe."""
    url_base = "https://store.steampowered.com/app/"
    
    for i, row in df.iterrows():
        appid = row['steam_appid']
        name = row['name']
        
        print(name + ':', url_base + str(appid))
        

print_steam_links(not_free_and_null_price[:5])

NameError: name 'not_free_and_null_price' is not defined

For these games we can conclude that:

- The Ship: Single Player is a tutorial, and comes as part of The Ship: Murder Party
- RollerCoaster Tycoon 3: Platinum has been removed from steam (and another game website: gog)  
  - "A spokesperson for GOG told Eurogamer it pulled the game "due to expiring licensing rights", and stressed it'll talk with "new distribution rights holders" to bring the game back as soon as possible." Source: [Eurogamer](https://www.eurogamer.net/articles/2018-05-09-rollercoaster-tycoon-3-pulled-from-steam-gog)
- BioShock has been replaced by BioShock Remastered
- Sam & Max 101 is sold as part of a season, and this can be found in the `package_groups` column

So we have a couple of options here. We could just drop these rows, we could try to figure out the price based on the package_groups column, or we could leave them for now and return to them later, which is what we will do. It may be that some or all of these rows are removed later in the cleaning process for other reasons.

Below we can view the games with similar names to the games we investigated, to help get an idea of what is happening.

In [None]:
steam_data[steam_data['name'].str.contains("The Ship:")]

In [None]:
steam_data[steam_data['name'].str.contains("BioShock™")]

In [None]:
steam_data[steam_data['name'].str.contains("Sam & Max 1")]

Finally if we take a look at the data for the first row, we can see that we have a variety of formats in which our price is stored. We have a `currency`, which is GBP, perfect as we are performing our analysis in the UK. Next we have a number of different values for the price so which one do we use?

In [None]:
steam_data['price_overview'][0]

If we inspect another row, we see that there is an active discount, applying an 80% discount to the title. It looks like `initial` contains to normal price before discount, and `final` contains the discounted price. `initial_formatted` and `final_formatted` contain the price displayed in the currency. We don't have to worry about these, as we'll be storing the price as an integer (or float) and if we really wanted, could format it like this when printing.

With all this in mind, it looks like we'll be checking the value under the currency key, and using the value in the initial key.

In [None]:
steam_data['price_overview'][37]

Now the preliminary investigation is complete we can begin definining our function. 

We start by evaluating the strings using literal_eval as before, however if there is a null value (caught by the try/except block) we return a properly formatted dictionary with -1 for the `initial` value. This will allow us to fill in a value of 0 for free games, then be left with an easily targetable value for the null rows.

In [None]:
def process_price(df):
    df = df.copy()
        
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # Create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # Set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    return df

price_data = process_price(steam_data)[['name', 'currency', 'price']]
price_data.head()

We're almost finished here, bet let's check if any games don't have GBP listed as the currency.

In [None]:
price_data[price_data['currency'] != 'GBP']

For some reason we have four games listed in either USD or EUR. We could use the current exchange rate to try and convert them into GBP, however as there are only four rows we will simply drop them.

We will also divide prices by 100 so they are displayed as floats in pounds.

In [12]:
def process_price(df):
    """Process price_overview column into formatted price column."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # change price to display in pounds (only applying to rows with a value greater than 0)
    df.loc[df['price'] > 0, 'price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'price']].head()

Unnamed: 0,name,price
0,Counter-Strike,7.19
1,Team Fortress Classic,3.99
2,Day of Defeat,3.99
3,Deathmatch Classic,3.99
4,Half-Life: Opposing Force,3.99


### Processing Description Columns

Next we have a series of columns with descriptive text about each game: `detailed_description`, `about_the_game` and `short_description`. These columns could be used as the basis for an interesting recommender or key-word analysis project, however they are not required in our current project and should be removed from our final data set as they take up large amounts of space.

In case we find some anomalies, let's inspect these columns anyway.

In [None]:
steam_data[['detailed_description', 'about_the_game', 'short_description']].isnull().sum()

It looks like we have 24 rows with missing data for these columns, and chances are the 24 rows with missing `detailed_description` are the rows with missing `about_the_game` and `short_description` data too. 

By inspecting the individual rows below, we can see that this is true - all rows with missing data in one description column have missing data in the other too.

In [None]:
steam_data[steam_data['detailed_description'].isnull()]

Browsing these games it looks like about half are old PopCap games from 2006 and about half are from Telltale Games, similar to the Sam & Max title we encountered in the previous section.

There is also a dedicated server and a game which is now unlisted on the steam store. It would definitely be best to remove these two.

Let's remove these rows for now, but we can reintroduce them later if we wish.

As stated, the description columns may be useful for future projects, so before we remove them from this data set we will export them as a csv file. We will include the steam_appid column in this export as it will allow us to match up these rows with rows in our primary data set later on, using a merge (or a join in SQL). We will write a short function to handle this, which we can re-use later on if we have any more dataframes that need exporting.

In [13]:
def export_data(df, filename):
    """Export dataframe to csv file, filename prepended with 'steam_'.
    
    filename : str without file extension
    """
    filepath = '../data/exports/steam_' + filename + '.csv'
    formatted_name = filename.replace('_', ' ')
    
    df.to_csv(filepath, index=False)
    print("Exported {} to '{}'".format(formatted_name, filepath))

    
def process_descriptions(df, export=False):
    """Export descriptions to external csv file then remove these columns."""
    # remove rows with missing description data
    df = df[df['detailed_description'].notnull()].copy()
    
    if export:
        # create dataframe of description columns and export to csv
        description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]
        
        export_data(description_data, filename='description_data')
    
    # drop description columns from main dataframe
    df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported description data to '../data/exports/steam_description_data.csv'


Unnamed: 0,name,steam_appid,required_age,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,windows,mac,linux,price
0,Counter-Strike,10,3,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19
1,Team Fortress Classic,20,3,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99
2,Day of Defeat,30,3,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}",1,1,1,3.99
3,Deathmatch Classic,40,3,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}",1,1,1,3.99
4,Half-Life: Opposing Force,50,3,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}",1,1,1,3.99


In [None]:
# inspect exported data
pd.read_csv('../data/exports/steam_description_data.csv').head()

### Processing Langauges

The next column is supported_languages. As we will be performing the analysis for an English company, we are only interested in apps that are in English. Whilst we could remove non-english apps at this stage, instead we will create a column marking english apps with a boolean value - True or False.

We begin as usual by looking for rows with null values.

In [None]:
steam_data['supported_languages'].isnull().sum()

Taking a closer look at these apps, it's possible one or two are not in english. As there are only 4 rows affected we will go ahead and remove these from the data set.

In [None]:
steam_data[steam_data['supported_languages'].isnull()]

By looking at the value for the first row and the values for the most common rows, it looks like languages are stored as a string which can be anything from a comma-separated list of languages to a mix of html and headings. It seems reasonably safe to assume that if the app is in English, the word English will appear somewhere in this string. With this in mind we can simply search the string and return a value based on the result.

In [None]:
print(steam_data['supported_languages'][0])
steam_data['supported_languages'].value_counts().head(10)

In [14]:
def process_language(df):
    """Process supported_languages column into a boolean 'is english' column."""
    df = df.copy()
    
    # drop rows with missing language data
    df = df.dropna(subset=['supported_languages'])
    
    df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)
    df = df.drop('supported_languages', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    
    return df


steam_data = process(raw_steam_data)
steam_data[['name', 'english']].head()

Unnamed: 0,name,english
0,Counter-Strike,1
1,Team Fortress Classic,1
2,Day of Defeat,1
3,Deathmatch Classic,1
4,Half-Life: Opposing Force,1


Before moving on, we can take a quick look at our results and see that most of our apps are in English.

In [None]:
steam_data['english'].value_counts(dropna=False)

### Processing image columns

Similar to our description columns, we have three columns that appear to contain links to various images: `header_image`, `screenshots` and `background`. We will treat these in almost the same way, exporting the contents to a csv file then removing the columns from our data set.

Whilst we won't be needed this data for our current project, it could open the door to some interesting image analysis in the future.

First we check for missing values.

In [None]:
image_cols = ['header_image', 'screenshots', 'background']

for col in image_cols:
    print(col+':', steam_data[col].isnull().sum())

Again it is likely that the 15 rows with missing screenshots data are the same rows with missing background data.

Seen below, some rows have missing `pc_requirements`, some have missing release_dates (blank string in the `date` part of release_date), and most have -1 for price, meaning we couldn't find any price data earlier.

It seems like it would be a good idea to remove these rows before proceeding.

In [None]:
steam_data[steam_data['screenshots'].isnull()]

There is also a `movies` column with similar data. Whilst having more missing values, presumably for games without videos, it appears to contain names, thumbnails and links to various videos and trailers. It's unlikely we'll be needed them but we can include them in the export and remove them from our data set.

In [None]:
steam_data['movies'].isnull().sum()

In [None]:
with pd.option_context("display.max_colwidth", 1000):
    print(steam_data[steam_data['movies'].notnull()]['movies'].head(3))

In [15]:
def process_images(df, export=False):
    """Remove image columns from dataframe, optionally exporting them to csv first."""
    df = df[df['screenshots'].notnull()].copy()
    
    if export:
        image_data = df[['steam_appid', 'header_image', 'screenshots', 'background', 'movies']]
        
        export_data(image_data, 'image_data')
        
    df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported image data to '../data/exports/steam_image_data.csv'


Unnamed: 0,name,steam_appid,required_age,website,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,support_info,content_descriptors,windows,mac,linux,price,english
0,Counter-Strike,10,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1
1,Team Fortress Classic,20,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1
2,Day of Defeat,30,3,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}","{'ids': [], 'notes': None}",1,1,1,3.99,1
3,Deathmatch Classic,40,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}","{'ids': [], 'notes': None}",1,1,1,3.99,1
4,Half-Life: Opposing Force,50,3,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...","{'ids': [], 'notes': None}",1,1,1,3.99,1


In [None]:
# inspect exported data
pd.read_csv('../data/exports/steam_image_data.csv').head()

### Website and support info

Next we will look at the `website` and `support_info` columns, both containing links to external websites. There are a large number of rows with no website listed, and while there are no null values in the support_info column, it looks like many will have both emails and url inside the data.

For our data set we'll be dropping both these columns. But it might be useful, if not interesting, to extract this data and export to a csv file as we have before.

Below we can see the null counts and some example rows.

In [None]:
print('website null counts:', steam_data['website'].isnull().sum())
print('support_info null counts:', steam_data['support_info'].isnull().sum())

with pd.option_context("display.max_colwidth", 100): # ensures strings not cut short
    display(steam_data[['name', 'website', 'support_info']][80:85])

We keep all the code that parses the columns inside the export if statement, so it only runs if we wish to export to csv. We don't need to worry that the rows with missing website data contain NaN whereas the other two columns contain a blank string for missing data, as once we have exported to csv they will be treated the same.

In [16]:
def process_info(df, export=False):
    """Drop support information from dataframe, optionally exporting beforehand."""
    if export:
        support_info = df[['steam_appid', 'website', 'support_info']].copy()
        
        support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))
        support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])
        support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])
        
        support_info = support_info.drop('support_info', axis=1)
        
        support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]

        export_data(support_info, 'support_info')
    
    df = df.drop(['website', 'support_info'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported support info to '../data/exports/steam_support_info.csv'


Unnamed: 0,name,steam_appid,required_age,pc_requirements,mac_requirements,linux_requirements,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
0,Counter-Strike,10,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1
1,Team Fortress Classic,20,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1
2,Day of Defeat,30,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
3,Deathmatch Classic,40,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
4,Half-Life: Opposing Force,50,3,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1


In [None]:
# inspect exported file
pd.read_csv('../data/exports/steam_support_info.csv').head()

### System Requirements

At first it looks like we have data for every row.

In [None]:
req_cols = ['pc_requirements', 'mac_requirements', 'linux_requirements']

print('null counts:\n')

for col in req_cols:
    print(col+':', steam_data[col].isnull().sum())

However if we look at the data a little more closely, we see that some rows actually have an empty list. These won't appear as null rows, but once evaluated these rows won't provide any information and are essentially useless to us, so can be thought of as such.

In [None]:
steam_data[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].tail()

We can check how many rows in each requirements column have empty lists using a simple boolean filter. By checking the first value in the shape parameter, we can get a count for how many empty lists there are.

In [None]:
print('Empty list counts:\n')

for col in req_cols:
    print(col+':', steam_data[steam_data[col] == '[]'].shape[0])

That's over half of the rows for both mac and linux requirements. That probably means that there is not enough data in these two columns to be useful for our analysis.

It turns out most games are developed solely for windows, with the growth in mac and linux ports only growing in recent years. Naturally it would make sense that any games that aren't supported on mac or linux would not have corresponding requirements.

As we have already cleaned our platforms column, we can check how many rows actually have missing data by comparing rows with empty lists in the requirements with data in the respective platform columns (mac/linux). If a row has an empty list in the requirements column but a 1 (True) in the platform column, it means the data is missing.

In [None]:
for col in ['mac_requirements', 'linux_requirements']:
    platform = col.split('_')[0]
    print(platform+':', steam_data[(steam_data[col] == '[]') & (steam_data[platform])].shape[0])

Whilst not an insignificant number, this means that the vast majority of rows are as they should be, and we're not looking at too many data errors.

Let's also have a look for missing values in the pc/windows column. We couldn't include it in our previous loop as the columns have different names, something we may wish to change later.

In [None]:
print('windows:', steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['windows'])].shape[0])

11 rows have missing system requirements. We can take a look at some of them below, and follow the links to the steam pages to try and discover if anything is amiss.

In [None]:
missing_windows_requirements = steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['windows'])]

print_steam_links(missing_windows_requirements[:5])
missing_windows_requirements.head()

There doesn't appear to be any common issue in these rows - some of the games are quite old but that's about it. It may simply be that no requirements were supplied when the games were added to the steam store.

Let's say that the fictional company we're doing analysis for is interested in developing for windows only. Also we can assume that a cross-platform game will have similar requirements in terms of hardware for each platform it supports. With this in mind we can safely drop both the mac and linux requirements columns, as we already know which games support these operating systems by our cleaned platform columns. That means we can focus on the pc_requirements column, which has information for almost every game in our data.

Now we will take a look at a couple of rows from the dataset to see how the data is stored.

In [None]:
display(steam_data['pc_requirements'].iloc[0])
display(steam_data['pc_requirements'].iloc[2000])
display(steam_data['pc_requirements'].iloc[15000])

In short: it's a mess. It looks like the data is stored as a dictionary, as we've seen before. There is definitely a key for 'minimum', but apart from that it is hard to see at a glance. The strings are full of html formatting, which is presumably parsed to display the information on the website. It also looks like there are different categories like Processor and Memory for some, but not all, rows.

Let's take a stab and cleaning out some of the unnessecary formatting and see if it becomes clearer.

By creating a dataframe from a selection of rows, we can easily and quickly make changes using the pandas .str accessor, allowing us to use python string formatting and regular expressions.

In [None]:
view_requirements = steam_data['pc_requirements'].iloc[[0, 2000, 15000]].copy()

view_requirements = (view_requirements
                         .str.replace(r'\\[rtn]', '')
                         .str.replace(r'<[pbr]{1,2}>', ' ')
                         .str.replace(r'<[\/"=\w\s]+>', '')
                    )

for i, row in view_requirements.iteritems():
    display(row)

We can now see more clearly the contents and structure of these rows. Some rows have both Minimum and Recommended requirements inside a 'minimum' key, some have separate 'minimum' and 'recommended' keys. Some have headings like 'Processor:' and 'Storage:' before various components, others simply have a list of components. Some state particular speeds for components, like 2 Ghz CPU, others state specific models, like 'Intel Core 2 Duo', amongst this information.

It seems like it would be possible to extract invidivual component information from this data, however it would be a lengthy and complex process recquiring the handling of many exceptions and invididual cases. Whilst we may wish to tackle this in the future, as it could provide an interesting window into how the demands of gaming have changed over the years, it won't necessarily provide us with useful information for our current objectives.

With that in mind, it seems best to proceed by cleaning the data slightly so it is readable, exporting to an external csv for future use, then removing the columns from our dataframe.

In [17]:
def process_requirements(df, export=False):
    if export:
        requirements = df[['steam_appid', 'pc_requirements']].copy()
        
        requirements = requirements[requirements['pc_requirements'] != '[]']
        
        requirements['requirements_clean'] = (requirements['pc_requirements']
                                                  .str.replace(r'\\[rtn]', '')
                                                  .str.replace(r'<[pbr]{1,2}>', ' ')
                                                  .str.replace(r'<[\/"=\w\s]+>', '')
                                             )
        
        export_data(requirements, 'requirements_data')
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df, export=True)
    
    return df


steam_data = process(raw_steam_data)
steam_data.head()

Exported requirements data to '../data/exports/steam_requirements_data.csv'


Unnamed: 0,name,steam_appid,required_age,developers,publishers,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english
0,Counter-Strike,10,3,['Valve'],['Valve'],[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1
1,Team Fortress Classic,20,3,['Valve'],['Valve'],[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1
2,Day of Defeat,30,3,['Valve'],['Valve'],[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
3,Deathmatch Classic,40,3,['Valve'],['Valve'],[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1
4,Half-Life: Opposing Force,50,3,['Gearbox Software'],['Valve'],[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1


In [None]:
# verify export
pd.read_csv('../data/exports/steam_requirements_data.csv').head()

### Processing developers and publishers

The next two columns, developers and publishers, will most likely contain similar information so we can look at them together. 

We'll start by checking the null counts, noticing that while the publishers column doesn't appear to have any null values at first, if we search for empty lists we see that we have 227 hidden null values.

In [None]:
print('developers null counts:', steam_data['developers'].isnull().sum())
print('developers empty list counts:', steam_data[steam_data['developers'] == "['']"].shape[0])

print('\npublishers null counts:', steam_data['publishers'].isnull().sum())
print('publishers empty list counts:', steam_data[steam_data['publishers'] == "['']"].shape[0])

In [None]:
no_dev = steam_data[steam_data['developers'].isnull()]

print('Total games missing developer:', no_dev.shape[0], '\n')
print_steam_links(no_dev[:5])

no_dev.head()

In [None]:
no_pub = steam_data[steam_data['publishers'] == "['']"]

print('Total games missing publisher:', no_pub.shape[0], '\n')
print_steam_links(no_pub[:5])

no_pub.head()

In [None]:
no_dev_or_pub = steam_data[(steam_data['developers'].isnull()) & (steam_data['publishers'] == "['']")]

print('Total games missing developer and publisher:', no_dev_or_pub.shape[0], '\n')
print_steam_links(no_dev_or_pub[:5])

no_dev_or_pub.head()

Options:
- remove rows with missing developer or publisher information
- impute missing information by replacing missing columns with the column we have
- write missing information as 'unkown' or none
- keep everything
- remove rows with both missing developer and publisher information

In [None]:
def process_developers_and_publishers(df):
    num_rows = df.shape[0]
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
    print('Before:', num_rows, '\nAfter:', df.shape[0], '\nRows dropped:', num_rows - df.shape[0])
            
    df['developers'] = df['developers'].apply(lambda x: literal_eval(x))
    df['publishers'] = df['publishers'].apply(lambda x: literal_eval(x))
    
    df['developer'] = df['developers'].apply(lambda x: x[0])
    df['publisher'] = df['publishers'].apply(lambda x: x[0])
    
    df['other_developers'] = df['developers'].apply(lambda x: ', '.join(x[1:]) if len(x) > 1 else np.nan)
    df['other_publishers'] = df['publishers'].apply(lambda x: ', '.join(x[1:]) if len(x) > 1 else np.nan)

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df

dev_pub_data = process_developers_and_publishers(steam_data)
dev_pub_data[['developer', 'publisher', 'other_developers', 'other_publishers']].head()

It may be worth investigating how many rows actually have other developers or publishers, as the other_developers and other_publishers columns are filled with null values for the first few rows.

In [None]:
print('Null counts:\n')

for col in ['developer', 'publisher', 'other_developers', 'other_publishers']:
    print(col + ':', dev_pub_data[col].isnull().sum())

It turns out that most games only have one developer and one publisher, and so our columns are filled with null values so they're of little use. It may be better to combine these columns into one. We can do this fairly easily using the python join method on a string. By invoking join on a comma, when there is only one value in the list of developers/publishers join will return that value, otherwise when there are multiple values we will get a comma-separated string like so:

In [None]:
', '.join(['one item'])

In [None]:
', '.join(['multiple', 'different', 'items'])

We can now modify and finish our function, and will be ready to move on to the next column.

In [18]:
def process_developers_and_publishers(df):
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
            
    df['developers'] = df['developers'].apply(lambda x: literal_eval(x))
    df['publishers'] = df['publishers'].apply(lambda x: literal_eval(x))
    
    df['developer'] = df['developers'].apply(lambda x: ', '.join(x))
    df['publisher'] = df['publishers'].apply(lambda x: ', '.join(x))

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,packages,package_groups,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


### Processing Packages

We are not incredibly interested in the `packages` and `package_groups` columns, except for where we are missing price data (and earlier filled these with -1). We can now easily investigate these rows. Overall we have 811 rows with missing price data.

In [None]:
print(steam_data[steam_data['price'] == -1].shape[0])

We can split these rows into two categories: those with package_groups data and those without. If we take a quick look at the package_groups column we see that there are no null values, but rows without data are stored as empty lists.

In [None]:
print('Null counts:', steam_data['package_groups'].isnull().sum())
print('Empty list counts:', steam_data[steam_data['package_groups'] == "[]"].shape[0])

Using a combination of filters, we can find out how many rows have both missing price and package_group data and investigate.

In [None]:
missing_price_and_package = steam_data[(steam_data['price'] == -1) & (steam_data['package_groups'] == "[]")]

print('Number of rows:', missing_price_and_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_and_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_and_package[-10:-5])

missing_price_and_package.head()

Most of our games with missing price data fall into the above category. From looking at the first few rows on the store page, it looks like they are currently unavailable or have been delisted from the store. Looking at the last few rows, it appears most of haven't been released and haven't had a price set. We will take care of all unreleased games when we clean the release_date column, but we can remove all of these apps now.

Let's now take a look at the apps that have missing price data but do have package_groups data.

In [None]:
missing_price_have_package = steam_data.loc[(steam_data['price'] == -1) & (steam_data['package_groups'] != "[]"), ['name', 'steam_appid', 'package_groups', 'price']]

print('Number of rows:', missing_price_have_package.shape[0], '\n')

print('First few rows:\n')
print_steam_links(missing_price_have_package[:5])

print('\nLast few rows:\n')
print_steam_links(missing_price_have_package[-10:-5])

display(missing_price_have_package.head())
missing_price_have_package.iloc[-10:-5]

Looking at a selection of these rows, the games appear to be: supersceded by a newer release or remaster, part of a bigger bundle of games or episodic, or included by purchasing another game. 

Whilst we could extract prices from the package_groups data, the most sensible option seems to be removing these rows. Since our logic interacts heavily with the price data we will rewrite the process_price function rather than putting this logic inside it's own function.

In [19]:
def process_price(df):
    """Process price_overview column into formatted price column, and take care of package columns."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # remove rows where price is -1
    df = df[df['price'] != -1]
    
    # change price to display in pounds (can apply to all now -1 rows removed)
    df['price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview', 'packages', 'package_groups'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,categories,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


### Processing Categories and Genres

Drop rows with missing categories/genres?

In [None]:
print(steam_data['categories'].isnull().sum())

In [None]:
print(steam_data['categories'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['categories'].head())

In [None]:
print_steam_links(steam_data[steam_data['categories'].isnull()].tail(20))

{'Single-player', 'Steam Leaderboards', 'Steam Workshop', 'MMO', 'Online Multi-Player', 'Steam Turn Notifications', 'In-App Purchases', 'Commentary available', 'Local Co-op', 'Partial Controller Support', 'Captions available', 'Steam Achievements', 'Steam Trading Cards', 'VR Support', 'Includes level editor', 'Online Co-op', 'SteamVR Collectibles', 'Local Multi-Player', 'Mods (require HL2)', 'Multi-player', 'Full controller support', 'Shared/Split Screen', 'Stats', 'Co-op', 'Includes Source SDK', 'Steam Cloud', 'Valve Anti-Cheat enabled', 'Cross-Platform Multiplayer', 'Mods'}


Unnamed: 0,steam_appid,c_captions_available,c_co_op,c_commentary_available,c_cross_platform_multiplayer,c_full_controller_support,c_in_app_purchases,c_includes_source_sdk,c_includes_level_editor,c_local_co_op,c_local_multi_player,c_mmo,c_mods,c_mods_require_hl2,c_multi_player,c_online_co_op,c_online_multi_player,c_partial_controller_support,c_shared_or_split_screen,c_single_player,c_stats,c_steam_achievements,c_steam_cloud,c_steam_leaderboards,c_steam_trading_cards,c_steam_turn_notifications,c_steam_workshop,c_steamvr_collectibles,c_vr_support,c_valve_anti_cheat_enabled
0,10,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1,20,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,30,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,40,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4,50,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1


Exported category data to '../data/exports/steam_category_data.csv'


Unnamed: 0,name,steam_appid,required_age,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [None]:
print(steam_data['genres'].isnull().sum())

In [None]:
print(steam_data['genres'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['genres'].iloc[100:105])

In [None]:
print_steam_links(steam_data[steam_data['genres'].isnull()].head(10))
print_steam_links(steam_data[steam_data['genres'].isnull()].tail(10))

In [None]:
steam_data[(steam_data['genres'].isnull()) | (steam_data['categories'].isnull())]

In [292]:
def process_categories(df, export=False):
    df = df[df['categories'].notnull()].copy()
    
    if export:
        category_data = df[['steam_appid', 'categories']].copy()

        category_data['categories'] = category_data['categories'].apply(lambda x: [item['description'] for item in literal_eval(x)])

        cols = set(list(itertools.chain(*category_data['categories'])))
        
        for col in sorted(cols):
            col_name = 'c_' + (col.lower()
                                  .replace('-', '_')
                                  .replace(' ', '_')
                                  .replace('(', '')
                                  .replace(')', '')
                                  .replace('/', '_or_')
                              )
            category_data[col_name] = category_data['categories'].apply(lambda x: 1 if col in x else 0)
        
        category_data = category_data.drop('categories', axis=1)
        export_data(category_data, 'category_data')
    
    df = df.drop('categories', axis=1)
    
    return df


def process_genres(df, export=False):
    df = df[df['genres'].notnull()].copy()
    
    if export:
        genre_data = df[['steam_appid', 'genres']].copy()

        genre_data['genres'] = genre_data['genres'].apply(lambda x: [item['description'] for item in literal_eval(x)])
        
        cols = set(list(itertools.chain(*genre_data['genres'])))

        for col in sorted(cols):
            col_name = 'g_' + (col.lower()
                            .replace(' ', '_')
                            .replace('&', 'and')
                       )
            genre_data[col_name] = genre_data['genres'].apply(lambda x: 1 if col in x else 0)

        genre_data = genre_data.drop('genres', axis=1)            
        export_data(genre_data, 'genre_data')
        
    df = df.drop('genres', axis=1)
    
    return df


process_categories(steam_data, export=True).head()
process_genres(steam_data, export=True).head()

{'Utilities', 'Adventure', 'Design & Illustration', 'Nudity', 'Documentary', 'Tutorial', 'Software Training', 'Racing', 'Game Development', 'Education', 'Early Access', 'Animation & Modeling', 'Audio Production', 'Accounting', 'Massively Multiplayer', 'Free to Play', 'Simulation', 'Sports', 'Indie', 'Video Production', 'Violent', 'Photo Editing', 'Action', 'Casual', 'RPG', 'Sexual Content', 'Strategy', 'Web Publishing', 'Gore'}


Unnamed: 0,steam_appid,g_accounting,g_action,g_adventure,g_animation_and_modeling,g_audio_production,g_casual,g_design_and_illustration,g_documentary,g_early_access,g_education,g_free_to_play,g_game_development,g_gore,g_indie,g_massively_multiplayer,g_nudity,g_photo_editing,g_rpg,g_racing,g_sexual_content,g_simulation,g_software_training,g_sports,g_strategy,g_tutorial,g_utilities,g_video_production,g_violent,g_web_publishing
0,10,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,20,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,30,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,50,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,name,steam_appid,required_age,categories,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"[{'id': 1, 'description': 'Multi-player'}, {'i...",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"[{'id': 2, 'description': 'Single-player'}, {'...",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [310]:
def expand_columns(df, col):
    df[col] = df[col].apply(lambda x: [item['description'] for item in literal_eval(x)])
    new_cols = set(list(itertools.chain(*df[col])))
    
    for new_col in sorted(new_cols):
        new_col_name = (new_col.lower()
                               .replace('-', '_')
                               .replace(' ', '_')
                               .replace('(', '')
                               .replace(')', '')
                               .replace('/', '_or_')
                               .replace('&', 'and')
                       )
        df[new_col_name] = df[col].apply(lambda x: 1 if new_col in x else 0)
            
    return df.drop(col, axis=1)


def process_categories(df, export=False):
    df = df[df['categories'].notnull()].copy()
    
    category_data = df[['steam_appid', 'categories']].copy()
    category_data = expand_columns(category_data, 'categories')
    
    if export:
        export_data(category_data, 'category_data')
    
    df = df.drop('categories', axis=1)
    
    return df


def process_genres(df, export=False):
    df = df[df['genres'].notnull()].copy()
    
    genre_data = df[['steam_appid', 'genres']].copy()
    genre_data = expand_columns(genre_data, 'genres')
        
    if export:    
        export_data(genre_data, 'genre_data')
        
    df = df.drop('genres', axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    df = process_categories(df, export=True)
    df = process_genres(df, export=True)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Exported category data to '../data/exports/steam_category_data.csv'
Exported genre data to '../data/exports/steam_genre_data.csv'


Unnamed: 0,name,steam_appid,required_age,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [311]:
pd.read_csv('../data/exports/steam_category_data.csv').head()

Unnamed: 0,steam_appid,captions_available,co_op,commentary_available,cross_platform_multiplayer,full_controller_support,in_app_purchases,includes_source_sdk,includes_level_editor,local_co_op,local_multi_player,mmo,mods,mods_require_hl2,multi_player,online_co_op,online_multi_player,partial_controller_support,shared_or_split_screen,single_player,stats,steam_achievements,steam_cloud,steam_leaderboards,steam_trading_cards,steam_turn_notifications,steam_workshop,steamvr_collectibles,vr_support,valve_anti_cheat_enabled
0,10,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1,20,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,30,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,40,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4,50,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1


In [312]:
pd.read_csv('../data/exports/steam_genre_data.csv').head()

Unnamed: 0,steam_appid,accounting,action,adventure,animation_and_modeling,audio_production,casual,design_and_illustration,documentary,early_access,education,free_to_play,game_development,gore,indie,massively_multiplayer,nudity,photo_editing,rpg,racing,sexual_content,simulation,software_training,sports,strategy,tutorial,utilities,video_production,violent,web_publishing
0,10,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,20,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,30,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,50,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Something

In [314]:
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [317]:
steam_data['achievements'].isnull().sum()

1855

In [322]:
literal_eval(steam_data['achievements'][9])

{'total': 33,
 'highlighted': [{'name': 'Defiant',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_hit_cancop_withcan.jpg'},
  {'name': 'Submissive',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_put_canintrash.jpg'},
  {'name': 'Malcontent',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_escape_apartmentraid.jpg'},
  {'name': 'What cat?',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_break_miniteleporter.jpg'},
  {'name': 'Trusty Hardware',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_crowbar.jpg'},
  {'name': 'Barnacle Bowling',
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_kill_barnacleswithbarrel.jpg'},
  {'name': "Anchor's Aweigh!",
   'path': 'https://steamcdn-a.akamaihd.net/steamcommunity/public/images/apps/220/hl2_get_airboat.jpg'},
  {'nam

In [325]:
steam_data['content_descriptors'].isnull().sum()

0

In [329]:
steam_data['content_descriptors'].value_counts().head(6)

{'ids': [], 'notes': None}                                                                                                                                                                  25394
{'ids': [2, 5], 'notes': None}                                                                                                                                                                427
{'ids': [1, 5], 'notes': None}                                                                                                                                                                251
{'ids': [5], 'notes': None}                                                                                                                                                                   127
{'ids': [1, 2, 5], 'notes': None}                                                                                                                                                             122
{'ids': [2, 5], 'notes': 'This

In [330]:
def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    df = df.drop(['achievements', 'content_descriptors'], axis=1)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_descriptions(df)
    df = process_language(df)
    df = process_images(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_developers_and_publishers(df)
    df = process_categories(df)
    df = process_genres(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,release_date,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"{'coming_soon': False, 'date': '1 Nov, 2000'}",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"{'coming_soon': False, 'date': '1 Apr, 1999'}",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"{'coming_soon': False, 'date': '1 May, 2003'}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"{'coming_soon': False, 'date': '1 Jun, 2001'}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"{'coming_soon': False, 'date': '1 Nov, 1999'}",1,1,1,3.99,1,Gearbox Software,Valve


### Processing Release Date

Cleaning the release date column will prove to be a far more interesting challenge than it has any right to be, providing some interesting optimisation and learning challenges along the way.

First we shall inspect the raw format of the column. As we can see below, it is stored as a dictionary like string object containing a value for `coming_soon` and `date`. From the first few rows it would appear that the dates are stored in a uniform format - day as an integer, month as a 3-character string abbreviation, a comma then the year as a four-digit number. We can parse this either using the python built-in datetime module, or as we already have pandas imported, we can use the [pd.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function.

Also, as our analysis will involve looking at ownership and sales data, looking at games that are not released yet will not be useful to us. Intuitively, we can drop any titles which are coming soon, presumably having this value set to true. (As a side note, once parsed it may be worth checking that no release dates are stated as beyond the current date, just to make doubly sure none slip through.

In [None]:
display(raw_steam_data['price_overview'][0])
display(raw_steam_data['release_date'][0])

In [None]:
steam_data[['name', 'release_date']].head()

One of the first steps in investigating this column is to check for null values. Luckily, it seems that the cleaning we have performed already has removed any null values from our data set, as seen below. This doesn't mean we have caught all empty values however, as we shall see shortly.

In [None]:
print('Null values:\n')
print('Raw data:', raw_steam_data['release_date'].isnull().sum())
print('Partially cleaned:', steam_data['release_date'].isnull().sum())

Another useful step in our exploration of this column is to look at the counts of unique values by using the [Series.value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) method in pandas. Immediately, this brings a couple of interesting (and for cleaning purposes, annoying) quirks of the data to light.

We can see that 64 rows have data for this column, but the date is stated as '' (an empty string). This means they do not have null values, but they don't have a date specified for some reason. This may be due to corruption, or it may be another reason entirely. We will probably have to decide what to do with these cases and investigate further.

Another issue we can notice is that while most of the dates are stored in the format we saw previously (day month, year), a couple are simply stored as the month and year (e.g. 'Apr 2016'). This means that the dates aren't all stored uniformly so we will have to be careful when parsing them later on, else we may run into problems or worse, errors.

In [None]:
steam_data['release_date'].value_counts(dropna=False)

Before we move on, let's quickly inspect some of the rows which have a blank date. Whilst many of the rows appear to have a fair amount of missing (NaN) data, there doesn't appear to be any clear pattern emerging (such as if they were all demos or dlc). With this in mind, it may be safest to handle them for now and return to them later, perhaps at the end once we have dealt with more of the columns. We may find that in handling the other columns, most or all of these rows will be removed anyway.

In [None]:
steam_data[steam_data['release_date'] == "{'coming_soon': False, 'date': ''}"]

# NOW in platform section, need to modify

So far the cleaning process has been relatively simple, requiring mainly checking for null values and dropping some rows or columns. Already we can see that handling the dates will be a little trickier.

Our first hurdle is getting python to recognise the data in the columns as dictionaries rather than just strings. This will allow us to access the different values separately, without having to do some unnecessarily complicated string formatting. As we can see below, even though the data looks like a dictionary it is in fact stored as a string.

In [None]:
print(type(steam_data['release_date'].iloc[0]))

steam_data['release_date'].iloc[0]

We can get around this using the handy [literal_eval](https://docs.python.org/3/library/ast.html#ast.literal_eval) function from the in-built `ast` module. As the name suggests, this will allow us to evaluate the string, and index into it as a 
dictionary.

In [None]:
print(type(literal_eval(steam_data['release_date'].iloc[0])))

literal_eval(steam_data['release_date'].iloc[0])['date']

Putting this all together, we'll be using the pandas [apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method, to help us quickly evaluate all of the dates, 

# End of copy to platform
and the pd.to_datetime functon to interpret and store them as datetime objects. This will be particularly useful as it will allow us to search and sort our dataset when it comes to performing analysis. Say for example we only wish to examine games released in 2010, by converting our dates to a recognisable (by python) format this will be very easy to achieve.

As seen below, we can supply the to_datetime function with our date and pandas will automatically interpret the format. We can then inspect it or print an attribute like the year. We can also provide pandas with the format explicitly, so it knows what to look for and how to parse it, which may be [quicker for large sets of data](https://stackoverflow.com/questions/32034689/why-is-pandas-to-datetime-slow-for-non-standard-time-format-such-as-2014-12-31).

In [None]:
timestamp = pd.to_datetime(literal_eval(steam_data['release_date'].iloc[0])['date'])

print(timestamp)
print(timestamp.year)

pd.to_datetime(literal_eval(steam_data['release_date'].iloc[0])['date'], format='%d %b, %Y')

Below, I have included my first solution to  processing the release date column. As we will see it is quite slow, taking nearly 4 seconds on average to complete. This isn't horrendous, but this could quickly add up in a larger data set and I'm sure we can do better.

There are a few areas we can investigate to make improvements. When initially parsing the date, we end up calling literal_eval twice, which may be a source of slowdown. We also loop over the entire dataset multiple times when calling the to_datetime function. 

We'll investigate which part is causing the greatest slowdown, but we can be certain that reducing the traversals over the data set will most likely provide significant gains. There are also a few other issues and mistakes that we'll dive into over the course of our optimisation process.

In [None]:
# Original function
def process_release_date(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    # Only want to keep released games
    df = df[df['coming_soon'] == False].copy()
    
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    df.loc[df['date'] == '', 'date'] = None
    
    # Parse the date formats we have discovered
    df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')
    df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')
    
    # Parse the rest of the date formats
    df['release_date'] = pd.to_datetime(df['datetime'])
    
    df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)
    return df

%timeit process_release_date(steam_data)

First, let's find out where the main slowdowns are. As we just saw we can use the %timeit magic to time our function. We can also use the in-built time module to inspect parts of our code.

In [None]:
# Original function
def process_release_date(df):
    df = df.copy()
    
    lit_start = time.time()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    print('Evaluation run-time:', time.time() - lit_start)
    
    df.loc[df['date'] == '', 'date'] = None
    
    first_parse_start = time.time()
    
    df['datetime'] = pd.to_datetime(df['date'], format='%d %b, %Y', errors='ignore')
    df['datetime'] = pd.to_datetime(df['datetime'], format='%b %Y', errors='ignore')
    
    print('First parse run-time:', time.time() - first_parse_start)
    
    final_parse_start = time.time()
    df['release_date'] = pd.to_datetime(df['datetime'])
    print('Final parse run-time:', time.time() - final_parse_start)
    
    df = df.drop(['coming_soon', 'date', 'datetime'], axis=1)
    return df

function_start = time.time()
process_release_date(steam_data)
print('\nTotal run-time:', time.time() - function_start)

Immediately we can see that the majority of run-time is taken up by the final call to pd.to_datetime. This suggests that the first two calls are not functioning as expected - they are possibly terminating after the first error instead of skipping over it as desired - and most of the work is being done by this call. Now it makes sense why it is slow - pandas has to figure out how each date is formatted, and since we know we have some variations this may be slowing it down considerably.

Whilst the evaluation run-time is much shorter, our multiple calls to literal_eval may be slowing the function down as well, so we may wish to also investigate that. But as we know the biggest slowdown, we shall begin there.

So we know that handling our dates all together as they are is slow, and we know that we have some different formats mixed in there. Whilst there are likely many possible solutions to this problem, using regular expressions (or regex) comes to mind as they tend to excel at pattern matching in strings.

We know for sure two of the patterns, so let's build a regex for each of those. Then we can iteratively add more as we discover any other patterns. A powerful and useful tool for building and testing regex can be found at [regexr.com](https://regexr.com/).

In [None]:
pattern = r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}'
string = '13 Jul, 2018'

print(re.search(pattern, string))

pattern = r'[A-Za-z]{3} [\d]{4}'
string = 'Apr 2016'

print(re.search(pattern, string))

Using these two patterns we can start building out our function. We're going to apply a function to the date column which searches for each pattern, returning a nicely formatted date string which we will then feed into the to_datetime function.

Our first search matches the 'month year' pattern, like 'Apr 2019'. As we don't know the particular day for these matches we will assume it is the first of the month, returning '1 Apr 2019' in this example.

If we don't match this, we'll check for the second case. Our second match will be the 'day month, year' pattern, such as '13 Jul, 2018'. In this case we will simply return the match with the comma removed, matching our first case, like '13 Jul 2018' in this example.

Finally we'll check for the empty string, returning it for now.

For anything else we'll simply print the string so we know what else we should be searching for.

In [None]:
def building_regex(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    # Useful to find some examples for developing regex
    # print(df['release_date'].value_counts())
    
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x 
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            return x
        else:
            print(x)
            
    df['date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
    
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df

result = building_regex(steam_data)

It looks like we only have to deal with one extra case, where the month comes before the day. Let's factor this in and time our new function against the old one to see if there are any improvements.

Previously we used the `infer_datetime_format` parameter of to_datetime, which can speed up the process. However, as we now know exactly the format our dates will be in, we can explicitly set it ourselves, which should be the fastest way of doing things.

We also need to decide how to function our missing dates - those with the empty strings. For now let's change the way the function handles errors from raise to coerce, which returns NaT (not a time) instead.

In [None]:
# Test parsing of dates

def process_release_date_old(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    # Simple parsing
    df['release_date'] = pd.to_datetime(df['date'])
    
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df


def process_release_date_new(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    # Complex parsing
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{2}, [\d]{4}', x):
            return x[4:6] + ' ' + x[:3] + ' ' + x[-4:]
        elif x == '':
            return x
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
    
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df

print('Testing date parsing:\n')
%timeit process_release_date_old(steam_data)
%timeit process_release_date_new(steam_data)

That's almost 4 times as fast.

One final thing we can do here is check how many patterns are matched by each ... order them

To do this instead of return the date we'll return a number, different for each match. We can then print the value counts for the column and see which is the most frequent.

In [None]:
# Optimising regex search order
def optimise_regex_order(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '0: mmm yyyy' # '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return '1: dd mmm, yyyy' # x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{2}, [\d]{4}', x):
            return '2: mmm dd, yyyy' # x[4:6] + ' ' + x[:3] + ' ' + x[-4:]
        elif x == '':
            return '3: empty' # pass
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['date'].apply(parse_date)
    
    print(df['release_date'].value_counts())
    
    return df


result = optimise_regex_order(steam_data)

Most in category. Organise so first to reduce number of searches

In [None]:
def process_release_date_unordered(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{2}, [\d]{4}', x):
            return x[4:6] + ' ' + x[:3] + ' ' + x[-4:]
        elif x == '':
            return x
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df


def process_release_date_ordered(df):
    df = df.copy()
    
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])
    df = df[df['coming_soon'] == False].copy()
    df['date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    def parse_date(x):
        if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            return x
        elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[A-Za-z]{3} [\d]{2}, [\d]{4}', x):
            return x[4:6] + ' ' + x[:3] + ' ' + x[-4:]
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['date'], format='%d %b %Y', errors='coerce')
    df = df.drop(['coming_soon', 'date'], axis=1)
    
    return df


%timeit process_release_date_unordered(steam_data)
%timeit process_release_date_ordered(steam_data)

In [None]:
# old old old old old for backup


def process_release_date_old(df):
    df = df.copy()
    
    def eval_date(x):
        parsed_x = literal_eval(x)
        if parsed_x['coming_soon']:
            return None
        else:
            return parsed_x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    df = df[df['release_date'].notnull()]  # could change to drop when '' and deal with missing release dates also
    
    # Simple parsing
    df['release_date'] = pd.to_datetime(df['release_date'])
    
    return df


def process_release_date_new(df):
    df = df.copy()
    
    def eval_date(x):
        parsed_x = literal_eval(x)
        if parsed_x['coming_soon']:
            return None
        else:
            return parsed_x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    df = df[df['release_date'].notnull()]  # could change to drop when '' and deal with missing release dates at same time
    
    # Useful to find some examples for developing regex
    # print(df['release_date'].value_counts())
    
    # Complex parsing
    def parse_date(x):
        if re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{2}, [\d]{4}', x):
            return x[4:6] + ' ' + x[:3] + ' ' + x[-4:]
        elif x == '':
            pass
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['release_date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['release_date'], infer_datetime_format=True)
    
    return df




Minor improvement but we'll take it.

Next, investigate evaluation part

In [None]:
# Testing evaluation methods
def evaluation_method_original(df):
    df = df.copy()
    
    # Eval section
    df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x))    
    df['coming_soon'] = df['release_date'].apply(lambda x: x['coming_soon'])
    # Only want released games
    df = df[df['coming_soon'] == False].copy()
    df['release_date'] = df['release_date'].apply(lambda x: x['date'])
    return df


def evaluation_method_1(df):
    df = df.copy()
    
    # Eval section
    df['coming_soon'] = df['release_date'].apply(lambda x: literal_eval(x)['coming_soon'])    
    df = df[df['coming_soon'] == False].copy()
    df['release_date'] = df['release_date'].apply(lambda x: literal_eval(x)['date'])
    
    return df


def evaluation_method_2(df):
    df = df.copy()
    
    # Return to - alternate method for eval, may be faster
    df_2 = df['release_date'].transform([lambda x: literal_eval(x)['coming_soon'], lambda x: literal_eval(x)['date']])
    df = pd.concat([df, df_2], axis=1)
    return df


def evaluation_method_3(df):
    df = df.copy()
    
    def eval_date(x):
        parsed_x = literal_eval(x)
        if parsed_x['coming_soon']:
            return None
        else:
            return parsed_x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    df = df[df['release_date'].notnull()]  # could change to drop when '' and deal with missing release dates also
    
    return df


%timeit evaluation_method_original(steam_data)
%timeit evaluation_method_1(steam_data)
%timeit evaluation_method_2(steam_data)
%timeit evaluation_method_3(steam_data)

Last method just about fastest, so we'll use that in final function  
Slower now? need to check that...

In [None]:
# Final Function
def process_release_date(df):
    df = df.copy()
    
    def eval_date(x):
        parsed_x = literal_eval(x)
        if parsed_x['coming_soon']:
            return None
        else:
            return parsed_x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    df = df[df['release_date'].notnull()] # could change to drop when '' and deal with missing release dates also
    
    # Complex parsing
    def parse_date(x):
        if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            return x
        elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[A-Za-z]{3} [\d]{2}, [\d]{4}', x):
            return x[4:6] + ' ' + x[:3] + ' ' + x[-4:]
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['release_date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['release_date'], format='%d %b %Y', errors='coerce')
    
    return df


%timeit process_release_date(steam_data)

In [None]:
def process(df):
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    
    print('Testing date evaluation')
    %timeit process_release_date_1(df)
    %timeit process_release_date_2(df)
    %timeit process_release_date_3(df)
    %timeit process_release_date_4(df)
    
    print('Testing date parsing')
    %timeit process_release_date_5(df)
    %timeit process_release_date_6(df)
    
    return df

steam_data = process(raw_steam_data)
steam_data[['name', 'steam_appid', 'release_date']].head()

In [None]:
# Final Function
def process_release_date(df):
    df = df.copy()
    
    def eval_date(x):
        parsed_x = literal_eval(x)
        if parsed_x['coming_soon']:
            return None
        else:
            return parsed_x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    df = df[df['release_date'].notnull()]  # could change to drop when '' and deal with missing release dates also
    
    # Complex parsing
    def parse_date(x):
        if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif x == '':
            pass # may wish to change? or alter how errors are handled in pd.to_datetime
        elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif re.search(r'[A-Za-z]{3} [\d]{2}, [\d]{4}', x):
            return x[4:6] + ' ' + x[:3] + ' ' + x[-4:]        
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['release_date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['release_date'], infer_datetime_format=True)
    # print(df['release_date'].value_counts())
    
    return df


def process(df):
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    
    # Check difference after reordering
    %timeit process_release_date_6(df)
    %timeit process_release_date(df)
    
    df = process_release_date(df)
    
    return df

print(raw_steam_data.shape)
steam_data = process(raw_steam_data)
print(steam_data.shape)
steam_data[['name', 'steam_appid', 'release_date']].head()

# Check none later than current date, also check total null is 64

# Final process function

In [None]:
def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = df.drop(['achievements', 'content_descriptors'], axis=1)
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_developers_and_publishers(df)
    # df = process_release_data(df)
    
    # Process columns which export data
    df = process_descriptions(df, export=True)
    df = process_language(df, export=True)
    df = process_images(df, export=True)
    df = process_info(df, export=True)
    df = process_requirements(df, export=True)
    df = process_categories(df, export=True)
    df = process_genres(df, export=True)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

# Run Functions

In [None]:
def process(df):
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    
    # df = process_release_date(df)
    # df = process_price(df)
    # df = process_language(df)
    # df = process_requirements(df)
    # df = process_developers_and_publishers(df)
    # df = process_packages(df)
    # df = process_platforms(df)
    
    return df

print(raw_steam_data.shape)
steam_data = process(raw_steam_data)
print(steam_data.shape)
steam_data.head()

In [None]:


steam_data = process_null_cols(raw_steam_data)
steam_data = process_null_row(steam_data, 'type')
steam_data = process_name(steam_data)
steam_data = steam_data.drop('type', axis=1)
steam_data = steam_data.drop_duplicates()
steam_data = process_age(steam_data)
steam_data = process_release_date(steam_data)
steam_data = process_price(steam_data)
steam_data = process_language(steam_data)
steam_data = process_requirements(steam_data)
steam_data = process_developers_and_publishers(steam_data)
steam_data = process_packages(steam_data)
steam_data = process_platforms(steam_data)

category_data = process_categories(steam_data)
# category_data.to_csv('../data/steam_category_data.csv', index=False)
steam_data = steam_data.drop('categories', axis=1)

genre_data = process_genres(steam_data)
# genre_data.to_csv('../data/steam_genre_data.csv', index=False)
steam_data = steam_data.drop('genres', axis=1)

# drop descriptions - could be useful for future recommender or key-word analysis, but drop for now
# drop header_image, screenshots, background - links to images. could be useful for future image analysis project
# drop website - nearly 10000 missing values, unlikely useful to this analysis
# drop movies column - just contains trailers (similar to screenshots)
# drop achievements column - too specific and redundant as have steam_achievements in categories
# drop support info column - not useful for analysis
drop_cols = ['detailed_description', 'about_the_game', 'short_description', 'header_image', 'screenshots', 'background', 'website', 'movies', 'achievements', 'support_info', 'content_descriptors']
steam_data = steam_data.drop(drop_cols, axis=1)

display(category_data.head())
display(genre_data.head())
steam_data.head()

### Combining and exporting data frames

In [None]:
steam_data.to_csv('../data/steam_data_clean.csv', index=False)

steam_data_full = steam_data.merge(genre_df, how='left', on='steam_appid')
steam_data_full = steam_data_full.merge(category_data, how='left', on='steam_appid')

steam_data_full.to_csv('../data/steam_data_clean_full.csv', index=False)

# Reference, extra resources & test code

### reference - old categories and genre functions

In [None]:
def parse_categories(x):
    try:
        return {c['description']:1 for c in literal_eval(x)}
    except ValueError:
        return {}

    
def get_col_list(series):
    cols_dict = {}
    
    def create_col_dict(x):
        for item in x.keys():
            cols_dict[item] = 1
    
    series.apply(create_col_dict)
    
    return list(cols_dict.keys())
    
    
def process_categories(df):
    df = df[['steam_appid', 'categories']].copy()
    
    df['categories'] = df['categories'].apply(parse_categories)
    col_list = get_col_list(df['categories'])
    
    def set_category(x, col):
        if col in x.keys():
            return 1
        else:
            return 0
    
    for col in col_list:
        col_name = (col.lower()
                       .replace('-', '_')
                       .replace(' ', '_')
                       .replace('(', '')
                       .replace(')', '')
                       .replace('/', '_or_')
                   )
        df[col_name] = df['categories'].apply(set_category, args=(col,))
    
    df = df.drop('categories', axis=1)
    
    return df

# TODO: refactor categories processing slightly to use these same functions
def parse_column(x):
    try:
        return {item['description']:1 for item in literal_eval(x)}
    except ValueError:
        return {}


def get_col_list(series):
    cols_dict = {}
    
    def create_col_dict(x):
        for item in x.keys():
            cols_dict[item] = 1
    
    series.apply(create_col_dict)
    
    return list(cols_dict.keys())


def set_category(x, col):
    if col in x.keys():
        return 1
    else:
        return 0

    
def process_genres(df):
    df = df[['steam_appid', 'genres']].copy()
    
    df['genres'] = df['genres'].apply(parse_column)
    
    col_list = get_col_list(df['genres'])
    
    for col in col_list:
        col_name = (col.lower()
                        .replace(' ', '_')
                        .replace('&', 'and')
                   )
        df[col_name] = df['genres'].apply(set_category, args=(col,))

    df = df.drop('genres', axis=1)
    
    return df

### category experimentation (aggregation)

In [279]:
def get_col_list(series):
    cols_dict = {}
    
    def create_col_dict(x):
        for item in x.keys():
            cols_dict[item] = 1
    
    series.apply(create_col_dict)
    
    return list(cols_dict.keys())


def get_col_list(series):
    col_list = []
    
    def create_col_list(x):
        for item in x:
            if item not in col_list:
                col_list.append(item)
    
    series.apply(create_col_dict)
    
    return col_list
    
    
def parse_categories(x):    
    row = [d['description'] for d in literal_eval(x)]
    
    for item in row:
        col_dict[item] = 1
        
    return row


def process_categories(df, export=False):
    df = df[df['categories'].notnull()].copy()
    
    if export:
        category_data = df[['steam_appid', 'categories']].copy()

        #category_data['categories'] = category_data['categories'].apply(lambda x: {item['description']:0 for item in literal_eval(x)})
        category_data['categories'] = category_data['categories'].apply(lambda x: [item['description'] for item in literal_eval(x)])
        cat_cols = set(list(itertools.chain(*category_data['categories'])))
        
        
#         col_dict = {}
#         category_data['categories'] = category_data['categories'].apply(parse_categories)
#         print(col_dict)
        
#         def get_column_names(series): 
#             return reduce(lambda x, y: set(x).union(y), series.to_list())
        
#         col_set = category_data['categories'].agg(get_column_names)
#         print(col_set)
        
#         start = time.time()
#         category_data['categories'].agg(get_column_names)
#         print(time.time() - start)
        
#         start = time.time()
#         print(get_col_list(category_data['categories']))
#         print(time.time() - start)
        
#         cols_dict = {}
        
#         def create_col_dict(x):
#             for item in x:
#                 cols_dict[item] = 1
                
#         category_data['categories'].apply(create_col_dict)
#         print(cols_dict.keys())
        
        
#         for col in col_list:
#             col_name = (col.lower()
#                            .replace('-', '_')
#                            .replace(' ', '_')
#                            .replace('(', '')
#                            .replace(')', '')
#                            .replace('/', '_or_')
#                        )
#             col_name = 'category_' + col_name
#             category_data[col_name] = category_data['categories'].apply(lambda x: 1 if col in x.keys() else 0)

#         category_data = category_data.drop('categories', axis=1)
            
        # export_data(category_data, 'category_data')
        display(category_data.head())
    
    df = df.drop('categories', axis=1)
    
    return df

process_categories(steam_data, export=True).head()

Unnamed: 0,steam_appid,categories
0,10,"[Multi-player, Online Multi-Player, Local Mult..."
1,20,"[Multi-player, Online Multi-Player, Local Mult..."
2,30,"[Multi-player, Valve Anti-Cheat enabled]"
3,40,"[Multi-player, Online Multi-Player, Local Mult..."
4,50,"[Single-player, Multi-player, Valve Anti-Cheat..."


Unnamed: 0,name,steam_appid,required_age,genres,achievements,release_date,content_descriptors,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'ids': [2, 5], 'notes': 'Includes intense vio...",1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,"[{'id': '1', 'description': 'Action'}]",{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'ids': [], 'notes': None}",1,1,1,3.99,1,Gearbox Software,Valve


In [108]:
from functools import reduce

s = pd.Series([[1], [2], [3], [4]])
t = pd.Series([['a', 'b'], ['a'], ['b', 'c']])

def my_min(series):
    return reduce(lambda x, y: set(x).union(set(y)), series.to_list())

def my_min(series):
    col_list = reduce(lambda x, y: x+y, series.to_list())
    return set(col_list)
# s.agg(my_min)

t.agg(my_min)

{'a', 'b', 'c'}

In [133]:
s = pd.Series([1, 2, 3, 4])

def test_sum(series):
    return reduce(lambda x, y: x+y, series)

test_sum(s)

10

In [275]:
t = pd.Series([['a', 'b'], ['a'], ['b', 'c']])
t = pd.concat([t]*1000, ignore_index=True)

def agg_1(series):
    return reduce(lambda x, y: set(x).union(y), series.to_list())

def f1(series):
    cols_dict = {}
    
    def create_col_dict(x):
        for item in x:
            cols_dict[item] = 1
    
    series.apply(create_col_dict)
    
    return list(cols_dict.keys())

def f2(series):
    cols_dict = {}
    
    for i, row in series.iteritems():
        for item in row:
            cols_dict[item] = 1
    
    return list(cols_dict.keys())



def my_sum(*args):
    total = 0
    for a in args:
        total += a
    return total

my_sum(1, 2) # 3

def get_items_dict(*args):
    items = {}
    for a in args:
        items[a] = 1
    return items

get_items_dict(1, 2, 1, 3) # {1: 1, 2: 1, 3: 1}

def get_items_set(*args):
    return set(args)

get_items_set(1, 2, 1, 3) # {1, 2, 3}

def get_from_lists(series):
    print('s', series)
    red = reduce(lambda x, y: set(x).union(y), series.to_list())
    print('r', red)
    return red

# pd.Series([['a', 'b'], ['a'], ['b', 'c']]).agg(get_from_lists)

from itertools import chain


# %timeit t.agg(agg_1)
# %timeit f1(t)
%timeit f2(t)
%timeit set(list(chain(*t)))

851 µs ± 3.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
357 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Reference - different process_platforms methods

In [None]:
def process_platforms(df):
    df = df.copy()
    df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
    df['windows'] = df['platforms'].apply(lambda x: x['windows'])
    df['mac'] = df['platforms'].apply(lambda x: x['mac'])
    df['linux'] = df['platforms'].apply(lambda x: x['linux'])
    
    df = df.drop('platforms', axis=1)
    
    return df

%timeit process_platforms(steam_data)[['name', 'windows', 'mac', 'linux']].head()

In [None]:
def process_platforms(df):
    df = df.copy()
    df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
    for platform in ['windows', 'mac', 'linux']:
        df[platform] = df['platforms'].apply(lambda x: x[platform])
    
    df = df.drop('platforms', axis=1)
    
    return df

%timeit process_platforms(steam_data)[['name', 'windows', 'mac', 'linux']].head()

In [None]:
def process_platforms(df):
    df = df.copy()
    df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
    def windows(x):
        return x['windows']
    
    def mac(x):
        return x['mac']
    
    def linux(x):
        return x['linux']
    
    platforms_df = df['platforms'].transform([windows, mac, linux])
    df = pd.concat([df, platforms_df], axis=1)
    df = df.drop('platforms', axis=1)
    
    return df

%timeit process_platforms(steam_data)[['name', 'windows', 'mac', 'linux']].head()

In [None]:
def process_platforms(df):
    df = df.copy()
    df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
    def windows(x):
        return x['windows']
    
    def mac(x):
        return x['mac']
    
    def linux(x):
        return x['linux']
    
    df[['windows', 'mac', 'linux']] = df['platforms'].transform([windows, mac, linux])
    #df = pd.concat([df, platforms_df], axis=1)
    df = df.drop('platforms', axis=1)
    
    return df

%timeit process_platforms(steam_data)[['name', 'windows', 'mac', 'linux']].head()

### Reference - Processing requirements attempts

In [None]:
# attempt at cleaning min/recommended requirements and exporting

def process_requirements(df, export=False):
    if export:
        requirements = df[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].copy()
        
        for col in ['pc_requirements', 'mac_requirements', 'linux_requirements']:
            requirements[col] = requirements[col].apply(lambda x: literal_eval(x))
            
        def handle_pc(x, col):
            try:
                return x[col]
            except TypeError:
                return ''
            except KeyError:
                return ''
        
        requirements['pc_min'] = requirements['pc_requirements'].apply(handle_pc, args=('minimum',))
        requirements['pc_rec'] = requirements['pc_requirements'].apply(handle_pc, args=('recommended',))
        
        requirements['pc_min'] = (requirements['pc_min']
                                      .str.replace('\r', '')
                                      .str.replace('\t', '')
                                      .str.replace('\n', '')
                                      .str.replace(r'<[\/"=\w\s]+>', '')
                                 )
        
        requirements['pc_rec'] = (requirements['pc_rec']
                                      .str.replace('\r', '')
                                      .str.replace('\t', '')
                                      .str.replace('\n', '')
                                      .str.replace(r'<[\/"=\w\s]+>', '')
                                 )
        
        requirements['pc_rec_from_min'] = requirements['pc_min'].str.extract(r'(Recommended:.+)')
        requirements['pc_rec_from_min'] = requirements['pc_rec_from_min'].fillna('')
        
        requirements['pc_min'] = requirements['pc_min'].str.replace(r'Recommended:.+', '').str.replace('Minimum:', '')
        requirements['pc_rec'] = requirements['pc_rec'] + requirements['pc_rec_from_min']
        requirements['pc_rec'] = requirements['pc_rec'].str.replace('Recommended:', '')
        
        requirements = requirements.drop('pc_rec_from_min', axis=1)
        
        # print(requirements['pc_min'][0])
        # display(requirements.head())
        
        requirements.to_csv('../data/steam_requirements.csv', index=False)
        print("Exported requirements to '../data/steam_requirements.csv'")
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)
    
    return df

In [None]:
# pc_requirements CPU cleaning attempt - has potential but so many variations, would be tedious

view_requirements = steam_data[['steam_appid', 'pc_requirements']].iloc[:1000].copy()

view_requirements['pc_requirements'] = view_requirements['pc_requirements'].apply(lambda x: literal_eval(x) if type(literal_eval(x)) is dict else {'minimum': ''})
view_requirements['min'] = view_requirements['pc_requirements'].apply(lambda x: x['minimum'])
view_requirements['rec'] = view_requirements['pc_requirements'].apply(lambda x: x['recommended'] if 'recommended' in x.keys() else '')

view_requirements['rec_from_min'] = view_requirements['min'].str.extract(r'(Recommended:.+)').fillna('')
view_requirements['rec'] = view_requirements['rec'] + view_requirements['rec_from_min']
view_requirements['rec'] = view_requirements['rec'].str.replace('Recommended:', '')

view_requirements['min'] = view_requirements['min'].str.replace(r'Recommended:.+', '').str.replace('Minimum:', '')

view_requirements['cpu'] = view_requirements['rec'].str.extract(r'(?:Processor:)(?:<\/strong>)?([\w\s.+]+)(?=<)').fillna('')

view_requirements['cpu_min'] = view_requirements['min'].str.extract(r'(?:Processor:)(?:<\/strong>)?([\w\s.+]+)(?=<)').fillna('')
view_requirements.loc[view_requirements['cpu'] == '', 'cpu'] = view_requirements.loc[view_requirements['cpu'] == '', 'cpu_min']

view_requirements['cpu_min'] = view_requirements['min'].str.extract(r'([\w\d\s]+)processor').fillna('')
view_requirements.loc[view_requirements['cpu'] == '', 'cpu'] = view_requirements.loc[view_requirements['cpu'] == '', 'cpu_min']

view_requirements = view_requirements.drop(['pc_requirements', 'rec_from_min'], axis=1)

with pd.option_context("display.max_colwidth", 1000):
    display(view_requirements)

### devs and pubs, keeping missing rows

In [None]:
', '.join(['word'])

def process_developers_and_publishers(df):
    df = df.copy()
    
    def parse_associates(x):
        try:
            return literal_eval(x)
        except ValueError:
            return np.nan
            
    df['developers'] = df['developers'].apply(parse_associates)
    df['publishers'] = df['publishers'].replace("['']", np.nan).apply(parse_associates)
    
    df['developer'] = df['developers'].apply(lambda x: x[0] if x is not np.nan else np.nan)
    df['publisher'] = df['publishers'].apply(lambda x: x[0] if x is not np.nan else np.nan)
    
    def parse_other_associates(x):
        if x is not np.nan:
            if len(x) > 1:
                return x[1:]
    
    df['other_developers'] = df['developers'].apply(parse_other_associates)
    df['other_publishers'] = df['publishers'].apply(parse_other_associates)

    
    # df['other_developers'] = df['developers'].apply(lambda x: x[1:] if len(x) > 1 else np.nan)
    # df['other_publishers'] = df['publishers'].apply(lambda x: x[1:] if len(x) > 1 else np.nan)

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df

dev_pub_data = process_developers_and_publishers(steam_data)
dev_pub_data.head()