# Steam Data Cleaning (Part 2)

*This is part of a larger series of notebooks on downloading, processing and analysing data from the steam store. [See all notebooks here.](../notebooks)*

See https://github.com/jbwhit/OSCON-2015/blob/master/develop/2015-07-16-jw-example-notebook-setup.ipynb for local imports

In [1]:
# load extensions and magics

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1900 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Mon Jun 03 15:52:33 2019 GMT Summer Time,Mon Jun 03 15:52:33 2019 GMT Summer Time


# Exports

**TODO**: genre and categories section writeup

Welcome back to the second part in this Steam data cleaning series. Last time we (...)

In [2]:
# standard library imports
from ast import literal_eval
import itertools
import time
import re

# third-party imports
import numpy as np
import pandas as pd

# customisations
pd.set_option("max_columns", 100)

## Import and Inspect Data

Continuing from before, import and inspect data.

In [3]:
imported_steam_data = pd.read_csv('../data/exports/steam_clean_part_1.csv')

print('Rows:', imported_steam_data.shape[0])
print('Columns:', imported_steam_data.shape[1])
imported_steam_data.head()

Rows: 28114
Columns: 24


Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,price,english,developer,publisher
0,Counter-Strike,10,0,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,7.19,1,Valve,Valve
1,Team Fortress Classic,20,0,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,3.99,1,Valve,Valve
2,Day of Defeat,30,0,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,3.99,1,Valve,Valve
3,Deathmatch Classic,40,0,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",windows;mac;linux,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,3.99,1,Gearbox Software,Valve


Look at the null count to see how we're doing after the first round of cleaning.

In [4]:
imported_steam_data.isnull().sum()

name                       0
steam_appid                0
required_age               0
detailed_description      14
about_the_game            14
short_description         14
header_image               0
website                 9531
pc_requirements            0
mac_requirements           0
linux_requirements         0
platforms                  0
categories               511
genres                    38
screenshots                5
movies                  1782
achievements               0
release_date               0
support_info               0
background                 5
price                      0
english                    0
developer                  0
publisher                158
dtype: int64

Strangely, just by exporting to and importing from csv, 12 null values have appeared in the publisher column. Let's take a look at a couple of these rows, by looking at them in the original, raw data.

In [5]:
raw_data = pd.read_csv('../data/raw/steam_app_data.csv')

raw_data[['name', 'steam_appid', 'publishers']][(raw_data['publishers'] == "['N/A']") | (raw_data['publishers'] == "['NA']")]

Unnamed: 0,name,steam_appid,publishers
4860,Alum,338420,['N/A']
5431,Scribble Space,351450,['N/A']
5949,Freshman Year,364450,['N/A']
7676,Cibele,408120,['N/A']
8858,Fantasy Tales Online,442710,['NA']
9895,Memoir En Code: Reissue,467940,['N/A']
12663,The Morgue Fissure Between Worlds,547150,['N/A']
14712,Kimmy,600660,['N/A']
14863,Night of Terror,604200,['N/A']
23124,Negative World,832130,['N/A']


Interestingly, by handling the data as we did we exposed some hidden null values. Only by re-importing the data were they recognised as actual null values, rather than the 'N/A' string (or in one case, 'NA' string). When it comes to defining our `process` function, we'll drop these rows. 

Apart from that, it looks like all the null values are in columns we haven't yet cleaned, which is perfect.

## Processing Description Columns

We have a series of columns with descriptive text about each game: `detailed_description`, `about_the_game` and `short_description`. As the column names imply, these provide information about each game in string format. This is great for humans' understanding, but when it comes to machines is a lot trickier.

These columns could be used as the basis for an interesting [recommender system](https://en.wikipedia.org/wiki/Recommender_system) or keyword analysis project, however they are not required in our current project. We'll be removing them as they likely take up large amounts of space, and will only serve to slow down our project.

We'll inspect the columns anyway, in case we find anomalies, and also export just the description data to a separate file, in case we want to use it in a future investigation.

In [6]:
imported_steam_data[['detailed_description', 'about_the_game', 'short_description']].isnull().sum()

detailed_description    14
about_the_game          14
short_description       14
dtype: int64

We have 14 rows with missing data for these columns, and chances are the 14 rows with missing `detailed_description` are the rows with missing `about_the_game` and `short_description` data too. 

By inspecting the individual rows below, we can see that this is true - all rows with missing data in one description column have missing data in the others as well.

In [7]:
imported_steam_data[imported_steam_data['detailed_description'].isnull()]

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,price,english,developer,publisher
92,Bejeweled 2 Deluxe,3300,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/330...,,{'minimum': '<p><strong>Minimum Requirements:<...,{'minimum': '<ul>\n\t<li><strong>OS:</strong> ...,[],windows;mac,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/330...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
93,Chuzzle Deluxe,3310,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/331...,,{'minimum': '<p><strong>Minimum Requirements:<...,{'minimum': '<ul>\n\t<li><strong>OS:</strong> ...,[],windows;mac,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/331...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
94,Insaniquarium Deluxe,3320,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/332...,,{'minimum': '<strong>Minimum Requirements:</st...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/332...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
96,AstroPop Deluxe,3340,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/334...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/334...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
97,Bejeweled Deluxe,3350,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/335...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/335...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
98,Big Money! Deluxe,3360,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/336...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/336...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
99,Dynomite Deluxe,3380,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/338...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/338...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
100,Feeding Frenzy 2 Deluxe,3390,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/339...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/339...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
101,Hammer Heads Deluxe,3400,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/340...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/340...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."
103,Iggle Pop Deluxe,3420,0,,,,https://steamcdn-a.akamaihd.net/steam/apps/342...,,{'minimum': '<p><strong>Minimum Requirements:<...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '4', 'description': 'Casual'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '30 Aug, 2006'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/342...,4.25,1,"PopCap Games, Inc.","PopCap Games, Inc."


Interestingly, all of these titles are games from 2006 developed and published by PopCap Games. My best guess is that they were developed previously and all added to the Steam store in one go after Valve allowed third-party titles.

We'll remove these rows, as well as any with a description of less than 20 characters, like those below.

In [8]:
imported_steam_data[imported_steam_data['detailed_description'].str.len() <= 20]

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,price,english,developer,publisher
9883,Penguins Cretins,490990,0,...,...,...,https://steamcdn-a.akamaihd.net/steam/apps/490...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,0,"{'coming_soon': False, 'date': '22 Jun, 2016'}","{'url': '', 'email': 'support@hfmgames.net'}",https://steamcdn-a.akamaihd.net/steam/apps/490...,1.69,1,HFM Games,HFM Games
19041,拼词游戏 2017,745840,0,带一点恐怖元素的休闲游戏,带一点恐怖元素的休闲游戏,一款有一点恐怖元素的休闲益智游戏。,https://steamcdn-a.akamaihd.net/steam/apps/745...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],windows;mac,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256699963, 'name': 'alpha', 'thumbnail...",11,"{'coming_soon': False, 'date': '29 Nov, 2017'}","{'url': '', 'email': '12668934@qq.com'}",https://steamcdn-a.akamaihd.net/steam/apps/745...,0.79,0,Mianwotu,Mianwotu
20982,God Test,797660,0,God Test,God Test,God Test,https://steamcdn-a.akamaihd.net/steam/apps/797...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],windows,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}, {'id': ...",,,0,"{'coming_soon': False, 'date': '18 Apr, 2018'}","{'url': '', 'email': 'insanegamedev@outlook.com'}",,0.0,1,God Test,God Test
25149,В поисках Атлантиды,925640,0,Интересная игра,Интересная игра,Atlantis,https://steamcdn-a.akamaihd.net/steam/apps/925...,https://vk.com/atlantisforever,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256725871, 'name': 'Game', 'thumbnail'...",1,"{'coming_soon': False, 'date': '1 Nov, 2018'}","{'url': 'https://vk.com/atlantisforever', 'ema...",https://steamcdn-a.akamaihd.net/steam/apps/925...,1.69,0,Dmitr Che,Dmitr Che
25281,东方百问~TouHouAsked,930840,0,Null,Null,Null,https://steamcdn-a.akamaihd.net/steam/apps/930...,https://asked.touhou.ren/,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://steamcdn...","[{'id': 256726640, 'name': 'TouHouAsked', 'thu...",2,"{'coming_soon': False, 'date': '7 Oct, 2018'}","{'url': 'https://asked.touhou.ren', 'email': '...",https://steamcdn-a.akamaihd.net/steam/apps/930...,0.79,0,Root Nine Studio,Root Nine Studio


To handle exporting the data to file, we'll write a reusable function which we can call upon for future columns. We will include the `steam_appid` column as it will allow us to match up these rows with rows in our primary data set later on, using a merge (like a join in SQL).

In [9]:
def export_data(df, filename, prefix='steam_', extension='.csv'):
    """Export dataframe to csv file, filename prepended with 'steam_'.
    
    filename : str without file extension
    """
    filepath = '../data/exports/' + prefix + filename + extension
    print_name = filename.replace('_', ' ')
    
    df.to_csv(filepath, index=False)
    
    print("Exported {} to '{}'".format(print_name, filepath))

We can now define a function to process and export the description columns. Notice we also remove the troublesome publisher rows.

In [10]:
def process_descriptions(df, export=False):
    """Export descriptions to external csv file then remove these columns."""
    # remove rows with missing description data
    df = df[df['detailed_description'].notnull()].copy()
    
    # remove rows with unusually small description
    df = df[df['detailed_description'].str.len() > 20]
    
    # by default we don't export, useful for calling function later
    if export:
        # create dataframe of description columns
        description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]
        
        export_data(description_data, filename='description_data')
    
    # drop description columns from main dataframe
    df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df, export=True)
    
    return df


steam_data = process(imported_steam_data)

Exported description data to '../data/exports/steam_description_data.csv'


In [11]:
# inspect exported data
pd.read_csv('../data/exports/steam_description_data.csv').head()

Unnamed: 0,steam_appid,detailed_description,about_the_game,short_description
0,10,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...
1,20,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...
2,30,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...
3,40,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...
4,50,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...


## Processing Media Columns

Similar to the description columns, we have three columns that contain links to various images: `header_image`, `screenshots` and `background`. Whilst we won't be needing this data in this project, it could open the door to some interesting image analysis in the future. We will treat these columns in almost the same way, exporting the contents to a csv file then removing them from the dataset.

Again, let's check for missing values.

In [12]:
image_cols = ['header_image', 'screenshots', 'background']

for col in image_cols:
    print(col+':', steam_data[col].isnull().sum())

steam_data[image_cols].head()

header_image: 0
screenshots: 4
background: 4


Unnamed: 0,header_image,screenshots,background
0,https://steamcdn-a.akamaihd.net/steam/apps/10/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/10/...
1,https://steamcdn-a.akamaihd.net/steam/apps/20/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/20/...
2,https://steamcdn-a.akamaihd.net/steam/apps/30/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/30/...
3,https://steamcdn-a.akamaihd.net/steam/apps/40/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/40/...
4,https://steamcdn-a.akamaihd.net/steam/apps/50/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/50/...


As with the description columns, it is likely that the 4 rows with no `screenshots` data are the same rows with no `background` data. There are so few that it is probably safe to remove them.

Before we make up our made let's inspect the rows in question. In part 1 of cleaning the data, we wrote a `print_steam_links` function to easily create links from a dataframe. To use it again, we could copy the code and define it here. Instead, we're going to use a handy trick in jupyter notebook. If we place the function in a separate python (.py) file inside the `src` folder, we can tell python to look there for local modules using `sys.path.append`. Next, we can import the function directly.

In [13]:
import sys
sys.path.append('../src/')

from datacleaning import print_steam_links

With the `print_steam_links` function now available, we can inspect the rows without screenshots. As we predicted, the rows without screenshots are also the rows without a background. It looks like two are unreleased, and if we'd dealt with the `release_date` column already these would already be removed. One was released recently (5 Jan, 2019), and perhaps didn't have screenshots at the time of downloading, and one simply doesn't have any. As we suspected, it's safe to remove all these rows.

In [14]:
no_screenshots = steam_data[steam_data['screenshots'].isnull()]

print_steam_links(no_screenshots)
no_screenshots

The Light Empire: https://store.steampowered.com/app/416220
Girl and Goblin: https://store.steampowered.com/app/880510
Arida: Backland's Awakening: https://store.steampowered.com/app/907760
Nukalypse: The Final War: https://store.steampowered.com/app/947940


Unnamed: 0,name,steam_appid,required_age,header_image,website,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,price,english,developer,publisher
7525,The Light Empire,416220,0,https://steamcdn-a.akamaihd.net/steam/apps/416...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...",,,4,"{'coming_soon': False, 'date': '2 Dec, 2015'}","{'url': '', 'email': 'Jemy.TLE@outlook.com'}",,4.79,1,Jemy,Jemy
23832,Girl and Goblin,880510,0,https://steamcdn-a.akamaihd.net/steam/apps/880...,,{'minimum': '<strong>最低配置:</strong><br><ul cla...,[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,"[{'id': 256739772, 'name': '3', 'thumbnail': '...",1552,"{'coming_soon': False, 'date': '5 Jan, 2019'}","{'url': '', 'email': 'smagician13@yahoo.com'}",,0.79,1,Inverse Game,Inverse Game
24641,Arida: Backland's Awakening,907760,0,https://steamcdn-a.akamaihd.net/steam/apps/907...,http://www.aridagame.com,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,windows;mac,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,"[{'id': 256729551, 'name': 'Teaser Beta 2018',...",0,"{'coming_soon': True, 'date': ''}","{'url': 'http://www.aridagame.com', 'email': '...",,0.0,1,Aoca Game Lab,Aoca Game Lab
25769,Nukalypse: The Final War,947940,0,https://steamcdn-a.akamaihd.net/steam/apps/947...,,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...",,"[{'id': 256745274, 'name': 'Nukalypse: The Fin...",0,"{'coming_soon': True, 'date': 'Oct 2019'}","{'url': '', 'email': 'nukalypse@gmail.com'}",,0.0,1,Zion Games Studio,Zion Games Studio


There is also a `movies` column with similar data. Whilst having more missing values, presumably for games without videos, it appears to contain names, thumbnails and links to various videos and trailers. It's unlikely we'll need them but we can include them in the export and remove them from our data set.

In [15]:
steam_data['movies'].isnull().sum()

1746

In [16]:
with pd.option_context("display.max_colwidth", 1000):
    print(steam_data[steam_data['movies'].notnull()]['movies'].head(2))

9                                                                                                                                                                                                                                                                                                                                                         [{'id': 904, 'name': 'Half-Life 2 Trailer', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/904/movie.jpg?t=1507237301', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie480.webm?t=1507237301', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/904/movie_max.webm?t=1507237301'}, 'highlight': True}, {'id': 5724, 'name': 'Free Yourself', 'thumbnail': 'https://steamcdn-a.akamaihd.net/steam/apps/5724/movie.293x165.jpg?t=1507237311', 'webm': {'480': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie480.webm?t=1507237311', 'max': 'http://steamcdn-a.akamaihd.net/steam/apps/5724/movie_max.webm?t=1507237311'}, 'highlight': Fa

We can now put this all together and define a `process_media` function, adding it in to `process` as before.

In [17]:
def process_media(df, export=False):
    """Remove media columns from dataframe, optionally exporting them to csv first."""
    df = df[df['screenshots'].notnull()].copy()
    
    if export:
        media_data = df[['steam_appid', 'header_image', 'screenshots', 'background', 'movies']]
        
        export_data(media_data, 'media_data')
        
    df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df)
    df = process_media(df, export=True)
    
    return df


steam_data = process(imported_steam_data)

Exported media data to '../data/exports/steam_media_data.csv'


In [18]:
# inspect exported data
pd.read_csv('../data/exports/steam_media_data.csv').head()

Unnamed: 0,steam_appid,header_image,screenshots,background,movies
0,10,https://steamcdn-a.akamaihd.net/steam/apps/10/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,
1,20,https://steamcdn-a.akamaihd.net/steam/apps/20/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,
2,30,https://steamcdn-a.akamaihd.net/steam/apps/30/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/30/...,
3,40,https://steamcdn-a.akamaihd.net/steam/apps/40/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,
4,50,https://steamcdn-a.akamaihd.net/steam/apps/50/...,"[{'id': 0, 'path_thumbnail': 'https://steamcdn...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,


Before we move on, we can inspect the memory savings of removing these columns by comparing the output of the [DataFrame.info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) method. If we pass `memory_usage="deep"` we get the true memory usage of each DataFrame. Without this, pandas estimates the amount used. This is because of the way python stores object (string) columns under the hood. Essentially python keeps track of a list of pointers which point to the actual strings in memory. It's a bit like if you hid a bunch of items around the house, and kept a list of where everything was. You couldn't tell the total size of everything just by looking at the list, but you could take a rough guess. Only by following the list and inspecting each individual item could you get an exact figure.

The blog post '[Why Python Is Slow](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/)' goes into more detail, but all we need to be aware of is that by passing the parameter we ensure we get the true value of memory usage. We also pass `verbose=False` to truncate unnecessary output.

We can see that already we have shrunk the memory usage from almost 300 MB to just under 60 MB. This is great because in general, the smaller the memory footprint the faster our code will run in future. And of course, we're not finished yet.

In [19]:
print('Imported Data:\n')
imported_steam_data.info(verbose=False, memory_usage="deep")

print('\nData with descriptions and media removed:\n')
steam_data.info(verbose=False, memory_usage="deep")

Imported Data:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28114 entries, 0 to 28113
Columns: 24 entries, name to publisher
dtypes: float64(1), int64(4), object(19)
memory usage: 297.7 MB

Data with descriptions and media removed:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27933 entries, 0 to 28113
Columns: 17 entries, name to publisher
dtypes: float64(1), int64(4), object(12)
memory usage: 60.3 MB


## Website and Support Info

Next we will look at the `website` and `support_info` columns. Seen below, they both contain links to external websites. The website column is simply stored as a string whereas the support info column is stored as a dictionary of `url` and `email`.

There are a large number of rows with no website listed, and while there are no null values in the `support_info` column, it looks like many will have empty `url` and `email` values inside the data.

For our dataset we'll be dropping both these columns, as they are far too specific to be useful in our analysis. As you may have guessed, we will extract and export this data as we have done before. If not useful, it could be interesting at a later date.

In [20]:
print('website null counts:', imported_steam_data['website'].isnull().sum())
print('support_info null counts:', imported_steam_data['support_info'].isnull().sum())

with pd.option_context("display.max_colwidth", 100): # ensures strings not cut short
    display(imported_steam_data[['name', 'website', 'support_info']][75:80])

website null counts: 9531
support_info null counts: 0


Unnamed: 0,name,website,support_info
75,X3: Reunion,http://www.egosoft.com/games/x3/info_en.php,"{'url': '', 'email': ''}"
76,X3: Terran Conflict,http://www.egosoft.com/games/x3tc/info_en.php,"{'url': '', 'email': 'info@egosoft.com'}"
77,X: Beyond the Frontier,http://www.egosoft.com/games/x/info_en.php,"{'url': '', 'email': ''}"
78,X: Tension,http://www.egosoft.com/games/x_tension/info_en.php,"{'url': '', 'email': ''}"
79,X Rebirth,http://www.egosoft.com/games/x_rebirth/info_en.php,"{'url': 'http://www.egosoft.com/support/index_en.php', 'email': 'info@egosoft.com'}"


We're going to split the support info into two separate columns. We'll keep all the code that parses the columns inside the export `if` statement, so it only runs if we wish to export to csv. We don't need to worry that the rows with missing website data contain `NaN` whereas the other two columns contain a blank string (`''`) for missing data, as once we have exported to csv they will be represented the same way.

In [21]:
def process_info(df, export=False):
    """Drop support information from dataframe, optionally exporting beforehand."""
    if export:
        support_info = df[['steam_appid', 'website', 'support_info']].copy()
        
        support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))
        support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])
        support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])
        
        support_info = support_info.drop('support_info', axis=1)
        
        support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]

        export_data(support_info, 'support_info')
    
    df = df.drop(['website', 'support_info'], axis=1)
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df)
    df = process_media(df)
    df = process_info(df, export=True)
    
    return df


steam_data = process(imported_steam_data)

Exported support info to '../data/exports/steam_support_info.csv'


In [22]:
# inspect exported file
pd.read_csv('../data/exports/steam_support_info.csv').head()

Unnamed: 0,steam_appid,website,support_url,support_email
0,10,,http://steamcommunity.com/app/10,
1,30,http://www.dayofdefeat.com/,,
2,50,,https://help.steampowered.com,
3,70,http://www.half-life.com/,http://steamcommunity.com/app/70,
4,80,,http://steamcommunity.com/app/80,


## System Requirements

At first it looks like we have data for every row.

In [23]:
req_cols = ['pc_requirements', 'mac_requirements', 'linux_requirements']

print('null counts:\n')

for col in req_cols:
    print(col+':', steam_data[col].isnull().sum())

null counts:

pc_requirements: 0
mac_requirements: 0
linux_requirements: 0


However if we look at the data a little more closely, we see that some rows actually have an empty list. These won't appear as null rows, but once evaluated these rows won't provide any information and are essentially useless to us, so can be thought of as such.

In [24]:
steam_data[['steam_appid', 'pc_requirements', 'mac_requirements', 'linux_requirements']].tail()

Unnamed: 0,steam_appid,pc_requirements,mac_requirements,linux_requirements
28109,1065230,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
28110,1065570,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
28111,1065650,{'minimum': '<strong>Minimum:</strong><br><ul ...,[],[]
28112,1066700,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[]
28113,1069460,{'minimum': '<strong>Minimum:</strong><br><ul ...,{'minimum': '<strong>Minimum:</strong><br><ul ...,[]


We can check how many rows in each requirements column have empty lists using a simple boolean filter. By checking the first value in the shape parameter, we can get a count for how many empty lists there are.

In [25]:
print('Empty list counts:\n')

for col in req_cols:
    print(col+':', steam_data[steam_data[col] == '[]'].shape[0])

Empty list counts:

pc_requirements: 13
mac_requirements: 16487
linux_requirements: 19422


That's over half of the rows for both mac and linux requirements. That probably means that there is not enough data in these two columns to be useful for our analysis.

It turns out most games are developed solely for windows, with the growth in mac and linux ports only growing in recent years. Naturally it would make sense that any games that aren't supported on mac or linux would not have corresponding requirements.

As we have already cleaned our platforms column, we can check how many rows actually have missing data by comparing rows with empty lists in the requirements with data in the respective platform columns (mac/linux). If a row has an empty list in the requirements column but a 1 (True) in the platform column, it means the data is missing.

In [26]:
for col in ['mac_requirements', 'linux_requirements']:
    platform = col.split('_')[0]
    print(platform+':', steam_data[(steam_data[col] == '[]') & (steam_data['platforms'].str.contains(platform))].shape[0])

mac: 134
linux: 155


Whilst not an insignificant number, this means that the vast majority of rows are as they should be, and we're not looking at too many data errors.

Let's also have a look for missing values in the pc/windows column. We couldn't include it in our previous loop as the columns have different names, something we may wish to change later.

In [27]:
print('windows:', steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['platforms'].str.contains('windows'))].shape[0])

windows: 9


11 rows have missing system requirements. We can take a look at some of them below, and follow the links to the steam pages to try and discover if anything is amiss.

In [28]:
missing_windows_requirements = steam_data[(steam_data['pc_requirements'] == '[]') & (steam_data['platforms'].str.contains('windows'))]

print_steam_links(missing_windows_requirements[:5])
missing_windows_requirements.head()

Uplink: https://store.steampowered.com/app/1510
Battlestations: Midway: https://store.steampowered.com/app/6870
Grand Theft Auto 2: https://store.steampowered.com/app/12180
Shift 2 Unleashed: https://store.steampowered.com/app/47920
iBomber Defense: https://store.steampowered.com/app/104000


Unnamed: 0,name,steam_appid,required_age,pc_requirements,mac_requirements,linux_requirements,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
31,Uplink,1510,0,[],[],[],windows;mac;linux,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '23', 'description': 'Indie'}, {'id': ...",0,"{'coming_soon': False, 'date': '23 Aug, 2006'}",6.99,1,Introversion Software,Introversion Software
191,Battlestations: Midway,6870,0,[],[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]",0,"{'coming_soon': False, 'date': '15 Mar, 2007'}",4.99,1,Eidos Interactive,Square Enix
314,Grand Theft Auto 2,12180,0,[],[],[],windows,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]",0,"{'coming_soon': False, 'date': '4 Jan, 2008'}",0.0,1,Rockstar North,Rockstar Games
931,Shift 2 Unleashed,47920,0,[],[],[],windows,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '9', 'description': 'Racing'}]",0,"{'coming_soon': False, 'date': '31 Mar, 2011'}",19.99,1,Slightly Mad Studios,Electronic Arts
1165,iBomber Defense,104000,0,[],[],[],windows;mac,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '23', 'description': 'Indie'}, {'id': ...",22,"{'coming_soon': False, 'date': '26 May, 2011'}",2.99,1,Cobra Mobile,Cobra Mobile


There doesn't appear to be any common issue in these rows - some of the games are quite old but that's about it. It may simply be that no requirements were supplied when the games were added to the steam store.

Let's say that the fictional company we're doing analysis for is interested in developing for windows only. Also we can assume that a cross-platform game will have similar requirements in terms of hardware for each platform it supports. With this in mind we can safely drop both the mac and linux requirements columns, as we already know which games support these operating systems by our cleaned platform columns. That means we can focus on the pc_requirements column, which has information for almost every game in our data.

Now we will take a look at a couple of rows from the dataset to see how the data is stored.

In [29]:
display(steam_data['pc_requirements'].iloc[0])
display(steam_data['pc_requirements'].iloc[2000])
display(steam_data['pc_requirements'].iloc[15000])

"{'minimum': '\\r\\n\\t\\t\\t<p><strong>Minimum:</strong> 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t<p><strong>Recommended:</strong> 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection<br /></p>\\r\\n\\t\\t\\t'}"

'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows XP or higher<br></li><li><strong>Processor:</strong> 1 GHz<br></li><li><strong>Memory:</strong> 512 MB RAM<br></li><li><strong>Graphics:</strong> OpenGL compatible graphics chip<br></li><li><strong>Storage:</strong> 2 GB available space</li></ul>\'}'

'{\'minimum\': \'<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong> Windows 10<br></li><li><strong>Processor:</strong> Intel Core i7<br></li><li><strong>Memory:</strong> 8 GB RAM<br></li><li><strong>Graphics:</strong> GTX 1070 or equivalent<br></li><li><strong>DirectX:</strong> Version 12<br></li><li><strong>Storage:</strong> 20 GB available space</li></ul>\'}'

In short: it's a mess. It looks like the data is stored as a dictionary, as we've seen before. There is definitely a key for 'minimum', but apart from that it is hard to see at a glance. The strings are full of html formatting, which is presumably parsed to display the information on the website. It also looks like there are different categories like Processor and Memory for some, but not all, rows.

Let's take a stab and cleaning out some of the unnessecary formatting and see if it becomes clearer.

By creating a dataframe from a selection of rows, we can easily and quickly make changes using the pandas .str accessor, allowing us to use python string formatting and regular expressions.

In [30]:
view_requirements = steam_data['pc_requirements'].iloc[[0, 2000, 15000]].copy()

view_requirements = (view_requirements
                         .str.replace(r'\\[rtn]', '')
                         .str.replace(r'<[pbr]{1,2}>', ' ')
                         .str.replace(r'<[\/"=\w\s]+>', '')
                    )

for i, row in view_requirements.iteritems():
    display(row)

"{'minimum': ' Minimum: 500 mhz processor, 96mb ram, 16mb video card, Windows XP, Mouse, Keyboard, Internet Connection Recommended: 800 mhz processor, 128mb ram, 32mb+ video card, Windows XP, Mouse, Keyboard, Internet Connection'}"

"{'minimum': 'Minimum: OS: Windows XP or higher Processor: 1 GHz Memory: 512 MB RAM Graphics: OpenGL compatible graphics chip Storage: 2 GB available space'}"

"{'minimum': 'Minimum: OS: Windows 10 Processor: Intel Core i7 Memory: 8 GB RAM Graphics: GTX 1070 or equivalent DirectX: Version 12 Storage: 20 GB available space'}"

We can now see more clearly the contents and structure of these rows. Some rows have both Minimum and Recommended requirements inside a 'minimum' key, some have separate 'minimum' and 'recommended' keys. Some have headings like 'Processor:' and 'Storage:' before various components, others simply have a list of components. Some state particular speeds for components, like 2 Ghz CPU, others state specific models, like 'Intel Core 2 Duo', amongst this information.

It seems like it would be possible to extract invidivual component information from this data, however it would be a lengthy and complex process recquiring the handling of many exceptions and invididual cases. Whilst we may wish to tackle this in the future, as it could provide an interesting window into how the demands of gaming have changed over the years, it won't necessarily provide us with useful information for our current objectives.

With that in mind, it seems best to proceed by cleaning the data slightly so it is readable, exporting to an external csv for future use, then removing the columns from our dataframe.

In [31]:
def process_requirements(df, export=False):
    if export:
        requirements = df[['steam_appid', 'pc_requirements']].copy()
        
        requirements = requirements[requirements['pc_requirements'] != '[]']
        
        requirements['requirements_clean'] = (requirements['pc_requirements']
                                                  .str.replace(r'\\[rtn]', '')
                                                  .str.replace(r'<[pbr]{1,2}>', ' ')
                                                  .str.replace(r'<[\/"=\w\s]+>', '')
                                             )
        
        requirements['requirements_clean'] = requirements['requirements_clean'].apply(lambda x: literal_eval(x))
        
        requirements['minimum'] = requirements['requirements_clean'].apply(lambda x: x['minimum'].replace('Minimum:', '').strip() if 'minimum' in x.keys() else np.nan)
        requirements['recommended'] = requirements['requirements_clean'].apply(lambda x: x['recommended'].replace('Recommended:', '').strip() if 'recommended' in x.keys() else np.nan)
        
        requirements = requirements.drop('requirements_clean', axis=1)
        
        export_data(requirements, 'requirements_data')
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)
    
    return df

def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df)
    df = process_media(df)
    df = process_info(df)
    df = process_requirements(df, export=True)
    
    return df


steam_data = process(imported_steam_data)

Exported requirements data to '../data/exports/steam_requirements_data.csv'


In [32]:
# verify export
pd.read_csv('../data/exports/steam_requirements_data.csv').head()

Unnamed: 0,steam_appid,pc_requirements,minimum,recommended
0,10,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",
1,20,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",
2,30,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",
3,40,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",
4,50,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,"500 mhz processor, 96mb ram, 16mb video card, ...",


### Processing Categories and Genres

Drop rows with missing categories/genres?

In [33]:
print(steam_data['categories'].isnull().sum())

509


In [34]:
print(steam_data['categories'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['categories'].head())

[{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]


0    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
1    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
2                                                                                                       [{'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
3    [{'id': 1, 'description': 'Multi-player'}, {'id': 36, 'description': 'Online Multi-Player'}, {'id': 37, 'description': 'Local Multi-Player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]
4                                                            [{'id': 2, 'description': 'Single-player'}, {'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enable

In [35]:
print_steam_links(steam_data[steam_data['categories'].isnull()].tail(20))

MOTiON by RADiCAL: https://store.steampowered.com/app/999900
The Marvellous Machine: https://store.steampowered.com/app/1000510
iDancer: https://store.steampowered.com/app/1004740
SubnetPing: https://store.steampowered.com/app/1008160
YouTube Center: https://store.steampowered.com/app/1009330
Discord Bot - Controls: https://store.steampowered.com/app/1010170
Wallpaper Maker （造物主视频桌面）: https://store.steampowered.com/app/1010800
Nero GameVR: https://store.steampowered.com/app/1011110
Greenland Melting: https://store.steampowered.com/app/1012510
VEGAS Movie Studio 16 Steam Edition: https://store.steampowered.com/app/1016810
VEGAS Movie Studio 16 Platinum Steam Edition: https://store.steampowered.com/app/1016840
Planet Evolution PC Live Wallpaper: https://store.steampowered.com/app/1017060
Screenbits - Screen Recorder: https://store.steampowered.com/app/1018680
Wondershare Video Converter Ultimate: https://store.steampowered.com/app/1025020
ACID Music Studio 11 Steam Edition: https://store

In [36]:
print(steam_data['genres'].isnull().sum())

37


In [37]:
print(steam_data['genres'][0])

with pd.option_context("display.max_colwidth", 1000):
    display(steam_data['genres'].iloc[100:105])

[{'id': '1', 'description': 'Action'}]


116    [{'id': '2', 'description': 'Strategy'}, {'id': '4', 'description': 'Casual'}]
117                                            [{'id': '4', 'description': 'Casual'}]
118                                            [{'id': '4', 'description': 'Casual'}]
119                                          [{'id': '2', 'description': 'Strategy'}]
120                                            [{'id': '4', 'description': 'Casual'}]
Name: genres, dtype: object

In [38]:
print_steam_links(steam_data[steam_data['genres'].isnull()].head(10))
print_steam_links(steam_data[steam_data['genres'].isnull()].tail(10))

Hot Dish: https://store.steampowered.com/app/12570
Dr. Daisy Pet Vet: https://store.steampowered.com/app/12580
Call of Cthulhu®: Dark Corners of the Earth: https://store.steampowered.com/app/22340
Super Granny Collection: https://store.steampowered.com/app/36270
Sacrifice: https://store.steampowered.com/app/38440
Nancy Drew® Dossier: Resorting to Danger!: https://store.steampowered.com/app/42200
Air Forte: https://store.steampowered.com/app/55020
Sonic Adventure DX: https://store.steampowered.com/app/71250
Portal 2 - The Final Hours: https://store.steampowered.com/app/104600
Sonic CD: https://store.steampowered.com/app/200940
EatWell: https://store.steampowered.com/app/678870
No Lights: https://store.steampowered.com/app/682910
Cyborg Arena: https://store.steampowered.com/app/706440
M.I.A. - Overture: https://store.steampowered.com/app/712060
VEHICLES FURY: https://store.steampowered.com/app/749290
The Big Three: https://store.steampowered.com/app/823390
BlueberryNOVA: https://store.st

In [39]:
steam_data[(steam_data['genres'].isnull()) | (steam_data['categories'].isnull())]

Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
338,Hot Dish,12570,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '29 Jul, 2008'}",5.99,1,Zemnott,ValuSoft
339,Dr. Daisy Pet Vet,12580,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '29 Jul, 2008'}",5.99,1,Zemnott,ValuSoft
366,Tom Clancy's Ghost Recon® Island Thunder™,13630,0,windows,,"[{'id': '1', 'description': 'Action'}]",0,"{'coming_soon': False, 'date': '15 Jul, 2008'}",4.29,1,Red Storm Entertainment,Ubisoft
508,Call of Cthulhu®: Dark Corners of the Earth,22340,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '16 Jun, 2009'}",3.99,1,Headfirst Productions,Bethesda Softworks
709,Westward Collection,36150,0,windows,,"[{'id': '4', 'description': 'Casual'}]",0,"{'coming_soon': False, 'date': '17 Jul, 2009'}",10.99,1,Sandlot Games,Sandlot Games
713,Super Granny Collection,36270,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '17 Jul, 2009'}",10.99,1,Sandlot Games,Sandlot Games
766,Sacrifice,38440,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '19 Aug, 2009'}",6.99,1,Shiny Entertainment,Interplay Inc.
785,Painkiller: Black Edition,39530,0,windows,,"[{'id': '1', 'description': 'Action'}]",0,"{'coming_soon': False, 'date': '24 Jan, 2007'}",8.99,1,People Can Fly,THQ Nordic
837,Nancy Drew® Dossier: Resorting to Danger!,42200,0,windows,"[{'id': 2, 'description': 'Single-player'}]",,0,"{'coming_soon': False, 'date': '19 Nov, 2009'}",5.19,1,HeR Interactive,HeR Interactive
936,Might & Magic: Heroes VI,48220,0,windows,,"[{'id': '3', 'description': 'RPG'}, {'id': '2'...",0,"{'coming_soon': False, 'date': '13 Oct, 2011'}",16.99,1,Blackhole,Ubisoft


In [40]:
def process_categories_and_genres(df, export=False):
    df = df[df['genres'].notnull()].copy()
    df = df[df['categories'].notnull()].copy()
    
    df['genres'] = df['genres'].apply(lambda x: ';'.join(item['description'] for item in literal_eval(x)))
    df['categories'] = df['categories'].apply(lambda x: ';'.join(item['description'] for item in literal_eval(x)))
    
    return df


def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # drop rows with missing publisher and copy
    df = df[df['publisher'].notnull()].copy()
    
    # Process export columns
    df = process_descriptions(df)
    df = process_media(df)
    df = process_info(df)
    df = process_requirements(df)
    df = process_categories_and_genres(df)
    
    return df


steam_data = process(imported_steam_data)

## Export



In [41]:
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
0,Counter-Strike,10,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,"{'coming_soon': False, 'date': '1 Nov, 2000'}",7.19,1,Valve,Valve
1,Team Fortress Classic,20,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,"{'coming_soon': False, 'date': '1 Apr, 1999'}",3.99,1,Valve,Valve
2,Day of Defeat,30,0,windows;mac;linux,Multi-player;Valve Anti-Cheat enabled,Action,0,"{'coming_soon': False, 'date': '1 May, 2003'}",3.99,1,Valve,Valve
3,Deathmatch Classic,40,0,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,0,"{'coming_soon': False, 'date': '1 Jun, 2001'}",3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0,windows;mac;linux,Single-player;Multi-player;Valve Anti-Cheat en...,Action,0,"{'coming_soon': False, 'date': '1 Nov, 1999'}",3.99,1,Gearbox Software,Valve


In [42]:
steam_data.isnull().sum()

name            0
steam_appid     0
required_age    0
platforms       0
categories      0
genres          0
achievements    0
release_date    0
price           0
english         0
developer       0
publisher       0
dtype: int64

In [43]:
steam_data.to_csv("../data/exports/steam_clean_part_2.csv", index=False)