# Data Cleaning

Currently our downloaded data is not in a very usable or useful state. Many of the columns contain lengthy strings or missing values, both of which are crippling to analysis and especially to any machine learning techniques we may wish to implement.

The main aims of this project are to investigate various sales and play-time statistics for games from the steam store, and see how different features of games may have an effect on the success of those games. Keeping this in mind will help inform our decisions about how to handle the various columns in our data set, however it may be a good idea to keep columns which may not seem useful to this particular project in order to provide a robust data set for future analysis projects.

To begin with, we'll import our libraries and set some options, then take a look at the downloaded data from the steam api. Once that is taken care of we will move on to the steamspy data and repeat the process. Hopefully by the end we will have clean data sets to use in the next step, exploratory analysis and visualisation.

### Aims:
- Improve functions
- Prepare notebook for delivery

### (Raw) Data Dictionary

Sort out data dictionary  

API and data dictionary:
https://steamspy.com/api.php

### Future ideas:
- pc requirements analysis over time
- picture analysis
- keyword/recommender analysis
- categories could make table in a database all on its own, perhaps in future
- for genres (and categories?) could create main genre, selected from list of key genres, allowing hybrids like action_adventure if contains both
- remove titles over £60/100?

In [1]:
# load extensions and magics

# http://raw.github.com/jrjohansson/version_information/master/version_information.py
%load_ext version_information
%reload_ext version_information

%version_information numpy, pandas

Software,Version
Python,3.7.3 64bit [MSC v.1915 64 bit (AMD64)]
IPython,7.5.0
OS,Windows 10 10.0.17763 SP0
numpy,1.16.3
pandas,0.24.2
Thu May 09 19:58:30 2019 GMT Summer Time,Thu May 09 19:58:30 2019 GMT Summer Time


In [2]:
# import libraries
from ast import literal_eval
import itertools
import time
import re

import numpy as np
import pandas as pd

In [3]:
# customisations
pd.set_option("max_columns", 100)
# pd.reset_option("max_columns")

## Cleaning steam data

### Import Data

We begin by importing the raw steam data we generated previously in data collection, which can be viewed by following the link to `../deliver/1-data-collection.ipynb` below.

In [4]:
from IPython.display import FileLink
FileLink("../deliver/1-data-collection.ipynb")

For a full explanation of the data cleaning process, follow the next link to `../deliver/2-data-cleaning-steam-explanation.ipynb`.

In [8]:
FileLink("../deliver/2-data-cleaning-steam-explanation.ipynb")

In [5]:
raw_steam_data = pd.read_csv('../data/raw/steam_app_data.csv')

print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])
raw_steam_data.head()

Rows: 29235
Columns: 39


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,fullgame,supported_languages,header_image,website,pc_requirements,mac_requirements,linux_requirements,legal_notice,drm_notice,ext_user_account_notice,developers,publishers,demos,price_overview,packages,package_groups,platforms,metacritic,reviews,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0.0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,,"English<strong>*</strong>, French<strong>*</st...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 719, 'final': 7...",[7],"[{'name': 'default', 'title': 'Buy Counter-Str...","{'windows': True, 'mac': True, 'linux': True}","{'score': 88, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 65735},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://steamcdn-a.akamaihd.net/steam/apps/10/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0.0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/20/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[29],"[{'name': 'default', 'title': 'Buy Team Fortre...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 2802},{'total': 0},"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/20/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0.0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,,"English, French, German, Italian, Spanish - Spain",https://steamcdn-a.akamaihd.net/steam/apps/30/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[30],"[{'name': 'default', 'title': 'Buy Day of Defe...","{'windows': True, 'mac': True, 'linux': True}","{'score': 79, 'url': 'https://www.metacritic.c...",,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 1992},{'total': 0},"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/30/...,"{'ids': [], 'notes': None}"
3,game,Deathmatch Classic,40,0.0,False,,,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,,"English, French, German, Italian, Spanish - Sp...",https://steamcdn-a.akamaihd.net/steam/apps/40/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Valve'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[31],"[{'name': 'default', 'title': 'Buy Deathmatch ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 931},{'total': 0},"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://steamcdn-a.akamaihd.net/steam/apps/40/...,"{'ids': [], 'notes': None}"
4,game,Half-Life: Opposing Force,50,0.0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,,"English, French, German, Korean",https://steamcdn-a.akamaihd.net/steam/apps/50/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,{'minimum': 'Minimum: OS X Snow Leopard 10.6....,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual...",,,,['Gearbox Software'],['Valve'],,"{'currency': 'GBP', 'initial': 399, 'final': 3...",[32],"[{'name': 'default', 'title': 'Buy Half-Life: ...","{'windows': True, 'mac': True, 'linux': True}",,,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://steamcdn...",,{'total': 4355},{'total': 0},"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://steamcdn-a.akamaihd.net/steam/apps/50/...,"{'ids': [], 'notes': None}"


### Define functions

In [6]:
def export_data(df, filename):
    """Export dataframe to csv file, filename prepended with 'steam_'.
    
    filename : str without file extension
    """
    filepath = '../data/exports/steam_' + filename + '.csv'
    formatted_name = filename.replace('_', ' ')
    
    df.to_csv(filepath, index=False)
    print("Exported {} to '{}'".format(formatted_name, filepath))
        

def process_null_cols(df, thresh=0.5):
    """Drop columns with more than a certain proportion of missing values (Default 50%)."""
    cutoff_count = len(df) * thresh
    
    return df.dropna(thresh=cutoff_count, axis=1)


def drop_null_rows(df, col):
    """Drop rows with null values in a particular column."""
    return df[df[col].notnull()]


def process_type(df):
    """Remove rows with null values for type column, then drop the column."""
    df = drop_null_rows(df, 'type')
    df = df.drop('type', axis=1)
    
    return df
    
    
def process_name(df):
    """Remove rows with null values or 'none' in name column."""
    df = drop_null_rows(df, 'name')
    df = df[df['name'] != 'none']
    
    return df


def process_age(df):
    """Format ratings in age column to be in line with the PEGI Age Ratings system."""
    # PEGI Age ratings: 3, 7, 12, 16, 18
    cut_points = [-1, 3, 7, 12, 16, 2000]
    label_values = [3, 7, 12, 16, 18]
    
    df['required_age'] = pd.cut(df['required_age'], bins=cut_points, labels=label_values)
    
    return df


def process_platforms(df):
    """Split platforms column into separate boolean columns for each platform."""
    # evaluate values in platforms column, so can index into dictionaries
    df['platforms'] = df['platforms'].apply(lambda x: literal_eval(x))
    
    # loop across keys, the platforms, which we'll turn into columns
    for platform in df['platforms'][0].keys():
        # set 1 if value for platform in original column is True, or 0 if it is False
        df[platform] = df['platforms'].apply(lambda x: 1 if x[platform] else 0)
    
    # remove the original platforms column
    df = df.drop('platforms', axis=1)
    
    return df


def process_price(df):
    """Process price_overview column into formatted price column."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # change price to display in pounds (only applying to rows with a value greater than 0)
    df.loc[df['price'] > 0, 'price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview'], axis=1)
    
    return df


def process_descriptions(df, export=False):
    """Export descriptions to external csv file then remove these columns."""
    # remove rows with missing description data
    df = df[df['detailed_description'].notnull()].copy()
    
    if export:
        # create dataframe of description columns and export to csv
        description_data = df[['steam_appid', 'detailed_description', 'about_the_game', 'short_description']]
        
        export_data(description_data, filename='description_data')
    
    # drop description columns from main dataframe
    df = df.drop(['detailed_description', 'about_the_game', 'short_description'], axis=1)
    
    return df


def process_language(df):
    """Process supported_languages column into a boolean 'is english' column."""
    df = df.copy()
    
    # drop rows with missing language data
    df = df.dropna(subset=['supported_languages'])
    
    df['english'] = df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)
    df = df.drop('supported_languages', axis=1)
    
    return df


def process_images(df, export=False):
    """Remove image columns from dataframe, optionally exporting them to csv first."""
    df = df[df['screenshots'].notnull()].copy()
    
    if export:
        image_data = df[['steam_appid', 'header_image', 'screenshots', 'background', 'movies']]
        
        export_data(image_data, 'image_data')
        
    df = df.drop(['header_image', 'screenshots', 'background', 'movies'], axis=1)
    
    return df


def process_info(df, export=False):
    """Drop support information from dataframe, optionally exporting beforehand."""
    if export:
        support_info = df[['steam_appid', 'website', 'support_info']].copy()
        
        support_info['support_info'] = support_info['support_info'].apply(lambda x: literal_eval(x))
        support_info['support_url'] = support_info['support_info'].apply(lambda x: x['url'])
        support_info['support_email'] = support_info['support_info'].apply(lambda x: x['email'])
        
        support_info = support_info.drop('support_info', axis=1)
        
        support_info = support_info[(support_info['website'].notnull()) | (support_info['support_url'] != '') | (support_info['support_email'] != '')]

        export_data(support_info, 'support_info')
    
    df = df.drop(['website', 'support_info'], axis=1)
    
    return df


def process_requirements(df, export=False):
    if export:
        requirements = df[['steam_appid', 'pc_requirements']].copy()
        
        requirements = requirements[requirements['pc_requirements'] != '[]']
        
        requirements['requirements_clean'] = (requirements['pc_requirements']
                                                  .str.replace(r'\\[rtn]', '')
                                                  .str.replace(r'<[pbr]{1,2}>', ' ')
                                                  .str.replace(r'<[\/"=\w\s]+>', '')
                                             )
        
        export_data(requirements, 'requirements_data')
        
    df = df.drop(['pc_requirements', 'mac_requirements', 'linux_requirements'], axis=1)
    
    return df


def process_developers_and_publishers(df):
    df = df[(df['developers'].notnull()) & (df['publishers'] != "['']")].copy()
            
    df['developers'] = df['developers'].apply(lambda x: literal_eval(x))
    df['publishers'] = df['publishers'].apply(lambda x: literal_eval(x))
    
    df['developer'] = df['developers'].apply(lambda x: ', '.join(x))
    df['publisher'] = df['publishers'].apply(lambda x: ', '.join(x))

    df = df.drop(['developers', 'publishers'], axis=1)
    
    return df


def process_price(df):
    """Process price_overview column into formatted price column, and take care of package columns."""
    df = df.copy()
    
    def parse_price(x):
        try:
            return literal_eval(x)
        except ValueError:
            return {'currency': 'GBP', 'initial': -1}
    
    # evaluate as dictionary and set to -1 if missing
    df['price_overview'] = df['price_overview'].apply(parse_price)
    
    # create columns from currency and initial values
    df['currency'] = df['price_overview'].apply(lambda x: x['currency'])
    df['price'] = df['price_overview'].apply(lambda x: x['initial'])
    
    # set price of free games to 0
    df.loc[df['is_free'], 'price'] = 0
    
    # remove non-GBP rows
    df = df[df['currency'] == 'GBP']
    
    # remove rows where price is -1
    df = df[df['price'] != -1]
    
    # change price to display in pounds (can apply to all now -1 rows removed)
    df['price'] /= 100
    
    # remove columns no longer needed
    df = df.drop(['is_free', 'currency', 'price_overview', 'packages', 'package_groups'], axis=1)
    
    return df


def expand_columns(df, col):
    df[col] = df[col].apply(lambda x: [item['description'] for item in literal_eval(x)])
    new_cols = set(list(itertools.chain(*df[col])))
    
    for new_col in sorted(new_cols):
        new_col_name = (new_col.lower()
                               .replace('-', '_')
                               .replace(' ', '_')
                               .replace('(', '')
                               .replace(')', '')
                               .replace('/', '_or_')
                               .replace('&', 'and')
                       )
        df[new_col_name] = df[col].apply(lambda x: 1 if new_col in x else 0)
            
    return df.drop(col, axis=1)


def process_categories(df, export=False):
    df = df[df['categories'].notnull()].copy()
    
    category_data = df[['steam_appid', 'categories']].copy()
    category_data = expand_columns(category_data, 'categories')
    
    if export:
        export_data(category_data, 'category_data')
    
    df = df.drop('categories', axis=1)
    
    return df


def process_genres(df, export=False):
    df = df[df['genres'].notnull()].copy()
    
    genre_data = df[['steam_appid', 'genres']].copy()
    genre_data = expand_columns(genre_data, 'genres')
        
    if export:    
        export_data(genre_data, 'genre_data')
        
    df = df.drop('genres', axis=1)
    
    return df


def process_release_date(df):
    df = df.copy()
    
    def eval_date(x):
        x = literal_eval(x)
        if x['coming_soon']:
            return '' # return blank string so can drop missing at end
        else:
            return x['date']
    
    df['release_date'] = df['release_date'].apply(eval_date)
    
    def parse_date(x):
        if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
            return x.replace(',', '')
        elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
            return '1 ' + x
        elif x == '':
            return np.nan
        else:
            # Should be everything, print out anything left just in case
            print(x)
            
    df['release_date'] = df['release_date'].apply(parse_date)
    df['release_date'] = pd.to_datetime(df['release_date'], format='%d %b %Y', errors='coerce')
    
    df = df[df['release_date'].notnull()]
    
    return df

### Run Functions

In [7]:
def process(df):
    """Process data set. Will eventually contain calls to all functions we write."""
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = process_null_cols(df)
    
    # Process rest of columns
    df = df.drop(['achievements', 'content_descriptors'], axis=1)
    df = process_type(df)
    df = process_name(df)
    df = process_age(df)
    df = process_platforms(df)
    df = process_price(df)
    df = process_language(df)
    df = process_developers_and_publishers(df)
    df = process_release_date(df)
    
    # Process columns which export data
    df = process_descriptions(df, export=False)
    df = process_images(df, export=False)
    df = process_info(df, export=False)
    df = process_requirements(df, export=False)
    df = process_categories(df, export=False)
    df = process_genres(df, export=False)
    
    return df

steam_data = process(raw_steam_data)
steam_data.head()

Unnamed: 0,name,steam_appid,required_age,release_date,windows,mac,linux,price,english,developer,publisher
0,Counter-Strike,10,3,2000-11-01,1,1,1,7.19,1,Valve,Valve
1,Team Fortress Classic,20,3,1999-04-01,1,1,1,3.99,1,Valve,Valve
2,Day of Defeat,30,3,2003-05-01,1,1,1,3.99,1,Valve,Valve
3,Deathmatch Classic,40,3,2001-06-01,1,1,1,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,3,1999-11-01,1,1,1,3.99,1,Gearbox Software,Valve


### Export Clean Data

In [141]:
# steam_data.to_csv('../data/steam_data_clean.csv', index=False)

steam_data_full = steam_data.merge(genre_data, how='left', on='steam_appid')
steam_data_full = steam_data_full.merge(category_data, how='left', on='steam_appid')

null_counts = steam_data_full.isnull().sum()
print(null_counts[null_counts > 0].shape[0])

# steam_data_full.to_csv('../data/steam_data_with_genre_and_category.csv', index=False)

steam_data_full.head()

0


Unnamed: 0,name,steam_appid,required_age,release_date,windows,mac,linux,price,english,developer,publisher,genre_accounting,genre_action,genre_adventure,genre_animation_and_modeling,genre_audio_production,genre_casual,genre_design_and_illustration,genre_documentary,genre_early_access,genre_education,genre_free_to_play,genre_game_development,genre_gore,genre_indie,genre_massively_multiplayer,genre_nudity,genre_photo_editing,genre_rpg,genre_racing,genre_sexual_content,genre_simulation,genre_software_training,genre_sports,genre_strategy,genre_tutorial,genre_utilities,genre_video_production,genre_violent,genre_web_publishing,category_captions_available,category_co_op,category_commentary_available,category_cross_platform_multiplayer,category_full_controller_support,category_in_app_purchases,category_includes_source_sdk,category_includes_level_editor,category_local_co_op,category_local_multi_player,category_mmo,category_mods,category_mods_require_hl2,category_multi_player,category_online_co_op,category_online_multi_player,category_partial_controller_support,category_shared_or_split_screen,category_single_player,category_stats,category_steam_achievements,category_steam_cloud,category_steam_leaderboards,category_steam_trading_cards,category_steam_turn_notifications,category_steam_workshop,category_steamvr_collectibles,category_vr_support,category_valve_anti_cheat_enabled
0,Counter-Strike,10,3,2000-11-01,1,1,1,7.19,1,Valve,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
1,Team Fortress Classic,20,3,1999-04-01,1,1,1,3.99,1,Valve,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,Day of Defeat,30,3,2003-05-01,1,1,1,3.99,1,Valve,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,Deathmatch Classic,40,3,2001-06-01,1,1,1,3.99,1,Valve,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4,Half-Life: Opposing Force,50,3,1999-11-01,1,1,1,3.99,1,Gearbox Software,Valve,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1
