# Steam Data Analysis

## TODO:

* Some apps have different steam_appids from the ones we are downloading them from - need a check for that since SteamSpy might have different appid from the storefront in that case
* Check the data update: add different options for full/partial load. Check timestamp on the lastupdate
* Automating data gathering (separating into different scripts to run in parallel?)
* Parallelism

## Data Acquisition

There are multiple Steam-related datasets publicly available either at [Kaggle](https://www.kaggle.com/datasets) or different website but most of them were either outdated at the start of this project, lacked some information I was interested in or both. Hence, it was reasonable to extract it directly either with API or web scraping.

There are two main sources of Steam App data that allow connecting with API - Steam itself (with it's multiple APIs) and [SteamSpy](https://steamspy.com/about). SteamSpy allows gathering quite a lot of data on games but some of it's data is not perfectly suitable for analysis (like the Owner values that might be a very rough estimation).

Steam has multiple APIs that are useful for the data acquisition: 
* [Steamworks API](https://partner.steamgames.com/) that requires developer key for quite a lot of it's functions. The documentation is quite detailed and we'll be using it to get the list of AppIDs we are downloading and the Review data
* [Storefront API](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI) - this one can be used without a key and is very well [described here](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI) 

Thanks to [Nick Davis](https://nik-davis.github.io) and [Vicente Arce](https://twitter.com/Duerkos) I already had a rough pipeline and good scripts for gathering information from these APIs but some adjustments were needed to be made in accordance to the project goals. Also, [SteamDB](https://steamdb.info/faq/#attribution-and-technologies-we-use) has quite a lot of good references and is overall a great tool to check the Steam Data.

## Process:

- Create an app list and gather available data from the Steam ISteamApps API
- Retrieve individual app data from the Steam Storefront API, SteamSpy and Steam appreviews API by iterating through app list
- Compile the data collection error table
- Try redownloading the data for the IDs we didn't get on the first run
- Export App list, Steam data, SteamSpy data and Steam Reviews data to csv files

## API references:

- https://partner.steamgames.com/doc/webapi
- https://partner.steamgames.com/doc/store/reviews
- https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI
- https://steamapi.xpaw.me/#
- https://steamspy.com/api.php
- https://steamdb.info/faq/#attribution-and-technologies-we-use


## Credits

Huge props to [Nick Davis](https://nik-davis.github.io) for making the clean Steam Data dataset and thoroughly describing the process in his blog and [Vicente Arce](https://twitter.com/Duerkos) for making awesome notebooks this work is forked [from]](https://github.com/Duerkos/steam_analysis).

## Utility functions

In [46]:
# standard library imports
import csv
import datetime as dt
import json
import os
import statistics
import time

# third-party imports
import numpy as np
import pandas as pd
import requests
import requests.auth

# customization  - ensure tables show all columns
pd.set_option('max_columns', 100)

In [2]:
# setting up proxies and Steam API key:
try:
    with open('../data/_credentials/steam_key.txt') as f:
        APIKey = f.read()
except:
    APIKey = ''
    
try:
    with open('../data/_credentials/proxies.txt', 'r') as f:
        proxies = eval(f.read())
    if not isinstance(proxies, dict):
        proxies = None
except:
    proxies = None

The next function uses requests library to get JSON response from web APIs. It is based on Nik Davis previous work, and it is quite standard as (thankfully) web APIs use a standard format, and requests makes it really easy.

In [3]:
def get_request(url,parameters=None, steamspy=False):
    """Return json-formatted response of a get request using optional parameters.
    
    Parameters
    ----------
    url : string
    parameters : {'parameter': 'value'}
        parameters to pass as part of get request
    
    Returns
    -------
    json_data
        json-formatted response (dict-like)
    """
    try:
        headers = {'Accept': 'application/json'}
        response = requests.get(url=url, params=parameters, headers = headers, proxies = proxies)
    except requests.exceptions.SSLError as s:
        print('SSL Error:', s)
        
        for i in range(5, 0, -1):
            print('\rWaiting... ({})'.format(i), end='')
            time.sleep(1)
        print('\rRetrying.' + ' '*10)
        
        # recursively try again
        return get_request(url, parameters, steamspy)
    
    if response:
        return response.json()
    else:
        # We do not know how many pages steamspy has... and it seems to work well, so we will use no response to stop.
        if steamspy:
            return 'stop'
        else :
            # response is none usually means too many requests. Wait and try again 
            print('No response, waiting 15 seconds...')
            time.sleep(15)
            print('Retrying.')
            return get_request(url, parameters, steamspy)        

## List of IDs

APPs on steam have an unique ID. The requests to Steam Storefront API (which has more information than Steam Spy) have to be made for a specific ID. This means we have to get  a list of IDs first.

We can do this in several way:

* Using Steam Spy API (see https://steamspy.com/api.php) to get the list of IDs and also the metadata from Steam Spy (at the same time. Unfortunately, using this method gives a lot of duplicates.

* Alternatively, we could use Steam API to get a list of apps, then filter them (see https://api.steampowered.com/IStoreService/GetAppList/v1/? or https://steamapi.xpaw.me/#IStoreService/GetAppInfo for reference)

Using Steam Store API provides some additional benefits as we can filter applications by type, times of the last data modification and the price change.
We will use the Steam GetAppList API and use the list of apps from it to index across all the tables we use eventually for consistency.

## Define Download Logic

This is strongly based on Nik Davis previous work, to get the info about the app IDs from Steam. Currently it's using the same index-based approach to do get the data but in the future I'm planning on using the data modification parameter from the IStoreService and create scripts capable to run in parallel to make the automatic data retrieval faster.

I will keep the original comments from Nik Davis `in quotes` to let the reader understand the process.

`Now we have the app_list dataframe, we can iterate over the app IDs and request individual app data from the servers. Here we set out our logic to retrieve and process this information, then finally store the data as a csv file.`

`Because it takes a long time to retrieve the data, it would be dangerous to attempt it all in one go as any errors or connection time-outs could cause the loss of all our data. For this reason we define a function to download and process the requests in batches, appending each batch to an external file and keeping track of the highest index written in a separate file.`

`This not only provides security, allowing us to easily restart the process if an error is encountered, but also means we can complete the download across multiple sessions.`

`Again, we provide verbose output for rows exported, batches complete, time taken and estimated time remaining.`

In [49]:
def get_app_data(app_list, start, stop, parser, pause, errors_list,
                 download_appid = False, last_modified = False):
    """Return list of app data generated from parser.
    
    parser : function to handle request
    
    TODO:
    download_id - add id from app_list for the downloaded app
    last_modified - add last_modified for the downloaded app
    
    """
    app_data = []
    """
    # iterate through each row of app_list, confined by start and stop
    for index, appid in app_list[start:stop].iteritems():
        print('Current index: {}'.format(index), end='\r')

        # retrive app data for a row, handled by supplied parser, and append to list
        try:
            data = parser(appid)
        except Exception as ex:
            errors_list.append(appid)
            print('\nError getting data for {} with exception {}\n'.format(appid, type(ex).__name__))
        app_data.append(data)

        time.sleep(pause) # prevent overloading api with requests
    """
    # iterate through each row of app_list, confined by start and stop
    for index, row in app_list[start:stop].iterrows():
        print('Current index: {}'.format(index), end='\r')

        # retrive app data for a row, handled by supplied parser, and append to list
        try:
            data = parser(row['download_appid'])
        except Exception as ex:
            errors_list.append(row['download_appid'])
            print('\nError getting data for {} with exception {}\n'.format(row['download_appid'], type(ex).__name__))
        if download_appid:
            data['download_appid'] = row['download_appid']
        if last_modified:
            data['last_modified'] = row['last_modified']
        app_data.append(data)

        time.sleep(pause) # prevent overloading api with requests

    return app_data


def process_batches(parser, app_list, download_path, data_filename, index_filename,
                    errors_list, columns,
                    begin=0, end=-1, batchsize=100, pause=1,
                    download_appid = False, last_modified = False):
    """Process app data in batches, writing directly to file.
    
    parser : custom function to format request
    app_list : dataframe of appid and name
    download_path : path to store data
    data_filename : filename to save app data
    index_filename : filename to store highest index written
    errors_list : list to store appid errors
    columns : column names for file
    
    Keyword arguments:
    
    begin : starting index (get from index_filename, default 0)
    end : index to finish (defaults to end of app_list)
    batchsize : number of apps to write in each batch (default 100)
    pause : time to wait after each api request (defualt 1)
    download_id - add id from app_list for the downloaded app
    last_modified - add last_modified for the downloaded app
    
    returns: none
    """
    print('Starting at index {}:\n'.format(begin))
    
    # by default, process all apps in app_list
    if end == -1:
        end = len(app_list) + 1
    
    # generate array of batch begin and end points
    batches = np.arange(begin, end, batchsize)
    batches = np.append(batches, end)
    
    apps_written = 0
    batch_times = []
    
    for i in range(len(batches) - 1):
        start_time = time.time()
        
        start = batches[i]
        stop = batches[i+1]
        
        app_data = get_app_data(app_list, start, stop, parser, pause, errors_list, download_appid, last_modified)
        
        rel_path = os.path.join(download_path, data_filename)
        
        # writing app data to file
        with open(rel_path, 'a', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=columns, extrasaction='ignore')
            
            for j in range(3,0,-1):
                print("\rAbout to write data, don't stop script! ({})".format(j), end='')
                time.sleep(0.5)
            
            writer.writerows(app_data)
            print('\rExported lines {}-{} to {}.'.format(start, stop-1, data_filename), end=' ')
            
        apps_written += len(app_data)
        
        idx_path = os.path.join(download_path, index_filename)
        
        # writing last index to file
        with open(idx_path, 'w') as f:
            index = stop
            print(index, file=f)
            
        # logging time taken
        end_time = time.time()
        time_taken = end_time - start_time
        
        batch_times.append(time_taken)
        mean_time = statistics.mean(batch_times)
        
        est_remaining = (len(batches) - i - 2) * mean_time
        
        remaining_td = dt.timedelta(seconds=round(est_remaining))
        time_td = dt.timedelta(seconds=round(time_taken))
        mean_td = dt.timedelta(seconds=round(mean_time))
        
        print('Batch {} time: {} (avg: {}, remaining: {})'.format(i, time_td, mean_td, remaining_td))
            
    print('\nProcessing batches complete. {} apps written'.format(apps_written))

The best way to use this function and still only get the newer apps, would be to instead of passing it the fully app_list, preprocess it so it only contains the "app_delta". Also we would need it to keep the final dataframe in a different file, to perform after an append to it (in case we are adding only new app_ids), or if we add some kind of updating process, a join.

`Next we define some functions to handle and prepare the external files.`

`We use reset_index for testing and demonstration, allowing us to easily reset the index in the stored file to 0, effectively restarting the entire download process.`

`We define get_index to retrieve the index from file, maintaining persistence across sessions. Every time a batch of information (app data) is written to file, we write the highest index within app_data that was retrieved. As stated, this is partially for security, ensuring that if there is an error during the download we can read the index from file and continue from the end of the last successful batch. Keeping track of the index also allows us to pause the download, continuing at a later time.`

`Finally, the prepare_data_file function readies the csv for storing the data. If the index we retrieved is 0, it means we are either starting for the first time or starting over. In either case, we want a blank csv file with only the header row to begin writing to, se we wipe the file (by opening in write mode) and write the header. Conversely, if the index is anything other than 0, it means we already have downloaded information, and can leave the csv file alone.`

In [5]:
def reset_index(download_path, index_filename):
    """Reset index in file to 0."""
    rel_path = os.path.join(download_path, index_filename)
    
    f = open(rel_path, 'w')
    f.write('0')
        

def get_index(download_path, index_filename):
    """Retrieve index from file, returning 0 if file not found."""
    try:
        rel_path = os.path.join(download_path, index_filename)
        with open(rel_path, 'r') as f:
            index = int(f.readline())
            #This just reads the initial line
    
    except FileNotFoundError:
        index = 0
        
    return index


def prepare_data_file(download_path, filename, index, columns):
    """Create file and write headers if index is 0."""
    if index == 0:
        rel_path = os.path.join(download_path, filename)

        with open(rel_path, 'w', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=columns)
            writer.writeheader()

## Download Steam Data

`Now we are ready to start downloading data and writing to file. We define our logic particular to handling the steam API - in fact if no data is returned we return just the name and appid - then begin setting some parameters. We define the files we will write our data and index to, and the columns for the csv file. The API doesn't return every column for every app, so it is best to explicitly set these.`

`Next we run our functions to set up the files, and make a call to process_batches to begin the process. Some additional parameters have been added for demonstration, to constrain the download to just a few rows and smaller batches. Removing these would allow the entire download process to be repeated.`

I retouched many of these parameters just to check if the download could made in batches (requesting several steamapps at the same time), or even putting a faster polling rate (right now it is one second).

The storefront API (http://store.steampowered.com/api/) is very much undocumented, but the key getaway is that is not possible. The official SteamWorks API lets us do other things and it is quite well documented, but we cannot get the data available at a steam webpage, which are the things interesting for us.

The storefront API is only accessible with these [requests](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI), and according to this [stackoverflow discussion](https://stackoverflow.com/questions/46330864/steam-api-all-games):
`There is a general API rate limit for each unique IP adress of 200 requests in five minutes which is one request every 1.5 seconds.` This matches our experience, Nik Davis put a pause between requests of just 1 second, and with this we get some but a few reconnect errors. If we put no pause at all, at the end we are limited by the 200 requests every 5 minutes.

That means that for a volume of around 50k this process will take around 21 hours. Thankfully we can resume it and do it in several batches.

If we were to build a web app and we wanted to update the information daily, we could instead try pulling the full applist from steam along with the "last updated" and only request the full appid information for those ids. This is probably what Steam Spy does, and SteamDB instead uses a more sofisticated approach by being notified of any changes to appids via steamworks.

For a one shot analysis (and not a web page where the user could explore the information), the full download approach is fine.

The goal of the next two functions is getting the full list of apps from Steam, and also getting from our already downloaded data (if any) which are the appids we already have, to get the "delta" so we only have to download the new app ids.

We could retouch them a bit, as they have a "last updated" key, so we also update new information and not just new ids. But that would be more suitable for a live webpage updated once a day, not for a full on analysis like we are presenting.

When getting the application list using IStoreService API you can use different parameters to gather different application type or filter outh the applications modified before the certain date. In this case I'm using 'include_dlc' parameters to include game DLCs for additional analysis.

Another important thing to note - obtaining Steam Data through the Storefront API is **region-based** and Steam storefront refuses to return the data for the games not available in the region we are downloading from (obtained by GeoIP, by the looks of it). It also changes prices to the currencty for the region, although this can be changed through the 'cc' optional parameter.

In [6]:
def getAppListBatch(url, parameters):
    """
    Getting application list data in batches, since IStoreService
    has limited max output
    """
    json_data = get_request(url, parameters=parameters)
    steam_id = pd.DataFrame.from_dict(json_data['response']['apps'])
    try:
        more_results = json_data['response']['have_more_results']
        last_appid =  json_data['response']['last_appid']
    except:
        more_results = False
        last_appid = False
    return more_results, steam_id, last_appid

In [20]:
def getAppList():
    """
    Getting the list of applications using Steam IStoreService
    Use https://steamapi.xpaw.me/#IStoreService/GetAppInfo as a reference
    for additional parameters
    
    Example of a single app data:
    "appid":10,"name":"Counter-Strike","last_modified":1602535893,"price_change_number":13853601
    """

    url = 'https://api.steampowered.com/IStoreService/GetAppList/v1/?'
    parameters = {'key': APIKey,
                 'include_dlc': 'true'}
    more_results = True
    begin = True
    # from the request we get the more_results flag and also the last_appid, so we use them for the next requests.
    while (more_results):
        more_results, steam_ids, last_appid = getAppListBatch(url, parameters)
        parameters['last_appid'] = last_appid
        if (begin):
            steam_allids = steam_ids
            begin = False
        else:
            steam_allids = steam_allids.append(steam_ids)
    steam_allids.rename(columns = {'appid': 'download_appid'}, inplace = True)
    return steam_allids.loc[:,['download_appid', 'last_modified', 'price_change_number']]

In [34]:
def get_update_ids(new_list, old_list):
    """
    Getting a table of apps that we are to download by comparing:
    
    new_list : table of apps from IStoreService,
    old_list : table of available apps. Requires to have 'steam_appid' column, change to the column series?
    
    """
    #We are going to forget about names and only care about IDs.
    """
    idList = idList['appid', '']
    oldFullList = oldFullList['steam_appid']
    oldFullList.columns = ['appid']
    updatedList = idList.append(oldFullList)
    updatedList = updatedList.drop_duplicates(keep=False)
    updatedList = updatedList.reset_index(drop=True)
    
    full_ids = pd.read_csv('../data/download/full_steam_ids.csv')
    missing_collection = full_ids[~full_ids['appid'].isin(steam_app_data.index) & ~full_ids.index.isin(missing_ids.index)].copy()
    missing_ids = pd.concat([steam_app_errors,steam_spy_errors,steam_reviews_errors,missing_collection]).drop_duplicates().reset_index(drop=True)
    
    """
    old_list = old_list[['download_appid']]
    updated_list = new_list[~new_list['download_appid'].isin(old_list['download_appid'])].copy().reset_index(drop=True)
    return updated_list

In [35]:
def parse_steam_request(appid):
    """Unique parser to handle data from Steam Store API.
    
    Returns : json formatted data (dict-like)
    """
    url = 'http://store.steampowered.com/api/appdetails/'
    parameters = {'appids': appid, 'key': APIKey}
    
    json_data = get_request(url, parameters=parameters)
    json_app_data = json_data[str(appid)]
    
    if json_app_data['success']:
        data = json_app_data['data']
    else:
        data = {'steam_appid': appid}
        
    return data


# Set file parameters
download_path = '../data/download/'
steam_app_data = 'steam_app_data.csv'
steam_app_data_delta = 'steam_app_data_delta.csv'
steam_index = 'steam_index.txt'

steam_columns = [
    'type', 'name', 'steam_appid', 'required_age', 'is_free', 'controller_support',
    'dlc', 'detailed_description', 'about_the_game', 'short_description', 'fullgame',
    'supported_languages', 'header_image', 'website', 'pc_requirements', 'mac_requirements',
    'linux_requirements', 'legal_notice', 'drm_notice', 'ext_user_account_notice',
    'developers', 'publishers', 'demos', 'price_overview', 'packages', 'package_groups',
    'platforms', 'metacritic', 'reviews', 'categories', 'genres', 'screenshots',
    'movies', 'recommendations', 'achievements', 'release_date', 'support_info',
    'background', 'content_descriptors',
    'download_appid', 'last_modified'
]

steam_errors = []

# Overwrites last index for demonstration (would usually store highest index so can continue across sessions)
if (os.path.isfile(download_path+steam_app_data_delta) == False):
    reset_index(download_path, steam_index)
    index = 0
else:
    # Retrieve last index downloaded from file
    index = get_index(download_path, steam_index)
    
# Wipe or create data file and write headers if no previous  data
if (os.path.isfile(download_path+steam_app_data) == False):
    prepare_data_file(download_path, steam_app_data, index, steam_columns)
    
# Wipe or create data file delta and write headers if index is 0
if (os.path.isfile(download_path+steam_app_data_delta) == False):
    prepare_data_file(download_path, steam_app_data_delta, index, steam_columns)
    
    
# Here we get the list of appids from steam
full_steam_ids = getAppList()

# Here we get the real list of ids not yet in our dataframe. If this is the first time we are downloading the data, we can skip
# This step and instead use the full app_list.
try:
    oldlist = pd.read_csv('../data/download/steam_app_data.csv', usecols = ['name','download_appid'])
    steam_ids = get_update_ids(full_steam_ids, oldlist)
except FileNotFoundError:
    print('Pre-existing file not found. First time downloading full app data from steam. This will take a while.\n')
    steam_ids = full_steam_ids 
 

In [36]:
print(f'New IDs detected: {str(len(steam_ids))}')
print(steam_ids)
#index = 0

New IDs detected: 105035
        download_appid  last_modified  price_change_number
0                   10     1602535893             13853601
1                   20     1579634708             13853601
2                   30     1512413490             13853601
3                   40     1568752159             13853601
4                   50     1579628243             13853601
...                ...            ...                  ...
105030         2057790     1655329480                    0
105031         2058200     1655377759                    0
105032         2058310     1655316155                    0
105033         2058440     1655357944             15115858
105034         2060820     1655397280             15147236

[105035 rows x 3 columns]


In [50]:
# I separated the long process to be able to debug it better.
# Set end and chunksize for demonstration - remove to run through entire app list
# Here by default we passed "app_list" that contained all the information and saved it, now we will modify it a bit
# And add pre-processing and post-processing
print(f'Adding {str(len(steam_ids))} new ids.\n')

# Adding download start timestamp
log_time = []
log_time.append(['Storefront download start', time.time()])

process_batches(
    parser=parse_steam_request,
    app_list=steam_ids,
    download_path=download_path,
    data_filename=steam_app_data_delta,
    index_filename=steam_index,
    errors_list=steam_errors,
    columns=steam_columns,
    begin=index,
    #end=10,
    #pause=0.5
    batchsize=100,
    pause=1,
    download_appid = True,
    last_modified = True
)

log_time.append(['Storefront download end', time.time()])

try:
    oldlist = pd.read_csv('../data/download/steam_app_data.csv')
    # We change the old file to backup, so remove any backup named this way before...
    os.replace('../data/download/steam_app_data.csv', '../data/download/steam_app_data_backup.csv')
    newlist = pd.read_csv('../data/download/steam_app_data_delta.csv')
    oldlist = oldlist.append(newlist, ignore_index=True)
    oldlist.to_csv('../data/download/steam_app_data.csv', index=False)
except FileNotFoundError:
    os.rename('../data/download/steam_app_data_delta.csv', '../data/download/steam_app_data.csv')
    
# Saving errors and download times
steam_errors_df = pd.DataFrame(steam_errors, columns=['appid'])
steam_errors_df.to_csv('../data/download/steam_errors.csv', index=False)

log_columns = ['operation', 'timestamp']
try:
    log_df = pd.read_csv('../data/download/download_log.csv', header=0)
except:
    log_df = pd.DataFrame(columns=log_columns)

log_df = log_df.append(pd.DataFrame(columns=log_columns, data=log_time), ignore_index=True)
log_df.to_csv('../data/download/download_log.csv', index=False)

Adding 105035 new ids.

Starting at index 0:

Exported lines 0-99 to steam_app_data_delta.csv. Batch 0 time: 0:03:22 (avg: 0:03:22, remaining: 2 days, 10:48:38)
Exported lines 100-199 to steam_app_data_delta.csv. Batch 1 time: 0:03:21 (avg: 0:03:21, remaining: 2 days, 10:40:04)
Exported lines 200-299 to steam_app_data_delta.csv. Batch 2 time: 0:03:23 (avg: 0:03:22, remaining: 2 days, 10:48:28)
Exported lines 300-399 to steam_app_data_delta.csv. Batch 3 time: 0:03:23 (avg: 0:03:22, remaining: 2 days, 10:49:22)
Exported lines 400-499 to steam_app_data_delta.csv. Batch 4 time: 0:03:29 (avg: 0:03:24, remaining: 2 days, 11:09:07)
Exported lines 500-599 to steam_app_data_delta.csv. Batch 5 time: 0:03:23 (avg: 0:03:24, remaining: 2 days, 11:05:27)
Exported lines 600-699 to steam_app_data_delta.csv. Batch 6 time: 0:03:23 (avg: 0:03:24, remaining: 2 days, 11:01:22)
Exported lines 700-799 to steam_app_data_delta.csv. Batch 7 time: 0:03:24 (avg: 0:03:24, remaining: 2 days, 10:58:15)
Exported line

  exec(code_obj, self.user_global_ns, self.user_ns)


Let's ensure we have no duplicate ids and that we got them all!

In [103]:
steam_app_data = pd.read_csv('../data/download/steam_app_data.csv')

In [104]:
#Checking for duplicates
print(f'appid duplicates:', steam_app_data.duplicated(subset='steam_appid').sum())
print(f'download_appid duplicates:', steam_app_data.duplicated(subset='download_appid').sum())

appid duplicates: 10
download_appid duplicates: 0


We got some duplicates here. Let's compare that to the full set of ids.
(And also save the current full Steam ids for the future reference)

In [196]:
steam_ids.to_csv("../data/download/full_steam_ids.csv", index=False)
steam_ids.duplicated(subset="download_appid").sum()

0

Even though we cleaned before, it is possible we got some ids twice. Let's compare the size.

In [62]:
len(steam_app_data)-len(full_steam_ids)

0

Since we were using old data, it seems like ids that are no longer available are still there, along with a few duplicates. Let's run again the function which would get new ids, just to make sure.

In [63]:
diff_ids = get_update_ids(full_steam_ids, steam_app_data)

In [64]:
len(diff_ids)

12

There are not many new IDs. Taking into account that around a hundred apps get uploaded to Steam everyday, this makes sense, so we do not need to download anything new for the moment, just cleaning.

Well, from Steam... now we have to make sure we got most of these IDs from SteamSpy as well.

Let's do a bit of pre-cleaning, to ensure we download only the ids we need from Steam Spy.

We are going to consider valid apps those that at least have a name for the moment. Then delete the remaining duplicates.

In [65]:
steam_app_data = steam_app_data.drop_duplicates(subset='download_appid', keep='last')
steam_app_data.to_csv('../data/download/steam_app_data.csv', index=False)

Appids that were not downloaded

In [66]:
steam_errors_df = pd.DataFrame(steam_errors, columns =['appid'])
steam_errors_df.to_csv('../data/download/steam_errors.csv', index=False)
steam_errors_df

Unnamed: 0,appid
0,45796
1,294974
2,624621
3,756950
4,1061400
5,1150270
6,1351710
7,1357200
8,1421310
9,1444140


In [67]:
# inspect downloaded data
steam_app_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105023 entries, 0 to 105034
Data columns (total 41 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     104343 non-null  object
 1   name                     104332 non-null  object
 2   steam_appid              105023 non-null  int64 
 3   required_age             104343 non-null  object
 4   is_free                  104343 non-null  object
 5   controller_support       26063 non-null   object
 6   dlc                      9804 non-null    object
 7   detailed_description     104181 non-null  object
 8   about_the_game           104180 non-null  object
 9   short_description        104176 non-null  object
 10  fullgame                 35360 non-null   object
 11  supported_languages      104152 non-null  object
 12  header_image             104343 non-null  object
 13  website                  61129 non-null   object
 14  pc_requirements     

## Steam Spy API

Steam Spy API accepts requests in a GET string and returns JSON arrays. The data is updated once every day. We are going to use `steamspy.com/api.php?request=appdetails&appid=download_appid` to return information on each game. API allows 1 request per second for such requests.

The information we can get through this API is:

* appid - Steam Application ID. If it's 999999, then data for this application is hidden on developer's request.
* name - game's name
* developer - comma separated list of the developers of the game
* publisher - comma separated list of the publishers of the game
* score_rank - score rank of the game based on user reviews
* owners - owners of this application on Steam as a range.
* average_forever - average playtime since March 2009. In minutes.
* average_2weeks - average playtime in the last two weeks. In minutes.
* median_forever - median playtime since March 2009. In minutes.
* median_2weeks - median playtime in the last two weeks. In minutes.
* ccu - peak CCU yesterday.
* price - current US price in cents.
* initialprice - original US price in cents.
* discount - current discount in percents.
* tags - game's tags with votes in JSON array.
* languages - list of supported languages.
* genre - list of genres.

In [68]:
def parse_steamspy_request(appid):
    """Parser to handle SteamSpy API data."""
    url = 'https://steamspy.com/api.php'
    parameters = {'request': 'appdetails', 'appid': appid}
    
    json_data = get_request(url, parameters)
    return json_data


# set files and columns
download_path = '../data/download'
steamspy_data = 'steamspy_data.csv'
steamspy_index = 'steamspy_index.txt'

steamspy_columns = [
    'appid', 'name', 'developer', 'publisher', 'score_rank', 'positive',
    'negative', 'userscore', 'owners', 'average_forever', 'average_2weeks',
    'median_forever', 'median_2weeks', 'price', 'initialprice', 'discount',
    'languages', 'genre', 'ccu', 'tags'
]

steamspy_errors = []

reset_index(download_path, steamspy_index)
index = get_index(download_path, steamspy_index)

# Adding download start timestamp
log_time = []
log_time.append(['SteamSpy download start', time.time()])

# Wipe data file if index is 0
prepare_data_file(download_path, steamspy_data, index, steamspy_columns)

full_steam_ids=pd.read_csv('../data/download/steam_app_data.csv')

process_batches(
    parser=parse_steamspy_request,
    app_list=full_steam_ids,
    download_path=download_path, 
    data_filename=steamspy_data,
    index_filename=steamspy_index,
    errors_list=steamspy_errors,
    columns=steamspy_columns,
    begin=index,
    end=len(full_steam_ids),
    batchsize=300,
    pause=0.1
)

# Saving download times
log_time.append(['SteamSpy download end', time.time()])

log_columns = ['operation', 'timestamp']
try:
    log_df = pd.read_csv('../data/download/download_log.csv', header=0)
except:
    log_df = pd.DataFrame(columns=log_columns)

log_df = log_df.append(pd.DataFrame(columns=log_columns, data=log_time), ignore_index=True)
log_df.to_csv('../data/download/download_log.csv', index=False)

  exec(code_obj, self.user_global_ns, self.user_ns)


Starting at index 0:

Exported lines 0-299 to steamspy_data.csv. Batch 0 time: 0:03:02 (avg: 0:03:02, remaining: 17:41:34)
Exported lines 300-599 to steamspy_data.csv. Batch 1 time: 0:03:03 (avg: 0:03:03, remaining: 17:41:33)
Exported lines 600-899 to steamspy_data.csv. Batch 2 time: 0:03:03 (avg: 0:03:03, remaining: 17:39:50)
Exported lines 900-1199 to steamspy_data.csv. Batch 3 time: 0:03:03 (avg: 0:03:03, remaining: 17:36:35)
Exported lines 1200-1499 to steamspy_data.csv. Batch 4 time: 0:03:03 (avg: 0:03:03, remaining: 17:33:53)
Exported lines 1500-1799 to steamspy_data.csv. Batch 5 time: 0:03:03 (avg: 0:03:03, remaining: 17:30:47)
Exported lines 1800-2099 to steamspy_data.csv. Batch 6 time: 0:03:04 (avg: 0:03:03, remaining: 17:28:30)
Exported lines 2100-2399 to steamspy_data.csv. Batch 7 time: 0:03:04 (avg: 0:03:03, remaining: 17:26:04)
Exported lines 2400-2699 to steamspy_data.csv. Batch 8 time: 0:03:03 (avg: 0:03:03, remaining: 17:22:56)
Exported lines 2700-2999 to steamspy_data.

In [69]:
steamspy_errors_df = pd.DataFrame(steamspy_errors, columns =['appid'])
steamspy_errors_df.to_csv('../data/download/steamspy_errors.csv', index=False)
steamspy_errors

[243450, 1087820, 1267580, 1497640, 1800600]

Let's quickly check if we have valid data inside.

In [70]:
steam_spy_data = pd.read_csv('../data/download/steamspy_data.csv')

In [71]:
steam_spy_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105023 entries, 0 to 105022
Data columns (total 20 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   appid            105023 non-null  int64  
 1   name             104758 non-null  object 
 2   developer        93779 non-null   object 
 3   publisher        84455 non-null   object 
 4   score_rank       52 non-null      float64
 5   positive         105023 non-null  int64  
 6   negative         105023 non-null  int64  
 7   userscore        105023 non-null  int64  
 8   owners           105023 non-null  object 
 9   average_forever  105023 non-null  int64  
 10  average_2weeks   105023 non-null  int64  
 11  median_forever   105023 non-null  int64  
 12  median_2weeks    105023 non-null  int64  
 13  price            94137 non-null   float64
 14  initialprice     94148 non-null   float64
 15  discount         94148 non-null   float64
 16  languages        93925 non-null   obje

In [73]:
steam_spy_data.duplicated(subset='appid').sum()

5

Looks good!

Now we have the Steam Spy data available on `../data/download/steamspy_data.csv` and the Steam Store data available on `../data/download/steam_app_data.csv`

## Steam Reviews

Steam Storefront and Steam Spy reviews data is a bit different and we can use the review data obtained directly from the AppReviews API as an additional and more reliable data point.

In [86]:
def parse_steamreviews_request(appid):
    """Parser to handle SteamSpy API data."""
    url = 'https://store.steampowered.com/appreviews/' + str(appid)
    #todo: add purchase_type=all in parameters for the next version
    parameters = {'json': 1, 'num_per_page': '0', 'language': 'all', 'purchase_type': 'all'}
    json_data = get_request(url, parameters)
    json_data = json_data['query_summary']
    json_data['appid']=appid
    return json_data


# set files and columns
download_path = '../data/download'
steamreviews_data = 'steamreviews_data.csv'
steamreviews_index = 'steamreviews_index.txt'

steamreviews_columns = [
    'appid', 'review_score', 'review_score_desc', 'total_positive', 'total_negative', 'total_reviews', 'download_appid'
]

steamreviews_errors = []

#Reset index if to download the reviews from 0
#reset_index(download_path, steamreviews_index)
index = get_index(download_path, steamreviews_index)

# Wipe data file if index is 0
prepare_data_file(download_path, steamreviews_data, index, steamreviews_columns)

full_steam_ids=pd.read_csv('../data/download/steam_app_data.csv')

# Adding download start timestamp
log_time = []
log_time.append(['Reviews download start', time.time()])

process_batches(
    parser=parse_steamreviews_request,
    app_list=full_steam_ids,
    download_path=download_path, 
    data_filename=steamreviews_data,
    index_filename=steamreviews_index,
    errors_list=steamreviews_errors,
    columns=steamreviews_columns,
    begin=index,
    end=len(full_steam_ids),
    batchsize=300,
    pause=0,
    download_appid = True
)

# Saving download times
log_time.append(['Reviews download end', time.time()])
log_columns = ['operation', 'timestamp']
try:
    log_df = pd.read_csv('../data/download/download_log.csv', header=0)
except:
    log_df = pd.DataFrame(columns=log_columns)

log_df = log_df.append(pd.DataFrame(columns=log_columns, data=log_time), ignore_index=True)
log_df.to_csv('../data/download/download_log.csv', index=False)

Starting at index 0:

Current index: 193
Error getting data for 6830 with exception ProxyError

Exported lines 0-299 to steamreviews_data.csv. Batch 0 time: 0:03:19 (avg: 0:03:19, remaining: 19:21:45)
Exported lines 300-599 to steamreviews_data.csv. Batch 1 time: 0:03:01 (avg: 0:03:10, remaining: 18:25:00)
Exported lines 600-899 to steamreviews_data.csv. Batch 2 time: 0:03:05 (avg: 0:03:08, remaining: 18:11:52)
Exported lines 900-1199 to steamreviews_data.csv. Batch 3 time: 0:03:01 (avg: 0:03:06, remaining: 17:57:54)
Exported lines 1200-1499 to steamreviews_data.csv. Batch 4 time: 0:02:57 (avg: 0:03:05, remaining: 17:44:10)
Exported lines 1500-1799 to steamreviews_data.csv. Batch 5 time: 0:02:59 (avg: 0:03:04, remaining: 17:35:54)
Exported lines 1800-2099 to steamreviews_data.csv. Batch 6 time: 0:02:57 (avg: 0:03:03, remaining: 17:27:24)
Exported lines 2100-2399 to steamreviews_data.csv. Batch 7 time: 0:02:57 (avg: 0:03:02, remaining: 17:20:32)
Exported lines 2400-2699 to steamreviews_

In [87]:
steamreviews=pd.read_csv('../data/download/steamreviews_data.csv')

In [88]:
steamreviews_errors_df = pd.DataFrame(steamreviews_errors, columns =['appid'])
steamreviews_errors_df.to_csv('../data/download/steamreviews_errors.csv', index=False)
steamreviews_errors

[6830, 369310, 646470, 944050, 1111490, 1198690, 1388730, 1829520]

In [89]:
steamreviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105023 entries, 0 to 105022
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   appid              105023 non-null  int64 
 1   review_score       105023 non-null  int64 
 2   review_score_desc  105023 non-null  object
 3   total_positive     105023 non-null  int64 
 4   total_negative     105023 non-null  int64 
 5   total_reviews      105023 non-null  int64 
 6   download_appid     105023 non-null  int64 
dtypes: int64(6), object(1)
memory usage: 5.6+ MB


In [90]:
steamreviews['total_reviews'].value_counts()

0        30127
1         6526
2         4588
3         3664
4         3069
         ...  
3228         1
7446         1
3088         1
15929        1
6115         1
Name: total_reviews, Length: 4752, dtype: int64

In [91]:
steamreviews['review_score_desc'].value_counts()

No user reviews            30127
Very Positive              12304
Mixed                      12262
Positive                   10726
Mostly Positive             8461
1 user reviews              6526
2 user reviews              4588
3 user reviews              3664
4 user reviews              3069
5 user reviews              2616
6 user reviews              2180
Mostly Negative             2131
7 user reviews              1921
8 user reviews              1643
9 user reviews              1521
Overwhelmingly Positive      919
Negative                     279
Very Negative                 70
Overwhelmingly Negative       16
Name: review_score_desc, dtype: int64

In [92]:
steamreviews['review_score'].value_counts()

0    57855
8    12304
5    12262
7    10726
6     8461
4     2131
9      919
3      279
2       70
1       16
Name: review_score, dtype: int64

In [100]:
#Checking for duplicates
print(f'appid duplicates:', steamreviews.duplicated(subset='appid').sum())
print(f'download_appid duplicates:', steamreviews.duplicated(subset='download_appid').sum())

appid duplicates: 8
download_appid duplicates: 8


In [105]:
#Cleaning up duplicates
steamreviews = steamreviews.drop_duplicates(subset='download_appid', keep='last')
steamreviews.to_csv('../data/download/steamreviews_data.csv', index=False)

This looks very good. Review Score Description actually gives us more information than the Score alone and we have some detailed data (and can even request the reviews themselves through this API if we wanted to. But it'll take **a lot** of time.

We might want to keep only the total reviews as popularity, and feature a column to have a score. But it the categories already stablished in Steam seem to be adequate. In any case, a continuous score is also good, so let's use the one stablished at steam DB: https://steamdb.info/blog/steamdb-rating/

## Compiling data colection errors table

Keeping track on what apps are missing/removed from the dataset is quite helpful and is necessary for the proper statistical analysis. So here we'll compile the dataset with the missing apps and the rough reasons of why they are missing. It will be used during cleanup as well.

### Loading data

In [197]:
# Loading data tables
steam_app_data = pd.read_csv('../data/download/steam_app_data.csv')
steam_spy_data = pd.read_csv('../data/download/steamspy_data.csv')
steamreviews = pd.read_csv('../data/download/steamreviews_data.csv')

# Loading full ids table
full_ids = pd.read_csv('../data/download/full_steam_ids.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [198]:
# Getting the missing IDs
steam_errors = full_ids[~full_ids['download_appid'].isin(steam_app_data.download_appid)]
steamspy_errors = full_ids[~full_ids['download_appid'].isin(steam_spy_data.appid)]
steamreview_errors = full_ids[~full_ids['download_appid'].isin(steamreviews.download_appid)]
missing_collection = pd.concat([steam_errors, steamspy_errors, steamreview_errors]).drop_duplicates().reset_index(drop=True)

### Trying to redownload the missing ids

While some data was not retrieved due to the region block or malformed json data, there might be some cases when a simple redownload with extended pause can help. Let's try redownloading the missing data before finishing up with the download.

In [200]:
steam_errors = []
steamspy_errors = []
steamreviews_errors = []
steam_app_data_filename  = 'steam_app_data.csv'
steamspy_data_filename = 'steamspy_data.csv'
steamreviews_data_filename = 'steamreviews_data.csv'

print('Attempting to redownload the missing Storefront data')
process_batches(
    parser=parse_steam_request,
    app_list=missing_collection,
    download_path=download_path,
    data_filename=steam_app_data_filename,
    index_filename=steam_index,
    errors_list=steam_errors,
    columns=steam_columns,
    begin=0,
    end=len(missing_collection),
    batchsize=100,
    pause=1,
    download_appid = True,
    last_modified = True
)

print('Attempting to redownload the missing SteamSpy data')
process_batches(
    parser=parse_steamspy_request,
    app_list=missing_collection,
    download_path=download_path, 
    data_filename=steamspy_data,
    index_filename=steamspy_index,
    errors_list=steamspy_errors,
    columns=steamspy_columns,
    begin=0,
    end=len(missing_collection),
    batchsize=100,
    pause=1
)

print('Attempting to redownload the missing Steam Review data')
process_batches(
    parser=parse_steamreviews_request,
    app_list=missing_collection,
    download_path=download_path, 
    data_filename=steamreviews_data,
    index_filename=steamreviews_index,
    errors_list=steamreviews_errors,
    columns=steamreviews_columns,
    begin=0,
    end=len(missing_collection),
    batchsize=100,
    pause=1,
    download_appid = True
)

Attempting to redownload the missing Storefront data
Starting at index 0:

Exported lines 0-24 to steam_app_data.csv. Batch 0 time: 0:00:52 (avg: 0:00:52, remaining: 0:00:00)

Processing batches complete. 25 apps written
Attempting to redownload the missing SteamSpy data
Starting at index 0:

Exported lines 0-24 to steamspy_data.csv. Batch 0 time: 0:00:39 (avg: 0:00:39, remaining: 0:00:00)

Processing batches complete. 25 apps written
Attempting to redownload the missing Steam Review data
Starting at index 0:

Exported lines 0-24 to steamreviews_data.csv. Batch 0 time: 0:00:42 (avg: 0:00:42, remaining: 0:00:00)

Processing batches complete. 25 apps written


In [230]:
# Removing duplicates
steam_app_data = pd.read_csv('../data/download/steam_app_data.csv')
steam_app_data = steam_app_data.drop_duplicates(subset='download_appid', keep='last')
steam_app_data.to_csv('../data/download/steam_app_data.csv', index=False)

steam_spy_data = pd.read_csv('../data/download/steamspy_data.csv')
steam_spy_data = steam_spy_data.drop_duplicates(subset='appid', keep='last')
steam_spy_data.to_csv('../data/download/steamspy_data.csv', index=False)

steamreviews = pd.read_csv('../data/download/steamreviews_data.csv')
steamreviews = steamreviews.drop_duplicates(subset="download_appid", keep='last')
steamreviews.to_csv('../data/download/steamreviews_data.csv', index=False)


  exec(code_obj, self.user_global_ns, self.user_ns)


### Creating the final missing_ids table

In [231]:
steam_app_errors =  pd.DataFrame(steam_errors, columns =['appid'])
steam_app_errors.to_csv('../data/download/steam_errors.csv', index=False)

steam_spy_errors =  pd.DataFrame(steamspy_errors, columns =['appid'])
steam_spy_errors.to_csv('../data/download/steamspy_errors.csv', index=False)

steam_reviews_errors = pd.DataFrame(steamreviews_errors, columns =['appid'])
steam_reviews_errors.to_csv('../data/download/steamreviews_errors.csv', index=False)

In [232]:
#Creating error table based on steam_app_errors table
missing_ids = steam_app_errors.copy()
missing_ids['reason'] = "Steam Download Error"

In [233]:
#Adding SteamSpy download errors (checking if they are already present)
steam_spy_errors['reason'] = 'SteamSpy Download Error'
#Adding Steam Reviews errors
steam_reviews_errors['reason'] = 'Steam Review Download Error'
#Adding the missing ids to the error list
missing_ids = pd.concat([missing_ids,steam_spy_errors,steam_reviews_errors])

In [234]:
missing_ids

Unnamed: 0,appid,reason
0,1061400,Steam Download Error
1,1444140,Steam Download Error
2,1863540,Steam Download Error


In [235]:
#Getting the list of the missing ids by comparing full_ids witht steam_app data
#df.loc[~df.index.isin(df.merge(df2.assign(a='key'),how='left').dropna().index)]
missing_collection = full_ids[~full_ids['download_appid'].isin(steam_app_data.download_appid) & ~full_ids.download_appid.isin(missing_ids.appid)][['download_appid']]
missing_collection['reason'] = 'Steam Storefront Error'
missing_collection.rename(columns = {'download_appid': 'appid'}, inplace = True)
missing_ids = pd.concat([missing_ids,missing_collection]).reset_index(drop=True)

In [236]:
steam_app_data[steam_app_data['download_appid'].isin(missing_collection.appid)][['steam_appid','download_appid']]

Unnamed: 0,steam_appid,download_appid


In [237]:
#Saving resulting missing ids dataframe to csv
missing_ids.to_csv('../data/download/missing_ids.csv', index=False)

In [238]:
# Comparing the data tables and making steamreviews and steam_spy_data to be consistent with steam_app_data:
index_missing_reviews = steam_app_data.index.difference(steamreviews.index).to_list()
index_missing_steamspy = steam_app_data.index.difference(steam_spy_data.index).to_list()
print(f'Difference between Steam reviews and Storefront: {index_missing_reviews}')
print(f'Difference between SteamSpy and Storefront: {index_missing_steamspy}')

Difference between Steam reviews and Storefront: [105035, 105036, 105037, 105038, 105039, 105040, 105041, 105042, 105043, 105044, 105045, 105046, 105047, 105048, 105049, 105050, 105052, 105053, 105054, 105055, 105057, 105059, 105060]
Difference between SteamSpy and Storefront: [105035, 105036, 105037, 105038, 105039, 105040, 105041, 105042, 105043, 105044, 105045, 105046, 105047, 105048, 105049, 105050, 105052, 105053, 105054, 105055, 105057, 105059, 105060]


## Next Steps

Here we have defined and demonstrated the download process used to generate the data sets. This is similar to what Nik Davis did in the past, with the exception that now the process can be reinitiated to get only the new IDs in the full id list and add them to the previous dataset. 

Now we have three tables with information obtained from the Steam Store and Steam Spy and a table with AppIDs we were not able to get. This should be enough to get a lot of interesting insights but we can still expand it by getting some additional data like platime, streaming stats, detailed reviews information etc.

Another possible way to improve it is with automation through scripting to refresh the dataset automatically/on schedule. 

We can now check and clean the data we've obtained and prepare for the it's analysis and fishing for interesting insights.