# Steam Data Analysis

# TODO:

* Some apps have different steam_appids from the ones we are downloading them from - need a check for that since SteamSpy might have different appid from the storefront in that case
* Check the data update: add different options for full/partial load. Check timestamp on the lastupdate
* Automating data gathering (separating into different scripts to run in parallel?)
* Parallelism

## Project Goals

<!-- PELICAN_BEGIN_SUMMARY -->

The motivation is gather, process and analyze Steam Store data to get insights about trends in the videogame market. As it is an online marketplace with public available data, it offers us more possibilities than analyzing console games data, where we would have to rely on an existing dataset.

We want to focus on two main aspects, first a general market analysis to know which genres are the most popular, pricing strategies and so on, which could be interesting for a new developer trying to make a new game or deciding a price policy. This has been studied already by other enthusiasts in internet, and also by Marketing companies helping publishers.

But to offer a different analysis, we want to also focus on the developers and publishers, to see which ones are the most successfull, how they have improved / worsen between the years, which titles have cemented their success and so on. In the light of recent years we have seen many acquisitions by large publishers such as Tencent, Microsoft and Sony, so this is very interesting concept.

This will be a complete data project, with a data acquisition section (by using some APIs and web scrapping), then data cleaning and joining data from different sources, an exploratory data analysis, and finally some key conclusions.


## Data Acquisition

This is the section where I struggled initially. There were several datasets available at [kaggle](https://www.kaggle.com/datasets), reddit and similar websites, but most were outdated or did not contain all the information I wanted to explore. Also I wanted to extract it directly from an API or use web scrapping, if possible, to learn a bit more (I already had experience with Twitter which has an excellent API).

[SteamSpy](https://steamspy.com/about) is a webpage which offers data about Steam games. In the past it was even able to deliver a good guess of sales, but that has become harder throughout the years. Check [VG Insights](https://vginsights.com/insights/article/how-to-estimate-steam-video-game-sales) for more information. It is also a good webpage if you want to explore market data on your own.
The most important thing for Steamspy is that it has its own API [here](https://steamspy.com/api.php). It can provide us easily with an already filtered list of games (not other apps or DLCs), and also some metrics not available at Steam directly such as an estimate of sales and the positive or negative reviews (Steam only gives us total number reviews).

Regarding Steam directly, the API is available at https://partner.steamgames.com/ , however you need a developer key and some (most of the functions) are tied to your key as they are intended to be used to manage your own products at the Steam store. Thanks to [Nik Davis](http://nik-davis.github.io) I discovered there were also a few API functions via the WEB API which can be used without a key at all. See here for more details: [StorefrontAPI](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI).

Getting the information from Steam will be a bit more difficult, but it will give us additional metrics, such as release date, genre...

We will retrieve first the list of appids and the information available at Steam Spy, then get for each appid the information from Steam and combine them in an unique dataframe. There will be no loss of information as app ids are unique. Afterwards, we will perform cleaning and finally start analyzing our dataset.

## Process:

- Create an app list and gather available data from SteamSpy API using 'all' request
- Retrieve individual app data from Steam API, by iterating through app list
- Export app list, Steam data and SteamSpy data to csv files

## API references:

- https://partner.steamgames.com/doc/webapi
- https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI
- https://steamapi.xpaw.me/#
- https://steamspy.com/api.php


## Credits

The most important source I found while looking how to connect to the API was Nik Davis, check his blog for a different analysis on steam data (from 2019) http://nik-davis.github.io
Download functions for the APIs are based on his notebook for "Steam Data Download". I had to make some changes and simplify a bit.

Steamspy seems to have changed its API, so I had to change the download method to instead download all the data by page (set of 1000 ids). The functions defined for Steam API itself still work as is. 

In [1]:
# standard library imports
import csv
import datetime as dt
import json
import os
import statistics
import time

# third-party imports
import numpy as np
import pandas as pd
import requests
import requests.auth

# customisations - ensure tables show all columns
pd.set_option("max_columns", 100)

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
#setting up proxies and Steam API key:
try:
    with open('../data/_credentials/steam_key.txt') as f:
        APIKey = f.read()
except:
    APIKey = ""
    
try:
    with open('../data/_credentials/proxies.txt', 'r') as f:
        proxies = eval(f.read())
except:
    proxies = None

In [4]:
type(proxies)

dict

The next function uses requests library to get JSON response from web APIs. It is based on Nik Davis previous work, and it is quite standard as (thankfully) web APIs use a standard format, and requests makes it really easy.

In [5]:
def get_request(url,parameters=None, steamspy=False):
    """Return json-formatted response of a get request using optional parameters.
    
    Parameters
    ----------
    url : string
    parameters : {'parameter': 'value'}
        parameters to pass as part of get request
    
    Returns
    -------
    json_data
        json-formatted response (dict-like)
    """
    try:
        headers = {'Accept': 'application/json'}
        response = requests.get(url=url, params=parameters, headers = headers, proxies = proxies)
    except requests.exceptions.SSLError as s:
        print('SSL Error:', s)
        
        for i in range(5, 0, -1):
            print('\rWaiting... ({})'.format(i), end='')
            time.sleep(1)
        print('\rRetrying.' + ' '*10)
        
        # recursively try again
        return get_request(url, parameters, steamspy)
    
    if response:
        return response.json()
    else:
        # We do not know how many pages steamspy has... and it seems to work well, so we will use no response to stop.
        if steamspy:
            return "stop"
        else :
            # response is none usually means too many requests. Wait and try again 
            print('No response, waiting 15 seconds...')
            time.sleep(15)
            print('Retrying.')
            return get_request(url, parameters, steamspy)        

## List of IDs

APPs on steam have an unique ID. The requests to Steam API (which has more information than Steam Spy) have to be made for a specific ID. This means we have to get first a list of ids.

We can do this in several ways, but this is what I decided to follow:

* Using Steam Spy API (see https://steamspy.com/api.php) to get the list of IDs and also the metadata from Steam Spy (at the same time. Unfortunately, using this method gives a lot of duplicates and headaches.


* Alternatively, we could use Steam API to get a list of apps, then filter them (see https://api.steampowered.com/ISteamApps/GetAppList/v1/? or https://steamapi.xpaw.me/#IStoreService/GetAppInfo)

We will use the Steam GetAppList API and use the list of apps from it to index across all the tables we use eventually for consistency.

## Define Download Logic

This is strongly based on Nik Davis previous work, to get the info about the app IDs from Steam. Initially I prefered to focus on the analysis rather than in the acquisition phase, but I watn to modify these functions to change them from an index based approach for the update, to instead check already existing ids in the database and download only the delta.

Later, we could maybe also add a check to see if the app has been updated or not and even redownload the info from not just new IDs, but IDs that have been updated.

I will keep the original comments from Nik Davis `in quotes` to let the reader understand the process.

`Now we have the app_list dataframe, we can iterate over the app IDs and request individual app data from the servers. Here we set out our logic to retrieve and process this information, then finally store the data as a csv file.`

`Because it takes a long time to retrieve the data, it would be dangerous to attempt it all in one go as any errors or connection time-outs could cause the loss of all our data. For this reason we define a function to download and process the requests in batches, appending each batch to an external file and keeping track of the highest index written in a separate file.`

`This not only provides security, allowing us to easily restart the process if an error is encountered, but also means we can complete the download across multiple sessions.`

`Again, we provide verbose output for rows exported, batches complete, time taken and estimated time remaining.`

In [6]:
def get_app_data(app_list, start, stop, parser, pause, errors_list):
    """Return list of app data generated from parser.
    
    parser : function to handle request
    """
    app_data = []
    
    # iterate through each row of app_list, confined by start and stop
    for index, appid in app_list[start:stop].iteritems():
        print('Current index: {}'.format(index), end='\r')

        # retrive app data for a row, handled by supplied parser, and append to list
        try:
            data = parser(appid)
        except Exception as ex:
            errors_list.append(appid)
            print('\nError getting data for {} with exception {}\n'.format(appid, type(ex).__name__))
        app_data.append(data)

        time.sleep(pause) # prevent overloading api with requests
    
    return app_data


def process_batches(parser, app_list, download_path, data_filename, index_filename,
                    errors_list, columns,
                    begin=0, end=-1, batchsize=100, pause=1):
    """Process app data in batches, writing directly to file.
    
    parser : custom function to format request
    app_list : dataframe of appid and name
    download_path : path to store data
    data_filename : filename to save app data
    index_filename : filename to store highest index written
    errors_list : list to store appid errors
    columns : column names for file
    
    Keyword arguments:
    
    begin : starting index (get from index_filename, default 0)
    end : index to finish (defaults to end of app_list)
    batchsize : number of apps to write in each batch (default 100)
    pause : time to wait after each api request (defualt 1)
    
    returns: none
    """
    print('Starting at index {}:\n'.format(begin))
    
    # by default, process all apps in app_list
    if end == -1:
        end = len(app_list) + 1
    
    # generate array of batch begin and end points
    batches = np.arange(begin, end, batchsize)
    batches = np.append(batches, end)
    
    apps_written = 0
    batch_times = []
    
    for i in range(len(batches) - 1):
        start_time = time.time()
        
        start = batches[i]
        stop = batches[i+1]
        
        app_data = get_app_data(app_list, start, stop, parser, pause, errors_list)
        
        rel_path = os.path.join(download_path, data_filename)
        
        # writing app data to file
        with open(rel_path, 'a', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=columns, extrasaction='ignore')
            
            for j in range(3,0,-1):
                print("\rAbout to write data, don't stop script! ({})".format(j), end='')
                time.sleep(0.5)
            
            writer.writerows(app_data)
            print('\rExported lines {}-{} to {}.'.format(start, stop-1, data_filename), end=' ')
            
        apps_written += len(app_data)
        
        idx_path = os.path.join(download_path, index_filename)
        
        # writing last index to file
        with open(idx_path, 'w') as f:
            index = stop
            print(index, file=f)
            
        # logging time taken
        end_time = time.time()
        time_taken = end_time - start_time
        
        batch_times.append(time_taken)
        mean_time = statistics.mean(batch_times)
        
        est_remaining = (len(batches) - i - 2) * mean_time
        
        remaining_td = dt.timedelta(seconds=round(est_remaining))
        time_td = dt.timedelta(seconds=round(time_taken))
        mean_td = dt.timedelta(seconds=round(mean_time))
        
        print('Batch {} time: {} (avg: {}, remaining: {})'.format(i, time_td, mean_td, remaining_td))
            
    print('\nProcessing batches complete. {} apps written'.format(apps_written))

The best way to use this function and still only get the newer apps, would be to instead of passing it the fully app_list, preprocess it so it only contains the "app_delta". Also we would need it to keep the final dataframe in a different file, to perform after an append to it (in case we are adding only new app_ids), or if we add some kind of updating process, a join.

`Next we define some functions to handle and prepare the external files.`

`We use reset_index for testing and demonstration, allowing us to easily reset the index in the stored file to 0, effectively restarting the entire download process.`

`We define get_index to retrieve the index from file, maintaining persistence across sessions. Every time a batch of information (app data) is written to file, we write the highest index within app_data that was retrieved. As stated, this is partially for security, ensuring that if there is an error during the download we can read the index from file and continue from the end of the last successful batch. Keeping track of the index also allows us to pause the download, continuing at a later time.`

`Finally, the prepare_data_file function readies the csv for storing the data. If the index we retrieved is 0, it means we are either starting for the first time or starting over. In either case, we want a blank csv file with only the header row to begin writing to, se we wipe the file (by opening in write mode) and write the header. Conversely, if the index is anything other than 0, it means we already have downloaded information, and can leave the csv file alone.`

In [7]:
def reset_index(download_path, index_filename):
    """Reset index in file to 0."""
    rel_path = os.path.join(download_path, index_filename)
    
    f= open(rel_path, 'w')
    f.write("0")
        

def get_index(download_path, index_filename):
    """Retrieve index from file, returning 0 if file not found."""
    try:
        rel_path = os.path.join(download_path, index_filename)
        with open(rel_path, 'r') as f:
            index = int(f.readline())
            #This just reads the initial line
    
    except FileNotFoundError:
        index = 0
        
    return index


def prepare_data_file(download_path, filename, index, columns):
    """Create file and write headers if index is 0."""
    if index == 0:
        rel_path = os.path.join(download_path, filename)

        with open(rel_path, 'w', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=columns)
            writer.writeheader()

## Download Steam Data

`Now we are ready to start downloading data and writing to file. We define our logic particular to handling the steam API - in fact if no data is returned we return just the name and appid - then begin setting some parameters. We define the files we will write our data and index to, and the columns for the csv file. The API doesn't return every column for every app, so it is best to explicitly set these.`

`Next we run our functions to set up the files, and make a call to process_batches to begin the process. Some additional parameters have been added for demonstration, to constrain the download to just a few rows and smaller batches. Removing these would allow the entire download process to be repeated.`

I retouched many of these parameters just to check if the download could made in batches (requesting several steamapps at the same time), or even putting a faster polling rate (right now it is one second).

The storefront API (http://store.steampowered.com/api/) is very much undocumented, but the key getaway is that is not possible. The official SteamWorks API lets us do other things and it is quite well documented, but we cannot get the data available at a steam webpage, which are the things interesting for us.

The storefront API is only accessible with these [requests](https://wiki.teamfortress.com/wiki/User:RJackson/StorefrontAPI), and according to this [stackoverflow discussion](https://stackoverflow.com/questions/46330864/steam-api-all-games):
`There is a general API rate limit for each unique IP adress of 200 requests in five minutes which is one request every 1.5 seconds.` This matches our experience, Nik Davis put a pause between requests of just 1 second, and with this we get some but a few reconnect errors. If we put no pause at all, at the end we are limited by the 200 requests every 5 minutes.

That means that for a volume of around 50k at January 2022 (the steam apps available also at steam spy, already filtered by game and some owner data...) this download will take around 21 hours. Thankfully we can resume it and do it in several batches.

If we were to build a web app and we wanted to update the information daily, we could instead try pulling the full applist from steam along with the "last updated" and only request the full appid information for those ids. This is probably what Steam Spy does, and SteamDB instead uses a more sofisticated approach by being notified of any changes to appids via steamworks.

In any case, for a one shot analysis (and not a web page where the user could explore the information), the full download approach is fine.

The goal of the next two functions is getting the full list of apps from Steam, and also getting from our already downloaded data (if any) which are the appids we already have, to get the "delta" so we only have to download the new app ids.

We could retouch them a bit, as they have a "last updated" key, so we also update new information and not just new ids. But that would be more suitable for a live webpage updated once a day, not for a full on analysis like we are presenting.
*NOTE:* Added 'include_dlc' key to include game DLCs for additional analysis

In [8]:
def getAppListBatch(url, parameters):
    json_data = get_request(url, parameters=parameters)
    steam_id = pd.DataFrame.from_dict(json_data["response"]["apps"])
    try:
        more_results = json_data["response"]["have_more_results"]
        last_appid =  json_data["response"]["last_appid"]
    except:
        more_results = False
        last_appid = False
    return more_results, steam_id, last_appid

def get_update_ids_old(updatedlist, oldlist):
    updatedlist['key1'] = 1
    oldlist['key2'] = 1
    updatedlist = pd.merge(updatedlist, oldlist, right_on=['steam_appid','name'],left_on=['appid','name'], how = 'outer')
    updatedlist = updatedlist[~(updatedlist.key2 == updatedlist.key1)]
    updatedlist = updatedlist.drop(['key1','key2','steam_appid'], axis=1)
    return updatedlist

def get_update_ids(idList, oldFullList):
    #We are going to forget about names and only care about IDs.
    idList = idList["appid"]
    oldFullList = oldFullList["steam_appid"]
    oldFullList.columns = ["appid"]
    updatedList = idList.append(oldFullList)
    updatedList = updatedList.drop_duplicates(keep=False)
    updatedList = updatedList.reset_index(drop=True)
    return updatedList


In [9]:
def getAppList():

    url = "https://api.steampowered.com/IStoreService/GetAppList/v1/?"
    parameters = {"key": APIKey,
                 "include_dlc": "true"}
    more_results = True
    begin = True
    # from the request we get the more_results flag and also the last_appid, so we use them for the next requests.
    while (more_results):
        more_results, steam_ids, last_appid = getAppListBatch(url, parameters)
        parameters["last_appid"] = last_appid
        if (begin):
            steam_allids = steam_ids
            begin = False
        else:
            steam_allids = steam_allids.append(steam_ids)
    return steam_allids

In [10]:
def parse_steam_request(appid):
    """Unique parser to handle data from Steam Store API.
    
    Returns : json formatted data (dict-like)
    """
    url = "http://store.steampowered.com/api/appdetails/"
    parameters = {"appids": appid, "key": APIKey}
    
    json_data = get_request(url, parameters=parameters)
    json_app_data = json_data[str(appid)]
    
    if json_app_data['success']:
        data = json_app_data['data']
    else:
        data = {'steam_appid': appid}
        
    return data


# Set file parameters
download_path = '../data/download/'
steam_app_data = 'steam_app_data.csv'
steam_app_data_delta = 'steam_app_data_delta.csv'
steam_index = 'steam_index.txt'

steam_columns = [
    'type', 'name', 'steam_appid', 'required_age', 'is_free', 'controller_support',
    'dlc', 'detailed_description', 'about_the_game', 'short_description', 'fullgame',
    'supported_languages', 'header_image', 'website', 'pc_requirements', 'mac_requirements',
    'linux_requirements', 'legal_notice', 'drm_notice', 'ext_user_account_notice',
    'developers', 'publishers', 'demos', 'price_overview', 'packages', 'package_groups',
    'platforms', 'metacritic', 'reviews', 'categories', 'genres', 'screenshots',
    'movies', 'recommendations', 'achievements', 'release_date', 'support_info',
    'background', 'content_descriptors'
]

steam_errors = []

# Overwrites last index for demonstration (would usually store highest index so can continue across sessions)
if (os.path.isfile(download_path+steam_app_data_delta) == False):
    reset_index(download_path, steam_index)

# Retrieve last index downloaded from file
index = get_index(download_path, steam_index)

# Wipe or create data file and write headers if no previous  data
if (os.path.isfile(download_path+steam_app_data) == False):
    prepare_data_file(download_path, steam_app_data, index, steam_columns)
    
# Wipe or create data file delta and write headers if index is 0
if (os.path.isfile(download_path+steam_app_data_delta) == False):
    prepare_data_file(download_path, steam_app_data_delta, index, steam_columns)
    
    
# Here we get the list of appids from steam
full_steam_ids = getAppList()

# Here we get the real list of ids not yet in our dataframe. If this is the first time we are downloading the data, we can skip
# This step and instead use the full app_list.
try:
    oldlist = pd.read_csv('../data/download/steam_app_data.csv', usecols = ['name','steam_appid'])
    steam_ids = get_update_ids(full_steam_ids, oldlist)
except FileNotFoundError:
    print("Pre-existing file not found. First time downloading full app data from steam. This will take a while.\n")
    steam_ids = full_steam_ids

In [11]:
print("New IDs detected: "+str(len(steam_ids)))
print(steam_ids)
index = 0

New IDs detected: 103145
0              10
1              20
2              30
3              40
4              50
           ...   
103140    2028023
103141    2028055
103142    2028056
103143    2028062
103144    2028850
Length: 103145, dtype: int64


In [None]:
# I separated the long process to be able to debug it better.
# Set end and chunksize for demonstration - remove to run through entire app list
# Here by default we passed "app_list" that contained all the information and saved it, now we will modify it a bit
# And add pre-processing and post-processing
print("Adding "+str(len(steam_ids))+" new ids.\n")
process_batches(
    parser=parse_steam_request,
    app_list=steam_ids,
    download_path=download_path,
    data_filename=steam_app_data_delta,
    index_filename=steam_index,
    errors_list=steam_errors,
    columns=steam_columns,
    begin=index,
    #end=10,
    #pause=0.5
    batchsize=100,
    pause=1
)

try:
    oldlist = pd.read_csv('../data/download/steam_app_data.csv')
    # We change the old file to backup, so remove any backup named this way before...
    os.replace('../data/download/steam_app_data.csv', '../data/download/steam_app_data_backup.csv')
    newlist = pd.read_csv('../data/download/steam_app_data_delta.csv')
    oldlist = oldlist.append(newlist, ignore_index=True)
    oldlist.to_csv('../data/download/steam_app_data.csv', index=False)
    steam_errors_df = pd.DataFrame(steam_errors, columns=["appid"])
    steam_errors_df.to_csv('../data/download/steam_errors.csv', index=False)
except FileNotFoundError:
    os.rename('../data/download/steam_app_data_delta.csv', '../data/download/steam_app_data.csv')

Let's ensure we have no duplicate ids and that we got them all!

In [13]:
steam_app_data = pd.read_csv('../data/download/steam_app_data.csv')

In [14]:
steam_app_data.duplicated(subset="steam_appid").sum()

28

We got some duplicates here. Let's compare that to the full set of ids.
(And also save the current full Steam ids for the future reference)

In [15]:
full_steam_ids1 = getAppList()
full_steam_ids_df = pd.DataFrame(full_steam_ids1, columns =['appid'])
full_steam_ids_df.to_csv("../data/download/full_steam_ids.csv", index=False)
full_steam_ids1.duplicated(subset="appid").sum()

0

Even though we cleaned before, it is possible we got some ids twice. Let's compare the size.

In [16]:
len(steam_app_data)-len(full_steam_ids)

0

Since we were using old data, it seems like ids that are no longer available are still there, along with a few duplicates. Let's run again the function which would get new ids, just to make sure.

In [17]:
diff_ids = get_update_ids(full_steam_ids, steam_app_data)

In [18]:
len(diff_ids)

36

There are only 14 new ids. Taking into account that around a hundred apps get uploaded to Steam everyday, this makes sense, so we do not need to download anything new for the moment, just cleaning.

Well, from Steam... now we have to make sure we got most of these IDs from SteamSpy as well.

Let's do a bit of pre-cleaning, to ensure we download only the ids we need from Steam Spy.

We are going to consider valid apps those that at least have a name for the moment. Then delete the remaining duplicates.

In [19]:
steam_app_data = steam_app_data.drop_duplicates(subset="steam_appid", keep="last")
steam_app_data.to_csv("../data/download/steam_app_data.csv", index=False)

Appids that were not downloaded

In [20]:
steam_errors_df = pd.DataFrame(steam_errors, columns =['appid'])
steam_errors_df.to_csv("../data/download/steam_errors.csv", index=False)
steam_errors_df

Unnamed: 0,appid
0,281341
1,712350
2,1023100
3,1061400
4,1072170
5,1163550
6,1215750
7,1262349
8,1354970
9,1389830


In [21]:
# inspect downloaded data
steam_app_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102436 entries, 0 to 103144
Data columns (total 39 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   type                     102436 non-null  object
 1   name                     102436 non-null  object
 2   steam_appid              102436 non-null  int64 
 3   required_age             102436 non-null  object
 4   is_free                  102436 non-null  object
 5   controller_support       25479 non-null   object
 6   dlc                      9693 non-null    object
 7   detailed_description     102286 non-null  object
 8   about_the_game           102285 non-null  object
 9   short_description        102282 non-null  object
 10  fullgame                 34580 non-null   object
 11  supported_languages      102262 non-null  object
 12  header_image             102436 non-null  object
 13  website                  60038 non-null   object
 14  pc_requirements     

## Steam Spy API

APPs on steam have an unique ID. The requests to Steam API (which has more information than Steam Spy) have to be made for a specific ID. This means we have to get first a list of ids.

We can do this in several ways, but this is what I decided to follow:

* Using Steam Spy API (see https://steamspy.com/api.php) to get the list of IDs and also the metadata from Steam Spy (at the same time). Alternatively, we could use Steam API to get a list of apps, then filter them (see https://api.steampowered.com/ISteamApps/GetAppList/v2/? or https://steamapi.xpaw.me/#IStoreService/GetAppInfo). Unfortunately, using this method got me 

* Then using Steam API to loop for each ID from the list and getting the complete info.

We are going to use this request: https://steamspy.com/api.php?request=all&page=1 - return apps 1,000-1,999 of all apps.

In [None]:
def parse_steamspy_request(appid):
    """Parser to handle SteamSpy API data."""
    url = "https://steamspy.com/api.php"
    parameters = {"request": "appdetails", "appid": appid}
    
    json_data = get_request(url, parameters)
    return json_data


# set files and columns
download_path = '../data/download'
steamspy_data = 'steamspy_data.csv'
steamspy_index = 'steamspy_index.txt'

steamspy_columns = [
    'appid', 'name', 'developer', 'publisher', 'score_rank', 'positive',
    'negative', 'userscore', 'owners', 'average_forever', 'average_2weeks',
    'median_forever', 'median_2weeks', 'price', 'initialprice', 'discount',
    'languages', 'genre', 'ccu', 'tags'
]

steamspy_errors = []

reset_index(download_path, steamspy_index)
index = get_index(download_path, steamspy_index)

# Wipe data file if index is 0
prepare_data_file(download_path, steamspy_data, index, steamspy_columns)

process_batches(
    parser=parse_steamspy_request,
    app_list=full_steam_ids["appid"],
    download_path=download_path, 
    data_filename=steamspy_data,
    index_filename=steamspy_index,
    errors_list=steamspy_errors,
    columns=steamspy_columns,
    begin=index,
    end=len(full_steam_ids),
    batchsize=300,
    pause=0.1
)

In [23]:
steamspy_errors_df = pd.DataFrame(steamspy_errors, columns =['appid'])
steamspy_errors_df.to_csv("../data/download/steamspy_errors.csv", index=False)
steamspy_errors

[326460, 392780, 590187, 1018960, 1018990, 1019000, 1019190, 1103750, 1705313]

Let's quickly check if we have valid data inside.

In [24]:
steam_spy_data = pd.read_csv('../data/download/steamspy_data.csv')

In [25]:
steam_spy_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103145 entries, 0 to 103144
Data columns (total 20 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   appid            103145 non-null  int64  
 1   name             102896 non-null  object 
 2   developer        92436 non-null   object 
 3   publisher        83261 non-null   object 
 4   score_rank       52 non-null      float64
 5   positive         103145 non-null  int64  
 6   negative         103145 non-null  int64  
 7   userscore        103145 non-null  int64  
 8   owners           103145 non-null  object 
 9   average_forever  103145 non-null  int64  
 10  average_2weeks   103145 non-null  int64  
 11  median_forever   103145 non-null  int64  
 12  median_2weeks    103145 non-null  int64  
 13  price            92793 non-null   float64
 14  initialprice     92804 non-null   float64
 15  discount         92804 non-null   float64
 16  languages        92583 non-null   obje

In [26]:
steam_spy_data.duplicated(subset="appid").sum()

9

Looks good!

Now we have the Steam Spy data available on `../data/download/steamspy_data.csv` and the Steam Store data available on `../data/download/steam_app_data.csv`

## Steam Reviews

While exploring why the steam reviews on steam spy and the steam store webpage itself were not the same, we got a way to check via the API partners itself. Let's try to obtain this information.

In [None]:
def parse_steamreviews_request(appid):
    """Parser to handle SteamSpy API data."""
    url = "https://store.steampowered.com/appreviews/" + str(appid)
    #todo: add purchase_type=all in parameters for the next version
    parameters = {"json": 1, "num_per_page": "0", "language": "all", "purchase_type": "all"}
    json_data = get_request(url, parameters)
    json_data = json_data['query_summary']
    json_data["appid"]=appid
    return json_data


# set files and columns
download_path = '../data/download'
steamreviews_data = 'steamreviews_data.csv'
steamreviews_index = 'steamreviews_index.txt'

steamreviews_columns = [
    'appid', 'review_score', 'review_score_desc', 'total_positive', 'total_negative', 'total_reviews'
]

steamreviews_errors = []

#Reset index if to download the reviews from 0
#reset_index(download_path, steamreviews_index)
index = get_index(download_path, steamreviews_index)

# Wipe data file if index is 0
prepare_data_file(download_path, steamreviews_data, index, steamreviews_columns)

full_steam_ids=pd.read_csv("../data/download/steam_app_data.csv")

process_batches(
    parser=parse_steamreviews_request,
    app_list=full_steam_ids["steam_appid"],
    download_path=download_path, 
    data_filename=steamreviews_data,
    index_filename=steamreviews_index,
    errors_list=steamreviews_errors,
    columns=steamreviews_columns,
    begin=index,
    end=len(full_steam_ids),
    batchsize=300,
    pause=0
)

In [28]:
steamreviews=pd.read_csv("../data/download/steamreviews_data.csv")

In [29]:
steamreviews_errors_df = pd.DataFrame(steamreviews_errors, columns =['appid'])
steamreviews_errors_df.to_csv("../data/download/steamreviews_errors.csv", index=False)
steamreviews_errors

[303530, 433743, 602180, 1085460, 1508400, 1890670]

In [30]:
steamreviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102436 entries, 0 to 102435
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   appid              102436 non-null  int64  
 1   review_score       102396 non-null  float64
 2   review_score_desc  102396 non-null  object 
 3   total_positive     102396 non-null  float64
 4   total_negative     102396 non-null  float64
 5   total_reviews      102396 non-null  float64
dtypes: float64(4), int64(1), object(1)
memory usage: 4.7+ MB


In [31]:
steamreviews["total_reviews"].value_counts()

0.0       37626
1.0        7625
2.0        5171
3.0        3727
4.0        3006
          ...  
1602.0        1
2996.0        1
3588.0        1
1938.0        1
3198.0        1
Name: total_reviews, Length: 3893, dtype: int64

In [32]:
steamreviews["review_score_desc"].value_counts()

No user reviews            37626
Very Positive               9557
Mixed                       8970
Positive                    8877
1 user reviews              7625
Mostly Positive             6455
2 user reviews              5171
3 user reviews              3727
4 user reviews              3006
5 user reviews              2464
6 user reviews              1887
7 user reviews              1723
Mostly Negative             1505
8 user reviews              1464
9 user reviews              1238
Overwhelmingly Positive      784
Negative                     256
Very Negative                 52
Overwhelmingly Negative        9
Name: review_score_desc, dtype: int64

In [33]:
steamreviews["review_score"].value_counts()

0.0    65931
8.0     9557
5.0     8970
7.0     8877
6.0     6455
4.0     1505
9.0      784
3.0      256
2.0       52
1.0        9
Name: review_score, dtype: int64

In [34]:
#Checking for duplicates
steamreviews.duplicated(subset="appid").sum()

6

In [35]:
#Cleaning up duplicates
steamreviews = steamreviews.drop_duplicates(subset="appid", keep="last")
steamreviews.to_csv("../data/download/steamreviews_data.csv", index=False)

This looks very good. Review Score Description actually gives us more information than the Score alone... although it lumps together all games with less than 10 reviews as a score of 0.

We might want to keep only the total reviews as popularity, and feature a column to have a score. But it the categories already stablished in Steam seem to be adequate. In any case, a continuous score is also good, so let's use the one stablished at steam DB: https://steamdb.info/blog/steamdb-rating/

## Compiling data colection errors table

Keeping track on what apps are missing/removed from the dataset is quite helpful and is necessary for the proper statistical analysis. So here we'll compile the dataset with the missing apps and the rough reasons of why they are missing. It will be used during cleanup as well.

### Loading data

In [135]:
#Loading data tables
steam_app_data = pd.read_csv("../data/download/steam_app_data.csv")
steam_spy_data = pd.read_csv("../data/download/steamspy_data.csv")
steamreviews = pd.read_csv("../data/download/steamreviews_data.csv")

steam_app_data = steam_app_data.set_index("steam_appid")
steam_spy_data = steam_spy_data.set_index("appid")
steamreviews = steamreviews.set_index("appid")

#Loading error tables
steam_app_errors = pd.read_csv("../data/download/steam_errors.csv")
steam_spy_errors = pd.read_csv("../data/download/steamspy_errors.csv")
steam_reviews_errors = pd.read_csv("../data/download/steamreviews_errors.csv")

#Loading full ids table
full_ids = pd.read_csv("../data/download/full_steam_ids.csv")
missing_collection = full_ids[~full_ids["appid"].isin(steam_app_data.index) & ~full_ids.index.isin(missing_ids.index)].copy()
missing_ids = pd.concat([steam_app_errors,steam_spy_errors,steam_reviews_errors,missing_collection]).drop_duplicates().reset_index(drop=True)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [136]:
missing_collection = full_ids[~full_ids["appid"].isin(steam_app_data.index) & ~full_ids.index.isin(missing_ids.index)].copy()

### Trying to redownload the missing ids

In [75]:
steam_errors = []
steamspy_errors = []
steamreviews_errors = []

print("Attempting to redownload Storefront data")
process_batches(
    parser=parse_steam_request,
    app_list=missing_ids["appid"],
    download_path=download_path,
    data_filename=steam_app_data_data,
    index_filename=steam_index,
    errors_list=steam_errors,
    columns=steam_columns,
    begin=0,
    end=len(missing_ids),
    batchsize=100,
    pause=1
)

print("Attempting to redownload SteamSpy data")
process_batches(
    parser=parse_steamspy_request,
    app_list=missing_ids["appid"],
    download_path=download_path, 
    data_filename=steamspy_data,
    index_filename=steamspy_index,
    errors_list=steamspy_errors,
    columns=steamspy_columns,
    begin=0,
    end=len(missing_ids),
    batchsize=100,
    pause=1
)

print("Attempting to redownload Steam Review data")
process_batches(
    parser=parse_steamreviews_request,
    app_list=missing_ids["appid"],
    download_path=download_path, 
    data_filename=steamreviews_data,
    index_filename=steamreviews_index,
    errors_list=steamreviews_errors,
    columns=steamreviews_columns,
    begin=0,
    end=len(missing_ids),
    batchsize=100,
    pause=1
)

Starting at index 0:

Current index: 3
Error getting data for 1061400 with exception Expecting value: line 1 column 1 (char 0)

Current index: 11
Error getting data for 1444140 with exception Expecting value: line 1 column 1 (char 0)

Current index: 17
Error getting data for 1863540 with exception Expecting value: line 1 column 1 (char 0)

Exported lines 0-99 to steam_app_data_delta.csv. Batch 0 time: 0:03:18 (avg: 0:03:18, remaining: 0:23:05)
Exported lines 100-199 to steam_app_data_delta.csv. Batch 1 time: 0:03:16 (avg: 0:03:17, remaining: 0:19:40)
Exported lines 200-299 to steam_app_data_delta.csv. Batch 2 time: 0:03:16 (avg: 0:03:17, remaining: 0:16:23)
Exported lines 300-399 to steam_app_data_delta.csv. Batch 3 time: 0:03:15 (avg: 0:03:16, remaining: 0:13:05)
Exported lines 400-499 to steam_app_data_delta.csv. Batch 4 time: 0:03:15 (avg: 0:03:16, remaining: 0:09:48)
Exported lines 500-599 to steam_app_data_delta.csv. Batch 5 time: 0:03:16 (avg: 0:03:16, remaining: 0:06:32)
Exporte

In [142]:
# Removing duplicates
steam_app_data = pd.read_csv("../data/download/steam_app_data.csv")
steam_app_data = steam_app_data.drop_duplicates(subset="steam_appid", keep="last")
steam_app_data.to_csv("../data/download/steam_app_data.csv", index=False)

steam_spy_data = pd.read_csv("../data/download/steamspy_data.csv")
steam_spy_data = steam_spy_data.drop_duplicates(subset="appid", keep="last")
steam_spy_data.to_csv("../data/download/steamspy_data.csv", index=False)

steamreviews = pd.read_csv("../data/download/steamreviews_data.csv")
steamreviews = steamreviews.drop_duplicates(subset="appid", keep="last")
steamreviews.to_csv("../data/download/steamreviews_data.csv", index=False)

steam_app_data = steam_app_data.set_index("steam_appid")
steam_spy_data = steam_spy_data.set_index("appid")
steamreviews = steamreviews.set_index("appid")

### Creating the final missing_ids table

In [125]:
steam_app_errors =  pd.DataFrame(steam_errors, columns =['appid'])
steam_app_errors.to_csv("../data/download/steam_errors.csv", index=False)

steam_spy_errors =  pd.DataFrame(steamspy_errors, columns =['appid'])
steam_spy_errors.to_csv("../data/download/steamspy_errors.csv", index=False)

steam_reviews_errors = pd.DataFrame(steamreviews_errors, columns =['appid'])
steam_reviews_errors.to_csv("../data/download/steamreviews_errors.csv", index=False)

In [126]:
#Creating error table based on steam_app_errors table
missing_ids = steam_app_errors.copy()
missing_ids["reason"] = "Steam Download Error"

In [127]:
#Adding SteamSpy download errors (checking if they are already present)
steam_spy_errors["reason"] = "SteamSpy Download Error"
#Adding Steam Reviews errors
steam_reviews_errors["reason"] = "Steam Review Download Error"
#Adding the missing ids to the error list
missing_ids = pd.concat([missing_ids,steam_spy_errors,steam_reviews_errors])

In [133]:
#Getting the list of the missing ids by comparing full_ids witht steam_app data
#df.loc[~df.index.isin(df.merge(df2.assign(a='key'),how='left').dropna().index)]
missing_collection = full_ids[~full_ids["appid"].isin(steam_app_data.index) & ~full_ids.index.isin(missing_ids.index)].copy()
missing_collection["reason"] = "Steam Storefront Error"
missing_ids = pd.concat([missing_ids,missing_collection]).reset_index(drop=True)

In [134]:
#Saving resulting missing ids dataframe to csv
missing_ids.to_csv("../data/download/missing_ids.csv", index=False)

In [None]:
# Comparing the data tables and making steamreviews and steam_spy_data to be consistent with steam_app_data:
index_missing_reviews = steam_app_data.index.difference(steamreviews.index)
index_missing_steamspy = steam_app_data.index.difference(steam_spy_data.index)

## Next Steps

Here we have defined and demonstrated the download process used to generate the data sets. This is similar to what Nik Davis did in the past, with the exception that now the process can be reinitiated to get only the new IDs in the full id list and add them to the previous dataset. This might be expanded to get also the IDs with data updated.

We have two tables now with a lot of information from the apps on Stem Store. From the Steam Store API we have a lot of metadata, which is used by the Steam Store itself to display the Store page. We will consider this the main source of information. The most useful but missing information is the quantity of positive or negative recommendations, we only have the total. Also, the tags (possibly as they were added later) are not available. There might be available in a separate request which is not public, or Valve just forgot to add it to the list.

From Steam Spy we have some additional information as it tries to track the concurrent users, we have averages , top... It also offers an estimate of owners, with a very large margin of error. We will check exactly what to keep and how to clean it in the next section.

After reviewing the reviews (positive/negative reviews) vs total reviews from Steam Spy and Steam Store in the cleaning section, we discovered that they did not agree so we got a third dataset - review metadata from the Steam Store API (partners). We could have changed the request and instead get all individual reviews, which could be an interesting machine learning analysis - we have data about if they are positive or negative, and we could do a sentiment analysis from the text using NLP. But that is not the focus of our current analysis. Also we have to note that getting the metadata was about 5h, but getting all individual reviews could take quite some time.