# Gathering Data

As part of our data collection and for the purpose of learning, we will obtain the data required using API. Typically an API is a great way for developers to allow access to databases and information on a server.

Below are the two API that we will be using for the data collection. 

1. https://partner.steamgames.com/doc/webapi_overview
2. https://steamspy.com/about

Valve (the company behind Steam) has API available at https://partner.steamgames.com/. An API such as this allows anyone to interface with data on a website in a controlled way, usually providing a host of useful features to the end-user. 

SteamSpy is a Steam stats-gathering service and crucially has data easily available through its own API. It provides a number of useful metrics including an estimation for total owners of each game.

## Problem Statement

(Potential)

Using data from Steam Games combined with data from Steam Spy, I will seek to identify games and genres with high ratings that people are playing these days and target to identify any possible relationship between ratings and genre (if possible, depending on the data). A recommender will then be built for people who would like to have suggestions on other games that they will be interested in.

Should time permits, the recommender can be built into an application for others to use.

## Data Collection method

Functions are written to collect data using API. ***Below code shows the functions used and collects a sample of the data.***

Total data estimated to be $51,749$ games details. The data collection is still in progress as limited to only 5,000 game details per iteration. Reason being the following error will be encountered if more that 5,000.<br> `'Connection failed: Too many connections'`

**Targeting to complete data collection by next week**. 



### Tracker for Steam Store
1. begin = 0, end = 1000, pause = 5, batch_pause = 180, time taken = 2h 8min 6s
2. Loop for 1000 to 5000, pause = 5, batch_pause = 180, time taken = 8h 23min 46s
3. Loop for 5000 to 10000, pause = 5, batch_pause = 180, time taken = 10h 23min 52s
4. Loop for 10000 to 15000, pause = 5, batch_pause = 180, time taken = 10h 25min 58s
5. Loop for 15000 to 20000, pause = 5, batch_pause = 180, time taken = 10h 27min 37s

### Tracker for Steam Spy
1. begin = 0, end = 1000, pause = 1.5, batch_pause = 60, time taken = 41min 11s
2. Loop for 1000 to 5000, pause = 1.5, batch_pause = 60, time taken = 2h 45min 3s
3. Loop for 5000 to 10000, pause = 1.5, batch_pause = 60, time taken = 3h 25min 2s
4. Loop for 10000 to 15000, pause = 1.5, batch_pause = 60, time taken = 3h 18min 25s
5. Loop for 15000 to 20000, pause = 1.5, batch_pause = 60, time taken = 3h 18min 53s
6. Loop for 20000 to 25000, pause = 1.5, batch_pause = 60, time taken = 3h 25min 11s
7. Loop for 25000 to 30000, pause = 1.5, batch_pause = 60, time taken = 3h 27min 51s
8. Loop for 30000 to 35000, pause = 1.5, batch_pause = 60, time taken = 3h 23min 33s
9. Loop for 35000 to 40000, pause = 1.5, batch_pause = 60, time taken = 3h 32min 48s
10. Loop for 40000 to 45000, pause = 1.5, batch_pause = 60, time taken = 3h 25min 28s

## Data Manipulation

After data collection is completed, we will then merge the two files into one and keep only required columns for the analysis and modelling. 


## Recommender

A recommender will be built based on the data, where user will be recommended similar games focusing on the genre and game title. 

---

## Import Libraries

In this section, we will import all the libraries that will be used in this notebook. 

In [1]:
# To read url
import requests

# For Calculation and Data Manipulation
import numpy as np
import pandas as pd

# For `.pkl` file exportion folding creation
import os

# for datetime conversion
import datetime

# for data collection server buffer time
import time

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 400

---

## Functions

In this section, we will list down all the functions that are being used in the notebook as a summary. The functions can be found when looking through the code. 

1. Generic Function to get requests from an API. ([link](#Generic-Function-to-get-requests-from-an-API))
2. Function to save dataframe to file ([link](#Function-to-save-dataframe))
3. Functions to get data ([link](#Functions-to-get-data))
4. Function to process data obtained in batches ([link](#Function-to-process-data-obtained-in-batches))


---

To start off, we will write a generic function to get requests from an API. 

#### Generic Function to get requests from an API

We define a generic function to get requests from an API. This function will take in 2 parameters:
1. URL in string; and 
2. API parameters in dictionary form. 

The API paramenters supplied is passed into the _get requests_ automatically, depending on the API. 

We will add in a couple of scenarios to getting the response: 
1. If a SSL Error occurs during extraction, we will wait for 5 seconds before prompting user to advise if they would like to retry (by calling the function again) and providing a feedback during the function run. 
2. If there is no responses, we wait 10 seconds before retry.

In both scenarios, the loop will end once user choose not to continue with getting the response. 

In [2]:
# Generic Function to get requests from an API

def get_request(url, parameters=None):
    """
    Return json-formatted response of a get request using optional parameters
    
    Parameters
    ----------
    Parameters to pass as part of get_requests
    
    url : string
        url of API
    parameters : {'API_prarmeter': 'value'}
        parameters that the api accepts
        
    Return
    ------
    value returned after calling the function
    
    response.json : json_data
        json-formatted resposnse (dict-like)
    None
    
    """
    
    # code run, to run `except` line if error
    try: 
        response = requests.get(url = url, params = parameters)
     
    # code to run SSLError if error prompt from `try` code
    except SSLError as s:
        print(f'SSL Error : {s}')
        
        # Waiting time of 5s
        # \r: carriage return, return print to start of curent line instead 
        for i in range(5, 0, -1):
            print('\rWaiting...  ({})')  # print countdown of waiting time
            time.sleep(1)     # wait for 1 second
        
        # check with user if want to retry:
        user_input = input("Do you want to retry?\nResponse (Y/N):")
        
        # if input is neither 'y' or 'n'
        if (user_input.lower() != 'y') & (user_input.lower() != 'n'):
            # inform that their input is not a valid option and reprompt user
            print('Invalid response received.')
            user_input = input("Do you want to retry?\nResponse (Y/N):")
        
        # if input is 'n'
        elif user_input.lower() == 'n': 
            # user indicated that they do not wad to retry
            print('User do not want to continue, loop stop.')
            return None
        else: 
            # to inform user about retrying to get request
            print('\rRetrying...' + ' '*8)    # inform user we are retrying to get the request
            
            # rerun function to get request
            return get_request(url, parameters)
    
    if response: 
        return response.json()
    else:
        # reponses is none means there is too many requests. Wait and try again
        print('No response, waiting 10 seconds...')
        time.sleep(10)
        # inform user function will retry
        print('Retrying...')
        return get_request(url, parameters)

#### Function to save dataframe

We define a generic function to save dataframe into pkl file. This function will take in 2 parameters:
1. filename and path in string; and 
2. dataframe. 

The function saves a dataframe into a pkl file. 

In [3]:
### Function to save dataframe into pkl
def pkl_output(filename, df):
    
    """
    Creating a pkl file or overwriting existing pkl file
    
    Parameters
    ----------
    Parameters to pass as part of pkl_output
    
    filename : string
        folder/file path and name of file. E.g. '../data/name.pkl'
    df : Dataframe
        Dataframe that we want to save into a file path
        
    Return
    ------
    value returned after calling the function
    
    None
        A pkl file should be created in the folder location indicated by filename as specified
    
    """
    
    # filename of output
    output_path = filename
    
    # create data folder if it does not exist in current folder,
    if not os.path.exists('../data'):
        os.makedirs('../data')

    # append file to pkl file
    pd.DataFrame(df).to_pickle(output_path)

---

## Generate List of App IDs

Every app on steam store has a unique `app ID`, even if the name is the same. This will be our `unique identifier`, which will be used to identify apps between the two extracted data, and eventually merging the tables of data. 

As such, we will generate a list of `App IDs` which will be used to build our data sets. While it is possible to generate the list of `App IDs` from Steam API from the url (https://api.steampowered.com/ISteamApps/GetAppList/v2/https://api.steampowered.com/ISteamApps/GetAppList/v2/), there is a large number of entries and could possibly consists of demos and videos, we will not be able to tell them apart from just the `App ID`. 

SteamSpy provides an `'all'` request, supplying some information on the apps they track. While it does not supply all information about each app, it provides a good starting point. 

After getting the response, we will store it into a pandas dataframe. 

In [4]:
%%time

# define url and parameters to get all App IDs
url_steamspy = 'https://steamspy.com/api.php'
param_appid = {'request' : 'all', 'page': 0}

# show the current number of page being scrap
print(f'\rCurrent page: 0')

# request 'all' from steamspy and parse into dataframe
json_data = get_request(url = url_steamspy, parameters= param_appid)
steam_spy_all_df = pd.DataFrame.from_dict(json_data, orient='index')

# create page counter
counter = 1

# create temporary variable that is length 1000 for while loop to work
data_add = ['temp']*1000

# Create loop for appid extraction
# as each iteration will scrap 1000 entries per page, loop continues if data last obtained is 1000
while len(data_add) == 1000:
    
    # to include buffer timing for each request
    # API indicated that request is every 60s. 
    time.sleep(61)
    
    # show the current number of page being scrap
    print(f'\rCurrent page: {counter}')
    
    # update 'page' parameter
    param_appid['page'] = counter
    
    # create dataframe by getting the json data
    data_add = pd.DataFrame.from_dict(get_request(url = url_steamspy, parameters= param_appid), orient='index')
    
    # concat the additional data
    steam_spy_all_df = pd.concat([steam_spy_all_df, data_add])
    
    # update counter
    counter += 1
    
    # to comment out only when getting all data as code will take around an hour to get complete data
    # used for testing of the below codes
    if counter == 5:    # Line A1
       break           # Line A2

# create pkl file of extraction for future code usage as all data takes about an hour to extract
# to remove comment only if all data is being extracted, i.e. line A1 and A2 are being commented out
pkl_output('../data/sample_app_id_and_game_list.pkl', steam_spy_all_df.sort_values('appid'))

# create dataframe for app_list, keeping only App ID and name
app_list = steam_spy_all_df[['appid', 'name']].sort_values('appid').reset_index(drop=True)
app_list.rename(columns={'appid': 'app_id', 'name': 'game_name'}, inplace=True)

Current page: 0
Current page: 1
Current page: 2
Current page: 3
Current page: 4
Wall time: 4min 5s


In [5]:
# look at app list shape and data
print(app_list.shape)
app_list.head()

(5000, 2)


Unnamed: 0,app_id,game_name
0,10,Counter-Strike
1,20,Team Fortress Classic
2,30,Day of Defeat
3,40,Deathmatch Classic
4,50,Half-Life: Opposing Force


In [6]:
# look at app list info to see if there is any null value
app_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   app_id     5000 non-null   int64 
 1   game_name  5000 non-null   object
dtypes: int64(1), object(1)
memory usage: 78.2+ KB


---
## Define download logic

Now that we have the `app_list` dataframe, we can iterate over the App IDs and request individual game data from the servers. 

As there is alot of data that will be retrieved from the internet, we will avoid attempting to retrieve all at once as any errors or connection time-outs could cause the loss of all retrieved data. For this reason, we define a function to download and process the requests in batches, appending each batch to an external file while keeping track of the highest index. 

This will allow us to easily restart the process if an error is encountered, but also suggests we can complete the download across multiple sessions. 

#### Functions to get data

We define 3 functions to get data. 

1. The first function is used to return one batch of game data to the multiple batch process function. It takes in 4 parameters. 
    1. index for starting value corresponding to app_list
    2. index for stopping value corresponding to app_list
    3. function that will be used to scrap data
    4. integer indicating number of seconds in between each request
    
    The function returns a dataframe to the batch process function.
    <br><br>
2. The second function is used to scrap data from steam. It takes in 2 parameters. 
    1. unique identifier of the game
    2. name of the game
    
    The function returns a dictionary containing the game data. 
    <br><br>
3. The third function is used to scrap data from steamspy. It takes in 2 parameters. 
    1. unique identifier of the game
    2. name of the game
    
    The function returns a dictionary containing the game data. 

In [7]:
# Function to get data
def get_game_data(start, stop, fn, pause = 2):
    """
    Function to get and store one batch of game data before returning to multiple batch process function
    
    Parameters
    ----------
    Parameters to pass as part of get_data
    
    start : integer
        index of starting value for app_list (overall)
    stop : integer
        index of stopping value for app_list (overall)
    fn : function
        Function used to scrap data
    pause : integer / float
        value of pause time in seconds for each scrapping. Default is 2 seconds.
        
    Return
    ------
    value returned after calling the function
    
    game_data : dataframe
        dataframe containing the game data for app_list[start:stop]
    
    """
    
    game_data = pd.DataFrame()
    
    # start printing on new line
    print()
    
    # iterate through each row of app_id within start and stop
    for index, row in app_list[start:stop].iterrows():
        # inform user current which index is being requested
        print(f'Current index: {index}', end='\r')
        
        appid = row['app_id']
        name = row['game_name']
        
        # retrieve game data for a row, handled by supllied function, and append to dataframe
        data = fn(appid, name)
        game_data = game_data.append(data, ignore_index=True)
        
        # prevent overloading api with requests
        time.sleep(pause)
    return game_data

In [8]:
def steam_data_request(appid, name):
    """
    Function to get game data from steam
    
    Parameters
    ----------
    Parameters to pass as part of steam_data_request
    
    appid : integer
        application / game id
    name : string
        application / game name
        
    Return
    ------
    value returned after calling the function
    
    data : dictionary
        dictionary containing the game data for appid/name.
    
    """
    
    # set the parameters to get request from steam
    steam_url = "http://store.steampowered.com/api/appdetails/"
    steam_parameters = {"appids": appid}
    
    # request the game details using API
    steam_json_data = get_request(steam_url, parameters=steam_parameters)
    steam_json_game_data = steam_json_data[str(appid)]
    
    # value to return depending of request success
    if steam_json_game_data['success']:
        data = steam_json_game_data['data']
    else:
        data = {'name':name, 'steam_appid': appid}
    
    return data

In [9]:
def steamspy_data_request(appid, name):
    """
    Function to get game data from steamspy
    
    Parameters
    ----------
    Parameters to pass as part of steamspy_data_request
    
    appid : integer
        application / game id
    name : string
        application / game name
        
    Return
    ------
    value returned after calling the function
    
    steamspy_json_data : dictionary
        dictionary containing the game data for appid/name.
    
    """
    
    # set the parameters to get request from steam
    steamspy_url = "https://steamspy.com/api.php"
    steamspy_parameters = {"request": "appdetails", "appid": appid}
    
    # request game details using API
    steamspy_json_data = get_request(steamspy_url, steamspy_parameters)
    
    return steamspy_json_data

#### Function to process data obtained in batches

We define a generic function to save dataframes from batch scrapping into pkl file. This function will take in 8 parameters:
1. a function used to scrap data
2. dataframe containing `app_id` and `game_name`
3. file path and name of file output
4. starting index of scrapping dataframe
5. ending index of scrapping dataframe
6. batchsize of each batch request
7. integer indicating pause time between each scrapping
8. integer indicating pause time between each batch

The function saves file into a pkl file

In [10]:
def batch_process(fn, app_list, data_filename, begin=0, end=-1, batchsize=1000, pause=2, batch_pause=300):
    """
    Function to get game data in batches and stored to pkl file
    
    Parameters
    ----------
    Parameters to pass as part of batch_process
    
    fn : function
        Function used to scrap data
    app_list : Dataframe
        dataframe containing app_id and game_name
    data_filename : string
        folder/file path and name of file. E.g. '../data/name.pkl'
    begin : integer
        starting index of scrapping. Default is 0. 
    end : integer
        last index of scrapping. Default to -1
    batchsize : integer
        Size of each batch iteration. Default is 1000.
    pause : integer
        value of pause time in seconds for each scrapping. Default is 2 seconds.
    batch_pause : integer
        value of pause time in seconds for each batch. Default is 300 seconds
        
    Return
    ------
    value returned after calling the function
    
    None
        a pkl file is generated after the running the function
    """
    
    print(f'Starting at index {begin}\n')
    
    # if user did not define where to stop, by default, process all apps in app_list
    if end == -1:
        end = app_list.shape[0]
    
    # generate list of batch begin and end points
    batches = [i for i in range(begin, end, batchsize)]
    # if end not in batches, append it in
    if batches[-1] != end:
        batches.append(end)
    
    # counter - number of games written
    game_written = 0
    
    for i in range(len(batches)-1):
        
        # set start and stop value of batch i
        start = batches[i]
        stop = batches[i+1]
        
        # feedback to user data is being scrapped
        print(f'\rStarting lines {start} to {stop-1} scrapping                        ', end='')
        
        # get dataframe of game_data for batch i
        game_data = get_game_data(start, stop, fn, pause)
        
        # update counter
        game_written += game_data.shape[0]
        
        # feedback to user data has been collected for export
        print(f'\rData exporting for lines {start} to {stop-1}                        ', end='')
        
        # save (append) game_data into data_filename
        temp = pd.read_pickle(data_filename)          # read original datafile
        game_data = pd.concat([temp, game_data], ignore_index=True)   # combine original with newly scrapped
        pkl_output(data_filename, game_data)   # save pickle file
        
        # feedback to user data being scrapped
        print(f'\rData exported for lines {start} to {stop-1}                        ')
        
        # rest before next batch
        time.sleep(batch_pause)
        
    print(f'\nAll batches complete. {game_written} games extracted')

After defining all the functions for the download logic, we will start the data extraction for Steam and Steamspy.

---

## Download Steam Game

We will start downloading the game data for the games identified in our `app_list`. 

In [12]:
%%time

# if user is interested in obtaining a sample data
# instead of running the remaining cell in this section
# this cell can be run instead
# by removining the comments for lines identified as code

# pkl file name to be saved
sample_steam_filename = '../data/sample_steam_game_data.pkl'    # code

# create empty pickle file for function usage using empty dataframe
empty_df = pd.DataFrame()   # code
pkl_output(sample_steam_filename, empty_df)   # code

# last run index, default is 0 to start scrapping
sample_steam_index_value = 0

# download game data from steam based on app_list
# below are all code
batch_process(
    fn = steam_data_request,  # Function used to scrap data
    app_list = app_list,      # dataframe containing app_id and game_name
    data_filename = sample_steam_filename,       # folder/file path and name of file. E.g. '../data/name.pkl'
    begin = sample_steam_index_value,      # starting index of scrapping. Default is 0
    end=20,                   # last index of scrapping. Default to -1
    batchsize=10,           # Size of each batch iteration. Default is 1000
    pause=5,                  # value of pause time in seconds for each scrapping. Default is 2 seconds
    batch_pause=180                   # value of pause time in seconds for each batch. Default is 300 seconds
)

# read in sample and look at dataframe
sample_steam_game_df = pd.read_pickle(sample_steam_filename)
sample_steam_game_df.info()

Starting at index 0

Starting lines 0 to 9 scrapping                        
Data exported for lines 0 to 9                         
Starting lines 10 to 19 scrapping                        
Data exported for lines 10 to 19                         

All batches complete. 20 games extracted
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 33 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   about_the_game        20 non-null     object 
 1   background            20 non-null     object 
 2   categories            20 non-null     object 
 3   content_descriptors   20 non-null     object 
 4   detailed_description  20 non-null     object 
 5   developers            20 non-null     object 
 6   genres                20 non-null     object 
 7   header_image          20 non-null     object 
 8   is_free               20 non-null     float64
 9   linux_requirements    20 non-null     ob

In [13]:
print(sample_steam_game_df.shape)
sample_steam_game_df.head()

(20, 33)


Unnamed: 0,about_the_game,background,categories,content_descriptors,detailed_description,developers,genres,header_image,is_free,linux_requirements,...,short_description,steam_appid,support_info,supported_languages,type,website,dlc,achievements,demos,movies
0,Play the world's number 1 online action game. Engage in an incredibly realistic brand of terrorist warfare in this wildly popular team-based game. Ally with teammates to complete strategic missions. Take out enemy sites. Rescue hostages. Your role affects your team's success. Your team's success affects your role.,https://cdn.akamai.steamstatic.com/steam/apps/10/page_bg_generated_v6b.jpg?t=1602535893,"[{'id': 1, 'description': 'Multi-player'}, {'id': 49, 'description': 'PvP'}, {'id': 36, 'description': 'Online PvP'}, {'id': 37, 'description': 'Shared/Split Screen PvP'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]","{'ids': [2, 5], 'notes': 'Includes intense violence and blood.'}",Play the world's number 1 online action game. Engage in an incredibly realistic brand of terrorist warfare in this wildly popular team-based game. Ally with teammates to complete strategic missions. Take out enemy sites. Rescue hostages. Your role affects your team's success. Your team's success affects your role.,[Valve],"[{'id': '1', 'description': 'Action'}]",https://cdn.akamai.steamstatic.com/steam/apps/10/header.jpg?t=1602535893,0.0,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual-core from Intel or AMD at 2.8 GHz, 1GB Memory, nVidia GeForce 8600/9600GT, ATI/AMD Radeaon HD2600/3600 (Graphic Drivers: nVidia 310, AMD 12.11), OpenGL 2.1, 4GB Hard Drive Space, OpenAL Compatible Sound Card'}",...,Play the world's number 1 online action game. Engage in an incredibly realistic brand of terrorist warfare in this wildly popular team-based game. Ally with teammates to complete strategic missions. Take out enemy sites. Rescue hostages. Your role affects your team's success. Your team's success affects your role.,10.0,"{'url': 'http://steamcommunity.com/app/10', 'email': ''}","English<strong>*</strong>, French<strong>*</strong>, German<strong>*</strong>, Italian<strong>*</strong>, Spanish - Spain<strong>*</strong>, Simplified Chinese<strong>*</strong>, Traditional Chinese<strong>*</strong>, Korean<strong>*</strong><br><strong>*</strong>languages with full audio support",game,,,,,
1,"One of the most popular online action games of all time, Team Fortress Classic features over nine character classes -- from Medic to Spy to Demolition Man -- enlisted in a unique style of online team warfare. Each character class possesses unique weapons, items, and abilities, as teams compete online in a variety of game play modes.",https://cdn.akamai.steamstatic.com/steam/apps/20/page_bg_generated_v6b.jpg?t=1579634708,"[{'id': 1, 'description': 'Multi-player'}, {'id': 49, 'description': 'PvP'}, {'id': 36, 'description': 'Online PvP'}, {'id': 37, 'description': 'Shared/Split Screen PvP'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}, {'id': 44, 'description': 'Remote Play Together'}]","{'ids': [2, 5], 'notes': 'Includes intense violence and blood.'}","One of the most popular online action games of all time, Team Fortress Classic features over nine character classes -- from Medic to Spy to Demolition Man -- enlisted in a unique style of online team warfare. Each character class possesses unique weapons, items, and abilities, as teams compete online in a variety of game play modes.",[Valve],"[{'id': '1', 'description': 'Action'}]",https://cdn.akamai.steamstatic.com/steam/apps/20/header.jpg?t=1579634708,0.0,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual-core from Intel or AMD at 2.8 GHz, 1GB Memory, nVidia GeForce 8600/9600GT, ATI/AMD Radeaon HD2600/3600 (Graphic Drivers: nVidia 310, AMD 12.11), OpenGL 2.1, 4GB Hard Drive Space, OpenAL Compatible Sound Card'}",...,"One of the most popular online action games of all time, Team Fortress Classic features over nine character classes -- from Medic to Spy to Demolition Man -- enlisted in a unique style of online team warfare. Each character class possesses unique weapons, items, and abilities, as teams compete online in a variety of game play modes.",20.0,"{'url': '', 'email': ''}","English, French, German, Italian, Spanish - Spain, Korean, Russian, Simplified Chinese, Traditional Chinese",game,,,,,
2,"Enlist in an intense brand of Axis vs. Allied teamplay set in the WWII European Theatre of Operations. Players assume the role of light/assault/heavy infantry, sniper or machine-gunner class, each with a unique arsenal of historical weaponry at their disposal. Missions are based on key historical operations. And, as war rages, players must work together with their squad to accomplish a variety...",https://cdn.akamai.steamstatic.com/steam/apps/30/page_bg_generated_v6b.jpg?t=1512413490,"[{'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}]","{'ids': [], 'notes': None}","Enlist in an intense brand of Axis vs. Allied teamplay set in the WWII European Theatre of Operations. Players assume the role of light/assault/heavy infantry, sniper or machine-gunner class, each with a unique arsenal of historical weaponry at their disposal. Missions are based on key historical operations. And, as war rages, players must work together with their squad to accomplish a variety...",[Valve],"[{'id': '1', 'description': 'Action'}]",https://cdn.akamai.steamstatic.com/steam/apps/30/header.jpg?t=1512413490,0.0,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual-core from Intel or AMD at 2.8 GHz, 1GB Memory, nVidia GeForce 8600/9600GT, ATI/AMD Radeaon HD2600/3600 (Graphic Drivers: nVidia 310, AMD 12.11), OpenGL 2.1, 4GB Hard Drive Space, OpenAL Compatible Sound Card'}",...,"Enlist in an intense brand of Axis vs. Allied teamplay set in the WWII European Theatre of Operations. Players assume the role of light/assault/heavy infantry, sniper or machine-gunner class, each with a unique arsenal of historical weaponry at their disposal. Missions are based on key historical operations.",30.0,"{'url': '', 'email': ''}","English, French, German, Italian, Spanish - Spain",game,http://www.dayofdefeat.com/,,,,
3,"Enjoy fast-paced multiplayer gaming with Deathmatch Classic (a.k.a. DMC). Valve's tribute to the work of id software, DMC invites players to grab their rocket launchers and put their reflexes to the test in a collection of futuristic settings.",https://cdn.akamai.steamstatic.com/steam/apps/40/page_bg_generated_v6b.jpg?t=1568752159,"[{'id': 1, 'description': 'Multi-player'}, {'id': 49, 'description': 'PvP'}, {'id': 36, 'description': 'Online PvP'}, {'id': 37, 'description': 'Shared/Split Screen PvP'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}, {'id': 44, 'description': 'Remote Play Together'}]","{'ids': [], 'notes': None}","Enjoy fast-paced multiplayer gaming with Deathmatch Classic (a.k.a. DMC). Valve's tribute to the work of id software, DMC invites players to grab their rocket launchers and put their reflexes to the test in a collection of futuristic settings.",[Valve],"[{'id': '1', 'description': 'Action'}]",https://cdn.akamai.steamstatic.com/steam/apps/40/header.jpg?t=1568752159,0.0,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual-core from Intel or AMD at 2.8 GHz, 1GB Memory, nVidia GeForce 8600/9600GT, ATI/AMD Radeaon HD2600/3600 (Graphic Drivers: nVidia 310, AMD 12.11), OpenGL 2.1, 4GB Hard Drive Space, OpenAL Compatible Sound Card'}",...,"Enjoy fast-paced multiplayer gaming with Deathmatch Classic (a.k.a. DMC). Valve's tribute to the work of id software, DMC invites players to grab their rocket launchers and put their reflexes to the test in a collection of futuristic settings.",40.0,"{'url': '', 'email': ''}","English, French, German, Italian, Spanish - Spain, Korean, Russian, Simplified Chinese, Traditional Chinese",game,,,,,
4,"Return to the Black Mesa Research Facility as one of the military specialists assigned to eliminate Gordon Freeman. Experience an entirely new episode of single player action. Meet fierce alien opponents, and experiment with new weaponry. Named 'Game of the Year' by the Academy of Interactive Arts and Sciences.",https://cdn.akamai.steamstatic.com/steam/apps/50/page_bg_generated_v6b.jpg?t=1579628243,"[{'id': 2, 'description': 'Single-player'}, {'id': 1, 'description': 'Multi-player'}, {'id': 8, 'description': 'Valve Anti-Cheat enabled'}, {'id': 44, 'description': 'Remote Play Together'}]","{'ids': [], 'notes': None}","Return to the Black Mesa Research Facility as one of the military specialists assigned to eliminate Gordon Freeman. Experience an entirely new episode of single player action. Meet fierce alien opponents, and experiment with new weaponry. Named 'Game of the Year' by the Academy of Interactive Arts and Sciences.",[Gearbox Software],"[{'id': '1', 'description': 'Action'}]",https://cdn.akamai.steamstatic.com/steam/apps/50/header.jpg?t=1579628243,0.0,"{'minimum': 'Minimum: Linux Ubuntu 12.04, Dual-core from Intel or AMD at 2.8 GHz, 1GB Memory, nVidia GeForce 8600/9600GT, ATI/AMD Radeaon HD2600/3600 (Graphic Drivers: nVidia 310, AMD 12.11), OpenGL 2.1, 4GB Hard Drive Space, OpenAL Compatible Sound Card'}",...,"Return to the Black Mesa Research Facility as one of the military specialists assigned to eliminate Gordon Freeman. Experience an entirely new episode of single player action. Meet fierce alien opponents, and experiment with new weaponry. Named 'Game of the Year' by the Academy of Interactive Arts and Sciences.",50.0,"{'url': 'https://help.steampowered.com', 'email': ''}","English, French, German, Korean",game,,,,,


## Download Steamspy Game
We will start downloading the game data from steamspy for the games identified in our app_list.

In [14]:
%%time

# if user is interested in obtaining a sample data
# instead of running the remaining cell in this section
# this cell can be run instead
# by removining the comments for lines identified as code

# pkl file name to be saved
sample_steamspy_filename = '../data/sample_steamspy_game_data.pkl'    # code

# create empty pickle file for function usage using empty dataframe
empty_df = pd.DataFrame()
pkl_output(sample_steamspy_filename, empty_df)   # code

# last run index, default is 0 to start scrapping
sample_steamspy_index_value = 0

# download game data from steam based on app_list
# below are all code
batch_process(
    fn = steamspy_data_request,  # Function used to scrap data
    app_list = app_list,      # dataframe containing app_id and game_name
    data_filename = sample_steamspy_filename,       # folder/file path and name of file. E.g. '../data/name.pkl'
    begin = sample_steamspy_index_value,      # starting index of scrapping. Default is 0
    end=20,                   # last index of scrapping. Default to -1
    batchsize=10,           # Size of each batch iteration. Default is 1000
    pause=2,                  # value of pause time in seconds for each scrapping. Default is 2 seconds
    batch_pause=120                   # value of pause time in seconds for each batch. Default is 300 seconds
)

# read in sample and look at dataframe
sample_steamspy_game_df = pd.read_pickle(sample_steamspy_filename)
sample_steamspy_game_df.info()

Starting at index 0

Starting lines 0 to 9 scrapping                        
Data exported for lines 0 to 9                         
Starting lines 10 to 19 scrapping                        
Data exported for lines 10 to 19                         

All batches complete. 20 games extracted
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   appid            20 non-null     float64
 1   average_2weeks   20 non-null     float64
 2   average_forever  20 non-null     float64
 3   ccu              20 non-null     float64
 4   developer        20 non-null     object 
 5   discount         20 non-null     object 
 6   genre            20 non-null     object 
 7   initialprice     20 non-null     object 
 8   languages        20 non-null     object 
 9   median_2weeks    20 non-null     float64
 10  median_forever   20 non-null     float64
 11  nam

In [15]:
print(sample_steamspy_game_df.shape)
sample_steamspy_game_df.head()

(20, 20)


Unnamed: 0,appid,average_2weeks,average_forever,ccu,developer,discount,genre,initialprice,languages,median_2weeks,median_forever,name,negative,owners,positive,price,publisher,score_rank,tags,userscore
0,10.0,113.0,11343.0,16273.0,Valve,0,Action,999,"English, French, German, Italian, Spanish - Spain, Simplified Chinese, Traditional Chinese, Korean",34.0,201.0,Counter-Strike,4951.0,"10,000,000 .. 20,000,000",193377.0,999,Valve,,"{'Action': 5379, 'FPS': 4801, 'Multiplayer': 3362, 'Shooter': 3327, 'Classic': 2758, 'Team-Based': 1845, 'First-Person': 1692, 'Competitive': 1587, 'Tactical': 1323, '1990's': 1181, 'e-sports': 1174, 'PvP': 865, 'Old School': 751, 'Military': 623, 'Strategy': 604, 'Survival': 296, 'Score Attack': 284, '1980s': 256, 'Assassin': 222, 'Violent': 64}",0.0
1,20.0,0.0,182.0,82.0,Valve,0,Action,499,"English, French, German, Italian, Spanish - Spain, Korean, Russian, Simplified Chinese, Traditional Chinese",0.0,13.0,Team Fortress Classic,896.0,"5,000,000 .. 10,000,000",5417.0,499,Valve,,"{'Action': 745, 'FPS': 306, 'Multiplayer': 257, 'Classic': 232, 'Hero Shooter': 213, 'Shooter': 206, 'Team-Based': 188, 'Class-Based': 181, 'First-Person': 169, '1990's': 132, 'Old School': 106, 'Co-op': 89, 'Competitive': 68, 'Fast-Paced': 61, 'Retro': 55, 'Online Co-Op': 51, 'Violent': 45, 'Mod': 36, 'Funny': 35, 'Remake': 35}",0.0
2,30.0,0.0,127.0,158.0,Valve,0,Action,499,"English, French, German, Italian, Spanish - Spain",0.0,21.0,Day of Defeat,558.0,"5,000,000 .. 10,000,000",5018.0,499,Valve,,"{'FPS': 788, 'World War II': 249, 'Multiplayer': 202, 'Shooter': 188, 'Action': 160, 'War': 151, 'Team-Based': 131, 'Classic': 126, 'First-Person': 105, 'Class-Based': 77, 'Military': 64, 'Historical': 57, 'Tactical': 40, 'Singleplayer': 37, 'Co-op': 34, 'Difficult': 18, 'Old School': 16, 'Retro': 14, 'World War I': 14, 'Strategy': 13}",0.0
3,40.0,0.0,50.0,3.0,Valve,0,Action,499,"English, French, German, Italian, Spanish - Spain, Korean, Russian, Simplified Chinese, Traditional Chinese",0.0,12.0,Deathmatch Classic,412.0,"5,000,000 .. 10,000,000",1860.0,499,Valve,,"{'Action': 629, 'FPS': 139, 'Classic': 107, 'Multiplayer': 96, 'Shooter': 94, 'First-Person': 70, 'Arena Shooter': 44, 'Old School': 33, 'Sci-fi': 33, 'Competitive': 23, 'Fast-Paced': 15, 'Retro': 14, 'Gore': 14, 'Co-op': 13, 'Difficult': 12, '1990's': 8}",0.0
4,50.0,21.0,476.0,111.0,Gearbox Software,0,Action,499,"English, French, German, Korean",21.0,230.0,Half-Life: Opposing Force,666.0,"5,000,000 .. 10,000,000",13345.0,499,Valve,,"{'FPS': 881, 'Action': 322, 'Classic': 251, 'Sci-fi': 248, 'Singleplayer': 225, 'Shooter': 220, 'First-Person': 187, 'Aliens': 172, '1990's': 133, 'Adventure': 114, 'Atmospheric': 105, 'Military': 91, 'Story Rich': 74, 'Silent Protagonist': 65, 'Great Soundtrack': 50, 'Gore': 38, 'Puzzle': 35, 'Co-op': 31, 'Moddable': 29, 'Retro': 18}",0.0


By changing the paramenters, we will be able to get the complete 51,749 data from the servers for both servers.

Once full data is obtained, we will clean and merge both datasets before conducting EDA and machine learning (recommender). 