## GetGameData

This notebook executes the code to collect and store relevant Steam game data.

In [1]:
%matplotlib inline

from bs4 import BeautifulSoup
import datetime
import json
import requests
import time

In [2]:
game_dict = {};

cookies = {"birthtime": "670000000"};

fields = ["game_name",
          "release_year", "release_month", "release_day",
          "lifetime", "tags",
          "platform_windows", "platform_mac", "platform_linux",
          "price_discount", "price_original", "metascore",
          "good_review_percentage_recent", "n_reviews_recent",
          "good_review_percentage", "n_reviews"];

month_map = {};
month_map["jan"] = 1;
month_map["feb"] = 2;
month_map["mar"] = 3;
month_map["apr"] = 4;
month_map["may"] = 5;
month_map["jun"] = 6;
month_map["jul"] = 7;
month_map["aug"] = 8;
month_map["sep"] = 9;
month_map["oct"] = 10;
month_map["nov"] = 11;
month_map["dec"] = 12;

In [3]:
def IsValidDateWordList(wordlist):    
    if len(wordlist) != 3:
        return False;
    
    if not wordlist[0].isalpha():
        return False;
    
    if not wordlist[1].isdigit():
        return False;
    
    if not wordlist[2].isdigit():
        return False;
    
    day = int(wordlist[1]);
    year = int(wordlist[2]);
    
    if (day > 31) or (day < 1):
        return False;
    
    if year < 1970:
        return False;
    
    return True;

In [7]:
with open('game_data.json', 'r') as f:
    game_dict = json.load(f);

The next block of code is the main body of work in this notebook. It loops over steam game pages and scrapes the relevant information from each. Currently, it pulls the game name, release date, lifetime in days, supported operating systems, current and original (no discount) prices, Metacritic score, the number and percentage of positive reviews in the past 30 days, and the number and percentage of positive reviews overall.

For testing, the app number list [1, 620, 245810, 227700, 230050, 391220, 391680] is great. 1 should redirect to the home page. 620, 227700, and 230050 are games, where 227700 is free to play and 230050 is missing the Metascore. 245810 and 391680 are DLC pages. Finally, 391220 is held behind an age gate.

In [15]:
outlog = file('GetGameData.log', 'a');

print time.ctime();
for appnumber in xrange(288000, 300000, 10):
    if(appnumber % 100 == 0):
        print("On app number %d" % appnumber);
        print time.ctime();

    # Define variables and defaults
    game_name = '';
    release_date = datetime.date(1970, 1, 1); # Default
    release_year = 1970;
    release_month = 1;
    release_day = 1;
    lifetime = 0; # Default of age 0
    tags = [];
    platform_windows = 0;
    platform_mac     = 0;
    platform_linux   = 0;
    price_discount = -1.00;
    price_original = -1.00;
    metascore = -1;
    good_review_percentage_recent = -1;
    n_reviews_recent = 0;
    good_review_percentage = -1;
    n_reviews = 0;
    
    try:
        game_page_html_test = requests.get("http://store.steampowered.com/app/%d/" % appnumber,
                                           cookies=cookies);
    except:
        outlog.write("%d: Couldn't get html?\n" % appnumber);
        continue;
    soup = BeautifulSoup(game_page_html_test.text, "html.parser");
    time.sleep(1.5);
    
    ##--- Check if the page redirected to the Steam homepage ---##
    title = soup.title.get_text();
    if("Welcome to Steam" in title):
        continue;
    
    ##--- Look for entry indicating a DLC item ---##
    html_entries = soup.findAll("div", attrs={"class": "game_area_dlc_bubble game_area_bubble"});
    if(len(html_entries) > 0):
        continue;
    
    ##--- Get the game name ---##
    # Find entries that will include the game name
    html_entries = soup.findAll("div", attrs={"class": "apphub_AppName"});
    
    # Should only have one entry
    if(len(html_entries) > 1):
        outlog.write("%d: More than 1 game name?\n" % appnumber);
    if(len(html_entries) == 0):
        outlog.write("%d: No game name?\n" % appnumber);
    else:
        # Extract the game name
        game_name = html_entries[0].get_text();
    #print(game_name);
    
    ##--- Get the release date and age ---##
    # Find enries that will include the release date
    html_entries = soup.findAll("span", attrs={"class": "date"});
    
    # Should only have one entry
    if(len(html_entries) > 1):
        outlog.write("%d: More than 1 release date?\n" % appnumber);
    if(len(html_entries) == 0):
        outlog.write("%d: No release date?\n" % appnumber);
    else:
        # Extract text from date entry, remove comma
        release_date_unicode = html_entries[0].get_text();
        release_date_unicode = release_date_unicode.replace(',', '').strip();
        
        # Tokenize the string, should be Mmm DD YYYY
        words = release_date_unicode.split();
        # Extract release date, calculate age
        if IsValidDateWordList(words):
            # Make month all lower case
            words[0] = words[0][:3].lower();
            release_month = month_map[words[0]];
            release_day = int(words[1]);
            release_year = int(words[2]);
            release_date = datetime.date(release_year, release_month, release_day);
            today = datetime.date.today();
            lifetime = (today - release_date).days;
        else:
            outlog.write("%d: Invalid release date?\n" % appnumber);
    #print(release_date);
    #print(lifetime);
    
    ##--- Get the game tags ---##
    # Find enries that will include the tags
    html_entries = soup.findAll("a", attrs={"class": "app_tag"});
    # Extract each tag and strip trailing/leading white space from them
    tags = [entry.get_text().strip() for entry in html_entries];
    #print tags;
    
    ##--- Get the suppoted operating systems ---##
    # Find enries that will include the supported operating systems
    html_entries = soup.findAll("div", attrs={"class": "game_area_purchase_platform"});
    # The "entries" are just the appearance of an image
    # There is no actual text, so simply use (length > 0) as indication of a supported OS
    if(len(html_entries) == 0):
        outlog.write("%d: No supported OS?\n" % appnumber);
    else:
        platform_windows = int(len(html_entries[0].findAll("span", attrs={"class": "platform_img win"})) > 0);
        platform_mac     = int(len(html_entries[0].findAll("span", attrs={"class": "platform_img mac"})) > 0);
        platform_linux   = int(len(html_entries[0].findAll("span", attrs={"class": "platform_img linux"})) > 0);
    #print platform_windows, platform_mac, platform_linux;
    
    ##--- Get the current and original prices ---##
    # Find enries that will include the game prices
    html_entries = soup.findAll("div", attrs={"class": "game_purchase_action_bg"});
    # The tag to find is different based on whether the game is currently discounted
    # Search for both
    if(len(html_entries) == 0):
        outlog.write("%d: No price block?\n" % appnumber);
    else:
        price_original_block = html_entries[0].findAll("div", attrs={"class": "game_purchase_price"});
        price_discount_block = html_entries[0].findAll("div", attrs={"class": "discount_prices"});
        
        if(len(price_original_block) == 0 and len(price_discount_block) == 0):
            outlog.write("%d: No price?\n" % appnumber);
        if(len(price_original_block) > 0):
            # Try evaluating the 'no discount' entries first
            # Extract price text, remove dollar sign
            price_text = price_original_block[0].get_text().strip()[1:];
        
            # Evaluate prices as floats
            if(price_text.replace('.', '').isdigit()):
                price_discount = float(price_text);
                price_original = float(price_text);
            elif ("ree" in price_text):
                # The first character of the string was cut,
                # so free to play games have the 'F' dropped
                price_discount = 0.00;
                price_original = 0.00;
            else:
                outlog.write("%d: Non-numeric word for price?\n" % appnumber);
        if(len(price_discount_block) > 0):
            # Evaluate 'discount' entry second
            # This will overwrite 'no discount' results if both were found
            price_text = price_discount_block[0].findAll("div", attrs={"class": "discount_final_price"})[0].get_text().strip()[1:];
            if(price_text.replace('.', '').isdigit()):
                price_discount = float(price_text);
            elif ("ree" in price_text):
                price_discount = 0.00;
            else:
                outlog.write("%d: Non-numeric word for price?\n" % appnumber);
            
            price_text = price_discount_block[0].findAll("div", attrs={"class": "discount_original_price"})[0].get_text().strip()[1:];
            if(price_text.replace('.', '').isdigit()):
                price_original = float(price_text);
            elif ("ree" in price_text):
                price_original = 0.00;
            else:
                outlog.write("%d: Non-numeric word for price?\n" % appnumber);
    #print(price_discount);
    #print(price_original);

    ##--- Get the Metacritic score ---##
    # Find entry that will be the parent of the entry that includes the score
    html_entries = soup.findAll("div", attrs={"id": "game_area_metascore"});
    if(len(html_entries) > 1):
        outlog.write("%d: More than 1 Metascore?\n" % appnumber);
    if(len(html_entries) == 0):
        outlog.write("%d: No Metascore?\n" % appnumber);
    else:
        # The entry that includes the score should be the first
        score_entry = html_entries[0].findAll("div");
        
        if(len(score_entry) == 0):
            outlog.write("%d: No score in the Metascore area?\n" % appnumber);
        else:
            score_text = score_entry[0].get_text().strip();
            if(score_text.isdigit()):
                metascore = int(score_text);
            else:
                outlog.write("%d: Non-numeric Metascore?\n" % appnumber);
    #print metascore;
    
    ##--- Get the user review information ---##
    # Find entries that will include the user review information
    html_entries = soup.findAll("div", attrs={"class": "user_reviews_summary_row"});
    # There should be an entry for overall and recent information
    if(len(html_entries) > 2):
        outlog.write("%d: More than 1 user review summary?\n" % appnumber);
    # Extract text from entries, strip trailing/leading white space
    
    if(len(html_entries) == 0):
        outlog.write("%d: No user review information?\n" % appnumber);
    else:
        summary_recent = '';
        summary_overall = html_entries[0].attrs["data-store-tooltip"].strip();
        if(len(html_entries) == 2):
            summary_recent = html_entries[0].attrs["data-store-tooltip"].strip();
            summary_overall = html_entries[1].attrs["data-store-tooltip"].strip();
        
        # Evaluate recent entry information
        if("in the last 30 days" in summary_recent):
            # Tokenize the recent summary text
            words = summary_recent.split();
            
            if(len(words) <= 3):
                outlog.write("%d: Too short review string?\n" % appnumber);
            else:
                # The review percentage should be the first word, ignore the '%'
                review_text = words[0][:-1];
                if(review_text.isdigit()):
                    good_review_percentage_recent = int(review_text);
                else:
                    outlog.write("%d: Non-numeric review percentage?\n" % appnumber);
                
                # Get total number of reviews, remove commmas
                review_text = words[3].replace(',', '');
                if(review_text.isdigit()):
                    n_reviews_recent = int(review_text);
                else:
                    outlog.write("%d: Non-numeric number of reviews?\n" % appnumber);
        
        # Evaluate overall entry information
        if("in the last 30 days" not in summary_overall):
            words = summary_overall.split();
            
            if(len(words) <= 3):
                outlog.write("%d: Too short review string?\n" % appnumber);
            else:
                review_text = words[0][:-1];
                if(review_text.isdigit()):
                    good_review_percentage = int(review_text);
                else:
                    outlog.write("%d: Non-numeric review percentage?\n" % appnumber);
                
                review_text = words[3].replace(',', '');
                if(review_text.isdigit()):
                    n_reviews = int(review_text);
                else:
                    outlog.write("%d: Non-numeric number of reviews?\n" % appnumber);
    #print(good_review_percentage_recent);
    #print(n_reviews_recent);
    #print(good_review_percentage);
    #print(n_reviews);
    #print("\n");
    
    game_dict[str(appnumber)] = dict(zip(fields, [game_name,
                                                  release_year, release_month, release_day,
                                                  lifetime, tags,
                                                  platform_windows, platform_mac, platform_linux,
                                                  price_discount, price_original, metascore,
                                                  good_review_percentage_recent, n_reviews_recent,
                                                  good_review_percentage, n_reviews]));

outlog.close();
print time.ctime();
print len(game_dict);

Sun Dec 25 10:11:50 2016
On app number 288000
Sun Dec 25 10:11:50 2016
On app number 288100
Sun Dec 25 10:12:15 2016
On app number 288200
Sun Dec 25 10:12:40 2016
On app number 288300
Sun Dec 25 10:13:07 2016
On app number 288400
Sun Dec 25 10:13:32 2016
On app number 288500
Sun Dec 25 10:13:56 2016
On app number 288600
Sun Dec 25 10:14:22 2016
On app number 288700
Sun Dec 25 10:14:47 2016
On app number 288800
Sun Dec 25 10:15:13 2016
On app number 288900
Sun Dec 25 10:15:38 2016
On app number 289000
Sun Dec 25 10:16:03 2016
On app number 289100
Sun Dec 25 10:16:28 2016
On app number 289200
Sun Dec 25 10:16:54 2016
On app number 289300
Sun Dec 25 10:17:18 2016
On app number 289400
Sun Dec 25 10:17:48 2016
On app number 289500
Sun Dec 25 10:18:14 2016
On app number 289600
Sun Dec 25 10:18:41 2016
On app number 289700
Sun Dec 25 10:19:06 2016
On app number 289800
Sun Dec 25 10:19:34 2016
On app number 289900
Sun Dec 25 10:20:01 2016
On app number 290000
Sun Dec 25 10:20:26 2016
On app nu

Now that the data has been scraped, it can be saved as a json file. This is the last part of the GetGameData notebook. The data will be loaded in the SteamVis notebook for visualization!

In [16]:
len(game_dict)

4492

In [17]:
file_game_data = open("game_data.json", "w");
json.dump(game_dict, file_game_data);
file_game_data.close();

In [27]:
file_game_data = open("game_data_sample.json", "w");
json.dump(game_dict, file_game_data);
file_game_data.close();

In [28]:
game_page_html_test = requests.get("http://store.steampowered.com/app/620/");
#game_page_html_test.text
soup = BeautifulSoup(game_page_html_test.text, "html.parser");
print(soup.prettify())

<!DOCTYPE html>
<html class=" responsive" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
   <meta content="width=device-width,initial-scale=1" name="viewport">
    <meta content="#171a21" name="theme-color">
     <title>
      Save 80% on Portal 2 on Steam
     </title>
     <link href="/favicon.ico" rel="shortcut icon" type="image/x-icon">
      <link href="http://store.akamai.steamstatic.com/public/shared/css/motiva_sans.css?v=Sd0odMs2NjL1" rel="stylesheet" type="text/css">
       <link href="http://store.akamai.steamstatic.com/public/shared/css/shared_global.css?v=qUauRjAB6F_h" rel="stylesheet" type="text/css">
        <link href="http://store.akamai.steamstatic.com/public/shared/css/buttons.css?v=FMXZx9fv9yp_" rel="stylesheet" type="text/css">
         <link href="http://store.akamai.steamstatic.com/public/css/v6/store.css?v=WMjWukom2M23" rel="stylesheet" type="text/css">
          <link href="http://store.akamai.steamstatic.com/public/shar