Steam Scraper Wrapper
---
This notebook calls the scraper notebook again and again until a certain target number of records is stored on disk.

We need over 20,000 records for our analysis.

As each record requires quite a few URL calls with manual delays in between, this scraping is quite slow. The only way to finish this in our lifetime is to run the notebook at all times.

The scraper notebook itself is vulnerable to network errors that are out of our control, and this blasted Windows machine just decides to restart itself from time to time (known issue, no fix; "feature not a bug", etc). Because of this, we can't just tell the scraper to scrape a bunch of games and then go on about our business.

This notebook aims to provide the desired set-it-and-forget-it functionality by calling the scraper notebook in a 'while' loop with a try/except block that logs process & errors, then builds in a delay after a failed scraping attempt. The loop is written to try f_o_r_e_v_e_r until the desired number of records are scraped, but since I live here, I'll check it periodically to make sure it isn't broken.

Each loop only attempts to scrape a certain number of games (as defined by {interval}), merges the results into the existing main file, then begins the loop again until {target_records} is reached.

Because the game records are added to the main file within each iteration, the maximum number of partially-scraped records that can be lost due to a crash is limited by the {interval} variable.

In [9]:
import pandas as pd
import time
import datetime

In [10]:
with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
    existing_records = pd.read_pickle(file)

starting_records = len(existing_records)
print(starting_records)

66207


In [11]:
# This notebook will run until this many records exist in the data/raw directory.
target_records = 100000
current_records = starting_records

# This variable is fed to the other notebook to determine how many games should be scraped
# per notebook run.
interval = 10
%store interval

Stored 'interval' (int)


In [12]:
# Set up your tracker variables.
successful_iterations = 0
failed_iterations = 0
successive_failed_iterations = 0
start_time = time.time()

# Loop over the scraper notebook until {target_records} exist in the directory.
# Each call of the notebook aims to add {interval} records.
while current_records < target_records :
    loop_start_time = time.time()
    try :
        %run "0.0-jod-steam-scraper.ipynb"
        successful_iterations += 1
        successive_failed_iterations = 0
    except Exception as e :
        print(f"This exception printed from the wrapper: {e}")
        failed_iterations += 1
        successive_failed_iterations += 1
        print(f"Successive failed iterations: {successive_failed_iterations}")
        print("Pausing for 2 minutes...")
        time.sleep(120)
        print("OK, GO!")

    # The scraped records are added to the main file from inside the other notebook.
    # Thus, we have to read it here to see what our current total count is.
    # (For a variety of reasons, each notebook run might not scrape exactly the number
    # of games that it was supposed to.)
    with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
        existing_records = pd.read_pickle(file)
    new_record_count = len(existing_records)

    # Final record.
    loop_end_time = time.time()
    loop_final_time = loop_end_time - loop_start_time
    total_time_so_far = loop_end_time - start_time
    print("-----------------------------")
    print(f"Loop complete at {datetime.now()}")
    print(f"Games scraped this loop: {new_record_count - current_records}")
    print(f"Time spent on this loop: {int(loop_final_time/60)}m {int(loop_final_time%60)}s")
    print(f"Games scraped so far: {new_record_count - starting_records}")
    print(f"Time spent so far: {int((loop_end_time-start_time)/60)}m")
    print(f"Current record count/target: {new_record_count}/{target_records}")
    print(f"Successful loops: {successful_iterations}")
    print(f"Failed loops: {failed_iterations}")
    print("-----------------------------")
    print("")
    current_records = new_record_count

finish_time = time.time()
total_runtime = finish_time - start_time

hours = int(total_runtime / (60**2))
minutes = int(total_runtime / 60)
seconds = int(total_runtime % 60)

# print(f"{hours}h, {minutes}m, {seconds}s")
print("******************************************************")
print("******************************************************")
print(f"Completed. Started at {starting_records}, added {current_records-starting_records}, ended at {current_records}.")
print(f"{successful_iterations} successful iterations, {failed_iterations} failed iterations.")
print(f"Total runtime: {hours}h {minutes}m {seconds}s")
print("******************************************************")
print("******************************************************")

Identified 66207 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3009
-----------------------------
Loop complete at 2024-03-12 16:37:53.020934
Games scraped this loop: 10
Time spent on this loop: 2m 24s
Games scraped so far: 10
Time spent so far: 2m
Current record count/target: 66217/100000
Successful loops: 1
Failed loops: 0
-----------------------------

Identified 66217 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&

KeyError: 'app_id'

This exception printed from the wrapper: 'app_id'
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-12 19:17:51.419025
Games scraped this loop: 0
Time spent on this loop: 2m 26s
Games scraped so far: 566
Time spent so far: 2m
Current record count/target: 66773/100000
Successful loops: 57
Failed loops: 1
-----------------------------

Identified 66773 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3043
-----------------------------
Loop complete at 2024-03-12 19:20:42.631342
Games scraped this loop: 10
Time spent on this loop: 2m 51s
Games scraped so far: 576
Time spent so far: 2m
Current record count/target: 66783

UserWarning: No games scraped from search results. Cannot continue.

This exception printed from the wrapper: No games scraped from search results. Cannot continue.
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-12 21:16:27.531427
Games scraped this loop: 0
Time spent on this loop: 4m 36s
Games scraped so far: 961
Time spent so far: 4m
Current record count/target: 67168/100000
Successful loops: 97
Failed loops: 2
-----------------------------

Identified 67168 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3064
-----------------------------
Loop complete at 2024-03-12 21:19:35.940226
Games scraped this loop: 10
Time spent on this loop: 3m 8s
Games scraped so far: 971
Time spent 

URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>

This exception printed from the wrapper: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-12 23:31:36.858112
Games scraped this loop: 0
Time spent on this loop: 2m 41s
Games scraped so far: 1410
Time spent so far: 2m
Current record count/target: 67617/100000
Successful loops: 143
Failed loops: 3
-----------------------------

Identified 67617 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=

AttributeError: 'bool' object has no attribute 'timeout'

This exception printed from the wrapper: 'bool' object has no attribute 'timeout'
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-13 11:14:56.889006
Games scraped this loop: 0
Time spent on this loop: 4m 38s
Games scraped so far: 4056
Time spent so far: 4m
Current record count/target: 70263/100000
Successful loops: 408
Failed loops: 4
-----------------------------

Identified 70263 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3318
-----------------------------
Loop complete at 2024-03-13 11:17:37.984314
Games scraped this loop: 10
Time spent on this loop: 2m 41s
Games scraped so far: 4

AttributeError: 'bool' object has no attribute 'timeout'

This exception printed from the wrapper: 'bool' object has no attribute 'timeout'
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-13 12:19:48.733077
Games scraped this loop: 0
Time spent on this loop: 4m 39s
Games scraped so far: 4276
Time spent so far: 4m
Current record count/target: 70483/100000
Successful loops: 430
Failed loops: 5
-----------------------------

Identified 70483 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3332
-----------------------------
Loop complete at 2024-03-13 12:22:18.199379
Games scraped this loop: 10
Time spent on this loop: 2m 29s
Games scraped so far: 4

UserWarning: No games scraped from search results. Cannot continue.

This exception printed from the wrapper: No games scraped from search results. Cannot continue.
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-13 13:31:11.498326
Games scraped this loop: 0
Time spent on this loop: 4m 36s
Games scraped so far: 4516
Time spent so far: 4m
Current record count/target: 70723/100000
Successful loops: 454
Failed loops: 6
-----------------------------

Identified 70723 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3346
-----------------------------
Loop complete at 2024-03-13 13:34:12.242927
Games scraped this loop: 10
Time spent on this loop: 3m 0s
Games scraped so far: 4526
Time spe

UserWarning: No games scraped from search results. Cannot continue.

This exception printed from the wrapper: No games scraped from search results. Cannot continue.
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-13 13:44:54.644502
Games scraped this loop: 0
Time spent on this loop: 4m 36s
Games scraped so far: 4546
Time spent so far: 4m
Current record count/target: 70753/100000
Successful loops: 457
Failed loops: 7
-----------------------------

Identified 70753 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3349
-----------------------------
Loop complete at 2024-03-13 13:47:34.090491
Games scraped this loop: 10
Time spent on this loop: 2m 39s
Games scraped so far: 4556
Time sp

UserWarning: No games scraped from search results. Cannot continue.

This exception printed from the wrapper: No games scraped from search results. Cannot continue.
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-13 18:14:51.334815
Games scraped this loop: 0
Time spent on this loop: 4m 44s
Games scraped so far: 5461
Time spent so far: 4m
Current record count/target: 71668/100000
Successful loops: 550
Failed loops: 8
-----------------------------

Identified 71668 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3415
-----------------------------
Loop complete at 2024-03-13 18:17:46.518228
Games scraped this loop: 10
Time spent on this loop: 2m 55s
Games scr

UserWarning: No games scraped from search results. Cannot continue.

This exception printed from the wrapper: No games scraped from search results. Cannot continue.
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-13 19:12:36.392392
Games scraped this loop: 0
Time spent on this loop: 4m 36s
Games scraped so far: 5644
Time spent so far: 4m
Current record count/target: 71851/100000
Successful loops: 569
Failed loops: 9
-----------------------------

Identified 71851 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3431
-----------------------------
Loop complete at 2024-03-13 19:15:38.849740
Games scraped this loop: 10
Time spent on this loop: 3m 2s
Games scra

UserWarning: No games scraped from search results. Cannot continue.

This exception printed from the wrapper: No games scraped from search results. Cannot continue.
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-13 19:20:15.578339
Games scraped this loop: 0
Time spent on this loop: 4m 36s
Games scraped so far: 5654
Time spent so far: 4m
Current record count/target: 71861/100000
Successful loops: 570
Failed loops: 10
-----------------------------

Identified 71861 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3432
-----------------------------
Loop complete at 2024-03-13 19:23:35.089206
Games scraped this loop: 10
Time spent on this loop: 3m 19s
Games sc

AttributeError: 'bool' object has no attribute 'timeout'

This exception printed from the wrapper: 'bool' object has no attribute 'timeout'
Successive failed iterations: 1
Pausing for 2 minutes...
OK, GO!
-----------------------------
Loop complete at 2024-03-13 19:28:13.425047
Games scraped this loop: 0
Time spent on this loop: 4m 38s
Games scraped so far: 5664
Time spent so far: 4m
Current record count/target: 71871/100000
Successful loops: 571
Failed loops: 11
-----------------------------

Identified 71871 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=3434
-----------------------------
Loop complete at 2024-03-13 19:31:17.461134
Games scraped this loop: 10
Time spent on this loop: 3m 4s
Games scraped so far: 5