Steam Scraper Wrapper
---
This notebook calls the scraper notebook again and again until a certain target number of records is stored on disk.

We need over 20,000 records for our analysis.

As each record requires quite a few URL calls with manual delays in between, this scraping is quite slow. The only way to finish this in our lifetime is to run the notebook at all times.

The scraper notebook itself is vulnerable to network errors that are out of our control, and this blasted Windows machine just decides to restart itself from time to time (known issue, no fix; "feature not a bug", etc). Because of this, we can't just tell the scraper to scrape a bunch of games and then go on about our business.

This notebook aims to provide the desired set-it-and-forget-it functionality by calling the scraper notebook in a 'while' loop with a try/except block that logs process & errors, then builds in a delay after a failed scraping attempt. The loop is written to try f_o_r_e_v_e_r until the desired number of records are scraped, but since I live here, I'll check it periodically to make sure it isn't broken.

Each loop only attempts to scrape a certain number of games (as defined by {interval}), merges the results into the existing main file, then begins the loop again until {target_records} is reached.

Because the game records are added to the main file within each iteration, the maximum number of partially-scraped records that can be lost due to a crash is limited by the {interval} variable.

In [None]:
import pandas as pd
import time
import datetime

In [None]:
with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
    existing_records = pd.read_pickle(file)

starting_records = len(existing_records)
print(starting_records)

In [None]:
# This notebook will run until this many records exist in the data/raw directory.
target_records = 25000
current_records = starting_records

# This variable is fed to the other notebook to determine how many games should be scraped
# per notebook run.
interval = 10
%store interval

In [None]:
# Set up your tracker variables.
successful_iterations = 0
failed_iterations = 0
successive_failed_iterations = 0
start_time = time.time()

# Loop over the scraper notebook until {target_records} exist in the directory.
# Each call of the notebook aims to add {interval} records.
while current_records < target_records :
    loop_start_time = time.time()
    try :
        %run "0.0-jod-steam-scraper.ipynb"
        successful_iterations += 1
        successive_failed_iterations = 0
    except Exception as e :
        print(f"This exception printed from the wrapper: {e}")
        failed_iterations += 1
        successive_failed_iterations += 1
        print(f"Successive failed iterations: {successive_failed_iterations}")
        print("Pausing for 2 minutes...")
        time.sleep(120)
        print("OK, GO!")

    # The scraped records are added to the main file from inside the other notebook.
    # Thus, we have to read it here to see what our current total count is.
    # (For a variety of reasons, each notebook run might not scrape exactly the number
    # of games that it was supposed to.)
    with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
        existing_records = pd.read_pickle(file)
    new_record_count = len(existing_records)

    # Final record.
    now = time.time()
    loop_final_time = now - loop_start_time
    print("-----------------------------")
    print(f"Loop complete at {datetime.now()}")
    print(f"Successfully scraped this loop: {new_record_count - current_records}")
    print(f"Current record count/target: {new_record_count}/{target_records}")
    print(f"Time spent on this loop: {int(loop_final_time/60)}min")
    print(f"Total time spent so far: {int((now-start_time)/60)}min")
    print(f"Successful loops: {successful_iterations}")
    print(f"Failed loops: {failed_iterations}")
    print("-----------------------------")
    print("")
    current_records = new_record_count

finish_time = time.time()
total_runtime = finish_time - start_time

hours = int(total_runtime // (60**2))
minutes = int(total_runtime // 60)
seconds = int(total_runtime % 60)

# print(f"{hours}h, {minutes}m, {seconds}s")
print("******************************************************")
print("******************************************************")
print(f"Completed. Started at {starting_records}, added {current_records-starting_records}, ended at {current_records}.")
print(f"{successful_iterations} successful iterations, {failed_iterations} failed iterations.")
print(f"Total runtime: {hours}h {minutes}m {seconds}s")
print("******************************************************")
print("******************************************************")