Steam Scraper Wrapper
---
This notebook calls the scraper notebook again and again until a certain target number of records is stored on disk.

We need over 20,000 records for our analysis.

As each record requires quite a few URL calls with manual delays in between, this scraping is quite slow. The only way to finish this in our lifetime is to run the notebook at all times.

The scraper notebook itself is vulnerable to network errors that are out of our control, and this blasted Windows machine just decides to restart itself from time to time (known issue, no fix; "feature not a bug", etc). Because of this, we can't just tell the scraper to scrape a bunch of games and then go on about our business.

This notebook aims to provide the desired set-it-and-forget-it functionality by calling the scraper notebook in a 'while' loop with a try/except block that logs process & errors, then builds in a delay after a failed scraping attempt. The loop is written to try f_o_r_e_v_e_r until the desired number of records are scraped, but since I live here, I'll check it periodically to make sure it isn't broken.

Each loop only attempts to scrape a certain number of games (as defined by {interval}), merges the results into the existing main file, then begins the loop again until {target_records} is reached.

Because the game records are added to the main file within each iteration, the maximum number of partially-scraped records that can be lost due to a crash is limited by the {interval} variable.

In [1]:
import pandas as pd
import time
import datetime

In [2]:
with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
    existing_records = pd.read_pickle(file)

starting_records = len(existing_records)
print(starting_records)

78935


In [3]:
# This notebook will run until this many records exist in the data/raw directory.
target_records = 100000
current_records = starting_records

# This variable is fed to the other notebook to determine how many games should be scraped
# per notebook run.
interval = 10
%store interval

Stored 'interval' (int)


In [4]:
# Set up your tracker variables.
successful_iterations = 0
failed_iterations = 0
successive_failed_iterations = 0
start_time = time.time()

# Loop over the scraper notebook until {target_records} exist in the directory.
# Each call of the notebook aims to add {interval} records.
while current_records < target_records :
    loop_start_time = time.time()
    try :
        %run "0.0-jod-steam-scraper.ipynb"
        successful_iterations += 1
        successive_failed_iterations = 0
    except Exception as e :
        print(f"This exception printed from the wrapper: {e}")
        failed_iterations += 1
        successive_failed_iterations += 1
        print(f"Successive failed iterations: {successive_failed_iterations}")
        print("Pausing for 2 minutes...")
        time.sleep(120)
        print("OK, GO!")

    # The scraped records are added to the main file from inside the other notebook.
    # Thus, we have to read it here to see what our current total count is.
    # (For a variety of reasons, each notebook run might not scrape exactly the number
    # of games that it was supposed to.)
    with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
        existing_records = pd.read_pickle(file)
    new_record_count = len(existing_records)

    # Final record.
    loop_end_time = time.time()
    loop_final_time = loop_end_time - loop_start_time
    total_time_so_far = loop_end_time - start_time
    print("-----------------------------")
    print(f"Loop complete at {datetime.now()}")
    print(f"Games scraped this loop: {new_record_count - current_records}")
    print(f"Time spent on this loop: {int(loop_final_time/60)}m {int(loop_final_time%60)}s")
    print(f"Games scraped so far: {new_record_count - starting_records}")
    print(f"Time spent so far: {int((loop_end_time-start_time)/60)}m")
    print(f"Current record count/target: {new_record_count}/{target_records}")
    print(f"Successful loops: {successful_iterations}")
    print(f"Failed loops: {failed_iterations}")
    print("-----------------------------")
    print("")
    current_records = new_record_count

finish_time = time.time()
total_runtime = finish_time - start_time

hours = int(total_runtime / (60**2))
minutes = int(total_runtime / 60)
seconds = int(total_runtime % 60)

# print(f"{hours}h, {minutes}m, {seconds}s")
print("******************************************************")
print("******************************************************")
print(f"Completed. Started at {starting_records}, added {current_records-starting_records}, ended at {current_records}.")
print(f"{successful_iterations} successful iterations, {failed_iterations} failed iterations.")
print(f"Total runtime: {hours}h {minutes}m {seconds}s")
print("******************************************************")
print("******************************************************")

Identified 78935 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4118
-----------------------------
Loop complete at 2024-03-19 21:01:14.928056
Games scraped this loop: 10
Time spent on this loop: 2m 43s
Games scraped so far: 10
Time spent so far: 2m
Current record count/target: 78945/100000
Successful loops: 1
Failed loops: 0
-----------------------------

Identified 78945 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4172
-----------------------------
Loop complete at 2024-03-19 23:25:27.679946
Games scraped this loop: 10
Time spent on this loop: 3m 54s
Games scraped so far: 496
Time spent so far: 3m
Current record count/target: 79431/100000
Successful loops: 51
Failed loops: 0
-----------------------------

Identified 79431 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4172
-----------------------------
Loop complete at 2024-03-19 23:28:47.355111
Games scraped this loop: 10
Time spent on th

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4187
-----------------------------
Loop complete at 2024-03-20 01:16:45.105740
Games scraped this loop: 10
Time spent on this loop: 2m 17s
Games scraped so far: 902
Time spent so far: 2m
Current record count/target: 79837/100000
Successful loops: 92
Failed loops: 0
-----------------------------

Identified 79837 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...


  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4187
-----------------------------
Loop complete at 2024-03-20 01:19:30.495844
Games scraped this loop: 10
Time spent on this loop: 2m 45s
Games scraped so far: 912
Time spent so far: 2m
Current record count/target: 79847/100000
Successful loops: 93
Failed loops: 0
-----------------------------

Identified 79847 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4187
-----------------------------
Loop complete at 2024-03-20 01:22:16.998683
Games scraped this loop: 10
Time spent on th

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4201
-----------------------------
Loop complete at 2024-03-20 02:30:17.749762
Games scraped this loop: 10
Time spent on this loop: 2m 41s
Games scraped so far: 1192
Time spent so far: 2m
Current record count/target: 80127/100000
Successful loops: 121
Failed loops: 0
-----------------------------

Identified 80127 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4201
-----------------------------
Loop complete at 2024-03-20 02:33:15.164366
Games scraped this loop: 10
Time spent on 

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4202
-----------------------------
Loop complete at 2024-03-20 02:36:11.440048
Games scraped this loop: 10
Time spent on this loop: 2m 56s
Games scraped so far: 1212
Time spent so far: 2m
Current record count/target: 80147/100000
Successful loops: 123
Failed loops: 0
-----------------------------

Identified 80147 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4204
-----------------------------
Loop complete at 2024-03-20 02:39:09

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4209
-----------------------------
Loop complete at 2024-03-20 02:55:38.831158
Games scraped this loop: 10
Time spent on this loop: 2m 33s
Games scraped so far: 1282
Time spent so far: 2m
Current record count/target: 80217/100000
Successful loops: 130
Failed loops: 0
-----------------------------

Identified 80217 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4212
-----------------------------
Loop comple

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4228
-----------------------------
Loop complete at 2024-03-20 03:47:24.639312
Games scraped this loop: 10
Time spent on this loop: 2m 42s
Games scraped so far: 1482
Time spent so far: 2m
Current record count/target: 80417/100000
Successful loops: 150
Failed loops: 0
-----------------------------

Identified 80417 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4229
-----------------------------
Loop complete at 2024-03-20 03:50:18.889270
Games scraped thi

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4236
-----------------------------
Loop complete at 2024-03-20 04:08:39.780235
Games scraped this loop: 10
Time spent on this loop: 2m 25s
Games scraped so far: 1562
Time spent so far: 2m
Current record count/target: 80497/100000
Successful loops: 158
Failed loops: 0
-----------------------------

Identified 80497 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4237
-----------------------------
Loop complete at 2024-03-20 04:10:56.732129
Games scraped thi

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4267
-----------------------------
Loop complete at 2024-03-20 12:18:39.809898
Games scraped this loop: 10
Time spent on this loop: 2m 55s
Games scraped so far: 3201
Time spent so far: 2m
Current record count/target: 82136/100000
Successful loops: 322
Failed loops: 0
-----------------------------

Identified 82136 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4267
-----------------------------
Loop complete at 2024-03-20 12:21:28.139235
Games scraped this loop: 10
Time spent on 

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=4269
-----------------------------
Loop complete at 2024-03-20 12:48:42.293368
Games scraped this loop: 10
Time spent on this loop: 2m 53s
Games scraped so far: 3301
Time spent so far: 2m
Current record count/target: 82236/100000
Successful loops: 332
Failed loops: 0
-----------------------------

Identified 82236 existing game records.

KeyboardInterrupt: 

KeyboardInterrupt: 

In [None]:
# LINK RESETTER

next_link = 'https://store.steampowered.com/search/?sort_by=&sort_order=0&page=3665'

%store next_link

Stored 'next_link' (str)
