Steam Scraper Wrapper
---
This notebook calls the scraper notebook again and again until a certain target number of records is stored on disk.

We need over 20,000 records for our analysis.

As each record requires quite a few URL calls with manual delays in between, this scraping is quite slow. The only way to finish this in our lifetime is to run the notebook at all times.

The scraper notebook itself is vulnerable to network errors that are out of our control, and this blasted Windows machine just decides to restart itself from time to time (known issue, no fix; "feature not a bug", etc). Because of this, we can't just tell the scraper to scrape a bunch of games and then go on about our business.

This notebook aims to provide the desired set-it-and-forget-it functionality by calling the scraper notebook in a 'while' loop with a try/except block that logs process & errors, then builds in a delay after a failed scraping attempt. The loop is written to try f_o_r_e_v_e_r until the desired number of records are scraped, but since I live here, I'll check it periodically to make sure it isn't broken.

Each loop only attempts to scrape a certain number of games (as defined by {interval}), merges the results into the existing main file, then begins the loop again until {target_records} is reached.

Because the game records are added to the main file within each iteration, the maximum number of partially-scraped records that can be lost due to a crash is limited by the {interval} variable.

In [1]:
import pandas as pd
import time
import datetime

In [2]:
with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
    existing_records = pd.read_pickle(file)

starting_records = len(existing_records)
print(starting_records)

95916


In [3]:
# This notebook will run until this many records exist in the data/raw directory.
target_records = 120000
current_records = starting_records

# This variable is fed to the other notebook to determine how many games should be scraped
# per notebook run.
interval = 10
%store interval

Stored 'interval' (int)


In [4]:
# Set up your tracker variables.
successful_iterations = 0
failed_iterations = 0
successive_failed_iterations = 0
start_time = time.time()

# Loop over the scraper notebook until {target_records} exist in the directory.
# Each call of the notebook aims to add {interval} records.
while current_records < target_records :
    loop_start_time = time.time()
    try :
        %run "0.0-jod-steam-scraper.ipynb"
        successful_iterations += 1
        successive_failed_iterations = 0
    except Exception as e :
        print(f"This exception printed from the wrapper: {e}")
        failed_iterations += 1
        successive_failed_iterations += 1
        print(f"Successive failed iterations: {successive_failed_iterations}")
        print("Pausing for 2 minutes...")
        time.sleep(120)
        print("OK, GO!")

    # The scraped records are added to the main file from inside the other notebook.
    # Thus, we have to read it here to see what our current total count is.
    # (For a variety of reasons, each notebook run might not scrape exactly the number
    # of games that it was supposed to.)
    with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
        existing_records = pd.read_pickle(file)
    new_record_count = len(existing_records)

    # Final record.
    loop_end_time = time.time()
    loop_final_time = loop_end_time - loop_start_time
    total_time_so_far = loop_end_time - start_time
    print("-----------------------------")
    print(f"Loop complete at {datetime.now()}")
    print(f"Games scraped this loop: {new_record_count - current_records}")
    print(f"Time spent on this loop: {int(loop_final_time/60)}m {int(loop_final_time%60)}s")
    print(f"Games scraped so far: {new_record_count - starting_records}")
    print(f"Time spent so far: {int((loop_end_time-start_time)/60)}m")
    print(f"Current record count/target: {new_record_count}/{target_records}")
    print(f"Successful loops: {successful_iterations}")
    print(f"Failed loops: {failed_iterations}")
    print("-----------------------------")
    print("")
    current_records = new_record_count

finish_time = time.time()
total_runtime = finish_time - start_time

hours = int(total_runtime / (60**2))
minutes = int(total_runtime / 60)
seconds = int(total_runtime % 60)

# print(f"{hours}h, {minutes}m, {seconds}s")
print("******************************************************")
print("******************************************************")
print(f"Completed. Started at {starting_records}, added {current_records-starting_records}, ended at {current_records}.")
print(f"{successful_iterations} successful iterations, {failed_iterations} failed iterations.")
print(f"Total runtime: {hours}h {minutes}m {seconds}s")
print("******************************************************")
print("******************************************************")

Identified 95916 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5588
-----------------------------
Loop complete at 2024-03-24 18:58:18.630402
Games scraped this loop: 10
Time spent on this loop: 4m 0s
Games scraped so far: 10
Time spent so far: 3m
Current record count/target: 95926/120000
Successful loops: 1
Failed loops: 0
-----------------------------

Identified 95926 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=e

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5606
-----------------------------
Loop complete at 2024-03-24 19:31:10.619135
Games scraped this loop: 10
Time spent on this loop: 2m 24s
Games scraped so far: 120
Time spent so far: 2m
Current record count/target: 96036/120000
Successful loops: 12
Failed loops: 0
-----------------------------

Identified 96036 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5607
-----------------------------
Loop complete at 2024-03-24 19:33:47.849894
Games scraped this 

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5612
-----------------------------
Loop complete at 2024-03-24 19:40:37.125733
Games scraped this loop: 10
Time spent on this loop: 3m 37s
Games scraped so far: 150
Time spent so far: 3m
Current record count/target: 96066/120000
Successful loops: 15
Failed loops: 0
-----------------------------

Identified 96066 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5616
------------------

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5621
-----------------------------
Loop complete at 2024-03-24 19:50:32.732771
Games scraped this loop: 10
Time spent on this loop: 3m 20s
Games scraped so far: 180
Time spent so far: 3m
Current record count/target: 96096/120000
Successful loops: 18
Failed loops: 0
-----------------------------

Identified 96096 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5624
-----------------------------
Loop complete

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5672
-----------------------------
Loop complete at 2024-03-24 21:17:51.405902
Games scraped this loop: 10
Time spent on this loop: 2m 37s
Games scraped so far: 470
Time spent so far: 2m
Current record count/target: 96386/120000
Successful loops: 47
Failed loops: 0
-----------------------------

Identified 96386 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5674
-----------------------------
Loop complete at 2024-03-24 21:20:54.9

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5680
-----------------------------
Loop complete at 2024-03-24 21:29:41.249729
Games scraped this loop: 10
Time spent on this loop: 2m 50s
Games scraped so far: 510
Time spent so far: 2m
Current record count/target: 96426/120000
Successful loops: 51
Failed loops: 0
-----------------------------

Identified 96426 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5681
-----------------------------
Loop complete at 2024-03-24 21:32:44.792655
Games scraped this 

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5681
-----------------------------
Loop complete at 2024-03-24 21:38:43.816811
Games scraped this loop: 10
Time spent on this loop: 2m 48s
Games scraped so far: 540
Time spent so far: 2m
Current record count/target: 96456/120000
Successful loops: 54
Failed loops: 0
-----------------------------

Identified 96456 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&pag

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5692
-----------------------------
Loop complete at 2024-03-24 21:50:26.266077
Games scraped this loop: 10
Time spent on this loop: 2m 41s
Games scraped so far: 580
Time spent so far: 2m
Current record count/target: 96496/120000
Successful loops: 58
Failed loops: 0
-----------------------------

Identified 96496 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5692
-----------------------------
Loop complete at 2024-03-24 21:53:16.927369
Games scraped this loop: 10
Time spent on th

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5723
-----------------------------
Loop complete at 2024-03-24 22:39:38.926497
Games scraped this loop: 10
Time spent on this loop: 1m 53s
Games scraped so far: 750
Time spent so far: 1m
Current record count/target: 96666/120000
Successful loops: 75
Failed loops: 0
-----------------------------

Identified 96666 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5725
-----------------------------
Loop complete at 2024-03-24 22:42:25.8

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5729
-----------------------------
Loop complete at 2024-03-24 22:52:10.165564
Games scraped this loop: 10
Time spent on this loop: 1m 57s
Games scraped so far: 800
Time spent so far: 1m
Current record count/target: 96716/120000
Successful loops: 80
Failed loops: 0
-----------------------------

Identified 96716 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5732
-----------------------------
Loop complete

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5868
-----------------------------
Loop complete at 2024-03-25 02:16:43.839056
Games scraped this loop: 10
Time spent on this loop: 2m 36s
Games scraped so far: 1480
Time spent so far: 2m
Current record count/target: 97396/120000
Successful loops: 148
Failed loops: 0
-----------------------------

Identified 97396 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=5869
-----------------------------
Loop complete at 2024-03-25 02:20:06.066662
Games scraped thi

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6014
-----------------------------
Loop complete at 2024-03-25 05:53:25.961579
Games scraped this loop: 10
Time spent on this loop: 2m 20s
Games scraped so far: 2234
Time spent so far: 2m
Current record count/target: 98150/120000
Successful loops: 224
Failed loops: 0
-----------------------------

Identified 98150 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6016
-----------------------------
Loop complete at 2024-03-25 05:56:53

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6235
-----------------------------
Loop complete at 2024-03-25 10:08:06.860933
Games scraped this loop: 10
Time spent on this loop: 3m 24s
Games scraped so far: 3129
Time spent so far: 3m
Current record count/target: 99045/120000
Successful loops: 314
Failed loops: 0
-----------------------------

Identified 99045 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6237
-----------------------------
Loop complete at 2024-03-25 10:11:02

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6242
-----------------------------
Loop complete at 2024-03-25 10:25:17.211979
Games scraped this loop: 10
Time spent on this loop: 2m 27s
Games scraped so far: 3189
Time spent so far: 2m
Current record count/target: 99105/120000
Successful loops: 320
Failed loops: 0
-----------------------------

Identified 99105 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6242
-----------------------------
Loop complete at 2024-03-25 10:28:55.780547
Games scraped this loop: 10
Time spent on 

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6250
-----------------------------
Loop complete at 2024-03-25 11:08:28.541339
Games scraped this loop: 10
Time spent on this loop: 2m 45s
Games scraped so far: 3329
Time spent so far: 2m
Current record count/target: 99245/120000
Successful loops: 334
Failed loops: 0
-----------------------------

Identified 99245 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6251
-----------------------------
Loop complete at 2024-03-25 11:11:06.336018
Games scraped thi

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6252
-----------------------------
Loop complete at 2024-03-25 11:22:46.381938
Games scraped this loop: 10
Time spent on this loop: 2m 53s
Games scraped so far: 3379
Time spent so far: 2m
Current record count/target: 99295/120000
Successful loops: 339
Failed loops: 0
-----------------------------

Identified 99295 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6255
-----------------------------
Loop comple

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6307
-----------------------------
Loop complete at 2024-03-25 14:44:22.134481
Games scraped this loop: 10
Time spent on this loop: 2m 58s
Games scraped so far: 4029
Time spent so far: 2m
Current record count/target: 99945/120000
Successful loops: 404
Failed loops: 0
-----------------------------

Identified 99945 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6307
-----------------------------
Loop complete at 2024-03-25 14:47:59.314861
Games scraped this loop: 10
Time spent on 

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6366
-----------------------------
Loop complete at 2024-03-25 17:11:27.670173
Games scraped this loop: 10
Time spent on this loop: 3m 7s
Games scraped so far: 4488
Time spent so far: 3m
Current record count/target: 100404/120000
Successful loops: 450
Failed loops: 0
-----------------------------

Identified 100404 existing game records.
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6366
-----------------------------
Loop complete at 2024-03-25 17:14:41.578308
Games scraped this loop: 10
Time spent on

  scraped_search_results_df = pd.DataFrame(games)


Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6443
-----------------------------
Loop complete at 2024-03-25 19:41:34.671908
Games scraped this loop: 10
Time spent on this loop: 2m 56s
Games scraped so far: 4948
Time spent so far: 2m
Current record count/target: 100864/120000
Successful loops: 496
Failed loops: 0
-----------------------------

Identified 100864 existing game records.
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
Stored 'next_link' (str)
10 games scraped from search page.
Scraping individual game page data...
Scraped 10 game pages.
Scraping comment counts. This might take a while...
Scraped 10 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&supportedlang=english&page=6446
-----------------------------
Loop comp

KeyboardInterrupt: 

KeyboardInterrupt: 

In [None]:
# LINK RESETTER

# next_link = 'https://store.steampowered.com/search/?sort_by=&sort_order=0&page=3665'

# %store next_link

Stored 'next_link' (str)
