Scrape!
---

This notebook scrapes the following info for a specific number of games from the Steam store and adds it to our existing database of game records, currently saved in /data/raw/0 - Scraped Games DF.json:

------

'app_id' <-- The unique ID assigned to each game by the Steam store (int)

'title' <-- Game's title

'release_date' <-- Game's release date

'positive_review_percent' <-- Self-explanatory

'number_of_reviews' <-- Number of reviews in ALL languages

'price' <-- In cents (so a $10.00 game will be 1000.0)

'game_page_link' <-- Link to that game's Steam store page

'tags' <-- Numeric representation of the top 7 (max) descriptive tags assigned by users of the Steam store

'tag_list' <-- List of strings holding the top 20 (max) descriptive tags assigned by users of the Steam store

'date_scraped' <-- Date that this info was added to the db

'developer' <-- Game's developer(s)

'publisher' <-- Game's publisher(s)

'description' <-- The little blurb

'interface_languages' <-- Languages with full interface support

'full_audio_languages' <-- Languages with full audio support

'subtitles_languages' <-- Languages with full subtitle support

'english' <-- This and all following columns contain the number of comments in that language

'schinese', 'tchinese', 'japanese', 'koreana', 'thai', 'bulgarian', 'czech', 'danish', 'german', 'spanish', 'latam', 'greek', 'french', 'italian', 'indonesian', 'hungarian', 'dutch', 'norwegian', 'polish', 'brazilian', 'romanian', 'russian', 'finnish', 'swedish', 'turkish', 'vietnamese'

-------

Other notebooks will further transform this data in preparation for modeling.

The number of games to scrape is in the second executable cell, and can be set one each run.

IMPORTANT NOTE: This notebook only functions if the Steam store is loading WITHOUT infinite scroll. Thus far, I have been able to make sure that it doesn't use infinite scroll by logging into my Steam account from a browser (Chrome) and unchecking the "Enable infinite scroll when searching" box under "Store Preferences." IF YOU HAVEN'T LOGGED IN IN A WHILE, THIS SETTING MAY REVERT! So to be sure, check your settings before each run. 

In [1]:
# Basic DS stuff
import numpy as np
import pandas as pd

# Trying not to get blocked while scraping by inputting
# random delays between Get requests.
import random
import time

# Web scraping
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen

# I needed some extra help locating specific parts within a
# bs4 tag object, so I got this.
import re

# For labeling records, tracking files, and formatting
from datetime import datetime

# To help see if we have existing data or not.
import os

# For Rick
import pickle

start_time = time.time()

In [2]:
# This variable will determine how many games to scrape per notebook execution.
# I'm putting it all the way up here to make it easier to find & modify.
%store -r interval

games_to_scrape = interval

Step 1: Learn about the page
---

In [3]:
# NOTE: THIS CODE ONLY WORKS IF YOUR STEAM SETTINGS ARE SET TO PAGINATED
# SEARCH RESULTS, NOT INFINITE SCROLL.
# It works for me if I log into the Steam store on a browser (Chrome),
# go to "Store Preferences," then uncheck the "Enable infinite scroll
# when searching" box.
# That setting apparently applies to all requests made by this notebook
# as well.

# This url is for the "all products" search with the result type
# limited to "Games" (category1=998)
url = "https://store.steampowered.com/search/?category1=998"
html = urlopen(url)
current_page_soup = BeautifulSoup(html, 'lxml')

Step 2: Scrape the first set of data from the search results pages
---

In [4]:
# Now that we know what our soups will look like, we can write functions to do the scraping.
# The first function will scrape all the relevant data off of the current results page.
# The second function will programmatically switch to the next page of results.
# Later, we will run both functions within a loop in order to scrape all results data
# from all pages.

# This is only the first round of scraping. Later, we will scrape more data from each
# game's store page. Since that process is completely different, we will define new
# functions for it later, after this round of scraping is complete.

# Loop through the HTML blocks for each game and scrape the key info into a dictionary,
# then add the dictionaries to the list.
# I'm not cleaning up the data types at this point - I'm learning as I'm going, so I'm
# prioritizing getting all the info I need into the df, and then working with data
# types later either by doing operations on the df or re-writing some of this code.
def scrape_current_page(current_page_soup) :

    """
    This function takes the soup of a paginated Steam search results page (NOT infinte scroll)
    and scrapes the:
    
    title
    release_date
    positive_review_percent
    number_of_reviews
    price
    game_page_link
    type
    app_id
    
    from every game on the page. It puts these values into dictionaries and appends them to
    the list called "games". 
    """

    for listing in current_page_soup.find_all('a', class_='search_result_row ds_collapse_flag') :

        # Create (or clean out) an empty dictionary to hold the new info.
        game = {}

        # Check if we have the specified number of games yet.
        if len(games) == games_to_scrape :
            return

        # Listings on results pages can be one of two types - standalone games, or bundles.
        # We only want to work with standalone games.
        # Only apps have this tag in their listing.
        if listing.has_attr('data-ds-appid') :
            raw_app_id = listing.get('data-ds-appid')
            # To exclude bundles, we'll skip any listing with multiple app_ids.
            # Since the app_id is scraped as a string, lists of app_ids will have
            # commas separating them, and we can identify them by those commas.
            if "," in raw_app_id :
                continue
            
            app_id = int(raw_app_id)

            # Make sure we haven't already scraped this one.
            if app_id not in already_scraped_app_ids.values :
                
                game['app_id'] = app_id
                    
                # The title and release date seem to be at uniform locations in all listings.
                game['title'] = listing.find('span', class_='title').get_text()
                raw_date = listing.find('div', class_='col search_released responsive_secondrow').get_text()
                raw_date = raw_date.strip()
                try:
                    formatted_date = datetime.strptime(raw_date, '%b %d, %Y')
                    game['release_date'] = formatted_date
                except:
                    try: 
                        formatted_date = datetime.strptime(raw_date, '%b %Y')
                        game['release_date'] = formatted_date
                    except :
                        game['release_date'] = raw_date

                # Not all games have reviws listed, so we have to account for code blocks that omit this part.
                # I might eventually remove this part and scrape the review data from the individual game pages
                # instead, since it seems to be more complete there. This is just proof of concept for now.
                try:
                    review_string = re.split('>| of|the | user', listing.find('div', class_='col search_reviewscore responsive_secondrow') \
                                                                .find('span').get('data-tooltip-html'))
                    raw_review_percent = review_string[1][:-1]
                    float_review_percent = int(raw_review_percent) / 100
                    formatted_review_percent = round(float_review_percent, 2)
                    game['positive_review_percent'] = formatted_review_percent
                except:
                    game['positive_review_percent'] = np.nan
                
                try: 
                    review_string = re.split('>| of|the | user', listing.find('div', class_='col search_reviewscore responsive_secondrow') \
                                                .find('span').get('data-tooltip-html'))
                    raw_review_number = review_string[3].replace(',', '')
                    formatted_review_number = int(raw_review_number)
                    game['number_of_reviews'] = formatted_review_number
                except: 
                    game['number_of_reviews'] = np.nan
                
                # Same for price - many unreleased games do not have price info, so we have to skip them.
                # Some games have an original price and a discounted price listed, but for the time being
                # I've decided to only go by original prices, so I'll default to that and only return
                # a null value if no kind of price whatsoever is listed.
                try: 
                    raw_price = listing.find('div', class_="discount_original_price").get_text()
                    bare_price = raw_price.replace("$", "").replace(",", "").replace(".", "")
                    formatted_price = int(bare_price)
                    game['price'] = formatted_price
                except:
                    try:
                        raw_price = listing.find('div', class_="discount_final_price").get_text()
                        bare_price = raw_price.replace("$", "").replace(",", "").replace(".", "")
                        formatted_price = int(bare_price)
                        game['price'] = formatted_price
                    except:
                        try:
                            raw_price = listing.find('div', class_="discount_final_price free").get_text()
                            formatted_price = 0
                            game['price'] = formatted_price
                        except:
                            game['price'] = np.nan

                # Weirdly enough, not every game seems to have its own page.
                try:
                    game['game_page_link'] = listing.get('href')
                except:
                    game['game_page_link'] = 'Failed'

                # Now we grab the tags, which will be a major feature in our analysis.
                try :
                    raw_tags = listing.get('data-ds-tagids')
                    formatted_tags = raw_tags.strip('[]').split(',')
                    game['tags'] = formatted_tags
                except :
                    game['tags'] = 'Failed'

                # Add the current date as a reference for future generations (and versioning).
                todays_date = datetime.now()
                game['date_scraped'] = todays_date.strftime("%Y-%m-%d")

                # Now we add this dict to the list, rinse and repeat.
                games.append(game)

In [5]:
# Now we create the function that determines if there is a next page of
# results, or if we're already at the last page.

def get_next_page_url(current_page_soup) :

        """
        This function takes the soup of a paginated Steam search results page (NOT infinte scroll)
        and determines whether it is the last page of results.

        If it is not the last page, the URL of the next page is stored in "next_link".

        If it is the last page, "next_link" will be set to False.
        """

        # First, we check to make sure there IS a next page. We can tell by looking
        # at the 'pagebtn' tags.
        pagebtn_tags = current_page_soup.find_all('a', class_='pagebtn')

        # This is the variable that we will use to store the next link, or set it to
        # False to let the loop know that we're done scraping.
        global next_link

        # If it is any of the middle pages, there will be two pagebtn tags.
        # The link we need is in side the pagebtn tag that displays the text '>'.

        # After a while this loop fails because we get an empty pagebtn tag, which
        # means that the page failed to load altogether. I will assume this is because
        # the request timed out, and build in an additional delay to deal with that
        # eventuality.
        loops = 0
        delay = 0

        while loops < 5 :
                try :
                        if len(pagebtn_tags) == 2 :
                                next_link = pagebtn_tags[1].get('href')

                        # If there is only one pagebtn tag, that means we're on the first page or the 
                        # last page. If it's the first page, then the pagebtn tag will contain the
                        # character '>'.
                        elif pagebtn_tags[0].get_text() == ">" :
                                next_link = pagebtn_tags[0].get('href')

                        # If neither of the above conditions are met, then we're on the last page and
                        # we can set "next_link" to False, triggering the loop to stop scraping.
                        else :
                                next_link = False

                        # We will use loops=100 as the indicator that the scrape was successful.
                        # This in no way means that 100 loops actually having happened.
                        # It's just a random value that can't (or shouldn't) occur naturally.
                        loops = 100

                except :
                        # If the scrape was unsuccessful, we will try again 4 more times, with
                        # an increasingly log delay between each attempt.
                        delay += 10
                        print("Search page failure "+str(delay//10)+"/5...")
                        print(next_link)
                        time.sleep(delay)
                        loops += 1
        
        if loops != 100 :
                print('Failed to parse next_link on search results page:')
                print(next_link)
                next_link = False

In [6]:
# Now that we have our functions, we'll iterate over them to scrape the data.

# Set the first url to be processed to the first page of search results...
# OR to a couple pages before the last-scraped search results page from the previous
# notebook run.
try :
    %store -r next_link
except :
    next_link = url

# Create the list that will hold the dictionaries of game info & errors.
games = []

# Now we decide how many results we want. 
# 
# The main constraint here is time - since
# we don't want to get IP banned, we'll have set delay between each get request.
# This isn't so important for this loop, since we can get 25 games in one get request.
# However, later we'll be going through the games' pages one-by-one, and in some cases
# we'll have to do 10 different get requests per game to scrape language-specific data.
# Therefore, adding 1 game adds at least 11 get requests & delays to our process.
# (I ended up scraping for over 10 hours.)
#
# Will only limit to inteverals of 25 (as there are 25 results per page).
# If games_to_scrape is greater than the number of games in the search results, then
# the the will automatically stop trying to scrape when it reaches the end of the
# final page of results, because get_next_page_url will set the next_link variable to False.

# Since we want to update this database from time to time, let's devise a way to scrape only
# game data that is not already in the .json file that holds our master list. We'll do this
# by pulling the app_ids from that file (if it exists) and pulling out only the app_ids to
# check against.

if os.path.exists('../data/raw/0 - Scraped Games DF.pkl') :
    with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
        check_df = pickle.load(file)
        already_scraped_app_ids = check_df['app_id']
        print(f"Identified {len(already_scraped_app_ids)} existing game records.")
else :
    print('No scraped games detected. Scraping from scratch.')
    already_scraped_app_ids = pd.Series()

# Now, loop. Keep scraping as long as our games list is shorter than the games_to_scrape var.
    
while len(games) < games_to_scrape :

    # Soup up the page in question.
    html = urlopen(next_link)
    current_page_soup = BeautifulSoup(html, 'lxml')
    
    # Scrape that page.
    scrape_current_page(current_page_soup)

    # We store the next_link variable after successful scrape but before updating
    # because the scrape_current_page function escapes when reacing the total
    # number of games_to_scrape, NOT specfically when all results on the page
    # are scraped.
    # Thus, if we want to start scraping again, we should start with the current
    # page and make sure there aren't any leftover games there before updating
    # next_link.
    %store next_link

    # Set "next_link" to the next URL we want to scrape.
    get_next_page_url(current_page_soup)

    # Include a random delay to prevent getting IP blocked.
    interval = 0.5 + random.random() * 0.3
    time.sleep(interval)

    if next_link == False :
        print('Fewer than '+str(games_to_scrape)+' games in the available search results.')
        if len(games) == 0 :
            raise UserWarning('No games scraped from search results. Cannot continue.')

print(str(len(games))+' games scraped from search page.')

Identified 17503 existing game records.
Stored 'next_link' (str)
2 games scraped from search page.


In [7]:
# Frame it and check.
scraped_search_results_df = pd.DataFrame(games)

# This results in some duplicates - sometimes different versions of the game have the same app id.
# Because we're interested in the relative ration of comment frequencies, not in the total number
# of games or total number of comments, we can safely drop duplicates even if they have different
# comments.
# Since we already excluded app_ids that are duplicates with our previously-scraped files, now we
# must remove duplicates that might exist within that last scrape.
scraped_search_results_df = scraped_search_results_df.drop_duplicates(subset='app_id', keep='first')
scraped_search_results_df = scraped_search_results_df.reset_index(drop=True)

# Save this as a json to safeguard against crashes - running this scraper takes hours.
scraped_search_results_df.to_json('../data/raw/temp/Scraped Search Results.json', orient='records')

Step 3: Scrape additional data for each game from its individual game page
---

In [8]:
# Now we're ready to use the URLs we just scraped to go through the pages
# one-by-one and scrape more data.

# We'll put all this data in a completely different df, then join them
# when we're done on app_id.
def scrape_game_page_data(current_page_soup, app_id) :

    """
    This function scrapes info from all the individual games pages
    currently referenced in games_info_df. We put the info in a dict
    "game", then append it to "games_extend_list".
    
    Later, we will turn that list into another df and merge it to
    games_info_df on index.

    Scraped information is:

    app_id
    developer
    publisher
    description
    interface_languages
    full_audio_languages
    subtitles_languages
    english     <-- the number of user comments in English
    """
    # For bugfixing
    global touched_ids
    
    # Create/clear out the dictionary.
    game = {}

    game["app_id"] = app_id
    touched_ids.append(game['app_id'])

    # We can get the developer and publisher from the same code block.
    try :
        code_block = current_page_soup.find('div', attrs={'id':'appHeaderGridContainer'})
    except :
        pass

    # The developer name is at a fixed location.
    try:
        raw_string = code_block.find('div', class_='grid_content').get_text()
        # Don't know why it always brings in a newline at the beginning of the string. and a
        # space at the end. Let's take those out.
        formatted_string = raw_string[1:-1]
        formatted_list = formatted_string.split(',')
        game['developer'] = formatted_list
    except :
        game['developer'] = None

    # The publisher name is also at a fixed location. Not every game has a publisher, though.
    try :
        raw_string = code_block.find('div', class_='grid_label', string='Publisher').find_next('a').get_text()
        formatted_list = raw_string.split(',')
        game['publisher'] = formatted_list
    except :
        game['publisher'] = None

    # Descriptions are at a fixed location.
    try:
        game['description'] = current_page_soup.find('meta', attrs={'name':'Description'}).get('content')
    except :
        game['description'] = 'Failed'

    # The languages are listed as rows of a table.
    # There are three different ways languages can be implemented in the game.
    # As we look through the table, we'll store the languages in separate lists.
    interface_languages = []
    full_audio_languages = []
    subtitles_languages = []
    language_types = [interface_languages, full_audio_languages, subtitles_languages]

    # The source code is compex so let's isolate the relevant block for safety.
    try :
        languages_code_block = current_page_soup.find('table', class_='game_language_options')
    # I'll leave a note for myself to help with bugfixing if needed.
    except :
        language_types[0] = 'Did not find code block'
        
    # Each "row" of the table is separated by a re tag. However, there's an extra
    # tr tag at the beginning of languages_code_block that I couldn't find a better
    # way to work around - since it has no text, it'll throw an error on .get_text,
    # so we can just try/except our way out of it.
    try :
        for row in languages_code_block.find_all('tr', class_='') :
            try :
                current_language = row.find('td', class_='ellipsis').get_text()
                # The text has a lot of formatting in it. No more!
                current_language = re.sub('\t|\n|\r', '', current_language)

                # The code block represents each cell of the row with a td class='checkcol'
                # tag. In order, the three cells of each row are interface, full audio,
                # and subtitles. If the language of that row does not have one of those
                # services, then there will be no more code inside the tags. If it does,
                # then there will be a "span" tag in there along with a checkmark.

                # Since the three types of language services are always in order,
                # we can basically use 'counter' to iterate through the list of lists
                # of language service types and only append the name of the language
                # if that section of code has the "span" tag that indicates a checkmark.
                counter = 0
                for column in row.find_all('td', class_='checkcol') :
                    if column.find('span') :
                        language_types[counter].append(current_language)
                    counter += 1
            except :
                pass
    # For bugfixing.
    except :
        language_types[0] = 'Found code block, failed to parse within code block'

    # The full list of tags by name (instead of code) is in a different part of the page
    # and requires a new code block.
    tag_names_code_block = current_page_soup.find('div', attrs={'class':'glance_tags popular_tags'})
    tag_names_list = []
    try : 
        for tag_section in tag_names_code_block.find_all('a', class_='app_tag') :
            tag_name = tag_section.get_text().strip()
            tag_names_list.append(tag_name)
        game['tag_list'] = tag_names_list.copy()
    except :
        game['tag_list'] = tag_names_list.copy()

    # Now we add the lists to our dictionary. We can access the lists via
    # the index of the language_types list of lists.
    game['interface_languages'] = language_types[0]
    game['full_audio_languages'] = language_types[1]
    game['subtitles_languages'] = language_types[2]

    # I would love to have rating data available for the games, but Steam does not
    # present it systematically (probably because so many games are not rated,
    # and because there are different rating systems.)
    # Maybe someday.
    # game['rating'] = PG, Mature Audiences, etc...

    # Now we get the number of reviews that are in English.
    # To get the numbers for other languages, we'll have to modify the URL parameters
    # and get the page again, so that'll be a big ol'loop that we'll do later.
    try:
        raw_english_comments = current_page_soup.find('label', attrs={'for':'review_language_mine'}) \
                                                    .find_next('span', class_='user_reviews_count').get_text()
        formatted_english_comments = int(raw_english_comments.replace(',', '').strip('()'))
        game['english'] = formatted_english_comments
                                                            
    except:
        game['english'] = 0

    # Rinse and repeat.
    games_extend_list.append(game)

In [9]:
# I'm declaring/cleaning out the list in a different cell because I hit a lot of 
# exceptions while testing this, and I didn't want to accidentally clean out all
# my previous hard work each time I made a fix and continued the process. 
games_extend_list = []

# Since running the following cell requires repeated get requests and sleep intervals,
# and since many failures tend to happen 20 minutes or more into the process,
# we can build in a ticker that keeps track of how far we got LAST time.
# Then, after we debug, we can start right over from where we left off. 
ticker = 0

In [10]:
# For bugfixing.
touched_ids = []

print("Scraping individual game page data...")

# Now we loop over all all app_ids in the df we created earlier.
for index, row in scraped_search_results_df.iterrows() :
    
    # This is for bugfixing. If the loop throws an exception, I can use the ticker
    # variable to quickly pick up where we left off.
    if index == ticker :
        # Soup up the page.
        url = row['game_page_link']
        html = urlopen(url)
        current_page_soup = BeautifulSoup(html, 'lxml')

        # Scrape the page.
        scrape_game_page_data(current_page_soup, row['app_id'])

        # Include a random delay to prevent getting IP blocked.
        interval = 0.5 + random.random() * 0.3
        time.sleep(interval)
        
        # If the loop throws an exception on a game, 'ticker' will thus be equal
        # to that game's index in the df, and I can go see what the problem was.
        ticker = index + 1
        

# Turn the new list of dicts into a new df.
scraped_game_pages_df = pd.DataFrame(games_extend_list)
scraped_game_pages_df.to_json('../data/raw/temp/Scraped Game Pages.json', orient='records')

print(f"Scraped {len(games_extend_list)} game pages.")

Scraping individual game page data...
Scraped 2 game pages.


In [11]:
# Now we join our dataframes to create our core dataset.
# I say "core," even though our all-important label has yet to be scraped.
# Bear with me. I'm new at this.
joined_games_df = pd.merge(scraped_search_results_df, scraped_game_pages_df, on="app_id", how='inner')
joined_games_df.to_json('../data/raw/temp/Joined Games DF.json', orient='records')
# joined_games_df.info()

Step 4: Scrape the number of comments in each language from the games' pages
---

In [12]:
# Now we begin the task of getting all the comment counts for each different language.
# Since this process requires a huge amount of get requests/time, we'll limit our exploration
# to the 10 most common languages for game localization (assuming the source text is English).

# Here's a list of all the language codes on Steam, for good measure.
# Don't know if we'll use it, but here it is.
all_languages = ['schinese', 'tchinese', 'japanese', 'koreana', 'thai', 'bulgarian', 'czech', 'danish', \
                 'german', 'english', 'spanish', 'latam', 'greek', 'french', 'italian', 'indonesian', \
                 'hungarian', 'dutch', 'norwegian', 'polish', 'portugese', 'brazilian', 'romanian', \
                 'russian', 'finnish', 'swedish', 'turkish', 'vietnamese', 'ukranian']

# For some languages, steam displays 'comments in my language' as including English. Let's make a list
# of them for reference.
# We will NOT subtract EN counts from these language comment counts for now, since the behavior of that was
# wonky.
# Since we're worried about ratio of comments normally to comments on this one game, this is not a HUGE
# problem.... but I'll need to find a better solution for this at some point.
languages_counted_with_english = ['german', 'danish', 'greek', 'dutch', 'norwegian', 'finnish', 'swedish']

# Since we already have the EN comment counts, let's make a list that excludes EN for future
# scraping. We can also remove all the languages that don't have their own distinct counts.
all_counted_non_english_languages = all_languages.copy()
all_counted_non_english_languages.remove('english')
all_counted_non_english_languages.remove('portugese')
all_counted_non_english_languages.remove('ukranian')

# These are the generally-accepted top 10 languages to localize into from EN.
# The count of EN comments is important for our analysis, but it's already in the df.
# No idea why they put an a on the end of Korean.
top_10_languages = ['german', 'french', 'spanish', 'brazilian', 'russian', 'italian', 'schinese', \
                    'japanese', 'koreana', 'polish']

# %store all_languages
# %store top_10_languages

In [13]:
# Now we build a function that will find the number of reviews in a given language for a given game.
# This function will iterate through our df (using the first one, which is also the smallest, for
# good measure), creating a new column for the language and filling the value with the number.
app_comment_languages = []
single_app_comment_languages = {}

# We'll need this later. Just trust me.
en_comment_counts_by_app_id = scraped_game_pages_df.set_index('app_id')['english']


def comments_in_all_languages(app_id, languages) :
    """
    Takes a Steam app id and a list of languages (as spelled in Steam's html)
    and creates a dictionary, then appends that dictionary to a list.

    Intended to be iterated over.

    The first key in the dictionary is "app id", and the value is the app id.

    The rest of the keys are the names of the languages, and the values are
    the number of comments on that game/app's page that are in that language.
    """
    
    # Make sure the dict is empty at the beginning of each loop.
    single_app_comment_languages = {}
    
    # Store the app_id in the dict.
    single_app_comment_languages['app_id'] = app_id

    # Soup up the game's page in the current language.
    for language in languages :
        url = 'https://store.steampowered.com/app/'+str(app_id)+'/?l='+language
        html = urlopen(url)
        current_page_soup = BeautifulSoup(html, 'lxml')

        # There are 2 types of game page source code, used on games with different language settings.
        # We'll try the most common one first, then try to execute the other type if this throws an exception.
        try :
            raw_comment_count = current_page_soup.find('label', attrs={'for':'review_language_mine'}).find_next('span').get_text()
            formatted_comment_count = int(raw_comment_count.replace(',', '').strip('()'))
            single_app_comment_languages[language] = formatted_comment_count
        
        # If that's no good, we try scraping the other way.
        # The 'other way' can't be scraped effectively by urlopen(), so we'll use requests.get() instead.
        except :
            try :
                url = 'https://store.steampowered.com/app/'+str(app_id)+'/?l='+language
                html = requests.get(url)
                html_string = str(html.content)
                raw_comment_count = re.split('<span class="user_reviews_count">|</span> <a class="tooltip" data-tooltip-html=', html_string)[-2]
                formatted_comment_count = int(raw_comment_count.replace(',', '').strip('()'))
                single_app_comment_languages[language] = formatted_comment_count
            # If both fail, then it's a loss.
            except:
                single_app_comment_languages[language] = np.nan
        
        # Additional cleaning...
        try :
            # If the code block doens't parse, the var ends up null.
            # Postgres disapproves.
            if single_app_comment_languages[language] == None :
                single_app_comment_languages[language] = 0
        except :
            # If something else went wrong, I need to know.
            # This will cause the table to fail to ingest, so I can check.
            single_app_comment_languages[language] = 'Failed_2'

    # Rinse and repeat.
    app_comment_languages.append(single_app_comment_languages)

    interval = 0.5 + random.random() * 0.3
    time.sleep(interval)

In [14]:
# Now we iterate over that function for all app ids.
# I'm also resetting the dic/list variables here since I ran these cells out of order a lot
# during bugfixing.
app_comment_languages = []
single_app_comment_languages = {}

# Pass each app_id into the function along with our list of target languages.

print(f"Scraping comment counts. This might take a while...")

try :
    for index, row in scraped_search_results_df.iterrows() :
        comments_in_all_languages(row['app_id'], top_10_languages)
except Exception as e :
    print(e)

# Export because I'm risk-averse.
comment_languages_df = pd.DataFrame(app_comment_languages)
comment_languages_df.to_json('../data/raw/temp/Comment Languages DF.json', orient='records')

print(f"Scraped {len(app_comment_languages)} games' comment counts.")
print(f"Next scrape starts from {next_link}")

Scraping comment counts. This might take a while...
Scraped 2 games' comment counts.
Next scrape starts from https://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998&supportedlang=english&page=873


Step 5: Save and Quit
---

In [15]:
# Now we merge our separate dfs.
games_df = pd.merge(joined_games_df, comment_languages_df, on="app_id", how='inner')

# If we have existing records on disk, we add to them.
if os.path.exists('../data/raw/0 - Scraped Games DF.pkl') :

    with open('../data/raw/0 - Scraped Games DF.pkl', 'rb') as file:
        existing_records = pd.read_pickle(file)
    
    brand_new_fancy_updated_version = pd.concat([existing_records, games_df], axis=0, ignore_index=True)
    
    with open('../data/raw/0 - Scraped Games DF.pkl', 'wb') as file:
        brand_new_fancy_updated_version.to_pickle(file)
    
# If not, we begin the records on disk.
else :
    with open('../data/raw/0 - Scraped Games DF.pkl', 'wb+') as file:
        pickle.dump(games_df, file)

# Data scraped!
        
end_time = time.time()

total_runtime = end_time - start_time

seconds = int(total_runtime % 60)
minutes = int(total_runtime // 60)
hours = int(total_runtime // (60**2))

# print(f"{hours}h, {minutes}m, {seconds}s")

-------------------------