
Video game localizion prioritization tool proposal
---

My goal is to scrape and analyze data from the video game platform
Steam in order to help studios or localization service providers
choose which languages they should localize into in order to
maximize their localization ROI.

The dataset would consist of a database of games with columns including
genre, sales, price, number of reviews, percent of positive reviews,
available languages, and the language that positive or negative reviews
are written in. Any relationships between these variables (especially
between language, genre, sales, price, and positive reviews, if such a
relationship is found) could be instrumental in driving business
decisions on the studio or language service provider level.

The code below is the beginning of my scraper. It scrapes a search result
page to gather the name, price, number of reviews, percent of positive
reviews, and individual game page url for all listed games. In order to
suit the needs of my project, it must be expanded to also perform the
following:

1. Scroll through a results list in order to cause the page to load more
results (current max is 50). Tools exist for this, but I haven't had the
time to study them yet.

2. Perform a secondary scraping of the individual games' pages to collect
the remainder of the column info that I haven't scraped yet. This is
theoretically possible with my current limited skillset, though I worry
that so many rapid calls will cause Steam to ban my ISP, so I should also
study tools that slow down and/or randomize the request timing.

3. Be able to ascertain the language in which a review is written. I think
there are tools available for this - worst case scenario, I just ask
ChatGPT 3.5 which language it is, using a rotating cast of ISPs to bypass
the daily message limit.

In [1]:
# Basic DS stuff
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Web scraping
from bs4 import BeautifulSoup
from urllib.request import urlopen

# I needed some extra help locating specific parts within a
# bs4 tag object, so I got this.
import re

# I didn't end up using this one, but that might be because
# I still have no idea what the eff I'm doing. Leaving it for
# now in case I need it later.
import requests

In [2]:
# I'm using the search results for the word "Hello" as a test case.
# The search URL can be directly appended to include specific search
# conditions, including avialable languages, price, genre, etc.

url = "https://store.steampowered.com/search/?term=hello"
html = urlopen(url)
my_soup = BeautifulSoup(html, 'lxml')

In [3]:
# To understand how to scrape further, I take a look at the HTML.

print(my_soup.prettify())

<!DOCTYPE html>
<html class="responsive" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="#171a21" name="theme-color"/>
  <title>
   Steam Search
  </title>
  <link href="/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="https://store.cloudflare.steamstatic.com/public/shared/css/motiva_sans.css?v=2C1Oh9QFVTyK&amp;l=english&amp;_cdn=cloudflare" rel="stylesheet" type="text/css"/>
  <link href="https://store.cloudflare.steamstatic.com/public/shared/css/shared_global.css?v=Y_UvvOKYFpvs&amp;l=english&amp;_cdn=cloudflare" rel="stylesheet" type="text/css"/>
  <link href="https://store.cloudflare.steamstatic.com/public/shared/css/buttons.css?v=hFJKQ6HV7IKT&amp;l=english&amp;_cdn=cloudflare" rel="stylesheet" type="text/css"/>
  <link href="https://store.cloudflare.steamstatic.com/public/css/v6/store.css?v=VGCwnkgjilIU&amp;l=english&amp;_cdn=cloudfl

In [7]:
# From looking at the whole page's HTML, I can tell which tag to call in order
# to get the information relevant to only a single game.

single_game_example = my_soup.find('a', class_='search_result_row ds_collapse_flag')

print(single_game_example.prettify())

<a class="search_result_row ds_collapse_flag" data-ds-appid="521890" data-ds-crtrids="[6858705,35483784]" data-ds-itemkey="App_521890" data-ds-steam-deck-compat-handled="true" data-ds-tagids="[1667,3810,1687,3839,4106,1742,3978]" data-gpnav="item" data-search-page="1" href="https://store.steampowered.com/app/521890/Hello_Neighbor/?snr=1_7_7_151_150_1" onmouseout="HideGameHover( this, event, 'global_hover' )" onmouseover="GameHover( this, event, 'global_hover', {&quot;type&quot;:&quot;app&quot;,&quot;id&quot;:521890,&quot;public&quot;:1,&quot;v6&quot;:1} );">
 <div class="col search_capsule">
  <img src="https://cdn.cloudflare.steamstatic.com/steam/apps/521890/capsule_sm_120.jpg?t=1670593236" srcset="https://cdn.cloudflare.steamstatic.com/steam/apps/521890/capsule_sm_120.jpg?t=1670593236 1x, https://cdn.cloudflare.steamstatic.com/steam/apps/521890/capsule_231x87.jpg?t=1670593236 2x"/>
 </div>
 <div class="responsive_search_name_combined">
  <div class="col search_name ellipsis">
   <spa

In [12]:
# Now I can begin scraping this info. In this step I'll also scrape the URL of
# the game's inividual page for future scraping.

# Create an empty list for the game info dictionaries to go into later.

games = []

# Loop through the HTML blocks for each game and scrape the key info into a dictionary,
# then add the dictionaries to the list.
# I'm not cleaning up the data types at this point - I'm learning as I'm going, so I'm
# prioritizing getting all the info I need into the df, and then working with data
# types later either by doing operations on the df or re-writing some of this code.

for listing in my_soup.find_all('a', class_='search_result_row ds_collapse_flag') :

    # Create (or clean out) an empty dictionary to hold the new info.

    game = {}

    # The title and release date seem to be at uniform locations in all listings.

    game['title'] = listing.find('span', class_='title').get_text()
    game['release_date'] = listing.find('div', class_='col search_released responsive_secondrow').get_text()

    # Not all games have reviws listed, so we have to account for code blocks that omit this part.
    # I might eventually remove this part and scrape the review data from the individual game pages
    # instead, since it seems to be more complete there. This is just proof of concept for now.

    try:
        review_string = re.split('>| of|the | user', listing.find('div', class_='col search_reviewscore responsive_secondrow') \
                                                    .find('span').get('data-tooltip-html'))
        game['positive_review_percent'] = review_string[1]
        game['number_of_reviews'] = review_string[3]
    except: 
        game['positive_review_percent'] = np.nan
        game['number_of_reviews'] = np.nan
    
    # Same for price - many unreleased games do not have price info, so we have to skip them.
    # Some games have an original price and a discounted price listed, but for the time being
    # I've decided to only go by original prices, so I'll default to that and only return
    # a null value if no kind of price whatsoever is listed.

    try: 
        game['price'] = listing.find('div', class_="discount_original_price").get_text()
    except:
        try:
            game['price'] = listing.find('div', class_="discount_final_price").get_text()
        except:
            game['price'] = np.nan

    # Weirdly enough, not every game seems to have its own page.

    try:
        game['game_page_link'] = listing.get('href')
    except:
        game['game_page_link'] = False

    # Now we add this dict to the list, rinse and repeat.

    games.append(game)

# After the loop, we check...
print(len(games))
print(games[0])

50
{'title': 'Hello Neighbor', 'release_date': 'Dec 8, 2017', 'positive_review_percent': '84%', 'number_of_reviews': '9,599', 'price': '$29.99', 'game_page_link': 'https://store.steampowered.com/app/521890/Hello_Neighbor/?snr=1_7_7_151_150_1'}


In [18]:
# Frame it and check.

game_info_df = pd.DataFrame(games)
print(game_info_df.info())
print(game_info_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   title                    50 non-null     object
 1   release_date             50 non-null     object
 2   positive_review_percent  35 non-null     object
 3   number_of_reviews        35 non-null     object
 4   price                    32 non-null     object
 5   game_page_link           50 non-null     object
dtypes: object(6)
memory usage: 2.5+ KB
None
              title  release_date positive_review_percent number_of_reviews  \
0    Hello Neighbor   Dec 8, 2017                     84%             9,599   
1     Hello Goodboy  May 25, 2023                    100%                34   
2    Hello, Goodbye  Apr 18, 2019                     91%                48   
3  Hello Neighbor 2   Dec 6, 2022                     73%             1,444   
4     Hello Teacher  Jun 16, 2021 

*That's all for now!*