# Steam Game Review Scraper

Scrape game review data from Steam, including the user, profile link, and the content of the review itself

In [34]:
from selenium import webdriver
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import re
from time import sleep
from datetime import datetime
from openpyxl import Workbook
import csv

## Requirements
You'll need to install the following libraries before beginning this project:
- [Selenium](https://www.selenium.dev/downloads/) : for automating the web browser; this can be involved... so check my [short YouTube video](https://youtu.be/9XAH_TvxwLg) for a walkthrough.
- [OpenPyXL](https://openpyxl.readthedocs.io/en/stable/) : for saving the data to an Excel spreadsheet (optional)

## Example
If you want to see an example of the output... you can see the results of me running the scaper for about 5 minutes on a particular game  
[Click to view Excel file](https://drive.google.com/file/d/1Ld04lwFY7OjIMU2wJxRcgdPvJ0o43BRo/view?usp=sharing)

## Getting started

Lookup the game id by doing a search on steam, navigate to the game homepage, and then get the number embedded in the URL before the game title.

In [35]:
# Warhammer 40k: Space Marine 2
game_id = 2183900

The url template below can be altered to filter by sentiment, language, and recency.  

Check the [website](https://steamcommunity.com/app/387990/positivereviews/?browsefilter=mostrecent) to see what options are available. For this project, I'm going to focus on **Positive** reviews only and sort by **Most Recent**.

In [36]:
template = 'https://steamcommunity.com/app/{}/positivereviews/?browsefilter=mostrecent'
template_with_language = 'https://steamcommunity.com/app/{}/positivereviews/?browsefilter=mostrecent&filterLanguage=english'

url = template_with_language.format(game_id)

In [37]:
# Set up options for Edge (Chromium)
options = Options()
options.use_chromium = True

# Provide the correct path to the Edge WebDriver
service = Service(executable_path=r"C:\Windows\System32\msedgedriver.exe")

# Initialize the WebDriver correctly using service
driver = webdriver.Edge(service=service, options=options)

Maximize the window and get the starting url

In [38]:
driver.maximize_window()
driver.get(url)

## Scrape the data

The page is continously scrolling, so you'll need to grab the cards, then scroll down to the bottom and repeat until finished. For this project, we are going to collect the following information:
- Steam ID
- Profile URL
- Review Text
- Review Recommendation
- Review Length (chars)
- Play Hours
- Date Posted

In [39]:
# Get current position of y scrollbar
last_position = driver.execute_script("return window.pageYOffset;")

# Initialize counters and lists for both types of reviews
reviews = []
recommended_reviews = 0
not_recommended_reviews = 0
review_ids = set()

# Set the review limit for each type
review_limit = 10000

running = True

while running and (recommended_reviews < review_limit or not_recommended_reviews < review_limit):
    # Get cards on the page (ensure the filter is set to 'Most Helpful' on the page)
    cards = driver.find_elements(By.CLASS_NAME, 'apphub_Card')

    for card in cards[-20:]:  # Only the tail end are new cards

        # Gamer profile URL
        profile_url = card.find_element(By.XPATH, './/div[@class="apphub_friend_block"]/div/a[2]').get_attribute('href')

        # Steam ID
        steam_id = profile_url.split('/')[-2]

        # Check to see if I've already collected this review
        if steam_id in review_ids:
            continue
        else:
            review_ids.add(steam_id)

        # Username
        user_name = card.find_element(By.XPATH, './/div[@class="apphub_friend_block"]/div/a[2]').text

        # Date posted
        date_posted = card.find_element(By.XPATH, './/div[@class="apphub_CardTextContent"]/div').text
        review_content = card.find_element(By.XPATH, './/div[@class="apphub_CardTextContent"]').text.replace(date_posted, '').strip()

        # Review length
        review_length = len(review_content.replace(' ', ''))

        # Recommendation status (either "Recommended" or "Not Recommended")
        thumb_text = card.find_element(By.XPATH, './/div[contains(@class, "title")][1]').text

        # Amount of play hours
        play_hours = card.find_element(By.XPATH, './/div[@class="reviewInfo"]/div[3]').text

        # Only collect reviews if we're under the limit for that type
        if "Recommended" in thumb_text and recommended_reviews < review_limit:
            recommended_reviews += 1
        elif "Not Recommended" in thumb_text and not_recommended_reviews < review_limit:
            not_recommended_reviews += 1
        else:
            continue  # Skip if limit is reached for either type

        # Save the review
        review = (steam_id, profile_url, review_content, thumb_text, review_length, play_hours, date_posted)
        reviews.append(review)

        # Check if we have collected enough reviews of both types
        if recommended_reviews >= review_limit and not_recommended_reviews >= review_limit:
            running = False
            break

    # Attempt to scroll down thrice, then break
    scroll_attempt = 0
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        sleep(0.5)
        curr_position = driver.execute_script("return window.pageYOffset;")

        if curr_position == last_position:
            scroll_attempt += 1
            sleep(0.5)

            if scroll_attempt >= 3:
                running = False
                break
        else:
            last_position = curr_position
            break  # Continue scraping the results

# Shutdown the web driver
driver.close()


## Save the results

You can push the data wherever you want. However, for this project, I'm going to save the data to an Excel spreadsheet using the [OpenPyXL](https://openpyxl.readthedocs.io/en/stable/) library

In [40]:
# save the file to Excel Worksheet
wb = Workbook()
ws = wb.worksheets[0]
ws.append(['SteamId', 'ProfileURL', 'ReviewText', 'Review', 'ReviewLength(Chars)', 'PlayHours', 'DatePosted'])
for row in reviews:
    ws.append(row)
    
today = datetime.today().strftime('%Y%m%d')    
wb.save(f'Steam_Reviews_{game_id}_{today}.xlsx')    
wb.close()

In [41]:
# save the file to a CSV file
today = datetime.today().strftime('%Y%m%d')   
with open(f'Steam_Reviews_{game_id}_{today}.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['SteamId', 'ProfileURL', 'ReviewText', 'Review', 'ReviewLength(Chars)', 'PlayHours', 'DatePosted'])
    writer.writerows(reviews)