<a href="https://colab.research.google.com/github/lundkvistbenjamin/steam-sales-scraper/blob/main/steam_sales_scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steam Sales Scraper
This notebook scrapes discounted games from the Steam store, extracts key information, and saves it into a CSV file.

In [1]:
# Install dependencies
!pip install beautifulsoup4 pandas requests



## 1. Fetch and Parse Steam Sales Pages
We will fetch the first five pages of discounted games from Steam and store the HTML.

In [2]:
import requests

pages_html = ""

# Loop through 5 pages
for page_number in range(1, 6):
    res = requests.get(f"https://store.steampowered.com/search/?supportedlang=english&specials=1&page={page_number}&ndl=1")
    pages_html += res.text

## 2. Parse HTML with BeautifulSoup
We will parse the HTML and extract relevant game information.

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(pages_html, "html.parser")

# Find all game containers
game_containers = soup.find_all("div", {"class": "responsive_search_name_combined"})

## 3. Extract Game Titles
Each game title is stored in a `<span>` tag with the class `"title"`.

In [4]:
titles = [
    game.find("span", {"class": "title"}).text
    if game.find("span", {"class": "title"}) else None
    for game in game_containers
]

## 4. Extract Game Ratings and Review Counts
We extract rating descriptions and the number of reviews.

In [5]:
rating_system = ["Overwhelmingly Negative", "Very Negative", "Negative", "Mostly Negative",
                 "Mixed", "Mostly Positive", "Positive", "Very Positive", "Overwhelmingly Positive"]

ratings = []
reviews = []

for game in game_containers:
    rating_span = game.find("span", {"class": "search_review_summary"})

    if rating_span:
        data_tooltip = rating_span["data-tooltip-html"]
        rating = data_tooltip.split("<br>")[0]
        ratings.append(rating_system.index(rating))  # Convert rating to index
        reviews.append(data_tooltip.split("<br>")[1].split(" ")[3])  # Extract review count
    else:
        ratings.append(None)
        reviews.append(None)

## 5. Extract Discounts, Prices, and Original Prices

In [6]:
import re

def parse_price(price_str):
    if price_str:
        clean_price = re.sub(r"[^\d.,]", "", price_str)  # Remove non-numeric characters except '.' and ','
        return float(clean_price.replace(",", "."))  # Convert comma to dot
    return None

discounts = [
    int(game.find("div", {"class": "discount_pct"}).text.strip("%"))
    if game.find("div", {"class": "discount_pct"}) else None
    for game in game_containers
]

prices = [
    parse_price(game.find("div", {"class": "discount_final_price"}).text)
    if game.find("div", {"class": "discount_final_price"}) else None
    for game in game_containers
]

original_prices = [
    parse_price(game.find("div", {"class": "discount_original_price"}).text)
    if game.find("div", {"class": "discount_original_price"}) else None
    for game in game_containers
]

## 6. Extract Release Dates

In [7]:
release_dates = [
    game.find("div", {"class": "search_released"}).text.strip()
    if len(game.find("div", {"class": "search_released"}).text) > 2 else None
    for game in game_containers
]

## 7. Extract Platform Compatibility

In [8]:
win = []
lin = []
osx = []

for game in game_containers:
    platforms = [platform["class"][1] for platform in game.find_all("span", {"class": "platform_img"})]

    win.append(1 if "win" in platforms else 0)
    lin.append(1 if "linux" in platforms else 0)
    osx.append(1 if "mac" in platforms else 0)

## 8. Store Data in Pandas DataFrame

In [9]:
from datetime import datetime
import pandas as pd

current_time = datetime.now().strftime("%Y-%m-%d %H:%M")
fetch_times = [current_time for _ in game_containers]

data = {
    "Game Name": titles,
    "Rating": ratings,
    "#Reviews": reviews,
    "Discount%": discounts,
    "Price (€)": prices,
    "Original Price (€)": original_prices,
    "Release Date": release_dates,
    "Windows": win,
    "Linux": lin,
    "MacOS": osx,
    "Fetched At": fetch_times
}

steam_sales = pd.DataFrame(data)

# Drop missing values
filtered_steam_sales = steam_sales.dropna()

# Show first few rows
filtered_steam_sales.head()

Unnamed: 0,Game Name,Rating,#Reviews,Discount%,Price (€),Original Price (€),Release Date,Windows,Linux,MacOS,Fetched At
0,Sea of Thieves: 2024 Edition,7,300690,-65.0,13.99,39.99,"Jun 3, 2020",1,0,0,2025-02-19 16:20
1,Total War: WARHAMMER III,5,81002,-60.0,23.99,59.99,"Feb 16, 2022",1,1,1,2025-02-19 16:20
2,NBA 2K25,4,14146,-67.0,23.09,69.99,"Oct 28, 2024",1,0,0,2025-02-19 16:20
3,The Outlast Trials,7,51475,-60.0,15.99,39.99,"Mar 5, 2024",1,0,0,2025-02-19 16:20
4,Lost Records: Bloom & Rage,7,312,-10.0,35.99,39.99,"Feb 18, 2025",1,0,0,2025-02-19 16:20


## 9. Save Data to CSV
If the file doesn't exist, create it. Otherwise, append the new data.

In [10]:
import os

file_path = "steam_sales.csv"

if not os.path.exists(file_path):
    filtered_steam_sales.to_csv(file_path, index=False)
else:
    filtered_steam_sales.to_csv(file_path, mode="a", index=False, header=False)

print(f"Data saved to {file_path}")

Data saved to steam_sales.csv
