### Project for a data engineering course.

DATASET https://store.steampowered.com/search/?specials=1&page=1

The project is about successfully scraping data from steam's store page for products with reduced prices. 


### Reflection

From the start I found this project a lot more interesting than the first one, maybe because it really felt like I was trying to display information that is quite interesting to me. 
I got kind of a good start, Dennis really went through BeautifulSoup thoroughly so every student would understand. And the tips was very helpful as well. 
I used youtube and some forums as tools, but the lectures helped a good bit. 

As usually, the first couple of hours went down to a lot of testing and trying to understand the fundamentals of beautifulsoup as well as IPython. But when I got a solid starting point, everything started to flow and I was making quite fast progress until I started struggling with the loops. It took a while to display all of the 125 games from the first pages instead of just one of the pages. 

Overall, I'm happy about how this project turned out and I'm looking forward to the next one. I tried to comment my code as clearly as possible. 

In [243]:
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
from IPython.core.display import HTML

In [244]:
# Download the steam data, not used later on because needed to loop through first 5 pages.

SEARCH_URL = "https://store.steampowered.com/search/?specials=1&page=1"
res = requests.get(SEARCH_URL)
soup = BeautifulSoup(res.text, 'html.parser')
prettysoup = soup.prettify()
type(soup)

bs4.BeautifulSoup

In [245]:
# 1. Data Scraping and Data Munging

# Making an empty list to store the data
data = []

# Downloading the first 5 pages of the Steam data, looping through with page_num.
for page_num in range(1, 6):

    # Constructing the URL for steam search with page number
    SEARCH_URL = f"https://store.steampowered.com/search/?specials=1&page={page_num}"
    
    # Sending an get request to the steam search URL
    res = requests.get(SEARCH_URL)

    # Parsing the HTML content of the response using BeautifulSoup
    soup = BeautifulSoup(res.text, 'html.parser')

    # Finding the container for search results, this is from where we get access to the game information later
    games = soup.find('div', {'id': 'search_results'})

    # Looping through the 25 games displayed per each page and collecting their information
    for game in games.find_all('div', class_='responsive_search_name_combined')[:25]:

        # Finding game title
        title = game.find('span', class_='title').get_text(strip=True)

        # Finding release date
        release_date = game.find('div', class_='search_released').get_text(strip=True)

        # Finding platforms
        platform = game.findAll('span', class_='platform_img')
        platforms = [platform['class'][-1] for platform in platform]

        # Finding reviews, if-else statements to avoid errors if there isnt any reviews
        reviews = game.find('span', class_='search_review_summary')
        if reviews:
            tooltip_html = reviews.get('data-tooltip-html', '')
            review_start = tooltip_html.find('<br>') + len('<br>')
            review_end = tooltip_html.find('%', review_start)
            review = tooltip_html[review_start:review_end]
        else:
            review = None

        # Finding sale percentage
        sale_percentage = game.find('div', class_='discount_pct').get_text(strip=True)

        # Finding original price
        original_price = game.find('div', class_='discount_original_price').get_text(strip=True)

        # Finding current price
        current_price = game.find('div', class_='discount_final_price').get_text(strip=True)

        # Checking if there is 'win', 'linux', or 'mac' in the platforms
        windows_check = int('win' in platforms)
        linux_check = int('linux' in platforms)
        mac_check = int('mac' in platforms)

        # Appending the data to the list
        data.append([title, release_date, review, sale_percentage, current_price, original_price, windows_check, linux_check, mac_check])


# Creating a pandas DataFrame to display the collected data
columns = ["Game", "Release date", "Rating", "Sale %", "Price", "Original price", "Windows", "Linux", "Mac"]
df = pd.DataFrame(data, columns=columns)

# Setting the option to display all rows in the DataFrame
pd.set_option('display.max_rows', None)

csv_file_path = "steam_data.csv"
if not os.path.exists(csv_file_path):
    # If it does not exist, it creates the CSV file
    df.to_csv(csv_file_path, index=False)
else:
    # If it exists, append the data to CSV file
    df.to_csv(csv_file_path, mode='a', header=False, index=False)

df

Unnamed: 0,Game,Release date,Rating,Sale %,Price,Original price,Windows,Linux,Mac
0,Euro Truck Simulator 2,"12 Oct, 2012",97.0,-75%,"4,99€","19,99€",1,1,1
1,Horizon Zero Dawn™ Complete Edition,"7 Aug, 2020",87.0,-75%,"12,49€","49,99€",1,0,0
2,EA SPORTS FC™ 24,"28 Sep, 2023",57.0,-50%,"34,99€","69,99€",1,0,0
3,Hell Let Loose,"27 Jul, 2021",84.0,-35%,"29,24€","44,99€",1,0,0
4,Tom Clancy's Ghost Recon® Breakpoint,"23 Jan, 2023",69.0,-80%,"11,99€","59,99€",1,0,0
5,Cyberpunk 2077: Phantom Liberty,"25 Sep, 2023",89.0,-15%,"25,49€","29,99€",1,0,0
6,Tom Clancy's Rainbow Six® Siege,"1 Dec, 2015",86.0,-60%,"7,99€","19,99€",1,0,0
7,American Truck Simulator,"2 Feb, 2016",96.0,-75%,"4,99€","19,99€",1,1,1
8,Last Train Home,"28 Nov, 2023",84.0,-15%,"33,99€","39,99€",1,0,0
9,Ancestors Legacy,"22 May, 2018",80.0,-90%,"3,49€","34,99€",1,0,0


In [250]:
games = soup.find('div', {'id': 'search_resultsRows'})
HTML(str(games))

In [252]:
# Testing

# Find the search results container
games_container = soup.find('div', {'id': 'search_results'})

# Find and display the titles and platform support of the first 20 games
for game in games_container.find_all('div', class_='responsive_search_name_combined')[:20]:
    
    # Find title
    title = game.find('span', class_='title')

    # Find platforms
    platform = game.findAll('span', class_='platform_img')
    platforms = [platform['class'][-1] for platform in platform]

    # Find release date
    release_date = game.find('div', class_='search_released')

    # Find reviews 
    reviews = game.find('span', class_='search_review_summary')

    tooltip_html = reviews['data-tooltip-html']
    percentage_start = tooltip_html.find('<br>') + len('<br>')
    percentage_end = tooltip_html.find('%', percentage_start)
    percentage = tooltip_html[percentage_start:percentage_end]

    # Find sale percentage
    sale_percentage = game.find('div', class_='discount_pct')

    # Find original price
    original_price = game.find('div', class_='discount_original_price')

    # Find current price
    current_price = game.find('div', class_='discount_final_price')


    display(HTML(str(title)))
    display(HTML(str(release_date)))
    display(HTML(str(percentage)))
    display(HTML(str(sale_percentage)))
    display(HTML(str(current_price)))
    display(HTML(str(original_price)))
    display(HTML(str(platforms)))
    print("____________")

____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________


____________
