## Data Downloader

This Jupyter notebook downloads data about globally top selling games from digital gaming platform Steam. These data contain title, release date and information about reviews and prices of individual games. Output of this Downloader is CSV file.

First, we download a few packages necessary for our downloader to be able to scrape data from Steam webpage, then other packages help us display raw data and pandas help us to create the dataframe.

In [1]:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
import pandas as pd
from IPython.display import display, HTML
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime

We begin by creating a class, which we call 'Downloader'. Functions of few first attributes of this downloader come naturally from their names. Then we create attributes 'dataf', 'hoarder' and 'download_data'. As the list of games is too long for just one webpage, Steam saved it in more than 600 webpages in total. Thus, downloading our data works in following way, using  'download_data' attribute:
- It begins by creating a list of urls of given number of pages (starting from the first one) using 'hoarder' attribute.
- Then it uses 'dataf' attribute, which downloads HTML for every url in the list, applies first few attributes of class 'Downloader' and creates pandas dataframe of data for every given url.
- Last, 'download_data' appends dataframe for every url to one large dataframe we want to end up with. It also assigns indexes to individual games.

In [2]:
class Downloader:
    def __init__(self, link):
        self.link = link
        self.uClient = uReq(link)
        self.page_html = self.uClient.read()
        self.uClient.close()
        self.soup = BeautifulSoup(self.page_html, "lxml")
        
    
    def get_titles(self):
        td = self.soup.findAll('span', {"class":"title"})
        titles = []
        for ind in td:
            # lstrip and rstrip remove symbols from sides, strip removes white spaces
            try:
                titles.append(str(ind).lstrip('<span class="title">').rstrip('span>').rstrip('</'))
            except:
                titles.append("Title not available")                
        return titles
    
    def get_release_dates(self):
        td = self.soup.findAll('div', {"class":"col search_released responsive_secondrow"})
        release_dates = []
        for ind in td:
            if ind.text != "":
            # lstrip and rstrip remove symbols from sides, strip removes white spaces
                try:
                    release_dates.append(str(ind).rstrip('</div>').split(">")[-1])
                except:
                    release_dates.append(None)
            else:
                release_dates.append(None)
        return release_dates
    
    def reviews(self):
        tds = self.soup.findAll('div', {"class":"col search_reviewscore responsive_secondrow"})
        reviews = []
        for td in tds:     
            try:
                children = td.findChildren('span', recursive=False)
                for ind in children:
                    # lstrip and rstrip remove symbols from sides, strip removes white spaces
                    try:
                        reviews.append(str(ind).split("html=")[-1])
                    except:
                        reviews.append(None)
            except:
                reviews.append(None)
        return reviews
     
    def get_share_positive_reviews(self):
        text = self.reviews()
        shares = []
        for percent in text:
            try:
                shares.append(percent.split("%")[0].split("br&gt;")[1])
            except:
                shares.append(None)
        return shares
        
    def get_number_user_reviews(self):  
        text = self.reviews()
        numbers = []
        for number in text:
            try:
                start = number.find("of the ") + len("of the ")
                end = number.find(" user reviews")
                numbers.append(number[start:end].replace(",",""))
            except:
                numbers.append(None)
        return numbers
        
    def get_prices(self):

        td = self.soup.findAll('div', {"class":"col search_price_discount_combined responsive_secondrow"} or 
                          {"class":"col search_price discounted responsive_secondrow"})
        prices = []
        for ind in td:
            try:
                if "888888" not in str(ind):
                    try:
                        if (len(str(ind).split("\r\n")[-1].split("</div>\n</div>")[0].strip()) <10):    
                            prices.append(str(ind).split("\r\n")[-1].split("</div>\n</div>")[0].strip().replace("€","").replace(",",".").replace("-","0").replace("Free","0")) 
                        else:
                            prices.append(None)
                    except:
                        prices.append(None)
                else:
                    start1 = str(ind).find("><strike>") + len("><strike>")
                    end1 = str(ind).find("</strike>")
                    prices.append(str(ind)[start1:end1].replace("€","").replace(",",".").replace("-","0").replace("Free","0"))
            except: 
                prices.append(None)
        return prices
    
    def get_price_after_sale(self):

        td = self.soup.findAll('div', {"class":"col search_price_discount_combined responsive_secondrow"} or 
                          {"class":"col search_price discounted responsive_secondrow"})
        sales = []
        for ind in td:
            if "888888" not in str(ind):
                sales.append(None)
            else:
                try:
                    if len(str(ind).split("br/>")[-1].split("€")) < 10:
                        sales.append(str(ind).split("br/>")[-1].split("€")[0].strip().replace(",",".").replace("-","0").replace("Free","0"))
                    else:
                        sales.append("0")
                except:
                    sales.append("0")
        return sales
    
    
    def get_rate_of_sale(self):

        td = self.soup.findAll('div', {"class":"col search_price_discount_combined responsive_secondrow"} or 
                          {"class":"col search_price discounted responsive_secondrow"})
        percent = []
        for ind in td:
            if "888888" not in str(ind):
                percent.append(None)
            else:
                try:
                    start = str(ind).find(">\n<span>-")+len(">\n<span>-")
                    end = str(ind).find("%")
                    percent.append(str(ind)[start:end])
                except:
                    percent.append(None)
        return percent
    
    def dataf(self):
        titles = self.get_titles()
        dates = self.get_release_dates()
        share_reviews = self.get_share_positive_reviews()
        number_reviews = self.get_number_user_reviews()
        normal_prices = self.get_prices()
        sale_price = self.get_price_after_sale()
        sale_rate = self.get_rate_of_sale()
        
        
        self.data = pd.DataFrame({
             'Title': pd.Series(titles),
             'Release date': pd.to_datetime(pd.Series(dates),format='%d %b, %Y', errors = 'coerce'),
             'Share of positive reviews (in %)': pd.to_numeric(share_reviews, errors='coerce'),
             'Total number of reviews': pd.to_numeric(number_reviews, errors='coerce'),
             'Normal price (€)': pd.to_numeric(normal_prices, errors='coerce'),
             'Discounted price if there is a sale (€)': pd.to_numeric(sale_price, errors='coerce'),
             'Sale rate (in %)': pd.to_numeric(sale_rate, errors='coerce')})
        return self.data
     
    def total_games(self):
        total_games = int(self.soup.find('div', {"class":"search_pagination_left"}).text.split("of")[1].strip())
        return total_games
    
    def last_page(self):
        last_page = int(np.ceil((self.total_games())/25))
        return last_page
            
    def hoarder(self):
        urls = [] 
        for i in range(self.last_page()):
            urls.append(self.link + f"&page={1+i}")
        return urls
    
    def download_data(self):
        urls = self.hoarder()
        Frame = pd.DataFrame()
        for url in urls:
            Frame = Frame.append(pd.DataFrame(data = Downloader(url).dataf()))
        Frame.index = range(self.total_games())
        return Frame

This is the link for Steam page of global top sellers ordered by reviews.

In [3]:
link = 'https://store.steampowered.com/search/?sort_by=Reviews_DESC&os=win&filter=globaltopsellers'

# The Data

First, we initialize the link in order to continue working with it. Next, we can apply the 'download_data' attribute and take a look at the craped data.

In [4]:
first = Downloader(link)
#first.download() # in order to explore page html in a reasonable way, one can use online javascript beautifier, available at:
# beautifier.io
df = first.download_data()
df


Unnamed: 0,Title,Release date,Share of positive reviews (in %),Total number of reviews,Normal price (€),Discounted price if there is a sale (€),Sale rate (in %)
0,The Witcher 3: Wild Hunt - Expansion Pass,2015-05-19,99,3351,24.99,,
1,Senren＊Banka,2020-02-14,99,2854,29.99,,
2,A Short Hike,2019-07-30,99,2757,6.59,,
3,Aseprite,2016-02-22,99,2751,14.99,,
4,Doki Doki Literature Club Fan Pack,2017-09-22,99,1498,9.99,,
...,...,...,...,...,...,...,...
14473,Men of War: Assault Squad 2 - Cold War,2019-09-12,17,539,21.99,,
14474,Far Cry® 5 - Dead Living Zombies,2018-08-28,17,585,7.99,3.99,50.0
14475,Command &amp; Conquer 4: Tiberian Twilight,2010-03-16,17,2340,19.99,,
14476,Tom Clancy's Ghost Recon® Wildlands - Narco Road,2017-04-25,16,501,14.99,,


# Saving the data as CSV file

Now we can save the data as CSV file.

In [6]:
df.to_csv('steam_global_sellers_by_reviews')