# Scraping websites to get data for Dota analysis

Importing selenium and the time module to control the browser, and make it wait due to some websites loading dynamically with AJAX, rendering some elements on the page invisible for a few seconds. bs4 to parse the HTML, and pandas to load data into a dataframe.

In [208]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import pandas as pd
import time

In [209]:
service_obj = Service('C:\Program Files (x86)\chromedriver.exe')
driver = webdriver.Chrome(service=service_obj)

List of Dota player regions. Each region has it's own leaderboard

In [210]:
all_regions = ['americas', 'europe', 'se_asia', 'china']

Function will use the provided argument to construct a URL and go to that webpage

In [211]:
def scrape(region):
    # Go to Dota Leaderboards page
    driver.get(f'https://www.dota2.com/leaderboards/#{region}-0')
    
    # Give the page a second to load everything
    time.sleep(1)
    
    # Grab the page source and put it into a variable
    webpage = driver.page_source
    
    # Parse the page source as a soup object
    soup = BeautifulSoup(webpage, 'html.parser')
    
    # Find the table tag
    table = soup.tbody
    
    # Find all rows in table and put them into a list
    rows = table.find_all('tr')
    
    # Initialize lists to capture data
    rankings = []
    players = []
    country_codes = []
    
    # Append each row's rank, and player name into rankings, and players respectively
    for row in rows:
        rank_tag = row.td
        rankings.append(rank_tag.string)
        player_tag = rank_tag.next_sibling
        players.append(player_tag.get_text().strip())
        # To account for any players that don't have a country listed
        try:
            flag = row.img['src']
            country_code = flag[-6:-4]
            country_codes.append(country_code.upper())
        # If there's no country, replace missing value with an empty string
        except:
            country_codes.append('')
            
    # Put the lists into a dict and load as a dataframe
    df = pd.DataFrame({'rank': rankings, 'player': players, 'country_code': country_codes})
    
    # Save dataframe to a CSV file, and remove default indexing
    df.to_csv(f'Dota leaderboards {region.title()}.csv', index=False)

Call the function over each element in the list of regions


In [212]:
for region in all_regions:
    scrape(region)

# Now scraping one more site to get country codes with their names

Using the requests module this time

In [213]:
import requests

Fetch the website

In [214]:
response = requests.get('https://www.iban.com/country-codes')

Convert page into a soup object

In [215]:
soup = BeautifulSoup(response.text, 'html.parser')

Find the table

In [216]:
table = soup.find('tbody')

Find all the rows within the table

In [217]:
rows = table.find_all('tr')

Initialize lists

In [218]:
country_names = []
country_codes = []

Append each country's name, and 2 letter code into the lists respectively

In [219]:
for row in rows:
    name = row.td
    country_names.append(name.string)
    code = name.next_sibling.next_sibling
    country_codes.append(code.string)

Store the lists into a dataframe

In [220]:
df = pd.DataFrame({'country': country_names, 'code': country_codes})

Save the dataframe without the indexing, to a CSV

In [221]:
df.to_csv('country info.csv', index=False)