# WIP - Dynamic - Selenium
## IES - Python - Project
### Marathon Results
### Scraping
#### David Koubek, Jiri Zelenka

### Import required packages.

In [1]:
import requests # for robots check
from bs4 import BeautifulSoup # prettify HTML
from selenium import webdriver # scraping JS dynamic elements
from time import sleep # for sleeping (slowing down) inside a function
import random # for random number sleeping
import pandas as pd # for dataframe
import numpy as np # for arrays

### Robots.txt

Are we allowed to scrape?

In [2]:
requests.get('https://www.runczech.com/robots.txt')

<Response [200]>

The response 200 means the request was fulfilled. Let's look visually at the actual robots.txt file what is allowed and what's not.

In [3]:
print(requests.get('https://www.runczech.com/robots.txt').text)

#
# robots.txt
#

# exclude these directories
User-agent: *
Disallow: /srv/
Disallow: /cgi/
Allow: /srv/www/qf/*/ramjet/eventList
Allow: /srv/www/qf/*/ramjet/eventVoucherList
Allow: /srv/www/qf/*/ramjet/contactPage
Allow: /srv/www/qf/*/ramjet/raceDetail
Allow: /srv/www/qf/*/ramjet/leagueDetail
Allow: /srv/www/qf/*/ramjet/results/list
Allow: /srv/www/qf/*/ramjet/results/league
Allow: /srv/www/qf/*/ramjet/results/league/detail
Allow: /srv/www/qf/*/ramjet/resultsEventDetail
Allow: /srv/www/qf/*/ramjet/resultsSubEventUserDetail
Allow: /srv/www/qf/*/ramjet/resultsSubEventGroupDetail
Allow: /srv/www/qf/*/ramjet/event/runnerList

Sitemap: https://www.runczech.com/sitemap-cs.xml
Sitemap: https://www.runczech.com/sitemap-en.xml
Sitemap: https://www.runczech.com/sitemap-de.xml
Sitemap: https://www.runczech.com/sitemap-it.xml
Sitemap: https://www.runczech.com/sitemap-fr.xml
Sitemap: https://www.runczech.com/sitemap-es.xml
Sitemap: https://www.runczech.com/sitemap-pl.xml
Sitemap: https://www.runcz

The "resultsEventDetail" which we desire to scrape is allowed which is good, we can proceed.

# Scraping JavaScript dynamic website
 - https://www.google.com/search?q=python+scrape+website+that+has+script+inside+html&oq=python+scrape+website+that+has+script+inside+html&aqs=chrome..69i57.14882j0j7&sourceid=chrome&ie=UTF-8
     - https://stackoverflow.com/questions/26680590/how-to-scrape-imbeded-script-on-webpage-in-python
     - https://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Scraping_a_Webpage_Rendered_by_Javascript_Using_Python.php
     - https://www.youtube.com/watch?v=FSH77vnOGqU
     - https://www.youtube.com/watch?v=vsmxMLmroyQ

## Selenium

First make sure chromedriver is correctly in the environment (download from https://sites.google.com/a/chromium.org/chromedriver/ ), otherwise the webdriver scraping outputs an error.

### Find all marathon links

The middlepage table of our webpage is not simply a static HTML code, it gets loaded in the browser only after we load the page, dynamically via JavaScript. So we have to use dynamic scraping methods, e.g. Selenium. After we've scraped the dynamic code, we need to scrape the "a href" tag of class "indexList_link" which contains URL links to the desired marathon events.

In [4]:
# Selenium scraping

# Working with chrome, first open one Chrome window that'll be displaying our URLs
browser = webdriver.Chrome()

We need to slow down the scraping inside get_soup function so the url gets fully loaded in the browser (JS table takes about 1-2s to pull data from servers) before it's scraped, otherwise the soup object will contain only the static parts of the website and not the dynamic ones which we care about.

In [5]:
# Scrapes dynamic webpage content using Selenium browser, returns a prettified soup code of the page
def get_soup(url):
    # Then navigate browser to desired url and get the source code
    browser.get(url) # navigate to the page

    # Wait randomly between 1.0-1.5seconds (1.0s should be enough to display our page), to confuse the website that we're not bots
    sleep(random.uniform(1.0, 1.5)) # time in seconds, sleep can take a float value
    
    # Take all the inner code of the displayed webpage
    innerHTML = browser.execute_script("return document.body.innerHTML") #returns the inner HTML as a string
    
    # Clean with BeautifulSoup:
    soup = BeautifulSoup(innerHTML,'lxml')
    return soup

In [6]:
# For a given RunCzech Results URL, returns a list of events' URLs (marathons)
def get_all_links(url):
    soup = get_soup(url) # call get_soup function on the desired url and get back the soup from bs (of the dynamic HTML with JS elements loaded)
    a_elements = soup.find_all('a',{'class':'indexList_link'}) # class "indexList_link" contains the href link we desire
    urls_events = ['https://www.runczech.com' + a['href'] for a in a_elements] # list comprehension/function for links, join runczech url with the href ending of the events
    return urls_events

In [7]:
# URL of Results webpage which contains links to marathons
url_results = "https://www.runczech.com/srv/www/qf/en/ramjet/results/list?&page=1&per_page=15"

In [8]:
urls_marathons = get_all_links(url_results)

In [9]:
urls_marathons

['https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22175',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22166',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22163',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22114',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21460',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21453',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21448',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21636',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21443',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21438',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21429',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21426',
 'ht

### Data table from marathon events

#### Scrape a single table\link

In [10]:
# TBD recode this later
url_marathon_2019 = urls_marathons[1]
print(url_marathon_2019)

https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22166


In [12]:
# Selenium scraping

# Working with chrome, first open one Chrome window that'll be displaying our URLs
browser = webdriver.Chrome()

In [13]:
# Get the whole table of data that interests us
def get_table(url):
    soup = get_soup(url)

    trs = soup.find_all('tr') # "tr" table-row element tag
    
    # Starting column 1
    tds_col_1 = [tr.find('td',{'class':'hidden980'}) for tr in trs] # hidden980 is the class of first column, "Rank in filter"
    tds_col_1 = [x for x in tds_col_1 if x != None] # filter out the None elements in tds (where tds weren't present in tr tags),
    # could also use filter(None, tds) which though gets rid of 0s as well which is more dangerous in certain situations
    
    # Column 2 - "Rank"
    tds_col_2 = [td.find_next('td') for td in tds_col_1] # finds next sibling of tag 'td', second column "Rank"
    contents_col_2 = [td.contents[0] for td in tds_col_2] # returns just the text inside tags
    
    # Column 3 - "Name"
    tds_col_3 = [td.find_next('td') for td in tds_col_2]
    contents_col_3 = [td.contents[0] for td in tds_col_3]
    
    # Column 5 - "Chip time"
    tds_col_5 = [td.find_next('td').find_next('td') for td in tds_col_3]
    contents_col_5 = [td.contents[0] for td in tds_col_5]
    
    # Column 6 - "St. number"
    tds_col_6 = [td.find_next('td') for td in tds_col_5]
    contents_col_6 = [td.contents[0] for td in tds_col_6]
    
    # Column 7 - "Nationality"
    tds_col_7 = [td.find_next('td') for td in tds_col_6]
    contents_col_7 = [td.contents[0] for td in tds_col_7]
    
    # Column 8 - "Age cat."
    tds_col_8 = [td.find_next('td') for td in tds_col_7]
    contents_col_8 = [td.contents[0] for td in tds_col_8]
    
    # Merge data
    # https://cmdlinetips.com/2018/01/how-to-create-pandas-dataframe-from-multiple-lists/
    # zip function to merge lists
    table = list(zip(contents_col_2, contents_col_3, contents_col_5, contents_col_6, contents_col_7, contents_col_8))
    
    # Create pandas dataframe
    labels = ['Rank', 'Name', 'Chip time', 'St. number', 'Nationality', 'Age cat.']
    df = pd.DataFrame(table, columns = labels)
    
    return df

In [20]:
url_marathon_2019_part_1 = 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22166&frm.subeventId=22167&page='
url_marathon_2019_part_2 = '&per_page=15&sort=finishTime'
num_of_pages = 487 # TBD recode this to automate it, for now manually set it for one year data
# pages_2019 = np.arange(0, num_of_pages) # get a range of numbers 0-487 (included-excluded, so total of 487 indexes)
marathon_2019_pages = np.empty(num_of_pages, dtype=object) # initialise an empty array of length 487

for i in range(0, num_of_pages):
    marathon_2019_pages[i] = url_marathon_2019_part_1 + str(i + 1) + url_marathon_2019_part_2
    
marathon_2019_pages[-2:] # display last two links if they are visually correct

array(['https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22166&frm.subeventId=22167&page=486&per_page=15&sort=finishTime',
       'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22166&frm.subeventId=22167&page=487&per_page=15&sort=finishTime'],
      dtype=object)

In [21]:
df = pd.DataFrame() # initialise empty df
position = 0 # initialise a page counter

# For loop that scrapes each page of the marathon year and concatenates the data table into one dataframe variable
for page in marathon_2019_pages:
    position += 1 # increment a page counter
    print('Scraping page ' + str(position) + '/' + str(len(marathon_2019_pages)))
    
    # Scrape the data table on this page
    df_add = get_table(page)
    
    # Concatenating table
    # http://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
    frames = [df, df_add]
    df = pd.concat(frames, ignore_index = True) # ignore_index = True ignores original 0-14 indices and makes a new one

print('Done scraping')

Scraping page 1/487
Scraping page 2/487
Scraping page 3/487
Scraping page 4/487
Scraping page 5/487
Scraping page 6/487
Scraping page 7/487
Scraping page 8/487
Scraping page 9/487
Scraping page 10/487
Scraping page 11/487
Scraping page 12/487
Scraping page 13/487
Scraping page 14/487
Scraping page 15/487
Scraping page 16/487
Scraping page 17/487
Scraping page 18/487
Scraping page 19/487
Scraping page 20/487
Scraping page 21/487
Scraping page 22/487
Scraping page 23/487
Scraping page 24/487
Scraping page 25/487
Scraping page 26/487
Scraping page 27/487
Scraping page 28/487
Scraping page 29/487
Scraping page 30/487
Scraping page 31/487
Scraping page 32/487
Scraping page 33/487
Scraping page 34/487
Scraping page 35/487
Scraping page 36/487
Scraping page 37/487
Scraping page 38/487
Scraping page 39/487
Scraping page 40/487
Scraping page 41/487
Scraping page 42/487
Scraping page 43/487
Scraping page 44/487
Scraping page 45/487
Scraping page 46/487
Scraping page 47/487
Scraping page 48/487
S

In [28]:
# Inspect the concatenated data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7300 entries, 0 to 7299
Data columns (total 6 columns):
Rank           7300 non-null object
Name           7300 non-null object
Chip time      7300 non-null object
St. number     7300 non-null object
Nationality    7300 non-null object
Age cat.       7300 non-null object
dtypes: object(6)
memory usage: 342.3+ KB


In [29]:
df.head()

Unnamed: 0,Rank,Name,Chip time,St. number,Nationality,Age cat.
0,1,Almahjoub DAZZA,2:05:58,1,BHR,MAM
1,2,Dawit WOLDE,2:06:18,12,ETH,MAM
2,3,Aychew BANTIE,2:06:23,7,ETH,MAM
3,4,Amos KIPRUTO,2:06:46,2,KEN,MAM
4,5,Solomon Kirwa YEGO,2:07:30,3,KEN,MAM


In [30]:
df.tail()

Unnamed: 0,Rank,Name,Chip time,St. number,Nationality,Age cat.
7295,7296,Jaroslav Sopuch,6:45:30,4652,SVK,M65
7296,7297,Dong Tran,6:39:37,7542,VNM,MAM
7297,7298,Jiří Přidal,6:57:39,6743,CZE,M65
7298,7299,Iva Valentová,6:42:54,F2060,CZE,W45
7299,7300,EHUD AVNI,6:59:04,7331,ISR,M50


#### Save the data to a file

In [25]:
# Save the data
df.to_csv('Data_Marathons_Prague/data_2019.csv', index = False) # "index = False" avoids saving the index column which is duplicated once loaded

In [27]:
# Test load the saved data
df_loaded = pd.read_csv('Data_Marathons_Prague/data_2019.csv')
df_loaded

Unnamed: 0,Rank,Name,Chip time,St. number,Nationality,Age cat.
0,1,Almahjoub DAZZA,2:05:58,1,BHR,MAM
1,2,Dawit WOLDE,2:06:18,12,ETH,MAM
2,3,Aychew BANTIE,2:06:23,7,ETH,MAM
3,4,Amos KIPRUTO,2:06:46,2,KEN,MAM
4,5,Solomon Kirwa YEGO,2:07:30,3,KEN,MAM
5,6,Hamid Ben DAOUD,2:08:14,9,ESP,MAM
6,7,Paul Muchemi MAINA,2:09:17,4,KEN,MAM
7,8,Girmaw AMARE,2:09:54,19,ISR,MAM
8,9,Nicodemus Kipkurui KIMUTAI,2:10:00,17,KEN,MAM
9,10,Goitom KIFLE,2:10:18,15,ERI,MAM
