# WIP - Dynamic - Selenium
## IES - Python - Project
### Marathon Results Analysis
#### David Koubek, Jiri Zelenka

#### Import required packages.

In [1]:
import requests # for robots check
from bs4 import BeautifulSoup # prettify HTML
from selenium import webdriver # scraping JS dynamic elements
from time import sleep # for sleeping (slowing down) inside a function

### Robots.txt

Are we allowed to scrape?

In [2]:
requests.get('https://www.runczech.com/robots.txt')

<Response [200]>

The response 200 means the request was fulfilled. Let's look visually at the actual robots.txt file what is allowed and what's not.

In [3]:
print(requests.get('https://www.runczech.com/robots.txt').text)

#
# robots.txt
#

# exclude these directories
User-agent: *
Disallow: /srv/
Disallow: /cgi/
Allow: /srv/www/qf/*/ramjet/eventList
Allow: /srv/www/qf/*/ramjet/eventVoucherList
Allow: /srv/www/qf/*/ramjet/contactPage
Allow: /srv/www/qf/*/ramjet/raceDetail
Allow: /srv/www/qf/*/ramjet/leagueDetail
Allow: /srv/www/qf/*/ramjet/results/list
Allow: /srv/www/qf/*/ramjet/results/league
Allow: /srv/www/qf/*/ramjet/results/league/detail
Allow: /srv/www/qf/*/ramjet/resultsEventDetail
Allow: /srv/www/qf/*/ramjet/resultsSubEventUserDetail
Allow: /srv/www/qf/*/ramjet/resultsSubEventGroupDetail
Allow: /srv/www/qf/*/ramjet/event/runnerList

Sitemap: https://www.runczech.com/sitemap-cs.xml
Sitemap: https://www.runczech.com/sitemap-en.xml
Sitemap: https://www.runczech.com/sitemap-de.xml
Sitemap: https://www.runczech.com/sitemap-it.xml
Sitemap: https://www.runczech.com/sitemap-fr.xml
Sitemap: https://www.runczech.com/sitemap-es.xml
Sitemap: https://www.runczech.com/sitemap-pl.xml
Sitemap: https://www.runcz

The "resultsEventDetail" which we desire to scrape is allowed which is good, we can proceed.

# Scraping JavaScript dynamic website
 - https://www.google.com/search?q=python+scrape+website+that+has+script+inside+html&oq=python+scrape+website+that+has+script+inside+html&aqs=chrome..69i57.14882j0j7&sourceid=chrome&ie=UTF-8
     - https://stackoverflow.com/questions/26680590/how-to-scrape-imbeded-script-on-webpage-in-python
     - https://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Scraping_a_Webpage_Rendered_by_Javascript_Using_Python.php
     - https://www.youtube.com/watch?v=FSH77vnOGqU
     - https://www.youtube.com/watch?v=vsmxMLmroyQ

## Selenium

First make sure chromedriver is correctly in the environment (download from https://sites.google.com/a/chromium.org/chromedriver/ ), otherwise the webdriver scraping outputs an error.

### Find all marathon links

The middlepage table of our webpage is not simply a static HTML code, it gets loaded in the browser only after we load the page, dynamically via JavaScript. So we have to use dynamic scraping methods, e.g. Selenium. After we've scraped the dynamic code, we need to scrape the "a href" tag of class "indexList_link" which contains URL links to the desired marathon events.

We need to slow down the scraping inside get_soup function so the url gets fully loaded in the browser (JS table takes about 1-2s to pull data from servers) before it's scraped, otherwise the soup object will contain only the static parts of the website and not the dynamic ones which we care about.

In [4]:
# Scrapes dynamic webpage content using Selenium browser, returns a prettified soup code of the page
def get_soup(url):
    # Working with chrome, first open window
    browser = webdriver.Chrome()
    # Then navigate browser to desired url and get the source code
    browser.get(url) # navigate to the page

    # Wait 1-2s (1s might just be enough but better be safe closer to 2s)
    sleep(2) # time in seconds, can also take a float value
    
    # Take all the inner code of the displayed webpage
    innerHTML = browser.execute_script("return document.body.innerHTML") #returns the inner HTML as a string
    
    # Clean with BeautifulSoup:
    soup = BeautifulSoup(innerHTML,'lxml')
    return soup

In [5]:
# For a given RunCzech Results URL, returns a list of events' URLs (marathons)
def get_all_links(url):
    soup = get_soup(url) # call get_soup function on the desired url and get back the soup from bs (of the dynamic HTML with JS elements loaded)
    a_elements = soup.find_all('a',{'class':'indexList_link'}) # class "indexList_link" contains the href link we desire
    urls_events = ['https://www.runczech.com' + a['href'] for a in a_elements] # list comprehension/function for links, join runczech url with the href ending of the events
    return urls_events

In [6]:
# URL of Results webpage which contains links to marathons
url_results = "https://www.runczech.com/srv/www/qf/en/ramjet/results/list?&page=1&per_page=15"

In [7]:
urls_marathons = get_all_links(url_results)

In [8]:
urls_marathons

['https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22175',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22166',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22163',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22114',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21460',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21453',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21448',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21636',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21443',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21438',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21429',
 'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=21426',
 'ht

### Data table from marathon events

#### Scrape a single table

Coded for one link/table for now:

In [9]:
url_marathon_2019 = urls_marathons[2]
url_marathon_2019

'https://www.runczech.com/srv/www/qf/en/ramjet/resultsEventDetail?eventId=22163'

In [10]:
# DELETE LATER, for playing around with simple one find cases
def get_names(url):
    soup = get_soup(url)

    tr = soup.find('tr') # "tr" table-row element tag
    
    el = tr.find('th',{'class':'hidden767'}).find_next('th').contents[0]
    
    return tr
#     return el

names = get_names(url_marathon_2019)
print(names)

<tr>
<th class="hidden767">Avatar</th>
<th>Name</th>
<th>Avg. chip<br/>time</th>
<th>Members<br/>- in group</th>
<th class="hidden767">Members<br/>- participants</th>
<th class="hidden767">Group type</th>
</tr>


In [11]:
def get_names(url):
    soup = get_soup(url)

    trs = soup.find_all('tr') # "tr" table-row element tag
    
    tds = [tr.find('td',{'class':'hidden980'}) for tr in trs] # hidden980 is class of first column
    tds = [x for x in tds if x != None] # filter out the None elements in tds (where tds weren't present in tr tags),
    # could also use filter(None, tds) which though gets rid of 0s as well which is more dangerous in certain situations
    tds_sibsibling = [td.find_next('td').find_next('td') for td in tds] # finds next sibling of tag 'td'
    contents = [td_sibling.contents[0] for td_sibling in tds_sibsibling] # returns just the text inside tags
    
    return contents # names are next to sibling of hidden980 class element/column

In [12]:
# Get "Official time" column
def get_times(url):
    soup = get_soup(url)

    trs = soup.find_all('tr') # "tr" table-row element tag
    
    tds = [tr.find('td',{'class':'hidden767'}) for tr in trs] # hidden980 is class of first column
    tds = [x for x in tds if x != None] # filter out the None elements in tds (where tds weren't present in tr tags),
    # could also use filter(None, tds) which though gets rid of 0s as well which is more dangerous in certain situations
    tds_sibling = [td.find_previous('td') for td in tds] # finds previous sibling of tag 'td', in this case the "Official time"
    contents = [td_sibling.contents[0] for td_sibling in tds_sibling] # returns just the text inside tags
    
    return contents

In [15]:
names_2019 = get_names(url_marathon_2019)
names_2019

['Benard KIMELI',
 'Felix KIBITOK',
 'Stephen KIPROP',
 'Geoffrey Kimutai KOECH',
 'Henry RONO',
 'Moses KIBET',
 'Moses Kipngetich KEMEI',
 'Yohanes GHEBREGERGIS',
 'Ishmael Chelanga KALALE',
 'Philimon Kipkorir MARITIM',
 'Abel KIPCHUMBA',
 'Jiří HOMOLÁČ',
 'Felix BOUR',
 'Igor OLEFIRENKO',
 'Caroline Chepkoech KIPKIRUI']

In [16]:
times_2019 = get_times(url_marathon_2019)
times_2019

['0:59:07',
 '0:59:08',
 '0:59:20',
 '1:00:30',
 '1:00:37',
 '1:00:59',
 '1:01:19',
 '1:01:44',
 '1:01:45',
 '1:02:04',
 '1:02:07',
 '1:04:03',
 '1:04:18',
 '1:04:23',
 '1:05:44']