# WIP - Dynamic - Selenium
## IES - Python - Project
### Marathon Results Analysis
#### David Koubek, Jiri Zelenka

#### Import required packages.

In [1]:
import requests # for robots check
from bs4 import BeautifulSoup # prettify HTML
from selenium import webdriver # scraping JS dynamic elements
from time import sleep # for sleeping (slowing down) inside a function
# from tqdm import tqdm
# from IPython.core.debugger import Tracer

### Robots.txt

Are we allowed to scrape?

In [2]:
requests.get('https://www.runczech.com/robots.txt')

<Response [200]>

The response 200 means the request was fulfilled. Let's look visually at the actual robots.txt file what is allowed and what's not.

In [3]:
print(requests.get('https://www.runczech.com/robots.txt').text)

#
# robots.txt
#

# exclude these directories
User-agent: *
Disallow: /srv/
Disallow: /cgi/
Allow: /srv/www/qf/*/ramjet/eventList
Allow: /srv/www/qf/*/ramjet/eventVoucherList
Allow: /srv/www/qf/*/ramjet/contactPage
Allow: /srv/www/qf/*/ramjet/raceDetail
Allow: /srv/www/qf/*/ramjet/leagueDetail
Allow: /srv/www/qf/*/ramjet/results/list
Allow: /srv/www/qf/*/ramjet/results/league
Allow: /srv/www/qf/*/ramjet/results/league/detail
Allow: /srv/www/qf/*/ramjet/resultsEventDetail
Allow: /srv/www/qf/*/ramjet/resultsSubEventUserDetail
Allow: /srv/www/qf/*/ramjet/resultsSubEventGroupDetail
Allow: /srv/www/qf/*/ramjet/event/runnerList

Sitemap: https://www.runczech.com/sitemap-cs.xml
Sitemap: https://www.runczech.com/sitemap-en.xml
Sitemap: https://www.runczech.com/sitemap-de.xml
Sitemap: https://www.runczech.com/sitemap-it.xml
Sitemap: https://www.runczech.com/sitemap-fr.xml
Sitemap: https://www.runczech.com/sitemap-es.xml
Sitemap: https://www.runczech.com/sitemap-pl.xml
Sitemap: https://www.runcz

The "resultsEventDetail" which we desire to scrape is allowed which is good, we can proceed.

# Scraping JavaScript dynamic website
 - https://www.google.com/search?q=python+scrape+website+that+has+script+inside+html&oq=python+scrape+website+that+has+script+inside+html&aqs=chrome..69i57.14882j0j7&sourceid=chrome&ie=UTF-8
     - https://stackoverflow.com/questions/26680590/how-to-scrape-imbeded-script-on-webpage-in-python
     - https://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Scraping_a_Webpage_Rendered_by_Javascript_Using_Python.php
     - https://www.youtube.com/watch?v=FSH77vnOGqU
     - https://www.youtube.com/watch?v=vsmxMLmroyQ

## Selenium

In [4]:
url_results = "https://www.runczech.com/srv/www/qf/cs/ramjet/results/list?&page=1&per_page=15"
url_marathon = "https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=22175&frm.subeventId=22176&page=1&per_page=15&sort=finishTime"

In [5]:
# DELETE LATER, just for playing around
# Simple sleep for 1s
def countdown(from_number):
    if from_number < 1:
        print("Done")
    else:
        print(from_number)
        sleep(1)
        countdown(from_number - 1)
        
countdown(3)

3
2
1
Done


First make sure chromedriver is correctly in the environment (download from https://sites.google.com/a/chromium.org/chromedriver/ ), otherwise the webdriver scraping outputs an error.

### Events' URLs

We need to slow down the scraping inside get_soup function so the url gets fully loaded in the browser (JS table takes about 1-2s to pull data from servers) before it's scraped, otherwise the soup object will contain only the static parts of the website and not the dynamic ones which we care about.

In [6]:
# Scrapes dynamic webpage content using Selenium browser, returns a prettified soup code of the page
def get_soup(url):
    # Working with chrome, first open window
    browser = webdriver.Chrome()
    # Then navigate browser to desired url and get the source code
    browser.get(url_results) # navigate to the page

    # Wait 1-2s (1s might just be enough but better be safe closer to 2s)
    sleep(2) # time in seconds, can also take a float value
    
    # Take all the inner code of the displayed webpage
    innerHTML = browser.execute_script("return document.body.innerHTML") #returns the inner HTML as a string
    
    # Clean with BeautifulSoup:
    soup = BeautifulSoup(innerHTML,'lxml')
    return soup

In [7]:
# For a given RunCzech Results URL, returns a list of events' URLs (marathons)
def get_all_links(url):
    soup = get_soup(url) # call get_soup function on the desired url and get back the soup from bs (of the dynamic HTML with JS elements loaded)
    a_elements = soup.find_all('a',{'class':'indexList_link'}) # class "indexList_link" contains the href link we desire
    urls_events = ['https://www.runczech.com' + a['href'] for a in a_elements] # list comprehension/function for links, join runczech url with the href ending of the events
    return urls_events

In [8]:
marathons_links = get_all_links(url_results)

In [9]:
marathons_links

['https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=22175',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=22166',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=22163',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=22114',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=21460',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=21453',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=21448',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=21636',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=21443',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=21438',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=21429',
 'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=21426',
 'ht

In [10]:
marathon_2019 = marathons_links[2]
marathon_2019

'https://www.runczech.com/srv/www/qf/cs/ramjet/resultsEventDetail?eventId=22163'

### Data table from event URL