# The Race Around The Netherlands
Part 2: webscraping the leaderboards

There are 4 editions of the race. The leaderboard with the check-point timings can be found here:

2018: https://ratn2018.legendstracking.com/#

2019: https://ratn2019.legendstracking.com/# 

2020: https://ratn2020.legendstracking.com/#

2021: https://ratn2021.legendstracking.com/# AND/OR
https://www.dotwatcher.cc/race/race-around-the-netherlands-2020?reverse=true

Unfortunately, the info I need is in Javascript and I need to click a button for the table to become visible. Therefore I can not just use BeautifulSoup. I will first have to click the button using Selenium, and - as I'm working in Chrome - the Chrome webdriver. 
- selenium: https://pypi.org/project/selenium/
- chrome webdriver: https://chromedriver.chromium.org/getting-started

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import regex as re
import datetime

In [3]:
#for scraping java
import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support.expected_conditions import presence_of_element_located
# import time
# import sys

## RATN 2018

In [None]:
url = 'https://ratn2018.legendstracking.com/#'

#set that you want to run the chrome headless (behind the scenes, in the background)
chrome_options = Options()
chrome_options.add_argument("--headless")

#insert absolute path of chrome driver
driver = webdriver.Chrome('C:/Users/jetsa/chromedriver.exe', options = chrome_options)
driver.get(url)

#### Click the right button
The buttons are placed here in the code:

-- nav tabs -->

li id="leaderboard-icon" class="active">

I search for the button using the tab-id: "leaderboard-icon"

In [None]:
# Find and click the right button, using the tab's id
button = driver.find_element(By.ID, "leaderboard-icon")
button.click()

In [None]:
#From here on you can start using BS4. This will now also find the html=code of the table that became visible by clicking the button.
source = driver.page_source

#close the webdriver that runs in the background
driver.close()

soup = BeautifulSoup(source, 'html')
print(soup.prettify())

#### Find the right tables
There are 4 tables in the code.

- The first one contains nothing of interest

- The second one contains nothing of interest either

- The third one contains the headers (Start, Timing 1, Timing 2 etc) and timings of solo riders

- The fourth one contains the headers (Start, Timing 1, Timing 2 etc) and timings of rider-pairs

I'll start with the solo riders.

In [None]:
#pick the second table. This one contains information about the soloriders.
table_solo = soup.find_all('table')[2]
table_solo

We have another challenge:
If you text.split().strip() the columns, there is no good character to do so. It becomes a horrible mess: double names ending up in timeslots etc.
Also, you cannot isolate the icon that displays the sex. 

Solutions: convert each table row into one long string. This way you can both grab the description of the icon (mars or venus) and split the columns on the closing td.  

In [None]:
rows_solo = table_solo.findChildren('tr')

#convert each elements into a string to prevent a horrible mess
string_rows = []

for i in rows_solo:
    i = str(i)
    string_rows.append(i)
print(string_rows[7])

In [None]:
#convert the string+rows into a dataframe. Name the column "riders"
solo_riders = pd.DataFrame(string_rows)
solo_riders.columns=["riders"]
solo_riders.head()

In [None]:
#split the strings on </td> 
solo_riders = solo_riders["riders"].str.split("</td>", expand = True)
solo_riders.head()

In [None]:
# CUSTOMIZE drop useless columns: colnr 1; 2; 17
solo_riders.drop(solo_riders.columns[[1,2,17]], axis=1, inplace = True)
solo_riders.head()

In [None]:
#rename columns
solo_riders.columns=["Place", "MarsVenus","Lastname", "Firstname", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]

In [None]:
#CUSTOMIZE drop useless rows
solo_riders = solo_riders.drop([0, 16, 24]).reset_index()
solo_riders.head()

#### Extract the gender of each rider from the icon information.

In [None]:
#extract the gender
gender = []

for row in solo_riders["MarsVenus"]:
    if 'mars' in row:
        gender.append('male')
    if 'venus' in row:
        gender.append('female')

# Add gender as a column to the solo_riders dataframe and drop the MarsVenus column and the automatically created index-column
solo_riders["Gender"] = gender        
solo_riders = solo_riders.drop(solo_riders[["MarsVenus", "index"]], axis = 1)
solo_riders.head()

In [None]:
#CUSTOMIZE append columns with start date&time, and wether it's a solo or duo ride.
StartDate = datetime.datetime(2018, 5, 1, 8)
solo_riders.insert(0, 'Solo or Duo', 'Solo')
solo_riders.insert(0, 'StartDate', StartDate)

solo_riders.head()

##  Prepare the duo-table
Prepare the duo-table, so the cleaning can then be done at the same time

In [None]:
#grab the duo table
table_duo = soup.find_all('table')[3]
table_duo

In [None]:
#grab the rows
rows_duo = table_duo.findChildren('tr')

#convert each elements into a string to prevent an even more horrible mess
string_rows = []

for i in rows_duo:
    i = str(i)
    string_rows.append(i)
print(string_rows[3])

In [None]:
#convert the string+rows into a dataframe. Name the column "riders"
duo_riders = pd.DataFrame(string_rows)
duo_riders.columns=["riders"]
duo_riders.head()

#split the strings on </td>
duo_riders = duo_riders["riders"].str.split("</td>", expand = True)
duo_riders.head()

In [None]:
# CUSTOMIZE drop useless columns: colnr 1; 2; 17
duo_riders = duo_riders.drop(duo_riders.columns[[1,2,17]], axis=1)
duo_riders

#rename columns
duo_riders.columns=["Place", "MarsVenus","Lastname", "Firstname", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]

#check which rows are empty/useless and you should drop in the next cell. 
duo_riders.head()

In [None]:
#CUSTOMIZE: drop useless ROWS
duo_riders = duo_riders.drop([0, 2]).reset_index()
duo_riders.head()

In [None]:
#extract the gender
gender = []

for row in duo_riders["MarsVenus"]:
    if 'mars-double' in row:
        gender.append('male')
    if 'venus-double' in row:
        gender.append('female')
    if 'venus-mars' in row:
        gender.append('mixed')

#Add gender as a column to the solo_riders dataframe and drop the MarsVenus column and the automatically created index-column
duo_riders["Gender"] = gender
duo_riders = duo_riders.drop(duo_riders[["MarsVenus", "index"]], axis = 1)
duo_riders.head()

In [None]:
#CUSTOMIZE append columns with the startdate (line 1) and whether it a duo or solo ride (line 2)
StartDate = datetime.datetime(2018, 5, 1, 8)
duo_riders.insert(0, 'Solo or Duo', 'Duo')
duo_riders.insert(0, 'StartDate', StartDate)
duo_riders.head()

### Appending the dataframes of solo and duo riders together

In [None]:
# glue solo and duo-riders together
all_riders = pd.concat([solo_riders, duo_riders]).reset_index()
all_riders = all_riders.drop(all_riders.columns[0], axis=1)
pd.set_option('display.max_rows', all_riders.shape[0]+1)
all_riders

### Cleaning - delete unwanted parts of the strings in each column  

In [None]:
#set of patterns you want to delete from the columns:
del_patterns = ['<td>', '\)', '<b>', '</b>', '<span id="leaderboard_[\d][\d][\d][\d]_[\d][\d]*">', '</span>', '<tr>', '</tr>', None]

In [None]:
all_riders = all_riders.replace(to_replace = del_patterns, value = '', regex = True)
all_riders.head()

In [None]:
# add a column that tells you if the rider: finished on time (finisher); did not start (DNS) or did not finish (DNF)
# There are now white spaces when no place is assigned (DNS or DNF), and this overcomplicates stuff. Remove the white spaces.
all_riders['Place'] = all_riders['Place'].str.strip()

# create a list of the three conditions
conditions = [
    (all_riders["Place"] != ''), #finisher
    (all_riders["Place"] == '') & (all_riders['StartTiming'] != ''), #DNF
    (all_riders["Place"] == '') & (all_riders['StartTiming'] == '') #DNS
    ]

# create a list of the values we want to assign for each condition
values = ["Finisher", "DNF", "DNS"]

# create a new column and use np.select to assign values to it using our lists as arguments
all_riders['Status'] = np.select(conditions, values)

# display updated DataFrame
all_riders.head()

In [None]:
# for prettiness sake, change the order of the columns a bit (place, first name, last name, gender, all timings)
all_riders = all_riders[["Place", "Firstname", "Lastname", "Gender", "Solo or Duo" , "StartDate", "Status", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]]
all_riders.head()

In [None]:
#CUSTOMIZE write to csv use the year in the name!
all_riders.to_csv(all_riders_2018.csv')

# Do the same for edition 2019

In [23]:
#CUSTOMIZE URL
url = 'https://ratn2019.legendstracking.com/#'

#set that you want to run the chrome headless (behind the scenes, in the background)
chrome_options = Options()
chrome_options.add_argument("--headless")

#insert absolute path of chrome driver
driver = webdriver.Chrome('C:/Users/jetsa/chromedriver.exe', options = chrome_options)
driver.get(url)

# Find and click the right button, using the tab's id
button = driver.find_element(By.ID, "leaderboard-icon")
button.click()

In [24]:
#From here on you can start using BS4. This will now also find the html=code of the table that became visible by clicking the button.
source = driver.page_source

#close the webdriver
driver.close()

In [25]:
#make the soup
soup = BeautifulSoup(source, 'html')
print(soup.prettify())

<html>
 <head>
  <meta content="https://www.legendstracking.com/_lib/img/logo-facebook.png" property="og:image"/>
  <meta content="1200" property="og:image:width"/>
  <meta content="630" property="og:image:height"/>
  <meta content="Legends Tracking" property="og:title"/>
  <meta content="https://www.legendstracking.com" property="og:url"/>
  <meta content="Live gps tracking services for your event" property="og:description"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
  <link href="../css/font-awesome.min.css" rel="stylesheet" type="text/css"/>
  <link href="../css/ol.css" rel="stylesheet" type="text/css"/>
  <link href="../css/trackers1.css" rel="stylesheet" type="text/css"/>
  <link href="../css/ol3-sidebar.css" rel="stylesheet" type="text/css"/>
  <link href="../css/ol3-layerswitcher.css" rel="stylesheet" type="text/css"/>
  <link href="../css/flags.css" rel="stylesheet" type="text/css"/>
  <link href="../dev/css/j

In [26]:
#pick the second table. This one contains information about the soloriders.
#this only works when put in a different cell.Maybe because you'll have to wait till the previous line is loaded?
table_solo = soup.find_all('table')[2]
table_solo

IndexError: list index out of range

In [21]:
#find all the rows from this specific table
rows_solo = table_solo.findChildren('tr')

#convert each elements into a string to prevent an even more horrible mess
string_rows = []

for i in rows_solo:
    i = str(i)
    string_rows.append(i)
# print(string_rows[7]) #test

#convert the string+rows into a dataframe. Name the column "riders"
solo_riders = pd.DataFrame(string_rows)
solo_riders.columns=["riders"]
solo_riders.head()

#split the strings on </td> (this is removed)
solo_riders = solo_riders["riders"].str.split("</td>", expand = True)
solo_riders.head() #to check which columns you shoul delete in the next step

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,"<tr><td colspan=""6"">",<td>Start,<td>Timing 1,<td>Timing 2,<td>Timing 3,<td>Timing 4,<td>Timing 5,<td>Timing 6,<td>Timing 7,<td>Timing 8,<td>Timing 9,<td>Finish,</tr>,,,,,
1,<tr><td>1),"<td style=""vertical-align:top"">26","<td style=""vertical-align:top;""><div style=""mi...","<td style=""vertical-align:top""><i aria-hidden=...",<td>Robison,<td>Kevin,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17478_1""><b>Wed 13:5...","<td><span id=""leaderboard_17478_2""><b>Thu 11:1...","<td><span id=""leaderboard_17478_3""><b>Thu 14:5...","<td><span id=""leaderboard_17478_4""><b>Thu 21:0...","<td><span id=""leaderboard_17478_5""><b>Fri 18:0...","<td><span id=""leaderboard_17478_6""><b>Sat 17:5...","<td><span id=""leaderboard_17478_7""><b>Sun 5:50...","<td><span id=""leaderboard_17478_8""><b>Sun 21:5...","<td><span id=""leaderboard_17478_9""><b>Mon 18:4...","<td><span id=""leaderboard_17478_10""><b>Tue 15:...",</tr>
2,<tr><td>2),"<td style=""vertical-align:top"">18","<td style=""vertical-align:top;""><div style=""mi...","<td style=""vertical-align:top""><i aria-hidden=...",<td>Robison,<td>Jamie,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17477_1""><b>Wed 13:5...","<td><span id=""leaderboard_17477_2""><b>Thu 11:1...","<td><span id=""leaderboard_17477_3""><b>Thu 14:5...","<td><span id=""leaderboard_17477_4""><b>Thu 21:0...","<td><span id=""leaderboard_17477_5""><b>Fri 18:0...","<td><span id=""leaderboard_17477_6""><b>Sat 17:5...","<td><span id=""leaderboard_17477_7""><b>Sun 5:51...","<td><span id=""leaderboard_17477_8""><b>Sun 22:0...","<td><span id=""leaderboard_17477_9""><b>Mon 18:4...","<td><span id=""leaderboard_17477_10""><b>Tue 15:...",</tr>
3,<tr><td>3),"<td style=""vertical-align:top"">16","<td style=""vertical-align:top;""><div style=""mi...","<td style=""vertical-align:top""><i aria-hidden=...",<td>Crawford,<td>Christopher,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17475_1""></span>","<td><span id=""leaderboard_17475_2""><b>Thu 15:3...","<td><span id=""leaderboard_17475_3""><b>Thu 20:3...","<td><span id=""leaderboard_17475_4""><b>Fri 10:2...","<td><span id=""leaderboard_17475_5""></span>","<td><span id=""leaderboard_17475_6""><b>Sat 12:5...","<td><span id=""leaderboard_17475_7""><b>Sat 18:2...","<td><span id=""leaderboard_17475_8""><b>Sun 21:3...","<td><span id=""leaderboard_17475_9""><b>Mon 17:3...","<td><span id=""leaderboard_17475_10""><b>Tue 15:...",</tr>
4,<tr><td>4),"<td style=""vertical-align:top"">133","<td style=""vertical-align:top;""><div style=""mi...","<td style=""vertical-align:top""><i aria-hidden=...",<td>Gebing,<td>Pascal,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17494_1""><b>Wed 15:3...","<td><span id=""leaderboard_17494_2""><b>Thu 15:4...","<td><span id=""leaderboard_17494_3""><b>Thu 19:4...","<td><span id=""leaderboard_17494_4""><b>Fri 10:4...","<td><span id=""leaderboard_17494_5""></span>","<td><span id=""leaderboard_17494_6""><b>Sat 12:5...","<td><span id=""leaderboard_17494_7""><b>Sat 18:1...","<td><span id=""leaderboard_17494_8""><b>Sun 21:3...","<td><span id=""leaderboard_17494_9""><b>Mon 17:2...","<td><span id=""leaderboard_17494_10""><b>Tue 15:...",</tr>


In [19]:
# MIGHT NEED CUSTOMIZATION inspect the head and decide which columns to drop. In this (2019) case: colnr 1; 2; 17
solo_riders.drop(solo_riders.columns[[1,2,17]], axis=1, inplace = True)
# print(solo_riders.head()) # to check if the dropping went right

#rename columns
solo_riders.columns=["Place", "MarsVenus","Lastname", "Firstname", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]

#check which rows are empty/useless and you should drop in the next cell. You'll need to see all rows (display.max_rows).
# In this case the rows you want to delete are nr 0; 45, 84
pd.set_option('display.max_rows', solo_riders.shape[0]+1)
solo_riders

Unnamed: 0,Place,MarsVenus,Lastname,Firstname,StartTiming,Timing1,Timing2,Timing3,Timing4,Timing5,Timing6,Timing7,Timing8,Timing9,TimingFinish
0,"<tr><td colspan=""6"">",<td>Timing 2,<td>Timing 3,<td>Timing 4,<td>Timing 5,<td>Timing 6,<td>Timing 7,<td>Timing 8,<td>Timing 9,<td>Finish,</tr>,,,,
1,<tr><td>1),"<td style=""vertical-align:top""><i aria-hidden=...",<td>Robison,<td>Kevin,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17478_1""><b>Wed 13:5...","<td><span id=""leaderboard_17478_2""><b>Thu 11:1...","<td><span id=""leaderboard_17478_3""><b>Thu 14:5...","<td><span id=""leaderboard_17478_4""><b>Thu 21:0...","<td><span id=""leaderboard_17478_5""><b>Fri 18:0...","<td><span id=""leaderboard_17478_6""><b>Sat 17:5...","<td><span id=""leaderboard_17478_7""><b>Sun 5:50...","<td><span id=""leaderboard_17478_8""><b>Sun 21:5...","<td><span id=""leaderboard_17478_9""><b>Mon 18:4...","<td><span id=""leaderboard_17478_10""><b>Tue 15:..."
2,<tr><td>2),"<td style=""vertical-align:top""><i aria-hidden=...",<td>Robison,<td>Jamie,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17477_1""><b>Wed 13:5...","<td><span id=""leaderboard_17477_2""><b>Thu 11:1...","<td><span id=""leaderboard_17477_3""><b>Thu 14:5...","<td><span id=""leaderboard_17477_4""><b>Thu 21:0...","<td><span id=""leaderboard_17477_5""><b>Fri 18:0...","<td><span id=""leaderboard_17477_6""><b>Sat 17:5...","<td><span id=""leaderboard_17477_7""><b>Sun 5:51...","<td><span id=""leaderboard_17477_8""><b>Sun 22:0...","<td><span id=""leaderboard_17477_9""><b>Mon 18:4...","<td><span id=""leaderboard_17477_10""><b>Tue 15:..."
3,<tr><td>3),"<td style=""vertical-align:top""><i aria-hidden=...",<td>Crawford,<td>Christopher,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17475_1""></span>","<td><span id=""leaderboard_17475_2""><b>Thu 15:3...","<td><span id=""leaderboard_17475_3""><b>Thu 20:3...","<td><span id=""leaderboard_17475_4""><b>Fri 10:2...","<td><span id=""leaderboard_17475_5""></span>","<td><span id=""leaderboard_17475_6""><b>Sat 12:5...","<td><span id=""leaderboard_17475_7""><b>Sat 18:2...","<td><span id=""leaderboard_17475_8""><b>Sun 21:3...","<td><span id=""leaderboard_17475_9""><b>Mon 17:3...","<td><span id=""leaderboard_17475_10""><b>Tue 15:..."
4,<tr><td>4),"<td style=""vertical-align:top""><i aria-hidden=...",<td>Gebing,<td>Pascal,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17494_1""><b>Wed 15:3...","<td><span id=""leaderboard_17494_2""><b>Thu 15:4...","<td><span id=""leaderboard_17494_3""><b>Thu 19:4...","<td><span id=""leaderboard_17494_4""><b>Fri 10:4...","<td><span id=""leaderboard_17494_5""></span>","<td><span id=""leaderboard_17494_6""><b>Sat 12:5...","<td><span id=""leaderboard_17494_7""><b>Sat 18:1...","<td><span id=""leaderboard_17494_8""><b>Sun 21:3...","<td><span id=""leaderboard_17494_9""><b>Mon 17:2...","<td><span id=""leaderboard_17494_10""><b>Tue 15:..."
5,<tr><td>5),"<td style=""vertical-align:top""><i aria-hidden=...",<td>Rutten,<td>René,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17484_1""><b>Wed 14:3...","<td><span id=""leaderboard_17484_2""><b>Thu 13:1...","<td><span id=""leaderboard_17484_3""><b>Thu 16:5...","<td><span id=""leaderboard_17484_4""><b>Fri 8:01...","<td><span id=""leaderboard_17484_5""><b>Fri 20:1...","<td><span id=""leaderboard_17484_6""><b>Sat 19:0...","<td><span id=""leaderboard_17484_7""><b>Sun 9:53...","<td><span id=""leaderboard_17484_8""><b>Mon 9:44...","<td><span id=""leaderboard_17484_9""><b>Tue 7:57...","<td><span id=""leaderboard_17484_10""><b>Tue 20:..."
6,<tr><td>6),"<td style=""vertical-align:top""><i aria-hidden=...",<td>Clark,<td>Gawaine,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17414_1""><b>Wed 15:1...","<td><span id=""leaderboard_17414_2""><b>Thu 13:2...","<td><span id=""leaderboard_17414_3""><b>Thu 17:4...","<td><span id=""leaderboard_17414_4""><b>Fri 9:17...","<td><span id=""leaderboard_17414_5""><b>Sat 9:43...","<td><span id=""leaderboard_17414_6""><b>Sun 9:53...","<td><span id=""leaderboard_17414_7""><b>Sun 14:3...","<td><span id=""leaderboard_17414_8""><b>Mon 16:4...","<td><span id=""leaderboard_17414_9""><b>Tue 12:0...","<td><span id=""leaderboard_17414_10""><b>Wed 2:2..."
7,<tr><td>7),"<td style=""vertical-align:top""><i aria-hidden=...",<td>Hunt,<td>Richard,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17451_1""><b>Wed 15:1...","<td><span id=""leaderboard_17451_2""><b>Thu 13:2...","<td><span id=""leaderboard_17451_3""><b>Thu 17:4...","<td><span id=""leaderboard_17451_4""><b>Fri 9:16...","<td><span id=""leaderboard_17451_5""><b>Sat 9:48...","<td><span id=""leaderboard_17451_6""><b>Sun 9:46...","<td><span id=""leaderboard_17451_7""><b>Sun 14:3...","<td><span id=""leaderboard_17451_8""><b>Mon 16:5...","<td><span id=""leaderboard_17451_9""><b>Tue 11:5...","<td><span id=""leaderboard_17451_10""><b>Wed 2:3..."
8,<tr><td>8),"<td style=""vertical-align:top""><i aria-hidden=...",<td>Alinski,<td>Marc,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17497_1""><b>Wed 14:4...","<td><span id=""leaderboard_17497_2""><b>Thu 14:1...","<td><span id=""leaderboard_17497_3""><b>Thu 17:5...","<td><span id=""leaderboard_17497_4""><b>Fri 9:39...","<td><span id=""leaderboard_17497_5""><b>Sat 10:0...","<td><span id=""leaderboard_17497_6""><b>Sun 11:0...","<td><span id=""leaderboard_17497_7""><b>Sun 15:5...","<td><span id=""leaderboard_17497_8""><b>Mon 18:3...","<td><span id=""leaderboard_17497_9""><b>Tue 15:5...","<td><span id=""leaderboard_17497_10""><b>Wed 6:0..."
9,<tr><td>9),"<td style=""vertical-align:top""><i aria-hidden=...",<td>Barkowsky,<td>Maren,<td><b>Wed 8:00</b>,"<td><span id=""leaderboard_17498_1""><b>Wed 14:2...","<td><span id=""leaderboard_17498_2""><b>Thu 14:1...","<td><span id=""leaderboard_17498_3""><b>Thu 17:5...","<td><span id=""leaderboard_17498_4""><b>Fri 10:1...","<td><span id=""leaderboard_17498_5""><b>Sat 10:0...","<td><span id=""leaderboard_17498_6""><b>Sun 11:0...","<td><span id=""leaderboard_17498_7""><b>Sun 15:5...","<td><span id=""leaderboard_17498_8""><b>Mon 18:3...","<td><span id=""leaderboard_17498_9""><b>Tue 15:5...","<td><span id=""leaderboard_17498_10""><b>Wed 6:1..."


In [22]:
#CUSTOMIZE: drop useless ROWS
solo_riders = solo_riders.drop([0, 45, 84]).reset_index()
solo_riders.head()

KeyError: '[45 84] not found in axis'

In [None]:
#extract the gender
gender = []

for row in solo_riders["MarsVenus"]:
    if 'mars' in row:
        gender.append('male')
    if 'venus' in row:
        gender.append('female')

# Add gender as a column to the solo_riders dataframe and drop the MarsVenus column and the automatically created index-column
solo_riders["Gender"] = gender        
solo_riders = solo_riders.drop(solo_riders[["MarsVenus", "index"]], axis = 1)
solo_riders.head()

In [None]:
#CUSTOMIZE append columns with start date&time, and wether it's a solo or duo ride.
StartDate = datetime.datetime(2019, 5, 1, 8)
solo_riders.insert(0, 'Solo or Duo', 'Solo')
solo_riders.insert(0, 'StartDate', StartDate)

solo_riders.head()

##  Prepare the duo-table 2019
Prepare the duo-table, so the cleaning can then be done at the same time

In [None]:
#grab the duo table
table_duo = soup.find_all('table')[3]
table_duo

In [None]:
#grab the rows
rows_duo = table_duo.findChildren('tr')

#convert each elements into a string to prevent an even more horrible mess
string_rows = []

for i in rows_duo:
    i = str(i)
    string_rows.append(i)
print(string_rows[3])

In [None]:
#convert the string+rows into a dataframe. Name the column "riders"
duo_riders = pd.DataFrame(string_rows)
duo_riders.columns=["riders"]
duo_riders.head()

#split the strings on </td>
duo_riders = duo_riders["riders"].str.split("</td>", expand = True)
duo_riders

In [None]:
# CUSTOMIZE drop useless columns: colnr 1; 2; 17
duo_riders = duo_riders.drop(duo_riders.columns[[1,2,17]], axis=1)
duo_riders

#rename columns
duo_riders.columns=["Place", "MarsVenus","Lastname", "Firstname", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]

#check which rows are empty/useless and you should drop in the next cell. 
duo_riders

In [None]:
#CUSTOMIZE: drop useless ROWS
duo_riders = duo_riders.drop([0, 20, 28]).reset_index()
duo_riders.head()

In [None]:
# NOTE 2019 pair-table doesnot make clear who is paird with who
#extract the gender
gender = []

for row in duo_riders["MarsVenus"]:
    if 'mars' in row:
        gender.append('male')
    if 'venus' in row:
        gender.append('female')

#Add gender as a column to the solo_riders dataframe and drop the MarsVenus column and the automatically created index-column
duo_riders["Gender"] = gender
duo_riders = duo_riders.drop(duo_riders[["MarsVenus", "index"]], axis = 1)
duo_riders

In [None]:
#CUSTOMIZE append columns with the startdate (line 1) and whether it a duo or solo ride (line 2)
StartDate = datetime.datetime(2019, 5, 1, 8)
duo_riders.insert(0, 'Solo or Duo', 'Duo')
duo_riders.insert(0, 'StartDate', StartDate)
duo_riders.head()

In [None]:
# glue solo and duo-riders together
all_riders = pd.concat([solo_riders, duo_riders]).reset_index()
all_riders = all_riders.drop(all_riders.columns[0], axis=1)
pd.set_option('display.max_rows', all_riders.shape[0]+1)
all_riders

In [None]:
#set of patterns you want to delete from the columns:
del_patterns = ['<td>', '\)', '<b>', '</b>', '<span id="leaderboard_[\d][\d][\d][\d]_[\d][\d]*">', '</span>', '<tr>', '</tr>', None]

In [None]:
all_riders = all_riders.replace(to_replace = del_patterns, value = '', regex = True)
all_riders.head()

In [None]:
# add a column that tells you if the rider: finished on time (finisher); did not start (DNS) or did not finish (DNF)
# There are now white spaces when no place is assigned (DNS or DNF), and this overcomplicates stuff. Remove the white spaces.
all_riders['Place'] = all_riders['Place'].str.strip()

# create a list of the three conditions
conditions = [
    (all_riders["Place"] != ''), #finisher
    (all_riders["Place"] == '') & (all_riders['StartTiming'] != ''), #DNF
    (all_riders["Place"] == '') & (all_riders['StartTiming'] == '') #DNS
    ]

# create a list of the values we want to assign for each condition
values = ["Finisher", "DNF", "DNS"]

# create a new column and use np.select to assign values to it using our lists as arguments
all_riders['Status'] = np.select(conditions, values)

# display updated DataFrame
all_riders.head()

In [None]:
# for prettiness sake, change the order of the columns a bit (place, first name, last name, gender, all timings)
all_riders = all_riders[["Place", "Firstname", "Lastname", "Gender", "Solo or Duo" , "StartDate", "Status", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]]
all_riders.head()

In [None]:
#CUSTOMIZE write to csv use the year in the name!
all_riders.to_csv('all_riders_2019.csv')

# PUT THIS INTO THE CODE!!!!!
#2019 ONLY: there is a mistake in the data, Neil Crawford gets assigned place 19, (and will therese later on be categorized as a 'finisher')
# He should get DNF. By removing 19, he will be recorded as DNF.

# solo_riders['Place'] = solo_riders['Place'].str.replace('19', '')
# solo_riders

# Do the same for edition 2020

In [None]:
#CUSTOMIZE URL
url = 'https://ratn2020.legendstracking.com/#'

#set that you want to run the chrome headless (behind the scenes, in the background)
chrome_options = Options()
chrome_options.add_argument("--headless")

#insert absolute path of chrome driver
driver = webdriver.Chrome('C:/Users/jetsa/chromedriver.exe', options = chrome_options)
driver.get(url)

# Find and click the right button, using the tab's id
button = driver.find_element(By.ID, "leaderboard-icon")
button.click()

In [None]:
#From here on you can start using BS4. I will now also see the html=code that became visible by clicking the button.
source = driver.page_source

#close the webdriver
driver.close()

In [None]:
#make the soup
soup = BeautifulSoup(source, 'html')
print(soup.prettify())

In [None]:
#MIGHT NEED CUSTOMIZATION, check if you find the right table when you choose table 2.
#The table should contain information about the timings of the soloriders.
#this only works when put in a different cell.Maybe because you'll have to wait till the previous line is loaded?
table_solo = soup.find_all('table')[2]
table_solo

In [None]:
#find all the rows from this specific table
rows_solo = table_solo.findChildren('tr')

#convert each elements into a string to prevent an even more horrible mess
string_rows = []

for i in rows_solo:
    i = str(i)
    string_rows.append(i)
# print(string_rows[7]) #test

#convert the string+rows into a dataframe. Name the column "riders"
solo_riders = pd.DataFrame(string_rows)
solo_riders.columns=["riders"]
solo_riders.head()

#split the strings on </td> (this is removed)
solo_riders = solo_riders["riders"].str.split("</td>", expand = True)
solo_riders.head() #to check which columns you shoul delete in the next step

In [None]:
# MIGHT NEED CUSTOMIZATION - inspect the head and decide which columns to drop. In this (2019) case: colnr 1; 2; 17
solo_riders.drop(solo_riders.columns[[1,2,17]], axis=1, inplace = True)
# print(solo_riders.head()) # to check if the dropping went right

#rename columns
solo_riders.columns=["Place", "MarsVenus","Lastname", "Firstname", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]

#check which rows are empty/useless and you should drop in the next cell. You'll need to see all rows (display.max_rows).
# In this case the rows you want to delete are nr 0; 45, 84
pd.set_option('display.max_rows', solo_riders.shape[0]+1)
solo_riders

In [None]:
#CUSTOMIZE: drop useless ROWS
solo_riders = solo_riders.drop([0, 61, 77]).reset_index()
solo_riders.head()

In [None]:
#extract the gender
gender = []

for row in solo_riders["MarsVenus"]:
    if 'mars' in row:
        gender.append('male')
    if 'venus' in row:
        gender.append('female')

# Add gender as a column to the solo_riders dataframe and drop the MarsVenus column and the automatically created index-column
solo_riders["Gender"] = gender        
solo_riders = solo_riders.drop(solo_riders[["MarsVenus", "index"]], axis = 1)
solo_riders.head()

In [None]:
#CUSTOMIZE append columns with start date&time, and wether it's a solo or duo ride.
StartDate = datetime.datetime(2020, 8, 29, 8)
solo_riders.insert(0, 'Solo or Duo', 'Solo')
solo_riders.insert(0, 'StartDate', StartDate)

solo_riders.head()

### Prepare the duo-table 2020

In [None]:
#grab the duo table
table_duo = soup.find_all('table')[3]
table_duo

In [None]:
#grab the rows
rows_duo = table_duo.findChildren('tr')

#convert each elements into a string to prevent an even more horrible mess
string_rows = []

for i in rows_duo:
    i = str(i)
    string_rows.append(i)
print(string_rows[3])

In [None]:
#convert the string+rows into a dataframe. Name the column "riders"
duo_riders = pd.DataFrame(string_rows)
duo_riders.columns=["riders"]
duo_riders.head()

#split the strings on </td>
duo_riders = duo_riders["riders"].str.split("</td>", expand = True)
duo_riders

In [None]:
# CUSTOMIZE drop useless columns: colnr 1; 2; 17
duo_riders = duo_riders.drop(duo_riders.columns[[1,2,17]], axis=1)
duo_riders

#rename columns
duo_riders.columns=["Place", "MarsVenus","Lastname", "Firstname", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]

#check which rows are empty/useless and you should drop in the next cell. 
duo_riders

In [None]:
#CUSTOMIZE: drop useless ROWS
duo_riders = duo_riders.drop([0, 4, 6]).reset_index()
duo_riders.head()

In [None]:
#extract the gender
gender = []

for row in duo_riders["MarsVenus"]:
    if 'mars-double' in row:
        gender.append('male')
    if 'venus-double' in row:
        gender.append('female')
    if 'venus-mars' in row:
        gender.append('mixed')

#Add gender as a column to the solo_riders dataframe and drop the MarsVenus column and the automatically created index-column
duo_riders["Gender"] = gender
duo_riders = duo_riders.drop(duo_riders[["MarsVenus", "index"]], axis = 1)
duo_riders

In [None]:
#CUSTOMIZE append columns with the startdate (line 1) and whether it a duo or solo ride (line 2)
StartDate = datetime.datetime(2020, 8, 29, 8)
duo_riders.insert(0, 'Solo or Duo', 'Duo')
duo_riders.insert(0, 'StartDate', StartDate)
duo_riders.head()

In [None]:
# glue solo and duo-riders together
all_riders = pd.concat([solo_riders, duo_riders]).reset_index()
all_riders = all_riders.drop(all_riders.columns[0], axis=1)
pd.set_option('display.max_rows', all_riders.shape[0]+1)
all_riders

In [None]:
#set of patterns you want to delete from the columns:
del_patterns = ['<td>', '\)', '<b>', '</b>', '<span id="leaderboard_[\d][\d][\d][\d][\d]*_[\d][\d]*">', '</span>', '<tr>', '</tr>', None]

In [None]:
all_riders = all_riders.replace(to_replace = del_patterns, value = '', regex = True)
all_riders.head()

In [None]:
# add a column that tells you if the rider: finished on time (finisher); did not start (DNS) or did not finish (DNF)
# There are now white spaces when no place is assigned (DNS or DNF), and this overcomplicates stuff. Remove the white spaces.
all_riders['Place'] = all_riders['Place'].str.strip()

# create a list of the three conditions
conditions = [
    (all_riders["Place"] != ''), #finisher
    (all_riders["Place"] == '') & (all_riders['StartTiming'] != ''), #DNF
    (all_riders["Place"] == '') & (all_riders['StartTiming'] == '') #DNS
    ]

# create a list of the values we want to assign for each condition
values = ["Finisher", "DNF", "DNS"]

# create a new column and use np.select to assign values to it using our lists as arguments
all_riders['Status'] = np.select(conditions, values)

# display updated DataFrame
all_riders.head()

In [None]:
# for prettiness sake, change the order of the columns a bit (place, first name, last name, gender, all timings)
all_riders = all_riders[["Place", "Firstname", "Lastname", "Gender", "Solo or Duo" , "StartDate", "Status", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]]
all_riders.head()

In [None]:
#CUSTOMIZE write to csv use the year in the name!
all_riders.to_csv('all_riders_2020.csv')

# Do the same for 2021

In [None]:
#CUSTOMIZE URL
url = 'https://ratn2021.legendstracking.com/#'

#set that you want to run the chrome headless (behind the scenes, in the background)
chrome_options = Options()
chrome_options.add_argument("--headless")

#insert absolute path of chrome driver
driver = webdriver.Chrome('C:/Users/jetsa/chromedriver.exe', options = chrome_options)
driver.get(url)

# Find and click the right button, using the tab's id
button = driver.find_element(By.ID, "leaderboard-icon")
button.click()

In [None]:
#From here on you can start using BS4. I will now also see the html=code that became visible by clicking the button.
source = driver.page_source

#close the webdriver
driver.close()

In [None]:
#make the soup
soup = BeautifulSoup(source, 'html')
print(soup.prettify())

In [None]:
#MIGHT NEED CUSTOMIZATION 
# check in the output below if you find the right table when you choose table 2.
#The table should contain information about the timings of the soloriders.
#this only works when put in a different cell.Maybe because you'll have to wait till the previous line is loaded?
table_solo = soup.find_all('table')[2]
table_solo

In [None]:
#find all the rows from this specific table
rows_solo = table_solo.findChildren('tr')

#convert each elements into a string to prevent an even more horrible mess
string_rows = []

for i in rows_solo:
    i = str(i)
    string_rows.append(i)
# print(string_rows[7]) #test

#convert the string+rows into a dataframe. Name the column "riders"
solo_riders = pd.DataFrame(string_rows)
solo_riders.columns=["riders"]
solo_riders.head()

#split the strings on </td> (this is removed)
solo_riders = solo_riders["riders"].str.split("</td>", expand = True)
solo_riders.head() #to check which columns you shoul delete in the next step

In [None]:
# MIGHT NEED CUSTOMIZATION - inspect the head and decide which columns to drop. In this (2019) case: colnr 1; 2; 17
solo_riders.drop(solo_riders.columns[[1,2,17]], axis=1, inplace = True)
# print(solo_riders.head()) # to check if the dropping went right

#rename columns
solo_riders.columns=["Place", "MarsVenus","Lastname", "Firstname", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]

#check which rows are empty/useless and you should drop in the next cell. You'll need to see all rows (display.max_rows).
# In this case the rows you want to delete are nr 0; 45, 84
pd.set_option('display.max_rows', solo_riders.shape[0]+1)
solo_riders

In [None]:
#CUSTOMIZE: drop useless ROWS (basically the rows that have a None in LastName-column)
solo_riders = solo_riders.drop([0, 32, 61]).reset_index()
solo_riders.head()

#extract the gender
gender = []

for row in solo_riders["MarsVenus"]:
    if 'mars' in row:
        gender.append('male')
    if 'venus' in row:
        gender.append('female')

# Add gender as a column to the solo_riders dataframe and drop the MarsVenus column and the automatically created index-column
solo_riders["Gender"] = gender        
solo_riders = solo_riders.drop(solo_riders[["MarsVenus", "index"]], axis = 1)
solo_riders.head()

In [None]:
#CUSTOMIZE append columns with start date&time, and wether it's a solo or duo ride.
StartDate = 2021 #NOTE the riders started at different dates and times because of corona
solo_riders.insert(0, 'Solo or Duo', 'Solo')
solo_riders.insert(0, 'StartDate', StartDate)

solo_riders.head()

### Duo-riders 2021

In [None]:
#grab the duo table
table_duo = soup.find_all('table')[3]
table_duo

In [None]:
#grab the rows
rows_duo = table_duo.findChildren('tr')

#convert each elements into a string to prevent an even more horrible mess
string_rows = []

for i in rows_duo:
    i = str(i)
    string_rows.append(i)
print(string_rows[1])

In [None]:
#convert the string+rows into a dataframe. Name the column "riders"
duo_riders = pd.DataFrame(string_rows)
duo_riders.columns=["riders"]
duo_riders.head()

#split the strings on </td>
duo_riders = duo_riders["riders"].str.split("</td>", expand = True)
duo_riders

In [None]:
# CUSTOMIZE drop useless columns: colnr 1; 2; 17
duo_riders = duo_riders.drop(duo_riders.columns[[1,2,17]], axis=1)
duo_riders

#rename columns
duo_riders.columns=["Place", "MarsVenus","Lastname", "Firstname", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]

#check which rows are empty/useless and you should drop in the next cell. 
duo_riders

In [None]:
#CUSTOMIZE: drop useless ROWS
duo_riders = duo_riders.drop([0, 3]).reset_index()
duo_riders.head()

In [None]:
#extract the gender
gender = []

for row in duo_riders["MarsVenus"]:
    if 'mars-double' in row:
        gender.append('male')
    if 'venus-double' in row:
        gender.append('female')
    if 'venus-mars' in row:
        gender.append('mixed')

#Add gender as a column to the solo_riders dataframe and drop the MarsVenus column and the automatically created index-column
duo_riders["Gender"] = gender
duo_riders = duo_riders.drop(duo_riders[["MarsVenus", "index"]], axis = 1)
duo_riders

In [None]:
#CUSTOMIZE append columns with the startdate (line 1) and whether it a duo or solo ride (line 2)
StartDate = 2021 #NOTE in 2021 the riders started on different dates and times because of corona
duo_riders.insert(0, 'Solo or Duo', 'Duo')
duo_riders.insert(0, 'StartDate', StartDate)
duo_riders.head()

In [None]:
# glue solo and duo-riders together
all_riders = pd.concat([solo_riders, duo_riders]).reset_index()
all_riders = all_riders.drop(all_riders.columns[0], axis=1)
pd.set_option('display.max_rows', all_riders.shape[0]+1)
all_riders

In [None]:
# CUSTOMIZE: set of patterns you want to delete from the columns:
del_patterns = ['<td>', '\)', '<b>', '</b>', '<span id="leaderboard_[\d][\d][\d][\d][\d]*_[\d][\d]*">', '</span>', '<tr>', '</tr>', None]

all_riders = all_riders.replace(to_replace = del_patterns, value = '', regex = True)
all_riders # check if the cleaning went right

In [None]:
# add a column that tells you if the rider: finished on time (finisher); did not start (DNS) or did not finish (DNF)
# There are now white spaces when no place is assigned (DNS or DNF), and this overcomplicates stuff. Remove the white spaces.
all_riders['Place'] = all_riders['Place'].str.strip()

# create a list of the three conditions
conditions = [
    (all_riders["Place"] != ''), #finisher
    (all_riders["Place"] == '') & (all_riders['StartTiming'] != ''), #DNF
    (all_riders["Place"] == '') & (all_riders['StartTiming'] == '') #DNS
    ]

# create a list of the values we want to assign for each condition
values = ["Finisher", "DNF", "DNS"]

# create a new column and use np.select to assign values to it using our lists as arguments
all_riders['Status'] = np.select(conditions, values)

# display updated DataFrame
all_riders.head()

In [None]:
# for prettiness sake, change the order of the columns a bit (place, first name, last name, gender, all timings)
all_riders = all_riders[["Place", "Firstname", "Lastname", "Gender", "Solo or Duo" , "StartDate", "Status", "StartTiming", "Timing1", "Timing2", "Timing3", "Timing4","Timing5", "Timing6", "Timing7", "Timing8", "Timing9", "TimingFinish"]]
all_riders

In [None]:
#CUSTOMIZE write to csv use the year in the name!
all_riders.to_csv('all_riders_2021.csv')