# Scraping event history of a weekly 5k in KN
This script scrapes the event history of a weekly 5k running event in Konstanz. Since the organisation doesn't allow webscraping, the scraped data will be anonymised and the script will be adjusted such that no direct references to the 5k event can be read in it.

The scraped data will be saved as an anonymised csv file.

In [103]:
# import relevant libraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

First I use the requests library to scrape the html content of the website. It is important to add a User-Agent header to the method, otherwise the web content will be an error 403 error message.

In [104]:
URL = 'https://www.parkrun.com.de/hockgraben/results/eventhistory/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:129.0) Gecko/20100101 Firefox/129.0'}
page = requests.get(URL, headers=headers)

print(page.text)

﻿
<!DOCTYPE html>
<html lang="de-DE">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link rel="apple-touch-icon" sizes="180x180" href="/wp-content/themes/parkrun/favicons/apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="32x32" href="/wp-content/themes/parkrun/favicons/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="16x16" href="/wp-content/themes/parkrun/favicons/favicon-16x16.png">
<link rel="manifest" href="/wp-content/themes/parkrun/favicons/site.webmanifest">
<link rel="mask-icon" href="/wp-content/themes/parkrun/favicons/safari-pinned-tab.svg" color="#2b233d">
<link rel="shortcut icon" href="/wp-content/themes/parkrun/favicons/favicon.ico">
<meta name="msapplication-TileColor" content="#da532c">
<meta name="msapplication-config" content="/wp-content/themes/parkrun/favicons/browserconfig.xml">
<meta name="theme-color" content="#ffffff">
<meta name="geo.placename" content="Friedrichstrasse" />
<meta

I then use BeautifulSoup to parse the html content of the website.

In [105]:
soup = BeautifulSoup(page.content, 'html.parser')

From the developer tools mode of the Website, I know that the information I want is stored in a table of the class Results-table.

In [106]:
table = soup.find('table', {'class':'Results-table'})
print(table.prettify())

<table class="Results-table Results-table--compact js-ResultsTable">
 <thead>
  <tr class="Results-table-thead">
   <th class="Results-table-th Results-table-th--position">
    <span class="Results-hideTablet">
     Lauf #
    </span>
    <span class="Results-tablet">
     #
    </span>
   </th>
   <th class="Results-table-th hideDetailed--mobile">
    Datum
   </th>
   <th class="Results-table-th detailed--mobile-tableCell">
    Datum/zuerst im Ziel
   </th>
   <th class="Results-table-th">
    Finisher
   </th>
   <th class="Results-table-th">
    Helfende
   </th>
   <th class="Results-table-th Results-hideTablet" colspan="2">
    Erster Mann im Ziel
   </th>
   <th class="Results-table-th Results-hideTablet" colspan="2">
    Erste Frau im Ziel
   </th>
   <th class="Results-table-th Results-tablet Results-tablet--tableCell Results-hideMobile">
    Erster Mann im Ziel
   </th>
   <th class="Results-table-th Results-tablet Results-tablet--tableCell Results-hideMobile">
    Erste Frau

Actually I can retrieve all the information I want from the table row tag of each row. So I first create a list of all the table rows.

In [107]:
rows = table.find_all('tr', class_='Results-table-row')

Once I have this list, I create lists for all the columns I want to have in my dataframe later and use regular expressions to extract the rough cut of the information. This includes a string that describes what kind of information it is.

In [108]:
dates = []
first_f = []
time_f = []
finishers = []
first_m = []
time_m = []
event_num = []
num_vols = []

for r in rows:
    dates.append(re.search(r'data-date="\d{4}-\d{2}-\d{2}"', str(r)).group(0))
    first_f.append(re.search(r'data-female="[^0-9]+(?: [^0-9]+)+\.?" ', str(r)).group(0))
    time_f.append(re.search(r'data-femaletime="\d+"', str(r)).group(0))
    finishers.append(re.search(r'data-finishers="\d+"', str(r)).group(0))
    first_m.append(re.search(r'data-male="[^0-9]+(?: [^0-9]+)+\.?" ', str(r)).group(0))
    time_m.append(re.search(r'data-maletime="\d+"', str(r)).group(0))
    event_num.append(re.search(r'data-parkrun="\d+"', str(r)).group(0))
    num_vols.append(re.search(r'data-volunteers="\d+"', str(r)).group(0))


I then clean the information so only the relevant bits remain.

In [109]:
lists = [dates, first_f, time_f, finishers, first_m, time_m, event_num, num_vols]

for l in lists:
    for i, item in enumerate(l):
        l[i] = re.search(r'[a-zA-Z\-]="([^"]+)"', item).group(1)

Finally, the times are written as dddd, so I add a semicolon to make them sensible. I assume here, that the fastest times are always in the format mm:ss. This is a fairly reasonable assumption as 5k times are never faster below ten minutes and rarely over 1h.

In [110]:
for l in [time_f, time_m]:
    for i, item in enumerate(l):
        l[i] = item[:2] + ':' + item[2:]

In the last step, I create a pandas dataframe from the lists and write it to a csv file.

In [111]:
data_dict = {'date': dates, 
             'first_female': first_f,
             'first_female_time': time_f,
             'first_male': first_m,
             'first_male_time': time_m,
             'number_of_finishers': finishers,
             'number_of_volunteers': num_vols,
             'event_number': event_num}

data = pd.DataFrame(data_dict)

data.head()

data.to_csv('5k_KN_history.csv', index=False)