# Scraping all runner data from an event
This script scrapes the majority of the data. It scrapes all runner data from all runs that have taken place at this event. For the most part it uses the same code as the [the other scraping script](2024-09-08_scraping_5k_KN_history.ipynb), however, there are some adjustments to allow it to scrape a different part of the website.

In [1]:
# import relevant libraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import hashlib

First I use the requests library to scrape the html content of the website. It is important to add a User-Agent header to the method, otherwise the web content will be an error 403 error message.

In this case, you need to enter the base url of the event and then the script will access the latest result from where it will retrieve information about how many events have taken place.

In [2]:
base_URL = 'REDACTED'
URL = base_URL + '/results/latestresults/'
headers = {'User-Agent': 'REDACTED'}
page = requests.get(URL, headers=headers)

I use BeautifulSoup to parse the html content and extract the latest event number.

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')
res_header = soup.find('div', {'class': 'Results-header'})
event_number = int(str(res_header.find(string=re.compile('#\d+')))[1:])

In [29]:
events = [] # event number of the current event
position = []
name = []
agegroup = []
gender = []
club = []
times = []
age_grade = [] # age graded ranking (percentage)
runs = [] # number of runs completed of the runner
volunteered = [] # number of times the runner has volunteered at a running event

Next, I create a for loop that repeats the following steps for every run in the history event.

In [37]:
# the website of the event uses captcha to restrict access. accordingly, you have to adjust
# the range for this for loop and run it multiple times.
for event in range(event_number):
    # scrape html content of the event result page
    URL = base_URL + f'/results/{event+1}'
    page = requests.get(URL, headers=headers)
    
    # make a list of all rows from the results table
    soup = BeautifulSoup(page.content, 'html.parser')
    table = soup.find('table', {'class':'Results-table'})
    rows = table.find_all('tr', class_='Results-table-row')
    
    # iterate over each row and extract the information
    for r in rows:
        events.append(event+1)
        
        if re.search(r'data-name="[^0-9]+(?: [^0-9]+)*\.?"', str(r)) is not None:
            name.append(re.search(r'data-name="[^0-9]+(?: [^0-9]+)*\.?"', str(r)).group(0))
        else:
            name.append('NaN')
            
        if re.search(r'data-position="\d+"', str(r)) is not None:
            position.append(re.search(r'data-position="\d+"', str(r)).group(0))
        else:
            position.append('NaN')
            
        if re.search(r'data-gender="[^"]*"', str(r)) is not None:
            gender.append(re.search(r'data-gender="[^"]*"', str(r)).group(0))
        else:
            gender.append('NaN')
            
        if re.search(r'data-club="[^"]*"', str(r)) is not None:
            club.append(re.search(r'data-club="[^"]*"', str(r)).group(0))
        else:
            club.append('NaN')
        
        if re.search(r'data-vols="\d+"', str(r)) is not None:
            volunteered.append(re.search(r'data-vols="\d+"', str(r)).group(0))
        else:
            volunteered.append('NaN')
            
        if re.search(r'data-runs="\d+"', str(r)) is not None:
            runs.append(re.search(r'data-runs="\d+"', str(r)).group(0))
        else:
            runs.append('NaN')
        
        if re.search(r'data-agegrade="\d+\.\d+"', str(r)) is not None:
            age_grade.append(re.search(r'data-agegrade="\d+\.\d+"', str(r)).group(0))
        else:
            age_grade.append('NaN')
            
        if re.search(r'data-agegroup="[A-Z]{2}\d{2}(-\d{2})?"', str(r)) is not None:
            agegroup.append(re.search(r'data-agegroup="[A-Z]{2}\d{2}(-\d{2})?"', str(r)).group(0))
        else:
            agegroup.append('NaN')

        if r.find(string=re.compile(r'(\d:)?\d{2}:\d{2}')) is not None:
            finish_time = r.find(string=re.compile(r'(\d:)?\d{2}:\d{2}'))
            times.append(finish_time)
        else:
            times.append('NaN')

Some of the lists still include more information than we want. This sequence filters everything that is between the quotation marks of the list items of these lists.

In [46]:
lists = [agegroup, gender, club, age_grade, runs, volunteered, position]

for l in lists:
    for i, item in enumerate(l):
        if re.search(r'[a-zA-Z\-]="([^"]*)"', item) is not None:
            l[i] = re.search(r'[a-zA-Z\-]="([^"]*)"', item).group(1)

Now I combine all lists to a data frame and anonymise the data.

In [53]:
data_dict = {'event_nr': events,
             'position': position,
             'runner': name,
             'agegroup': agegroup,
             'gender': gender,
             'club': club,
             'time': times,
             'age_grade': age_grade,
             'no_runs': runs,
             'no_volunteered': volunteered}

data = pd.DataFrame(data_dict)

data['runner'] = data['runner'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest())

Finally, I save the dataframe to a csv file.

In [57]:
data.to_csv('5k_KN_full_results.csv', index=False)