# Webscraping marathon results from marathonguide.com

[Marathonguide.com](http://www.marathonguide.com/) has a database of marathon results from c.2000 onwards. Each marathon documented has the database of names, times, place, place within sex, age group division (e.g Female 30-34), and whether or not the time would have qualified the person for the Boston Marathon (one of the most prestigious major road marathons in the World).



The website has an overview for each marathon as well which typically documents any number of; finishers (also split male / female), winning finishing times (split male / female), average finish time, and standard deviation. It is this overview that I'll be using for scraping with the aim of analysing the following:
- Whether average finishing time has improved in the past two decades
- Whether technology has had an impact in marathon racing over the past two decades
- Any other trends

As of the last cell, there are only two errored marathon links, the links redirect to website maintence. There are no values in the dictionary as NA, and hence of the 12k marathons from 2000-2020 we have valid data for all but one of them.

In [1]:
from bs4 import BeautifulSoup
import requests
import time
import re
import pandas as pd

In [4]:
#link to access results for each marathon
link = 'http://www.marathonguide.com/results/'
#set up dictionary to write into
maras = {}
errors = {}
NA = ['NA', 'NA', 'NA', 'NA', 'NA', 'NA']

#loop through each marathon and its link
#sort out indexing for 00's and 10's
for i in range(0, 21):
    if i < 10:
        m = "0" + str(i)
    else:
        m = str(i)
    j = "20" + m
    #specify year_URL in here as to not keep adding to the end of it
    year_URL = 'http://www.marathonguide.com/results/browse.cfm?Year='
    year_URL = year_URL + str(j)
    print(year_URL)
    page = requests.get(year_URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    #find table with list of marathons by date
    by_date = soup.find_all('table')[7].td
    marathons = by_date.find_all('a')
    #loop through each marathon in table
    for marathon in marathons:
        marathon_name = marathon.text
        marathon_name = marathon_name + " " + j #naming as name + year
        link_suffix = marathon['href']
        URL = link + link_suffix
        page = requests.get(URL)
        soup = BeautifulSoup(page.content, 'html.parser')
        try:
            rel_table = soup.find_all('table')[8]
            td = rel_table.find_all('td')[0]
            try:
                info_list = td.text.splitlines()
                if info_list[3] == "":
                    info_list = info_list[5:12] #for those entries with weird formatting
                else:
                    info_list = info_list[1:7] #cut out start and end of list
                maras[marathon_name] = info_list
            except:
                maras[marathon_name] = NA
        except:
            print(URL)
            errors[marathon_name] = URL
            print('error')
        print(marathon_name)

http://www.marathonguide.com/results/browse.cfm?Year=2000
Christmas Marathon 2000
Jacksonville Marathon 2000
Dallas White Rock Marathon 2000
Honolulu Marathon 2000
Hops Marathon by the Bay 2000
Kiawah Island Marathon 2000
Rocket City Marathon 2000
California International Marathon 2000
Memphis Marathon 2000
Raleigh Marathon 2000
Tucson Marathon 2000
Western Hemisphere Marathon 2000
Seattle Marathon 2000
Space Coast Marathon 2000
North Central Trail Marathon 2000
Atlanta Marathon 2000
Gobbler Grind Marathon 2000
Philadelphia Marathon 2000
Midsouth Marathon 2000
Oklahoma Marathon 2000
Richmond Marathon 2000
Valley of Fire Marathon 2000
Long Beach Marathon 2000
Ocean State Marathon 2000
Chickamauga Battlefield Marathon 2000
Harrisburg Marathon 2000
Mid-Atlantic Cross Country Challenge 2000
Montgomery County Marathon In The Parks 2000
New York City Marathon 2000
San Antonio Marathon 2000
Santa Clarita Marathon 2000
Vulcan Marathon 2000
Dublin Marathon 2000
Cape Cod Marathon 2000
Columbus M

In [7]:
y = 0
maras_series = {}
for value in maras.values():
    y += 1 #for simple naming the dictionary keys
    maras_series[str(y)] = pd.Series(value)
df = pd.DataFrame.from_dict(maras_series, orient = 'index') #save each entry as rows
df.to_csv('marathons_all_no_index.csv', index = False)

In [11]:
print(errors)
for key, value in maras.items():
    if value == NA: print(key)

{'Turkey Swamp Races 2012': 'http://www.marathonguide.com/results/browse.cfm?MIDD=4588120819', 'Sunflower Trail Marathon 2019': 'http://www.marathonguide.com/results/browse.cfm?MIDD=7058190504'}
