Working on converting scores on USAU tournament websites into a data table for analysis with pandas. Based on playing over 3 years of organized Ultimate Frisbee and inspecting the USAU website's score tables, I have a rough idea of what I'm looking for. The USAU tournament info is split into  slides/tabs for the pools and brackets. 

The pool play is under the section with `id="poolSlide"`. Then, the pools are organized under sections of `data-type="pool"` and organized into tables of `class="scores_table"`. Each game is under its own table row with a `data-game` identifier.

Bracket play is under the section with `id="bracketSlide"` and set up in a rather unique way because they try to make it look like the bracket.

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

In [31]:
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# US english
LANGUAGE = "en-US,en;q=0.5"

def get_soup(url):
    """Constructs and returns a soup using the HTML content of `url` passed"""
    # initialize a session
    session = requests.Session()
    # set the User-Agent as a regular browser
    session.headers['User-Agent'] = USER_AGENT
    # request for english content (optional)
    session.headers['Accept-Language'] = LANGUAGE
    session.headers['Content-Language'] = LANGUAGE
    # make the request
    html = session.get(url)
    # return the soup
    return bs(html.content, "html.parser")

def get_all_scoretables(soup):
    """Extracts and returns all tables in a soup object"""
    return soup.find_all("table.global_table.scores_table")

def get_table_name(table):
    """Given a table soup, returns the table's name"""
    return table.find("thead").find("tr").find("th").text.strip()

def get_table_headers(table):
    """Given a table soup, returns all the headers"""
    headers = []
    for th in table.find("tbody").find("tr").find_all("th"):
        headers.append(th.text.strip())
    return headers

def get_table_rows(table):
    """Given a table, returns all its rows"""
    rows = []
    for tr in table.find("tbody").find_all("tr")[1:]:
        cells = []
        # grab all td tags in this table row
        tds = tr.find_all("td")
        if len(tds) == 0:
            # if no td tags, search for th tags
            # can be found especially in wikipedia tables below the table
            ths = tr.find_all("th")
            for th in ths:
                cells.append(th.text.strip())
        else:
            # use regular td tags
            for td in tds:
                cells.append(td.text.strip())
        rows.append(cells)
    return rows

In [34]:
url = "https://play.usaultimate.org/events/Florida-Warm-Up-2019/schedule/Men/CollegeMen/"
soup = get_soup(url)
tables = get_all_tables(soup)
print(f"Found a total of {len(tables)} tables.")
for i, table in enumerate(tables, start=1):
    headers = get_table_headers(table)
    rows = get_table_rows(table)
    table_name = f"table{i}"
    pd.DataFrame(rows, columns=headers).to_csv(f"{table_name}.csv")

Found a total of 5 tables.


In [37]:
df = pd.read_csv('./table2.csv')
# fix up the score column
df['Team 1 Score'] = df['Score'].str.extract(r'(^\d+)').astype(int)
df['Team 2 Score'] = df['Score'].str.extract(r'(\d+$)').astype(int)
df = df[['Date','Time','Team 1','Team 2','Team 1 Score','Team 2 Score']]
df.head()

Unnamed: 0,Date,Time,Team 1,Team 2,Team 1 Score,Team 2 Score
0,Fri 2/8,9:00 AM,Harvard (11),Oklahoma (20),13,9
1,Fri 2/8,9:00 AM,Kansas (13),Kennesaw State (14),10,12
2,Fri 2/8,9:00 AM,Alabama-Huntsville (30),Texas-Dallas (28),7,13
3,Fri 2/8,9:00 AM,Texas A&M; (25),South Carolina (23),10,12
4,Fri 2/8,9:00 AM,Rutgers (22),Michigan (16),10,13


Shout out to https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python.