Working on converting scores on USAU tournament websites into a data table for analysis with pandas. Based on playing over 3 years of organized Ultimate Frisbee and inspecting the USAU website's score tables, I have a rough idea of what I'm looking for. The USAU tournament info is split into  slides/tabs for the pools and brackets. 

The pool play is under the section with `id="poolSlide"`. Then, the pools are organized under sections of `data-type="pool"` and organized into tables of `class="scores_table"`. Each game is under its own table row with a `data-game` identifier.

Bracket play is under the section with `id="bracketSlide"` and set up in a rather unique way because they try to make it look like the bracket.

In [61]:
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

In [73]:
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# US english
LANGUAGE = "en-US,en;q=0.5"

def get_soup(url):
    """Constructs and returns a soup using the HTML content of `url` passed"""
    # initialize a session
    session = requests.Session()
    # set the User-Agent as a regular browser
    session.headers['User-Agent'] = USER_AGENT
    # request for english content (optional)
    session.headers['Accept-Language'] = LANGUAGE
    session.headers['Content-Language'] = LANGUAGE
    # make the request
    html = session.get(url)
    # return the soup
    return bs(html.content, "html.parser")

Get pool play -- or consolation pool -- data

In [81]:
def get_all_scoretables(soup):
    """Extracts and returns all tables in a soup object"""
    return soup.find_all("table",class_="scores_table")

def get_scoretable_name(table):
    """Given a table soup, returns the table's name"""
    return table.find("thead").find("tr").find("th").text.strip()

def get_table_headers(table):
    """Given a table soup, returns all the headers"""
    headers = []
    for th in table.find("tbody").find("tr").find_all("th"):
        headers.append(th.text.strip())
    return headers

def get_table_rows(table):
    """Given a table, returns all its rows"""
    rows = []
    for tr in table.find("tbody").find_all("tr")[1:]:
        cells = []
        # grab all td tags in this table row
        tds = tr.find_all("td")
        if len(tds) == 0:
            # if no td tags, search for th tags
            # can be found especially in wikipedia tables below the table
            ths = tr.find_all("th")
            for th in ths:
                cells.append(th.text.strip())
        else:
            # use regular td tags
            for td in tds:
                cells.append(td.text.strip())
        rows.append(cells)
    return rows
#Shout out to https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python.

def scrape_and_clean_scoretables(base_url,event):
    #scrape
    soup = get_soup(base_url+event)
    tables = get_all_scoretables(soup)
    print(f"Found {len(tables)} score tables")
    filepath = f"./data/ultimate/{event}"
    if not os.path.exists(filepath):
        os.makedirs(filepath)
    print(f"Data files stored under {filepath}")
    #organize
    for i, table in enumerate(tables, start=1):
        headers = get_table_headers(table)
        rows = get_table_rows(table)
        table_name = f"table{i}"
        df = pd.DataFrame(rows, columns=headers)
        df['Table'] = get_scoretable_name(table)
        df['Event'] = event
        df.to_csv(f"{filepath}{table_name}.csv")
    #clean
    for i in range(1,len(tables)):
        df = pd.read_csv(f"{filepath}table{i}.csv")
        if 'Score' not in df.columns:
            continue
        # fix up the score column
        df['Team 1 Score'] = df['Score'].str.extract(r'(^\d+)').astype(int)
        df['Team 2 Score'] = df['Score'].str.extract(r'(\d+$)').astype(int)
        df = df[['Event','Table','Date','Time','Team 1','Team 2','Team 1 Score','Team 2 Score']]
        df.to_csv(f"{filepath}table{i}.csv")

In [82]:
base_url = "https://play.usaultimate.org/events/"
event = "Florida-Warm-Up-2019/schedule/Men/CollegeMen/"
scrape_and_clean_scoretables(base_url,event)

Found 3 score tables
Data files stored under ./data/ultimate/Florida-Warm-Up-2019/schedule/Men/CollegeMen/


AssertionError: 8 columns passed, passed data had 100 columns

Now let's look at bracket play

In [76]:
base_url = "https://play.usaultimate.org/events/"
event = "Florida-Warm-Up-2019/schedule/Men/CollegeMen/"
soup = get_soup(base_url+event)
bracketgames = soup.find_all("div",class_="bracket_game")
print(bracketgames[0])

<div class="bracket_game top_game" data-index="1" data-relation="" id="game206416">
<div class="gameID_area">
<div class="gameID">
<span></span>
<p><a href="/teams/events/match_report/?EventGameId=xlo48rmrkjLrXsswELflat3wy3sU5%2b0YzNZlu17L5Qo%3d">G1</a></p>
</div>
</div>
<div class="top_area winner">
<span class="isScore">
<span class="score adjust-data" data-type="game-score-home">15</span>
</span>
<span class="isName">
<span class="team adjust-data" data-type="game-team-home"><a href="/events/teams/?EventTeamId=qO2i5gaszd2WIAT%2fYAFGl4cSLWjNsJocDhD7EiOQr2Y%3d">Brown (2)</a></span>
</span>
</div>
<div class="btm_area loser">
<span class="isScore">
<span class="score adjust-data" data-type="game-score-away">9</span>
</span>
<span class="isName">
<span class="team adjust-data" data-type="game-team-away"><a href="/events/teams/?EventTeamId=%2f%2f0z18TPo74yY8%2bc2GMmvtvGXkA1uMsdzz8gEdmyglc%3d">Central Florida (4)</a></span>
</span>
</div>
<p class="location">field 2</p>
<span class="game-

In [80]:
games = []
for i, game in enumerate(bracketgames):
    home_team = game.find("span",{"class":"team","data-type":"game-team-home"}).text
    away_team = game.find("span",{"class":"team","data-type":"game-team-away"}).text
    home_score = game.find("span",{"class":"score","data-type":"game-score-home"}).text
    away_score = game.find("span",{"class":"score","data-type":"game-score-away"}).text
    game_time = game.find("span",{"class":"date"}).text
    games.append([event,game_time,home_team,away_team,home_score,away_score])
    
headers = ['event','game_time','home_team','away_team','home_score','away_score']
filepath = f"./data/ultimate/{event}"
if not os.path.exists(filepath):
    os.makedirs(filepath)
pd.DataFrame(games,columns=headers).to_csv(f"{filepath}bracketplay.csv")

Need to come up with a good way to assign tournament IDs. I have been using a manual method to do it, but I may want to work on anonymizing the games or some odd data cleaning techniques.