Working on converting scores on USAU tournament websites into a data table for analysis with pandas. Based on playing over 3 years of organized Ultimate Frisbee and inspecting the USAU website's score tables, I have a rough idea of what I'm looking for. The USAU tournament info is split into  slides/tabs for the pools and brackets. 

The pool play is under the section with `id="poolSlide"`. Then, the pools are organized under sections of `data-type="pool"` and organized into tables of `class="scores_table"`. Each game is under its own table row with a `data-game` identifier.

Bracket play is under the section with `id="bracketSlide"` and set up in a rather unique way because they try to make it look like the bracket.

In [61]:
import os
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

In [73]:
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# US english
LANGUAGE = "en-US,en;q=0.5"

def get_soup(url):
    """Constructs and returns a soup using the HTML content of `url` passed"""
    # initialize a session
    session = requests.Session()
    # set the User-Agent as a regular browser
    session.headers['User-Agent'] = USER_AGENT
    # request for english content (optional)
    session.headers['Accept-Language'] = LANGUAGE
    session.headers['Content-Language'] = LANGUAGE
    # make the request
    html = session.get(url)
    # return the soup
    return bs(html.content, "html.parser")

In [107]:
def get_all_scoretables(soup):
    """Extracts and returns all tables in a soup object"""
    return soup.find_all("table",class_="scores_table")

def get_scoretable_name(table):
    """Given a table soup, returns the table's name"""
    return table.find("thead").find("tr").find("th").text.strip()

def get_table_headers(table):
    """Given a table soup, returns all the headers"""
    headers = []
    for th in table.find("tbody").find("tr").find_all("th"):
        headers.append(th.text.strip())
    return headers

def get_table_rows(table):
    """Given a table, returns all its rows"""
    rows = []
    for tr in table.find("tbody").find_all("tr")[1:]:
        cells = []
        # grab all td tags in this table row
        tds = tr.find_all("td")
        if len(tds) == 0:
            # if no td tags, search for th tags
            # can be found especially in wikipedia tables below the table
            ths = tr.find_all("th")
            for th in ths:
                cells.append(th.text.strip())
        else:
            # use regular td tags
            for td in tds:
                cells.append(td.text.strip())
        rows.append(cells)
    return rows
#Shout out to https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python.

def scrape_and_clean_scoretables(base_url, event):
    #scrape
    soup = get_soup(base_url+event)
    tables = get_all_scoretables(soup)
    T = len(tables)
    print(f"Found {T} score tables")
    #clean
    frames = []
    for i, table in enumerate(tables, start=1):
        headers = get_table_headers(table)
        rows = get_table_rows(table)
        df = pd.DataFrame(rows, columns=headers)
        df['table'] = get_scoretable_name(table)
        df['event'] = event
        df['home_score'] = df['Score'].str.extract(r'(^\d+)').astype(int)
        df['away_score'] = df['Score'].str.extract(r'(\d+$)').astype(int)
        df = df[['event','table','Date','Time','Team 1','Team 2','home_score','away_score']]
        frames.append(df)
    result = pd.concat(frames)
    return result

def scrape_and_clean_brackets(base_url, event):
    #scrape
    soup = get_soup(base_url+event)
    bracketgames = soup.find_all("div",{"class":"bracket_game"})
    print(f"Found {len(bracketgames)} bracket games")
    #clean
    games = []
    for i, game in enumerate(bracketgames):
        game_id = game['id']
        home_team = game.find("span",{"class":"team","data-type":"game-team-home"}).text
        away_team = game.find("span",{"class":"team","data-type":"game-team-away"}).text
        home_score = game.find("span",{"class":"score","data-type":"game-score-home"}).text
        away_score = game.find("span",{"class":"score","data-type":"game-score-away"}).text
        game_time = game.find("span",{"class":"date"}).text
        games.append([event,game_id,game_time,home_team,away_team,home_score,away_score])
    headers = ['event','game_id','game_time','home_team','away_team','home_score','away_score']
    result = pd.DataFrame(games,columns=headers)
    return result

**Let's get our data!**

In [108]:
BASE_URL = "https://play.usaultimate.org/events/"
EVENT = "Florida-Warm-Up-2019/schedule/Men/CollegeMen/"
FILE_PATH = f"./data/ultimate/{EVENT}"
if not os.path.exists(FILE_PATH):
    os.makedirs(FILE_PATH)
print(f"Data files will be stored under {FILE_PATH}")

Data files will be stored under ./data/ultimate/Florida-Warm-Up-2019/schedule/Men/CollegeMen/


In [109]:
poolplay = scrape_and_clean_scoretables(BASE_URL,EVENT)
poolplay.to_csv(f"{FILE_PATH}poolplay.csv")
bracketplay = scrape_and_clean_brackets(BASE_URL, EVENT)
poolplay.to_csv(f"{FILE_PATH}bracketplay.csv")

Found 3 score tables
Found 56 bracket games


Now that we have our data in a nice format, we can start doing some analysis...
Let's start by looking at a team's goal differential at a tournament

In [110]:
poolplay['home_diff'] = poolplay['home_score'] - poolplay['away_score']
poolplay['away_diff'] = poolplay['away_score'] - poolplay['home_score']
poolplay.head()

Unnamed: 0,event,table,Date,Time,Team 1,Team 2,home_score,away_score,home_diff,away_diff
0,Florida-Warm-Up-2019/schedule/Men/CollegeMen/,Pool A Schedule & Scores,Fri 2/8,9:00 AM,Harvard (11),Oklahoma (20),13,9,4,-4
1,Florida-Warm-Up-2019/schedule/Men/CollegeMen/,Pool A Schedule & Scores,Fri 2/8,9:00 AM,Kansas (13),Kennesaw State (14),10,12,-2,2
2,Florida-Warm-Up-2019/schedule/Men/CollegeMen/,Pool A Schedule & Scores,Fri 2/8,9:00 AM,Alabama-Huntsville (30),Texas-Dallas (28),7,13,-6,6
3,Florida-Warm-Up-2019/schedule/Men/CollegeMen/,Pool A Schedule & Scores,Fri 2/8,9:00 AM,Texas A&M; (25),South Carolina (23),10,12,-2,2
4,Florida-Warm-Up-2019/schedule/Men/CollegeMen/,Pool A Schedule & Scores,Fri 2/8,9:00 AM,Rutgers (22),Michigan (16),10,13,-3,3


In [121]:
home = poolplay[['Team 1','home_diff']]
home.columns = ['team','diff']
away = poolplay[['Team 2','away_diff']]
away.columns = ['team','diff']
merged = home.append(away,ignore_index=True)


Unnamed: 0,team,diff
0,Harvard (11),4
1,Kansas (13),-2
2,Alabama-Huntsville (30),-6
3,Texas A&M; (25),-2
4,Rutgers (22),-3
5,Northeastern (18),-1
6,Georgia (10),1
7,Illinois State (12),-2
8,Central Florida (4),2
9,LSU (15),-1
