Parsing Trial
=============

I can get the raw data, but raw curling box scores isn't as interesting at the data that goes into strategic decisions, like whether to blank an end or hammer scoring conversions. We can get a lot of this team-agnostic data from just the box scores, which is great!

The initial work to pull in data will be almost identical to what we did during the exploratory analysis, making it another case to develop my own access API for the data...

In [1]:
%matplotlib inline

import re

import matplotlib.pyplot as plt
import numpy
import pandas
import seaborn

from bs4 import BeautifulSoup
import requests

In [2]:
name = re.compile('[\w\s]+')
params = {'tournamentId': 555, 'associationId': 0, 'drawNumber': 0}
results_site = r'http://results.worldcurling.org/Championship/DisplayResults'

r = requests.get(results_site, params=params)
assert r.status_code == 200
soup = BeautifulSoup(r.text, 'html.parser')

In [3]:
box_scores = soup.find_all('table', class_='game-table')
len(box_scores)

71

In [4]:
print(box_scores[0])

<table class="table game-table">
<thead>
<tr class="game-header-row">
<th class="game-header text-center" colspan="3">
Draw #1            </th>
<th class="text-center">1</th>
<th class="text-center">2</th>
<th class="text-center">3</th>
<th class="text-center">4</th>
<th class="text-center">5</th>
<th class="text-center">6</th>
<th class="text-center">7</th>
<th class="text-center">8</th>
<th class="text-center">9</th>
<th class="text-center">10</th>
<th class="text-center"> </th>
<th class="text-center"> </th>
<th class="text-right">Total</th>
</tr>
</thead>
<tbody>
<tr>
<td class="game-sheet" rowspan="2" style="vertical-align:middle">
A            </td>
<td class="game-team">
                Sweden
            </td>
<td class="game-hammer">
                     
            </td>
<td class=" game-end10 text-center">
                    1
                </td>
<td class=" game-end10 text-center">
                    0
                </td>
<td class=" game-end10 text-center"

OK, we have our box scores loaded in correctly, so let's work on pulling out the data. We'll build up just the general raw data first, then do some operations on it. Let's grab our basic information from the single game data, then worry about organizing the trickier end score data.

In [5]:
game = box_scores[0]  # just to test

draw = game.find('th', class_='game-header').text.strip()
sheet = game.find('td', class_='game-sheet').text.strip()
teams = [t.text.strip() for t in game.find_all('td', class_='game-team')]
hammer = [h.text.strip() for h in game.find_all('td', class_='game-hammer')]
final_score = [int(s.text.strip()) for s in game.find_all('td', class_='game-total')]

for t, h, f in zip(teams, hammer, final_score):
    print(t, h, f)

Sweden  8
Japan * 5


In [14]:
draw

'Draw #1'

In [6]:
rows = game.find_all('tr', class_=None)

end_data = []
for row in rows:
    scores = row.find_all('td', class_='game-end10')
    scores = [s.text.strip() for s in scores]
    scores = [int(s) for s in scores if s is not '']
    end_data.append(scores)

print(end_data)
for data in zip(teams, hammer, end_data, final_score):
    print(data)

[[1, 0, 2, 0, 1, 0, 2, 0, 0, 2], [0, 1, 0, 2, 0, 0, 0, 2, 0, 0]]
('Sweden', '', [1, 0, 2, 0, 1, 0, 2, 0, 0, 2], 8)
('Japan', '*', [0, 1, 0, 2, 0, 0, 0, 2, 0, 0], 5)


Awesome! I think this way is a little easier or clearer than what I had done before, plus it gets around having a bunch of `if` statements just to avoid the information present in the header row. This is looking good. Now, I can generate real data from this information.

Here's the (very rough) idea:
1. keep track of which team currently has the hammer
1. walk through the `end_data` and record the "type" of end for both teams
1. once you have this for every end, get aggregate data

Yeah, that is pretty vague, but we'll see.

Looking back at my previous code, I had to use `re` to get the team name... I'm not sure why that was, but I'll keep an eye on this just in case I get weird results using the above instead of `re`.

In [7]:
# note: use 'blank' for end you don't score in...? probably need more types
end_types = ('blank', 'blank-with-hammer', 'score-with-hammer', 'steal')  # steal == score-without-hammer
team0_data = [teams[0]]
team1_data = [teams[1]]

# transpose end data using zip trick
end_dataT = [[s1, s2] for s1, s2 in zip(*end_data)]
hammer_team = hammer.index('*')

def decide_end_type(points, hammer=False):
    if points > 0 and hammer:
        return end_types[2]
    elif points == 0 and hammer:
        return end_types[1]
    elif points > 0 and not hammer:
        return end_types[3]
    elif points == 0 and not hammer:
        return end_types[0]

for score in end_dataT:
    team0_score, team1_score = score
    t0_end, t1_end = [team0_score], [team1_score]
    
    t0_end.append(decide_end_type(team0_score, hammer=(hammer_team == 0)))
    t1_end.append(decide_end_type(team1_score, hammer=(hammer_team == 1)))
    
    team0_data.append(t0_end)
    team1_data.append(t1_end)
    
    # adjust hammer if score, don't if neither team scores
    if team0_score > 0:
        hammer_team = 1
    elif team1_score > 0:
        hammer_team = 0

for data in zip(team0_data, team1_data):
    print(data)

('Sweden', 'Japan')
([1, 'steal'], [0, 'blank-with-hammer'])
([0, 'blank'], [1, 'score-with-hammer'])
([2, 'score-with-hammer'], [0, 'blank'])
([0, 'blank'], [2, 'score-with-hammer'])
([1, 'score-with-hammer'], [0, 'blank'])
([0, 'blank'], [0, 'blank-with-hammer'])
([2, 'steal'], [0, 'blank-with-hammer'])
([0, 'blank'], [2, 'score-with-hammer'])
([0, 'blank-with-hammer'], [0, 'blank'])
([2, 'score-with-hammer'], [0, 'blank'])


As a first pass, this isn't bad. I obviously have to couple the two streams together more, since recording a `'blank-with-hammer'` because the other team scored is not good... Looks like the `decide_end_types` should just take in both scores *and* the team with the hammer and return two values (or three, if you want to push updating the team with the hammer to this). That would probably make it nicer.

In [8]:
def determine_end_types(t0_score, t1_score, hammer_team):
    # regular scoring
    if t0_score > 0 and hammer_team == 0:
        return 'score-with-hammer', 'blank'
    elif t1_score > 0 and hammer_team == 1:
        return 'blank', 'score-with-hammer'
    # steals
    elif t0_score > 0 and hammer_team == 1:
        return 'steal', 'blank'
    elif t1_score > 0 and hammer_team == 0:
        return 'blank', 'steal'
    # blanks
    elif t0_score == 0 and t1_score == 0 and hammer_team == 0:
        return 'blank-with-hammer', 'blank'
    elif t0_score == 0 and t1_score == 0 and hammer_team == 1:
        return 'blank', 'blank-with-hammer'

In [9]:
team0_data = [teams[0]]
team1_data = [teams[1]]
hammer_team = hammer.index('*')

for score in end_dataT:
    t0_score, t1_score = score
    t0_type, t1_type = determine_end_types(t0_score, t1_score, hammer_team)
    
    t0_end, t1_end = [t0_score, t0_type], [t1_score, t1_type]
    team0_data.append(t0_end)
    team1_data.append(t1_end)
    
    # adjust hammer if score, don't if neither team scores
    if t0_score > 0:
        hammer_team = 1
    elif t1_score > 0:
        hammer_team = 0

for data in zip(team0_data, team1_data):
    print(data)

('Sweden', 'Japan')
([1, 'steal'], [0, 'blank'])
([0, 'blank'], [1, 'score-with-hammer'])
([2, 'score-with-hammer'], [0, 'blank'])
([0, 'blank'], [2, 'score-with-hammer'])
([1, 'score-with-hammer'], [0, 'blank'])
([0, 'blank'], [0, 'blank-with-hammer'])
([2, 'steal'], [0, 'blank'])
([0, 'blank'], [2, 'score-with-hammer'])
([0, 'blank-with-hammer'], [0, 'blank'])
([2, 'score-with-hammer'], [0, 'blank'])


OK, great! Now that we have this, we can start getting aggregate data from the game. Let's keep track of total ends, total ends with hammer, stolen ends, scoring with hammer ends, and scoring 2+ points with hammer. I'm going to restrict how much of the team stats are based on what the other team does.

In [10]:
team0_data

['Sweden',
 [1, 'steal'],
 [0, 'blank'],
 [2, 'score-with-hammer'],
 [0, 'blank'],
 [1, 'score-with-hammer'],
 [0, 'blank'],
 [2, 'steal'],
 [0, 'blank'],
 [0, 'blank-with-hammer'],
 [2, 'score-with-hammer']]

In [11]:
def get_aggregate(team_data):
    data = {'blank': 0, 'blank-with-hammer': 0, 'steal': 0,
            'score-with-hammer': 0, 'score-2+-with-hammer': 0,
            'team-name': team_data[0], 'total-ends': len(team_data) - 1,
            'total-score': 0, 'stolen-points': 0}
    ends = team_data[1:]
    for end in ends:
        data[end[1]] += 1
        data['total-score'] += end[0]
        if end[1] == 'score-with-hammer' and end[0] > 1:
            data['score-2+-with-hammer'] += 1
        if end[1] == 'steal':
            data['stolen-points'] += end[0]
    return data

t0_aggregate = get_aggregate(team0_data)
t1_aggregate = get_aggregate(team1_data)
print(t0_aggregate)
print(t1_aggregate)

{'team-name': 'Sweden', 'stolen-points': 3, 'blank': 4, 'total-score': 8, 'blank-with-hammer': 1, 'steal': 2, 'score-with-hammer': 3, 'total-ends': 10, 'score-2+-with-hammer': 2}
{'team-name': 'Japan', 'stolen-points': 0, 'blank': 6, 'total-score': 5, 'blank-with-hammer': 1, 'steal': 0, 'score-with-hammer': 3, 'total-ends': 10, 'score-2+-with-hammer': 2}


OK, that seems like pretty decent aggregate data for now. It's not that bad for just a single box score.

We should also keep track of where the points are coming from, since stealing a lot of points is good. Let's include that above, so we don't have to reimplement everything.

This seems like a good way to build up aggregate data for a single tournament. Look over all of the games within round robin, then use that aggregate data to predict the match-up winners in bracket-play.

In [12]:
print(len(box_scores))

final = box_scores[-1]
game_type = final.find('th', class_='game-header').text.strip()
print(game_type)

71
Final


In [13]:
for game in box_scores:
    game_type = game.find('th', class_='game-header')
    if game_type is not None:
        game_type = game_type.text.strip().lower()
        if not game_type.startswith('draw'):
            print(game_type)

play-off 1/2
play-off 3/4
semifinals
bronze game
final


In [15]:
s = [4, 0]
determine_end_types(*score, 0)

('score-with-hammer', 'blank')