Exploratory Analysis
====================

First try at pulling the game data down and exploring how the website is organized.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy
import pandas
import seaborn

from bs4 import BeautifulSoup
import requests

In [2]:
# future note: there may be 555 saved bonspiels on the site...
# looping over that number could be possible for pulling everything down
test_site = 'http://results.worldcurling.org/Championship/Details/555'

In [4]:
r = requests.get(test_site)
assert r.status_code == 200

soup = BeautifulSoup(r.text, 'html.parser')

Game Scores
-----------

From my quick search before setting up this notebook, I was only pulling the final scores, not the box scores... Let's see if I can figure that out.

These are pulled from the **Draw** tab within each tournament results page.

In [5]:
scores = soup.find_all('td', class_='col-md-2')
print(len(scores))
print(scores[0])

88
<td class="col-md-2 text-center">
(1)                            <br/>
SWE - JPN                            <br/>
8 - 5                    </td>


In [6]:
data = scores[0].text.split()
game_data = {'draw': int(data[0].lstrip('(').rstrip(')')),
             'team1': data[1], 'team2': data[3],
             'team1_score': int(data[4]), 'team2_data': int(data[6])}
print(game_data)

{'team2': 'JPN', 'team1_score': 8, 'draw': 1, 'team2_data': 5, 'team1': 'SWE'}


In [7]:
def create_game_data(game_soup):
    data = game_soup.text.split()
    if len(data) > 1:
        draw = data[0].lstrip('(').rstrip(')')
        game_data = {'team1': data[1], 'team2': data[3],
                     'team1_score': int(data[4]), 'team2_data': int(data[6])}
        try:
            game_data['playoffs'] = False
            draw = int(draw)
            game_data['draw'] = 'D{}'.format(draw)
        except:
            game_data['playoffs'] = True
            game_data['draw'] = draw
        return game_data

tourney_data = [create_game_data(s) for s in scores]
tourney_data = [d for d in tourney_data if d is not None]
print(len(tourney_data))  # to see how many false positives we got from our soup.find_all

71


Box Scores
----------

But we can't fit any sort of model to just final scores, since that would probably just lead to the model predicting that the teams with the consistently higher scores (or $\Delta$scores) would win, which *a priori* should not be the case for every tourney. We want box scores to try to capture some of the strategy (blanking ends, scoring with hammer conversion, etc) without having to watch the games

In [8]:
box = soup.find_all('div', class_='col-md-12')
print(len(box))
print(box[0])

53
<div class="col-md-12">
<h4>Men</h4>
</div>


In [9]:
box[1]

<div class="col-md-12">
<b>
            St Jakobshalle
        </b>
</div>

In [10]:
for i in range(10):
    print(box[i])

<div class="col-md-12">
<h4>Men</h4>
</div>
<div class="col-md-12">
<b>
            St Jakobshalle
        </b>
</div>
<div class="col-md-12">
    BASEL
    <br/>
    Switzerland
</div>
<div class="col-md-12">
        4/2/2016 - 4/10/2016
    </div>
<div class="col-md-12">
<div class="panel panel-default">
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th class="col-md-1 text-center">Position</th>
<th class="col-md-1 text-center">Record</th>
<th class="col-md-1"> </th>
<th class="col-md-9">Association</th>
</tr>
</thead>
<tbody>
<tr>
<td class="col-md-1 text-center">
                                1
                            </td>
<td class="col-md-1 text-center">
                                12 - 1
                            </td>
<td class="col-md-1">
<img alt="Canada" class="img-border-line" height="20" src="http://wcfresults.azurewebsites.net/Content/Images/Flags/canada.gif" width="40"/>
</td>
<td class="col-md-9">
<div class="col-md-12">
<b>Canada</b>
</

These are obviously not the box scores...and I think I know why.

On a fresh load of the page, the **Results** tab (which contains the box scores) *does not* show any scores by default. Instead, there are a couple links to actually get the data to show on the page, since most people probably just want a few scores and not the entire block of scores.

For my test page, the link **Show all games** has the target:
```
http://results.worldcurling.org/Championship/DisplayResults?tournamentId=555&associationId=0&drawNumber=0
```
This link format may be how I get the RESTful access to the scores... Just going to that site gives a raw dump of the scores!

In [11]:
test_site_2 = 'http://results.worldcurling.org/Championship/DisplayResults?tournamentId=555&associationId=0&drawNumber=0'

In [12]:
r2 = requests.get(test_site_2)
assert r2.status_code == 200

soup2 = BeautifulSoup(r2.text, 'html.parser')

In [13]:
box2 = soup2.find_all('div', class_='col-md-12')
print(len(box2))
print(box2[1])

285
<div class="col-md-12">
<h5>Draw #1</h5>
<p>
<b>4/2/2016 2:00 PM</b>
</p>
<p>
<table class="table game-table">
<thead>
<tr class="game-header-row">
<th class="game-header text-center" colspan="3">
Draw #1            </th>
<th class="text-center">1</th>
<th class="text-center">2</th>
<th class="text-center">3</th>
<th class="text-center">4</th>
<th class="text-center">5</th>
<th class="text-center">6</th>
<th class="text-center">7</th>
<th class="text-center">8</th>
<th class="text-center">9</th>
<th class="text-center">10</th>
<th class="text-center"> </th>
<th class="text-center"> </th>
<th class="text-right">Total</th>
</tr>
</thead>
<tbody>
<tr>
<td class="game-sheet" rowspan="2" style="vertical-align:middle">
A            </td>
<td class="game-team">
                Sweden
            </td>
<td class="game-hammer">
                     
            </td>
<td class=" game-end10 text-center">
                    1
                </td>
<td class=" game-end10 text-center">

Funny, the first entry is *the entire tourney!* Start at entry 1 to get individual game data.

In [14]:
just_box_scores = soup2.find_all('table', class_='game-table')
print(len(just_box_scores))
# print(just_box_scores[0])

71


OK, so this might be better than grabbing the `div`s and doing something with them. The good news is that we have the same number of total games (71) as getting just the final scores from the previous site, so we shouldn't be missing anything. We can verify that later, which wouldn't be too bad.

So, getting the tables for the scores, now we need to convert those into data structures that we can actually use.

In [15]:
for row in just_box_scores[0].find_all('tr'):
    for i, col in enumerate(row.find_all('td')):
        print('{}: {}'.format(i, col))

0: <td class="game-sheet" rowspan="2" style="vertical-align:middle">
A            </td>
1: <td class="game-team">
                Sweden
            </td>
2: <td class="game-hammer">
                     
            </td>
3: <td class=" game-end10 text-center">
                    1
                </td>
4: <td class=" game-end10 text-center">
                    0
                </td>
5: <td class=" game-end10 text-center">
                    2
                </td>
6: <td class=" game-end10 text-center">
                    0
                </td>
7: <td class=" game-end10 text-center">
                    1
                </td>
8: <td class=" game-end10 text-center">
                    0
                </td>
9: <td class=" game-end10 text-center">
                    2
                </td>
10: <td class=" game-end10 text-center">
                    0
                </td>
11: <td class=" game-end10 text-center">
                    0
                </

So let's look at the individual data for just this one game to think about what data we should save and how we should save it.

- Sheet: td class `game-sheet`, only in the first row
- LSFE: td class `game-hammer` contains an asterisk (`*`)
- Team names: td class `game-team`, in each row separately
- Final score: td class `game-total`, in each row
- Individual end scores: td class `game-end10`, with points scored in that end for each team

We can build up an array-like structure to save the individual end scores, and everything else can just be a string or value. This data does not contain the draw number, but maybe we can do that with the RESTful access to the site. I'll have to look more into that.

In [16]:
import re

# name = re.compile('\w+')  # only matches 'United' for United States of America
name = re.compile('[\w\s]+')  # matches full USA!

for box_score in just_box_scores:
    for row in box_score.find_all('tr'):
        sheet, team_name = None, None
        sheet = row.find('td', class_='game-sheet')
        team = row.find('td', class_='game-team')
        if sheet is not None:
            sheet = sheet.text.strip()
        if team is not None:
            team_name = re.search(name, team.text.strip()).group(0)
            print('sheet: {}, team: {}'.format(sheet, team_name))

sheet: A, team: Sweden
sheet: None, team: Japan
sheet: B, team: Korea
sheet: None, team: Scotland
sheet: C, team: Norway
sheet: None, team: Russia
sheet: D, team: Germany
sheet: None, team: Switzerland
sheet: A, team: Russia
sheet: None, team: Korea
sheet: B, team: United States of America
sheet: None, team: Denmark
sheet: C, team: Canada
sheet: None, team: Finland
sheet: D, team: Norway
sheet: None, team: Scotland
sheet: B, team: Switzerland
sheet: None, team: Sweden
sheet: C, team: Germany
sheet: None, team: Japan
sheet: A, team: Denmark
sheet: None, team: Canada
sheet: B, team: Scotland
sheet: None, team: Russia
sheet: C, team: Korea
sheet: None, team: Norway
sheet: D, team: United States of America
sheet: None, team: Finland
sheet: A, team: Japan
sheet: None, team: Switzerland
sheet: B, team: Canada
sheet: None, team: United States of America
sheet: C, team: Finland
sheet: None, team: Denmark
sheet: D, team: Sweden
sheet: None, team: Germany
sheet: A, team: Germany
sheet: None, tea

So a quick and easy `re` search *will* pull out the team names correctly...except for country names with spaces. Let's see if I can fix that. **FIXED!**

So, I can extract the sheet and the two teams fine, and I can probably reset building up the box score by the double-`None`s that I see for the `team_name` and `sheet` values. Next, I need to actually get the score data from the rows. Let's try doing this separately.

In [18]:
for i, box_score in enumerate(just_box_scores):
    print('=== GAME {} ==='.format(i + 1))
    for row in box_score.find_all('tr'):
        hammer = row.find('td', class_='game-hammer')
        scores = row.find_all('td', class_='game-end10')
        total = row.find('td', class_='game-total')
        if scores:
            scores = [end.text.strip().replace('X', '') for end in scores]
            scores = [int(end) for end in scores if end.isdigit()]
        if hammer:
            hammer = hammer.text.strip()
            if hammer == '*':
                hammer = True
            else:
                hammer = False
        if total:
            total = int(total.text.strip())
            print('LSFE: {}, scores: {}, total: {} ({})'.format(hammer, scores, total, sum(scores)))
    print()

=== GAME 1 ===
LSFE: False, scores: [1, 0, 2, 0, 1, 0, 2, 0, 0, 2], total: 8 (8)
LSFE: True, scores: [0, 1, 0, 2, 0, 0, 0, 2, 0, 0], total: 5 (5)

=== GAME 2 ===
LSFE: False, scores: [0, 0, 1, 0, 0, 0, 2, 0], total: 3 (3)
LSFE: True, scores: [0, 2, 0, 0, 2, 3, 0, 2], total: 9 (9)

=== GAME 3 ===
LSFE: True, scores: [2, 0, 0, 2, 0, 0, 4, 3], total: 11 (11)
LSFE: False, scores: [0, 2, 0, 0, 1, 0, 0, 0], total: 3 (3)

=== GAME 4 ===
LSFE: True, scores: [0, 0, 1, 0, 1, 0, 0, 0], total: 2 (2)
LSFE: False, scores: [1, 1, 0, 3, 0, 2, 0, 1], total: 8 (8)

=== GAME 5 ===
LSFE: False, scores: [0, 3, 0, 0, 2, 0, 1, 0, 0, 2, 1], total: 9 (9)
LSFE: True, scores: [2, 0, 1, 1, 0, 2, 0, 0, 2, 0, 0], total: 8 (8)

=== GAME 6 ===
LSFE: False, scores: [0, 0, 0, 0, 1, 0, 0, 0], total: 1 (1)
LSFE: True, scores: [2, 0, 1, 0, 0, 1, 1, 2], total: 7 (7)

=== GAME 7 ===
LSFE: True, scores: [1, 0, 1, 0, 1, 1, 0, 0, 3], total: 7 (7)
LSFE: False, scores: [0, 1, 0, 1, 0, 0, 0, 1, 0], total: 3 (3)

=== GAME 8 ===
LS

Awesome! Other things to extract would be `game-total` for the final scores of the game (not really necessary except as a check that I pulled all of the data properly) and `game-hammer` to see which team has LSFE/hammer.

Since the above numbers are still strings, I would need to parse them into numbers (dropping the Xs for unplayed ends), and probably drop the empty ends from the list. Let's try to include this in the code above. I'm also including the total score check from both summing the ends and getting the total from the desired `td` on the site.

*Note: I pulled the `print` statement in this and team name extraction into the last `if` statement to get rid of the excess `None` rows from the output. This should be good to use when I'm loading everything into a dataframe.*