Exploratory Analysis
====================

First try at pulling the game data down and exploring how the website is organized.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy
import pandas
import seaborn

from bs4 import BeautifulSoup
import requests

In [2]:
# future note: there may be 555 saved bonspiels on the site...
# looping over that number could be possible for pulling everything down
test_site = 'http://results.worldcurling.org/Championship/Details/555'

In [3]:
r = requests.get(test_site)
assert r.status_code == 200

soup = BeautifulSoup(r.text, 'html.parser')

Game Scores
-----------

From my quick search before setting up this notebook, I was only pulling the final scores, not the box scores... Let's see if I can figure that out.

These are pulled from the **Draw** tab within each tournament results page.

In [4]:
scores = soup.find_all('td', class_='col-md-2')
print(len(scores))
print(scores[0])

88
<td class="col-md-2 text-center">
(1)                            <br/>
SWE - JPN                            <br/>
8 - 5                    </td>


In [5]:
data = scores[0].text.split()
game_data = {'draw': int(data[0].lstrip('(').rstrip(')')),
             'team1': data[1], 'team2': data[3],
             'team1_score': int(data[4]), 'team2_data': int(data[6])}
print(game_data)

{'team2_data': 5, 'draw': 1, 'team2': 'JPN', 'team1_score': 8, 'team1': 'SWE'}


In [6]:
def create_game_data(game_soup):
    data = game_soup.text.split()
    if len(data) > 1:
        draw = data[0].lstrip('(').rstrip(')')
        game_data = {'team1': data[1], 'team2': data[3],
                     'team1_score': int(data[4]), 'team2_data': int(data[6])}
        try:
            game_data['playoffs'] = False
            draw = int(draw)
            game_data['draw'] = 'D{}'.format(draw)
        except:
            game_data['playoffs'] = True
            game_data['draw'] = draw
        return game_data

tourney_data = [create_game_data(s) for s in scores]
tourney_data = [d for d in tourney_data if d is not None]
print(len(tourney_data))  # to see how many false positives we got from our soup.find_all

71


Box Scores
----------

But we can't fit any sort of model to just final scores, since that would probably just lead to the model predicting that the teams with the consistently higher scores (or $\Delta$scores) would win, which *a priori* should not be the case for every tourney. We want box scores to try to capture some of the strategy (blanking ends, scoring with hammer conversion, etc) without having to watch the games

In [7]:
box = soup.find_all('div', class_='col-md-12')
print(len(box))
print(box[0])

53
<div class="col-md-12">
<h4>Men</h4>
</div>


In [8]:
box[1]

<div class="col-md-12">
<b>
            St Jakobshalle
        </b>
</div>

In [9]:
for i in range(10):
    print(box[i])

<div class="col-md-12">
<h4>Men</h4>
</div>
<div class="col-md-12">
<b>
            St Jakobshalle
        </b>
</div>
<div class="col-md-12">
    BASEL
    <br/>
    Switzerland
</div>
<div class="col-md-12">
        4/2/2016 - 4/10/2016
    </div>
<div class="col-md-12">
<div class="panel panel-default">
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th class="col-md-1 text-center">Position</th>
<th class="col-md-1 text-center">Record</th>
<th class="col-md-1"> </th>
<th class="col-md-9">Association</th>
</tr>
</thead>
<tbody>
<tr>
<td class="col-md-1 text-center">
                                1
                            </td>
<td class="col-md-1 text-center">
                                12 - 1
                            </td>
<td class="col-md-1">
<img alt="Canada" class="img-border-line" height="20" src="http://wcfresults.azurewebsites.net/Content/Images/Flags/canada.gif" width="40"/>
</td>
<td class="col-md-9">
<div class="col-md-12">
<b>C

These are obviously not the box scores...and I think I know why.

On a fresh load of the page, the **Results** tab (which contains the box scores) *does not* show any scores by default. Instead, there are a couple links to actually get the data to show on the page, since most people probably just want a few scores and not the entire block of scores.

For my test page, the link **Show all games** has the target:
```
http://results.worldcurling.org/Championship/DisplayResults?tournamentId=555&associationId=0&drawNumber=0
```
This link format may be how I get the RESTful access to the scores... Just going to that site gives a raw dump of the scores!

In [10]:
test_site_2 = 'http://results.worldcurling.org/Championship/DisplayResults?tournamentId=555&associationId=0&drawNumber=0'

In [11]:
r2 = requests.get(test_site_2)
assert r2.status_code == 200

soup2 = BeautifulSoup(r2.text, 'html.parser')

In [12]:
box2 = soup2.find_all('div', class_='col-md-12')
print(len(box2))
print(box2[1])

285
<div class="col-md-12">
<h5>Draw #1</h5>
<p>
<b>4/2/2016 2:00 PM</b>
</p>
<p>
<table class="table game-table">
<thead>
<tr class="game-header-row">
<th class="game-header text-center" colspan="3">
Draw #1            </th>
<th class="text-center">1</th>
<th class="text-center">2</th>
<th class="text-center">3</th>
<th class="text-center">4</th>
<th class="text-center">5</th>
<th class="text-center">6</th>
<th class="text-center">7</th>
<th class="text-center">8</th>
<th class="text-center">9</th>
<th class="text-center">10</th>
<th class="text-center"> </th>
<th class="text-center"> </th>
<th class="text-right">Total</th>
</tr>
</thead>
<tbody>
<tr>
<td class="game-sheet" rowspan="2" style="vertical-align:middle">
A            </td>
<td class="game-team">
                Sweden
            </td>
<td class="game-hammer">
                     
            </td>
<td class=" game-end10 text-center">
                    1
                </td>
<td class=" game-end10 text-center">

Funny, the first entry is *the entire tourney!* Start at entry 1 to get individual game data.

In [13]:
just_box_scores = soup2.find_all('table', class_='game-table')
print(len(just_box_scores))
# print(just_box_scores[0])

71


OK, so this might be better than grabbing the `div`s and doing something with them. The good news is that we have the same number of total games (71) as getting just the final scores from the previous site, so we shouldn't be missing anything. We can verify that later, which wouldn't be too bad.

So, getting the tables for the scores, now we need to convert those into data structures that we can actually use.

In [14]:
for row in just_box_scores[0].find_all('tr'):
    for i, col in enumerate(row.find_all('td')):
        print('{}: {}'.format(i, col))

0: <td class="game-sheet" rowspan="2" style="vertical-align:middle">
A            </td>
1: <td class="game-team">
                Sweden
            </td>
2: <td class="game-hammer">
                     
            </td>
3: <td class=" game-end10 text-center">
                    1
                </td>
4: <td class=" game-end10 text-center">
                    0
                </td>
5: <td class=" game-end10 text-center">
                    2
                </td>
6: <td class=" game-end10 text-center">
                    0
                </td>
7: <td class=" game-end10 text-center">
                    1
                </td>
8: <td class=" game-end10 text-center">
                    0
                </td>
9: <td class=" game-end10 text-center">
                    2
                </td>
10: <td class=" game-end10 text-center">
                    0
                </td>
11: <td class=" game-end10 text-center">
                    0
                </

So let's look at the individual data for just this one game to think about what data we should save and how we should save it.

- Sheet: td class `game-sheet`, only in the first row
- LSFE: td class `game-hammer` contains an asterisk (`*`)
- Team names: td class `game-team`, in each row separately
- Final score: td class `game-total`, in each row
- Individual end scores: td class `game-end10`, with points scored in that end for each team

We can build up an array-like structure to save the individual end scores, and everything else can just be a string or value. This data does not contain the draw number, but maybe we can do that with the RESTful access to the site. I'll have to look more into that.