In [1]:
import wcf

from processing.game import Game
from processing.pipeline import convert, fields

print(wcf.__version__)
conn = wcf.API('../credentials/wcf.json').connect()

0.6.1


In [2]:
test_tourney = conn.get_draws_by_tournament(579)
test_game = Game(test_tourney[-1])
raw_data = test_game.data
test_game.convert()

print(test_game)

RUS  88.0 | 0 0 0 0 1 0 2 0 | 3
CAN* 80.0 | 0 2 1 0 0 3 0 2 | 8


Build End Info
----------------

From our data, what we really need is the current game state, from which we can use it to predict which team will eventually win the game. The game state will include
- the current end
- the team that had the hammer (or if team 1 had hammer)
- the team that scored, and how many points they scored (if a blank, team with hammer scores 0)
- the score differential
- the stolen point and stolen ends differential
- the previous number of blank ends
- additional metrics...?

All differentials will be in reference to team 1 (other team is team 0), which should make it easier to predict whether or not team 1 will win. Plus, the probability from the model can just be reversed to get the team 0 information.

In [3]:
test_game_processed = convert([test_game])
test_game_processed

Unnamed: 0,tourney_id,game_id,round,team_0,team_1,end_number,hammer,scored_points,scoring_team,score_diff,score_ends_diff,score_hammer_diff,score_hammer_ends_diff,steal_diff,steal_ends_diff,blanks,winning_team
0,579,25534,F,RUS,CAN,1,1,1,0,0,0,0,0,0,0,1,1
1,579,25534,F,RUS,CAN,2,1,1,2,2,1,2,1,0,0,1,1
2,579,25534,F,RUS,CAN,3,0,1,1,3,2,2,1,1,1,1,1
3,579,25534,F,RUS,CAN,4,0,0,0,3,2,2,1,1,1,2,1
4,579,25534,F,RUS,CAN,5,0,0,1,2,1,1,0,1,1,2,1
5,579,25534,F,RUS,CAN,6,1,1,3,5,2,4,1,1,1,2,1
6,579,25534,F,RUS,CAN,7,0,0,2,3,1,2,0,1,1,2,1
7,579,25534,F,RUS,CAN,8,1,1,2,5,2,4,1,1,1,2,1


Great! We have our processing pipeline set up (basic), and loading everything into a `pandas.DataFrame` looks to have worked well. We can repeat this processing for a whole bunch of tournaments (and a bunch of games), to build up a huge dataset.

One thing that we probably should include is the gender of the game, or when we process we should save the information in separate data directories. This second option is probably a better choice, so let's set up that processing pipeline for us.

Plan
-----

Currently, `pipeline.convert` is just a simple function that returns a nested list of the converted information. Instead, we want to create a large pandas dataframe containing all of the information from a single tournament, and save it as a datafile for later processing. As each tournament will probably have around 70 games, each datafile will have around 600-700 entries.

In [4]:
len(test_tourney)

71

We need to create `Game`s for all 71 of the games, then run them through `convert` to get a single dataframe with all of the information.

In [5]:
games_converted = []
for game in test_tourney:
    single = Game(game)
    single.convert()
    games_converted.append(single)
df = convert(games_converted)

print(df.shape)
df.head()

(656, 17)


Unnamed: 0,tourney_id,game_id,round,team_0,team_1,end_number,hammer,scored_points,scoring_team,score_diff,score_ends_diff,score_hammer_diff,score_hammer_ends_diff,steal_diff,steal_ends_diff,blanks,winning_team
0,579,25464,1,USA,SCO,1,1,1,3,3,1,3,1,0,0,0,1
1,579,25464,1,USA,SCO,2,0,0,0,3,1,3,1,0,0,1,1
2,579,25464,1,USA,SCO,3,0,0,2,1,0,1,0,0,0,1,1
3,579,25464,1,USA,SCO,4,1,0,1,0,-1,1,0,-1,-1,1,1
4,579,25464,1,USA,SCO,5,1,1,3,3,0,4,1,-1,-1,1,1


In [6]:
656 / 71  # average ends per game

9.23943661971831

In [7]:
import os

os.getcwd()

'/home/mikemoran/bin/curling'

In [9]:
# os.makedirs('data/women')
df.to_csv('data/women/579.csv')

In [10]:
! ls data/women/

255.csv  293.csv  371.csv  402.csv  454.csv  507.csv  579.csv
274.csv  313.csv  382.csv  444.csv  487.csv  554.csv


Great! We have a basic pipeline that takes the raw data from the WCF, converts it into a dataframe, and saves that as a CSV file for later use. We can repeat this for a bunch of tournaments, saving them within the correct directory, and load everything in to do some analysis.

I should be done doing the exploratory work for this here, and just build up a single script to process everything and get all of the data files in place.