In [1]:
from traitlets.config.manager import BaseJSONConfigManager
path = "/Users/matthiaszunhammer/anaconda/etc/jupyter/nbconfig"
cm = BaseJSONConfigManager(config_dir=path)
cm.update('livereveal', {
              'theme': 'simple',
              'transition': 'none',
              'start_slideshow_at': 'selected',
})

cm.update('livereveal', {
              'width': 1024,
              'height': 768,
})

{'height': 768,
 'start_slideshow_at': 'selected',
 'theme': 'simple',
 'transition': 'none',
 'width': 1024}

# Background
<link rel="stylesheet" href="reveal.js/css/theme/black.css" id="theme">
After a sunday leisure trip to "Rennbahn Düsseldorf", I've got the idea that horseracing is an ideal training ground for practicing machine learning (ML) and Big Data handling as:
* Interesting topic and good conversation starter (I'm not actually into betting, though)
* Lots of cases for prediction available (200-800 races/day)
* Lots of data available for each horse (jockey weight, horse age, past performance...)
* "Parimutuel betting":, i.e. you compete against all other betters, rather than a bookie and his ML team (after paying a hefty house commission of approx. 15-20%)

# Aims
* Improve python skills (I mainly work with MATLAB and R in neuroscience)
* Get exposure to data-base systems like MongoDB and SQL (I mainly work with single-file-based data in neuroscience)
* Aim added later: Work with git on team-level, together with my brother Florian (before: solo)
* Ultimately: find out if it possible to "beat the odds" with machine learning (which I doubt)

# Roadmap
1. Get data
2. Clean data
3. Modelling
4. Test model
5. Reflect on results

# 1.1) Where to get data? – The problem

Some sites offer data (e.g. [Betwise][1], [Betfair][2]).
But:
* A) They usually charge money (>100€ Betwise)
* B) They usually offer a limited scope of variables (especially Betfair)
* C) They usually offer data in a  format different than the race-sheets of upcoming races, making it difficult to implement a ML workflow
[1]: https://www.betwise.co.uk/smartform

[2]: https://www.betfair.com/de




# 1.1) Where to get data?  – The solution
* Betting sites usually provide lots of data for upcomming and past races.
* The [requests](http://docs.python-requests.org/en/master/) package for python offers a powerful tool to download webpages.
* ... so I wrote a couple of functions to download and parse data from one of the big betting services. 


# 1.1) Where to get data?  – The solution: scraping (cont.)
In this process, called "web-scraping",  I learned:
* using the requests package
* how to use PHP-queries to accessed data
* how to use http's GET and POST methods to login in automatically
* some JSON (to post HTTP headers)
* servers do not like receiving requests...

# 1.2.) How to get data?   – The solution: scraping (cont.)
<link rel="stylesheet" href="lib/css/zenburn.css">
The code can be found in the "scrape" module. 
>```py
import hracing.scrape
hracing.scrape.main()
```

Note: It will not work on your machine: I stored info on host and my login in a local .ini file for privacy issues.

In [None]:
import hracing.scrape
hracing.scrape.main()

# 1.2) Parse data — The problem
A pretty simple problem:
* Information is sitting in html-elements
* Extract and store in some readily accessible format

# 1.2) Parse data — The solution

* At first, I extracted all data with regexp...
* ... then I learned that what I'm trying is called parsing...
* ... and what parsers  are good for.
* BeautifulSoup makes this task easy.

# 1.3) Storing data — The problem
* One race consists of the following info:
    * Race-level info (e.g. race_ID, daytime, location...)
    * Horse-level info (e.g. name, weight, sex, jockey...)
    * Short-forms: A table describing the latest performance for each horse
    * Long-forms: A table describing all-time performance for each horse
    * A table describing the finish (sometimes for all horses, or only the first three,... etc. depending on track)
 >> Hierarchical data structure: Past performances nested in horses, horses and finishers nested in races.

# 1.3) Storing data — The solution, take 1
* create a race class with variables stored as properties
* save class instances in separate "pickled" files
* give up...
    
++ Good to practice class syntax

-- Inefficient data storage (lots of discspace, inaccessible, inflexible)

# 1.3) Storing data — The solution, take 2
* create SQL data-bank
* save races in a relation with race_ID, horses in a relation horse_ID
* give up...
   
++ Good to practice SQL syntax

-- Not actually an efficient way to store data, as it cannot handle data hierarchy and is inflexible (esp. when new variables become available)

# 1.3.) Storing data — The solution, take 3
* parse data into hierarchical dict
* create MongoDB data-bank and store dicts according to race ID
* :)
   
++ Efficient storage, conserves natural data hierarchy, flexible if new variables become available

-- Why did I not try this earlier

# 1.3.) Storing data — The solution, take 3


In [142]:
import pandas as pd
import pymongo
from hracing import db
client = pymongo.MongoClient()
db = client.races

a=db.races.find_one()

def race_to_df(race_dict):
    '''Function that generating a pandas df from a db race entry.
    Df will contain one line per runner with race-level and finish info'''
    # Generate tables with race-level, horse-level, and finsh info 
    race_level_keys=race_dict.keys()-['_id','horses','finish']
    race_generals = { k: race_dict[k] for k in race_level_keys }
    df_race_level=pd.DataFrame(race_generals,index=[race_generals['race_ID']])
    df_horse_level=pd.DataFrame(race_dict['horses'])
    df_finish=pd.DataFrame(race_dict['finish'])
    # Cross join race*horse (for some stupid reason not yet included in pandas so extra temp_keys cludge is needed)
    df_race_level['temp_key']=1
    df_horse_level['temp_key']=1
    df_race_n_horse=pd.merge(df_race_level,df_horse_level,on='temp_key')
    # Left join on starter_no1 to add info on winners
    df=pd.merge(df_race_n_horse,pd_finish[['starter_no1','place']], on='starter_no1',how='left')
    return df

race_to_df(a)

Unnamed: 0,country,currency,distance,ground,n_starter,race_ID,race_date_time,race_name,race_number,stakes,...,nonrunner,odd,owner,sex,short_forms,starter_no1,starter_no2,trainer,weight,place
0,FRA,EUR,3600.0,,6,1278271,2017-09-14 12:05:00,Auteuil,1,53.0,...,False,28.0,A.Bocquet\n\t\t\t\t,Wallach,"{'past_racedates': [2017-09-01 00:00:00, 2017-...",1,,Serge Foucher,69.0,1.0
1,FRA,EUR,3600.0,,6,1278271,2017-09-14 12:05:00,Auteuil,1,53.0,...,False,3.5,Ra.Green\n\t\t\t\t,Wallach,"{'past_racedates': [2017-08-25 00:00:00, 2017-...",2,,Guillaume Macaire,69.0,
2,FRA,EUR,3600.0,,6,1278271,2017-09-14 12:05:00,Auteuil,1,53.0,...,False,4.3,Jdg Bloodstock Services\n\t\t\t\t,Wallach,"{'past_racedates': [2017-08-20 00:00:00, 2017-...",3,,Arnaud Chaille-Chaille,69.0,
3,FRA,EUR,3600.0,,6,1278271,2017-09-14 12:05:00,Auteuil,1,53.0,...,False,13.0,Mme I.Pacault\n\t\t\t\t,Stute,"{'past_racedates': [2017-04-04 00:00:00, 2017-...",4,,Isabelle Pacault,68.0,3.0
4,FRA,EUR,3600.0,,6,1278271,2017-09-14 12:05:00,Auteuil,1,53.0,...,False,1.9,Mme P.Papot\n\t\t\t\t,Wallach,"{'past_racedates': [2017-09-01 00:00:00, 2016-...",5,,Guillaume Macaire,67.0,2.0
5,FRA,EUR,3600.0,,6,1278271,2017-09-14 12:05:00,Auteuil,1,53.0,...,False,22.0,J.Seror\n\t\t\t\t,Wallach,"{'past_racedates': [2017-08-24 00:00:00, 2017-...",6,,Mickael Seror,67.0,


In [80]:
print([1,2]+([3,4]))

[1, 2, 3, 4]


# 3.) Machine learning

# 4.) Bet?