[logo]: http://mlb.mlb.com/mlb/components/fantasy/bts/y2016/images/bts_250x250.jpg "Beat the Streak"

![alt text][logo]

To run the code in this notebook clone the github repo at: https://github.com/matthewmcnew/beat-the-streak

# Beat the Streak? 

MLB.com offers an online game titled, Beat the Streak. In the game players can attempt to beat Joe Dimaggio's hitting streak by picking a player that gets a hit in 56 consecutive games. Players are allowed to pick two players for each day and even pass on picking on a day. Since the game's inception in 2001, no online player has managed to beat the streak and win the grand prize of $5,600,000. 

This project will attempt to model beat the streak as a classification problem. The machine learning models will attempt to predict which players have a high likelihood of getting a hit for a given day. Due to the importance of picking correctly, the machine learning models will be optimized for precision. Ultimately, the model will suggest one or a few batters out of hundreds as suggested batters for a day.

Even with a very accurate model the probability of actually beating the streak is very low. The realistic goal of the project will be to see if the machine learning models can outperform baseball experts. If the models are relatively useful or accurate an online website could be set up to provide daily beat the streak recommendations. 

#### Ok but, really how hard is it to Beat the Streak?

Well, if we were able to predict who would get a hit in 90% of the games correctly the probability would be:

In [7]:
.90 ** 56, "%"

(0.002738927449953412, '%')

00.2% is a pretty low. Oh, and reaching 90% accuracy would be really hard as well. 

With that said, let's dive in shall we?

### Acquiring the Baseball Data:

Luckily, there is a wealth of baseball data available. This project pulled data from the online Retrosheet database. This data was aggregated to create possible 'choices' for each day and each player in the starting lineup for that day. Then additional data such as a batter's 'hitting average' is calculated for each choice. 

The steps taken to aquire, clean, process and present the data can be viewed in the Calculating_Averages.ipynb and Build_Choices.ipynb notebooks on the project's [Github repository](https://github.com/matthewmcnew/beat-the-streak).

### Exploratory Data Analysis

Complete explorations and visualizations of the imported baseball data is available in the [Analysis notebook](https://github.com/matthewmcnew/beat-the-streak/blob/master/Analysis.ipynb).

Although, there are plenty of possible covariants for each choice most variables did not appear to have much predictive power. The two strongest variables were the batter's hitting average and the ballpark the game was played in.

### Machine Learning. 

The choices were prepared for machine learning modeling by creating a couple of additional features: 

* Dummy variables were created for Categorical Variables.
    * Weather
    * Day or Night 
    * Field Condition
    * Precipitation Condition
    * Sky Condition
* [ESPN's HitsFactor statistic](http://espn.go.com/mlb/stats/parkfactor/_/sort/hitsFactor) to represent how hitter friendly the ballpark for each choice was.
* *is_high_attendance* a binary variable that corresponds to 1 if the attendance was greater than 38000.
* *is_coors* a binary variable that is 1 if the ballpark is Coors Field. (The most hitter friendly ballpark)
* *is_hity_pitcher* a binary variable that is 1 if the starting pitcher allows hits in more than 23% of his batters faced. 
* *is_really_hity_pitcher* a binary variable that is 1 if the starting pitcher allows hits in more than 26% of his batters faced. 
* *is_hity_bullpen* a binary variable that is 1 if opposing team's bullpen allows hits by more than 25% of their batters faced. 
* *pitcher_hand_diff* a continuous variable that equals the difference between the percent of hits given by the starting pitcher and percent of hits given by the starting pitcher's against the batting hand of the batter
* *hitter_hand_diff* a continuous variable that equals the difference between the percent of hits given achieved by the batter and percent of hits achieved by the batter against pitchers with the starting pitcher's hand. 

The final dataset prepped for machine learning can be viewed below:

In [5]:
from beat_the_streak import dataset
choices = dataset.load_dataset_starting_at_day('2015-05-30')
choices.head()


Unnamed: 0,game_date,first_name_tx,last_name_tx,game_id,bat_id,bat_home_id,bat_lineup_id,array_agg,count,best_hit,...,sky_3,sky_5,is_high_attendance,is_coors,is_hity_pitcher,is_really_hity_pitcher,is_hity_bull,ballpark,pitcher_hand_diff,hitter_hand_diff
450,2015-05-31,Erick,Aybar,ANA201505310,aybae001,1,1,"[u'R', u'R', u'R', u'R', u'L']",5,1,...,0.0,0.0,0,0,0,0,0,0.939,0.053628,0.053885
451,2015-05-31,Miguel,Cabrera,ANA201505310,cabrm001,0,3,"[u'R', u'R', u'R', u'R']",4,1,...,0.0,0.0,0,0,1,0,0,0.939,0.029108,0.012163
452,2015-05-31,Kole,Calhoun,ANA201505310,calhk001,1,6,"[u'L', u'L', u'L', u'L']",4,0,...,0.0,0.0,0,0,0,0,0,0.939,0.008109,-0.015102
453,2015-05-31,Yoenis,Cespedes,ANA201505310,cespy001,0,4,"[u'R', u'R', u'R', u'R']",4,0,...,0.0,0.0,0,0,1,0,0,0.939,0.029108,0.0338
455,2015-05-31,David,Freese,ANA201505310,freed001,1,4,"[u'R', u'R', u'R', u'R']",4,1,...,0.0,0.0,0,0,0,0,0,0.939,-0.018864,-0.004872


### Quantifying Model Performance 

A successful model needs to be optimized for extreme precision. The batter for each day with the highest probability of a hit would be the choice for the Beat the Streak game. 

To quantify this desired performance a custom Scikit Learn metric was created. This metric, BestPickForEachDayGotHitPercent, takes the number of days to calculate the top choice for and calculates the percent of successful picks for a model's predicted probability. 

In [11]:
from beat_the_streak import dataset
choices = dataset.load_dataset_starting_at_day('2015-05-30')

In [13]:
from beat_the_streak.metrics import BestPickForEachDayGotHitPercent

metric = BestPickForEachDayGotHitPercent(number_of_choices_per_day=1, game_dates=choices.game_date)

metric

<beat_the_streak.metrics.BestPickForEachDayGotHitPercent at 0x114192b50>

The models in this project were evaluated with this metric and it will be used throughout this notebook.

In order to prevent overfitting the data and to demonstrate realistic performance,  the data is split into a test and a training set. To replicate a realistic test/training split the data is split on a specific date with the training set occurring before the date and the test set occurring after the date. Demonstrations of models in this notebook will use this split and will be trained on the training set and tested on the testing set. 

In [18]:
from beat_the_streak.dataset import test_train_split
train_set, test_set = test_train_split(day='2015-07-01')

### Naive Model Implementation

The first step in developing a machine learning model was to create a realistic baseline model. A naive simple hitting model was developed that calculated the probability of a hit solely based on the batter's hitting average.

This naive model performed decently. The percent of days where it successfully picked a batter who got a hit:


In [13]:
from beat_the_streak.simple import SimpleHittingModel
from beat_the_streak.dataset import test_train_split
from beat_the_streak.metrics import BestPickForEachDayGotHitPercent

_, test = test_train_split(day='2015-07-01')
simple_hitting_model = SimpleHittingModel()
BestPickForEachDayGotHitPercent(test.game_date, 1)(clf=simple_hitting_model, X=test[['hitting_average']], y=test.got_hit)

0.6483516483516484

All things considered, 65% is not a terrible accuracy. Can a more complicated model perform better?

### Logistic Regression Model

The next step was to train and test a logistic regression model. Scikit-learn's Logistic Regression is demonstrated below:

In [7]:
from sklearn import cross_validation, datasets, linear_model, ensemble
from beat_the_streak.dataset import test_train_split, train_cols
from beat_the_streak.metrics import BestPickForEachDayGotHitPercent

train, test = test_train_split(day='2015-08-01')
lg = linear_model.LogisticRegression().fit(train[train_cols], train.got_hit)

BestPickForEachDayGotHitPercent(test.game_date)(clf=lg, X=test[train_cols], y=test.got_hit)

0.703125

### Random Forest Model

The parameters for a random forrest model were optimized with GridSearch performing cross validation. The resulting  scikit-learn random forest is shown below:

In [11]:
from sklearn import ensemble
from beat_the_streak.dataset import test_train_split, train_cols
rf = ensemble.RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=None, max_features='auto', max_leaf_nodes=100,
                min_samples_leaf=1, min_samples_split=20,
                min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,
                oob_score=False, random_state=None, verbose=0,
                warm_start=False)

train, test = test_train_split(day='2015-08-01')

voting_model.fit(train[train_cols], train.got_hit)

BestPickForEachDayGotHitPercent(test.game_date)(clf=voting_model, X=test[train_cols], y=test.got_hit)


0.6875

### Player Model (unique random forest for each batter)

One of the potential pitfalls of only running models over the entire dataset is that unique attributes of certain batters cannot be captured. For example, over the dataset as a whole the sky conditions of a game have little influence on the probability of a batter getting a hit. However, certain batters may be strongly influenced by the weather or sky conditions. 

An individual random forest model for each batter might be able to capture these irregularities.

A custom model, called PlayerModel, was created for this project. The PlayerModel creates and fits a new unique model each batter in the training data. Then it predicts probabilities from that batter's unique model. 

Below is an attempt to train and test the PlayerModel with a unique random forest for each batter.

In [12]:
from sklearn import ensemble
from beat_the_streak.players import PlayerModel

def factory():
    return ensemble.RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=None, max_features='auto', max_leaf_nodes=100,
                min_samples_leaf=1, min_samples_split=20,
                min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,
                oob_score=False, random_state=None, verbose=0,
                warm_start=False)

cls = PlayerModel(model_cls=factory)

train, test = test_train_split(day='2015-08-01')

cls.fit(train[train_cols], train.got_hit)
BestPickForEachDayGotHitPercent(test.game_date)(clf=voting_model, X=test[train_cols], y=test.got_hit)

0.703125

A 70% accuracy is not a extraordinary improvement. However, hopefully the insights gained from the PlayerModel can be used to improve the ensemble model below. 

### Ensemble Model 

An ensemble model was built and optimized with grid search [grid search](https://github.com/matthewmcnew/beat-the-streak/blob/master/Gridsearch.ipynb). This model is a 'voting' classifier that combines the probabilities of three different models. The three combined models were:

* LogisticRegression model
* RandomForestClassifier model
* Player Model that ran a unique RandomForestClassifer for each batter in the dataset

Each model takes a separate subset of the features. This is accomplished with a FeatureSelector transformer as part of a Pipeline.

This model is trained and tested below:

In [5]:
from sklearn import ensemble, linear_model, pipeline
from beat_the_streak.transformers import FeatureSelector
from beat_the_streak.list_subtract import subtract
from beat_the_streak.players import PlayerModel
from sklearn.ensemble import VotingClassifier
from beat_the_streak.dataset import test_train_split, train_cols
from beat_the_streak.metrics import BestPickForEachDayGotHitPercent

def classifier_factory():
    return ensemble.RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=None, max_features='auto', max_leaf_nodes=100,
                min_samples_leaf=1, min_samples_split=20,
                min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,
                oob_score=False, random_state=None, verbose=0,
                warm_start=False)

rf = pipeline.Pipeline([
        ('sel', FeatureSelector(subtract(train_cols, ['player_hash']))),
        ('clf', ensemble.RandomForestClassifier(n_estimators=25, max_depth=12, min_samples_split=1, max_leaf_nodes=10))])
lg = pipeline.Pipeline([
        ('sel', FeatureSelector(subtract(train_cols, ['player_hash']))),
        ('clf', linear_model.LogisticRegression())]) 
ply = pipeline.Pipeline([
        ('sel', FeatureSelector(subtract(train_cols, ['hitting_average']))),
        ('clf', PlayerModel(model_cls=classifier_factory))])

voting_model = VotingClassifier(estimators=[('rf', rf), ('lg', lg), ('ply', ply)], weights=[1,1,1], voting='soft')

train, test = test_train_split(day='2015-08-01')

voting_model.fit(train[train_cols], train.got_hit)

BestPickForEachDayGotHitPercent(test.game_date)(clf=voting_model, X=test[train_cols], y=test.got_hit)

0.734375

### Well?

The ensemble voting classifier is a small but decent improvement over the naive pick the best hitter approach. However, it is definitely not perfect. With its predictions we are not going to beat the streak any time soon. 

This project only focused on data from 2015. An important next step would be would be to import and test against another year's dataset. This would help confirm that the ensemble voting classifier still had predictive power in unseen data in future years. There is a risk that these models have been overfit and optimized for the 2015 dataset. 

### What is left on the table?

There are plenty of potential models and covariants that have not been studied yet. Perhaps some of these unused techniques would provide additional predictive power. 

A common attribute when predicting how likely a player is to get a hit is their recent past performance. For example, perhaps a batter is in a hitting slump or a hitting boom. This recent 'hotness' of a batter could be very useful. The models studied in the project only used the hitting average of batter throughout the the entire season up to the game date in question. 

In addition to the recent performance of the batter, it might be beneficial to study the recent performance of the starting pitcher, opposing bullpen, or even the baseball teams playing.

Another angle that might be useful is historical data surrounding the batters and starting pitchers in each beat the streak 'choice'. The batting average of a player in past seasons or the number of years a pitcher has pitched may be useful when predicting the probability of a hit.

## Game Time

#### Try to Beat the Streak with the help of the Voting Classifier:

Run the code below and to see how many days you can pick a batter from the top probabilities of the voting classifier. 

In [None]:
from beat_the_streak.game import play
from beat_the_streak.dataset import test_train_split

_, test = test_train_split(day='2015-08-01')

play(voting_model, test)

   
Your current streak: 0
Game Date to choose 2015-08-02
       hitting_average  pitcher_hitting_average             name      prob
18984         0.257353                 0.228495      Erick Aybar  0.452759
18993         0.164557                 0.228495   Chris Iannetta  0.459151
37943         0.202614                 0.221223    Leonys Martin  0.461126
39468         0.258294                 0.220690  Alcides Escobar  0.465815
27182         0.179012                 0.186620    Roberto Perez  0.465846
20257         0.253776                 0.225577     Martin Prado  0.468348
6494          0.162791                 0.236538      Rene Rivera  0.473165
39469         0.197425                 0.213884       Ryan Goins  0.488830
16207         0.231707                 0.214815       Jed Lowrie  0.575495
20255         0.117647                 0.225577      Jeff Mathis  0.611366

Make your choice:
