# Data loader

I wrote some helper functions to make loading data easier for experimentation. The code is located in `Notebook/utils/data_loader.py`. Below is examples of how to use it.

In [11]:
from utils.data_loader import Dataset

Everything is housed inside a `Dataset`. It only requires a name (used for saving and loading).

In [12]:
train_ds = Dataset(name='my_training_dataset')

Initially it does nothing. You can always see what data it has actually loaded by accessing the `.data` attribute, which is a Pandas DataFrame.

In [13]:
train_ds.data == None

True

## Games

Load games using the `load_games` method. Optionally you can specify start and end dates. By default it gets all data from 2000 to 2015 inclusive.

In [14]:
train_ds.load_games(start_date='2000-01-01', end_date='2003-01-01')

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,avg_diff,obp_diff,slg_diff,avg_pct_diff,obp_pct_diff,slg_pct_diff,home_rest,away_rest,away_team_season_game_num,home_team_season_game_num
0,2001-04-01,2001,4.0,1.0,TOR,TEX,1.0,loaizes01,helliri01,1499.562988,...,-0.008060,-0.010103,0.023271,-2.947374,-2.977845,4.989568,5.0,5.0,0,0
1,2001-04-02,2001,4.0,2.0,SEA,OAK,1.0,garcifr03,hudsoti01,1519.463989,...,-0.000864,0.001190,-0.016229,-0.323318,0.331871,-3.705210,5.0,5.0,0,0
2,2001-04-02,2001,4.0,2.0,NYA,KCA,1.0,clemero02,suppaje01,1529.510986,...,-0.010188,0.006929,0.024787,-3.703559,1.970596,5.554343,5.0,5.0,0,0
3,2001-04-02,2001,4.0,2.0,CIN,ATL,0.0,harnipe01,burkejo03,1527.274048,...,0.003972,-0.001729,0.020216,1.459194,-0.506960,4.555242,5.0,5.0,0,0
4,2001-04-02,2001,4.0,2.0,CHN,WAS,0.0,liebejo01,vazquja01,1462.510010,...,-0.010158,0.009335,-0.018992,-3.996340,2.803560,-4.646432,5.0,5.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4850,2002-09-29,2002,9.0,29.0,TEX,OAK,0.0,benoijo01,zitoba01,1498.168945,...,-0.004118,-0.015355,-0.006046,-1.562305,-4.366582,-1.402798,1.0,1.0,80,80
4851,2002-09-29,2002,9.0,29.0,SLN,MIL,1.0,benesan01,frankwa01,1564.499023,...,0.043451,0.055172,0.090154,17.782738,16.750874,22.641031,1.0,1.0,80,80
4852,2002-09-29,2002,9.0,29.0,LAN,SDN,0.0,alvarvi01,perezol01,1538.897949,...,0.079375,0.086138,0.092961,29.143755,25.025467,24.589701,1.0,1.0,80,80
4853,2002-09-29,2002,9.0,29.0,ARI,COL,1.0,pattejo02,starkde01,1556.958008,...,0.006467,0.046106,0.085538,2.213280,11.996187,16.898954,1.0,1.0,80,80


This returns all games in the requested date range (data comes from `data/mlb_games_df.csv`). It returns the data, but it's also always saved at the `.data` attribute.

In [15]:
# The same as what was printed out above
train_ds.data.head()

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,avg_diff,obp_diff,slg_diff,avg_pct_diff,obp_pct_diff,slg_pct_diff,home_rest,away_rest,away_team_season_game_num,home_team_season_game_num
0,2001-04-01,2001,4.0,1.0,TOR,TEX,1.0,loaizes01,helliri01,1499.562988,...,-0.00806,-0.010103,0.023271,-2.947374,-2.977845,4.989568,5.0,5.0,0,0
1,2001-04-02,2001,4.0,2.0,SEA,OAK,1.0,garcifr03,hudsoti01,1519.463989,...,-0.000864,0.00119,-0.016229,-0.323318,0.331871,-3.70521,5.0,5.0,0,0
2,2001-04-02,2001,4.0,2.0,NYA,KCA,1.0,clemero02,suppaje01,1529.510986,...,-0.010188,0.006929,0.024787,-3.703559,1.970596,5.554343,5.0,5.0,0,0
3,2001-04-02,2001,4.0,2.0,CIN,ATL,0.0,harnipe01,burkejo03,1527.274048,...,0.003972,-0.001729,0.020216,1.459194,-0.50696,4.555242,5.0,5.0,0,0
4,2001-04-02,2001,4.0,2.0,CHN,WAS,0.0,liebejo01,vazquja01,1462.51001,...,-0.010158,0.009335,-0.018992,-3.99634,2.80356,-4.646432,5.0,5.0,0,0


## Team stats

You can add team stats (from `data/team_stats.csv`) from previous years. To do so, use the `.add_team_stats()` method. By default it uses the previous years data, but you can add different/more years by specifying `year_offset` (default of 1). So, for example, `year_offset=2` would get team stats from two years prior. Columns are named `home/away_{col_name}_offset{year_offset}years`. So for example, `home_Avg_Attendance_offset1` for average attendance for the home team last year.

By default, **no columns are actually loaded from the team stats**, so you need to specify them using `cols=[...]`. Again, check out the `team_stats.csv` file to see what columns are available.

In [16]:
train_ds.add_team_stats(cols=['Avg_Attendance', 'W-L-pct'])

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,obp_pct_diff,slg_pct_diff,home_rest,away_rest,away_team_season_game_num,home_team_season_game_num,home_Avg_Attendance_offset1year,home_W-L-pct_offset1year,away_Avg_Attendance_offset1year,away_W-L-pct_offset1year
0,2001-04-01,2001,4.0,1.0,TOR,TEX,1.0,loaizes01,helliri01,1499.562988,...,-2.977845,4.989568,5.0,5.0,0,0,24861.419753,0.512346,32341.993789,0.438272
1,2001-04-02,2001,4.0,2.0,SEA,OAK,1.0,garcifr03,hudsoti01,1519.463989,...,0.331871,-3.705210,5.0,5.0,0,0,33215.672840,0.561728,26058.875776,0.565217
2,2001-04-02,2001,4.0,2.0,NYA,KCA,1.0,clemero02,suppaje01,1529.510986,...,1.970596,5.554343,5.0,5.0,0,0,37914.055901,0.540373,24699.740741,0.475309
3,2001-04-02,2001,4.0,2.0,CIN,ATL,0.0,harnipe01,burkejo03,1527.274048,...,-0.506960,4.555242,5.0,5.0,0,0,34630.858025,0.524691,35842.962733,0.586420
4,2001-04-02,2001,4.0,2.0,CHN,WAS,0.0,liebejo01,vazquja01,1462.510010,...,2.803560,-4.646432,5.0,5.0,0,0,34714.518750,0.401235,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4850,2002-09-29,2002,9.0,29.0,TEX,OAK,0.0,benoijo01,zitoba01,1498.168945,...,-4.366582,-1.402798,1.0,1.0,80,80,31576.515528,0.450617,27327.695652,0.629630
4851,2002-09-29,2002,9.0,29.0,SLN,MIL,1.0,benesan01,frankwa01,1564.499023,...,16.750874,22.641031,1.0,1.0,80,80,36152.633540,0.574074,32171.693750,0.419753
4852,2002-09-29,2002,9.0,29.0,LAN,SDN,0.0,alvarvi01,perezol01,1538.897949,...,25.025467,24.589701,1.0,1.0,80,80,35078.895062,0.530864,30563.919255,0.487654
4853,2002-09-29,2002,9.0,29.0,ARI,COL,1.0,pattejo02,starkde01,1556.958008,...,11.996187,16.898954,1.0,1.0,80,80,32990.739130,0.567901,34977.111111,0.450617


## Team pitching

Similarly, you can add team pitching stats (at a team-level, _not_ a pitcher level) using `.add_team_pitching_stats()`. The parameters are the same as `.add_team_stats()`. Data comes from `data/team_pitching_stats.csv`.

In [17]:
train_ds.add_team_pitching_stats(cols=['WHIP', 'ERA'])

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,away_team_season_game_num,home_team_season_game_num,home_Avg_Attendance_offset1year,home_W-L-pct_offset1year,away_Avg_Attendance_offset1year,away_W-L-pct_offset1year,home_WHIP_offset1year,home_ERA_offset1year,away_WHIP_offset1year,away_ERA_offset1year
0,2001-04-01,2001,4.0,1.0,TOR,TEX,1.0,loaizes01,helliri01,1499.562988,...,0,0,24861.419753,0.512346,32341.993789,0.438272,1.513465,5.17,1.640308,5.52
1,2001-04-02,2001,4.0,2.0,SEA,OAK,1.0,garcifr03,hudsoti01,1519.463989,...,0,0,33215.672840,0.561728,26058.875776,0.565217,1.440466,4.53,1.498153,4.58
2,2001-04-02,2001,4.0,2.0,NYA,KCA,1.0,clemero02,suppaje01,1529.510986,...,0,0,37914.055901,0.540373,24699.740741,0.475309,1.428973,4.76,1.582934,5.48
3,2001-04-02,2001,4.0,2.0,CIN,ATL,0.0,harnipe01,burkejo03,1527.274048,...,0,0,34630.858025,0.524691,35842.962733,0.586420,1.445642,4.33,1.327686,4.06
4,2001-04-02,2001,4.0,2.0,CHN,WAS,0.0,liebejo01,vazquja01,1462.510010,...,0,0,34714.518750,0.401235,,,1.487416,5.26,1.512428,5.14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4850,2002-09-29,2002,9.0,29.0,TEX,OAK,0.0,benoijo01,zitoba01,1498.168945,...,80,80,31576.515528,0.450617,27327.695652,0.629630,1.575690,5.71,1.246668,3.59
4851,2002-09-29,2002,9.0,29.0,SLN,MIL,1.0,benesan01,frankwa01,1564.499023,...,80,80,36152.633540,0.574074,32171.693750,0.419753,1.334402,3.96,1.475524,4.65
4852,2002-09-29,2002,9.0,29.0,LAN,SDN,0.0,alvarvi01,perezol01,1538.897949,...,80,80,35078.895062,0.530864,30563.919255,0.487654,1.317749,4.25,1.385224,4.52
4853,2002-09-29,2002,9.0,29.0,ARI,COL,1.0,pattejo02,starkde01,1556.958008,...,80,80,32990.739130,0.567901,34977.111111,0.450617,1.242462,3.88,1.482517,5.29


## Pitcher stats

Finally, you can add stats for an individual pitcher (from `data/pitchers_games.csv`). Rather than a year offset, this uses `game_offset`, and only considers games from the same season (so for the first game of the season any "previous" stats will be `None`). By default it uses the previous games stats. Note that these are games where that pitcher started, so it's possible to have stats come from several games back, if that were the most recent game in which the pitcher started.

In [18]:
train_ds.add_pitcher_stats(cols=['WHIP', 'ERA', 'IP'])

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,away_WHIP_offset1year,away_ERA_offset1year,home_pitcher_season_game,WHIP,ERA,IP,away_pitcher_season_game,WHIP_1,ERA_1,IP_1
0,2001-04-01,2001,4.0,1.0,TOR,TEX,1.0,loaizes01,helliri01,1499.562988,...,1.640308,5.52,1.0,,,,1.0,,,
1,2001-04-02,2001,4.0,2.0,SEA,OAK,1.0,garcifr03,hudsoti01,1519.463989,...,1.498153,4.58,1.0,,,,1.0,,,
2,2001-04-02,2001,4.0,2.0,NYA,KCA,1.0,clemero02,suppaje01,1529.510986,...,1.582934,5.48,1.0,,,,1.0,,,
3,2001-04-02,2001,4.0,2.0,CIN,ATL,0.0,harnipe01,burkejo03,1527.274048,...,1.327686,4.06,1.0,,,,1.0,,,
4,2001-04-02,2001,4.0,2.0,CHN,WAS,0.0,liebejo01,vazquja01,1462.510010,...,1.512428,5.14,1.0,,,,1.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4850,2002-09-29,2002,9.0,29.0,TEX,OAK,0.0,benoijo01,zitoba01,1498.168945,...,1.246668,3.59,6.0,3.437500,4.70,3.2,18.0,3.666667,3.58,3.0
4851,2002-09-29,2002,9.0,29.0,SLN,MIL,1.0,benesan01,frankwa01,1564.499023,...,1.475524,4.65,8.0,0.704225,4.58,7.1,2.0,1.800000,9.00,5.0
4852,2002-09-29,2002,9.0,29.0,LAN,SDN,0.0,alvarvi01,perezol01,1538.897949,...,1.385224,4.52,1.0,,,,8.0,1.000000,2.82,7.0
4853,2002-09-29,2002,9.0,29.0,ARI,COL,1.0,pattejo02,starkde01,1556.958008,...,1.482517,5.29,2.0,0.833333,1.50,6.0,8.0,1.147541,2.84,6.1


You can also ask for the average over the past few games by specifying `game_offset` as a value greater than 1 (the default value). If you specify (for example) `game_offset=5` then it will average each stat over the past five games the pitcher started. Note that if the pitcher started _less than_ five games (for example if it's early in the season) then it will only average over what they actually started.

In [19]:
train_ds.add_pitcher_stats(cols=['WHIP', 'ERA', 'IP'], game_offset=5)

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,away_WHIP_offset1year,away_ERA_offset1year,home_pitcher_season_game,away_pitcher_season_game,home_pitcher_WHIP_avg_5games,home_pitcher_ERA_avg_5games,home_pitcher_IP_avg_5games,away_pitcher_WHIP_avg_5games,away_pitcher_ERA_avg_5games,away_pitcher_IP_avg_5games
0,2001-04-01,2001,4.0,1.0,TOR,TEX,1.0,loaizes01,helliri01,1499.562988,...,1.640308,5.52,1.0,1.0,,,,,,
1,2001-04-02,2001,4.0,2.0,SEA,OAK,1.0,garcifr03,hudsoti01,1519.463989,...,1.498153,4.58,1.0,1.0,,,,,,
2,2001-04-02,2001,4.0,2.0,NYA,KCA,1.0,clemero02,suppaje01,1529.510986,...,1.582934,5.48,1.0,1.0,,,,,,
3,2001-04-02,2001,4.0,2.0,CIN,ATL,0.0,harnipe01,burkejo03,1527.274048,...,1.327686,4.06,1.0,1.0,,,,,,
4,2001-04-02,2001,4.0,2.0,CHN,WAS,0.0,liebejo01,vazquja01,1462.510010,...,1.512428,5.14,1.0,1.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4850,2002-09-29,2002,9.0,29.0,TEX,OAK,0.0,benoijo01,zitoba01,1498.168945,...,1.246668,3.59,6.0,18.0,2.577602,3.765555,4.300000,1.476823,3.224,6.06
4851,2002-09-29,2002,9.0,29.0,SLN,MIL,1.0,benesan01,frankwa01,1564.499023,...,1.475524,4.65,8.0,2.0,1.458013,7.117777,5.400000,1.800000,9.000,5.00
4852,2002-09-29,2002,9.0,29.0,LAN,SDN,0.0,alvarvi01,perezol01,1538.897949,...,1.385224,4.52,1.0,8.0,1.000000,2.820000,7.000000,1.161152,3.282,6.46
4853,2002-09-29,2002,9.0,29.0,ARI,COL,1.0,pattejo02,starkde01,1556.958008,...,1.482517,5.29,2.0,8.0,0.959016,2.036000,6.040000,1.462938,2.762,5.66


In [20]:
train_ds.data.tail()

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,away_WHIP_offset1year,away_ERA_offset1year,home_pitcher_season_game,away_pitcher_season_game,home_pitcher_WHIP_avg_5games,home_pitcher_ERA_avg_5games,home_pitcher_IP_avg_5games,away_pitcher_WHIP_avg_5games,away_pitcher_ERA_avg_5games,away_pitcher_IP_avg_5games
4850,2002-09-29,2002,9.0,29.0,TEX,OAK,0.0,benoijo01,zitoba01,1498.168945,...,1.246668,3.59,6.0,18.0,2.577602,3.765555,4.3,1.476823,3.224,6.06
4851,2002-09-29,2002,9.0,29.0,SLN,MIL,1.0,benesan01,frankwa01,1564.499023,...,1.475524,4.65,8.0,2.0,1.458013,7.117777,5.4,1.8,9.0,5.0
4852,2002-09-29,2002,9.0,29.0,LAN,SDN,0.0,alvarvi01,perezol01,1538.897949,...,1.385224,4.52,1.0,8.0,1.0,2.82,7.0,1.161152,3.282,6.46
4853,2002-09-29,2002,9.0,29.0,ARI,COL,1.0,pattejo02,starkde01,1556.958008,...,1.482517,5.29,2.0,8.0,0.959016,2.036,6.04,1.462938,2.762,5.66
4854,2002-09-29,2002,9.0,29.0,ANA,SEA,1.0,seleaa01,valdeis01,1572.269043,...,1.2,3.54,15.0,16.0,1.430306,4.9,6.044445,1.19173,3.782,6.44


## Saving and loading

Finally, you can save and load your data easily. By default files are saved to `data/saved_datasets/{your_dataset_name}.csv`.

In [19]:
train_ds.save()

If you want to load some data that you previous saved, create a new blank dataset with the same name as your saved data (so for example, to load the dataset we just saved we would create a new dataset with the same `name='my_training_dataset'`). Then run `.load()`.

In [20]:
new_train_ds = Dataset(name='my_training_dataset')
new_train_ds.load()

In [21]:
new_train_ds.data.head()

Unnamed: 0,away_team_season_game_num,home_team_season_game_num,date,Y,M,D,home_team,away_team,home_win,home_pitcher,...,team_home_WHIP_offset1,team_home_ERA_offset1,team_away_WHIP_offset1,team_away_ERA_offset1,pitcher_home_WHIP_offset1,pitcher_home_ERA_offset1,pitcher_home_IP_offset1,pitcher_away_WHIP_offset1,pitcher_away_ERA_offset1,pitcher_away_IP_offset1
0,13,13,2001-04-28,2001,4.0,28.0,CHA,SEA,0.0,biddlro01,...,1.464037,4.67,1.440466,4.53,2.54902,3.86,5.1,,,
1,18,19,2001-05-11,2001,5.0,11.0,TOR,SEA,0.0,hamiljo02,...,1.513465,5.17,1.440466,4.53,1.285714,4.99,7.0,1.568628,4.7,5.1
2,21,22,2001-05-22,2001,5.0,22.0,MIN,SEA,1.0,radkebr01,...,1.501187,5.16,1.440466,4.53,2.195122,3.39,4.1,4.545454,6.66,2.2
3,26,26,2001-05-28,2001,5.0,28.0,KCA,SEA,0.0,durbich01,...,1.582934,5.48,1.440466,4.53,3.333333,5.2,3.0,1.0,5.67,9.0
4,28,35,2001-06-14,2001,6.0,14.0,COL,SEA,0.0,astacpe01,...,1.507692,5.29,1.440466,4.53,2.0,5.28,6.0,1.142857,4.25,7.0


In [23]:
new_train_ds.data.shape == train_ds.data.shape

True