# Data validation

The point of this notebook is to carefully check the data. Do different data sets contain _exactly_ the same games? Are pitchers in one data set always present in other data sets? These are the types of questions I want to have definitive answers to here. **I'm only looking at data used in the data loader.**

In [66]:
import pandas as pd
import numpy as np
import math

from utils.data_cleaning import uniform_name

In [67]:
mlb_games_df = pd.read_csv('../data/mlb_games_df.csv')

pitchers_games_df = pd.read_csv('../data/starting_pitchers_games.csv')
pitchers_summary_df = pd.read_csv('../data/pitchers_summary.csv')

team_stats_df = pd.read_csv('../data/team_stats.csv')
team_pitching_df = pd.read_csv('../data/team_pitching_stats.csv')

In [68]:
mlb_games_df['date'] = pd.to_datetime(mlb_games_df['date'])
pitchers_games_df['Date'] = pd.to_datetime(pitchers_games_df['Date'])

## Games

Are the same games present in all data sets? `mlb_games_df` starts in 2001, while `pitchers_games_df` starts in 2000 (which is fine and expected). So we'll trim `pitchers_games_df` for comparison

In [69]:
pitchers_games_2001_df = pitchers_games_df[pitchers_games_df['Date'] >= '2001-01-01']

In [70]:
mlb_games_df.head()

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,avg_diff,obp_diff,slg_diff,avg_pct_diff,obp_pct_diff,slg_pct_diff,home_rest,away_rest,away_team_season_game_num,home_team_season_game_num
0,2001-04-01,2001,4.0,1.0,TOR,TEX,1.0,loaizes01,helliri01,1499.563,...,-0.00806,-0.010103,0.023271,-2.947374,-2.977845,4.989568,5.0,5.0,0,0
1,2001-04-02,2001,4.0,2.0,SEA,OAK,1.0,garcifr03,hudsoti01,1519.464,...,-0.000864,0.00119,-0.016229,-0.323318,0.331871,-3.70521,5.0,5.0,0,0
2,2001-04-02,2001,4.0,2.0,NYA,KCA,1.0,clemero02,suppaje01,1529.511,...,-0.010188,0.006929,0.024787,-3.703559,1.970596,5.554343,5.0,5.0,0,0
3,2001-04-02,2001,4.0,2.0,CIN,ATL,0.0,harnipe01,burkejo03,1527.274,...,0.003972,-0.001729,0.020216,1.459194,-0.50696,4.555242,5.0,5.0,0,0
4,2001-04-02,2001,4.0,2.0,CHN,WAS,0.0,liebejo01,vazquja01,1462.51,...,-0.010158,0.009335,-0.018992,-3.99634,2.80356,-4.646432,5.0,5.0,0,0


In [71]:
pitchers_games_2001_df.head()

Unnamed: 0,Gcar,Gtm,Date,Tm,Opp,Inngs,Dec,DR,IP,H,...,Home_Tm,WHIP,Result,Tm_Score,Opp_Score,name,DFS(DK),DFS(FD),Year,season_game
4,541,11,2001-04-14,ARI,COL,GS-4,,2,4.0,8,...,COL,2.0,L,8,9,morgami01,,,2001,4.0
5,7,156,2019-09-22,ANA,HOU,GS-2,L(0-1),4,2.0,4,...,HOU,2.5,L,5,13,rodrijo07,-2.5,0.0,2019,7.0
6,1,80,2017-07-01,MIN,KCR,GS-6,W(1-0),99,5.0,7,...,KCR,1.6,W,10,5,jorgefe01,,,2017,1.0
7,2,86,2017-07-07,MIN,BAL,GS-3,,5,2.2,7,...,MIN,3.636364,W,9,6,jorgefe01,,,2017,2.0
30,129,103,2001-07-28,CLE,DET,GS-6,L(1-1),2,5.2,8,...,DET,1.730769,L,2,4,woodast01,,,2001,17.0


In [72]:
assert set(mlb_games_df['date'].unique()) == set(pitchers_games_2001_df['Date'].unique())

So there are the exact same dates, good. What about team names?

In [73]:
set(mlb_games_df['home_team'].unique()) - set(pitchers_games_2001_df['Home_Tm'].unique())

{'CHA', 'CHN', 'KCA', 'LAN', 'NYA', 'NYN', 'SDN', 'SFN', 'SLN', 'TBA', 'WAS'}

In [74]:
set(pitchers_games_2001_df['Home_Tm'].unique()) - set(mlb_games_df['home_team'].unique())

{'CHC',
 'CHW',
 'FLA',
 'KCR',
 'LAA',
 'LAD',
 'MON',
 'NYM',
 'NYY',
 'SDP',
 'SFG',
 'STL',
 'TBD',
 'TBR',
 'WSN'}

Not at all! We'll run both through the standardizer function.

In [75]:
mlb_games_df['home_team'] = mlb_games_df['home_team'].apply(uniform_name)
mlb_games_df['away_team'] = mlb_games_df['away_team'].apply(uniform_name)

pitchers_games_df['Tm'] = pitchers_games_df['Tm'].apply(uniform_name)
pitchers_games_df['Home_Tm'] = pitchers_games_df['Home_Tm'].apply(uniform_name)

pitchers_games_2001_df['Tm'] = pitchers_games_2001_df['Tm'].apply(uniform_name)
pitchers_games_2001_df['Home_Tm'] = pitchers_games_2001_df['Home_Tm'].apply(uniform_name)

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is inv

nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid
nan is invalid


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pitchers_games_2001_df['Tm'] = pitchers_games_2001_df['Tm'].apply(uniform_name)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pitchers_games_2001_df['Home_Tm'] = pitchers_games_2001_df['Home_Tm'].apply(uniform_name)


In [76]:
set(mlb_games_df['home_team'].unique()) - set(pitchers_games_2001_df['Home_Tm'].unique())

set()

In [77]:
set(pitchers_games_2001_df['Home_Tm'].unique()) - set(mlb_games_df['home_team'].unique())

set()

So `pitchers_games_df` is missing some team names, but otherwise they now match.

In [78]:
pitchers_games_df[pitchers_games_df['Tm'].isna()]

Unnamed: 0,Gcar,Gtm,Date,Tm,Opp,Inngs,Dec,DR,IP,H,...,Home_Tm,WHIP,Result,Tm_Score,Opp_Score,name,DFS(DK),DFS(FD),Year,season_game
1652,1,6,2003-04-06,,DET,GS-7,,99,6.2,7,...,CHA,1.290323,W,10,2,tewajo02,,,2003,1.0
1653,2,11,2003-04-12,,DET,GS-5,L(0-1),5,4.1,5,...,DET,1.951220,L,3,4,tewajo02,,,2003,2.0
1654,3,17,2003-04-19,,CLE,GS-6,W(1-1),6,6.0,5,...,CHA,1.333333,W,12,3,tewajo02,,,2003,3.0
1655,4,22,2003-04-24,,BAL,GS-5,,4,5.0,4,...,BAL,1.600000,L,4,5,tewajo02,,,2003,4.0
1656,5,30,2003-05-03,,SEA,GS-4,L(1-2),8,3.2,7,...,CHA,3.750000,L,2,12,tewajo02,,,2003,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96005,152,132,2016-08-31,,BOS,GS-5,,4,5.0,7,...,BOS,1.800000,L,6,8,mylydr01,,,2016,26.0
96006,153,138,2016-09-07,,BAL,GS-4,,6,3.2,7,...,TBA,3.125000,W,7,6,mylydr01,,,2016,27.0
96007,154,144,2016-09-13,,TOR,GS-6,W(7-11),5,5.2,5,...,TOR,1.153846,W,6,2,mylydr01,,,2016,28.0
96008,155,150,2016-09-20,,NYY,GS-6,,6,6.0,4,...,TBA,1.000000,L,3,5,mylydr01,,,2016,29.0


In [79]:
pitchers_games_df[pitchers_games_df['Home_Tm'].isna()].shape[0]

0

I'll join to `mlb_games_df` and get the names from there, although this only works for the year 2001 and on. So hopefully I can just get ones from 2000 by hand.

In [80]:
pitchers_games_df = pitchers_games_df.merge(mlb_games_df, left_on=['Date', 'Home_Tm'], 
                                             right_on=['date', 'home_team'], how='left')

In [81]:
pitchers_games_df['Tm'].isna().sum()

6062

In [82]:
min_cols = ['Date', 'Tm', 'Opp', 'Home_Tm', 'home_team', 'away_team']
pitchers_games_df[pitchers_games_df['Tm'].isna()][min_cols].head(10)

Unnamed: 0,Date,Tm,Opp,Home_Tm,home_team,away_team
1681,2003-04-06,,DET,CHA,CHA,DET
1682,2003-04-12,,DET,DET,DET,CHA
1683,2003-04-19,,CLE,CHA,CHA,CLE
1684,2003-04-24,,BAL,BAL,BAL,CHA
1685,2003-05-03,,SEA,CHA,CHA,SEA
1686,2004-08-21,,BOS,CHA,CHA,BOS
1687,2004-08-26,,CLE,CLE,CLE,CHA
2055,2013-09-04,,NYY,NYA,NYA,CHA
2056,2013-09-10,,DET,CHA,CHA,DET
2057,2013-09-16,,MIN,CHA,CHA,MIN


In [83]:
def get_Tm(x):
    tm = x['Tm']
    if isinstance(tm, str):
        return tm
    else:
        if x['home_team'] != x['Opp']:
            return x['home_team']
        else:
            return x['away_team']

In [84]:
def not_away_team(x):
    tm = x['Tm']
    if x['Opp'] != x['Home_Tm']:
        return x['Opp']
    else:
        return tm

In [85]:
pitchers_games_df['Tm'] = pitchers_games_df.apply(get_Tm, axis='columns')
pitchers_games_df['Tm'] = pitchers_games_df.apply(not_away_team, axis='columns')

In [86]:
pitchers_games_df['Tm'].isna().sum()

100

In [87]:
pitchers_games_df[pitchers_games_df['Tm'].isna()]['Date'].dt.year.value_counts()

2000    100
Name: Date, dtype: int64

So that fixed everything in 2001 and on, but still leaves a lot of issues for 2000. **To-do: This year 2000 issue still needs fixing.**

In [88]:
pitchers_games_df[pitchers_games_df['Tm'].isna()][min_cols].head(10)

Unnamed: 0,Date,Tm,Opp,Home_Tm,home_team,away_team
5643,2000-07-17,,BOS,BOS,,
6833,2000-06-09,,TOR,TOR,,
14900,2000-04-11,,PIT,PIT,,
14904,2000-05-05,,MIL,MIL,,
14909,2000-06-01,,CIN,CIN,,
14911,2000-06-12,,MIL,MIL,,
14915,2000-07-05,,ATL,ATL,,
14917,2000-07-18,,BOS,BOS,,
14920,2000-08-04,,HOU,HOU,,
14921,2000-08-09,,ARI,ARI,,


In [89]:
# Drop the columns that came from the merge, then recreate 2001+ df
pitchers_games_df = pitchers_games_df.drop(mlb_games_df.columns, axis='columns')
pitchers_games_2001_df = pitchers_games_df[pitchers_games_df['Date'] >= '2001-01-01']
pitchers_games_df.to_csv('../data/pitchers_games.csv', index=False)

And what about the exact teams playing? `pitchers_games_df` doesn't store the away team explicitly (it's just the opposite of `Home_Tm`), so for simplicity we'll just compare the home teams.

In [90]:
mlb_games_df[['date', 'home_team']].equals(pitchers_games_2001_df[['Date', 'Home_Tm']])

False

In [107]:
min_cols = ['date', 'Date', 'home_team', 'Home_Tm', 'away_team', 'Tm']
home_merged_df = mlb_games_df.merge(pitchers_games_2001_df, 
                                    left_on=['date', 'home_team'],
                                    right_on=['Date', 'Home_Tm'], how='outer')[min_cols]

In [108]:
home_merged_df.head()

Unnamed: 0,date,Date,home_team,Home_Tm,away_team,Tm
0,2001-04-01,2001-04-01,TOR,TOR,TEX,TEX
1,2001-04-01,2001-04-01,TOR,TOR,TEX,TEX
2,2001-04-02,2001-04-02,SEA,SEA,OAK,OAK
3,2001-04-02,2001-04-02,SEA,SEA,OAK,OAK
4,2001-04-02,2001-04-02,NYA,NYA,KCA,NYY


In [109]:
home_merged_df[home_merged_df['home_team'].isna()].head()

Unnamed: 0,date,Date,home_team,Home_Tm,away_team,Tm
97005,NaT,2010-06-25,,PHI,,TOR
97006,NaT,2010-06-25,,PHI,,TOR
97007,NaT,2015-05-02,,TBA,,BAL
97008,NaT,2015-05-02,,TBA,,TBR
97009,NaT,2011-06-26,,SEA,,FLA


In [111]:
home_merged_df['home_team'].isna().sum(), home_merged_df['Home_Tm'].isna().sum()

(26, 14)

So there are 26 games which appear in the pitching but not in `mlb_games_df`, and 14 that appera in `mlb_games_df` but not in pitching.

In [112]:
home_merged_df[home_merged_df['home_team'].isna()].head()

Unnamed: 0,date,Date,home_team,Home_Tm,away_team,Tm
97005,NaT,2010-06-25,,PHI,,TOR
97006,NaT,2010-06-25,,PHI,,TOR
97007,NaT,2015-05-02,,TBA,,BAL
97008,NaT,2015-05-02,,TBA,,TBR
97009,NaT,2011-06-26,,SEA,,FLA


In [113]:
home_merged_df[home_merged_df['Home_Tm'].isna()].head()

Unnamed: 0,date,Date,home_team,Home_Tm,away_team,Tm
48001,2010-06-25,NaT,TOR,,PHI,
48015,2010-06-26,NaT,TOR,,PHI,
48069,2010-06-27,NaT,TOR,,PHI,
53131,2011-06-24,NaT,MIA,,SEA,
53162,2011-06-25,NaT,MIA,,SEA,


It looks like the home and away teams are swapped on some of them. For example, on 2010-06-25 `mlb_games_df` shows TOR (home) vs PHI (away), but `pitchers_games` has PHI as the home team and TOR as the away team. [The correct one](https://www.baseball-reference.com/boxes/PHI/PHI201006250.shtml) is Phillies at home, so `mlb_games_df` is wrong. Since I don't know if the stats were computed correctly, I'll just ignore it for now. However, it also looks like dropping them wouldn't be too big of a deal, as it's only removing a small number of games, spread out across several seasons.

In [119]:
home_merged_df[home_merged_df['home_team'].isna()][['Home_Tm', 'Date']].value_counts()

Home_Tm  Date      
TBA      2015-05-03    2
         2015-05-02    2
         2015-05-01    2
SEA      2011-06-26    2
         2011-06-25    2
         2011-06-24    2
PHI      2010-06-27    2
         2010-06-26    2
         2010-06-25    2
MIL      2017-09-17    2
         2017-09-16    2
         2017-09-15    2
DET      2019-05-19    2
dtype: int64

In [120]:
home_merged_df[home_merged_df['Home_Tm'].isna()][['home_team', 'date']].value_counts()

home_team  date      
TOR        2010-06-27    1
           2010-06-26    1
           2010-06-25    1
MIA        2017-09-17    1
           2017-09-16    1
           2017-09-15    1
           2011-06-26    1
           2011-06-25    1
           2011-06-24    1
DET        2019-09-06    1
CIN        2013-07-23    1
BAL        2015-05-03    1
           2015-05-02    1
           2015-05-01    1
dtype: int64

## Teams

Making sure team names are consistent throughout all files.

In [121]:
team_stats_df.head()

Unnamed: 0,Year,Team,Wins,Losses,W-L-pct,Avg_Attendance,Ghome,ERA,W,E,...,SOA,WCWin,HBP,BBA,L,FP,HR,PPF,CS,HRA
0,2000.0,LAN,86.0,76.0,0.530864,35699.94375,81.0,4.1,86,135,...,1154,N,51.0,600,76,0.978,211,94,42.0,176
1,2001.0,LAN,86.0,76.0,0.530864,35078.895062,81.0,4.25,86,116,...,1212,N,56.0,524,76,0.981,206,91,42.0,184
2,2002.0,LAN,92.0,70.0,0.567901,34668.956522,81.0,3.69,92,90,...,1132,N,53.0,555,70,0.985,155,92,37.0,165
3,2003.0,LAN,85.0,77.0,0.524691,33715.166667,81.0,3.16,85,119,...,1289,N,72.0,526,77,0.981,124,94,36.0,127
4,2004.0,LAN,93.0,69.0,0.574074,37452.246914,81.0,4.01,93,73,...,1066,N,62.0,521,69,0.988,203,95,41.0,178


In [122]:
team_pitching_df.head()

Unnamed: 0,Team,W,L,ERA,G,GS,CG,ShO,SV,HLD,...,ER,HR,BB,IBB,HBP,WP,BK,SO,Year,WHIP
0,ATL,95,67,4.06,538,162,13,6,53,,...,649,165,484,52,37,23,6,1093,2000,1.327686
1,TEX,71,91,5.52,577,162,3,0,39,,...,876,202,661,40,63,40,6,918,2000,1.640308
2,KCA,77,85,5.48,491,162,10,3,29,,...,877,239,693,35,42,77,5,927,2000,1.582934
3,HOU,72,90,5.42,572,162,8,1,30,,...,865,234,598,25,60,55,3,1064,2000,1.526579
4,BAL,74,88,5.37,558,162,14,2,33,,...,855,202,665,32,36,51,1,1017,2000,1.543507


In [123]:
team_stats_df['Team'] = team_stats_df['Team'].apply(uniform_name)
team_pitching_df['Team'] = team_pitching_df['Team'].apply(uniform_name)

In [124]:
assert set(team_stats_df['Team'].unique()) == set(team_pitching_df['Team'].unique())

In [125]:
assert set(team_pitching_df['Team'].unique()) == set(mlb_games_df['home_team'].unique())

In [127]:
team_stats_df.to_csv('../data/team_stats.csv', index=False)
team_pitching_df.to_csv('../data/team_pitching_stats.csv', index=False)

## Pitchers

Are the same pitchers present in all datasets?

In [126]:
mlb_games_df.head()

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,avg_diff,obp_diff,slg_diff,avg_pct_diff,obp_pct_diff,slg_pct_diff,home_rest,away_rest,away_team_season_game_num,home_team_season_game_num
0,2001-04-01,2001,4.0,1.0,TOR,TEX,1.0,loaizes01,helliri01,1499.563,...,-0.00806,-0.010103,0.023271,-2.947374,-2.977845,4.989568,5.0,5.0,0,0
1,2001-04-02,2001,4.0,2.0,SEA,OAK,1.0,garcifr03,hudsoti01,1519.464,...,-0.000864,0.00119,-0.016229,-0.323318,0.331871,-3.70521,5.0,5.0,0,0
2,2001-04-02,2001,4.0,2.0,NYA,KCA,1.0,clemero02,suppaje01,1529.511,...,-0.010188,0.006929,0.024787,-3.703559,1.970596,5.554343,5.0,5.0,0,0
3,2001-04-02,2001,4.0,2.0,CIN,ATL,0.0,harnipe01,burkejo03,1527.274,...,0.003972,-0.001729,0.020216,1.459194,-0.50696,4.555242,5.0,5.0,0,0
4,2001-04-02,2001,4.0,2.0,CHN,WAS,0.0,liebejo01,vazquja01,1462.51,...,-0.010158,0.009335,-0.018992,-3.99634,2.80356,-4.646432,5.0,5.0,0,0


In [129]:
pitchers_games_df.head()

Unnamed: 0,Gcar,Gtm,Date,Tm,Opp,Inngs,Dec,DR,IP,H,...,Home_Tm,WHIP,Result,Tm_Score,Opp_Score,name,DFS(DK),DFS(FD),Year,season_game
0,498,57,2000-06-05,CHC,CHC,GS-5,,5,4.2,5,...,CHN,1.904762,L,3,4,morgami01,,,2000,21.0
1,501,64,2000-06-13,LAD,LAD,GS-5,L(1-1),2,4.2,8,...,LAN,2.380952,L,1,6,morgami01,,,2000,24.0
2,506,79,2000-06-30,CIN,CIN,GS-5,L(3-2),2,5.0,8,...,ARI,2.0,L,4,5,morgami01,,,2000,29.0
3,512,97,2000-07-21,ARI,CIN,GS-5,,2,5.0,10,...,CIN,2.4,W,5,4,morgami01,,,2000,35.0
4,541,11,2001-04-14,ARI,COL,GS-4,,2,4.0,8,...,COL,2.0,L,8,9,morgami01,,,2001,4.0


In [130]:
mlb_pitchers = set(mlb_games_df['home_pitcher'].unique()).union(set(mlb_games_df['away_pitcher'].unique()))
games_pitchers = set(pitchers_games_df['name'].unique())

In [131]:
mlb_pitchers == games_pitchers

False

In [134]:
len(mlb_pitchers - games_pitchers)

280

In [135]:
len(games_pitchers - mlb_pitchers)

364

I had the known issues where, in `pitchers_games.csv` I had accidentally stripped the letters "c", "s" and "v" from the start of a name. How many pitchers in that data set would be fixed by prepending the correct letter?

In [142]:
fixed_pitchers = dict()
no_match = []

for p in games_pitchers:
    for mlb_p in mlb_pitchers:
        if mlb_p.endswith(p):
            fixed_pitchers[p] = mlb_p
            fixed = True
            break
    if not fixed:
        no_match.append(p)

In [143]:
fixed_pitchers

{'walrole01': 'walrole01',
 'abreed01': 'cabreed01',
 'abatc.01': 'sabatc.01',
 'diazmi02': 'diazmi02',
 'adzeco01': 'sadzeco01',
 'ahiltr01': 'cahiltr01',
 'broadla01': 'broadla01',
 'oleral01': 'soleral01',
 'foppeje01': 'foppeje01',
 'eovalna01': 'eovalna01',
 'tsaoch01': 'tsaoch01',
 'duplajo01': 'duplajo01',
 'bolanro01': 'bolanro01',
 'arpech01': 'carpech01',
 'weavelu01': 'weavelu01',
 'burnea.01': 'burnea.01',
 'queveru01': 'queveru01',
 'augenbr01': 'augenbr01',
 'nippedu01': 'nippedu01',
 'fistedo01': 'fistedo01',
 'hollade01': 'hollade01',
 'runzlda01': 'runzlda01',
 'ampsad01': 'sampsad01',
 'redmama01': 'redmama01',
 'mayermi01': 'mayermi01',
 'astrmi01': 'castrmi01',
 'mchugco01': 'mchugco01',
 'phillja03': 'phillja03',
 'durapmo01': 'durapmo01',
 'illoro01': 'villoro01',
 'borucry01': 'borucry01',
 'tepesni01': 'tepesni01',
 'locubr01': 'slocubr01',
 'lopezwi01': 'lopezwi01',
 'hernaru03': 'hernaru03',
 'tollbbr01': 'tollbbr01',
 'burrebr01': 'burrebr01',
 'myettaa01': '

In [144]:
no_match

[]

Perfect. I'll `apply` this to `pitchers_games` and that will fix those.

In [149]:
pitchers_games_df['name'] = pitchers_games_df['name'].replace(fixed_pitchers)

In [150]:
pitchers_games_df['name'].isna().sum()

0

In [151]:
games_pitchers = set(pitchers_games_df['name'].unique())

In [152]:
len(mlb_pitchers - games_pitchers)

1

In [153]:
len(games_pitchers - mlb_pitchers)

84

Much better! Let's look into both now.

In [154]:
mlb_pitchers - games_pitchers

{'cookaa01'}

In [157]:
list(games_pitchers - mlb_pitchers)[:10]

['johnsma05',
 'andrecl01',
 'hillke01',
 'ikorbr01',
 'blacktr01',
 'turnbde01',
 'yarnaed01',
 'adamsau02',
 'grosski01',
 'delacjo01']

In [161]:
pitchers_games_df[pitchers_games_df['name'].isin(list(games_pitchers - mlb_pitchers))]['Date'].dt.year.value_counts()

2000    528
2012     15
2016      9
2004      6
2019      5
2013      3
2007      2
2018      1
2017      1
Name: Date, dtype: int64

So it's almost entirely pitchers who only played in the year 2000, which is missing in `mlb_games_df`, so that's not surprising. Let's look at just those after the year 2000.

In [165]:
pitchers_games_df[(pitchers_games_df['name'].isin(list(games_pitchers - mlb_pitchers))) & (pitchers_games_df['Date'] > '2001-01-01')].head(10).iloc[[0]]

Unnamed: 0,Gcar,Gtm,Date,Tm,Opp,Inngs,Dec,DR,IP,H,...,Home_Tm,WHIP,Result,Tm_Score,Opp_Score,name,DFS(DK),DFS(FD),Year,season_game
3698,235,22,2019-04-21,SEA,SEA,GS-1,,1,1.0,0,...,ANA,0.0,W,8,6,robleha01,4.25,6.0,2019,11.0


In [164]:
mlb_games_df[(mlb_games_df['date'] == '2019-04-21') & (mlb_games_df['home_team'] == 'ANA')]

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,avg_diff,obp_diff,slg_diff,avg_pct_diff,obp_pct_diff,slg_pct_diff,home_rest,away_rest,away_team_season_game_num,home_team_season_game_num
44051,2019-04-21,2019,4.0,21.0,ANA,SEA,1.0,barrija01,leakemi01,1497.622483,...,0.05728,0.032478,0.016462,20.12987,8.907355,3.265796,1.0,1.0,12,10


So it has totally different pitchers as the starters. Again, [pitchers_games_df is the correct one](https://www.baseball-reference.com/boxes/ANA/ANA201904210.shtml). I'll just leave them though, as it's only about 30 games. Let's update the starting pitchers and resave everything.

In [166]:
pitchers_games_df.to_csv('../data/pitchers_games.csv', index=False)

In [177]:
starting_pitchers_df = pitchers_games_df[(pitchers_games_df['Inngs'].str.startswith('GS')) | (pitchers_games_df['Inngs'].str.startswith('CG'))]

In [178]:
starting_pitchers_df.head()

Unnamed: 0,Gcar,Gtm,Date,Tm,Opp,Inngs,Dec,DR,IP,H,...,Home_Tm,WHIP,Result,Tm_Score,Opp_Score,name,DFS(DK),DFS(FD),Year,season_game
0,498,57,2000-06-05,CHC,CHC,GS-5,,5,4.2,5,...,CHN,1.904762,L,3,4,morgami01,,,2000,1.0
1,501,64,2000-06-13,LAD,LAD,GS-5,L(1-1),2,4.2,8,...,LAN,2.380952,L,1,6,morgami01,,,2000,2.0
2,506,79,2000-06-30,CIN,CIN,GS-5,L(3-2),2,5.0,8,...,ARI,2.0,L,4,5,morgami01,,,2000,3.0
3,512,97,2000-07-21,ARI,CIN,GS-5,,2,5.0,10,...,CIN,2.4,W,5,4,morgami01,,,2000,4.0
4,541,11,2001-04-14,ARI,COL,GS-4,,2,4.0,8,...,COL,2.0,L,8,9,morgami01,,,2001,1.0


In [179]:
starting_pitchers_df.shape[0]

97955

In [180]:
starting_pitchers_df.to_csv('../data/starting_pitchers_games.csv', index=False)