# This notebook prepares the primary data sets and visits some tips and techniques along the way

* Full data pulled from retrosheet.org
* Chadwick tools used for converting retrosheet data http://chadwick.sourceforge.net/doc/cwtools.html

We'll be using Retrosheet baseball data for our examples in this discussion.  At the center of the work will be the Retrosheet 'Event Files' from 2017.  Each row in the event file describes some kind of action on the field. Any time the game situation is different from the previous pitch, one or more events is created.  The events are rich in data.  They indicate the game, the inning, the batter, the type of hit, the direction of the hit, the number of runners on base etc. We'll take a look at the event codes a bit later to get some impression of the depth of information.  There are 191,996 events recorded for 2017.  There were 81\*30 or 2430 scheduled games in 2017. This yields, on average, about 79 events per game.

In [None]:
import pandas as pd
import os
import matplotlib

# % and %% are ipthyon 'magics'
# ! is ipython's shell execution shortcut

%history

In [None]:
!ls data_public/*.EV*

In [None]:
%%sh
head -3 data_public/2017CHA.EVA
echo ""
echo 'data_public/2017CHA.EVA'
echo ""
sort -k2 -t, data_public/2017CHA.EVA |head -5

#### Useful techniques for interacting with the shell

* <b>Use assignment to capture the ouput of your ! command</b>

In [None]:
files = !ls

In [None]:
type(files)

In [None]:
files.grep('\.i.*')

In [None]:
files.p

## Passing Python output to the shell
#### Here we'll do the opposite and more powerful technique
* <b>We'll use Use {expression} to pass from ipython to the shell</b>

In [None]:
extension = 'ipynbb'

In [None]:
!ls *.{extension}

* Hmm... Extra trailing 'b'. Lets take a slice of the extension string

In [None]:
!ls *.{extension[0:-1]}

### This shows us a few techniques and examples. <br> In the following section, we'll use these techniques to bring in the data set that we'll use for our analysis.

### Here we're preparing the file.  Run the cwevent executable with -n and capture the header

In [None]:
#Chadwick expects a 'team' file in the cwd
!ln -s ./data_public/TEAM2017 team

In [None]:
! cwevent -n data_public/2017SEA.EVA |head -1 >data_public/atbats.txt

### Now we'll shell out and run a loop to invoke the converter on each event file.  We'll also concatentate the roster files in a separate command

In [None]:
%%sh
for x in $(ls -1 data_public/*.EV*); do cwevent $x >>data_public/atbats.txt; done
cat data_public/*.ROS >data_public/rosters.txt

### Standard Python file to dictionary
* Constants for event codes

In [None]:
event_code = {}
with open("data_public/event_codes.txt") as f:
    for line in f:
       (val, key) = line.split()
       event_code[key] = int(val)

In [101]:
event_code

{'UNK': 0,
 'NONE': 1,
 'GENERIC_OUT': 2,
 'K': 3,
 'SB': 4,
 'DEF_INDIFFERENCE': 5,
 'SB_CAUGHT': 6,
 'ERROR_PICKOFF': 7,
 'PICKOFF': 8,
 'WP': 9,
 'PB': 10,
 'BK': 11,
 'OTHER_ADVANCE': 12,
 'ERROR_FOUL': 13,
 'BB': 14,
 'IBB': 15,
 'HBP': 16,
 'INTERFERENCE': 17,
 'ERROR': 18,
 'FC': 19,
 'SINGLE': 20,
 'DOUBLE': 21,
 'TRIPLE': 22,
 'HR': 23,
 'MISSING': 24}

In [None]:
df_events=pd.read_csv('./data_public/atbats.txt')
df_players=pd.read_csv('./data_public/players.txt')

### Here are a few Pandas tools for getting an overview of a dataframe

In [87]:
df_events.shape

(191196, 36)

In [88]:
df_events.columns

Index(['GAME_ID', 'AWAY_TEAM_ID', 'INN_CT', 'BAT_HOME_ID', 'OUTS_CT',
       'BALLS_CT', 'STRIKES_CT', 'AWAY_SCORE_CT', 'HOME_SCORE_CT',
       'RESP_BAT_ID', 'RESP_BAT_HAND_CD', 'RESP_PIT_ID', 'RESP_PIT_HAND_CD',
       'BASE1_RUN_ID', 'BASE2_RUN_ID', 'BASE3_RUN_ID', 'EVENT_TX',
       'LEADOFF_FL', 'PH_FL', 'BAT_FLD_CD', 'BAT_LINEUP_ID', 'EVENT_CD',
       'BAT_EVENT_FL', 'AB_FL', 'H_CD', 'SH_FL', 'SF_FL', 'EVENT_OUTS_CT',
       'RBI_CT', 'WP_FL', 'PB_FL', 'ERR_CT', 'BAT_DEST_ID', 'RUN1_DEST_ID',
       'RUN2_DEST_ID', 'RUN3_DEST_ID'],
      dtype='object')

In [89]:
df_events.head()

Unnamed: 0,GAME_ID,AWAY_TEAM_ID,INN_CT,BAT_HOME_ID,OUTS_CT,BALLS_CT,STRIKES_CT,AWAY_SCORE_CT,HOME_SCORE_CT,RESP_BAT_ID,...,SF_FL,EVENT_OUTS_CT,RBI_CT,WP_FL,PB_FL,ERR_CT,BAT_DEST_ID,RUN1_DEST_ID,RUN2_DEST_ID,RUN3_DEST_ID
0,ANA201704070,SEA,1,0,0,3,2,0,0,seguj002,...,F,0,0,F,F,0,1,0,0,0
1,ANA201704070,SEA,1,0,0,1,2,0,0,hanim001,...,F,1,0,F,F,0,0,1,0,0
2,ANA201704070,SEA,1,0,1,1,1,0,0,canor001,...,F,1,0,F,F,0,0,1,0,0
3,ANA201704070,SEA,1,0,2,0,1,0,0,cruzn002,...,F,0,0,F,F,0,0,2,0,0
4,ANA201704070,SEA,1,0,2,2,2,0,0,cruzn002,...,F,1,0,F,F,0,0,0,2,0


In [90]:
df_events.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 191196 entries, 0 to 191195
Data columns (total 36 columns):
GAME_ID             191196 non-null object
AWAY_TEAM_ID        191196 non-null object
INN_CT              191196 non-null int64
BAT_HOME_ID         191196 non-null int64
OUTS_CT             191196 non-null int64
BALLS_CT            191196 non-null int64
STRIKES_CT          191196 non-null int64
AWAY_SCORE_CT       191196 non-null int64
HOME_SCORE_CT       191196 non-null int64
RESP_BAT_ID         191196 non-null object
RESP_BAT_HAND_CD    191196 non-null object
RESP_PIT_ID         191196 non-null object
RESP_PIT_HAND_CD    191196 non-null object
BASE1_RUN_ID        61225 non-null object
BASE2_RUN_ID        37543 non-null object
BASE3_RUN_ID        19364 non-null object
EVENT_TX            191196 non-null object
LEADOFF_FL          191196 non-null object
PH_FL               191196 non-null object
BAT_FLD_CD          191196 non-null int64
BAT_LINEUP_ID       191196 non-null int6

### Now we'll look at a few pandas techniques
* First we'll restrict the dataframe to a single column
* Next we'll restrict the dataframe to a set of columns
* Third we'll breakdown the contants of a column
* Fourth we'll use value_counts() to get a summary

In [94]:
df_events['GAME_ID'].head()

0    ANA201704070
1    ANA201704070
2    ANA201704070
3    ANA201704070
4    ANA201704070
Name: GAME_ID, dtype: object

* Notice that we restrict with dataframe[] and provide a python list of the colums with ['item1','item2'....] resulting in doubling of the brackets

In [104]:
df_events[['GAME_ID','AWAY_TEAM_ID','BALLS_CT','RESP_BAT_ID','OUTS_CT','EVENT_CD']].head()

Unnamed: 0,GAME_ID,AWAY_TEAM_ID,BALLS_CT,RESP_BAT_ID,OUTS_CT,EVENT_CD
0,ANA201704070,SEA,3,seguj002,0,14
1,ANA201704070,SEA,1,hanim001,0,3
2,ANA201704070,SEA,1,canor001,1,2
3,ANA201704070,SEA,0,cruzn002,2,4
4,ANA201704070,SEA,2,cruzn002,2,3


In [93]:
df_events['GAME_ID'].str[0:3].head()

0    ANA
1    ANA
2    ANA
3    ANA
4    ANA
Name: GAME_ID, dtype: object

### Here we apply the series.value_counts( ) method to return a seri 

In [108]:
df_events['GAME_ID'].str[0:3].value_counts()

DET    6547
BOS    6542
TEX    6530
MIN    6525
CHN    6483
BAL    6471
ATL    6440
ARI    6439
OAK    6435
SFN    6433
CIN    6433
MIL    6426
COL    6420
PIT    6411
PHI    6407
MIA    6403
WAS    6370
NYN    6369
CHA    6336
SLN    6327
SEA    6325
NYA    6320
HOU    6310
TBA    6290
TOR    6248
KCA    6244
ANA    6230
CLE    6185
LAN    6169
SDN    6128
Name: GAME_ID, dtype: int64

### Let's look at the data quickly to get a bit of a better idea of what's in the files

In [104]:
df_events[['GAME_ID','AWAY_TEAM_ID','BALLS_CT','RESP_BAT_ID','OUTS_CT','EVENT_CD']].head()

Unnamed: 0,GAME_ID,AWAY_TEAM_ID,BALLS_CT,RESP_BAT_ID,OUTS_CT,EVENT_CD
0,ANA201704070,SEA,3,seguj002,0,14
1,ANA201704070,SEA,1,hanim001,0,3
2,ANA201704070,SEA,1,canor001,1,2
3,ANA201704070,SEA,0,cruzn002,2,4
4,ANA201704070,SEA,2,cruzn002,2,3


* Note above that there are two consecutive events for the same batter (cruzn002).  We'll use the pandas indexed lookup method (.iloc) to convert a row of the dataframe into a series whose index is the column name and values are the data from that row. Event codes here are 14=walk, 3=K, 2=generic out and 4=stolen base

In [105]:
df_events.iloc[3].loc['EVENT_CD']

4

In [None]:
type(df_events['AWAY_TEAM_ID'].value_counts())

In [None]:
df_events['AWAY_TEAM_ID'].value_counts().index

#### This shows us a few techniques and examples. <br> In the following section, we'll use these techniques to bring in the data set that we'll use for our analysis.

In [None]:
flt_homers = df_events['EVENT_CD'] == 23
flt_redsox = (df_events['GAME_ID'].str.startswith('BOS')) | (df_events['AWAY_TEAM_ID'] == 'BOS')
flt_yankees = (df_events['GAME_ID'].str.startswith('NYA')) | (df_events['AWAY_TEAM_ID'] == 'NYA')

In [None]:
filters = {}
for team in df_events['AWAY_TEAM_ID'].value_counts().index:
   filters[team] = (df_events['GAME_ID'].str.startswith(team)) | (df_events['AWAY_TEAM_ID'] == team)

In [None]:
df_events[(filters['BOS']) & (~df_events['GAME_ID'].str.startswith('BOS'))]['EVENT_CD'].value_counts()[13]

In [None]:
df_events[(filters['BOS']) & (df_events['EVENT_CD'] == 13)].shape

In [None]:
df_events.columns

In [None]:
len(df_events.GAME_ID.value_counts())

In [None]:
df_events[flt_redsox]['GAME_ID'].shape

In [None]:
df_events[flt_redsox & flt_homers]['BAT_HOME_ID'].value_counts()

In [None]:
for filter in filters:
    print(filter)
    print(df_events[filters[filter] & flt_homers]['BAT_HOME_ID'].value_counts())

In [None]:
df_events['GAME_ID'].str.startswith('BOS').value_counts()

In [None]:
grp_teams_homers = df_events.groupby([df_events['GAME_ID'].str[0:3], df_events['AWAY_TEAM_ID'], df_events['EVENT_CD']==23])

In [None]:
grp_teams_homers['EVENT_CD'].count()

In [None]:
grp_away = df_events.groupby(df_events['AWAY_TEAM_ID'])
grp_home = df_events.groupby(df_events['GAME_ID'].str[0:3])

In [None]:
grp_away[df_events[df_events['EVENT_CD'] == 23]].count()

In [None]:
df_events[df_events['EVENT_CD'] == 23]['EVENT_CD']

In [None]:
df_events[df_events['EVENT_CD'] == 23]['RESP_BAT_ID'].value_counts()

In [None]:
%matplotlib inline

In [None]:
grp_home['EVENT_CD'].head()

In [None]:
df_events[filters['NYA']

In [None]:
grp_hit_type=df_events.groupby('EVENT_CD')

In [None]:
grp_hit_type.describe()

In [None]:
for grpname,grprec in grp_hit_type:
    print(grpname)
    print(grprec)

In [None]:
df_events[df_events['EVENT_CD'] == 23].groupby([df_events['EVENT_CD'] == 23]['EVENT_CD'])

In [None]:
df_events[df_events['EVENT_CD' == 23]].groupby('BAT_HOME_ID').min()

In [None]:
for f in var:
    print(f)
    

In [None]:
type(var)

In [None]:
var.n

In [None]:
type(var.p)

In [None]:
var.grep('^r.*')

In [None]:
flt = '*.ip*'

In [None]:
%ls {flt}