# Pitcher summary

The point of this notebook is to start collecting the summary data on pitchers. This will then be used for future analysis/wrangling.

In [1]:
import pandas as pd
import numpy as np
import requests
import time

from bs4 import BeautifulSoup

## Pitcher summary

A summary of all pitchers. I manually downloaded the data from FanGraphs season-by-season, and then just joined them together to get `fg_pitchers.csv`. Technically `pybaseball` can do a similar thing, but when using it it only pulled a subset of the pitchers. So this seems more reliable.

In [131]:
fg_pitching_df = pd.read_csv('../data/fg_pitchers.csv')

In [132]:
fg_pitching_df = fg_pitching_df.sort_values('year')

In [133]:
fg_pitching_df.head()

Unnamed: 0,Name,Team,W,L,SV,G,GS,IP,K/9,BB/9,...,LOB%,GB%,HR/FB,EV,ERA,FIP,xFIP,WAR,playerid,year
0,John Frascatore,Blue Jays,2,4,0,60,0,73.0,3.7,4.07,...,70.8%,,,,5.42,6.45,,-1.1,1004304,2000
400,Mike Lincoln,Twins,0,3,0,8,4,20.2,6.53,5.66,...,70.3%,,,,10.89,10.15,,-0.8,1457,2000
401,Alan Mills,- - -,4,1,2,41,0,49.1,6.57,6.39,...,79.6%,,,,5.29,6.3,,-0.7,1008949,2000
402,Rick Aguilera,Cubs,1,2,29,54,0,47.2,7.17,3.4,...,76.5%,,,,4.91,5.92,,-0.7,1000086,2000
403,Chris Fussell,Royals,5,3,0,20,9,70.0,5.91,5.66,...,72.3%,,,,6.3,7.13,,-0.7,1004410,2000


Get their first and last seasons, starting and ending age, and number of games played. Also record all the teams they played for, and how many this is.

In [134]:
pitchers_summary_df = fg_pitching_df.groupby(['Name', 'playerid']).agg({'year': [min, max], 
                                                              'G': sum, 'GS': sum})
pitchers_summary_df['teams'] = fg_pitching_df.groupby(['Name', 'playerid'])['Team'].unique()
pitchers_summary_df['num_teams'] = fg_pitching_df.groupby(['Name', 'playerid'])['Team'].nunique()
pitchers_summary_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,year,year,G,GS,teams,num_teams
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,sum,sum,Unnamed: 6_level_1,Unnamed: 7_level_1
Name,playerid,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
A.J. Achter,11387,2014,2016,45,0,"[Twins, Angels]",2
A.J. Burnett,512,2000,2015,428,423,"[Marlins, Blue Jays, Yankees, Pirates, Phillies]",5
A.J. Cole,11467,2015,2019,79,19,"[Nationals, - - -, Indians]",3
A.J. Griffin,11132,2012,2017,88,85,"[Athletics, Rangers]",2
A.J. Minter,18655,2017,2019,117,0,[Braves],1
A.J. Morris,9919,2016,2016,7,0,[Reds],1
A.J. Murray,3422,2007,2008,16,4,[Rangers],1
A.J. Puk,19343,2019,2019,10,0,[Athletics],1
A.J. Reed,16246,2019,2019,1,0,[White Sox],1
A.J. Schugel,11432,2015,2017,73,0,"[Diamondbacks, Pirates]",2


In [135]:
pitchers_summary_df.columns = ['first_season', 'last_season', 'games_played', 'games_started', 'teams', 'num_teams']

In [136]:
pitchers_summary_df = pitchers_summary_df.reset_index()

In [137]:
# Store it as a float because missing values can't exist with ints (important when joining later)
pitchers_summary_df['playerid'] = pitchers_summary_df['playerid'].astype(float)

In [138]:
pitchers_summary_df.head()

Unnamed: 0,Name,playerid,first_season,last_season,games_played,games_started,teams,num_teams
0,A.J. Achter,11387.0,2014,2016,45,0,"[Twins, Angels]",2
1,A.J. Burnett,512.0,2000,2015,428,423,"[Marlins, Blue Jays, Yankees, Pirates, Phillies]",5
2,A.J. Cole,11467.0,2015,2019,79,19,"[Nationals, - - -, Indians]",3
3,A.J. Griffin,11132.0,2012,2017,88,85,"[Athletics, Rangers]",2
4,A.J. Minter,18655.0,2017,2019,117,0,[Braves],1


In [141]:
pitchers_summary_df['playerid'] = pitchers_summary_df['playerid'].astype(int)

Search by name. Note that **this isn't necessary in this case**, because fangraphs has their ID, so we can just do a reverse lookup, which is much faster. Saving it though in case this is useful in other situations.

In [None]:
# pitcher_keys = []
# ten_pct_inc = int(pitchers_summary_df['Name'].nunique() / 10)

# for i, name in enumerate(pitchers_summary_df['Name'].unique()):
#     # Try and get their first and last name to search for. If this splits into more than
#     # just two parts, record it and move on
#     try:
#         first, last = name.split(' ', 1)
#         if '.' in first:
#             first = first.replace('.', '. ')
#             first = first.rstrip(' ')
#     except Exception as e:
#         row = [name] + [None]*4
#         pitcher_keys.append(row)
#         continue
        
#     # If you get a first and last name, look them up. If this returns more than one player,
#     # record it and move on. If not, get their data and 
#     pitcher_data = pyb.playerid_lookup(last, first)
#     if pitcher_data.shape[0] > 1:
#         row = [name] + [None]*4
#         pitcher_keys.append(row)
#         continue
#     else:
#         try:
#             row = [name] + list(pitcher_data[['key_mlbam', 'key_retro', 'key_bbref', 'key_fangraphs']].values[0])
#         except Exception as e:
#             row = [name] + [None]*4
#         pitcher_keys.append(row)
        
#     # Sleep for one second to avoid rate limiting
#     time.sleep(1)
    
#     if i % ten_pct_inc == 0:
#         print(f'{10*i/ten_pct_inc}% complete')

In [139]:
player_keys = pyb.playerid_reverse_lookup(pitchers_summary_df['playerid'], key_type='fangraphs')

Gathering player lookup table. This may take a moment.


In [140]:
player_keys.head()

Unnamed: 0,name_last,name_first,key_mlbam,key_retro,key_bbref,key_fangraphs,mlb_played_first,mlb_played_last
0,aardsma,david,430911,aardd001,aardsda01,1902,2004.0,2015.0
1,abad,fernando,472551,abadf001,abadfe01,4994,2010.0,2019.0
2,abbott,paul,110015,abbop001,abbotpa01,1061,1990.0,2004.0
3,abreu,bryan,650556,abreb002,abreubr01,16609,2019.0,2020.0
4,abreu,juan,444874,abrej002,abreuju01,6306,2011.0,2011.0


In [142]:
player_keys = player_keys[['key_mlbam', 'key_retro', 'key_bbref', 'key_fangraphs']]
player_keys.head()

Unnamed: 0,key_mlbam,key_retro,key_bbref,key_fangraphs
0,430911,aardd001,aardsda01,1902
1,472551,abadf001,abadfe01,4994
2,110015,abbop001,abbotpa01,1061
3,650556,abreb002,abreubr01,16609
4,444874,abrej002,abreuju01,6306


In [143]:
pitchers_summary_df = pitchers_summary_df.merge(player_keys, left_on='playerid', right_on='key_fangraphs')

In [144]:
pitchers_summary_df.head()

Unnamed: 0,Name,playerid,first_season,last_season,games_played,games_started,teams,num_teams,key_mlbam,key_retro,key_bbref,key_fangraphs
0,A.J. Achter,11387,2014,2016,45,0,"[Twins, Angels]",2,592091,achta001,achteaj01,11387
1,A.J. Burnett,512,2000,2015,428,423,"[Marlins, Blue Jays, Yankees, Pirates, Phillies]",5,150359,burna001,burnea.01,512
2,A.J. Cole,11467,2015,2019,79,19,"[Nationals, - - -, Indians]",3,595918,colea002,coleaj01,11467
3,A.J. Griffin,11132,2012,2017,88,85,"[Athletics, Rangers]",2,456167,grifa002,griffaj01,11132
4,A.J. Minter,18655,2017,2019,117,0,[Braves],1,621345,minta001,minteaj01,18655


In [145]:
pitchers_summary_df = pitchers_summary_df.drop('playerid', axis='columns')

In [146]:
pitchers_summary_df.isna().sum()

Name             0
first_season     0
last_season      0
games_played     0
games_started    0
teams            0
num_teams        0
key_mlbam        0
key_retro        0
key_bbref        0
key_fangraphs    0
dtype: int64

In [147]:
pitchers_summary_df.shape[0]

3308

In [148]:
pitchers_summary_df.to_csv('../data/pitchers_summary.csv', index=False)

## Fetch pitchers game-by-game data from BR

pybaseball doesn't seem to give you access to game-by-game stats for pitchers, which I need. So taking their code and modifying it to pull directly from BR. Note that this _does_ include ERA, but _doesn't_ include WHIP.

In [5]:
pitchers_summary_df = pd.read_csv('../../data/pitchers_summary.csv')

In [6]:
pitchers_summary_df.head()

Unnamed: 0,Name,first_season,last_season,games_played,games_started,teams,num_teams,key_mlbam,key_retro,key_bbref,key_fangraphs
0,A.J. Burnett,2000,2015,428,423,"MIA,TOR,NYA,PIT,PHI",5,150359,burna001,burnea.01,512
1,A.J. Cole,2015,2019,79,19,"WAS,CLE",3,595918,colea002,coleaj01,11467
2,A.J. Griffin,2012,2017,88,85,"OAK,TEX",2,456167,grifa002,griffaj01,11132
3,A.J. Murray,2007,2008,16,4,TEX,1,451262,murra001,murraaj01,3422
4,Aaron Blair,2016,2017,16,16,ATL,1,594760,blaia001,blairaa01,14934


Get pitcher game-level data from Baseball Reference.

In [7]:
def pitcher_bref(br_id, season):
    """
    Get season-level Pitching Statistics for Specific Team (from Baseball-Reference)
    ARGUMENTS:
    br_id : str : The BR unique identifier. You can get this from playerid_lookup in the key_bberf columns
    end_season : int : season you want data for (data is returned on a game-by-game basis)
    """

    url = f"https://www.baseball-reference.com/players/gl.fcgi?id={br_id}&t=p&year={season}"

    data = []
    headings = None
    stats_url = url
    response = requests.get(stats_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    table = soup.find_all('table', {'id': 'pitching_gamelogs'})
    if len(table) > 0:
        table = table[0]
    else:
        return None

    if headings is None:
        headings = [row.text.strip() for row in table.find_all('th')]
        # Even within BR, it seems like different years (or perhaps players?) have
        # different numbers of columns (i.e. different stats being calculated). Annoyingly,
        # finding all "th" includes a bunch of junk column headers. You can identify where 
        # the junk ones start, because you just start getting integers (e.g. "5", "6", etc.).
        # So my work around here is to look for the first integer, and then just filter out
        # everything after that point.
        for i, h in enumerate(headings):
            try:
                int(h)
                headings = headings[1:i]
                break
            except Exception as e:
                pass

    rows = table.find_all('tr')
    # Skip the last row, as this is a footer with only yearly summary data
    for row in rows[:-1]:
        cols = row.find_all('td')
        # Some rows are just basically a few long td's with text, explaining things
        # like trades. This is a really crude way to filter those out.
        if len(cols) > 40:
            cols = [ele.text.strip() for ele in cols]
            cols = [col.replace('*', '').replace('#', '') for col in cols]  # Removes '*' and '#' from some names
            cols = [col for col in cols if 'Totals' not in col and 'NL teams' not in col and 'AL teams' not in col]  # Removes Team Totals and other rows
            cols.insert(2, int(season))
            data.append([ele for ele in cols[0:]])

    headings.insert(2, "Year")
    
    data = pd.DataFrame(data=data, columns=headings) # [:-5]  # -5 to remove Team Totals and other rows (didn't work in multi-year queries)
    data.columns = [x if x != '' else 'at' for x in data.columns]
    data = data.dropna()  # Removes Row of All Nones
    data.reset_index(drop=True, inplace=True)  # Fixes index issue (Index was named 'W" for some reason)
    
    return data

Clean up the data and calculate WHIP.

In [79]:
def get_pitcher_game_stats(br_id, year, prepend=False, verbose=0):
    pitcher_df = pitcher_bref(br_id, year)
    if pitcher_df is None:
        if not prepend:
            return None
        else:
            lcl = list(string.ascii_lowercase)
            lcl = ['s', 'c'] + lcl
            lcl = list(set(lcl))
            for c in list(string.ascii_lowercase):
                pitcher_df = pitcher_bref(c + br_id, year)
                if pitcher_df is not None:
                    if verbose > 0:
                        print(f'Fixed {br_id} by changing it to {c + br_id}')
                    br_id = c + br_id
                    break
            else:
                if verbose > 1:
                    print(f"Couldn't fix {br_id} in {year}")
                return None, None
    pitcher_df['Year'] = pitcher_df['Year'].astype(int).astype(str)
    pitcher_df['Date'] = pitcher_df[['Date', 'Year']].agg(' '.join, axis=1)
    
    # Note double-headers and remove the (#) marker from the date so that Pandas can parse it
    patt = r'\((\d+)\)'
    pitcher_df['Double_Header'] = pitcher_df['Date'].str.extract(patt, expand=False)
    pitcher_df['Date'] = pitcher_df['Date'].str.replace(patt, '')
    
    # Weird date formatting
    pitcher_df['Date'] = pitcher_df['Date'].str.replace('susp', '')
    
    pitcher_df = pitcher_df.drop('Year', axis='columns')
    
    home_team = []
    for i in range(pitcher_df.shape[0]):
        if pitcher_df.loc[i, 'at'] == '@':
            home_team.append(pitcher_df.loc[i, 'Opp'])
        else:
            home_team.append(pitcher_df.loc[i, 'Tm'])
            
    pitcher_df['Home_Tm'] = home_team
    pitcher_df = pitcher_df.drop('at', axis='columns')
    
    pitcher_df['WHIP'] = (pitcher_df['BB'].astype(int) + pitcher_df['H'].astype(int)) / pitcher_df['IP'].astype(float)
    
    pitcher_df[['Result', 'Final_Score']] = pitcher_df['Rslt'].str.split(',', 1, expand=True)
    pitcher_df[['Tm_Score', 'Opp_Score']] = pitcher_df['Final_Score'].str.split('-', 1, expand=True)
    pitcher_df = pitcher_df.drop(['Rslt', 'Final_Score'], axis='columns')
    if not prepend:
        return pitcher_df
    else:
        return pitcher_df, br_id

In [72]:
ventuyo, fixed_name = get_pitcher_game_stats('entuyo01', 2001, prepend=True, verbose=2)

Couldn't fix entuyo01 in 2001


In [73]:
ventuyo, fixed_name = get_pitcher_game_stats('entuyo01', 2013, prepend=True, verbose=2)

Fixed entuyo01 by changing it to ventuyo01


In [62]:
fixed_name

'ventuyo01'

In [64]:
ventuyo

Unnamed: 0,Gcar,Gtm,Date,Tm,Opp,Inngs,Dec,DR,IP,H,...,cWPA,RE24,Entered,Exited,Double_Header,Home_Tm,WHIP,Result,Tm_Score,Opp_Score
0,1,151,Sep 17 2013,KCR,CLE,GS-6,,99,5.2,5,...,0.09%,1.5,1t start tie,6t 1-3 2 out a2,,KCR,1.346154,L,3,5
1,2,156,Sep 23 2013,KCR,SEA,GS-6,,5,5.2,2,...,0.02%,1.27,1b start tie,6b 12- 2 out a1,,SEA,0.961538,W,6,5
2,3,161,Sep 28 2013,KCR,CHW,GS-4,L(0-1),4,4.0,6,...,0.00%,-1.99,1b start tie,4b 3 out d4,,CHW,1.75,L,5,6


Get this data for all pitchers and save it.

In [16]:
pitchers_games_df = pd.read_csv('../../data/pitchers_games.csv')
pitchers_games_df = pitchers_games_df[pitchers_games_df['Inngs'].str.startswith('GS-')]
pitchers_games_df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Gcar,Gtm,Date,Tm,Opp,Inngs,Dec,DR,IP,H,...,Home_Tm,WHIP,Result,Tm_Score,Opp_Score,name,DFS(DK),DFS(FD),Year,season_game
20,498,57,2000-06-05,ARI,CHC,GS-5,,5,4.2,5,...,CHC,1.904762,L,3,4,morgami01,,,2000,21.0
23,501,64,2000-06-13,ARI,LAD,GS-5,L(1-1),2,4.2,8,...,LAD,2.380952,L,1,6,morgami01,,,2000,24.0
28,506,79,2000-06-30,ARI,CIN,GS-5,L(3-2),2,5.0,8,...,ARI,2.0,L,4,5,morgami01,,,2000,29.0
34,512,97,2000-07-21,ARI,CIN,GS-5,,2,5.0,10,...,CIN,2.4,W,5,4,morgami01,,,2000,35.0
63,541,11,2001-04-14,ARI,COL,GS-4,,2,4.0,8,...,COL,2.0,L,8,9,morgami01,,,2001,4.0


In [20]:
merged_df = pitchers_summary_df.merge(pitchers_games_df, left_on='key_bbref', right_on='name', how='right')

In [22]:
merged_df.head()

Unnamed: 0,Name,first_season,last_season,games_played,games_started,teams,num_teams,key_mlbam,key_retro,key_bbref,...,Home_Tm,WHIP,Result,Tm_Score,Opp_Score,name,DFS(DK),DFS(FD),Year,season_game
0,Mike Morgan,2000.0,2002.0,120.0,5.0,ARI,1.0,119374.0,morgm001,morgami01,...,CHC,1.904762,L,3,4,morgami01,,,2000,21.0
1,Mike Morgan,2000.0,2002.0,120.0,5.0,ARI,1.0,119374.0,morgm001,morgami01,...,LAD,2.380952,L,1,6,morgami01,,,2000,24.0
2,Mike Morgan,2000.0,2002.0,120.0,5.0,ARI,1.0,119374.0,morgm001,morgami01,...,ARI,2.0,L,4,5,morgami01,,,2000,29.0
3,Mike Morgan,2000.0,2002.0,120.0,5.0,ARI,1.0,119374.0,morgm001,morgami01,...,CIN,2.4,W,5,4,morgami01,,,2000,35.0
4,Mike Morgan,2000.0,2002.0,120.0,5.0,ARI,1.0,119374.0,morgm001,morgami01,...,COL,2.0,L,8,9,morgami01,,,2001,4.0


In [96]:
missing_pitchers = merged_df[merged_df['key_bbref'].isna()]['name'].unique()

In [97]:
missing_pitchers[:10]

array(['entuyo01', 'trasst01', 'tewajo02', 'houpe01', 'ahiltr01',
       'uppaje01', 'osajo02', 'imonja01', 'hambjo03', 'ribtr01'],
      dtype=object)

In [36]:
pitchers_games_df[pitchers_games_df['name'] == 'tewajo02']

Unnamed: 0,Gcar,Gtm,Date,Tm,Opp,Inngs,Dec,DR,IP,H,...,Home_Tm,WHIP,Result,Tm_Score,Opp_Score,name,DFS(DK),DFS(FD),Year,season_game
5672,1,6,2003-04-06,,DET,GS-7,,99,6.2,7,...,CHW,1.290323,W,10,2,tewajo02,,,2003,1.0
5673,2,11,2003-04-12,,DET,GS-5,L(0-1),5,4.1,5,...,DET,1.95122,L,3,4,tewajo02,,,2003,2.0
5674,3,17,2003-04-19,,CLE,GS-6,W(1-1),6,6.0,5,...,CHW,1.333333,W,12,3,tewajo02,,,2003,3.0
5675,4,22,2003-04-24,,BAL,GS-5,,4,5.0,4,...,BAL,1.6,L,4,5,tewajo02,,,2003,4.0
5676,5,30,2003-05-03,,SEA,GS-4,L(1-2),8,3.2,7,...,CHW,3.75,L,2,12,tewajo02,,,2003,5.0
5677,6,120,2004-08-21,,BOS,GS-4,L(0-1),99,3.1,8,...,CHW,2.903226,L,7,10,tewajo02,,,2004,1.0
5679,8,125,2004-08-26,,CLE,GS-4,,2,3.1,7,...,CLE,2.903226,W,14,9,tewajo02,,,2004,3.0


In [83]:
successfully_processed_pitchers = []
failed_processing_pitchers = []
num_pitchers = pitchers_summary_df['key_bbref'].nunique()

for i, key in enumerate(missing_pitchers):
    pitcher_yearly_df = None
    try:
        for y in range(2000, 2020):
            year_df, new_key = get_pitcher_game_stats(key, y, prepend=True, verbose=1)
            if new_key is not None:
                key = new_key
            if year_df is not None:
                if pitcher_yearly_df is None:
                    pitcher_yearly_df = year_df
                else:
                    pitcher_yearly_df = pd.concat([pitcher_yearly_df, year_df])
        pitcher_yearly_df['Date'] = pd.to_datetime(pitcher_yearly_df['Date'])
        pitcher_yearly_df = pitcher_yearly_df.sort_values('Date')
        pitcher_yearly_df.to_csv(f'../../data/pitchers_games/{key}.csv', index=False)
        successfully_processed_pitchers.append(key)
    except Exception as e:
        print(key, e)
        failed_processing_pitchers.append(key)
    
    if i % 50 == 0 and i > 0:
        num_success = len(successfully_processed_pitchers)
        num_fails = len(failed_processing_pitchers)
        print(f'{100*(i+1) / num_pitchers:.2f}% processed')
        print(f'{num_success} ({100*num_success / (num_success + num_fails):.2f}%) successful')
        print('='*40)
    time.sleep(1)

houpe01 'NoneType' object is not subscriptable
Fixed ahiltr01 by changing it to cahiltr01
Fixed uppaje01 by changing it to suppaje01
Fixed osajo02 by changing it to sosajo02
Fixed imonja01 by changing it to simonja01
Fixed hambjo03 by changing it to chambjo03
ribtr01 'NoneType' object is not subscriptable


KeyboardInterrupt: 

In [229]:
for i, key in enumerate(missing_pitchers):

successfully_processed_pitchers = []
failed_processing_pitchers = []
num_pitchers = pitchers_summary_df['key_bbref'].nunique()

for i, key in enumerate(pitchers_summary_df['key_bbref'].unique()):
    pitcher_yearly_df = None
    try:
        for y in range(2000, 2020):
            year_df = get_pitcher_game_stats(key, y)
            if year_df is not None:
                if pitcher_yearly_df is None:
                    pitcher_yearly_df = year_df
                else:
                    pitcher_yearly_df = pd.concat([pitcher_yearly_df, year_df])
        pitcher_yearly_df['Date'] = pd.to_datetime(pitcher_yearly_df['Date'])
        pitcher_yearly_df = pitcher_yearly_df.sort_values('Date')
        pitcher_yearly_df.to_csv(f'../data/pitchers_games/{key}.csv', index=False)
        successfully_processed_pitchers.append(key)
    except Exception as e:
        failed_processing_pitchers.append(key)
    
    if i % 50 == 0 and i > 0:
        num_success = len(successfully_processed_pitchers)
        num_fails = len(failed_processing_pitchers)
        print(f'{100*(i+1) / num_pitchers:.2f}% processed')
        print(f'{num_success} ({100*num_success / (num_success + num_fails):.2f}%) successful')
        print('='*40)
    time.sleep(1)

1.54% processed
51 (100.00%) successful
3.05% processed
101 (100.00%) successful
4.56% processed
151 (100.00%) successful
6.08% processed
201 (100.00%) successful
7.59% processed
251 (100.00%) successful
9.10% processed
301 (100.00%) successful
10.61% processed
350 (99.72%) successful
12.12% processed
400 (99.75%) successful
13.63% processed
450 (99.78%) successful
15.15% processed
500 (99.80%) successful
16.66% processed
550 (99.82%) successful
18.17% processed
600 (99.83%) successful
19.68% processed
650 (99.85%) successful
21.19% processed
700 (99.86%) successful
22.70% processed
750 (99.87%) successful
24.21% processed
800 (99.88%) successful
25.73% processed
850 (99.88%) successful
27.24% processed
900 (99.89%) successful
28.75% processed
950 (99.89%) successful
30.26% processed
1000 (99.90%) successful
31.77% processed
1050 (99.90%) successful
33.28% processed
1100 (99.91%) successful
34.79% processed
1150 (99.91%) successful
36.31% processed
1200 (99.92%) successful
37.82% proce

In [230]:
failed_processing_pitchers

['thompbr01', 'woodmi01', 'perezto03']

In [231]:
for i, key in enumerate(failed_processing_pitchers):
    pitcher_yearly_df = None
    try:
        for y in range(2000, 2020):
            year_df = get_pitcher_game_stats(key, y)
            if year_df is not None:
                if pitcher_yearly_df is None:
                    pitcher_yearly_df = year_df
                else:
                    pitcher_yearly_df = pd.concat([pitcher_yearly_df, year_df])
        pitcher_yearly_df['Date'] = pd.to_datetime(pitcher_yearly_df['Date'])
        pitcher_yearly_df = pitcher_yearly_df.sort_values('Date')
        pitcher_yearly_df.to_csv(f'../data/pitchers_games/{key}.csv', index=False)
        successfully_processed_pitchers.append(key)
    except Exception as e:
        print(key)
        print(e)

Examples of this data can be seen [here](https://www.baseball-reference.com/players/gl.fcgi?id=cookaa01&t=p&year=2002).

Descriptions of columns:
- Gcar -- Career Game Number for Player
- Gtm -- Season Game Number for Team. Number in parentheses indicates number of team games the player did not play in from one appearance to next.
- Date -- A number in parentheses indicates which game of a doubleheader.
- Rslt -- Game Result for Team. W - Win, L - Loss, T - Tie (for a suspended game)
- Inngs -- Innings Played by Player
    - CG - Complete Game started and finished
    - GS-# - Game Started to what inning
    - #-GF, Inning entered to end of game
    - #-# - Inning Entered to Inning Left
    - (#) Game did not go 9 innings (only shown when player finished the game).
    - For pitchers, an SHO means they shutout the opposition. A zero for the innings means the innings played is unknown.
- Dec -- Decision, Save, or Hold
    - W - Win (pitcher record after game)
    - L - Loss (pitcher record after game)
    - BW - Blown Save and Win (pitcher record after game)
    - BL - Blown Save and Loss (pitcher record after game)
    - S - Save (pitcher saves thus far)
    - BSv - Blown Save (pitcher blown saves thus far)
    - H - Hold (pitcher holds thus far)
- DR -- Days Rest. Number or days since their previous appearance. 99 if start of season or 99 or more days (may include demotions). -1 if pitching both games of double-header.
- IP -- Innings Pitched
- H -- Hits/Hits Allowed
- R -- Runs Scored/Allowed
- ER -- Earned Runs Allowed
- BB -- Bases on Balls/Walks
- SO -- Strikeouts
- HR -- Home Runs Hit/Allowed
- HBP -- Times Hit by a Pitch.
- ERA -- 9 * ER / IP. For recent years, leaders need 1 IP per team game played.
- BF -- Batters Faced
- Pit -- Number of pitches in the PA.
- Str -- Strikes. Includes both pitches in the zone and those swung at out of the zone.
- StL -- Strikes Looking. Strikes called by the umpire.
- StS -- Strikes Swinging. Strikes due to a swing and a miss.
- GB -- Ground Balls. Includes bunts and all other ground balls.
- FB -- Fly Balls. Includes Fly Balls, Line Drives, and Pop-Ups.
- LD -- Line Drives. These are double-counted in Fly Balls as well.
- PU -- Pop Ups. Generally, high fly balls that land within the infield circle. These are double-counted in Fly Balls as well.
- Unk -- Unknown batted ball type. A ball in play for which we don’t know the type.
- GSc -- Game Score. Developed by Bill James
    1. Start with 50 points.
    2. Add 1 point for each out recorded, so 3 points for every complete inning pitched.
    3. Add 2 points for each inning completed after the 4th.
    4. Add 1 point for each strikeout.
    5. Subtract 2 points for each hit allowed.
    6. Subtract 4 points for each earned run allowed.
    7. Subtract 2 points for each unearned run allowed.
    8. Subtract 1 point for each walk.
- IR -- Inherited Runners. Number of runners on base when pitcher entered the game.
- IS -- Inherited Score. Number or percentage of runners on base when pitcher entered the game who subsequently scored. These runners show up in the previous pitcher’s ERA.
- SB -- Stolen Bases
- CS -- Caught Stealing
- PO -- Pickoffs. Runner picked off a base. May include cases they were safe on an error. Also includes Pickoff Caught Stealing plays.
- AB -- At Bats
- 2B -- Doubles Hit/Allowed
- 3B -- Triples Hit/Allowed
- IBB -- Intentional Bases on Balls
- GDP -- Double Plays Grounded Into. Only includes standard 6-4-3, 4-3, etc. double plays. For gamelogs only in seasons we have play-by-play, we include triple plays as well. All official seasonal totals do not include GITP's.
- SF -- Sacrifice Flies
- ROE -- Reached On Error. Times a batter reached due to an error. DOES NOT include a fielder’s choice where no out was recorded.
- aLI -- Average Leverage Index. The average pressure the pitcher or batter saw in this game or season. 1.0 is average pressure, below 1.0 is low pressure and above 1.0 is high pressure.
- WPA -- Win Probability Added by Pitcher. Given average teams, this is the change in probability. A change of +/- 1 would indicate one win added or lost.
- acLI -- Average Championship Leverage Index. The average pressure the pitcher or batter saw in this game or season. 1.0 is average pressure, below 1.0 is low pressure and above 1.0 is high pressure.
- cWPA -- Championship Win Probability Added by Pitcher. Given average teams, this is the change in probability, displayed in percentage points. A change of +/- 100% would indicate one world series win added or lost.
- RE24 -- Base-Out Runs Saved. Given the bases occupied/out situation, how many runs did the pitcher save in the resulting play. Compared to average, so 0 is average, and above 0 is better than average
- Entered -- The situation when pitcher entered game. 
    - Inning top or bottom: 8b (bottom of 8th) 
    - bases occupied or start of inning: ’---’ (bases empty) 
    - score from pitching team’s perspective 
        - ahead/down and runs or tie: a4 (ahead by 4 runs) 
- Exited -- The situation when pitcher exited game
    - Inning top or bottom: 4t (top of 4th)
    - bases occupied, 3 outs, or end of game: ’123’ (bases loaded)
    - score from pitching team’s perspective
        - ahead/down and runs or tie: d2 (down by 2 runs)