# The Lahman Baseball Database Exploratory Data Analysis

### SUMMARY

The Lahman Baseball Database is a comprehensive record of batting and pitching statistics from 1871 to 2016. It also contains fielding statistics, standings, team stats, managerial records, post-season data, and a number of other data points. There are many many possible investigations that would likely uncover a range of interesting findings. Here I focus more narrowly on some general trends related to performance and measures of success. My hope is that an additional, more unique, investigation can be extended from the initial one.

### INITIAL INVESTIGATION - Exploratory Data Analysis

It's hard to argue agaist the notion of hitting being a very important factor in baseball. For the purposes of this investigation it's the low-hanging fruit that I'll investigate first. 


The most relevant tables from The Lahman Baseball Database with data regarding batting performance and measure of success are:
* Regular-season batting statistics (Batting.csv)
* Post-season batting statistics (BattingPost.csv)
* All-Star appearances (AllstarFull.csv)
* Player salary (Salaries.csv)
* Awards received (AwardsPlayers.csv)

Let's have a look at the tables - noting an interesting player in many of these tables is Hank Aaron (plyerID: aaronha01), who one of the most regarded players in baseball history:

##### LOAD AND BUILD UP TABLES

In [1]:
# import declarations and settings
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 6)

In [2]:
# a function to make loading tables easier
def load(data_file):
    return pd.read_csv('supporting-files/baseballdatabank-2017/core/' + data_file)

##### Regular Season Batting Statistics
The Regular-season batting statistics table contains records for each player's batting performance for each year played. The operations carried out on this table are:
* load original batting table
* group by playerID and sum to find career statistics for each player
* drop the yearID column
* calculate lifetime batting average
* remove players who never batted
* fill NaN fields with 0 (in this case, NaN values are equivalent to 0).

# remove pitchers test

In [3]:
appearances_full = load('Appearances.csv')
appearances_full

Unnamed: 0,yearID,teamID,lgID,playerID,G_all,GS,G_batting,G_defense,G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
0,1871,TRO,,abercda01,1,,1,1,0,0,0,0,0,1,0,0,0,0,,,
1,1871,RC1,,addybo01,25,,25,25,0,0,0,22,0,3,0,0,0,0,,,
2,1871,CL1,,allisar01,29,,29,29,0,0,0,2,0,0,0,29,0,29,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102758,2016,CHN,NL,zobribe01,147,142.0,147,146,0,0,1,119,0,1,27,0,24,46,0.0,4.0,0.0
102759,2016,SEA,AL,zuninmi01,55,48.0,55,52,0,52,0,0,0,0,0,0,0,0,2.0,3.0,0.0
102760,2016,SEA,AL,zychto01,12,0.0,0,12,12,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0


In [25]:
def is_pitcher(row):
    try: 
        ratio = float(row[8]) / sum(row[9:17])
    except:
        return False
    return ratio > 0.95

In [26]:
pitchers = appearances_full[appearances_full.apply(is_pitcher, axis=1)]
pitchers

Unnamed: 0,yearID,teamID,lgID,playerID,G_all,GS,G_batting,G_defense,G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
37,1871,RC1,,fishech01,25,,25,25,24,0,2,1,0,0,0,0,0,0,,,
73,1871,TRO,,mcmuljo01,29,,29,29,29,0,0,0,0,1,0,0,0,0,,,
86,1871,CL1,,prattal01,29,,29,29,28,0,0,0,0,0,5,0,2,7,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102277,2016,CHN,NL,pattosp01,16,0.0,16,16,16,0,0,0,0,0,1,0,0,1,0.0,0.0,0.0
102588,2016,CHN,NL,stroppe01,55,0.0,52,55,54,0,0,0,0,0,1,0,0,1,0.0,0.0,0.0
102736,2016,CHN,NL,woodtr01,81,0.0,76,78,77,0,0,0,0,0,3,0,0,3,0.0,2.0,3.0


In [27]:
pitchers = pitchers.groupby(pitchers.playerID).sum()
pitchers

Unnamed: 0_level_0,yearID,G_all,GS,G_batting,G_defense,G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
alexape01,1914,48,39.0,48,47,46,0,0,0,0,0,1,0,0,1,0.0,0.0,0.0
altroni01,11454,137,3.0,137,137,129,0,5,0,0,0,2,0,1,3,0.0,0.0,0.0
anderva01,1895,36,,36,36,29,0,0,0,0,0,1,0,1,2,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zay01,1886,1,,1,1,1,0,0,0,0,0,0,1,0,1,,,
zettlge01,11239,177,,177,177,172,0,11,1,0,0,0,1,12,13,,,
zinnji01,3841,41,13.0,41,40,38,0,0,0,0,0,0,0,2,2,0.0,1.0,0.0


# Remove less than X games test

In [28]:
appearances_full_2 = load('Appearances.csv')
appearances_full_2

Unnamed: 0,yearID,teamID,lgID,playerID,G_all,GS,G_batting,G_defense,G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
0,1871,TRO,,abercda01,1,,1,1,0,0,0,0,0,1,0,0,0,0,,,
1,1871,RC1,,addybo01,25,,25,25,0,0,0,22,0,3,0,0,0,0,,,
2,1871,CL1,,allisar01,29,,29,29,0,0,0,2,0,0,0,29,0,29,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102758,2016,CHN,NL,zobribe01,147,142.0,147,146,0,0,1,119,0,1,27,0,24,46,0.0,4.0,0.0
102759,2016,SEA,AL,zuninmi01,55,48.0,55,52,0,52,0,0,0,0,0,0,0,0,2.0,3.0,0.0
102760,2016,SEA,AL,zychto01,12,0.0,0,12,12,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0


In [29]:
def more_x_games(row):
    try:
        sum_games = sum(row[8:17])
    except:
        return False
    return sum_games > 100

In [30]:
games_larger = appearances_full_2[appearances_full_2.apply(more_x_games, axis=1)]
games_larger

Unnamed: 0,yearID,teamID,lgID,playerID,G_all,GS,G_batting,G_defense,G_p,G_c,G_1b,G_2b,G_3b,G_ss,G_lf,G_cf,G_rf,G_of,G_dh,G_ph,G_pr
1683,1883,CHN,NL,ansonca01,98,,98,98,2,1,98,0,0,0,0,0,1,1,,,
1760,1883,DTN,NL,farrejo01,101,,101,101,0,0,0,0,101,0,0,0,0,0,,,
1763,1883,CHN,NL,flintsi01,85,,85,85,0,83,0,0,0,0,0,0,23,23,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102745,2016,MIA,NL,yelicch01,155,150.0,155,149,0,0,0,0,0,0,120,31,0,149,1.0,5.0,0.0
102757,2016,WAS,NL,zimmery01,115,109.0,115,114,0,0,114,0,0,0,0,0,0,0,1.0,0.0,0.0
102758,2016,CHN,NL,zobribe01,147,142.0,147,146,0,0,1,119,0,1,27,0,24,46,0.0,4.0,0.0


# Make these two conditions first, this is the starting dataframe. Then go off this:

In [None]:
# show only those who have pitched
appearances_full = appearances_full[appearances_full.G_p != 0]
appearances_full

In [None]:
positions_list = ['G_p','G_c','G_1b','G_2b','G_3b','G_ss','G_lf','G_cf','G_rf']
positions_list

In [None]:
# appearances_full['new_col'] = appearances_full[positions_list].sum(axis=1)
# appearances_full['new_col'] = appearances_full['G_p'] / appearances_full[positions_list].sum(axis=1)
appearances_full['new_col'] = appearances_full.loc[:,'G_p'] / appearances_full.loc[:,positions_list].sum(axis=1)
appearances_full

In [None]:
appearances_full = appearances_full[appearances_full.new_col < .95]
appearances_full

In [None]:
# load original batting table
batting = load('Batting.csv')

# group by playerID and sum to find career statistics for each player 
batting_career = batting.groupby(batting.playerID).sum()

# drop the yearID column
batting_career = batting_career.drop('yearID',axis=1)

# calculate lifetime batting average
batting_career['lifetime_BA'] = round((batting_career['H']/batting_career['AB']),4)

# remove players who never batted
batting_career.dropna(subset=['lifetime_BA'], inplace=True)

# fill all NaN fields with 0
batting_career.fillna(value=0, inplace=True)

batting_career

#### Post Season Batting Statistics
The Post-season batting statistics table also contains records for each player's batting performance, but for each year of post-season activity:

* load original post-season batting table
* group by playerID and sum to find career statistics for each player
* drop the yearID column
* calculate lifetime batting average
* remove players who never batted in the post-season
* fill NaN fields with 0 (in this case, NaN values are equivalent to 0).

In [None]:
# load original post-season batting table
batting_post = load('BattingPost.csv')

# group by playerID and sum to find career post-season statistics for each player
batting_post_career = batting_post.groupby(batting_post.playerID).sum()

# drop yearID field
batting_post_career = batting_post_career.drop('yearID',axis=1)

# calculate lifetime batting average
batting_post_career['lifetime_post_BA'] = round((batting_post_career['H']/batting_post_career['AB']),4)

# remove players who never batted in the post-season
batting_post_career.dropna(subset=['lifetime_post_BA'], inplace=True)

# fill NaN fields with 0
batting_career.fillna(value=0, inplace=True)

batting_post_career

##### All-Star Appearances
The average number of all-star games a player is sent to per year seems like the only way to quantify this as a measure of success. To find this we'll have to know how long a player's career lasted. The Master table (Master.csv) has the debut and final game for each player:

* load master table
* compute number of years played
    * convert debut and finalGame columnst to timestamps for calculation purposes
    * calculate and add column for difference between debut and final game played
    * fill NaN
    * convert days to years
* group by playerID

In [None]:
# load master table
master = load('Master.csv')

# compute number of years played
# convert debut and finalGame columnst to timestamps for calculation purposes
master['debut'] = pd.to_datetime(master['debut'])
master['finalGame'] = pd.to_datetime(master['finalGame'])
# calculate and add column for difference between debut and final game played
master['career-years'] =  master['finalGame'] - master['debut']
# fill NaN
master['career-years'].fillna(value=0, inplace=True)
# lambda function to convert days to years
to_days = lambda x: x.days / 365
# converts days to years
master['career-years'] = master['career-years'].apply(to_days)

# group by playerID
master = master.groupby(master.playerID).first()

master

Career All-Star appearances are tallied up here noting that GP (Games Played) may be fewer than the number of All-Star games the player was sent to. Also, a player's career in years may be less than the number of All-Star games the player was sent to because at periods in the 50's and 60's [more than one all star game was played](http://www.nytimes.com/2008/07/15/sports/baseball/15sandomir.html). This also means that the average number of all-star games a player was sent to per year (calculated below) may be greater than 1 for players like Hank Aaron:
* load original all-star appearances table
* count number of times player was sent to all-star game(s)
* compute career all-star appearances
* drop yearID, gameNum and startingPos
* concatenate this dataframe with career-years from master dataframe
* remove any row with NaN (player never went to All-Star game)

In [None]:
# load original all-star appearances table
all_star = load('AllstarFull.csv')

# count number of times player was sent to all-star game(s)
all_star['sent_to_AS'] = 1

# compute career all-star appearances
all_star_career = all_star.groupby(all_star.playerID).sum()

# drop yearID, gameNum and startingPos
all_star_career = all_star_career.drop(['yearID','gameNum','startingPos'],axis=1)

# concatenate this dataframe with career-years from master dataframe
all_star_career = pd.concat([all_star_career, master['career-years']], axis=1)

# remove any row with NaN (player never went to All-Star game)
all_star_career.dropna(inplace=True)

# calculate all-star appearances average
all_star_career['avg-all-star'] = all_star_career['sent_to_AS'] / all_star_career['career-years']

all_star_career

##### Player salary
A player's salary in 1985 dollars is not what it would be in 2016 dollars, so I've adjusted for inflation using the [United States Bureau of Labor Statistics' Consumer Price Index](https://www.bls.gov/data/) historical data (each year adjusted independently and then the average taken of all a player's salaries over the course of a baseball career). Each salary was adjusted to the 2016 annual average (2017 doesn't yet have a year-end average. I then average the salary across a player's career:
* load the salaries and cpi dataframes
* lambda function queries the cpi dataframe to find a value corresponding to the year x
* vectorized operation to create 'adjusted' column and compute adjusted salary using lambda function
* supress scientific notation in the dataframe
* compute average salary over the course of a player's career
    * group by playerID and find the average
    * drop yearID column
    * rename 'adjusted'column to 'avg-salar-adj'

In [None]:
# load the salaries and cpi dataframes
salaries = load('Salaries.csv')
cpi = load('consumer_price_index.csv')

# lambda function queries the cpi dataframe to find a value corresponding to the year x
# cpi_val = lambda x: float(cpi.loc[cpi['Year'] == x, 'Annual-Avg'].values[0])

def cpi_val(x):
    return float(cpi.loc[cpi['Year'] == x, 'Annual-Avg'].values[0])

# vectorized operation to create 'adjusted' column and compute adjusted salary using lambda function
salaries['avg-adjusted'] = (240.007 * salaries['salary']) / salaries['yearID'].apply(cpi_val)

# supress scientific notation in the dataframe
pd.set_option('display.float_format', lambda x: '%.0f' % x)

# compute average salary over the course of a player's career
# group by playerID and find the average
salaries_career = salaries.groupby(salaries.playerID).mean()
# drop yearID column
salaries_career = salaries_career.drop('yearID', axis=1)
# rename column to 'avg-salar-adj'
salaries_career.columns = ['avg-salary', 'avg-salary-adj']

salaries_career

##### Awards Received
Similarly to All-Star appearances, the sum total of awards a player has received over his career are tallied up here and the average number of awards won per year are computed based on player career length:


In [None]:
# load original awards dataframe
awards_players = load('AwardsPlayers.csv')

# tally up total awards
awards_players['awarded'] = 1

# group by awards received over course of career
awards_players_career = awards_players.groupby(awards_players.playerID).sum()

# drop yearID
awards_players_career = awards_players_career.drop('yearID', axis=1)

# compute average number of awards won over the course of a player's career
# concatenate this dataframe with career-years from master dataframe
awards_players_career = pd.concat([awards_players_career, master['career-years']], axis=1)

# undo supress scientific notation
pd.set_option('display.float_format', None)

# remove any row with NaN (artifact of concatenating)
awards_players_career.dropna(inplace=True)

# calculate all-star appearances average
awards_players_career['avg-awards'] = awards_players_career['awarded'] / awards_players_career['career-years']

# awards_players
awards_players_career

##### INITIAL PLOTS
Arguably the best measure of hitting ability is Batting Average. This should bear some relationship on measures of success we have here - awards received, All-Star appearances and player salary. Each of these measures were initially recorded each starting in a different year.

##### Batting average vs. Awards received (awards recorded since 1877)

In [None]:
awards_ba = pd.concat([awards_players_career, batting_career['lifetime_BA']], axis=1)
awards_ba.dropna(inplace=True)
awards_ba

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
ax = sns.regplot(x='lifetime_BA', y='avg-awards', data=awards_ba, scatter_kws={'alpha':0.5})

plt.show()

##### Batting average vs. All-Star appearances (awards recorded since 1933)

In [None]:
all_star_ba = pd.concat([all_star_career, batting_career['lifetime_BA']], axis=1)
all_star_ba

In [None]:
all_star_ba.dropna(inplace=True)
all_star_ba

In [None]:
ax = sns.regplot(x='lifetime_BA', y='avg-all-star', data=all_star_ba, scatter_kws={'alpha':0.05})
plt.show()

##### Batting average vs. player salary (awards recorded since 1985)

In [None]:
salary_ba = pd.concat([salaries_career, batting_career['lifetime_BA']], axis=1)
salary_ba

In [None]:
salary_ba.dropna(inplace=True)

# supress scientific notation in the dataframe
pd.set_option('display.float_format', lambda x: '%.0f' % x)

salary_ba

In [None]:
ax = sns.regplot(x='lifetime_BA', y='avg-salary-adj', data=salary_ba, scatter_kws={'alpha':0.05})
plt.show()

In [None]:
# test plot
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()
cols = ['H','HR','RBI','GP','awarded']
sns.pairplot(main_df[cols], size=2.5)
plt.show()

# Adaptability score

In [None]:
# load data
appearances_full = pd.read_csv('supporting-files/baseballdatabank-2017/core/Appearances.csv')
appearances_full

In [None]:
# collapse and sum by player ID
appearances = appearances_full.groupby(appearances_full.playerID).sum()



# drop yearID column
appearances = appearances.drop('yearID', axis=1)
appearances

In [None]:
# create player position adaptability score
positions = ['G_p','G_c','G_1b','G_2b','G_3b','G_ss','G_lf','G_cf','G_rf']
appearances['adapt_score'] = 3 - round((appearances[positions].std(axis=1, ddof=1)/appearances[positions].mean(axis=1)),4)



# total apperances
appearances

In [None]:
# remove any players who never made on-field appearances
appearances.dropna(subset=['adapt_score'], inplace=True)
appearances

In [None]:
# check again for adaptabilty scores that were not computed
rows_count = len(appearances.index)
adapt_nan = appearances['adapt_score'].isnull().sum()
print("Rows in the table: %s" %rows_count)
print("Player position adaptability scores not computed: %s" %adapt_nan)

In [None]:
appearances['adapt_score']

In [None]:
main_df

In [None]:
new_df = pd.concat([main_df, appearances['adapt_score']], axis=1)
new_df

In [None]:
new_df.dropna(subset=['adapt_score'], inplace=True)

In [None]:
sns.set()
cols = ['H','HR','RBI','GP','awarded', 'adapt_score','lifetime_ba']
sns.pairplot(new_df[cols], size=2.5)
plt.show()

I want to align these tables so that a player's career post-season batting statistics line up 


Things that stand out from looking at these data:
* the 'playerID' field is the stand-out identifier that is present in each table
* a player's measurements (number of hits in a given year for example) span multiple years and will 

If I could pull one measure from each table, it would be:
* Calculate career hits from the Batting Table: Getting a hit is probably the most reasonable measure of hitting ability.
* Number of times the player was sent to the All-Star game from the All-Star Table
* Average career salary from the Salary Table

Combine these two
* Number of awards received from the AwardPlayers Table
* Number of awards shared from the AwardSharePlayers Table

One challenge with aligning these measurements from their respective tables is... So I'll have to collapse and average data where appropriate.

##### Calculate career hits