# The Lahman Baseball Database Exploratory Data Analysis

### Summary:

The Lahman Baseball Database is a comprehensive record of batting and pitching statistics from 1871 to 2016. It also contains fielding statistics, standings, team stats, managerial records, post-season data, and a number of other data points. Any given investigation of this data set could consume many many hours and would likely uncover a range of interesting findings. Here I will initially focus more narrowly on some general trends related to performance and measures of success. My hope is that an additional, more unique, investigation can be extended from the initial one.

### INITIAL INVESTIGATION - Exploratory Data Analysis

it's hard to argue agaist the notion of hitting being a very important factor in baseball. For the purposes of this investigation it's the low-hanging fruit that I'll investigate first. 


The most relevant tables from The Lahman Baseball Database are that measure performance and success are:
* Regular-season batting statistics (Batting.csv)
* Post-season batting statistics (BattingPost.csv)
* All-Star appearances (AllstarFull.csv)
* Player salary (Salaries.csv)
* Awards received (AwardsPlayers.csv)

Let's have a look at the tables, starting with the most recent activity from 2016:

##### LOAD TABLES:

In [1]:
# import declarations and settings
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 6)

In [2]:
# a function to make loading tables easier
def load(data_file):
    return pd.read_csv('supporting-files/baseballdatabank-2017/core/' + data_file)

##### Regular Season Batting Statistics

In [3]:
batting = load('Batting.csv')
batting

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1871,1,TRO,,1,4,0,0,0,0,0,0.0,0.0,0.0,0,0.0,,,,,
1,addybo01,1871,1,RC1,,25,118,30,32,6,0,0,13.0,8.0,1.0,4,0.0,,,,,
2,allisar01,1871,1,CL1,,29,137,28,40,4,5,0,19.0,3.0,1.0,2,5.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102813,zobribe01,2016,1,CHN,NL,147,523,94,142,31,3,18,76.0,6.0,4.0,96,82.0,6.0,4.0,4.0,4.0,17.0
102814,zuninmi01,2016,1,SEA,AL,55,164,16,34,7,0,12,31.0,0.0,0.0,21,65.0,0.0,6.0,0.0,1.0,0.0
102815,zychto01,2016,1,SEA,AL,12,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0


#### Post Season Batting Statistics

In [4]:
batting_post = load('BattingPost.csv')
batting_post

Unnamed: 0,yearID,round,playerID,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,1884,WS,becanbu01,NY4,AA,1,2,0,1,0,0,0,0,0,,0,0,0,,,,
1,1884,WS,bradyst01,NY4,AA,3,10,1,0,0,0,0,0,0,,0,1,0,,,,
2,1884,WS,carrocl01,PRO,NL,3,10,2,1,0,0,0,1,0,,1,1,0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13540,2016,NLCS,zobribe01,CHN,NL,6,20,3,3,1,0,0,1,0,0.0,4,3,0,0.0,0.0,1.0,0.0
13541,2016,NLDS1,zobribe01,CHN,NL,4,16,1,3,2,0,0,2,0,0.0,0,4,0,0.0,0.0,0.0,0.0
13542,2016,WS,zobribe01,CHN,NL,7,28,5,10,2,1,0,2,0,0.0,3,4,0,0.0,0.0,0.0,0.0


##### All-Star Appearances

In [5]:
all_star = load('AllstarFull.csv')
all_star

Unnamed: 0,playerID,yearID,gameNum,gameID,teamID,lgID,GP,startingPos
0,gomezle01,1933,0,ALS193307060,NYA,AL,1.0,1.0
1,ferreri01,1933,0,ALS193307060,BOS,AL,1.0,2.0
2,gehrilo01,1933,0,ALS193307060,NYA,AL,1.0,3.0
...,...,...,...,...,...,...,...,...
5145,syndeno01,2016,0,ALS201607120,NYN,NL,0.0,
5146,teherju01,2016,0,ALS201607120,ATL,NL,1.0,
5147,zobribe01,2016,0,ALS201607120,CHN,NL,1.0,4.0


##### Player salary
Note, I've adjusted for inflation using the United States Bureau of Labor Statistics' Consumer Price Index historical data. Each salary was adjusted to the 2016 annual average.

In [7]:
# load the salaries and cpi dataframes
salaries = load('Salaries.csv')
cpi = load('consumer_price_index.csv')

# lambda function queries the cpi dataframe to find a value corresponding to a year
cpi_val = lambda x: float(cpi.loc[cpi['Year'] == x, 'Annual-Avg'].values[0])

# vectorized operation to create 'adjusted' column and compute adjusted salary using lambda function
salaries['adjusted'] = (240.007 * salaries['salary']) / salaries['yearID'].apply(cpi_val)

# setting to supress scientific notation in the dataframe
pd.set_option('display.float_format', lambda x: '%.0f' % x)

salaries

Unnamed: 0,yearID,teamID,lgID,playerID,salary,adjusted
0,1985,ATL,NL,barkele01,870000,1940577
1,1985,ATL,NL,bedrost01,550000,1226802
2,1985,ATL,NL,benedbr01,545000,1215649
...,...,...,...,...,...,...
26425,2016,WSN,NL,treinbl01,524900,524900
26426,2016,WSN,NL,werthja01,21733615,21733615
26427,2016,WSN,NL,zimmery01,14000000,14000000


##### Awards Received

In [None]:
award_players = load('AwardsPlayers.csv')
award_players

Some initial obserations and notes:
* Post-season batting statistics for all field players, except pitchers, will be the primary measure of performance. Fielding statistics would have been a useful measure as well, however the fielding data here is not comprehensive  or well structured enough to be helpful for this exploration.
* Limited to regular season
* All-Star participation, awards received and salary will be the measures of success.
* Measures of performance will be computed/normalized over the career of a given player.


# TEST: Clean, align and scatter plot

In [None]:
# compute career batting statistics
batting_career = batting.groupby(batting.playerID).sum()

# calculate lifetime batting average
batting_career['lifetime_ba'] = round((batting_career['H']/batting_career['AB']),4)

# fillna
batting_career.fillna(value=0, inplace=True)

# drop yearID field
batting_career = batting_career.drop('yearID',axis=1)
batting_career

In [None]:
# compute career all-star appearances
all_star_career = all_star.groupby(all_star.playerID).sum()

# drop yearID, gameNum and startingPos
all_star_career = all_star_career.drop(['yearID','gameNum','startingPos'],axis=1)
all_star_career

# Hank Aaron should be 25

In [None]:
# omit salary until figure out how to adjust for inflation

In [None]:
# compute total awards
award_players['awarded'] = 1
awards_players_career = award_players.groupby(award_players.playerID).sum()

# drop yearID
awards_players_career = awards_players_career.drop('yearID', axis=1)
awards_players_career

In [None]:
# concatenate the tables
main_df = pd.concat([batting_career,all_star_career,awards_players_career], axis=1)

# fill NaN with 0.0
main_df.fillna(value=0, inplace=True)
main_df

In [None]:
# test plot
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()
cols = ['H','HR','RBI','GP','awarded']
sns.pairplot(main_df[cols], size=2.5)
plt.show()

# Adaptability score

In [None]:
# load data
appearances_full = pd.read_csv('supporting-files/baseballdatabank-2017/core/Appearances.csv')
appearances_full

In [None]:
# collapse and sum by player ID
appearances = appearances_full.groupby(appearances_full.playerID).sum()



# drop yearID column
appearances = appearances.drop('yearID', axis=1)
appearances

In [None]:
# create player position adaptability score
positions = ['G_p','G_c','G_1b','G_2b','G_3b','G_ss','G_lf','G_cf','G_rf']
appearances['adapt_score'] = 3 - round((appearances[positions].std(axis=1, ddof=1)/appearances[positions].mean(axis=1)),4)



# total apperances
appearances

In [None]:
# remove any players who never made on-field appearances
appearances.dropna(subset=['adapt_score'], inplace=True)
appearances

In [None]:
# check again for adaptabilty scores that were not computed
rows_count = len(appearances.index)
adapt_nan = appearances['adapt_score'].isnull().sum()
print("Rows in the table: %s" %rows_count)
print("Player position adaptability scores not computed: %s" %adapt_nan)

In [None]:
appearances['adapt_score']

In [None]:
main_df

In [None]:
new_df = pd.concat([main_df, appearances['adapt_score']], axis=1)
new_df

In [None]:
new_df.dropna(subset=['adapt_score'], inplace=True)

In [None]:
sns.set()
cols = ['H','HR','RBI','GP','awarded', 'adapt_score','lifetime_ba']
sns.pairplot(new_df[cols], size=2.5)
plt.show()

I want to align these tables so that a player's career post-season batting statistics line up 


Things that stand out from looking at these data:
* the 'playerID' field is the stand-out identifier that is present in each table
* a player's measurements (number of hits in a given year for example) span multiple years and will 

If I could pull one measure from each table, it would be:
* Calculate career hits from the Batting Table: Getting a hit is probably the most reasonable measure of hitting ability.
* Number of times the player was sent to the All-Star game from the All-Star Table
* Average career salary from the Salary Table

Combine these two
* Number of awards received from the AwardPlayers Table
* Number of awards shared from the AwardSharePlayers Table

One challenge with aligning these measurements from their respective tables is... So I'll have to collapse and average data where appropriate.

##### Calculate career hits