# Moneyball

The purpose of this project is to use historical baseball statistics to build a team of nine players based on historical 'on base percentage".

This was the method used by the Oakland A's to build a championship calliber team on a shostring budget. This story was also the basis for the motion picture "Moneyball" starring Brad Pitt and Jonah Hill.

To acomplish this we will need each players On Base Percentage and salary for each season played. 

### On Base Percentage (OBP)

Per Wikipedia: On-base percentage (OBP), also known as on-base average/OBA, measures how frequently a batter reaches base.[1] It is the ratio of the batter's times-on-base (TOB) (the sum of hits, walks, and times hit by pitch) to their number of plate appearances.[1] OBP does not credit the batter for reaching base due to fielding error, fielder's choice, dropped/uncaught third strike, fielder's obstruction, or catcher's interference.

#### The Formula

       OBP = (H + BB + HBP) / (AB + BB + HPB + SF)
      
Where:
* H = Hits
* BB = Bases on Balls (Walks)
* HBP = Hit By Pitch
* AB = At bat
* SF = Sacrifice fly

data courtesy of http://www.seanlahman.com/baseball-archive/statistics/

In [1]:
import pandas as pd
import numpy as np

In [2]:
# create the datetime objects on read
people_df = pd.read_csv('Data/People.csv',  parse_dates=['debut', 'finalGame'])
batting_df = pd.read_csv('Data/Batting.csv')
appearances_df  =pd.read_csv('Data/Appearances.csv')
salaries_df = pd.read_csv('Data/Salaries.csv')

In [3]:
pd.set_option('display.max_columns', 45)
pd.set_option('display.max_rows', 90)

In [4]:
people_df.head(3)

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01


In [5]:
players = people_df[['playerID', 'nameFirst', 'nameLast', 'nameGiven']]

In [6]:
players.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19953 entries, 0 to 19952
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   playerID   19953 non-null  object
 1   nameFirst  19916 non-null  object
 2   nameLast   19953 non-null  object
 3   nameGiven  19916 non-null  object
dtypes: object(4)
memory usage: 623.7+ KB


In [7]:
batting_df.head(3)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1871,1,TRO,,1,4,0,0,0,0,0,0.0,0.0,0.0,0,0.0,,,,,0.0
1,addybo01,1871,1,RC1,,25,118,30,32,6,0,0,13.0,8.0,1.0,4,0.0,,,,,0.0
2,allisar01,1871,1,CL1,,29,137,28,40,4,5,0,19.0,3.0,1.0,2,5.0,,,,,1.0


In [8]:
batting = batting_df[['playerID', 'yearID', 'teamID', 'AB', 'H', 'BB', 'HBP', 'SF']]

In [9]:
batting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107429 entries, 0 to 107428
Data columns (total 8 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   playerID  107429 non-null  object 
 1   yearID    107429 non-null  int64  
 2   teamID    107429 non-null  object 
 3   AB        107429 non-null  int64  
 4   H         107429 non-null  int64  
 5   BB        107429 non-null  int64  
 6   HBP       104612 non-null  float64
 7   SF        71325 non-null   float64
dtypes: float64(2), int64(4), object(2)
memory usage: 6.6+ MB


In [10]:
player_bb = pd.merge(players, batting)
len(player_bb)

107429

In [11]:
# checkpoint
player_bb1 = player_bb.copy()
player_bb1.fillna(0, inplace=True)
print(len(player_bb1))
player_bb1.head()

107429


Unnamed: 0,playerID,nameFirst,nameLast,nameGiven,yearID,teamID,AB,H,BB,HBP,SF
0,aardsda01,David,Aardsma,David Allan,2004,SFN,0,0,0,0.0,0.0
1,aardsda01,David,Aardsma,David Allan,2006,CHN,2,0,0,0.0,0.0
2,aardsda01,David,Aardsma,David Allan,2007,CHA,0,0,0,0.0,0.0
3,aardsda01,David,Aardsma,David Allan,2008,BOS,1,0,0,0.0,0.0
4,aardsda01,David,Aardsma,David Allan,2009,SEA,0,0,0,0.0,0.0


In [12]:
player_bb1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107429 entries, 0 to 107428
Data columns (total 11 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   playerID   107429 non-null  object 
 1   nameFirst  107429 non-null  object 
 2   nameLast   107429 non-null  object 
 3   nameGiven  107429 non-null  object 
 4   yearID     107429 non-null  int64  
 5   teamID     107429 non-null  object 
 6   AB         107429 non-null  int64  
 7   H          107429 non-null  int64  
 8   BB         107429 non-null  int64  
 9   HBP        107429 non-null  float64
 10  SF         107429 non-null  float64
dtypes: float64(2), int64(4), object(5)
memory usage: 9.8+ MB


In [13]:
salaries_df.head()

Unnamed: 0,yearID,teamID,lgID,playerID,salary
0,1985,ATL,NL,barkele01,870000
1,1985,ATL,NL,bedrost01,550000
2,1985,ATL,NL,benedbr01,545000
3,1985,ATL,NL,campri01,633333
4,1985,ATL,NL,ceronri01,625000


In [14]:
salaries = salaries_df[['yearID', 'playerID', 'salary']]
salaries.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26428 entries, 0 to 26427
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   yearID    26428 non-null  int64 
 1   playerID  26428 non-null  object
 2   salary    26428 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 619.5+ KB


In [15]:
player_bb_salary = pd.merge(player_bb1, salaries)
player_bb_salary.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28294 entries, 0 to 28293
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   playerID   28294 non-null  object 
 1   nameFirst  28294 non-null  object 
 2   nameLast   28294 non-null  object 
 3   nameGiven  28294 non-null  object 
 4   yearID     28294 non-null  int64  
 5   teamID     28294 non-null  object 
 6   AB         28294 non-null  int64  
 7   H          28294 non-null  int64  
 8   BB         28294 non-null  int64  
 9   HBP        28294 non-null  float64
 10  SF         28294 non-null  float64
 11  salary     28294 non-null  int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 2.8+ MB


In [17]:
# checkpoint before calculations
stats_w_salaries = player_bb_salary.copy()
stats_w_salaries.head(2)

Unnamed: 0,playerID,nameFirst,nameLast,nameGiven,yearID,teamID,AB,H,BB,HBP,SF,salary
0,aardsda01,David,Aardsma,David Allan,2004,SFN,0,0,0,0.0,0.0,300000
1,aardsda01,David,Aardsma,David Allan,2007,CHA,0,0,0,0.0,0.0,387500


In [26]:
stats_w_salaries['salary'].value_counts()

109000     674
200000     467
500000     437
1000000    402
300000     367
          ... 
657000       1
1935000      1
509600       1
6515828      1
0            1
Name: salary, Length: 3353, dtype: int64

In [27]:
stats_w_salaries.drop(stats_w_salaries[stats_w_salaries['salary'] == 0].index , inplace=True)

### Calculating 'On Base Percentage'

reminder: OBP = (H + BB + HBP) / (AB + BB + HPB + SF)

Remove outliers of less than 20 'At-Bats' to eliminate luck.

In [28]:
stats_w_salaries.drop(stats_w_salaries[stats_w_salaries.AB < 20].index, inplace=True)

In [29]:
stats_w_salaries.head()

Unnamed: 0,playerID,nameFirst,nameLast,nameGiven,yearID,teamID,AB,H,BB,HBP,SF,salary
17,abbotje01,Jeff,Abbott,Jeffrey William,1998,CHA,244,68,9,0.0,5.0,175000
18,abbotje01,Jeff,Abbott,Jeffrey William,1999,CHA,57,9,5,0.0,1.0,255000
19,abbotje01,Jeff,Abbott,Jeffrey William,2000,CHA,215,59,21,2.0,1.0,255000
20,abbotje01,Jeff,Abbott,Jeffrey William,2001,FLO,42,11,3,1.0,0.0,300000
30,abbotji01,Jim,Abbott,James Anthony,1999,MIL,21,2,0,0.0,0.0,400000


In [83]:
# add OBP column
obp_df = stats_w_salaries.copy()

In [84]:
obp_df['OBP'] = obp_df.apply(lambda row: (row['H']+row['BB']+row['HBP']) / (row['AB']+row['BB']+row['HBP']+row['SF']), axis='columns')
obp_df.sort_values(by=['OBP'], inplace=True, ascending=False)

In [88]:
obp_df.head(2)

Unnamed: 0,playerID,nameFirst,nameLast,nameGiven,yearID,teamID,AB,H,BB,HBP,SF,salary,OBP
2518,bondsba01,Barry,Bonds,Barry Lamar,2004,SFN,373,135,232,9.0,3.0,18000000,0.6094
2516,bondsba01,Barry,Bonds,Barry Lamar,2002,SFN,403,149,198,9.0,2.0,15000000,0.581699


In [98]:
obp_df['OBP_to_salary_ratio'] = obp_df.apply(lambda row: row['OBP'] / (row['salary'] / 10000), axis='columns')
obp_df.sort_values(by=['OBP_to_salary_ratio'], inplace=True, ascending=False)

In [99]:
obp_df.head()

Unnamed: 0,playerID,nameFirst,nameLast,nameGiven,yearID,teamID,AB,H,BB,HBP,SF,salary,OBP,OBP_to_salary_ratio
23508,silveda01,Dave,Silvestri,David Joseph,1993,NYA,21,6,5,0.0,0.0,10900,0.423077,0.388144
16183,mazzile01,Lee,Mazzilli,Lee Louis,1986,NYN,58,16,12,2.0,0.0,60000,0.416667,0.069444
13018,jonestr01,Tracy,Jones,Tracy Donald,1986,CIN,86,30,9,0.0,1.0,60000,0.40625,0.067708
13854,krukjo01,John,Kruk,John Martin,1986,SDN,278,86,45,0.0,2.0,60000,0.403077,0.067179
9505,greenmi01,Mike,Greenwell,Michael Lewis,1986,BOS,35,11,5,0.0,0.0,60000,0.4,0.066667
