# Moneyball

The purpose of this project is to use historical baseball statistics to build a team of nine players based on historical 'on base percentage".

This was the method used by the Oakland A's to build a championship calliber team on a shostring budget. This story was also the basis for the motion picture "Moneyball" starring Brad Pitt and Jonah Hill.

### On Base Percentage (OBP)

Per Wikipedia: On-base percentage (OBP), also known as on-base average/OBA, measures how frequently a batter reaches base.[1] It is the ratio of the batter's times-on-base (TOB) (the sum of hits, walks, and times hit by pitch) to their number of plate appearances.[1] OBP does not credit the batter for reaching base due to fielding error, fielder's choice, dropped/uncaught third strike, fielder's obstruction, or catcher's interference.

#### The Formula

       OBP = (H + BB + HBP) / (AB + BB + HPB + SF)
      
Where:
* H = Hits
* BB = Bases on Balls (Walks)
* HBP = Hit By Pitch
* AB = At bat
* SF = Sacrifice fly

data courtesy of http://www.seanlahman.com/baseball-archive/statistics/

In [1]:
import pandas as pd
import numpy as np

In [2]:
# create the datetime objects on read
data = pd.read_csv('Data/People.csv',  parse_dates=['debut', 'finalGame'])

In [3]:
pd.set_option('display.max_columns', 45)
pd.set_option('display.max_rows', 90)


In [4]:
data.head(10)

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,,,,,,,Hank,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,USA,GA,Atlanta,Tommie,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,,,,Don,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,,,,Andy,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01
5,abadfe01,1985.0,12.0,17.0,D.R.,La Romana,La Romana,,,,,,,Fernando,Abad,Fernando Antonio,235.0,74.0,L,L,2010-07-28,2019-09-28,abadf001,abadfe01
6,abadijo01,1850.0,11.0,4.0,USA,PA,Philadelphia,1905.0,5.0,17.0,USA,NJ,Pemberton,John,Abadie,John W.,192.0,72.0,R,R,1875-04-26,1875-06-10,abadj101,abadijo01
7,abbated01,1877.0,4.0,15.0,USA,PA,Latrobe,1957.0,1.0,6.0,USA,FL,Fort Lauderdale,Ed,Abbaticchio,Edward James,170.0,71.0,R,R,1897-09-04,1910-09-15,abbae101,abbated01
8,abbeybe01,1869.0,11.0,11.0,USA,VT,Essex,1962.0,6.0,11.0,USA,VT,Colchester,Bert,Abbey,Bert Wood,175.0,71.0,R,R,1892-06-14,1896-09-23,abbeb101,abbeybe01
9,abbeych01,1866.0,10.0,14.0,USA,NE,Falls City,1926.0,4.0,27.0,USA,CA,San Francisco,Charlie,Abbey,Charles S.,169.0,68.0,L,L,1893-08-16,1897-08-19,abbec101,abbeych01


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19953 entries, 0 to 19952
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   playerID      19953 non-null  object        
 1   birthYear     19839 non-null  float64       
 2   birthMonth    19671 non-null  float64       
 3   birthDay      19529 non-null  float64       
 4   birthCountry  19892 non-null  object        
 5   birthState    19403 non-null  object        
 6   birthCity     19780 non-null  object        
 7   deathYear     9825 non-null   float64       
 8   deathMonth    9824 non-null   float64       
 9   deathDay      9823 non-null   float64       
 10  deathCountry  9821 non-null   object        
 11  deathState    9771 non-null   object        
 12  deathCity     9816 non-null   object        
 13  nameFirst     19916 non-null  object        
 14  nameLast      19953 non-null  object        
 15  nameGiven     19916 non-null  object