# Lecture 20: Baseball Prospectus team statistics
***

In this notebook, we'll work on computing the Pythagorean Winning Percentage and comparing ERA and FIP.

Start by loading Numpy and Pandas using their common aliases, np and pd. 

In [9]:
import numpy as np 
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline

local_path = 'data/winningPercentageData.csv'

# Select the path that works for you 
file_path = local_path 

# Load the data into a DataFrame 
dfTW= pd.read_csv(file_path)

# Inspect some of the data
dfTW.head()

Unnamed: 0,yearID,teamID,W,L,R,RA
0,2010,ARI,65,97,713,836
1,2010,ATL,91,71,738,629
2,2010,BAL,66,96,613,785
3,2010,BOS,89,73,818,744
4,2010,CHA,88,74,752,704


The data has columns for: 

- **yearID**: Year 
- **teamID**: Team
- **W**: Wins for season
- **L**: Losses for season
- **R**: Runs scored that season
- **RA**: Runs allowed that season

### Exercise 1 - How meaningful is Pythagorean WP?
***
Compare the actual winning percentage for teams at the end of the season to their Pythagorean score. How strong is the correlation? 

In [13]:
dfTW["WP"] = dfTW["W"]/dfTW["L"]
dfTW["Pythagorean_WP"] = (dfTW["R"]**1.83) / (dfTW["R"]**1.83 + dfTW["RA"]**1.83)
dfTW.head()

Unnamed: 0,yearID,teamID,W,L,R,RA,WP,Pythagorean_WP
0,2010,ARI,65,97,713,836,0.670103,0.4277
1,2010,ATL,91,71,738,629,1.28169,0.572598
2,2010,BAL,66,96,613,785,0.6875,0.388744
3,2010,BOS,89,73,818,744,1.219178,0.543272
4,2010,CHA,88,74,752,704,1.189189,0.530139


In [18]:
#Correlation
print ("Correlation: ",dfTW['WP'].corr(dfTW['Pythagorean_WP'], method = 'pearson'))

#R^2
print ("R^2: ",dfTW['WP'].corr(dfTW['Pythagorean_WP'], method = 'pearson') **2)

# Matrix
dfTW.drop(['yearID'], axis=1).corr(method='pearson')

Correlation:  0.9279679105390983
R^2:  0.8611244429903


Unnamed: 0,W,L,R,RA,WP,Pythagorean_WP
W,1.0,-0.999765,0.562137,-0.724604,0.989779,0.934464
L,-0.999765,1.0,-0.562261,0.724352,-0.989774,-0.934356
R,0.562137,-0.562261,1.0,0.036645,0.562999,0.655795
RA,-0.724604,0.724352,0.036645,1.0,-0.712847,-0.726632
WP,0.989779,-0.989774,0.562999,-0.712847,1.0,0.927968
Pythagorean_WP,0.934464,-0.934356,0.655795,-0.726632,0.927968,1.0


### Exercise 2 - FIP vs. ERA
***
Imagine you are hanging out in CSEL, listening to the raucous conversations about sports that often occur in computer science. Anyway, you hear a wayward CSCI1300 student claim that FIP is a useless stat because it's the same as ERA. Any pitcher with a low ERA will also have a low FIP, and vice versa. You decide to investigate and possibly give him the mathematical smackdown of his life. Use the pitchingIn2013.csv file that magically appears before you, and on Moodle.

Use cFIP = 3.048.

$$FIP = \frac{((13*HR) + (3*(BB+HBP)) - (2*K))}{IP} + cFIP$$

In [19]:
local_path1 = 'data/pitchingIn2013.csv'

# Select the path that works for you 
file_path1 = local_path1 

# Load the data into a DataFrame 
dfP= pd.read_csv(file_path1)

# Inspect some of the data
dfP.head()

Unnamed: 0,playerID,teamID,HR,BB,HBP,SO,IP,ERA
0,aardsda01,NYN,7,19,4,36,39.6667,4.31
1,abadfe01,WAS,3,10,1,32,37.6667,3.35
2,aceveal01,BOS,8,22,0,24,37.0,4.86
3,adamsmi03,PHI,5,11,1,23,25.0,3.96
4,affelje01,SFN,2,17,4,21,33.6667,3.74


In [25]:
cFIP = 3.048
dfP["FIP"] = ((13 * dfP["HR"]) + (3*(dfP["BB"] + dfP["HBP"])) - (2 * dfP["SO"]))/dfP["IP"] + cFIP
dfP.head()

Unnamed: 0,playerID,teamID,HR,BB,HBP,SO,IP,ERA,FIP
0,aardsda01,NYN,7,19,4,36,39.6667,4.31,5.266486
1,abadfe01,WAS,3,10,1,32,37.6667,3.35,3.260389
2,aceveal01,BOS,8,22,0,24,37.0,4.86,6.345297
3,adamsmi03,PHI,5,11,1,23,25.0,3.96,5.248
4,affelje01,SFN,2,17,4,21,33.6667,3.74,4.444038


In [23]:
#Correlation
dfP = dfP[np.isfinite(dfP['FIP'])]  # Have to drop all values that arent finite

print ("Correlation: ",dfP['ERA'].corr(dfP['FIP'], method = 'pearson'))

Correlation:  0.7898740671889416


- After calculating the FIP and correlating it with ERA, we can see that it only has an 80% correlation with ERA. This means that ERA and FIP are definitely not the same statistic. For the most part however, we can say these two stats are pretty correlated. But the student is still definitely wrong.