
In 2003, Michael  Lewis published “MoneyBall: The Art of Winning an Unfair Game”.  It chronicled the efforts Oakland Athletics’ General Manager Billy Beane to determine if there were previously unknown measures of a player’s value to a team rather than the obvious Batting Average (BA) and Fielding Percentage (FP) measurements.  Beane theorized that a team’s payroll could be far more effectively utilized by paying less money for these heretofore unrecognized and undervalued skills.  Prior to Beane’s tenure as GM of the Oakland As, team ownership dictated drastic cuts in payroll due to years of high payroll and losing records.  Beane had to win with less resources.  His teams had great success from 2000-2003 with one of the lowest team payrolls in the league.  The following analysis attempts to determine the extent to which Beane’s theories have proven true.

 

    1. Description of Dataset:

www.fangraph.com has compiled data from MLB since 1871 to the present.  The extensive collection includes statistics organized by league, team, players and position.  In addition to expected data items such as, games won/lost, batting averages, ERA,  FP,  and player salaries,  more recently invented statistical measures such as UZR and OBP have been included since their invention and collection.  Upon payment of a monthly or annual subscription, the interface below is used to generate reports which can be exported into CSV files.



In order to obtain a single dataset required for this analysis a broad dump of 58 columns for all 147 years was required comprising 573 KB of data.



    2. Analytic Questions

        A. Is it true that winning teams have been able to do so with lower payrolls because previously unrecognized and undervalued skills are now being considered?  i.e. Has the correlation between team payroll and wins changed from before MoneyBall to five years later?

        B. Is it true that focusing on factors other than hitting, i.e. fielding, have produced more Wins/Season?

        C. Is it true that a focus on On Base Percentage (OBP) is more valuable then Batting Average (BA), evidenced by a higher correlation of wins to OBP rather than BA since the MoneyBall concepts were published?

        D. Has a focus on skills other than hitting caused any fielding (by position) to receive more focus and improve?


    3. Further Research

        A. How does data trend for periods further out then five years?   

Solution:  Obtain data from further out then presently available.


        B. Will UZR (Ultimate Zone Rating) and OBP prove to be better indicators of ‘wins’ value than Fielding Percentage (FP) or BA?  UZR was invented with the concepts in Moneyball, and was not collected prior to 2003.  It is thought to be a better analysis of fielding ability than FP. Recreation of UZR data prior to 2003 might show if a focus on it is warranted and beneficial, in terms of wins.  A similar comparison of OBP v. Batting Average could be made as well.

Solution:  Construct/obtain UZR & OBP data from prior to       the MoneyBall era to compare to later years.

        C. Has an emphasis on non-traditional measures such as UZR and OBP produced higher or salaries for players who have not had high Batting Averages?

Solution: Determine which, if any positions have become more valuable since MoneyBall in terms of wins, and ascertain if those positions know command greater salaries (adjusted for inflation).

    D.  Have the theories of Moneyball caused a shift in payroll that raises overall payrolls because now other players now earn more and the former obvious superstars still earn large salaries?

Solution:  Analyze the percentage of salaries by position and OBP to determine if they have grown since MoneyBall.
    4. Data and Conclusions

        A. In 2002, the correlation between team payroll and Wins was fairly strong, although three American League teams won over 90 games with payrolls well below the median.



Five years after MoneyBall, the distribution of Wins has come with lower team payrolls.


        B. 2002 data reveal a consistent correlation between Fielding Percentage and Wins per season




OBP data contradicts MoneyBall theories that it should be more valued.



        C. UZR data reveals a marked improvement in fielding skills from the center field and third base positions. Shortstop skills have also significantly improved since MoneyBall.


In [None]:
# -*- coding: utf-8 -*-
"""
Created on Mon Jan 21 11:53:44 2019

@author: Jack Sonntag

Plots MLB data in an attmept to gauge the impact of 2003's "Money Ball" on 
baseball, specifically, Payroll v. Wins, Wins v. Fielding %, Wins v. OBP (
On Base Percentage), and changes in positional UZR (Ultimate Zone Rating).

To determine if "Money Ball" has had an impact on the game, data from 5 years 
prior to it's publication is compared to data five yearas after.

"""
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import csv
import numpy as np
from sklearn.preprocessing import minmax_scale
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.preprocessing import normalize

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=Warning)

INFLA_FACTOR = 1.19
#W Wins v. Payroll from the year before Moneyball to 5 years after.
#

data_path = 'all bball data.csv'
df = pd.read_csv(data_path).dropna()#,skiprows=[0],axis=1, inplace=True
print( df.describe())
df.dropna(axis=1, inplace=True)

X_train, X_test = train_test_split(df,test_size=0.5)

 #x = stats.ttest_ind(dated_data[dated_data.test == 1].is_signed_up,
                          #dated_data[dated_data.test == 0].is_signed_up)
x = stats.ttest_ind(X_train.wins,X_test.wins)
print ("TTest stat ->",round(x[0],4),"pval ->",round(x[1],4))
#print ((stats.ttest_ind(X_train,X_test)))

def do_payroll(year):
    wins = df.loc[df['year'] == int(year)]['wins']
    sal = df.loc[df['year'] == int(year)]['payroll']
    if (year == '2002'):
        sal = sal*INFLA_FACTOR  #adjust for inflation
        
    sal_df= pd.DataFrame()
    sal_df['wins'] = minmax_scale(wins) #normalize data
    sal_df['payroll'] = minmax_scale(sal)
       
    hi_sal = int(sal.max()/1000000) # set ends of axis'
    lo_sal = int(sal.min()/1000000)
    hi_wins = int(wins.max())
    lo_wins = int(wins.min())

    plt.xticks([])  #negate old ticks & build new x,y axis & title
    plt.yticks([])
    xspaces = "                                              "
    xlab = str(lo_wins) + xspaces + str(int((lo_wins+hi_wins)/2)) + xspaces + str(hi_wins) + "\nWins(normalized)"
    plt.xlabel(xlab)
    yspaces = "                          "
    ylab = "Payroll M$ - 2008$$\n"+ str(lo_sal) + yspaces + str(int((lo_sal+hi_sal)/2)) + \
    yspaces + str(hi_sal) 
    plt.ylabel(ylab)

    #plot data
    title_str = ("2002 v. 2008  Wins v. Payroll")
    plt.title(title_str)
    if year == '2002':
        leg_lbl = year + " wins"
        sns.kdeplot(sal_df["wins"], linestyle=':',shade = True,label=leg_lbl,color='y')
        leg_lbl = year + " payroll"
        sns.kdeplot(sal_df["payroll"], linestyle='-',label=leg_lbl,shade = True,color='y')
    else:
        leg_lbl = year + " wins"
        sns.kdeplot(sal_df["wins"], linestyle=':',shade = True,label=leg_lbl,color='b')
        leg_lbl = year + " payroll"
        sns.kdeplot(sal_df["payroll"], linestyle='-',shade = True,label=leg_lbl,color='b')
    print(year ," - correlation of wins to payroll ->", sal_df['wins'].corr(sal_df['payroll']))
#    print("Suggesting payroll has become much less important to winning. Moneyball affirming")
do_payroll("2002")
do_payroll("2008")
print("Suggesting payroll has become much less important to winning. Moneyball affirming")
plt.show()

def do_fielding(year):
#Wins v. Fielding %
    
    data_path = 'all bball data.csv'
    df = pd.read_csv(data_path).dropna()#,skiprows=[0],axis=1, inplace=True
    df.dropna(axis=1, inplace=True)
        
    wins = df.loc[df['year'] == int(year)]['wins']
    fp = df.loc[df['year'] == int(year)]['FP']
       
    fp_df= pd.DataFrame()
    fp_df['wins'] = minmax_scale(wins) #normalize data
    fp_df['fp'] = minmax_scale(fp)
       
    hi_fp = int( (fp.max()*100))
    lo_fp = int( (fp.min()*100))
    hi_wins = int(wins.max())
    lo_wins = int(wins.min())
    
#    print("fp->",lo_fp,hi_fp)

    plt.xticks([])  #negate old ticks & build new x,y axis & title
    plt.yticks([])
    xspaces = "                                              "
    xlab = str(lo_wins) + xspaces + str(int((lo_wins+hi_wins)/2)) + xspaces + str(hi_wins) + "\nWins(normalized)"
    plt.xlabel(xlab)
    yspaces = "                          "
    ylab = "Fielding % \n"+ str(lo_fp) + yspaces + str(int((lo_fp+hi_fp)/2)) + \
    yspaces + str(hi_fp) 
    plt.ylabel(ylab)

#    title = " Wins v. Fielding % Green=2002, Yellow=2008"
    title_str = ("2002 v. 2008  Wins v. Fielding %")
    plt.title(title_str)
    if year == '2002':
        leg_lbl = year + " wins"
        sns.kdeplot(fp_df["wins"], linestyle=':',shade = True,label=leg_lbl,color='y')
        leg_lbl = year + " FP"
        sns.kdeplot(fp_df["fp"], linestyle='-',label=leg_lbl,shade = True,color='y')
    else:
        leg_lbl = year + " wins"
        sns.kdeplot(fp_df["wins"], linestyle=':',shade = True,label=leg_lbl,color='b')
        leg_lbl = year + " FP"
        sns.kdeplot(fp_df["fp"], linestyle='-',shade = True,label=leg_lbl,color='b')
    
do_fielding("2002")
do_fielding("2008")
print("Suggesting FP has become less important to winning. Moneyball contradictory")
plt.show()

# end fielding

def do_OBP(year):
    data_path = 'all bball data.csv'
    df = pd.read_csv(data_path).dropna()#,skiprows=[0],axis=1, inplace=True
    df.dropna(axis=1, inplace=True)
        
    wins = df.loc[df['year'] == int(year)]['wins']
    obp = df.loc[df['year'] == int(year)]['OBP']
       
    fp_df= pd.DataFrame()
    fp_df['wins'] = minmax_scale(wins) #normalize data
    fp_df['obp'] = minmax_scale(obp)
       
    hi_obp = int( (obp.max()*100))
    lo_obp = int( (obp.min()*100))
    hi_wins = int(wins.max())
    lo_wins = int(wins.min())
    
 #   print("obp->",lo_obp,hi_obp)

    plt.xticks([])  #negate old ticks & build new x,y axis & title
    plt.yticks([])
    xspaces = "                                              "
    xlab = str(lo_wins) + xspaces + str(int((lo_wins+hi_wins)/2)) + xspaces + str(hi_wins) + "\nWins(normalized)"
    plt.xlabel(xlab)
    yspaces = "                          "
    ylab = "OBP% \n"+ str(lo_obp) + yspaces + str(int((lo_obp+hi_obp)/2)) + \
    yspaces + str(hi_obp) 
    plt.ylabel(ylab)

    title_str = ("2002 v. 2008  Wins v. OBP %")
    plt.title(title_str)
    if year == '2002':
        leg_lbl = year + " wins"
        sns.kdeplot(fp_df["wins"], linestyle=':',shade = True,label=leg_lbl,color='y')
        leg_lbl = year + " FP"
        sns.kdeplot(fp_df["obp"], linestyle='-',label=leg_lbl,shade = True,color='y')
    else:
        leg_lbl = year + " wins"
        sns.kdeplot(fp_df["wins"], linestyle=':',shade = True,label=leg_lbl,color='b')
        leg_lbl = year + " OBP"
        sns.kdeplot(fp_df["obp"], linestyle='-',shade = True,label=leg_lbl,color='b')
    print(year ," - correlation of wins to OBP % ->", (fp_df['wins'].corr(fp_df['obp']) ))
    
do_OBP("2002")
do_OBP("2008")
print("Suggesting OBP has become less important to winning. Moneyball contradictory")
plt.show()
#
# UZR comparison
# Illustrate whether 'Moneyball' has had an impact on defensive factors such
# as UZR

plt.ylabel("UZR %")
plt.yticks([6.0,8.0,10.0,12.0,14.0,16.0,18.0])
plt.title("UZR Changes 2003 - 2014")

data = pd.read_csv('UZR data.csv')
df = pd.DataFrame(data)
plt.plot(df['1b'],color='r',label="1b")
plt.plot(df['2b'],color='y',label="2b")
plt.plot(df['3b'],color='g',label="3b")
plt.plot(df['ss'],color='b',label="ss")
plt.plot(df['lf'],color='c',label="lf")
plt.plot(df['rf'],color='k',label="rf")
plt.plot(df['cf'],color='m',label="cf")
plt.legend(loc="upper right")

yr_labels = ['2003', '2006', '2009', '2012']

plt.xticks([]) 
spaces = "                                                                                        "
xlbl = "2003"  + spaces + "2014\nYears"
plt.xlabel(xlbl)

plt.show()
