# Who will win the 2020 Masters?

This analysis focuses on predicting the 2020 Masters winner using multivariate linear regression.  The Masters is one of golf's four major championships.  Unlike American football, basketball or tennis, golf is played individually and not based on matches.  The field is make up of 96 players invited based on a set of qualifying criteria. The lowest score after four rounds of golf (72 holes) wins.  


## Data set
I chose PGA Tour ShotLink data from 1980 to 2019 (www.shotlink.com and www.pgatour.com).  This is the largest set of data available on these metrics.  The data is manually collected by volunteers who follow every PGA Tour player and record every shot made in every PGA Tour tournament.  I was looking into scraping the data when I realized someone on Kaggle had already done this (https://www.kaggle.com/bradklassen/pga-tour-20102018-data).  I instead used this data set, formatted it, and cleaned out obvious errors.

<b>Limitations:</b>
- <b>Having enough data:</b> Unfortunately, the data was only available until year end 2019.  I would have liked to have analyzed the data around player's 2020 tournament performances leading up to the Masters tournament, and create a "trending score".  Here is the explanation of why the data was delted (https://www.kaggle.com/data/174010).
- <b>Having the right variables:</b> While there are hundreds of variables tracked by the PGA tour, I would ideally like more data on how they play the golf course that hosts the Masters tournament (Augusta National Golf Club in Georgia) and potentially how they play similar holes at other courses leading up to the Masters tournament.


## Conclusion
After analyzing the lowest scorers at prior Masters tournaments and top finishers over the last 3 years, I found that P-values were high and the R-squared values were low for most variables.  There also wasn't much consistency in the results of independent variables between these two subsets of data.  This is not ideal for linear regression.  Other non-linear models (machine learning, the Elo rating system) may be better matches to predict the outcome of the 2020 Masters winner.

After trying multiple different iterations of potential models, I focused on a multivariate linear regression model that utilized the top 3 independent variables that had the lowest P-values and highest R-squared values (GIR, Scrambling and Putts).  I also used a subset of the data set that was an aggregate of the last two years of player's results at the Masters Tournament (2018 and 2019 Masters).

<b>Multivariate linear regression model:</b>  Score = 59.0038 - (0.2305 x GIR) - (0.0149 x Scrambling) + (17.4504 x Putts)

<b>Projected Winners</b> (ranked in order of who is most likely to least likely to win):
1. 	Rickie Fowler
2. 	Patrick Cantlay
3. 	Brooks Koepka
4. 	Jordan Spieth
5. 	Justin Thomas
6. 	Tiger Woods
7. 	Patton Kizzire
8. 	Dustin Johnson
9. 	Francesco Molinari
10. 	Jon Rahm
11. 	Patrick Reed
12. 	Aaron Wise
13. 	Bubba Watson
14. 	Jason Day
15. 	Tony Finau
16. 	Charley Hoffman
17. 	Louis Oosthuizen
18. 	Justin Rose
19. 	Webb Simpson
20. 	Andrew Landry

# Detailed steps shown below

## Importing, Formatting and Cleaning the Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
sns.set_palette("GnBu_d")

In [2]:
df = pd.read_csv('pga_data_historical.csv') #2010-2019 PGA Tour data, source: https://www.kaggle.com/bradklassen/pga-tour-20102018-data (as of August 2020, he has taken the data down)

In [3]:
df.head()

Unnamed: 0,player_name,date,tournament,statistic,variable,value
0,Rik Massengale,1980-01-13,Bob Hope Desert Classic,Final Round Scoring Average,AVG,70.0
1,Bobby Nichols,1980-01-13,Bob Hope Desert Classic,Final Round Scoring Average,AVG,73.0
2,Andy North,1980-01-13,Bob Hope Desert Classic,Final Round Scoring Average,AVG,73.0
3,John Mahaffey,1980-01-13,Bob Hope Desert Classic,Final Round Scoring Average,AVG,73.0
4,Peter Jacobsen,1980-01-13,Bob Hope Desert Classic,Final Round Scoring Average,AVG,73.0


In [4]:
df.shape

(46147897, 6)

In [5]:
#Unearthing the available variables in the "statistic" column
df.statistic.unique()

array(['Final Round Scoring Average', 'All-Around Ranking',
       'Par 4 Birdie or Better Leaders', 'Scoring Average Before Cut',
       'Scoring Average (Actual)', 'Round 3 Scoring Average',
       'Official Money', 'Greens in Regulation Percentage',
       'Total Birdies', 'Birdie or Better Conversion Percentage',
       'Eagles (Holes per)', 'Total Driving', 'Scoring Average',
       'Par 3 Birdie or Better Leaders', 'Driving Distance',
       'Sand Save Percentage', 'Putts Per Round',
       'Percentage of potential money won', 'Driving Accuracy Percentage',
       'Par 5 Birdie or Better Leaders', 'Par Breakers', 'Ball Striking',
       'Birdie Average', 'Total Money (Official and Unofficial)',
       'Total Eagles', 'Front 9 Round 1 Scoring Average',
       'Back 9 Round 1 Scoring Average', 'Back 9 Par 5 Scoring Average',
       'Front 9 Round 2 Scoring Average', 'Front 9 Par 4 Scoring Average',
       'Back 9 Par 3 Scoring Average', 'Front 9 Par 3 Scoring Average',
       'Fron

In [6]:
df.statistic.value_counts()

Total Eagles                               256346
Total Money (Official and Unofficial)      256156
Total Birdies                              255870
Percentage of potential money won          254714
Final Round Scoring Average                252576
                                            ...  
Late Round 5 Scoring Average                  890
Tenth Tee Early Round 5 Scoring Average       528
First Tee Late Round 5 Scoring Average        468
Tenth Tee Late Round 5 Scoring Average        422
First Tee Early Round 5 Scoring Average       276
Name: statistic, Length: 442, dtype: int64

In [7]:
#Selecting from the hundreds of variables the "statistic" column

chosen_statistic = ['Greens in Regulation Percentage',
                    'Driving Accuracy Percentage',
                    'Scrambling',
                    'Scoring Average (Actual)',
                    'Official Money',
                    'Percentage of Available Purse Won',
                    'Driving Distance',
                    'Sand Save Percentage',
                    'Overall Putting Average']

#Definitions for each variable:
    #Greens in Regulation Percentage: The percent of time a player was able to hit the green in regulation (greens hit in regulation/holes played). Note: A green is considered hit in regulation if any portion of the ball is touching the putting surface after the GIR stroke has been taken. (The GIR stroke is determined by subtracting 2 from par (1st stroke on a par 3, 2nd on a par 4, 3rd on a par 5))
    #Driving Accuracy Percentage: The percentage of time a tee shot comes to rest in the fairway (regardless of club)
    #Scrambling: The percent of time a player misses the green in regulation, but still makes par or better.
    #Scoring Average (Actual): The average number of strokes per completed round.
    #Official Money: The total official money a player has earned year-to-date. Note: This is for PGA TOUR members only.
    #Percentage of Available Purse Won: For official events, the player's total money won as a percentage of the total purse available.
    #Driving Distance: The average number of yards per measured drive. These drives are measured on two holes per round. Care is taken to select two holes which face in opposite directions to counteract the effect of wind. Drives are measured to the point at which they come to rest regardless of whether they are in the fairway or not.
    #Sand Save Percentage: The percent of time a player was able to get 'up and down' once in a greenside sand bunker (regardless of score). Note: 'Up and down' indicates it took the player 2 shots or less to put the ball in the hole from that point.
    #Overall Putting Average: The average number of putts for all holes played (total putts / total holes played).

In [8]:
#Adjusting the DataFrame
df = df[df['statistic'].isin(chosen_statistic)]
df = df[df['variable'] != 'RANK THIS WEEK'] #Multiple variables have the value 'RANK THIS WEEK', so unstacking will create duplicate columns of 'RANK THIS WEEK', causing an error ("Index contains duplicate entries, cannot reshape"). This line eliminates the error
df = df.drop(columns = ['variable'])
df.head()

Unnamed: 0,player_name,date,tournament,statistic,value
145,Bill Rogers,1980-01-13,Bob Hope Desert Classic,Scoring Average (Actual),70.6
146,Gil Morgan,1980-01-13,Bob Hope Desert Classic,Scoring Average (Actual),70.6
147,Bob Gilder,1980-01-13,Bob Hope Desert Classic,Scoring Average (Actual),70.6
148,Billy Kratzert,1980-01-13,Bob Hope Desert Classic,Scoring Average (Actual),70.6
149,Roger Maltbie,1980-01-13,Bob Hope Desert Classic,Scoring Average (Actual),70.6


In [9]:
#Unstacking the data in the DataFrame
df = df.set_index(['player_name', 'date', 'tournament','statistic'])['value'].unstack('statistic').reset_index()
df.head()

statistic,player_name,date,tournament,Driving Accuracy Percentage,Driving Distance,Greens in Regulation Percentage,Official Money,Overall Putting Average,Percentage of Available Purse Won,Sand Save Percentage,Scoring Average (Actual),Scrambling
0,A.J. Duncan,1988-04-03,KMart Greater Greensboro Open,58.93,283.5,58.33,"$1,810",,0.18,33.33,73.75,
1,A.J. Duncan,1990-03-11,Honda Classic,44.64,264.1,52.78,"$1,810",,0.18,14.29,77.0,
2,A.J. McInerney,2017-11-05,Shriners Hospitals for Children Open,50.0,317.8,76.39,,1.694,2.22,22.22,69.5,52.94
3,A.J. McInerney,2018-06-10,FedEx St. Jude Classic,50.0,298.5,56.94,,1.514,0.22,33.33,70.75,61.29
4,Aaron Baddeley,2000-03-12,Honda Classic,64.29,283.6,70.83,,1.625,,50.0,70.25,66.67


In [10]:
df.shape

(128160, 12)

In [11]:
df.describe()

statistic,player_name,date,tournament,Driving Accuracy Percentage,Driving Distance,Greens in Regulation Percentage,Official Money,Overall Putting Average,Percentage of Available Purse Won,Sand Save Percentage,Scoring Average (Actual),Scrambling
count,128160,128160,128160,124669.0,123306.0,124543.0,123674,87568.0,118148.0,123011.0,126125.0,87574.0
unique,2440,1697,304,333.0,1169.0,111.0,11277,206.0,772.0,134.0,129.0,425.0
top,Davis Love III,1995-07-23,PGA Championship,64.29,271.9,66.67,$,1.611,0.22,50.0,70.0,66.67
freq,546,178,3018,7030.0,322.0,8056.0,200,6397.0,7025.0,19184.0,6630.0,4253.0


## Fixing the Data types

In [12]:
print(df.dtypes)

statistic
player_name                          object
date                                 object
tournament                           object
Driving Accuracy Percentage          object
Driving Distance                     object
Greens in Regulation Percentage      object
Official Money                       object
Overall Putting Average              object
Percentage of Available Purse Won    object
Sand Save Percentage                 object
Scoring Average (Actual)             object
Scrambling                           object
dtype: object


In [13]:
df.head()

statistic,player_name,date,tournament,Driving Accuracy Percentage,Driving Distance,Greens in Regulation Percentage,Official Money,Overall Putting Average,Percentage of Available Purse Won,Sand Save Percentage,Scoring Average (Actual),Scrambling
0,A.J. Duncan,1988-04-03,KMart Greater Greensboro Open,58.93,283.5,58.33,"$1,810",,0.18,33.33,73.75,
1,A.J. Duncan,1990-03-11,Honda Classic,44.64,264.1,52.78,"$1,810",,0.18,14.29,77.0,
2,A.J. McInerney,2017-11-05,Shriners Hospitals for Children Open,50.0,317.8,76.39,,1.694,2.22,22.22,69.5,52.94
3,A.J. McInerney,2018-06-10,FedEx St. Jude Classic,50.0,298.5,56.94,,1.514,0.22,33.33,70.75,61.29
4,Aaron Baddeley,2000-03-12,Honda Classic,64.29,283.6,70.83,,1.625,,50.0,70.25,66.67


In [14]:
#Cleaning the "Official Money" column so we can analyze it
df['Official Money'] = df['Official Money'].str.replace('\$', '', regex = True)
df['Official Money'] = df['Official Money'].str.replace(',', '', regex = True)
df['Official Money'] = df['Official Money'].str.replace('O', '0', regex = True)
df['Official Money'] = df['Official Money'].str.replace(' ', '', regex = True)
df['Official Money'] = df['Official Money'].str.strip()
df['Official Money'].replace("", np.nan, inplace=True)

In [15]:
#Converting data types from objects to float
df['Greens in Regulation Percentage']=df['Greens in Regulation Percentage'].astype('float')
df['Driving Accuracy Percentage']=df['Driving Accuracy Percentage'].astype('float')
df['Scrambling']=df['Scrambling'].astype('float')
df['Scoring Average (Actual)']=df['Scoring Average (Actual)'].astype('float')
df['Official Money']=df['Official Money'].astype('float')
df['Percentage of Available Purse Won']=df['Percentage of Available Purse Won'].astype('float')
df['Driving Distance']=df['Driving Distance'].astype('float')
df['Sand Save Percentage']=df['Sand Save Percentage'].astype('float')
df['Overall Putting Average']=df['Overall Putting Average'].astype('float')

In [16]:
print(df.dtypes)

statistic
player_name                           object
date                                  object
tournament                            object
Driving Accuracy Percentage          float64
Driving Distance                     float64
Greens in Regulation Percentage      float64
Official Money                       float64
Overall Putting Average              float64
Percentage of Available Purse Won    float64
Sand Save Percentage                 float64
Scoring Average (Actual)             float64
Scrambling                           float64
dtype: object


In [17]:
#Cleaning up the DataFrame - renaming certain columns
df.rename(columns = {'Greens in Regulation Percentage':'GIR',
                     'Driving Accuracy Percentage':'Fairways',
                     'Scrambling':'Scrambling',
                     'Scoring Average (Actual)':'Score',
                     'Official Money':'Money',
                     'Percentage of Available Purse Won':'Purse',
                     'Driving Distance':'Distance',
                     'Sand Save Percentage':'Sandies',
                     'Overall Putting Average':'Putts'}, inplace=True)

In [18]:
df.head()

statistic,player_name,date,tournament,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling
0,A.J. Duncan,1988-04-03,KMart Greater Greensboro Open,58.93,283.5,58.33,1810.0,,0.18,33.33,73.75,
1,A.J. Duncan,1990-03-11,Honda Classic,44.64,264.1,52.78,1810.0,,0.18,14.29,77.0,
2,A.J. McInerney,2017-11-05,Shriners Hospitals for Children Open,50.0,317.8,76.39,,1.694,2.22,22.22,69.5,52.94
3,A.J. McInerney,2018-06-10,FedEx St. Jude Classic,50.0,298.5,56.94,,1.514,0.22,33.33,70.75,61.29
4,Aaron Baddeley,2000-03-12,Honda Classic,64.29,283.6,70.83,,1.625,,50.0,70.25,66.67


In [19]:
df.describe()

statistic,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling
count,124669.0,123306.0,124543.0,123474.0,87568.0,118148.0,123011.0,126125.0,87574.0
mean,65.607781,275.949496,66.55593,48706.11,1.604443,1.40743,51.691787,70.651087,61.097847
std,10.898736,19.243917,8.459017,123111.4,0.084746,2.585814,24.156935,1.862798,11.346915
min,17.86,182.6,20.83,96.0,0.611,0.13,0.0,63.25,4.17
25%,58.93,261.5,61.11,4265.0,1.556,0.23,36.36,69.25,53.57
50%,66.07,275.3,66.67,13325.0,1.611,0.52,50.0,70.5,61.29
75%,73.21,289.8,72.22,39875.0,1.653,1.45,66.67,71.75,68.75
max,100.0,362.1,487.5,2250000.0,2.056,30.0,200.0,81.25,100.0


### Fixing errors in the data
There is at least one GIR value that is above 100% (in the chart above the max = 487.5) and one Sandies value that is above 100% (in the chart above the max = 200.0).  Will drop any rows where GIR or Sandies is above 100%:

In [20]:
df=df[df.GIR <= 100]
df=df[df.Sandies <= 100]

In [21]:
df.describe()

statistic,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling
count,122818.0,121486.0,122848.0,118460.0,86385.0,113663.0,122848.0,122691.0,86391.0
mean,65.584648,276.036268,66.498876,48430.05,1.604132,1.391657,51.688042,70.651976,61.093638
std,10.904967,19.236936,8.347419,121399.8,0.084406,2.558169,24.150509,1.852941,11.312598
min,17.86,182.6,20.83,280.0,0.611,0.13,0.0,63.25,4.17
25%,58.93,261.5,61.11,4332.0,1.556,0.23,36.36,69.33,53.57
50%,66.07,275.4,66.67,13392.0,1.611,0.52,50.0,70.5,61.29
75%,73.21,289.9,72.22,39900.0,1.653,1.45,66.67,71.75,68.75
max,100.0,362.1,100.0,2250000.0,2.056,30.0,100.0,81.25,100.0


In [22]:
#Adding the year as a column to the dataframe
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df.head()

statistic,player_name,date,tournament,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling,year,month
0,A.J. Duncan,1988-04-03,KMart Greater Greensboro Open,58.93,283.5,58.33,1810.0,,0.18,33.33,73.75,,1988,4
1,A.J. Duncan,1990-03-11,Honda Classic,44.64,264.1,52.78,1810.0,,0.18,14.29,77.0,,1990,3
2,A.J. McInerney,2017-11-05,Shriners Hospitals for Children Open,50.0,317.8,76.39,,1.694,2.22,22.22,69.5,52.94,2017,11
3,A.J. McInerney,2018-06-10,FedEx St. Jude Classic,50.0,298.5,56.94,,1.514,0.22,33.33,70.75,61.29,2018,6
4,Aaron Baddeley,2000-03-12,Honda Classic,64.29,283.6,70.83,,1.625,,50.0,70.25,66.67,2000,3


# Predicting the 2020 Masters winner

## Reviewing who scored the lowest in Masters Tournaments over the last decade
For simplicity, this reviews any players who scored 69 or less on average for each round of the tournament.  Par is 72 at the golf course that hosts the Masters Tournament (Augusta National), so these players scored 3 strokes less than par on average.

In [23]:
df_masters_winners = df.loc[(df['tournament'] == 'Masters Tournament') & (df.year > 2009) & (df.Score <= 69)]
df_masters_winners.nsmallest(50, 'Score')

statistic,player_name,date,tournament,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling,year,month
68943,Jordan Spieth,2015-04-12,Masters Tournament,69.64,282.6,75.0,1800000.0,1.5,18.0,50.0,67.5,66.67,2015,4
98008,Phil Mickelson,2010-04-11,Masters Tournament,60.71,297.1,75.0,1350000.0,1.611,18.0,0.0,68.0,77.78,2010,4
94192,Patrick Reed,2018-04-08,Masters Tournament,73.21,299.3,66.67,1980000.0,1.444,18.0,50.0,68.25,62.5,2018,4
70128,Justin Rose,2015-04-12,Masters Tournament,78.57,293.8,75.0,880000.0,1.611,8.8,66.67,68.5,61.11,2015,4
100693,Rickie Fowler,2018-04-08,Masters Tournament,71.43,290.6,70.83,1188000.0,1.569,10.8,50.0,68.5,76.19,2018,4
21310,Charl Schwartzel,2011-04-10,Masters Tournament,66.07,278.4,68.06,1440000.0,1.486,19.2,42.86,68.5,78.26,2011,4
98091,Phil Mickelson,2015-04-12,Masters Tournament,69.64,293.9,70.83,880000.0,1.583,8.8,57.14,68.5,76.19,2015,4
69000,Jordan Spieth,2018-04-08,Masters Tournament,67.86,287.1,72.22,748000.0,1.625,6.8,66.67,68.75,70.0,2018,4
117798,Tiger Woods,2019-04-14,Masters Tournament,62.5,294.6,80.56,2070000.0,1.667,,66.67,68.75,50.0,2019,4
77684,Lee Westwood,2010-04-11,Masters Tournament,62.5,292.3,80.56,,1.681,10.8,60.0,68.75,64.29,2010,4


In [24]:
#Fairways
Winners_Fairways_model = sm.OLS.from_formula("Score ~ Fairways", data=df_masters_winners)
Winners_Fairways_results = Winners_Fairways_model.fit()
print("Score + Fairways | P-value:", Winners_Fairways_results.pvalues[1]/2, "| R-squared:", Winners_Fairways_results.rsquared)

#Distance
Winners_Distance_model = sm.OLS.from_formula("Score ~ Distance", data=df_masters_winners)
Winners_Distance_results = Winners_Distance_model.fit()
print("Score + Distance | P-value:", Winners_Distance_results.pvalues[1]/2, "| R-squared:", Winners_Distance_results.rsquared)

#GIR
Winners_GIR_model = sm.OLS.from_formula("Score ~ GIR", data=df_masters_winners)
Winners_GIR_results = Winners_GIR_model.fit()
print("Score + GIR | P-value:", Winners_GIR_results.pvalues[1]/2, "| R-squared:", Winners_GIR_results.rsquared)

#Putts
Winners_Putts_model = sm.OLS.from_formula("Score ~ Putts", data=df_masters_winners)
Winners_Putts_results = Winners_Putts_model.fit()
print("Score + Putts | P-value:", Winners_Putts_results.pvalues[1]/2, "| R-squared:", Winners_Putts_results.rsquared)

#Sandies
Winners_Sandies_model = sm.OLS.from_formula("Score ~ Sandies", data=df_masters_winners)
Winners_Sandies_results = Winners_Sandies_model.fit()
print("Score + Sandies | P-value:", Winners_Sandies_results.pvalues[1]/2, "| R-squared:", Winners_Sandies_results.rsquared)

#Scrambling
Winners_Scrambling_model = sm.OLS.from_formula("Score ~ Scrambling", data=df_masters_winners)
Winners_Scrambling_results = Winners_Scrambling_model.fit()
print("Score + Scrambling | P-value:", Winners_Scrambling_results.pvalues[1]/2, "| R-squared:", Winners_Scrambling_results.rsquared)

#Money
Winners_Money_model = sm.OLS.from_formula("Score ~ Money", data=df_masters_winners)
Winners_Money_results = Winners_Money_model.fit()
print("Score + Money | P-value:", Winners_Money_results.pvalues[1]/2, "| R-squared:", Winners_Money_results.rsquared)


#Note:
#P-value is the probability of getting a sample proportion at least this extreme.  Ranges from zero to 100%. 
#The P>|t| value in the OLS summary reflects a two-sided P-value, and needs to be divided by 2 to reflect the single sided value.
#P-values quoted in the summary of this notebook reflect the single sided P-value


Score + Fairways | P-value: 0.2240475033709749 | R-squared: 0.038883150137074884
Score + Distance | P-value: 0.15807395452360568 | R-squared: 0.06689997097484623
Score + GIR | P-value: 0.41398534430943346 | R-squared: 0.0032494005209643406
Score + Putts | P-value: 0.035671100457042304 | R-squared: 0.2006769600829218
Score + Sandies | P-value: 0.0019330609663386158 | R-squared: 0.4368900571096984
Score + Scrambling | P-value: 0.12970249401895406 | R-squared: 0.08391424355080668
Score + Money | P-value: 0.002362477426547019 | R-squared: 0.44554907438337543


In [25]:
df_masters_winners.corr()

statistic,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling,year,month
statistic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Fairways,1.0,0.076079,-0.132931,-0.009744,-0.133614,-0.157185,0.094395,-0.197188,-0.122191,0.263912,
Distance,0.076079,1.0,0.07307,-0.036656,0.297111,-0.036458,0.257322,0.25865,0.015398,0.624001,
GIR,-0.132931,0.07307,1.0,0.199351,0.755508,-0.190875,-0.007945,-0.057004,-0.375639,-0.08444,
Money,-0.009744,-0.036656,0.199351,1.0,-0.396587,0.913606,-0.428993,-0.667495,-0.045039,0.2381,
Putts,-0.133614,0.297111,0.755508,-0.396587,1.0,-0.66306,0.218628,0.44797,-0.284665,0.040814,
Purse,-0.157185,-0.036458,-0.190875,0.913606,-0.66306,1.0,-0.657227,-0.766933,0.466345,-0.082705,
Sandies,0.094395,0.257322,-0.007945,-0.428993,0.218628,-0.657227,1.0,0.660977,-0.23359,0.308651,
Score,-0.197188,0.25865,-0.057004,-0.667495,0.44797,-0.766933,0.660977,1.0,-0.28968,0.08235,
Scrambling,-0.122191,0.015398,-0.375639,-0.045039,-0.284665,0.466345,-0.23359,-0.28968,1.0,-0.068381,
year,0.263912,0.624001,-0.08444,0.2381,0.040814,-0.082705,0.308651,0.08235,-0.068381,1.0,


### Summary
- Nearly all the P-values are above alpha of 0.05 except for Sandies (0.002) and Putts (0.036).
- R-squared for Sandies (0.44) and Putts (0.20) are the highest for the variables available.
- The average number of putts were 1.681 or lower
- With the exception of Mickelson in 2010 and Schwartzel in 2011, sand saves were above 50%.  Sand saves are difficult to use as restriction to an analysis, because they are only relevant if the player misses the green (opposite of a GIR) and the ball lands in the sand.  In addition, it is difficult to apply a player's prior Sandies record to their future 2020 Master's performance outside of prior Masters tournaments because the sand used in the bunkers are not what is typically used at other golf courses (it is made of material from feldspar mines in North Carolina).
- Note that the summary analysis excludes Money, because the Money represents how much money they won at the Masters Tournament

## Analyzing prior Masters Tournament top finishers by year

### 2017

In [26]:
#2017
#Sergio Garcia won in the playoff
df_2017_winners = df.loc[(df['tournament'] == 'Masters Tournament') & (df['year']==2017)]
df_2017_winners = df_2017_winners.nsmallest(10, 'Score')
df_2017_winners.head(10)

statistic,player_name,date,tournament,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling,year,month
70164,Justin Rose,2017-04-09,Masters Tournament,62.5,279.8,75.0,1188000.0,1.667,11.88,0.0,69.75,50.0,2017,4
110292,Sergio Garcia,2017-04-09,Masters Tournament,80.36,291.9,75.0,1980000.0,1.653,19.8,83.33,69.75,66.67,2017,4
21401,Charl Schwartzel,2017-04-09,Masters Tournament,69.64,280.8,68.06,748000.0,1.625,7.48,60.0,70.5,56.52,2017,4
85495,Matt Kuchar,2017-04-09,Masters Tournament,67.86,269.9,62.5,484000.0,1.556,4.84,33.33,70.75,70.37,2017,4
117410,Thomas Pieters,2017-04-09,Masters Tournament,60.71,293.0,66.67,,1.611,4.84,44.44,70.75,58.33,2017,4
95037,Paul Casey,2017-04-09,Masters Tournament,62.5,277.9,77.78,396000.0,1.764,3.96,0.0,71.0,56.25,2017,4
73210,Kevin Chappell,2017-04-09,Masters Tournament,60.71,288.5,70.83,354750.0,1.708,3.55,20.0,71.25,42.86,2017,4
104543,Rory McIlroy,2017-04-09,Masters Tournament,51.79,288.0,61.11,354750.0,1.611,3.55,42.86,71.25,75.0,2017,4
744,Adam Scott,2017-04-09,Masters Tournament,64.29,291.0,73.61,308000.0,1.722,3.08,57.14,71.5,47.37,2017,4
106010,Ryan Moore,2017-04-09,Masters Tournament,75.0,276.8,63.89,308000.0,1.653,3.08,50.0,71.5,61.54,2017,4


In [27]:
#Fairways
winners17_Fairways_model = sm.OLS.from_formula("Score ~ Fairways", data=df_2017_winners)
winners17_Fairways_results = winners17_Fairways_model.fit()
print("Score + Fairways | P-value:", winners17_Fairways_results.pvalues[1]/2, "| R-squared:", winners17_Fairways_results.rsquared)

#Distance
winners17_Distance_model = sm.OLS.from_formula("Score ~ Distance", data=df_2017_winners)
winners17_Distance_results = winners17_Distance_model.fit()
print("Score + Distance | P-value:", winners17_Distance_results.pvalues[1]/2, "| R-squared:", winners17_Distance_results.rsquared)

#GIR
winners17_GIR_model = sm.OLS.from_formula("Score ~ GIR", data=df_2017_winners)
winners17_GIR_results = winners17_GIR_model.fit()
print("Score + GIR | P-value:", winners17_GIR_results.pvalues[1]/2, "| R-squared:", winners17_GIR_results.rsquared)

#Putts
winners17_Putts_model = sm.OLS.from_formula("Score ~ Putts", data=df_2017_winners)
winners17_Putts_results = winners17_Putts_model.fit()
print("Score + Putts | P-value:", winners17_Putts_results.pvalues[1]/2, "| R-squared:", winners17_Putts_results.rsquared)

#Sandies
winners17_Sandies_model = sm.OLS.from_formula("Score ~ Sandies", data=df_2017_winners)
winners17_Sandies_results = winners17_Sandies_model.fit()
print("Score + Sandies | P-value:", winners17_Sandies_results.pvalues[1]/2, "| R-squared:", winners17_Sandies_results.rsquared)

#Scrambling
winners17_Scrambling_model = sm.OLS.from_formula("Score ~ Scrambling", data=df_2017_winners)
winners17_Scrambling_results = winners17_Scrambling_model.fit()
print("Score + Scrambling | P-value:", winners17_Scrambling_results.pvalues[1]/2, "| R-squared:", winners17_Scrambling_results.rsquared)

#Money
winners17_Money_model = sm.OLS.from_formula("Score ~ Money", data=df_2017_winners)
winners17_Money_results = winners17_Money_model.fit()
print("Score + Money | P-value:", winners17_Money_results.pvalues[1]/2, "| R-squared:", winners17_Money_results.rsquared)


#Note:
#P-value is the probability of getting a sample proportion at least this extreme.  Ranges from zero to 100%. 
#The P>|t| value in the OLS summary reflects a two-sided P-value, and needs to be divided by 2 to reflect the single sided value.
#P-values quoted in the summary of this notebook reflect the single sided P-value

Score + Fairways | P-value: 0.15559497919745446 | R-squared: 0.1274599195858377
Score + Distance | P-value: 0.49501838002690585 | R-squared: 2.0745460284632422e-05
Score + GIR | P-value: 0.13312562098999206 | R-squared: 0.1514993235012131
Score + Putts | P-value: 0.27370999525603185 | R-squared: 0.04700206273177776
Score + Sandies | P-value: 0.45663874723993125 | R-squared: 0.001576655314487141
Score + Scrambling | P-value: 0.4014889669113523 | R-squared: 0.008246901643381377
Score + Money | P-value: 0.00046596396807391336 | R-squared: 0.8106279188767553


### 2018

In [28]:
#2018 
df_2018_winners = df.loc[(df['tournament'] == 'Masters Tournament') & (df['year']==2018)]
df_2018_winners = df_2018_winners.nsmallest(10, 'Score')
df_2018_winners.head(10)

statistic,player_name,date,tournament,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling,year,month
94192,Patrick Reed,2018-04-08,Masters Tournament,73.21,299.3,66.67,1980000.0,1.444,18.0,50.0,68.25,62.5,2018,4
100693,Rickie Fowler,2018-04-08,Masters Tournament,71.43,290.6,70.83,1188000.0,1.569,10.8,50.0,68.5,76.19,2018,4
69000,Jordan Spieth,2018-04-08,Masters Tournament,67.86,287.1,72.22,748000.0,1.625,6.8,66.67,68.75,70.0,2018,4
68287,Jon Rahm,2018-04-08,Masters Tournament,73.21,295.4,68.06,528000.0,1.556,4.8,75.0,69.25,60.87,2018,4
18650,Bubba Watson,2018-04-08,Masters Tournament,83.93,304.0,77.78,386375.0,1.694,3.51,60.0,69.75,62.5,2018,4
19797,Cameron Smith,2018-04-08,Masters Tournament,58.93,293.3,68.06,386375.0,1.583,3.51,40.0,69.75,69.57,2018,4
50235,Henrik Stenson,2018-04-08,Masters Tournament,76.79,284.9,70.83,386375.0,1.625,3.51,50.0,69.75,57.14,2018,4
104555,Rory McIlroy,2018-04-08,Masters Tournament,62.5,302.6,59.72,386375.0,1.514,3.51,80.0,69.75,75.86,2018,4
80513,Marc Leishman,2018-04-08,Masters Tournament,53.57,291.8,65.28,319000.0,1.569,2.9,66.67,70.0,72.0,2018,4
38593,Dustin Johnson,2018-04-08,Masters Tournament,66.07,304.9,68.06,286000.0,1.667,2.6,50.0,70.25,65.22,2018,4


In [29]:
#Fairways
winners18_Fairways_model = sm.OLS.from_formula("Score ~ Fairways", data=df_2018_winners)
winners18_Fairways_results = winners18_Fairways_model.fit()
print("Score + Fairways | P-value:", winners18_Fairways_results.pvalues[1]/2, "| R-squared:", winners18_Fairways_results.rsquared)

#Distance
winners18_Distance_model = sm.OLS.from_formula("Score ~ Distance", data=df_2018_winners)
winners18_Distance_results = winners18_Distance_model.fit()
print("Score + Distance | P-value:", winners18_Distance_results.pvalues[1]/2, "| R-squared:", winners18_Distance_results.rsquared)

#GIR
winners18_GIR_model = sm.OLS.from_formula("Score ~ GIR", data=df_2018_winners)
winners18_GIR_results = winners18_GIR_model.fit()
print("Score + GIR | P-value:", winners18_GIR_results.pvalues[1]/2, "| R-squared:", winners18_GIR_results.rsquared)

#Putts
winners18_Putts_model = sm.OLS.from_formula("Score ~ Putts", data=df_2018_winners)
winners18_Putts_results = winners18_Putts_model.fit()
print("Score + Putts | P-value:", winners18_Putts_results.pvalues[1]/2, "| R-squared:", winners18_Putts_results.rsquared)

#Sandies
winners18_Sandies_model = sm.OLS.from_formula("Score ~ Sandies", data=df_2018_winners)
winners18_Sandies_results = winners18_Sandies_model.fit()
print("Score + Sandies | P-value:", winners18_Sandies_results.pvalues[1]/2, "| R-squared:", winners18_Sandies_results.rsquared)

#Scrambling
winners18_Scrambling_model = sm.OLS.from_formula("Score ~ Scrambling", data=df_2018_winners)
winners18_Scrambling_results = winners18_Scrambling_model.fit()
print("Score + Scrambling | P-value:", winners18_Scrambling_results.pvalues[1]/2, "| R-squared:", winners18_Scrambling_results.rsquared)

#Money
winners18_Money_model = sm.OLS.from_formula("Score ~ Money", data=df_2018_winners)
winners18_Money_results = winners18_Money_model.fit()
print("Score + Money | P-value:", winners18_Money_results.pvalues[1]/2, "| R-squared:", winners18_Money_results.rsquared)


#Note:
#P-value is the probability of getting a sample proportion at least this extreme.  Ranges from zero to 100%. 
#The P>|t| value in the OLS summary reflects a two-sided P-value, and needs to be divided by 2 to reflect the single sided value.
#P-values quoted in the summary of this notebook reflect the single sided P-value

Score + Fairways | P-value: 0.19739577117732404 | R-squared: 0.09179765405919837
Score + Distance | P-value: 0.21361245814094232 | R-squared: 0.08040710586512434
Score + GIR | P-value: 0.3582503900023769 | R-squared: 0.01738941757680501
Score + Putts | P-value: 0.05322932035005617 | R-squared: 0.29252273556899966
Score + Sandies | P-value: 0.41845116329306054 | R-squared: 0.005621856116004853
Score + Scrambling | P-value: 0.4001296028615949 | R-squared: 0.008480017718602406
Score + Money | P-value: 0.00024682507331018165 | R-squared: 0.7983813756913664


### 2019

In [30]:
#2019
df_2019_winners = df.loc[(df['tournament'] == 'Masters Tournament') & (df['year']==2019)]
df_2019_winners = df_2019_winners.nsmallest(10, 'Score')
df_2019_winners.head(10)

statistic,player_name,date,tournament,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling,year,month
117798,Tiger Woods,2019-04-14,Masters Tournament,62.5,294.6,80.56,2070000.0,1.667,,66.67,68.75,50.0,2019,4
17525,Brooks Koepka,2019-04-14,Masters Tournament,69.64,313.6,73.61,858667.0,1.639,,100.0,69.0,68.42,2019,4
38616,Dustin Johnson,2019-04-14,Masters Tournament,60.71,308.0,70.83,858667.0,1.569,,100.0,69.0,80.95,2019,4
127350,Xander Schauffele,2019-04-14,Masters Tournament,62.5,305.8,70.83,858667.0,1.597,,57.14,69.0,57.14,2019,4
40810,Francesco Molinari,2019-04-14,Masters Tournament,73.21,294.8,65.28,403938.0,1.458,,66.67,69.25,84.0,2019,4
54817,Jason Day,2019-04-14,Masters Tournament,66.07,296.5,70.83,403938.0,1.556,,33.33,69.25,71.43,2019,4
123063,Tony Finau,2019-04-14,Masters Tournament,67.86,316.3,66.67,403938.0,1.556,,63.64,69.25,79.17,2019,4
126086,Webb Simpson,2019-04-14,Masters Tournament,83.93,283.1,68.06,403938.0,1.556,,50.0,69.25,78.26,2019,4
68306,Jon Rahm,2019-04-14,Masters Tournament,76.79,308.4,70.83,310500.0,1.694,,71.43,69.5,80.95,2019,4
94045,Patrick Cantlay,2019-04-14,Masters Tournament,64.29,299.9,61.11,310500.0,1.472,,71.43,69.5,67.86,2019,4


In [31]:
#Fairways
winners19_Fairways_model = sm.OLS.from_formula("Score ~ Fairways", data=df_2019_winners)
winners19_Fairways_results = winners19_Fairways_model.fit()
print("Score + Fairways | P-value:", winners19_Fairways_results.pvalues[1]/2, "| R-squared:", winners19_Fairways_results.rsquared)

#Distance
winners19_Distance_model = sm.OLS.from_formula("Score ~ Distance", data=df_2019_winners)
winners19_Distance_results = winners19_Distance_model.fit()
print("Score + Distance | P-value:", winners19_Distance_results.pvalues[1]/2, "| R-squared:", winners19_Distance_results.rsquared)

#GIR
winners19_GIR_model = sm.OLS.from_formula("Score ~ GIR", data=df_2019_winners)
winners19_GIR_results = winners19_GIR_model.fit()
print("Score + GIR | P-value:", winners19_GIR_results.pvalues[1]/2, "| R-squared:", winners19_GIR_results.rsquared)

#Putts
winners19_Putts_model = sm.OLS.from_formula("Score ~ Putts", data=df_2019_winners)
winners19_Putts_results = winners19_Putts_model.fit()
print("Score + Putts | P-value:", winners19_Putts_results.pvalues[1]/2, "| R-squared:", winners19_Putts_results.rsquared)

#Sandies
winners19_Sandies_model = sm.OLS.from_formula("Score ~ Sandies", data=df_2019_winners)
winners19_Sandies_results = winners19_Sandies_model.fit()
print("Score + Sandies | P-value:", winners19_Sandies_results.pvalues[1]/2, "| R-squared:", winners19_Sandies_results.rsquared)

#Scrambling
winners19_Scrambling_model = sm.OLS.from_formula("Score ~ Scrambling", data=df_2019_winners)
winners19_Scrambling_results = winners19_Scrambling_model.fit()
print("Score + Scrambling | P-value:", winners19_Scrambling_results.pvalues[1]/2, "| R-squared:", winners19_Scrambling_results.rsquared)

#Money
winners19_Money_model = sm.OLS.from_formula("Score ~ Money", data=df_2019_winners)
winners19_Money_results = winners19_Money_model.fit()
print("Score + Money | P-value:", winners19_Money_results.pvalues[1]/2, "| R-squared:", winners19_Money_results.rsquared)


#Note:
#P-value is the probability of getting a sample proportion at least this extreme.  Ranges from zero to 100%. 
#The P>|t| value in the OLS summary reflects a two-sided P-value, and needs to be divided by 2 to reflect the single sided value.
#P-values quoted in the summary of this notebook reflect the single sided P-value

Score + Fairways | P-value: 0.08540572714124003 | R-squared: 0.2205895551707081
Score + Distance | P-value: 0.47479371262309983 | R-squared: 0.000531671540754286
Score + GIR | P-value: 0.0035442625261141794 | R-squared: 0.616955220664926
Score + Putts | P-value: 0.14386293137365389 | R-squared: 0.13950084050055178
Score + Sandies | P-value: 0.24288836874753833 | R-squared: 0.06257362073341999
Score + Scrambling | P-value: 0.03227008456360929 | R-squared: 0.36455719703955947
Score + Money | P-value: 0.0004258759264315939 | R-squared: 0.769735230545112


### Summary

The results are different players who scored the lowest at the Masters over the last decade
- Nearly all the P-values are above alpha of 0.05 except for 2019 GIR (0.004) and 2019 Scrambling (0.032)
- The highest R-squared was 2019 GIR (0.617)
- Note that the summary analysis excludes Money, because the Money represents how much money they won at the Masters Tournament


|		|	2017 P-value	|	2017 R-squared	|	2018 P-value	|	2018 R-squared	|	2019 P-value	|	2019 R-squared	|
| :- | :-: | :-: | :-: | :-: | :-: | :-: |
|	Score + Fairways	|	0.156	|	0.127	|	0.197	|	0.092	|	0.085	|	0.221	|
|	Score + Distance	|	0.495	|	0.000	|	0.214	|	0.080	|	0.475	|	0.001	|
|	Score + GIR	|	0.133	|	0.151	|	0.358	|	0.017	|	0.004	|	0.617	|
|	Score + Putts	|	0.274	|	0.047	|	0.053	|	0.293	|	0.144	|	0.140	|
|	Score + Sandies	|	0.457	|	0.002	|	0.418	|	0.006	|	0.243	|	0.063	|
|	Score + Scrambling	|	0.401	|	0.008	|	0.400	|	0.008	|	0.032	|	0.365	|
|	Score + Money	|	0.000	|	0.811	|	0.000	|	0.798	|	0.000	|	0.770	|


## Summary of findings

So far the results are not ideal. Generally, the P-values are high and the correlations are low for most variables.  There isn't much consistency in variables between low scorers at prior Masters tournaments and top finishers over the last 3 years.

|		|	P-value	|	R-squared	|
| :- | :-: | :-: |
|	Top finishers at the 2019 Masters: Score + GIR	|	0.004	|	0.617	|
|	Top finishers at the 2019 Masters: Score + Scrambling	|	0.032	|	0.365	|
|	Lowest scorers from previous Masters: Score + Putts	|	0.036	|	0.201	|

## Reducing the data down to the last 2 years

In [32]:
df_2020winner = df[(df.year >= 2018) & (df.year <= 2020) & (df.Money >= 1)] #Also removing player's tournaments where they did not win any money
df_2020winner.head()

statistic,player_name,date,tournament,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling,year,month
230,Aaron Baddeley,2018-02-04,Waste Management Phoenix Open,48.21,289.4,62.5,51060.0,1.556,0.74,70.0,69.0,74.07,2018,2
231,Aaron Baddeley,2018-02-11,AT&T Pebble Beach Pro-Am,61.82,288.3,63.89,16576.0,1.569,0.22,50.0,71.0,65.38,2018,2
232,Aaron Baddeley,2018-02-18,Genesis Open,46.43,307.1,48.61,133200.0,1.431,1.85,28.57,69.5,78.38,2018,2
233,Aaron Baddeley,2018-03-11,Valspar Championship,40.38,301.0,56.94,15431.0,1.597,0.24,60.0,71.5,70.97,2018,3
234,Aaron Baddeley,2018-04-01,Houston Open,67.86,290.3,69.44,13440.0,1.653,0.19,80.0,71.75,63.64,2018,4


## Aggregating the data by player

In [33]:
aggregation_functions = {'Score': 'mean', 'Fairways': 'mean', 'Distance': 'mean', 'GIR': 'mean', 'Putts': 'mean', 'Sandies': 'mean', 'Scrambling': 'mean', 'Money': 'sum'} #'player_name': 'first'
df_2020winner_agg = df.groupby(df_2020winner['player_name']).aggregate(aggregation_functions)
df_2020winner_agg.nsmallest(30, 'Score')

Unnamed: 0_level_0,Score,Fairways,Distance,GIR,Putts,Sandies,Scrambling,Money
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Andres Romero,67.0,51.79,302.5,69.44,1.444,44.44,72.73,337560.0
Jerry Kelly,67.0,67.86,288.0,73.61,1.625,66.67,78.95,108500.0
Jason Gore,67.25,71.43,282.8,72.22,1.556,50.0,65.0,92960.0
John Merrick,67.75,82.14,290.4,79.17,1.583,0.0,73.33,97772.0
Jason Bohn,68.0,71.43,259.5,68.06,1.514,100.0,69.57,73660.0
Y.E. Yang,68.0,69.64,310.1,72.22,1.625,42.86,70.0,31040.0
Jim Herman,68.25,69.64,292.4,73.61,1.639,37.5,68.42,19488.0
Brendon Todd,68.4375,72.7675,280.5,65.97,1.49675,79.8075,74.7375,255132.0
Parker McLachlin,68.5,66.5175,281.375,72.2225,1.59,60.12,62.2175,233341.0
Stuart Appleby,68.5625,73.66,279.85,62.845,1.52775,79.615,76.2175,170244.0


In [34]:
df_2020winner_agg.head()

Unnamed: 0_level_0,Score,Fairways,Distance,GIR,Putts,Sandies,Scrambling,Money
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Aaron Baddeley,70.214286,55.461905,293.066667,64.020476,1.570857,57.58381,64.538571,1153971.0
Aaron Wise,69.651786,64.709286,303.189286,70.138929,1.619536,46.785714,54.126071,4321645.0
Abraham Ancer,69.75,67.656486,296.311765,67.941622,1.594919,50.063514,63.862703,4277384.0
Adam Hadwin,69.986486,66.744865,290.232353,68.131081,1.596811,54.415946,61.842162,3961044.0
Adam Long,69.6875,69.343333,294.708333,63.889167,1.552083,56.904167,63.2075,1742343.0


## Comparing independent variables

#### Comparing the variables to "Score" (average number of strokes per completed round)

In [35]:
#Fairways
Score_Fairways_model = sm.OLS.from_formula("Score ~ Fairways", data=df_2020winner_agg)
Score_Fairways_results = Score_Fairways_model.fit()
print("Score + Fairways | P-value:", Score_Fairways_results.pvalues[1]/2, "| R-squared:", Score_Fairways_results.rsquared)

#Distance
Score_Distance_model = sm.OLS.from_formula("Score ~ Distance", data=df_2020winner_agg)
Score_Distance_results = Score_Distance_model.fit()
print("Score + Distance | P-value:", Score_Distance_results.pvalues[1]/2, "| R-squared:", Score_Distance_results.rsquared)

#GIR
Score_GIR_model = sm.OLS.from_formula("Score ~ GIR", data=df_2020winner_agg)
Score_GIR_results = Score_GIR_model.fit()
print("Score + GIR | P-value:", Score_GIR_results.pvalues[1]/2, "| R-squared:", Score_GIR_results.rsquared)

#Putts
Score_Putts_model = sm.OLS.from_formula("Score ~ Putts", data=df_2020winner_agg)
Score_Putts_results = Score_Putts_model.fit()
print("Score + Putts | P-value:", Score_Putts_results.pvalues[1]/2, "| R-squared:", Score_Putts_results.rsquared)

#Sandies
Score_Sandies_model = sm.OLS.from_formula("Score ~ Sandies", data=df_2020winner_agg)
Score_Sandies_results = Score_Sandies_model.fit()
print("Score + Sandies | P-value:", Score_Sandies_results.pvalues[1]/2, "| R-squared:", Score_Sandies_results.rsquared)

#Scrambling
Score_Scrambling_model = sm.OLS.from_formula("Score ~ Scrambling", data=df_2020winner_agg)
Score_Scrambling_results = Score_Scrambling_model.fit()
print("Score + Scrambling | P-value:", Score_Scrambling_results.pvalues[1]/2, "| R-squared:", Score_Scrambling_results.rsquared)

#Money
Score_Money_model = sm.OLS.from_formula("Score ~ Money", data=df_2020winner_agg)
Score_Money_results = Score_Money_model.fit()
print("Score + Money | P-value:", Score_Money_results.pvalues[1]/2, "| R-squared:", Score_Money_results.rsquared)


#Note:
#P-value is the probability of getting a sample proportion at least this extreme.  Ranges from zero to 100%. 
#The P>|t| value in the OLS summary reflects a two-sided P-value, and needs to be divided by 2 to reflect the single sided value.
#P-values quoted in the summary of this notebook reflect the single sided P-value

Score + Fairways | P-value: 0.0046821950855813425 | R-squared: 0.022896560833514434
Score + Distance | P-value: 0.001852749598786064 | R-squared: 0.028577606637016806
Score + GIR | P-value: 1.9101645065828996e-27 | R-squared: 0.3292806086437271
Score + Putts | P-value: 1.2441167889370516e-09 | R-squared: 0.11480449519486935
Score + Sandies | P-value: 0.041579068038456554 | R-squared: 0.010245821008425526
Score + Scrambling | P-value: 8.401530269278429e-29 | R-squared: 0.34338699649251914
Score + Money | P-value: 8.47088898461933e-06 | R-squared: 0.06150984862995712


In [36]:
#Fairways
Money_Fairways_model = sm.OLS.from_formula("Money ~ Fairways", data=df_2020winner_agg)
Money_Fairways_results = Money_Fairways_model.fit()
print("Money + Fairways | P-value:", Money_Fairways_results.pvalues[1]/2, "| R-squared:", Money_Fairways_results.rsquared)

#Distance
Money_Distance_model = sm.OLS.from_formula("Money ~ Distance", data=df_2020winner_agg)
Money_Distance_results = Money_Distance_model.fit()
print("Money + Distance | P-value:", Money_Distance_results.pvalues[1]/2, "| R-squared:", Money_Distance_results.rsquared)

#GIR
Money_GIR_model = sm.OLS.from_formula("Money ~ GIR", data=df_2020winner_agg)
Money_GIR_results = Money_GIR_model.fit()
print("Money + GIR | P-value:", Money_GIR_results.pvalues[1]/2, "| R-squared:", Money_GIR_results.rsquared)

#Putts
Money_Putts_model = sm.OLS.from_formula("Money ~ Putts", data=df_2020winner_agg)
Money_Putts_results = Money_Putts_model.fit()
print("Money + Putts | P-value:", Money_Putts_results.pvalues[1]/2, "| R-squared:", Money_Putts_results.rsquared)

#Sandies
Money_Sandies_model = sm.OLS.from_formula("Money ~ Sandies", data=df_2020winner_agg)
Money_Sandies_results = Money_Sandies_model.fit()
print("Money + Sandies | P-value:", Money_Sandies_results.pvalues[1]/2, "| R-squared:", Money_Sandies_results.rsquared)

#Scrambling
Money_Scrambling_model = sm.OLS.from_formula("Money ~ Scrambling", data=df_2020winner_agg)
Money_Scrambling_results = Money_Scrambling_model.fit()
print("Money + Scrambling | P-value:", Money_Scrambling_results.pvalues[1]/2, "| R-squared:", Money_Scrambling_results.rsquared)

#Money
Money_Score_model = sm.OLS.from_formula("Money ~ Score", data=df_2020winner_agg)
Money_Score_results = Money_Score_model.fit()
print("Money + Score | P-value:", Money_Score_results.pvalues[1]/2, "| R-squared:", Money_Score_results.rsquared)


Money + Fairways | P-value: 0.1894646566085978 | R-squared: 0.002652295179736641
Money + Distance | P-value: 1.1417847103729464e-09 | R-squared: 0.1156851158568385
Money + GIR | P-value: 0.0058611230597433615 | R-squared: 0.02155628692597822
Money + Putts | P-value: 0.07946860796274853 | R-squared: 0.006784101132768172
Money + Sandies | P-value: 0.0253325205374173 | R-squared: 0.013016766319043027
Money + Scrambling | P-value: 0.0034739569925010694 | R-squared: 0.02468623478122689
Money + Score | P-value: 8.470888984614513e-06 | R-squared: 0.06150984862995723


In [37]:
df_2020winner_agg.corr()

Unnamed: 0,Score,Fairways,Distance,GIR,Putts,Sandies,Scrambling,Money
Score,1.0,-0.151316,-0.169049,-0.57383,0.338828,-0.101222,-0.585992,-0.248012
Fairways,-0.151316,1.0,-0.440184,0.3166,0.245902,-0.234139,-0.019394,-0.0515
Distance,-0.169049,-0.440184,1.0,0.208092,0.135373,-0.086258,-0.017175,0.340125
GIR,-0.57383,0.3166,0.208092,1.0,0.453204,-0.087838,0.040143,0.146821
Putts,0.338828,0.245902,0.135373,0.453204,1.0,-0.197015,-0.546536,-0.082366
Sandies,-0.101222,-0.234139,-0.086258,-0.087838,-0.197015,1.0,0.351386,0.114091
Scrambling,-0.585992,-0.019394,-0.017175,0.040143,-0.546536,0.351386,1.0,0.157119
Money,-0.248012,-0.0515,0.340125,0.146821,-0.082366,0.114091,0.157119,1.0


## Creating the model using Score as the dependent variable

In [38]:
Model = sm.OLS.from_formula("Score ~ GIR + Scrambling + Putts", data=df_2020winner_agg)
Results = Model.fit()
print(Results.summary())

                            OLS Regression Results                            
Dep. Variable:                  Score   R-squared:                       0.812
Model:                            OLS   Adj. R-squared:                  0.810
Method:                 Least Squares   F-statistic:                     418.2
Date:                Sun, 11 Oct 2020   Prob (F-statistic):          5.75e-105
Time:                        11:20:19   Log-Likelihood:                -146.44
No. Observations:                 294   AIC:                             300.9
Df Residuals:                     290   BIC:                             315.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     64.4367      1.369     47.061      0.0

In [39]:
df_2020winner_agg['Prediction-Score'] = 64.4367 + (-0.1984*df_2020winner_agg.GIR) + (-0.0408*df_2020winner_agg.Scrambling) + (13.3478*df_2020winner_agg.Putts)
df_2020winner_agg.nsmallest(20, 'Prediction-Score') #use .nsmallest for lowest score

Unnamed: 0_level_0,Score,Fairways,Distance,GIR,Putts,Sandies,Scrambling,Money,Prediction-Score
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
John Merrick,67.75,82.14,290.4,79.17,1.583,0.0,73.33,97772.0,66.867075
Andres Romero,67.0,51.79,302.5,69.44,1.444,44.44,72.73,337560.0,66.966643
Jason Gore,67.25,71.43,282.8,72.22,1.556,50.0,65.0,92960.0,68.225429
Brendon Todd,68.4375,72.7675,280.5,65.97,1.49675,79.8075,74.7375,255132.0,68.277282
Jerry Kelly,67.0,67.86,288.0,73.61,1.625,66.67,78.95,108500.0,68.301491
Jason Bohn,68.0,71.43,259.5,68.06,1.514,100.0,69.57,73660.0,68.303709
Anders Albertson,69.111667,62.798333,299.1,73.61,1.6225,75.0,70.568333,209712.0,68.610094
Geoff Ogilvy,68.75,85.71,293.4,75.0,1.653,33.33,72.22,39116.0,68.674037
Joey Garber,68.721667,67.858333,301.35,72.765,1.601167,44.445,64.25,355681.0,68.750776
Brett Stegmaier,68.75,57.144,298.78,73.334,1.6,71.906,60.95,211928.0,68.756954


## Creating the model using Money as the dependent variable

In [40]:
#Model = sm.OLS.from_formula("Money ~ GIR + Scrambling + Putts", data=df_2020winner)
#Results = Model.fit()
#print(Results.summary())

In [41]:
#df_2020winner['Prediction-Money'] = 946500 + (10020*df_2020winner.GIR) + (1727.52*df_2020winner.Scrambling) + (-1012000*df_2020winner.Putts)
#df_2020winner.nlargest(10, 'Prediction-Money') #use .nlargest for largest amount of Money to be won

In [42]:
Model = sm.OLS.from_formula("Money ~ GIR + Scrambling + Putts", data=df_2020winner_agg)
Results = Model.fit()
print(Results.summary())

                            OLS Regression Results                            
Dep. Variable:                  Money   R-squared:                       0.053
Model:                            OLS   Adj. R-squared:                  0.043
Method:                 Least Squares   F-statistic:                     5.401
Date:                Sun, 11 Oct 2020   Prob (F-statistic):            0.00125
Time:                        11:20:20   Log-Likelihood:                -4768.8
No. Observations:                 294   AIC:                             9546.
Df Residuals:                     290   BIC:                             9560.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   4.237e+06   9.22e+06      0.460      0.6

In [43]:
df_2020winner_agg['Prediction-Money'] = 4.237e+06 + (1.459e+05*df_2020winner_agg.GIR) + (4.044e+04*df_2020winner_agg.Scrambling) + (-9.039e+06*df_2020winner_agg.Putts)
df_2020winner_agg.nlargest(20, 'Prediction-Money') #use .nlargest for largest amount of Money to be won

Unnamed: 0_level_0,Score,Fairways,Distance,GIR,Putts,Sandies,Scrambling,Money,Prediction-Score,Prediction-Money
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
John Merrick,67.75,82.14,290.4,79.17,1.583,0.0,73.33,97772.0,66.867075,4444631.0
Andres Romero,67.0,51.79,302.5,69.44,1.444,44.44,72.73,337560.0,66.966643,4257181.0
Jerry Kelly,67.0,67.86,288.0,73.61,1.625,66.67,78.95,108500.0,68.301491,3481062.0
Brendon Todd,68.4375,72.7675,280.5,65.97,1.49675,79.8075,74.7375,255132.0,68.277282,3355284.0
Jason Gore,67.25,71.43,282.8,72.22,1.556,50.0,65.0,92960.0,68.225429,3337814.0
Jason Bohn,68.0,71.43,259.5,68.06,1.514,100.0,69.57,73660.0,68.303709,3295319.0
Anders Albertson,69.111667,62.798333,299.1,73.61,1.6225,75.0,70.568333,209712.0,68.610094,3164705.0
Geoff Ogilvy,68.75,85.71,293.4,75.0,1.653,33.33,72.22,39116.0,68.674037,3158610.0
Joey Garber,68.721667,67.858333,301.35,72.765,1.601167,44.445,64.25,355681.0,68.750776,2978738.0
Brett Stegmaier,68.75,57.144,298.78,73.334,1.6,71.906,60.95,211928.0,68.756954,2938849.0


## Creating the model only using previous Masters data

### Reducing the data

In [44]:
df_2020winner_mastersonly = df[(df.year >= 2018) & (df.year <= 2020) & (df.Money >= 1) & (df_2020winner['tournament'] == 'Masters Tournament')]
df_2020winner_mastersonly.head()

statistic,player_name,date,tournament,Fairways,Distance,GIR,Money,Putts,Purse,Sandies,Score,Scrambling,year,month
315,Aaron Wise,2019-04-14,Masters Tournament,66.07,305.8,68.06,184000.0,1.597,,50.0,70.25,56.52,2019,4
455,Adam Hadwin,2018-04-08,Masters Tournament,69.64,291.9,69.44,93775.0,1.667,0.85,83.33,71.75,54.55,2018,4
760,Adam Scott,2018-04-08,Masters Tournament,69.64,290.3,68.06,63663.0,1.736,0.58,80.0,72.25,56.52,2018,4
778,Adam Scott,2019-04-14,Masters Tournament,71.43,310.3,75.0,161000.0,1.667,,33.33,70.5,66.67,2019,4
1233,Alex Noren,2019-04-14,Masters Tournament,76.79,291.0,55.56,25415.0,1.639,,33.33,74.0,56.25,2019,4


In [45]:
aggregation_functions = {'Score': 'mean', 'Fairways': 'mean', 'Distance': 'mean', 'GIR': 'mean', 'Putts': 'mean', 'Sandies': 'mean', 'Scrambling': 'mean', 'Money': 'sum'} #'player_name': 'first'
df_2020winner_mastersonly_agg = df.groupby(df_2020winner_mastersonly['player_name']).aggregate(aggregation_functions)
df_2020winner_mastersonly_agg.nsmallest(20, 'Score')

Unnamed: 0_level_0,Score,Fairways,Distance,GIR,Putts,Sandies,Scrambling,Money
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Brooks Koepka,69.0,69.64,313.6,73.61,1.639,100.0,68.42,858667.0
Rickie Fowler,69.0,74.11,295.55,67.36,1.5205,41.665,72.71,1498500.0
Jon Rahm,69.375,75.0,301.9,69.445,1.625,73.215,70.91,838500.0
Patrick Cantlay,69.5,64.29,299.9,61.11,1.472,71.43,67.86,310500.0
Dustin Johnson,69.625,63.39,306.45,69.445,1.618,75.0,73.085,1144667.0
Jordan Spieth,69.75,66.965,289.85,70.83,1.618,76.19,66.82,855956.0
Tony Finau,69.75,61.61,303.3,63.195,1.549,41.82,70.62,689938.0
Bubba Watson,69.875,70.535,298.45,73.61,1.6805,70.0,65.34,611775.0
Patrick Reed,69.875,75.0,297.7,62.5,1.5275,50.0,62.915,2035488.0
Aaron Wise,70.25,66.07,305.8,68.06,1.597,50.0,56.52,184000.0


In [46]:
df_2020winner_mastersonly_agg.head()

Unnamed: 0_level_0,Score,Fairways,Distance,GIR,Putts,Sandies,Scrambling,Money
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Aaron Wise,70.25,66.07,305.8,68.06,1.597,50.0,56.52,184000.0
Adam Hadwin,71.75,69.64,291.9,69.44,1.667,83.33,54.55,93775.0
Adam Scott,71.375,70.535,300.3,71.53,1.7015,56.665,61.595,224663.0
Alex Noren,74.0,76.79,291.0,55.56,1.639,33.33,56.25,25415.0
Andrew Landry,72.0,80.36,287.6,65.28,1.583,66.67,60.0,37950.0


### Using Score as the dependent variable

In [47]:
Model = sm.OLS.from_formula("Score ~ GIR + Scrambling + Putts", data=df_2020winner_mastersonly_agg)
Results = Model.fit()
print(Results.summary())

                            OLS Regression Results                            
Dep. Variable:                  Score   R-squared:                       0.827
Model:                            OLS   Adj. R-squared:                  0.819
Method:                 Least Squares   F-statistic:                     98.98
Date:                Sun, 11 Oct 2020   Prob (F-statistic):           1.32e-23
Time:                        11:20:20   Log-Likelihood:                -52.837
No. Observations:                  66   AIC:                             113.7
Df Residuals:                      62   BIC:                             122.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     59.0038      2.348     25.126      0.0

In [48]:
df_2020winner_mastersonly_agg['Prediction-Score-MastersOnly'] = 59.0038 + (-0.2305*df_2020winner_mastersonly_agg.GIR) + (-0.0149*df_2020winner_mastersonly_agg.Scrambling) + (17.4504*df_2020winner_mastersonly_agg.Putts)
df_2020winner_mastersonly_agg.nsmallest(30, 'Prediction-Score-MastersOnly') #use .nsmallest for lowest score

Unnamed: 0_level_0,Score,Fairways,Distance,GIR,Putts,Sandies,Scrambling,Money,Prediction-Score-MastersOnly
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Rickie Fowler,69.0,74.11,295.55,67.36,1.5205,41.665,72.71,1498500.0,68.927274
Patrick Cantlay,69.5,64.29,299.9,61.11,1.472,71.43,67.86,310500.0,69.59382
Brooks Koepka,69.0,69.64,313.6,73.61,1.639,100.0,68.42,858667.0,69.618443
Jordan Spieth,69.75,66.965,289.85,70.83,1.618,76.19,66.82,855956.0,69.916614
Justin Thomas,70.5,66.96,307.3,75.0,1.674,55.0,64.24,395900.0,69.971094
Tiger Woods,70.5,58.035,294.2,73.615,1.646,50.0,50.0,2133663.0,70.013901
Patton Kizzire,70.5,71.43,299.9,66.67,1.569,75.0,62.5,161000.0,70.084793
Dustin Johnson,69.625,63.39,306.45,69.445,1.618,75.0,73.085,1144667.0,70.142508
Francesco Molinari,70.375,75.0,291.95,67.36,1.59,66.67,71.545,532088.0,70.157435
Jon Rahm,69.375,75.0,301.9,69.445,1.625,73.215,70.91,838500.0,70.297068


### Using Money as the dependent variable

In [49]:
Model = sm.OLS.from_formula("Money ~ GIR + Scrambling + Putts", data=df_2020winner_mastersonly_agg)
Results = Model.fit()
print(Results.summary())

                            OLS Regression Results                            
Dep. Variable:                  Money   R-squared:                       0.371
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                     12.21
Date:                Sun, 11 Oct 2020   Prob (F-statistic):           2.24e-06
Time:                        11:20:20   Log-Likelihood:                -934.56
No. Observations:                  66   AIC:                             1877.
Df Residuals:                      62   BIC:                             1886.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   2.666e+06   1.49e+06      1.792      0.0

In [50]:
df_2020winner_mastersonly_agg['Prediction-Money-MastersOnly'] = 2.666e+06 + (4.859e+04*df_2020winner_mastersonly_agg.GIR) + (6042.6003*df_2020winner_mastersonly_agg.Scrambling) + (-3.605e+06*df_2020winner_mastersonly_agg.Putts)
df_2020winner_mastersonly_agg.nlargest(20, 'Prediction-Money-MastersOnly') #use .nlargest for largest amount of Money to be won

Unnamed: 0_level_0,Score,Fairways,Distance,GIR,Putts,Sandies,Scrambling,Money,Prediction-Score-MastersOnly,Prediction-Money-MastersOnly
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Rickie Fowler,69.0,74.11,295.55,67.36,1.5205,41.665,72.71,1498500.0,68.927274,896977.367813
Brooks Koepka,69.0,69.64,313.6,73.61,1.639,100.0,68.42,858667.0,69.618443,747549.612526
Patrick Cantlay,69.5,64.29,299.9,61.11,1.472,71.43,67.86,310500.0,69.59382,738825.756358
Jordan Spieth,69.75,66.965,289.85,70.83,1.618,76.19,66.82,855956.0,69.916614,678506.252046
Justin Thomas,70.5,66.96,307.3,75.0,1.674,55.0,64.24,395900.0,69.971094,663656.643272
Dustin Johnson,69.625,63.39,306.45,69.445,1.618,75.0,73.085,1144667.0,70.142508,649065.992925
Francesco Molinari,70.375,75.0,291.95,67.36,1.59,66.67,71.545,532088.0,70.157435,639390.238464
Patton Kizzire,70.5,71.43,299.9,66.67,1.569,75.0,62.5,161000.0,70.084793,626912.81875
Tiger Woods,70.5,58.035,294.2,73.615,1.646,50.0,50.0,2133663.0,70.013901,611252.865
Jon Rahm,69.375,75.0,301.9,69.445,1.625,73.215,70.91,838500.0,70.297068,610688.337273
