# Background and Motivations

As more advanced metrics like WAR and ERA+ have overtaken traditional stats in Major League Baseball, so too have strokes gained and its variants come to dominate golf analytics. Strokes gained is a straightforward concept; Before each shot, the player has an expected number of shots remaining on the hole, say 4.2 before teeing off on a moderate par 4. Suppose the player hits a great drive, after which they now expect to take only 2.8 shots (on average) to complete the hole. Then in just one shot, the expected number of shots has decreased by 1.4=4.2-2.8, and so their drive has gained them .4 shots (a decrease of 1.4 expected shots minus the 1 shot used to achieve the decrease). If we perform this kind of calculation for every shot across a round and then segment the data across different phases of the game, we can get an idea of where a player is gaining and losing shots. For example, a player might realize they are averaging 1.3 strokes gained per round off the tee while losing .7 shots with their putting. Time for them to go practice 6-footers. We should note that the idea of expected shots remaining is always computed with respect to some standard, and typically that's taken to be the average PGA TOUR player.


Since Mark Broadie's introduction of strokes gained around 2011, these stats have become the primary lens through which PGA Tour players' skillsets are addressed, at least among the analytically inclined. For example when discussing the best drivers of the ball, the stat of total driving (ranking players by the average of their rank in the driving distance and fairway % categories) has given way to strokes gained: off the tee. In addition, strokes gained has been revelatory in explaining how the best players in the world gain an edge. The common adage of 'drive for show, putt for dough' proved to be more or less completely wrong, at least at the pga tour level, where it turns out that the very best players are almost all quite adept off the tee, and strokes gained from top drivers are typically both greater and more stable (both event-to-event and year-to-year) than those from top putters. Likewise, it turns out, contrary to popular wisdom at the time, that the approach category (think iron shots from 100-225 yards) actually played the largest role of any phase of the game. Tiger Woods' ungodly numbers in the strokes gained: approach category contributed most to his dominance in the 2000's. 

All that said, I have always wondered if the fancy strokes gained metrics, which require complicated data collection and very large databases, are more or less recoverable from the conventional statistics we have always used. For example, I imagine strokes gained: off the tee is roughly just some affine combination of average distance and fairway %. Then, once you have a good estimate for strokes gained: off the tee, could you combine that with average greens in regulation and proximity to the hole to approximate strokes gained: approach, and so on. If it were the case, then we could estimate strokes gained data from years before it was collected and in tournaments where it was unavailable (for example, the majors were slow to incorporate strokes gained relative to the PGA tour). That kind of analysis would be particularly valuable in describing historic players like early career Tiger or Jack Nicklaus. Right now, our sense of what made them so great is largely anecdotal.


# Conclusions and Limitations 

As mentioned in the readme, a number of the inherently linear approaches (linear regression, linear SVR, ridge, lasso) are able to recreate the long game stats, off the tee, approach, and their sum reasonably well (R^2 of .7ish, .75ish,.85ish, respectively in our final SVR model). Our features were just some core on course statistics along with overall strokes gained, which doesn't involve anything other than comparing scorecards to others in the field, along with a notion of strength of field. I refrained from using anything like wins, top 10s, etc. and the only stat that requires some intensive collection is proximity to the hole, which you'd have to pace off every hole. So something like this could be incorporated reasonably easily at say, the SEC championship, long before they have the ShotLink data required for full strokes gained info. 

The addition of the proximity feature from the 2nd data set seemed key in raising the Strokes Gained Approach figure from .64 to .74, but we should note that the comparison is not exactly apples to apples because incorporating this feature requires us to consider only the seasons from 2015-2019 where the player finished top 200 in the fedex cup. That took us from 1700 to roughly 700 data points. So perhaps we introduced some bias into the data with this merging and actually proximity isn't telling us much that greens in regulation isn't already giving.

I should also note that the data sets are a little bit old and incomplete in that it only includes certain tournaments (which? I am unsure) from the 2010-2018 time frame on which it draws. Unlike some other pro sports, like the NBA, which have easy-to-use API's for historic data collection, the PGA Tour's website is notoriously difficult and has changed frequently so that existing scraping techniques no longer apply in 2024. Making my own scraper is beyond my current capabilities. DataGolf also has a paid api for strokes gained data which seems to be the industry standard but alas I am too cheap for that and it's not clear to me whether it also has the conventional stats which I wanted to be my features. 

# Existing and Future Work

Finally, I should note other similar work which has been done: The closest that I can tell is from
https://www.fantasylabs.com/articles/conventional-data-and-strokes-gained-are-not-that-different/
where Mr. Colin Davy ran some linear regressions to show that strokes gained: tee to green and strokes gained:putting can be predicted fairly accurately using conventional statistics. That was valuable at the time when strokes gained was available only for certain pga tour events. I'm not sure if that's still the case or what the deal is on the DP world tour, LIV, korn ferry, pine valley member guest etc.

The difference of my work here is I've predicted many more statistics and also used other regression techniques (though they panned out similarly to the linear approach he used). 

Mr. Khudabux suggested a similar analysis to  here:
https://github.com/NishadKhudabux/Data-Science-in-Golf-Strokes-Gained-vs-Traditional-Metrics
and he used tree models to predict scores from strokes gained and conventional statistics. 

In terms of future work, it would be interesting to analyze what sort of players tend to be guessed well and poorly by our models. Are there players whose conventional stats consistently underrate their strokes gained performance? I suspect these would be players who avoid penalty shots or something. Maybe not Jordan Spieth. We could also try more regression methods, like tree-based methods or KNN. KNN doesn't make much sense here because it doesn't really tell us about feature selection and won't have much hope of being extrapolated to amateur players (who are 'far' from any of the stats of professional players). Moreover, given that we have multiple player seasons from lots of players, KNN runs the risk of just allowing us to use, say, Phil Mickelson's 2014 and 2016 stats to predict his 2015 ones, which is not really the spirit of the exercise. We could get around this issue with stratified sampling methods and use some kind of constrained clustering. I think the real next project would be to do this carefully with a better, more modern data set, both in terms of not missing players and tournaments, and also having all the desired conventional stats. I really wish I had had 3 putt %.

