#### Data Collection

Started looking at the following sources:
- https://www.owgr.com/
- https://www.pgatour.com/stats/
- https://datagolf.com/

The only site that robot.txt allowed data scraping was pgatour.com, so that will be the primary resource for this analysis. I also found the api that pgatour.com was using to collect data but scaping from here directly was disallowed.

Plan for collecting data

1) Select six North American events
2) Find appropriate features that affect scoring average
3) Determine a base URl to loop through and scrap data


NOTES
- Each stat is a 4 day-average
- Only players who make it to the final round (4 rounds) are considered



In [2]:
#IMPORTS
import pandas as pd
import requests
import time


Let's define a helper function to build our csv files based on our stat code and title. It will be important to rate limit our requests to not overwhelm the servers.

In [3]:
def data_frame_builder(title: str, statCd: int) -> pd.DataFrame:
    allDfs = []
    
    tournamentMap =  {
    "U.S. Open" : '026',
    "The Memorial Tournament": '023',
    "RBC Canadian Open": '032',
    "AT&T Bryon Nelson": '019',
    "Travelers Championship": '034',
    "Wells Fargo Championship": '480'
    }
    
    for event, eventCd in tournamentMap.items():
        #rate limit our requests
        time.sleep(2)
        
        #build the url and create a get request for the html
        url = f'https://www.pgatour.com/stats/stat.{statCd}.y2022.eon.t{eventCd}.html'
        rs = requests.get(url)
        
        #use panadas to read <table> tags and generate a data frame (second table on the page is desired)
        scrapedDfs = pd.read_html(rs.text)
        currDf = scrapedDfs[1]
        
        #add event column for multi indexing
        currDf['EVENT'] = event
        
        #append df to array
        allDfs.append(currDf)
    
    completedDf = pd.concat(allDfs)
    
    #clean the NaN data
    cleanedDf = completedDf.dropna(axis=1, how='all')
    cleanedDf = cleanedDf.drop(cleanedDf.columns[0], axis = 1)
    
    return cleanedDf

Collected the following data frames from 2022 Season, please refer each column definition below:

Scoring Average: Average score of a player over 4 rounds.

Driving Distance: The average distance off the tee, excluding par threes.

Driving Accuracy: The percentage of time a tee shot comes to rest in the fairway (regardless of club).

Greens in Regulation: The percent of time a player was able to hit the green in regulation (greens hit in regulation/holes played). Note: A green is considered hit in regulation if any portion of the ball is touching the putting surface after the GIR stroke has been taken. (The GIR stroke is determined by subtracting 2 from par (1st stroke on a par 3, 2nd on a par 4, 3rd on a par 5))

Scambling: The percent of time a player misses the green in regulation, but still makes par or better.

Sand Saves: The percent of time a player was able to get 'up and down' once in a greenside sand bunker (regardless of score). Note: 'Up and down' indicates it took the player 2 shots or less to put the ball in the hole from that point.

Putts Per Round: The total number of putts in a round.

In [18]:
statsMap = {
  "scoring_average" : 120,
  "driving_distance" : 101,
  "driving_accuracy" : 102,
  "greens_in_regulation":103,
  "proximity_to_the_hole": 331,
  "scrambling": 130,
  "sand_saves": 111,
  "putts_per_round" : 119
}

#loop through the map and create new csv
for title, val in statsMap.items():
    newDf = data_frame_builder(title, val)
    newDf.to_csv(f"../data/raw_data/{title}.csv")