## Goal

Scrape the PGA Tour public facing website to create CSVs. This notebook is only for FedExCup Standings over all years. End goal is to have these CSVs be used to create a relational database that can be queried to conduct basic analysis of golfers on the PGA Tour and to compare to historical record. 

In [11]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
import os.path
from os import path

In [12]:
# Statistic to scrape (THIS NEEDS TO CHANGE EACH TIME)

stat_cat = 'Off the tee'
stat_id = '102'  ## Driving accuracy

In [13]:
!pwd

/Users/nicholasbeaudoin/Desktop/PGA-Tour-Analytics/notebooks/scrape jobs


In [16]:
file_path = '/Users/nicholasbeaudoin/Desktop/PGA-Tour-Analytics/data/'

In [17]:
# Import tournament list
df_tourney = pd.read_csv(file_path + 'setup/stat_id_{stat_id}_tournaments.csv'.format(stat_id=stat_id))
df_tourney.head()

Unnamed: 0,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,t044,t044,t045,t044,t044,t001,t057,t495,t060,t060,...,t045,t060,t060,t060,t060,t060,t060,t060,t060,t536
1,t040,t040,t044,t045,t045,t044,t001,t057,t057,t495,...,t493,t028,t028,t028,t028,t028,t028,t028,t028,t041
2,t031,t041,t042,t040,t040,t045,t041,t060,t001,t045,...,t464,t505,t505,t505,t505,t505,t505,t027,t027,t470
3,t041,t039,t041,t041,t041,t040,t045,t001,t045,t041,...,t047,t027,t027,t027,t027,t027,t027,t013,t013,t534
4,t039,t042,t040,t042,t042,t041,t044,t045,t495,t057,...,t060,t013,t013,t013,t013,t013,t013,t472,t033,t010


## Driving Accuracy - All years, all tournaments

In [None]:

for year in range(1980, 2021):

    for tournament in df_tourney[str(year)]:
        
        try:
            
            print(tournament)
            print(year)

            ### Get Title of Stats Page ###
            url = "https://www.pgatour.com/content/pgatour/stats/stat.{stat_id}.y{year}.eon.{tournament}.html".format(stat_id=stat_id, tournament=tournament, year=year)
            html = urlopen(url)
            soup = BeautifulSoup(html)
            bread_crumbs = soup.findAll('div', {'class' : 'breadcrumbs'})
            title = [crumb.text for crumb in bread_crumbs][0][87:].strip()
            print(title)

            ### Get tournament name ###
            url = "https://www.pgatour.com/content/pgatour/stats/stat.{stat_id}.y{year}.eon.{tournament}.html".format(stat_id=stat_id, year=year, tournament=tournament)
            html = urlopen(url)
            soup = BeautifulSoup(html)

            tourney_container = soup.findAll("div", {"class": "with-chevron"})[2]
            tourney_container
            tag = tourney_container.findAll("option", {"value" : tournament})[0]
            tourney_name = tag.text
            print(tourney_name)
            print(' ')

            ### Get column headers ###
            html = urlopen(url)
            soup = BeautifulSoup(html)

            # Extract table header rows
            soup.findAll('tr', limit=2)[1].findAll('th')    

            # Store column headers
            column_headers  = [th.getText() for th in 
                                            soup.findAll('tr', limit=2)[1].findAll('th')]

            ### Get data for dataframe ###

            data_rows = soup.findAll('tr')[2:]  # skip the first 2 header rows

            player_data = []  # create an empty list to hold all the data (in lists)

            for p in range(len(data_rows)):  # for each table row
                player_row = []  # create an empty list for each player

            # for each table data element from each table row
                for td in data_rows[p].findAll('td'):        
                    # get the text content and append to the player_row 
                    player_row.append(td.getText())        

                # then append each player to the player_data matrix
                player_data.append(player_row)

            # Convert list of lists to DF
            df = pd.DataFrame(player_data, columns=column_headers)

            # Add features
            df['YEAR'] = year
            df['Tournament'] = tourney_name

            ### Data Cleaning ###

            # Convert to numerics
            df = df.convert_objects(convert_numeric=True)

            # Clean player names
            df['PLAYER NAME'] = [player.replace('\n','') for player in df['PLAYER NAME']]

            # Drop RANK LAST WEEK
            df.drop('RANK LAST WEEK', axis=1, inplace=True)
            df.drop(df.columns[0], axis=1, inplace=True)


            ### Export ###
            if not os.path.isfile(file_path + 'data/statistics/{stat_cat}_{title}.csv'.format(stat_cat=stat_cat, title=title)):
                print('File does not exist --> CREATING')
                df.to_csv(file_path + 'data/statistics/{stat_cat}_{title}.csv'.format(stat_cat=stat_cat, title=title), header='column_names')

            else: 
                print('File exists --> appending data to file')
                df.to_csv(file_path + 'data/statistics/{stat_cat}_{title}.csv'.format(stat_cat=stat_cat, title=title), mode='a', header=False)
        
        except:
            pass

t044
1980
Driving Accuracy Percentage
Pensacola Open
 


For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.


t040
1980
Driving Accuracy Percentage
Southern Open
 
t031
1980
Driving Accuracy Percentage
Anheuser-Busch Golf Classic
 
t041
1980
Driving Accuracy Percentage
San Antonio Texas Open
 
t039
1980
Driving Accuracy Percentage
Hall Of Fame
 
t038
1980
Driving Accuracy Percentage
Pleasant Valley Jimmy Fund Classic
 
t037
1980
Driving Accuracy Percentage
B.C. Open
 
t035
1980
Driving Accuracy Percentage
Buick-Goodwrench Open
 
t036
1980
Driving Accuracy Percentage
World Series of Golf
 
t027
1980
Driving Accuracy Percentage
Manufacturers Hanover Westchester Classic
 
t033
1980
Driving Accuracy Percentage
PGA Championship
 
t181
1980
Driving Accuracy Percentage
IVB-Golf Classic
 
t034
1980
t030
1980
Driving Accuracy Percentage
Quad Cities Open
 
t029
1980
Driving Accuracy Percentage
Greater Milwaukee Open
 
t028
1980
Driving Accuracy Percentage
Western Open
 
t025
1980
Driving Accuracy Percentage
Danny Thomas Memphis Classic
 
t032
1980
Driving Accuracy Percentage
Canadian Open
 
t026
1980
Dr

t003
1982
Driving Accuracy Percentage
Phoenix Open
 
t002
1982
Driving Accuracy Percentage
Bob Hope Desert Classic
 
t001
1982
Driving Accuracy Percentage
Joe Garagiola-Tucson Open
 
nan
1982
Driving Accuracy Percentage
nan
1982
Driving Accuracy Percentage
nan
1982
Driving Accuracy Percentage
nan
1982
Driving Accuracy Percentage
nan
1982
Driving Accuracy Percentage
t044
1983
Driving Accuracy Percentage
Pensacola Open
 
t045
1983
Driving Accuracy Percentage
Walt Disney World Golf Classic
 
t040
1983
Driving Accuracy Percentage
Southern Open
 
t041
1983
Driving Accuracy Percentage
Texas Open
 
t042
1983
Driving Accuracy Percentage
Lajet Coors Classic
 
t047
1983
Driving Accuracy Percentage
Panasonic Las Vegas Pro Celebrity Classic
 
t038
1983
Driving Accuracy Percentage
Bank of Boston Classic
 
t037
1983
Driving Accuracy Percentage
B.C. Open
 
t036
1983
Driving Accuracy Percentage
World Series of Golf
 
t034
1983
Driving Accuracy Percentage
Sammy Davis Jr.-Greater Hartford Open
 
t035
19

Driving Accuracy Percentage
Hertz Bay Hill Classic
 
t010
1985
Driving Accuracy Percentage
Honda Classic
 
t008
1985
Driving Accuracy Percentage
Doral-Eastern Open
 
t004
1985
Driving Accuracy Percentage
Isuzu-Andy Williams San Diego Open
 
t006
1985
Driving Accuracy Percentage
Hawaiian Open
 
t005
1985
Driving Accuracy Percentage
Bing Crosby National Pro-Am
 
t007
1985
Driving Accuracy Percentage
Los Angeles Open
 
t003
1985
Driving Accuracy Percentage
Phoenix Open
 
t002
1985
Driving Accuracy Percentage
Bob Hope Classic
 
nan
1985
Driving Accuracy Percentage
nan
1985
Driving Accuracy Percentage
nan
1985
Driving Accuracy Percentage
nan
1985
Driving Accuracy Percentage
nan
1985
Driving Accuracy Percentage
nan
1985
Driving Accuracy Percentage
nan
1985
Driving Accuracy Percentage
t057
1986
Driving Accuracy Percentage
Tallahassee Open
 
t001
1986
Driving Accuracy Percentage
Seiko-Tucson Match Play Championship
 
t041
1986
Driving Accuracy Percentage
Vantage Championship
 
t045
1986
Drivin

t028
1988
Driving Accuracy Percentage
Beatrice Western Open
 
t022
1988
Driving Accuracy Percentage
Georgia-Pacific Atlanta Golf Classic
 
t026
1988
Driving Accuracy Percentage
U.S. Open Championship
 
t027
1988
Driving Accuracy Percentage
Manufacturers Hanover Westchester Classic
 
t024
1988
Driving Accuracy Percentage
Kemper Open
 
t023
1988
Driving Accuracy Percentage
Memorial Tournament
 
t021
1988
Driving Accuracy Percentage
Colonial National Invitation
 
t019
1988
Driving Accuracy Percentage
GTE Byron Nelson Golf Classic
 
t047
1988
Driving Accuracy Percentage
Panasonic Las Vegas Invitational
 
t020
1988
Driving Accuracy Percentage
Independent Insurance Agent Open
 
t018
1988
Driving Accuracy Percentage
USF&G Classic
 
t012
1988
Driving Accuracy Percentage
MCI Heritage Golf Classic
 
t014
1988
Driving Accuracy Percentage
Masters Tournament
 
t054
1988
Driving Accuracy Percentage
Deposit Guaranty Golf Classic
 
t013
1988
Driving Accuracy Percentage
KMart Greater Greensboro Open
 
