## Data Collection and Sorting

I used HTML Web Scraping to access and scrape my data off the official PGA tour website (http://www.pgatour.com/). There were sevaral different tables of statistics on each player, all differing in category. So, for each of the 9 tables, I created a new request with its URL and individually scraped the player names and the respected statistics. After doing so, I created a DataFrame for each pages statistical values and the players they belonged to, and at the very end merged all the DataFrames on player name in to one large DataFrame, with the golfer's name and all his respected statistics.

In [1]:
import requests
import time
from bs4 import BeautifulSoup

In [2]:
import numpy as np
import pandas as pd

In [3]:
req = requests.get("http://www.pgatour.com/stats/stat.138.html")
soup = BeautifulSoup(req.text, "html.parser")

## Top 10 Finishes (Events, Top 10, 1st, 2nd, 3rd) (164 Golfers)

In [4]:
# First Scrape Golfer names into a list

names = []
for n in soup.find_all('td', {'class':'player-name'}):
    names.append(n.find('a').contents[0])
names_clean = [str(name.replace("\xa0", " ")) for name in names]

In [5]:
len(names_clean)

164

In [6]:
# Scrape stats
stats = []

for stat in soup.find_all('td', {'class':'hidden-small hidden-medium'}):
    try:
        stats.append(stat.contents[0])
    except:
        stats.append(0)

In [7]:
# Break up into separate lists for each statistic

stats1 = [int(num) for num in stats]
events = stats1[::5]
top10 = stats1[1::5]
num1st = stats1[2::5]
num2nd = stats1[3::5]
num3rd = stats1[4::5]

In [8]:
# Convert to DataFrame

top10 = pd.DataFrame({
        'Name':names_clean,
        'Events':events,
        '1st':num1st,
        '2nd':num2nd,
        '3rd':num3rd
    })

## Scoring Average (Rounds, Avg Score, Total Strokes, Total Adjustment, Total Rounds) (203 Golfers)

In [9]:
req = requests.get("http://www.pgatour.com/stats/stat.120.html")
soup = BeautifulSoup(req.text, "html.parser")

In [10]:
# First Scrape Golfer names into a list

names = []
for n in soup.find_all('td', {'class':'player-name'}):
    names.append(n.find('a').contents[0])
names_clean2 = [str(name.replace("\xa0", " ")) for name in names]

In [11]:
len(names_clean2)

203

In [12]:
# Scrape stats
stats = []

for stat in soup.find_all('td', {'class':'hidden-small hidden-medium'}):
    try:
        stats.append(stat.contents[0])
    except:
        stats.append(0)

In [13]:
# Convert to numeric

stats1 = []
for num in stats:
    num = str(num).replace(',', '').replace(' ', '')
    stats1.append(float(num))

In [14]:
# Break up into separate lists for each statistic

rounds = stats1[::5]
avg_score = stats1[1::5]
total_strokes = stats1[2::5]
total_adjustment = stats1[3::5]

In [15]:
# Convert to DataFrame

scoring_avg = pd.DataFrame({
        'Name':names_clean2,
        'Rounds':rounds,
        'Average Score':avg_score,
        'Total Adjustment':total_adjustment,
        'Total Strokes':total_strokes,
    })

## Driving Distance (Rounds, Average Drive, Total Distance, Total Drives) (205 Golfers)

In [16]:
req = requests.get("http://www.pgatour.com/stats/stat.101.html")
soup = BeautifulSoup(req.text, "html.parser")

In [17]:
# First Scrape Golfer names into a list

names = []
for n in soup.find_all('td', {'class':'player-name'}):
    names.append(n.find('a').contents[0])
names_clean3 = [str(name.replace("\xa0", " ")) for name in names]

In [18]:
len(names_clean3)

205

In [19]:
# Scrape stats
stats = []

for stat in soup.find_all('td', {'class':'hidden-small hidden-medium'}):
    try:
        stats.append(stat.contents[0])
    except:
        stats.append(0)

In [20]:
# Convert to numeric

stats1 = []
for num in stats:
    num = str(num).replace(',', '').replace(' ', '')
    stats1.append(float(num))

In [21]:
# Break up into separate lists for each statistic

avg_drive = stats1[1::4]

In [22]:
# Convert to DataFrame

driving_distance = pd.DataFrame({
        'Name':names_clean3,
        'Average Drive':avg_drive,
    })

## Driving Accuracy Percentage (Rounds, %) (205 Golfers)

In [23]:
req = requests.get("http://www.pgatour.com/stats/stat.102.html")
soup = BeautifulSoup(req.text, "html.parser")

In [24]:
# First Scrape Golfer names into a list

names = []
for n in soup.find_all('td', {'class':'player-name'}):
    names.append(n.find('a').contents[0])
names_clean4 = [str(name.replace("\xa0", " ")) for name in names]

In [25]:
len(names_clean4)

205

In [26]:
# Scrape stats
stats = []

for stat in soup.find_all('td', {'class':'hidden-small hidden-medium'}):
    try:
        stats.append(stat.contents[0])
    except:
        stats.append(0)

In [27]:
# Convert to numeric

stats1 = []
for num in stats:
    num = str(num).replace(',', '').replace(' ', '')
    stats1.append(float(num))

In [28]:
# Break up into separate lists for each statistic

perc = stats1[1::4]
fairways_hit = stats1[2::4]
possible_fairways = stats1[3::4]

In [29]:
# Convert to DataFrame

driving_accuracy = pd.DataFrame({
        'Name':names_clean4,
        'Driving Percentage':perc,
        'Fairways Hit':fairways_hit,
        'Possible Fairways':possible_fairways
    })

## Greens in Regulation Percentage (Rounds, %) (205 Golfers)

In [30]:
req = requests.get("http://www.pgatour.com/stats/stat.103.html")
soup = BeautifulSoup(req.text, "html.parser")

In [31]:
# First Scrape Golfer names into a list

names = []
for n in soup.find_all('td', {'class':'player-name'}):
    names.append(n.find('a').contents[0])
names_clean5 = [str(name.replace("\xa0", " ")) for name in names]

In [32]:
len(names_clean5)

205

In [33]:
# Scrape stats
stats = []

for stat in soup.find_all('td', {'class':'hidden-small hidden-medium'}):
    try:
        stats.append(stat.contents[0])
    except:
        stats.append(0)

In [34]:
# Convert to numeric

stats1 = []
for num in stats:
    num = str(num).replace(',', '').replace(' ', '')
    stats1.append(float(num))

In [35]:
# Break up into separate lists for each statistic

perc = stats1[1::5]

In [36]:
len(perc)

205

In [37]:
# Convert to DataFrame

greens_in_regulation = pd.DataFrame({
        'Name':names_clean5,
        'Greens Percentage':perc
    })

## SG: Tee-to-Green (205 Golfers)

In [38]:
req = requests.get("http://www.pgatour.com/stats/stat.02674.html")
soup = BeautifulSoup(req.text, "html.parser")

In [39]:
# First Scrape Golfer names into a list

names = []
for n in soup.find_all('td', {'class':'player-name'}):
    names.append(n.find('a').contents[0])
names_clean6 = [str(name.replace("\xa0", " ")) for name in names]

In [40]:
len(names_clean6)

205

In [41]:
# Scrape stats
stats = []

for stat in soup.find_all('td', {'class':'hidden-small hidden-medium'}):
    try:
        stats.append(stat.contents[0])
    except:
        stats.append(0)

In [42]:
# Convert to numeric

stats1 = []
for num in stats:
    num = str(num).replace(',', '').replace(' ', '')
    stats1.append(float(num))

In [43]:
# Break up into separate lists for each statistic

perc = stats1[1::6]

In [44]:
# Convert to DataFrame

sg_tee_to_green = pd.DataFrame({
        'Name':names_clean6,
        'SG: Tee-to-Green':perc
    })

## SG: Total (205 Golfers)

In [45]:
req = requests.get("http://www.pgatour.com/stats/stat.02675.html")
soup = BeautifulSoup(req.text, "html.parser")

In [46]:
# First Scrape Golfer names into a list

names = []
for n in soup.find_all('td', {'class':'player-name'}):
    names.append(n.find('a').contents[0])
names_clean7 = [str(name.replace("\xa0", " ")) for name in names]

In [47]:
len(names_clean7)

205

In [48]:
# Scrape stats
stats = []

for stat in soup.find_all('td', {'class':'hidden-small hidden-medium'}):
    try:
        stats.append(stat.contents[0])
    except:
        stats.append(0)

In [49]:
# Convert to numeric

stats1 = []
for num in stats:
    num = str(num).replace(',', '').replace(' ', '')
    stats1.append(float(num))

In [50]:
# Break up into separate lists for each statistic

perc = stats1[1::6]

In [51]:
# Convert to DataFrame

sg_total = pd.DataFrame({
        'Name':names_clean7,
        'SG: Total':perc
    })

## SG: Putting (205 Golfers)

In [52]:
req = requests.get("http://www.pgatour.com/stats/stat.02564.html")
soup = BeautifulSoup(req.text, "html.parser")

In [53]:
# First Scrape Golfer names into a list

names = []
for n in soup.find_all('td', {'class':'player-name'}):
    names.append(n.find('a').contents[0])
names_clean8 = [str(name.replace("\xa0", " ")) for name in names]

In [54]:
len(names_clean8)

205

In [55]:
# Scrape stats
stats = []

for stat in soup.find_all('td', {'class':'hidden-small hidden-medium'}):
    try:
        stats.append(stat.contents[0])
    except:
        stats.append(0)

In [56]:
# Convert to numeric

stats1 = []
for num in stats:
    num = str(num).replace(',', '').replace(' ', '')
    stats1.append(float(num))

In [57]:
# Break up into separate lists for each statistic

perc = stats1[1::4]

In [58]:
# Convert to DataFrame

sg_putting = pd.DataFrame({
        'Name':names_clean8,
        'SG: Putting':perc
    })

## Scrambling (205 Golfers)

In [59]:
req = requests.get("http://www.pgatour.com/stats/stat.130.html")
soup = BeautifulSoup(req.text, "html.parser")

In [60]:
# First Scrape Golfer names into a list

names = []
for n in soup.find_all('td', {'class':'player-name'}):
    names.append(n.find('a').contents[0])
names_clean9 = [str(name.replace("\xa0", " ")) for name in names]

In [61]:
len(names_clean9)

205

In [62]:
# Scrape stats
stats = []

for stat in soup.find_all('td', {'class':'hidden-small hidden-medium'}):
    try:
        stats.append(stat.contents[0])
    except:
        stats.append(0)

In [63]:
# Convert to numeric

stats1 = []
for num in stats:
    num = str(num).replace(',', '').replace(' ', '')
    stats1.append(float(num))

In [64]:
# Break up into separate lists for each statistic

perc = stats1[1::4]

In [65]:
# Convert to DataFrame

scrambling = pd.DataFrame({
        'Name':names_clean9,
        'Scrambling %':perc
    })

## Combine Stats Into One DataFrame

In [66]:
# Sets are (top10, scoring_avg, driving_distance, driving_accuracy, greens_in_regulation,
#            sg_tee_to_green, sg_total, sg_putting, scrambling)

data = top10.merge(scoring_avg, how='right', on=['Name']).merge(
    driving_distance, on=['Name']).merge(driving_accuracy, on=['Name']).merge(
    greens_in_regulation, on=['Name']).merge(sg_tee_to_green, on=['Name']).merge(
    sg_total, on=['Name']).merge(sg_putting, on=['Name']).merge(scrambling, on=['Name'])


In [67]:
# clean

for col in ['1st', '2nd', '3rd', 'Events']:
    data[col] = data[col].fillna(0)

In [68]:
# write out to csv

data.to_csv('golfers.csv')

## Collect Data on the Course Itself (Erin Hills)
The US Open this year will be played at Erin Hills. In anticipation of my machine learning model, I did some preliminary research on the course. I created a separate dataset with information on each hole on the course (distance, par, and description) and also jotted down notes on the course so I could better estimate which attributes would give players more of an edge in this tournament.

In [69]:
req = requests.get("https://erinhills.com/golf/hole-by-hole/")
soup = BeautifulSoup(req.text, "html.parser")

In [70]:
dists = []
for dist in soup.find_all('p', {'class':'tee-distance'})[::5]:
    dists.append(int(dist.contents[0]))

In [71]:
par = []
for num in soup.find_all('div', {'class':'par'}):
    p = str(num.contents[0])[13]
    par.append(int(p))

In [72]:
descs = []
for desc in soup.find_all('div', {'class':'hole-copy'}):
    descs.append(str(desc.contents[0])[3:-4])

In [73]:
course = pd.DataFrame({
        'Hole':np.arange(1,19),
        'Distance':dists,
        'Par':par,
        'Description':descs
    })

In [74]:
# write to csv

course.to_csv('course.csv')