<a href="https://colab.research.google.com/github/jonathannocek/pga-data-analysis/blob/master/pga_data_scraping_nocek.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CS377: Data Science Final Project / CS395: Data Science Capstone**
## **Goal**

The goal of this project will to analyze the stats from the PGA Tour using the data provided on pgatour.com

## **Data Scraping**

First, I will need to scrape the data from the pgatour website and load this into a dataframe. 

### **Variables**
Descriptions taken from the PGA TOUR's website. The Strokes Gained concept can confusing so [here](https://www.pgatour.com/news/2016/05/31/strokes-gained-defined.html) is a detailed explanation of the statistic. It has revolutionized golf statistics and has provided significant insight as to how a player analyzes their game. 

The Strokes Gained - concept is a by-product of the PGA TOUR's ShotLink Intelligence Program, which encourages academics to perform research against ShotLink statistical data. Professor Mark Broadie from Columbia Business School developed the early concept which was later refined by the TOUR.

*   **NAME** - The name of the golfer
*   **ROUNDS** - The number of rounds played in the given season
*   **SCORING** - The weighted scoring average which takes the stroke average of the field into account. It is computed by adding a player's total strokes to an adjustment and dividing by the total rounds played. The adjustment is computed by determining the stroke average of the field for each round played. This average is subtracted from par to create an adjustment for each round. A player accumulates these adjustments for each round played.
*   **DRIVE_DISTANCE** - The average number of yards per measured drive. These drives are measured on two holes per round. Care is taken to select two holes which face in opposite directions to counteract the effect of wind. Drives are measured to the point at which they come to rest regardless of whether they are in the fairway or not 
*   **FWY_%** - The percentage of time a tee shot comes to rest in the fairway (regardless of club).
*   **GIR_%** - The percent of time a player was able to hit the green in regulation (greens hit in regulation/holes played). Note: A green is considered hit in regulation if any portion of the ball is touching the putting surface after the GIR stroke has been taken. (The GIR stroke is determined by subtracting 2 from par (1st stroke on a par 3, 2nd on a par 4, 3rd on a par 5))
*   **SG_P (Strokes Gained Putting)** - The number of putts a player takes from a specific distance is measured against a statistical baseline to determine the player's strokes gained or lost on a hole. The sum of the values for all holes played in a round minus the field average strokes gained/lost for the round is the player's Strokes gained/lost for that round. The sum of strokes gained for each round are divided by total rounds played. 
*   **SG_ATG (Strokes Gained Around the Green)** - The number of Around the Green strokes a player takes from specific locations and distances are measured against a statistical baseline to determine the player's strokes gained or lost on a hole. 
*   **SG_APP (Strokes Gained Approach)** - The number of Approach the Green strokes a player takes from specific locations and distances are measured against a statistical baseline to determine the player's strokes gained or lost on a hole. 
*   **SG_OTT (Strokes Gained Off the Tee)** - The number of strokes a player takes from a specific distance off the tee on Par 4 & par 5's is measured against a statistical baseline to determine the player's strokes gained or lost off the tee on a hole. 
*   **SG_TTG (Strokes Gained Tee to Green)** - The per round average of the number of Strokes the player was better or worse than the field average on the same course & event minus the Players Strokes Gained putting value. 
*   **SG_T (Strokes Gained Total)** - The per round average of the number of Strokes the player was better or worse than the field average on the same course & event.
*   **PAR3 (Par 3 Scoring)** - The average score on all par 3's played (e.g. 3.22)
*   **PAR4 (Par 4 Scoring)** - The average score on all par 4's played (e.g. 4.22)
*   **PAR5 (Par 5 Scoring)** - The average score on all par 5's played (e.g. 5.22)
*   **POINTS** - Total number of FedEx Cup points
*   **TOP 10** - The number of top 10 finishes
*   **1ST** - The number of tournaments won
*   **YEAR** - The year of the season for the statistics
*   **UD_%** - A subjective stat.  Most college programs define it as a short shot (i.e. a pitch or chip shot) and the percentage of time that they get the ball up and down, or a chip and a putt.
*   **SS_%** - The percent of time a player was able to get 'up and down' once in a greenside sand bunker (regardless of score). Note: 'Up and down' indicates it took the player 2 shots or less to put the ball in the hole from that point.
*   **SCRAM_%** - The percent of time a player misses the green in regulation, but still makes par or better. 
*   **PUTTS** - The average number of putts taken during a round.
*   **1PUTT_%** - The percentage of holes a player one putts
*   **2PUTT_%** - The percentage of holes a player two putts
*   **3PUTT_%** - The percentage of holes a player three putts




## **Imports**

In [0]:
import requests 
import pandas as pd 
import numpy as np 
from bs4 import BeautifulSoup # Used for pulling data out of HTML files
import seaborn as sns

## **Data Scraping**

This data scraping function is modified from [here](https://github.com/daronprater/PGA-Tour-Data-Science-Project/blob/master/PGAtour.com%20Web%20Scraper.ipynb).

#### **Get Column Headers**

This function will get the column names to use for the dataframe. Using BeautifulSoup, it will pull all headers with the rounds class and then, all headers with the col-stat class. Finally, it will save all of this into a headers list and return this list. 

In [0]:
def get_headers(soup):
    headers = [] #Initialize empty list for column names
    
    #Get rounds header
    rounds = soup.find_all(class_="rounds hidden-small hidden-medium")[0].get_text()
    headers.append(rounds)
    
    #Get other headers
    stat_headers = soup.find_all(class_="col-stat hidden-small hidden-medium")
    for header in stat_headers:
        headers.append(header.get_text())
    
    return headers

#### **Get Players Names**

This function takes the BeautifulSoup created and uses it to gather player's names

In [0]:
def get_players(soup):    
    player_list = [] #Initialize empty list for players
    
    #Get player as html tags
    players = soup.select('td a')[1:] #Use 1 beacuse first line of all tables is not useful.
    #Loop through list
    for player in players:
        player_list.append(player.get_text())
    
    return player_list

### **Get Stats**

Will find all of the statistics specified and compile them into a list. 


In [0]:
def get_stats(soup, categories):
    '''This function takes the soup created before and the number of categories 
    needed to generate this'''
    
    #Finds all tags with class specified and puts into a list
    stats = soup.find_all(class_="hidden-small hidden-medium")
    
    #Initialize stats list
    stat_list = []
    
    #Loop through 
    for i in range(0, len(stats)-categories+1, categories):
        temp_list = []
        for j in range(categories):
            temp_list.append(stats[i + j].get_text())
        stat_list.append(temp_list)
            
    return stat_list

### **Create Stats Dictionary**

This function take two lists, players and stats, and combines them into a dictionary with the player as the key.

In [0]:
def stats_dict(players, stats):
        '''This function takes two lists, players and stats, 
        and creates a dictionary with the player being the key 
        and the stats as the vales (as a list)'''
    
        #initialize player dictionary
        player_dict = {}
    
        #Loop through player list
        for i, player in enumerate(players):
            player_dict[player] = stats[i]
    
        return player_dict

### **Creating the Dataframe**

make_dataframe will take the url where the desired data is located and the number of the category. 

The function will get headers, get players, get stats, and finally combine them into a dictionary. From there, it will use pandas to create a dataframe with the dictionary and the headers as column names. 

In [0]:
def make_dataframe(url, categories):
        
    ##Create soup object from url.
    response = requests.get(url)
    text = response.text
    soup = BeautifulSoup(text, 'lxml')
    
    #1. Get Headers
    headers = get_headers(soup)
    
    #2. Get Players
    players = get_players(soup)
    
    #3. Get Stats
    stats = get_stats(soup, categories)
    
    #4. Make stats dictionary.
    stats_dictionary = stats_dict(players, stats)
    
    #Make dataframe
    frame = pd.DataFrame(stats_dictionary, index = headers).T
    
    #Reset index
    frame = frame.reset_index()
    
    #For each Dataframe, change index column to 'NAME'
    frame = frame.rename(index = str, columns = {'index': 'NAME'})
    return frame

Use the following below to specify which years of statistics are being scraped. 

In [0]:
years = [str(i) for i in range(2010, 2020)]

### **Putting it all together**

Here, the for loop iterates through the years 2010-2019 to find the following statisitcs:
1. FedEx Cup points
2. Top 10s
3. Wins
4. Scoring Average
5. Driving Distance
6. Driving Driving Accuracy
7. Green in Regulations
8. Strokes Gained - Tee to Green
9. Strokes Gained - Off the Tee
10. Strokes Gained - Approach the Green
11. Strokes Gained - Around the Green
12. Strokes Gained - Putting
13. Strokes Gained - Total
14. Par 3 Scoring
15. Par 4 Scoring
16. Par 5 Scoring 

In [0]:
for year in years:
    print(year)
    #Fedex cup points
    fcp = make_dataframe("https://www.pgatour.com/stats/stat.02671.{}.html".format(year), 6)[['NAME', 'POINTS']]
    #Top 10's and wins
    top10 = make_dataframe("https://www.pgatour.com/stats/stat.138.{}.html".format(year), 5)[['NAME', 'TOP 10', '1ST']]

    # Scoring statistics, keep rounds from this page as it most accurately reflects total rounds player completed in season.
    scoring = make_dataframe("https://www.pgatour.com/stats/stat.120.{}.html".format(year), 5)[['NAME', 'ROUNDS', 'AVG']]
    scoring = scoring.rename(columns={'AVG':'SCORING'})

    # Driving Distance
    drivedistance = make_dataframe("https://www.pgatour.com/stats/stat.101.{}.html".format(year), 4)[['NAME', 'AVG.']]
    #Rename Columns
    drivedistance = drivedistance.rename(columns = {'AVG.':'DRIVE_DISTANCE'})

    # Driving Accuracy
    driveacc = make_dataframe("https://www.pgatour.com/stats/stat.102.{}.html".format(year), 4)[['NAME', '%']]
    # Change column name from % to FWY %
    driveacc = driveacc.rename(columns = {'%': "FWY_%"})

    # Greens in Regulation.
    gir = make_dataframe("https://www.pgatour.com/stats/stat.103.{}.html".format(year), 5)[['NAME', '%']]
    # Change column name from % to GIR %
    gir = gir.rename(columns = {'%': "GIR_%"})

    # SG-Tee tp Green
    sg_teetogreen = make_dataframe("https://www.pgatour.com/stats/stat.02674.{}.html".format(year), 6)[['NAME', 'AVERAGE']]
    #Change name of average column
    sg_teetogreen = sg_teetogreen.rename(columns = {'AVERAGE' : 'SG_TTG'})

    # SG-Off the Tee
    sg_ott = make_dataframe("https://www.pgatour.com/stats/stat.02567.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    #Rename Columns
    sg_ott = sg_ott.rename(columns = {'AVERAGE':'SG_OTT'})
    
    # SG-Approach
    sg_approach = make_dataframe("https://www.pgatour.com/stats/stat.02568.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    #Rename Columns
    sg_approach = sg_approach.rename(columns = {'AVERAGE':'SG_APP'})

    # SG-Around the Green
    sg_around = make_dataframe("https://www.pgatour.com/stats/stat.02569.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    #Rename Columns
    sg_around = sg_around.rename(columns = {'AVERAGE':'SG_ATG'})

    # SG-Putting
    sg_putting = make_dataframe("https://www.pgatour.com/stats/stat.02564.{}.html".format(year), 4)[['NAME', 'AVERAGE']]
    #Change name of average column
    sg_putting = sg_putting.rename(columns = {'AVERAGE': 'SG_P'})
   
    # SG-Total
    sg_total = make_dataframe("https://www.pgatour.com/stats/stat.02675.{}.html".format(year), 6)[['NAME', 'AVERAGE']]
    sg_total = sg_total.rename(columns = {'AVERAGE':'SG_T'})

    # Par 3 Scoring
    par3_scoring = make_dataframe("https://www.pgatour.com/stats/stat.142.{}.html".format(year), 4)[['NAME', 'AVG']]
    par3_scoring = par3_scoring.rename(columns = {'AVG':'PAR3'})
    
    # Par 4 Scoring
    par4_scoring = make_dataframe("https://www.pgatour.com/stats/stat.143.{}.html".format(year), 4)[['NAME', 'AVG']]
    par4_scoring = par4_scoring.rename(columns = {'AVG':'PAR4'})

    # Par 5 Scoring
    par5_scoring = make_dataframe("https://www.pgatour.com/stats/stat.144.{}.html".format(year), 4)[['NAME', 'AVG']]
    par5_scoring = par5_scoring.rename(columns = {'AVG':'PAR5'})

    #Get Dataframes into list.
    data_frames = [drivedistance, driveacc, gir, sg_putting, sg_around, sg_approach, sg_ott, sg_teetogreen, sg_total, par3_scoring, par4_scoring, par5_scoring]
    
    #Merge all Dataframes together
    df_one = pd.DataFrame()
    df_one = scoring
    for df in data_frames:
        df_one = pd.merge(df_one, df, on='NAME')
        
    

    #merge fex ex cup points
    df_one = pd.merge(df_one, fcp, how='outer', on='NAME')
    #Merge top 10's
    df_one = pd.merge(df_one, top10, how='outer', on='NAME')
    
    #Only get people who's scoring average isn't null.
    df_one = df_one.loc[df_one['SCORING'].isnull() == False]  
    
    #Add year column
    df_one['Year'] = year
    
    #Concat dataframe to overall dataframe
    if year == '2010':
        df_total = pd.DataFrame()
        df_total = pd.concat([df_total, df_one], axis=0)
    else:
        df_total = pd.concat([df_total, df_one], axis=0)

2010
2011
2012
2013
2014
2015
2016
2017
2018
2019


In [0]:
df_total.head(15)

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,GIR_%,SG_P,SG_ATG,SG_APP,SG_OTT,SG_TTG,SG_T,PAR3,PAR4,PAR5,POINTS,TOP 10,1ST,Year
0,Matt Kuchar,97,69.606,286.9,67.89,69.36,0.648,0.334,0.336,0.158,0.827,1.461,3.02,3.96,4.56,2728,11,1.0,2010
1,Steve Stricker,73,69.66,282.9,68.5,68.29,0.437,0.419,0.773,0.191,1.383,1.818,2.99,3.99,4.58,2028,9,2.0,2010
2,Retief Goosen,75,69.718,291.4,64.79,65.96,0.679,0.395,0.185,0.337,0.917,1.598,3.04,3.99,4.6,1360,10,,2010
3,Paul Casey,64,69.72,294.2,61.31,68.68,0.812,-0.111,0.483,0.215,0.587,1.411,3.03,3.98,4.64,2250,7,,2010
4,Jim Furyk,76,69.828,276.0,71.01,67.12,0.402,0.367,0.641,0.15,1.159,1.564,3.02,3.99,4.7,2980,7,3.0,2010
5,Ernie Els,72,69.843,288.4,60.16,67.86,0.33,0.043,0.735,0.215,0.992,1.322,3.13,3.98,4.63,1438,7,2.0,2010
6,Luke Donald,71,69.85,277.0,62.36,65.28,0.87,0.464,0.661,-0.506,0.619,1.493,3.0,4.0,4.71,2700,7,,2010
7,Justin Rose,78,69.885,287.8,65.17,66.31,0.243,0.447,0.168,0.338,0.952,1.195,3.03,4.0,4.63,718,4,2.0,2010
8,Bo Van Pelt,104,69.955,292.0,65.23,69.23,0.098,0.107,0.26,0.724,1.091,1.192,3.06,3.99,4.61,445,8,,2010
9,Phil Mickelson,76,69.966,299.1,52.66,65.13,-0.147,0.228,0.738,0.185,1.151,1.001,3.05,4.01,4.58,843,6,1.0,2010


In [0]:
df_total.describe()

Unnamed: 0,NAME,ROUNDS,SCORING,DRIVE_DISTANCE,FWY_%,GIR_%,SG_P,SG_ATG,SG_APP,SG_OTT,SG_TTG,SG_T,PAR3,PAR4,PAR5,POINTS,TOP 10,1ST,Year
count,1866,1866,1866.0,1866.0,1866.0,1866.0,1866.0,1866.0,1866.0,1866.0,1866.0,1866.0,1866.0,1866.0,1866.0,1861,1540,1540.0,1866
unique,465,73,1299.0,401.0,1195.0,835.0,993.0,774.0,1036.0,1005.0,1261.0,1278.0,30.0,29.0,46.0,1037,14,6.0,10
top,Webb Simpson,81,70.966,288.1,60.1,66.67,0.0,-0.111,0.054,0.079,-0.003,0.03,3.06,4.04,4.67,142,1,,2018
freq,10,57,6.0,16.0,6.0,22.0,7.0,9.0,7.0,6.0,7.0,6.0,182.0,210.0,135.0,7,390,1194.0,193


In [0]:
from google.colab import files

df_total.to_csv('2010_2019_PGA_Stats.csv')
files.download('2010_2019_PGA_Stats.csv')