# Analysis on PGA Data from 2010-2018
Interested in exploring how Profession Golf Association (PGA) Golfers perform.  A dataset from [Kaggle](https://www.kaggle.com/jmpark746/pga-tour-data-2010-2018) covers relevent PGA tour data between 2010-2018.  Let's dive into the data.

#### We will follow the CRISP-DM process throughout this notebook 
1. [Business Understanding](#dm1) <br>
2. [Data Understanding](#dm2)<br>
3. [Data Preparation](#dm3)<br>
&ensp;- [Clean](#dm31)<br>
&ensp;- [Explore](#dm32)<br>

---

Let's ask some questions that we are interested in answering in the ***Business Understanding*** <a name="dm1"></a> phase of CRISP-DM
   - Does the lowest score correlate to making the most money?
   - What aspect of the golf game relates most to the highest money made?
   - Have golf courses evolved since 2010 leading to different stats contributing differently to success?
   - Are those who play more rounds more likely to shoot lower scores?
   - How well can we predict player performance in the future from previous years?

In order to insure our ***Data Understanding***<a name="dm2"></a>, the data that will be used to answer the above questions is provided entirely within the [Kaggle](https://www.kaggle.com/jmpark746/pga-tour-data-2010-2018) dataset.  This is a fairly comprehensive dataset achieved by the author webscraping [pgatour.com](https://www.pgatour.com/) to access relevent PGA Tour data between 2010-2018.

We need to start by importing the necessary libraries and reading the raw data.

In [14]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pga = pd.read_csv('./data/pgaTourData.csv')

To start the ***Data Preparation*** <a name="dm3"></a><a name="dm31"></a> process we recognize that the read data is fairly clean but requires some additional munging to handle NaNs and object columns.

In [15]:
def clean_df(df):
    """Function that prepares the data for sklearn by removing NaNs"""
    
    # Many of the columns have the same number of NaNs, with this much missing data it doesn't help any predictions. We can drop
    #  these rows if they are missing 12 columns of data
    df = df.dropna(thresh=12)

    # If wins or top10s are NaNs, that means they had 0 for that year
    df.loc[:,['Wins', 'Top 10']] = df.loc[:,['Wins', 'Top 10']].fillna(0)
    
    # There are 4 golfers that have NaNs for points; after investigation, these players did NOT have their PGA Tour card
    #  during the listed year.  It is safe to drop these rows
    df = df.dropna(subset=['Points'])
    
    return df

def transform_df(df):
    """Take the object dtypes and convert into numeric columns"""
    
    # Points is treated as an object because it uses commas as a thousands separator
    df['Points'] = df['Points'].str.replace(',','').astype(np.int64)
    
    # Money uses dollar sign and commas as a thousands separator
    df['Money'] = df['Money'].str.replace(r'\$|,','', regex=True).astype(np.int64)
    
    return df

pga = clean_df(pga)
pga = transform_df(pga)
pga.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1674 entries, 0 to 1677
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Player Name         1674 non-null   object 
 1   Rounds              1674 non-null   float64
 2   Fairway Percentage  1674 non-null   float64
 3   Year                1674 non-null   int64  
 4   Avg Distance        1674 non-null   float64
 5   gir                 1674 non-null   float64
 6   Average Putts       1674 non-null   float64
 7   Average Scrambling  1674 non-null   float64
 8   Average Score       1674 non-null   float64
 9   Points              1674 non-null   int64  
 10  Wins                1674 non-null   float64
 11  Top 10              1674 non-null   float64
 12  Average SG Putts    1674 non-null   float64
 13  Average SG Total    1674 non-null   float64
 14  SG:OTT              1674 non-null   float64
 15  SG:APR              1674 non-null   float64
 16  SG:ARG

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, v)


Break up the dataframes by year.  Part of the motivation for this is to attempt to use 2010-2017 data to predict 2018 data.

In [81]:
pga_years = [pga[pga['Year']==year] for year in sorted(pga['Year'].unique())]

Additionally, **Data Preparation**<a name="dm32"></a> also requires preliminary exploration to determine the path to answer our questions.

In [13]:
pga.sort_values('Money', ascending=False)

Unnamed: 0,Player Name,Rounds,Fairway Percentage,Year,Avg Distance,gir,Average Putts,Average Scrambling,Average Score,Points,Wins,Top 10,Average SG Putts,Average SG Total,SG:OTT,SG:APR,SG:ARG,Money
647,Jordan Spieth,91.0,62.91,2015,291.8,67.87,27.82,65.03,68.938,4169,4.0,14.0,0.571,2.154,0.494,0.618,0.471,12030465
361,Justin Thomas,86.0,54.09,2017,309.3,67.33,28.29,60.54,69.359,2689,4.0,9.0,0.332,1.724,0.452,0.738,0.289,9921560
303,Jordan Spieth,85.0,59.48,2017,295.0,69.97,28.34,61.75,68.846,2671,3.0,8.0,0.278,1.988,0.321,0.896,0.429,9433033
729,Jason Day,75.0,55.94,2015,313.7,70.83,28.44,65.34,69.161,2459,3.0,8.0,0.586,2.106,0.772,0.461,0.287,9403330
520,Dustin Johnson,87.0,57.17,2016,313.6,67.82,28.49,59.58,69.172,2701,2.0,12.0,0.328,1.993,1.117,0.477,0.070,9365185
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1173,Kyle Thompson,51.0,62.65,2012,284.2,64.78,30.13,48.99,72.296,44,0.0,0.0,-1.209,-1.398,-0.204,0.204,-0.189,45460
183,Andrew Yun,51.0,53.58,2018,290.6,61.85,30.20,44.01,73.624,16,0.0,0.0,-0.600,-1.936,-0.457,-0.447,-0.432,41566
543,Robert Allenby,51.0,55.06,2016,282.2,63.73,30.41,50.75,73.117,9,0.0,0.0,-0.491,-1.950,-0.631,-0.460,-0.368,25271
79,Kyle Thompson,50.0,62.99,2018,285.3,62.44,29.61,54.02,72.673,10,0.0,0.0,-0.750,-1.262,-0.534,0.036,-0.014,24878
