# Initial Exploration of Golf Dataset
##### Purpose is to understand amount of missing data, required cleaning and basic distributions for each column.

### Loading Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Loading Data

In [2]:
raw_file = pd.read_csv('../data/pga_2015-2022.csv')
raw_file

Unnamed: 0,Player_initial_last,tournament id,player id,hole_par,strokes,hole_DKP,hole_FDP,hole_SDP,streak_DKP,streak_FDP,...,purse,season,no_cut,Finish,sg_putt,sg_arg,sg_app,sg_ott,sg_t2g,sg_total
0,A. Ancer,401353224,9261,288,289,60.0,51.1,56,3,7.6,...,12.0,2022,0,T32,0.20,-0.13,-0.08,0.86,0.65,0.85
1,A. Hadwin,401353224,5548,288,286,72.5,61.5,61,8,13.0,...,12.0,2022,0,T18,0.36,0.75,0.31,0.18,1.24,1.60
2,A. Lahiri,401353224,4989,144,147,21.5,17.4,27,0,0.0,...,12.0,2022,0,CUT,-0.56,0.74,-1.09,0.37,0.02,-0.54
3,A. Long,401353224,6015,144,151,20.5,13.6,17,0,0.4,...,12.0,2022,0,CUT,-1.46,-1.86,-0.02,0.80,-1.08,-2.54
4,A. Noren,401353224,3832,144,148,23.5,18.1,23,0,1.2,...,12.0,2022,0,CUT,0.53,-0.36,-1.39,0.19,-1.56,-1.04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36859,V. Singh,2271,392,144,146,33.0,26.4,26,0,0.6,...,6.0,2015,0,,,,,,,
36860,W. Kim,2271,7082,144,150,18.5,12.9,21,0,0.2,...,6.0,2015,0,,,,,,,
36861,W. McGirt,2271,3532,216,215,44.5,40.6,45,0,6.2,...,6.0,2015,0,,,,,,,
36862,Z. Blair,2271,9040,288,278,73.0,70.8,74,3,23.2,...,6.0,2015,0,,,,,,,


There are 36,864 rows of data for 37 columns.

In [3]:
data_types = {}
for colname in raw_file.columns:
    col_dtype = raw_file[colname].dtype
    if col_dtype in data_types.keys():
        data_types[col_dtype].append(colname)
    else:
        data_types[raw_file[colname].dtype] = [colname]

for dtype in data_types.keys():
    print(dtype, data_types[dtype], len(data_types[dtype]), '\n')

object ['Player_initial_last', 'player', 'tournament name', 'course', 'date', 'Finish'] 6 

int64 ['tournament id', 'player id', 'hole_par', 'strokes', 'hole_SDP', 'streak_DKP', 'streak_SDP', 'n_rounds', 'made_cut', 'finish_DKP', 'finish_FDP', 'finish_SDP', 'total_SDP', 'season', 'no_cut'] 15 

float64 ['hole_DKP', 'hole_FDP', 'streak_FDP', 'pos', 'total_DKP', 'total_FDP', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'purse', 'sg_putt', 'sg_arg', 'sg_app', 'sg_ott', 'sg_t2g', 'sg_total'] 16 



The dataset contains 3 data types: O (string) (6 cols), int64 (15 cols) and float64 (16 cols)

Looking at the column names, there are 3 columns that spark intrigue: Unnamed: 2-4. Also, what happened to Unnamed: 1? It is likely that these columns are missing data.

Let's look at the missing values for each column.

In [9]:
for colname in raw_file.columns:
    num_null = raw_file[colname].isnull().sum()
    print(colname, ":", round(100*num_null/len(raw_file)), "%")

Player_initial_last : 0 %
tournament id : 0 %
player id : 0 %
hole_par : 0 %
strokes : 0 %
hole_DKP : 0 %
hole_FDP : 0 %
hole_SDP : 0 %
streak_DKP : 0 %
streak_FDP : 0 %
streak_SDP : 0 %
n_rounds : 0 %
made_cut : 0 %
pos : 42 %
finish_DKP : 0 %
finish_FDP : 0 %
finish_SDP : 0 %
total_DKP : 0 %
total_FDP : 0 %
total_SDP : 0 %
player : 0 %
Unnamed: 2 : 100 %
Unnamed: 3 : 100 %
Unnamed: 4 : 100 %
tournament name : 0 %
course : 0 %
date : 0 %
purse : 0 %
season : 0 %
no_cut : 0 %
Finish : 21 %
sg_putt : 21 %
sg_arg : 21 %
sg_app : 21 %
sg_ott : 21 %
sg_t2g : 21 %
sg_total : 21 %


As expected, Unnamed: 2-4 feature many null value. All of them.

We can also see that pos (position) is missing 42% of values. This makes sense as just less than half of players are cut from the final 2 rounds of a typical golf tournament. Similarly, it makes sense for the sg_ columns (shots gained) to be missing exactly half of this number (21%) as players who don't make cut still play half of the rounds in a tournament.

All other columns have 100% present values, which is nice.