## 2021: Week 6 - Comparing Prize Money for Professional Golfers

What's one of the benefits of preparing your own data?
Being able to start your analysis sooner!

Sometimes I can find opening Tableau Desktop to explore my data gets a little distracting by trying to visualise it before I've decided on the story. Starting my analysis of the dataset in Tableau Prep helps me, personally, to stay more focused! It's clear where the outliers are, what the distribution of the dataset is and therefore what the story should be.

For this week's challenge we're looking at a dataset that was used in December 2020 for Sports Viz Sunday (thanks to Kate Brown for sharing!) This dataset comes from the PGA and LPGA 2019 Golf tours and lists the total prize money for the top 100 players. For those of us who aren't too familiar with golf, the PGA is the men's tour, whilst the LPGA is the women's tour.

### Input

![img](https://1.bp.blogspot.com/-n1nJAhjFwFE/YB1DtLB2OrI/AAAAAAAAAtc/okyuUbZ672006nCq_cenaAu_9SWa1HlBgCLcBGAsYHQ/w400-h223/2021W06%2BInput.png)

### Requirments

- Input the data
- Answer these questions:
    - What's the Total Prize Money earned by players for each tour?
    - How many players are in this dataset for each tour?
    - How many events in total did players participate in for each tour?
    - How much do players win per event? What's the average of this for each tour?
    - How do players rank by prize money for each tour? What about overall? What is the average difference between where they are ranked within their tour compared to the overall rankings where both tours are combined?
        - Here we would like the difference to be positive as you would presume combining the tours would cause a player's ranking to increase
- Combine the answers to these questions into one dataset
- Pivot the data so that we have a column for each tour, with each row representing an answer to the above questions
- Clean up the Measure field and create a new column showing the difference between the tours for each measure
    - We're looking at the difference between the LPGA from the PGA, so in most instances this number will be negative
- Output the data

### Output

![img2](https://1.bp.blogspot.com/-iKTMfxcBhx8/YCPot6fySAI/AAAAAAAAAvE/KqpS4RH8QQo_0HJMnXXwFfjLycQn_CQPwCLcBGAsYHQ/w400-h131/2021W06%2BOutput.png)

4 fields
- Measure
- PGA
- LPGA
- Difference between tours

5 rows (6 including headers)

In [180]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

In [181]:
df = pd.read_excel("./data/PGALPGAMoney2019.xlsx")

In [182]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   PLAYER NAME  200 non-null    object
 1   MONEY        200 non-null    int64 
 2   EVENTS       200 non-null    int64 
 3   TOUR         200 non-null    object
dtypes: int64(2), object(2)
memory usage: 6.4+ KB


In [183]:
df.describe(include="number").T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MONEY,200.0,1575683.83,1515693.82,127365.0,407763.5,1249998.0,2168541.75,9684006.0
EVENTS,200.0,22.74,4.17,8.0,21.0,23.0,26.0,35.0


In [184]:
df.describe(include="object").T

Unnamed: 0,count,unique,top,freq
PLAYER NAME,200,200,Brooks Koepka,1
TOUR,200,2,PGA,100


In [185]:
df.head()

Unnamed: 0,PLAYER NAME,MONEY,EVENTS,TOUR
0,Brooks Koepka,9684006,21,PGA
1,Rory McIlroy,7785286,19,PGA
2,Matt Kuchar,6294690,22,PGA
3,Patrick Cantlay,6121488,21,PGA
4,Gary Woodland,5690965,24,PGA


### What's the Total Prize Money earned by players for each tour?

In [186]:
total_prize = df.groupby(["TOUR"])["MONEY"].sum()
total_prize

TOUR
LPGA     58410411
PGA     256726356
Name: MONEY, dtype: int64

### How many players are in this dataset for each tour?

In [187]:
player_num = df.groupby(["TOUR"])["PLAYER NAME"].count()
player_num

TOUR
LPGA    100
PGA     100
Name: PLAYER NAME, dtype: int64

### How many events in total did players participate in for each tour?

In [188]:
total_events = df.groupby(["TOUR"])["EVENTS"].sum()
total_events

TOUR
LPGA    2266
PGA     2282
Name: EVENTS, dtype: int64

### How much do players win per event?

In [189]:
grouped = df.groupby(["PLAYER NAME", "TOUR"]).apply(lambda df_: df_["MONEY"] / df_["EVENTS"]).reset_index().drop("level_2", axis=1).rename(columns={0:"Avg Money per Event"})
grouped.sample(10)

Unnamed: 0,PLAYER NAME,TOUR,Avg Money per Event
195,Wei-Ling Hsu,LPGA,13343.25
61,Emiliano Grillo,PGA,76178.87
191,Troy Merritt,PGA,72911.57
112,Kiradech Aphibarnrat,PGA,81503.65
178,Si Woo Kim,PGA,78278.86
62,Eun-Hee Ji,LPGA,32903.96
134,Max Homa,PGA,82544.24
160,Rickie Fowler,PGA,197290.5
70,Hannah Green,LPGA,45371.17
98,Joaquin Niemann,PGA,51232.82


### What's the average of this for each tour?

In [190]:
avg_money_per_event = grouped.groupby(["TOUR"])["Avg Money per Event"].mean().round(0).astype(int)
avg_money_per_event

TOUR
LPGA     25525
PGA     120282
Name: Avg Money per Event, dtype: int32

### How do players rank by prize money for each tour?

In [191]:
ranking_per_tour = df.groupby(["TOUR"])["MONEY"].rank(ascending=False).reset_index(drop=True)
df["RANK_PER_TOUR"] = ranking_per_tour
df.head()

Unnamed: 0,PLAYER NAME,MONEY,EVENTS,TOUR,RANK_PER_TOUR
0,Brooks Koepka,9684006,21,PGA,1.0
1,Rory McIlroy,7785286,19,PGA,2.0
2,Matt Kuchar,6294690,22,PGA,3.0
3,Patrick Cantlay,6121488,21,PGA,4.0
4,Gary Woodland,5690965,24,PGA,5.0


### What about overall ranking?

In [192]:
ranking_overall = df["MONEY"].rank(ascending=False)
df["RANK_OVERALL"] = ranking_overall
df.head()

Unnamed: 0,PLAYER NAME,MONEY,EVENTS,TOUR,RANK_PER_TOUR,RANK_OVERALL
0,Brooks Koepka,9684006,21,PGA,1.0,1.0
1,Rory McIlroy,7785286,19,PGA,2.0,2.0
2,Matt Kuchar,6294690,22,PGA,3.0,3.0
3,Patrick Cantlay,6121488,21,PGA,4.0,4.0
4,Gary Woodland,5690965,24,PGA,5.0,5.0


### What is the average difference between where they are ranked within their tour 
### compared to the overall rankings where both tours are combined?

In [193]:
avg_rank_per_tour = df.groupby(["TOUR"])["RANK_PER_TOUR"].mean()
avg_rank_overall = df.groupby(["TOUR"])["RANK_OVERALL"].mean()
avg_diff_btw_tours = abs(avg_rank_per_tour - avg_rank_overall)
avg_diff_btw_tours

TOUR
LPGA   96.13
PGA     3.87
dtype: float64

In [194]:
pd.options.display.float_format = "{:.2f}".format

final_output = pd.concat([total_prize, total_events, 
                          player_num, avg_money_per_event, avg_diff_btw_tours, ], axis=1, join="inner")

final_output.columns=["Total Prize Money", "Number of Events", 
                      "Number of Players", "Avg Money per Event", "Avg Difference in Ranking"]
final_output = final_output.transpose()
final_output = final_output.reset_index()
final_output = final_output.rename(columns={"index": "Measure"})
final_output.columns.name = None
final_output = final_output.loc[: , ["Measure", "PGA", "LPGA"]]
final_output["Difference between tours"] = final_output["LPGA"] - final_output["PGA"]
final_output

Unnamed: 0,Measure,PGA,LPGA,Difference between tours
0,Total Prize Money,256726356.0,58410411.0,-198315945.0
1,Number of Events,2282.0,2266.0,-16.0
2,Number of Players,100.0,100.0,0.0
3,Avg Money per Event,120282.0,25525.0,-94757.0
4,Avg Difference in Ranking,3.87,96.13,92.26


In [195]:
final_output.to_csv("./output/Week6_output.csv")