## 2021: Week 13 - Premier League Statistics

Before Simon joined The Data School in the UK, he was a professional sporting performance analyst. Simon has reached into his previous professional life to come up with a football (read soccer) based challenge for this week. 

Simon is channelling his inner fanalyst to use data to understand more about the game that he enjoys. 

This week we want to create a data set that allows us to analyse 'Open Play Goals' scored. We will rank the players overall and by their position. 

### Input

5 csv files, all with a similar structure. There are a lot of columns in these data sets.

![img](https://1.bp.blogspot.com/-LgrltPPfiQI/YGGYbH3V12I/AAAAAAAACIw/9RiC8fLny98MIw2GjjagIg1DpZBfPAp7gCLcBGAsYHQ/w640-h206/Screenshot%2B2021-03-29%2Bat%2B10.05.25.png)

### Requirement
Open play goal scoring prowess in the Premier League 2015-2020
1. Input all the files
2. Remove all goalkeepers from the data set
3. Remove all records where appearances = 0	
4. In this challenge we are interested in the goals scored from open play
    - Create a new “Open Play Goals” field (the goals scored from open play is the number of goals scored that weren’t penalties or freekicks)
    - Note some players will have scored free kicks or penalties with their left or right foot
    - Be careful how Prep handles null fields! (have a look at those penalty and free kick fields) 
    - Rename the original Goals scored field to Total Goals Scored
5. Calculate the totals for each of the key metrics across the whole time period for each player, (be careful not to lose their position)
6. Create an open play goals per appearance field across the whole time period
7. Rank the players for the amount of open play goals scored across the whole time period, we are only interested in the top 20 (including those that are tied for position) – Output 1
8. Rank the players for the amount of open play goals scored across the whole time period by position, we are only interested in the top 20 (including those that are tied for position) – Output 2
9. Output the data – in your solution on twitter / the forums, state the name of the player who was the only non-forward to make it into the overall top 20 for open play goals scored

### Output

- Overall Rank
![img](https://1.bp.blogspot.com/-xc4fVyBVWO0/YGMhwxNq8LI/AAAAAAAACJA/lEK27Th2KfclGSFXaPXihdLtWUOLzuTPQCLcBGAsYHQ/w640-h126/Screenshot%2B2021-03-30%2Bat%2B14.03.36.png)

- Rank by Position
![img](https://1.bp.blogspot.com/-K97KWwGSF6A/YGMiYn0oLoI/AAAAAAAACJI/Ey8XzxlsS8E6YFVa0H7YTFHY_SPpkpLEACLcBGAsYHQ/w640-h122/Screenshot%2B2021-03-30%2Bat%2B14.06.08.png)

Two files:
1. Overall Rank
- 22 Rows (23 including headers)
- 10 Fields:
    - Open Play Goals
    - Goals with Right Foot
    - Goals with Left Foot 
    - Position
    - Appearances
    - Rank
    - Total Goals
    - Open Play Goals / Game
    - Headed Goals
    - Name
2. Rank by Position
    - 65 Rows (66 including headers)
    - 10 Fields : as per the first output file

In [730]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Input all the files

In [731]:
pl_15_16 = pd.read_csv("./data/pl_15-16.csv")
pl_16_17 = pd.read_csv("./data/pl_16-17.csv")
pl_17_18 = pd.read_csv("./data/pl_17-18.csv")
pl_18_19 = pd.read_csv("./data/pl_18-19.csv")
pl_19_20 = pd.read_csv("./data/pl_19-20.csv")

In [732]:
df = pd.concat([pl_15_16,
                pl_16_17,
                pl_17_18,
                pl_18_19,
                pl_19_20], axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4247 entries, 0 to 973
Data columns (total 54 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Season                  4247 non-null   object 
 1   Name                    4247 non-null   object 
 2   Position                4247 non-null   object 
 3   Appearances             4247 non-null   int64  
 4   Clean sheets            1824 non-null   float64
 5   Goals conceded          1824 non-null   float64
 6   Tackles                 3768 non-null   float64
 7   Tackle success %        2883 non-null   object 
 8   Last man tackles        1345 non-null   float64
 9   Blocked shots           3768 non-null   float64
 10  Interceptions           3768 non-null   float64
 11  Clearances              3768 non-null   float64
 12  Headed Clearance        3768 non-null   float64
 13  Clearances off line     1345 non-null   float64
 14  Recoveries              2883 non-null   f

In [733]:
df["Name"] = df.Name.str.strip()

In [734]:
df.columns

Index(['Season', 'Name', 'Position', 'Appearances', 'Clean sheets',
       'Goals conceded', 'Tackles', 'Tackle success %', 'Last man tackles',
       'Blocked shots', 'Interceptions', 'Clearances', 'Headed Clearance',
       'Clearances off line', 'Recoveries', 'Duels won', 'Duels lost',
       'Successful 50/50s', 'Aerial battles won', 'Aerial battles lost',
       'Own goals', 'Errors leading to goal', 'Assists', 'Passes',
       'Passes per match', 'Big chances created', 'Crosses',
       'Cross accuracy %', 'Through balls', 'Accurate long balls',
       'Yellow cards', 'Red cards', 'Fouls', 'Offsides', 'Goals',
       'Headed goals', 'Goals with right foot', 'Goals with left foot',
       'Hit woodwork', 'Goals per match', 'Penalties scored',
       'Freekicks scored', 'Shots', 'Shots on target', 'Shooting accuracy %',
       'Big chances missed', 'Saves', 'Penalties saved', 'Punches',
       'High Claims', 'Catches', 'Sweeper clearances', 'Throw outs',
       'Goal Kicks'],
     

### Remove all goalkeepers from the data set

In [735]:
goalkeepers = df.loc[df["Position"] == "Goalkeeper"].index
df = df.drop(goalkeepers, axis=0)
df = df.reset_index(drop=True)

### Remove all records where appearances = 0

In [736]:
appearances = df.loc[df["Appearances"] == 0].index
df = df.drop(appearances, axis=0)
df = df.reset_index(drop=True)
df.shape

(1457, 54)

### In this challenge we are interested in the goals scored from open play
- Create a new “Open Play Goals” field (the goals scored from open play is the number of goals scored that weren’t penalties or freekicks)
- Note some players will have scored free kicks or penalties with their left or right foot
- Be careful how Prep handles null fields! (have a look at those penalty and free kick fields) 
- Rename the original Goals scored field to Total Goals Scored

In [737]:
df["Open Play Goals"] = df["Goals with left foot"] + df["Goals with right foot"] + df["Headed goals"] - df["Penalties scored"] - df["Freekicks scored"] 
df = df.rename(columns={"Goals": "Total Goals"})

### Calculate the totals for each of the key metrics across the whole time period for each player, (be careful not to lose their position)

In [738]:
grouped = df.groupby(["Name"])[["Open Play Goals", "Goals with right foot", "Goals with left foot", 
                                            "Appearances", "Total Goals", "Headed goals"]].agg("sum")

### Create an open play goals per appearance field across the whole time period

In [739]:
grouped["Open Play Goals/Game"] = grouped["Open Play Goals"] / grouped["Appearances"]

In [740]:
position = df[["Name", "Position"]]
position = position.drop_duplicates()

In [741]:
grouped = grouped.reset_index().merge(position, how="left", on="Name")

### Rank the players for the amount of open play goals scored across the whole time period, we are only interested in the top 20 (including those that are tied for position)

In [742]:
output_1 = grouped.sort_values(by="Open Play Goals", ascending=False).reset_index()
output_1["Rank"] = output_1["Open Play Goals"].rank(ascending=False, method="min")

In [743]:
output_1 = output_1.loc[:, ["Name", "Position", "Rank", "Total Goals", "Open Play Goals", "Goals with right foot",
                            "Goals with left foot", "Headed goals", "Appearances", "Open Play Goals/Game"]]
output_1[["Rank" ,"Total Goals", "Open Play Goals", "Goals with right foot", "Goals with left foot", "Headed goals"]] = output_1[["Rank" ,"Total Goals", "Open Play Goals", 
                                                                                                                                  "Goals with right foot", "Goals with left foot", "Headed goals"]].astype(int)
output_1.head()

Unnamed: 0,Name,Position,Rank,Total Goals,Open Play Goals,Goals with right foot,Goals with left foot,Headed goals,Appearances,Open Play Goals/Game
0,Sadio Mané,Forward,1,74,74,44,20,10,164,0.45122
1,Jamie Vardy,Forward,2,85,67,52,22,11,142,0.471831
2,Mohamed Salah,Forward,3,73,66,11,58,4,108,0.611111
3,Sergio Agüero,Forward,4,65,53,46,7,12,86,0.616279
4,Harry Kane,Forward,5,55,47,32,15,7,75,0.626667


In [744]:
output_1[output_1["Rank"].isin(np.arange(1, 21))]

Unnamed: 0,Name,Position,Rank,Total Goals,Open Play Goals,Goals with right foot,Goals with left foot,Headed goals,Appearances,Open Play Goals/Game
0,Sadio Mané,Forward,1,74,74,44,20,10,164,0.45122
1,Jamie Vardy,Forward,2,85,67,52,22,11,142,0.471831
2,Mohamed Salah,Forward,3,73,66,11,58,4,108,0.611111
3,Sergio Agüero,Forward,4,65,53,46,7,12,86,0.616279
4,Harry Kane,Forward,5,55,47,32,15,7,75,0.626667
5,Eden Hazard,Forward,6,48,40,35,13,0,138,0.289855
6,Romelu Lukaku,Forward,7,41,39,9,23,9,71,0.549296
7,Son Heung-Min,Forward,7,39,39,20,16,3,126,0.309524
8,Chris Wood,Forward,9,34,33,12,8,14,94,0.351064
9,Roberto Firmino,Forward,9,34,33,19,8,7,106,0.311321


In [745]:
output_1.to_csv("./output/Week12_output_1.csv")

### Rank the players for the amount of open play goals scored across the whole time period by position, we are only interested in the top 20 (including those that are tied for position) 

In [747]:
output_2 = grouped.copy()
output_2["Rank by Position"] = output_2.groupby(["Position"])["Open Play Goals"].rank(ascending=False, method="min")

In [748]:
output_2["Rank by Position"] = output_2["Rank by Position"].astype(int)
grouped = output_2.groupby(["Position"])
defender = grouped.get_group("Defender")
defender.loc[defender["Rank by Position"].isin(np.arange(1, 21))]

Unnamed: 0,Name,Open Play Goals,Goals with right foot,Goals with left foot,Appearances,Total Goals,Headed goals,Open Play Goals/Game,Position,Rank by Position
1,Aaron Cresswell,0.0,1.0,5.0,124,6,0.0,0.0,Defender,2
5,Aaron Wan-Bissaka,0.0,0.0,0.0,77,0,0.0,0.0,Defender,2
8,Abdul Rahman Baba,0.0,0.0,0.0,15,0,0.0,0.0,Defender,2
14,Adam Masina,0.0,0.0,0.0,14,0,0.0,0.0,Defender,2
15,Adam Matthews,0.0,0.0,0.0,1,0,0.0,0.0,Defender,2
...,...,...,...,...,...,...,...,...,...,...
789,Yan Valery,0.0,1.0,1.0,34,2,0.0,0.0,Defender,2
794,Yohan Benalouane,0.0,0.0,0.0,11,0,0.0,0.0,Defender,2
796,Younes Kaboul,0.0,2.0,0.0,48,2,0.0,0.0,Defender,2
800,Zanka,0.0,1.0,1.0,62,3,1.0,0.0,Defender,2


In [None]:
forward = output_2.loc[output_2["Position"] == "Forward"].sort_values(by="Rank by Position", ascending=True).head(20)
midfielder = output_2.loc[output_2["Position"] == "Midfielder"].sort_values(by="Rank by Position", ascending=True).head(20)
defender = output_2.loc[output_2["Position"] == "Defender"].sort_values(by="Rank by Position", ascending=True).head(20)
output_2 = pd.concat([forward, midfielder, defender], axis=0)
output_2.reset_index(drop=True)

In [645]:
output_2.to_csv("./output/Week12_output_2.csv")