# The David De Gea Dilemma: Comparing Goalkeeper Greats Throughout History

### June 24th 2023

##### David De Gea is a goalkeeper for Manchester United, where he has been playing as a starter since 2010. With a controversial question mark looming over his contract situation due to his underwhelming 2022-2023 campaign, a portion of the Mancheter United fanbase are debating if De Gea is worthy of being deemed a United legend. 

##### Let's inspect some data and assses how De Gea compares to other goalies who are retired and deemed as legends, not just for United but for other major European clubs, namely:

1. Peter Schmeichel (Manchester United)
2. Edwin van der Sar (Manchester United)
3. Petr Cech (Chelsea)
4. Iker Casillas (Real Madrid)
5. Gianluigi Buffon (Juventus)

alongside these active players who have consistently performed at a high level:

6. Manuel Neuer (Bayern Munich)
7. Alisson Becker (Liverpool)
8. Ederson (Manchester City)
9. Thibaut Courtois (Real Madrid)

In [11]:
gks = ["ps","vds","cech","iker","buffon","neuer","alisson","ederson","courtois","ddg"]
full_name = ["Peter Schmeichel","Edwin Van de Sar","Petr Cech","Iker Casillas","Gianluigi Buffon","Manuel Neuer","Alisson Becker","Ederson","Thibaut Courtois","David De Gea"]

## Penalty Kicks (+ shootouts)

#### Let's first observe how De Gea compares to these keepers regarding penalty kicks.

In [5]:
pk_save_rate = [3/37,11/60,17/85,23/100,39/124,22/76,14/33,7/54,14/66,14/74]

In [12]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Bar(x=gks, y=pk_save_rate,hovertext=full_name)])
# Customize aspect
fig.update_traces(marker_color='rgb(158,202,100)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)
fig.update_layout(title_text='Penalty Kick Save Rates among top Keepers (all comps, excluding shootouts)')
fig.show()

##### While **De Gea**'s penalty saving record is not necessarily something he can particulalry be proud of, I, as a United fan, find it quite funny how he was better than both van der Sar and Peter Schmeichel, both United legends.

##### This time, let's try using [KL divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) to compare this data. KL Divergence, in short, lets us compare two probability distributions by calculating the expectation of the log-odds of two distributions. 

##### Here, let's assume that, based on historical data, players have an 85% chance of scoring and a 15% chance of missing, partly because I couldn't find the consensus on this statistic after going through some data sources. But it seems like the number is somewhere north of 80 percent, so let's go with 85 percent for the sake of brevity of this presentation. (Also, some data sources set aside another percentage for players completely missing the goal, but let's combine that with GK saving the penalty because it is of my opinion that a player missing in any fashion can be attributed to the keeper. It's a mental game!)

##### To that end, our base distribution will be $p_{scored} = .85$ and $p_{saved}=.15$ (the implication being that the average keeper, in the context of PKs, will prevail against the shooter only 15 percent of the time), to which we will compare each goalkeeper's individual penalty kick distribution.

In [13]:
p = [.85,.15]

In [32]:
import numpy as np

def kl_divergence(p, q):
 return np.sum(p[i] * np.log(p[i]/q[i]) for i in range(len(p)))

In [34]:
gks_kld = []

for save_rate in pk_save_rate:
    q = [1-save_rate, save_rate]
    gks_kld.append(kl_divergence(p,q))

fig1= go.Figure(data=[go.Bar(x=gks, y=gks_kld,hovertext=full_name)])
# Customize aspect
fig1.update_traces(marker_color='rgb(158,202,100)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)
fig1.update_layout(title_text='KL Divergence among top Keepers (entire career in all comps, excluding shootouts)')
fig1.show()


Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.fromiter(generator)) or the python sum builtin instead.



#### Two identical distributions produces a KL divergnece of 0, and thus the more similar two distributions are, the closer the KL divergence will be to 0. Thus, we can infer that:

1. The likes of Edwin van de Sar, Petr Cech, Ederson, and **De Gea** are pretty much average PK savers.
2. Peter Schmeichel having a higher KL divergence doesn't imply that he's better than the previous mentioned keepers, but that he's **worse** than the average keeper at saving penalty kicks (and we can infer this from the previous visualization where we saw his 8 percent PK save rate, lowest of the 10 keepers here)
3. Buffon and Neuer are great at saving PKs, but not as great as Alisson!

(The subtle assumption here is that all penalty kicks are equally difficult, regardless of the competition, whether or not the keeper's team is losing or winning at the time their team gave away a penalty, how good of a PK kicker the GK is going against, and etc.)

#### However, this is an analysis based on non-shootout PKs, meaning that it's excluding some historic moments such as:

1. Edwin Van de sar's three penalty saves in the Community Shield (2007) against Chelsea, and the other two in the Champions League Final (2008), also against Chelsea
2. Petr Cech single-handedly securing Chelsea's first Champions League victory against Bayern Munich in 2012 by denying Olic and Schweinsteiger.
3. De Gea going zero for 11 (0/11) against Villareal in the Europa League final shootout (2021/2022).
4. Neuer denying Kaka and Ronaldo, and with the help of Ramos sending it to the moon, beating Real Madrid in the Champions League (2011/2012)

### (I'm having a hard time finding public data regarding shootouts, so I will try to manually collect this data on my own)