## Ordinary vs Bayesian Regression for Soccer Analytics

<b> Team Members </b>: Ahmed Rizk - 70995758 <br>
<b> Theme </b>: Comparison of a Bayesian Estimator with a non Bayesian estimator

This project was inspired by this blog post: https://jramkiss.github.io/2020/03/01/regression-vs-bayesian-regression/

#### Github Repo

https://github.com/izk20/STAT447C   

all commits are made by myself.

#### Introduction

Data analytics is increasingly being cemented as a crucial part of soccer. Top teams utilize data regarding player performances and opposition analysis to inform their player recruitment, player development and even tactical strategies. Given the nature of the data collected, frequentist methods are predominantly used. This paper will attempt to explore bayesian regression approaches to predicting goals or assists and determine whether they could potentially perform better than frequentist approaches.

#### Candidate Datasets

The 2 original candidate datasets used are:

https://www.kaggle.com/datasets/koklengyeo/big-5-player-statistics-2021-2022-season?select=Passing.csv

https://www.kaggle.com/datasets/koklengyeo/big-5-player-statistics-2021-2022-season?select=Shooting.csv 

Both datasets contain real statistics for players in the top 5 leagues during the 2021-2022 season, scraped from FBREF. The former contains passing and chance creation related metrics, and the latter contains shooting and goal scoring related metrics. Minimal modifications have been made to rename columns and remove players with low playing minutes (completed less than 15 sets of 90 minutes) as well as defenders and goalkeepers. The aim of these modifications is to remove outliers (eg. players with outstanding stats but very few minutes of playing time) as well as players whose role on the field is not related to goals or chance creation. 

In [26]:
library(tidyverse)

"package 'tidyverse' was built under R version 4.3.2"
"package 'ggplot2' was built under R version 4.3.2"
"package 'tibble' was built under R version 4.3.2"
"package 'tidyr' was built under R version 4.3.2"
"package 'readr' was built under R version 4.3.2"
"package 'dplyr' was built under R version 4.3.2"
"package 'stringr' was built under R version 4.3.2"
"package 'forcats' was built under R version 4.3.2"
"package 'lubridate' was built under R version 4.3.2"
── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────

In [25]:
passing <- read.csv('Passing_filtered.csv')
head(passing)

Unnamed: 0_level_0,Player,Nation,Pos,Squad,Comp,Age,Born,X90s,total_cmp,total_att,⋯,long_cmp.,Ast,xAG,xA,A.xAG,KP,final_third,PPA,CrsPA,PrgP
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Salis Abdul Samed,gh GHA,MF,Clermont Foot,fr Ligue 1,21,2000,27.4,56.0,62.1,⋯,83.7,0.0,0.03,0.03,-0.03,0.62,3.18,0.47,0.04,3.94
2,Laurent Abergel,fr FRA,MF,Lorient,fr Ligue 1,28,1993,32.8,40.9,50.6,⋯,64.4,0.06,0.11,0.08,-0.05,1.07,4.48,0.7,0.27,5.21
3,Tammy Abraham,eng ENG,FW,Roma,it Serie A,23,1997,34.3,14.1,20.3,⋯,71.8,0.12,0.14,0.07,-0.02,1.11,1.02,0.79,0.09,2.04
4,Che Adams,sct SCO,FW,Southampton,eng Premier League,25,1996,22.7,13.4,20.8,⋯,52.9,0.13,0.13,0.07,0.0,1.15,1.06,0.44,0.0,1.45
5,Tyler Adams,us USA,MF,RB Leipzig,de Bundesliga,22,1999,15.0,52.9,61.2,⋯,69.0,0.07,0.06,0.09,0.01,0.47,5.67,0.53,0.0,6.4
6,Yacine Adli,fr FRA,MFFW,Bordeaux,fr Ligue 1,21,2000,25.1,36.1,49.8,⋯,56.3,0.28,0.23,0.17,0.05,2.35,4.86,1.43,0.16,6.53


In [29]:
glimpse(passing)

Rows: 666
Columns: 31
$ Player      [3m[90m<chr>[39m[23m "Salis Abdul Samed", "Laurent Abergel", "Tammy Abraham", "…
$ Nation      [3m[90m<chr>[39m[23m "gh GHA", "fr FRA", "eng ENG", "sct SCO", "us USA", "fr FR…
$ Pos         [3m[90m<chr>[39m[23m "MF", "MF", "FW", "FW", "MF", "MFFW", "MF", "FW", "MF", "M…
$ Squad       [3m[90m<chr>[39m[23m "Clermont Foot", "Lorient", "Roma", "Southampton", "RB Lei…
$ Comp        [3m[90m<chr>[39m[23m "fr Ligue 1", "fr Ligue 1", "it Serie A", "eng Premier Lea…
$ Age         [3m[90m<int>[39m[23m 21, 28, 23, 25, 22, 21, 19, 27, 22, 28, 30, 23, 17, 30, 26…
$ Born        [3m[90m<int>[39m[23m 2000, 1993, 1997, 1996, 1999, 2000, 2002, 1994, 1999, 1992…
$ X90s        [3m[90m<dbl>[39m[23m 27.4, 32.8, 34.3, 22.7, 15.0, 25.1, 20.8, 30.4, 15.8, 26.1…
$ total_cmp   [3m[90m<dbl>[39m[23m 56.0, 40.9, 14.1, 13.4, 52.9, 36.1, 40.2, 21.5, 25.2, 62.3…
$ total_att   [3m[90m<dbl>[39m[23m 62.1, 50.6, 20.3, 20.8, 61.2, 49.8, 47.9, 29.8, 

In [22]:
shooting <- read.csv('Shooting_filtered.csv')
tail(shooting)

Unnamed: 0_level_0,Player,Nation,Pos,Squad,Comp,Age,Born,X90s,Gls,Sh,⋯,G.SoT,Dist,FK,PK,PKatt,xG,npxG,npxG.Sh,G.xG,np.G.xG
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
661,DuvÃ¡n Zapata,co COL,FW,Atalanta,it Serie A,30,1991,19.1,0.52,4.08,⋯,0.23,13.3,0.0,0.16,0.16,0.71,0.59,0.14,-0.18,-0.22
662,Piotr ZieliÅ„ski,pl POL,MF,Napoli,it Serie A,27,1994,23.2,0.26,1.85,⋯,0.4,20.0,0.04,0.0,0.0,0.15,0.15,0.08,0.11,0.11
663,MartÃ­n Zubimendi,es ESP,MF,Real Sociedad,es La Liga,22,1999,28.8,0.07,0.69,⋯,0.22,13.2,0.0,0.0,0.0,0.08,0.08,0.11,-0.01,-0.01
664,Szymon Å»urkowski,pl POL,MF,Empoli,it Serie A,23,1997,25.6,0.23,1.84,⋯,0.35,19.2,0.0,0.0,0.0,0.14,0.14,0.08,0.09,0.09
665,Martin Ã˜degaard,no NOR,MFFW,Arsenal,eng Premier League,22,1998,30.9,0.23,1.72,⋯,0.33,21.0,0.39,0.0,0.0,0.16,0.16,0.09,0.07,0.07
666,Milan ÄuriÄ‡,ba BIH,FW,Salernitana,it Serie A,31,1990,24.1,0.21,2.03,⋯,0.27,11.1,0.0,0.04,0.04,0.22,0.19,0.09,-0.01,-0.02


In [28]:
glimpse(shooting)

Rows: 666
Columns: 25
$ Player  [3m[90m<chr>[39m[23m "Salis Abdul Samed", "Laurent Abergel", "Tammy Abraham", "Che …
$ Nation  [3m[90m<chr>[39m[23m "gh GHA", "fr FRA", "eng ENG", "sct SCO", "us USA", "fr FRA", …
$ Pos     [3m[90m<chr>[39m[23m "MF", "MF", "FW", "FW", "MF", "MFFW", "MF", "FW", "MF", "MF", …
$ Squad   [3m[90m<chr>[39m[23m "Clermont Foot", "Lorient", "Roma", "Southampton", "RB Leipzig…
$ Comp    [3m[90m<chr>[39m[23m "fr Ligue 1", "fr Ligue 1", "it Serie A", "eng Premier League"…
$ Age     [3m[90m<int>[39m[23m 21, 28, 23, 25, 22, 21, 19, 27, 22, 28, 30, 23, 17, 30, 26, 27…
$ Born    [3m[90m<int>[39m[23m 2000, 1993, 1997, 1996, 1999, 2000, 2002, 1994, 1999, 1992, 19…
$ X90s    [3m[90m<dbl>[39m[23m 27.4, 32.8, 34.3, 22.7, 15.0, 25.1, 20.8, 30.4, 15.8, 26.1, 17…
$ Gls     [3m[90m<dbl>[39m[23m 0.04, 0.00, 0.50, 0.31, 0.00, 0.04, 0.00, 0.39, 0.00, 0.19, 0.…
$ Sh      [3m[90m<dbl>[39m[23m 0.66, 0.88, 2.68, 2.16, 0.13, 1.27, 0.72, 1.84, 0.95

#### Approaches

In the case of both datasets, all data is normalized by 90 minutes. Meaning, all count variables such as goals, shots etc. are on a per 90 basis as opposed to being on a regular scale. In the first dataset, the target variable will either be assists per 90 or expected assists per 90. In the second, the target variable will either be goals per 90 or expected goals per 90. Here are some helpful definitions:

- Expected Goals (xG) is a metric that estimates the probability of a given shot resulting in a goal based on various factors like shot angle, distance from goal etc.
- Expected Assists (xA) estimates the likelihood that a given pass will become an assist, based on factors like pass location, type, and receiving player's position.

In our datasets, xG and xA per 90 represent the sum of the probability values assigned to each shot/pass made by the player over the course of 90 minutes. In the frequentist approach, the model is unlikely to be very different depending on which target variable is chosen.

To build the frequentist model, we look to identify sources of multicollinearity (which are very likely to be present in both datasets). Given the large number of predictor variables, reducing the number would help prevent overfitting on the data. Stepwise forward selection can be used to identify the best set of predictor variables to use for the model. 

I am looking to explore the different Bayesian models that can be built for this data. We can use the Poisson distribution for modeling goals and assists per 90 minutes if the the mean and variance of these counts are roughly equal. If there's evidence of overdispersion, the Negative Binomial distribution is a better choice as it can handle the extra variability. Given that some players have very low (sometimes 0) goals and assists per 90, this can end up making the poisson distribution unsuitable. We can determine the more suitable prior distribution after carrying out the exploratory data analysis.

For xG and xA, which are continuous rather than count variables and represent aggregated probabilities rather than counts, we can use a normal distribution as the model's likelihood. This can be suitable given that the data is normalized per 90 minutes and may approximate a normal distribution in the variability and mean.