Prediction of match results in English Premier League based on football statistics

L.T. Ozgur Yildirim

Index:

Purpose of the study
Materials and Methods
Tableau
Model Pipeline
Conclusion
Glossary
Files
References
License

Purpose of the study

In this project, I am working as data analyst for a company related to football statistics. My company created machine learning models to predict the results of matches in English Premier League based on football statistics. The main objectives of this study are 1) analyzing relationships between football statistics, and 2) building classification models to predict the results of football matches based on football statistics.

Materials and Methods

The football statistics in the dataset are scraped on https://fbref.com/en/ website using BeautifulSoup library. The tables in the dataset include shooting, passing, goalkeeping, passing types, goal and shot creation, defense, possession, and miscellaneous statistics (Figure 1). After scraping, the tables are exported as .csv files into csv_main and csv_validation folders. The dataset consists of statistics of 27 teams in the premier league ranging from 2013 season to 2022 season. The main_data include statistics from 2018-2022 seasons to train the classification models. The validation_data include statistics from 2013-2017 seasons to test/validate the classification models. Moreover, Premier League standings for the last 22 seasons (from kaggle.com) are used to analyze the championship and relegation numbers of the teams. The epl_standings.csv file is stored in csv_main folder. The studies are performed on .csv and excel files using Python and Tableau softwares.

Figure 1

A few NaN values in the dataset are replaced by mean value of the columns. No outliers were observed in the dataset. After that, the features are selected using variance threshold and KBEST methods to prepare the dataset for the classification models. After the results are obtained, the results are exported as .csv and excel files. Extra excel files of dataset are created using Python to compare actual data and predicted data in Tableau.

The methods of this study include the followings:

Web scraping
Get data
Exploratory data analyses
Data cleaning
Data visualization
Data wrangling
Variance threshold
KBEST
Hyperparameters search
Model Pipeline
Apply Random Forest Classifier Model
Evaluate the results

Tableau

Link to Tableau

The .csv and excel files are imported into Tableau. The football statistics are plotted to visualize and observe their relations in Tableau.

Model Pipeline

The columns contributing to the prediction of the results in cleaned data are selected using Variance Threshold and KBEST methods. The model pipeline includes random forest classifier, decision tree classifier and logistic regression models. Hyperparameters for random forest classifier and decision tree classifier are searched using gridsearch method. Then, the models are trained using the train portion of the main_data in the pipeline. The model with the best accuracy after using the test portion of the main_data is random forest classifier model.

The steps above are applied for the validation_data (2013-2017 seasons). The accuracy score of the random forest classifier applied on the validation_data is 0.77. The comparison of actual win, draw, and lose data and predicted win, draw, and lose results are shown in confusion matrix of the validation_data (Figure 2).

There are 3292 predictions of the match results in validation data (2013-2017 season). It was predicted that the relevant team would win in 1381 matches. The actual data shows that the relevant team would win in 1046 matches. It was predicted that the relevant team would lose in 1764 matches. The actual data shows that the relevant team would lose in 1186 matches. It was predicted that the relevant team would draw in 147 matches. The actual data shows that the relevant team would draw in 116 matches.

The random forest classifier's accuracy is pretty good. Random Forest Classifier Model Results_Seasons(2013-2017)

Accuracy Score: 0.71

Precision: 0.76

Recall: 0.77

F1: 0.76

Figure 2

Conclusion

This study investigates the results of matches in English Premier League in seasons 2013-2022 based on football statistics using classification models. The random forest classifier model is selected as it has the best accuracy score. The confusion matrix shows that the model predicts well (accuracy Score: 0.71) about the results of the matches. The future study can be prediction of football matches which are going to be played in the future.

Glossary

The files with descriptions of the statistics in the dataset are located at "glossary" folder.

Files

csv_main: season 2018-2022 data exported from Python and imported to Tableau

csv_validation: season 2013-2017 data exported from Python and imported to Tableau

excel_files: exported from Python and imported to Tableau

glossary: descriptions of statistics in the dataset

images: images of visualized data in dataset

pickle: saved data of models, encoder, transformer.

presentation: pdf file of the presentation

.ipynb files: Jupiter notebook files

References

The dataset is scraped from https://fbref.com/en/ website. Moreover, Premier League standings for the last 22 seasons (https://www.kaggle.com) is used. Lecture notes from the web scraping lessons in Ironhack Data Analytics Bootcamp and web scraping videos of Dataquest channel are used to complete web scraping part of this project.

License

This is an educational project; therefore, all materials can be used freelly.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
csv_main		csv_main
csv_validation		csv_validation
excel_files		excel_files
glossary		glossary
images		images
pickle		pickle
presentation		presentation
1_Web_Scraping.ipynb		1_Web_Scraping.ipynb
2_Main_Data_Model.ipynb		2_Main_Data_Model.ipynb
3_Web_Scraping_Validation.ipynb		3_Web_Scraping_Validation.ipynb
4_Validation_Model.ipynb		4_Validation_Model.ipynb
5_Data_Visualization.ipynb		5_Data_Visualization.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prediction of match results in English Premier League based on football statistics

Index:

Purpose of the study

Materials and Methods

Tableau

Model Pipeline

Conclusion

Glossary

Files

References

License

About

Releases

Packages

Languages

ltaylanozgur/Football_Match_Predictions

Folders and files

Latest commit

History

Repository files navigation

Prediction of match results in English Premier League based on football statistics

Index:

Purpose of the study

Materials and Methods

Tableau

Model Pipeline

Conclusion

Glossary

Files

References

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages