This project aimed to achieve two things. Firstly, it the question was whether football results are predictable through a basic machine learning approach (not surprisingly, it was not possible). Additionally, the learning goal was to build an end-to-end [kedro]() pipeline using solely the [pyspark]() api. The reason for that interest is the immense power and advantages pyspark has over pandas

## Problem Description
The goal of this project was multifold. First, it was planned to see how well the outcome of football matches could be predicted through a basic machine learning approach. Second, I wanted to get exposure and training in [spark ml](https://spark.apache.org/docs/latest/ml-guide). Therefore, all pipelines (excluding some parts of the EDA pipelines) were written exclusively in pyspark. Third, the project is using the kedro pipelining framework and the subsequent blog-post is deploying the forecasting pipeline on Microsoft Azure.

## Framework
As mentioned before this project builds mainly on two technologies. Namely, the pipelining and project structure is managed by [kedro](https://kedro.readthedocs.io/en/stable/), whereas the scripting happens mostly through [pypspark](https://spark.apache.org/docs/latest/api/python/), leveraging the parallelism benefits of spark.

## Repository
The entire code-base can be found here



## Package Imports

In [9]:
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("football_madness").getOrCreate()

## Data
The data for this project came from two different locations. Firstly, the larger chunk of data usage for training and assessing the model was taken from [this](http://www.football-data.co.uk/) website. All data ranging from the 2000-01 season until the 2022-23 season was used.


In [13]:
print(pd.read_excel("../data/01_raw/all-euro-data-2000-2001.xls"))

    Div       Date       HomeTeam       AwayTeam  FTHG  FTAG FTR  HTHG  HTAG  \
0    E0 2000-08-19       Charlton       Man City     4     0   H     2     0   
1    E0 2000-08-19        Chelsea       West Ham     4     2   H     1     0   
2    E0 2000-08-19       Coventry  Middlesbrough     1     3   A     1     1   
3    E0 2000-08-19          Derby    Southampton     2     2   D     1     2   
4    E0 2000-08-19          Leeds        Everton     2     0   H     2     0   
..   ..        ...            ...            ...   ...   ...  ..   ...   ...   
375  E0 2001-05-19       Man City        Chelsea     1     2   A     1     1   
376  E0 2001-05-19  Middlesbrough       West Ham     2     1   H     2     1   
377  E0 2001-05-19      Newcastle    Aston Villa     3     0   H     2     0   
378  E0 2001-05-19    Southampton        Arsenal     3     2   H     0     1   
379  E0 2001-05-19      Tottenham     Man United     3     1   H     1     1   

    HTR  ...   IWA   LBH   LBD   LBA   

### Summary Statistics
There are quite 


## Feature Engineering
The feature space is kept relatively straight-forward and simple. Most features are rolling averages of the last n betting-provider quotas, goals scored, goals received, etc. Additionally, ratios of how many of the last n games at home/ away were won in order to give the model a feeling if a team performs better at one or the other location. The rather challenging part was 

In [15]:
spark.read.parquet("../data/04_feature/team_spine/").show()

+----------+---------+--------------+--------------+---------------+------------------------+----------------+---------------+------------------------+----------------+----------+-------------+-----+--------------+------------+---------------------+-------------+----------------------+-------+----------------+-----+--------------+------------+---------------------+-------+----------------+----+-------------+-------------------+----------------------------+-----------+----------------+--------------------+----------------+---------------------+-------------------------+----------------+---------------------+-------------------------+--------------+-------------------+-----------------------+----------------+---------------------+-------------------------+----------------+---------------------+-------------------------+---------------+--------------------+------------------------+-------------+------------------+----------------------+-----------+----------------+--------------------+---

In [8]:
spark.read.parquet("../data/04_feature/master_table/").show()

23/10/04 09:41:59 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+-------------+-------------+----------+---------+-----------------------------------------------------------------+----------------------------------------------------------------+-----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+---------------------------------------------------------------+------------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------------------------------+--------------------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+------------------------------------------------------------------+-----------------------------------------------------------------+---------------------------------------------------

## Modeling


## Inference


In [11]:
spark.read.parquet("../data/07_model_output/inverted_model_predictions_inference/").show()

+-----------+--------------+----------+---------+-----------------------------------------------------------------+----------------------------------------------------------------+-----------------------------------------------------------------+----------------------------------------------------------------+----------------------------------------------------------------+---------------------------------------------------------------+------------------------------------------------------------+-----------------------------------------------------------+---------------------------------------------------------------------+--------------------------------------------------------------------+------------------------------------------------------------+-----------------------------------------------------------+-------------------------------------------------------------+------------------------------------------------------------+--------------------------------------------------------


## Model Evaluation


![title](../data/08_reporting/confusion_matrix_plot.png)

In [16]:
spark.read.parquet("../data/07_model_output/standing_table_inference/").show()

+--------------+-----------------------+------------------+--------------+---------+------+
|          team|predicted_number_points|true_number_points|predicted_rank|true_rank|league|
+--------------+-----------------------+------------------+--------------+---------+------+
|      Man City|                     49|                93|             9|        1|    E0|
|     Liverpool|                     59|                92|             6|        2|    E0|
|       Chelsea|                     68|                74|             1|        3|    E0|
|     Tottenham|                     62|                71|             4|        4|    E0|
|       Arsenal|                     61|                69|             5|        5|    E0|
|    Man United|                     65|                58|             2|        6|    E0|
|      West Ham|                     63|                56|             3|        7|    E0|
|     Leicester|                     61|                52|             5|      

## Next steps