# EPL Match Outcome Prediction
## Data Exploration & Feature Engineering Overview

This notebook provides an overview of the dataset used in the project,
explains the feature engineering choices, and discusses the challenges
of predicting football match outcomes.



### Objective

The goal of this project is to predict the outcome of English Premier League
matches (Home win / Draw / Away win) using historical match statistics.

The objective is not to beat bookmakers, but to demonstrate a complete
machine learning workflow:
- data preparation
- feature engineering
- model training
- evaluation
- comparison with a strong baseline (bookmakers).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("../data/processed/model_data.csv")
df.head()

### Dataset Description

Each row in the dataset represents one football match.

The features are engineered as **differences between home and away teams**
based on recent performance (rolling averages over previous matches).

The target variable is:
- `1`  → Home win
- `0`  → Draw
- `-1` → Away win

In [None]:
df["target"].value_counts().plot(
    kind="bar",
    title="Match Outcome Distribution"
)
plt.xlabel("Match outcome")
plt.ylabel("Number of matches")
plt.show()


### Class Distribution

The dataset is slightly imbalanced:
- Home wins are the most frequent outcome
- Draws are less frequent

This imbalance partly explains why predicting draws is particularly difficult,
both for machine learning models and bookmakers.

In [None]:
feature_cols = [c for c in df.columns if c.startswith("diff_")]
len(feature_cols), feature_cols[:10]

### Feature Engineering Strategy

Only **pre-match information** is used.

Features are based on:
- rolling averages of team performance
- differences between home and away teams
- recent form (last 5–10 matches)

This approach avoids data leakage and reflects the information
available before a match is played.

### Models Used

Two models are trained:
- Logistic Regression (baseline, interpretable)
- Random Forest (non-linear model)

Performance is evaluated using:
- Accuracy
- Log-loss
- Confusion matrices
- Classification reports

A bookmaker baseline is used for comparison.

### Results Discussion

Both machine learning models achieve around 53–54% accuracy,
which is comparable to the bookmaker baseline.

This result is expected, as bookmakers aggregate a large amount
of information not present in the dataset.

The models struggle to predict draws, which is a known difficulty
in football analytics.

### Limitations and Future Work

Limitations:
- No information about injuries or lineups
- No in-game events
- Limited contextual variables

Possible improvements:
- Incorporating team ratings (Elo)
- Adding player-level information
- Using more advanced probabilistic models