# 1. Introduction

This project aims to predict the outcomes of English Premier League (EPL) soccer/football matches using deep learning techniques from class.

https://github.com/nich5850/DeepLearningFinal

**Motivation:**  
I watch and play soccer all the time. The thing is soccer outcome prediction is a challenging task because of how many features go into each game. It's useful for fans, analysts, betting markets, and clubs. This project investigates whether engineered features from historical match data can help classify match results.

**Approach:**  
I engineered features like form, past performance, and relative stats, then trained two models:
1. Random Forest Classifier — for interpretability
2. MLP Classifier — for capturing non-linear patterns


# 2. Dataset

### Dataset Overview

I use a cleaned dataset `final_dataset.csv` that contains a ton of data about pre-match stats and match outcomes. Here's how the dataset is loaded and previewed:

```python
import pandas as pd

df = pd.read_csv("final_dataset.csv")
print(f"Dataset shape: {df.shape}")
df.head()
```

### Target Variable

I then added a `ResultLabel` column to represent the match outcome:
- 0: Home Win
- 1: Away Win
- 2: Draw


# 3. Data Cleaning and EDA

I removed columns that would cause data leakage or were irrelevant for modeling.

### Data Transformation Considerations
Some features had different ranges (e.g., `FormPts` ranges from 0–15, while `GoalDiff` could be much larger).

- For Random Forest I did not standardize features before RF training because I know this isn't scale dependent. 
- However, the MLP model is sensitive to feature scales, so I applied `StandardScaler` to the features before MLP training.

### Missing Data & Cleaning
- The dataset was preprocessed to remove rows with missing values.
  
### Important Feature Hypotheses
Based on basic knowledge of soccer, I hypothesized the following features to be most predictive:

- `HFormPts` and `AFormPts` — represent recent team performance  
- `HST`, `AST` — how many shots on targets (similar to xG)  
- `HGF`, `AGF` — season-to-date goals scored  

These hypotheses were later validated during feature importance analysis.

---

### Cleaning Code Snippet

```python
# Drop columns that reveal the final score or post-match info
leakage_cols = ['FTHG', 'FTAG', 'FTR', 'HTHG', 'HTAG', 'HTR']
df.drop(columns=[col for col in leakage_cols if col in df.columns], inplace=True)

# Drop identifiers and team names
df.drop(columns=['Date', 'HomeTeam', 'AwayTeam'], inplace=True)

# Drop columns that directly compute the label
df.drop(columns=['HTP', 'ATP', 'DiffPts'], inplace=True)

# Keep only numeric data
df = df.select_dtypes(include=[np.number])


# 4. Train/Test Split

I split the data 80/20 using sklearn's `train_test_split`.

```python
from sklearn.model_selection import train_test_split

X = df.drop(columns=['ResultLabel'])
y = df['ResultLabel']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

Shape of the resulting datasets:

- `X_train`: (number of training samples, number of features)
- `X_test`: (number of test samples, number of features)


# 5. Random Forest Classifier

### Training

```python
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
```

### Evaluation

```python
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

Training Random Forest...

Random Forest Evaluation:
Accuracy: 0.8961988304093568
Confusion Matrix:
 [[579  50   2]
 [ 63 574   0]
 [ 14  13  73]]
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.92      0.90       631
           1       0.90      0.90      0.90       637
           2       0.97      0.73      0.83       100

    accuracy                           0.90      1368
   macro avg       0.92      0.85      0.88      1368
weighted avg       0.90      0.90      0.90      1368
```

### Feature Importances

```python
importances = rf.feature_importances_
importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
top_10 = importance_df.sort_values(by='Importance', ascending=False).head(10)

Top 10 Important Features:
        Feature  Importance
18  DiffFormPts    0.192975
16         HTGD    0.188572
17         ATGD    0.179686
6     HTFormPts    0.059762
4          ATGC    0.059145
7     ATFormPts    0.058501
2          ATGS    0.057365
1          HTGS    0.054247
3          HTGC    0.054056
5            MW    0.048716
```

## Random Forest Summary

- Accuracy: ~89.6%
- Performed well across all result types
- Most important features include form-related and attacking metrics
- Draws were harder to predict, but not misclassified as often


# 6. MLP Classifier

### Training

```python
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=300, random_state=42)
mlp.fit(X_train, y_train)
y_pred_mlp = mlp.predict(X_test)
```

### Evaluation

```python
print("Accuracy:", accuracy_score(y_test, y_pred_mlp))
print("Classification Report:")
print(classification_report(y_test, y_pred_mlp))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_mlp))

Training MLP Classifier...

MLP Evaluation:
Accuracy: 0.7785087719298246
Confusion Matrix:
 [[617  14   0]
 [205 402  30]
 [ 48   6  46]]
Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.98      0.82       631
           1       0.95      0.63      0.76       637
           2       0.61      0.46      0.52       100

    accuracy                           0.78      1368
   macro avg       0.76      0.69      0.70      1368
weighted avg       0.81      0.78      0.77      1368
```


## MLP Classifier Summary

- Accuracy: ~77.9%
- Performance dropped for predicting draws
- Captured some non-linear patterns but less interpretable
- Still outperformed random guessing


# 7. Discussion

### Feature Impact:
- Random Forest identified key features like `HFormPts`, `Goal Difference`, and `Shots on Target`.
- These represent recent form and offensive capabilities — intuitive predictors of match outcome.

### Model Comparison:
- Random Forest outperformed MLP in both accuracy and interpretability.
- MLP struggled more with predicting draws — common in sports classification tasks.
- Adding layers could boost the accuracy drastically here.

### Realism & Validity:
- No future-dependent features or final scores were used in training.
- My early versions of the code had a lot of data leakage so it was always returning 100% accuracy. 


# 8. Conclusion

### What I Learned:
- EPL match outcome prediction is feasible using engineered pre-match features.
- Random Forest performed well and provided interpretable insights.
- MLP was less accurate but still showed promise.

### Limitations:
- Models do not account for injuries, weather, or lineup changes.
- Static snapshot — no time-aware validation or rolling features.

### Future Work:
- Introduce betting odds.
- Explore recurrent neural networks or sequence models.
- Get way better/richer data including location, player stats, etc. 
