# xG Football (StatsBomb Open Data)

This notebook accompanies the capstone project and shows:

- Preparation and cleaning of StatsBomb shot data;
- Exploratory data analysis (EDA) and descriptive statistics;
- Model comparison (Logistic Regression vs RandomForest);
- Feature importance analysis and simple hyperparameter tuning.


## 1. Imports and setup

Make sure you have installed the project dependencies (see `README.md`).


In [None]:
from pathlib import Path

import pandas as pd

from xg_futebol.data_prep import FEATURE_COLUMNS, load_shots_dataframe


## 2. StatsBomb data

Clone StatsBomb open data (outside this notebook):

```bash
cd cohorts/2025/capstone
git clone https://github.com/statsbomb/open-data.git data/statsbomb-open-data
```

The events will be in `data/statsbomb-open-data/data/events`.


In [None]:
events_dir = Path("data/statsbomb-open-data/data/events")
events_dir


### 2.1 Load shots

We will load only a subset of matches (`limit_files`) for quick exploration.


In [None]:
df = load_shots_dataframe(events_dir, limit_files=50)
df.shape


In [None]:
df.head()


### 2.2 Basic data cleaning

The `load_shots_dataframe` function already removes shots without distance/angle/minute. Here we can inspect missing values and types.


In [None]:
df.isna().mean()


In [None]:
df.describe()


### 2.3 Target distribution (goal vs non-goal)


In [None]:
df["is_goal"].value_counts(normalize=True)


## 3. EDA: distance, angle and time


In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

df["distance"].hist(bins=30, ax=axes[0])
axes[0].set_title("Distribuição da distância")

df["angle"].hist(bins=30, ax=axes[1])
axes[1].set_title("Distribuição do ângulo")

df["minute"].hist(bins=30, ax=axes[2])
axes[2].set_title("Distribuição do minuto")

plt.tight_layout()
plt.show()


## 4. Prepare features for modeling

We will use `FEATURE_COLUMNS` and apply simple one-hot encoding with `pandas.get_dummies`.


In [None]:
X = pd.get_dummies(df[FEATURE_COLUMNS], drop_first=True)
y = df["is_goal"].values

X.shape, y.shape


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

X_train.shape, X_val.shape


## 5. Model comparison and selection

We will compare Logistic Regression and RandomForest using ROC AUC with cross-validation.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np

models = {
    "logreg": LogisticRegression(
        max_iter=1000,
        class_weight="balanced",
        n_jobs=-1,
    ),
    "random_forest": RandomForestClassifier(
        n_estimators=200,
        max_depth=10,
        min_samples_leaf=5,
        random_state=42,
        n_jobs=-1,
    ),
}



In [None]:
for name, model in models.items():
    scores = cross_val_score(
        model,
        X_train,
        y_train,
        cv=5,
        scoring="roc_auc",
        n_jobs=-1,
    )
    print(name, "ROC AUC:", scores.mean(), "+/-", scores.std())


## 6. Feature importance (Logistic Regression)

After fitting Logistic Regression, we can inspect the absolute coefficients.


In [None]:
logreg = models["logreg"]
logreg.fit(X_train, y_train)

importance = pd.Series(
    np.abs(logreg.coef_[0]),
    index=X_train.columns,
).sort_values(ascending=False)

importance.head(20)


## 7. Simple hyperparameter tuning (RandomForest)

We will run a small `GridSearchCV` for RandomForest.


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [5, 10],
}

rf = RandomForestClassifier(
    min_samples_leaf=5,
    random_state=42,
    n_jobs=-1,
)

grid = GridSearchCV(
    rf,
    param_grid=param_grid,
    scoring="roc_auc",
    cv=3,
    n_jobs=-1,
)

grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Best ROC AUC:", grid.best_score_)


## 8. Conclusion and connection to `train.py`

- This notebook is for **exploration**, feature selection and model comparison.
- The `train.py` script in the project trains the final model (with preprocessing + calibration) on all shots and saves the pipeline in `models/`.
- The web service (`predict.py`) loads this pipeline and exposes the `/predict` endpoint to receive shots and return `prob_goal`.
