# Exploratory Data Analysis for xG Tutor

This notebook explores the StatsBomb-derived SQLite dataset used by **xG Tutor**.
We will audit data quality, engineer key shot geometry features (distance and
angle), and look for early relationships between goal outcomes, StatsBomb xG,
and other shot attributes.

## Imports & Notebook Setup

We start by loading the Python packages needed for data access, numerical
analysis, and visualization.

In [None]:
import json
import math
import sqlite3
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

pd.options.display.max_columns = 60
pd.options.display.float_format = "{:.3f}".format
sns.set_context("notebook")

## Load the StatsBomb-Derived Dataset

Set the path to the SQLite database populated by the ETL pipeline. If you have
not run the loader yet, execute the following command from the project root to
create `data/xg_tutor.db` first:

```bash
poetry run python -m xgoal_tutor.etl /path/to/events.json data/xg_tutor.db
```

Update `DB_PATH` below if you store the database elsewhere.

In [None]:
DB_PATH = Path("data/xg_tutor.db")
if not DB_PATH.exists():
    raise FileNotFoundError(
        "SQLite database not found. Run the ETL loader before executing the notebook."
    )

with sqlite3.connect(DB_PATH) as conn:
    shots = pd.read_sql("SELECT * FROM shots", conn)
    events = pd.read_sql("SELECT * FROM events", conn)
    freeze_frames = pd.read_sql("SELECT * FROM freeze_frames", conn)

shots.head()

## Initial Data Audit

Inspect the basic structure and completeness of the `shots` table to surface
potential data quality issues.

In [None]:
shots.info()

In [None]:
missing_summary = (
    shots.isna()
    .mean()
    .rename("missing_fraction")
    .to_frame()
    .assign(missing_count=shots.isna().sum())
    .sort_values("missing_fraction", ascending=False)
)
missing_summary.head(20)

In [None]:
shots.describe(include="all", datetime_is_numeric=True)

### Logical Consistency Checks

A few quick validation queries help reveal duplicated identifiers and confirm
basic assumptions about coordinates and categorical values.

In [None]:
consistency_checks = {
    "duplicate_shot_ids": shots["shot_id"].duplicated().sum(),
    "unique_outcomes": shots["outcome"].dropna().unique(),
    "start_x_range": shots["start_x"].agg(["min", "max"]),
    "start_y_range": shots["start_y"].agg(["min", "max"]),
}
consistency_checks

## Geometry Feature Engineering

StatsBomb encodes shot locations on a 120×80 pitch. We map those coordinates to
real-world dimensions (105×68 meters) and compute two key geometry features:

- **Distance to the goal center** in meters
- **Goal opening angle** in both radians and degrees

In [None]:
SB_PITCH_LENGTH = 120.0
SB_PITCH_WIDTH = 80.0
REAL_PITCH_LENGTH = 105.0
REAL_PITCH_WIDTH = 68.0
GOAL_WIDTH = 7.32
GOAL_X = REAL_PITCH_LENGTH
GOAL_Y = REAL_PITCH_WIDTH / 2
UPPER_POST_Y = GOAL_Y - GOAL_WIDTH / 2
LOWER_POST_Y = GOAL_Y + GOAL_WIDTH / 2


def statsbomb_to_metric(x: pd.Series, y: pd.Series) -> tuple[pd.Series, pd.Series]:
    """Convert StatsBomb shot coordinates to metric units."""

    scale_x = REAL_PITCH_LENGTH / SB_PITCH_LENGTH
    scale_y = REAL_PITCH_WIDTH / SB_PITCH_WIDTH
    return x * scale_x, y * scale_y


def goal_distance(x_m: pd.Series, y_m: pd.Series) -> pd.Series:
    """Compute Euclidean distance to the goal center in meters."""

    return np.hypot(GOAL_X - x_m, GOAL_Y - y_m)


def goal_opening_angle(x_m: pd.Series, y_m: pd.Series) -> pd.Series:
    """Compute the goal opening angle (radians) from the shot location."""

    shot_points = np.column_stack([x_m, y_m])
    upper_post = np.array([GOAL_X, UPPER_POST_Y])
    lower_post = np.array([GOAL_X, LOWER_POST_Y])

    vec_upper = upper_post - shot_points
    vec_lower = lower_post - shot_points

    dot = np.einsum("ij,ij->i", vec_upper, vec_lower)
    norms = np.linalg.norm(vec_upper, axis=1) * np.linalg.norm(vec_lower, axis=1)

    cos_theta = np.clip(dot / norms, -1.0, 1.0)
    return np.arccos(cos_theta)


shots = shots.copy()
shots["start_x_m"], shots["start_y_m"] = statsbomb_to_metric(shots["start_x"], shots["start_y"])
shots["distance_to_goal_m"] = goal_distance(shots["start_x_m"], shots["start_y_m"])
shots["goal_angle_rad"] = goal_opening_angle(shots["start_x_m"], shots["start_y_m"])
shots["goal_angle_deg"] = np.degrees(shots["goal_angle_rad"])

shots[[
    "shot_id",
    "start_x",
    "start_y",
    "start_x_m",
    "start_y_m",
    "distance_to_goal_m",
    "goal_angle_rad",
    "goal_angle_deg",
]].head()

### Geometry Feature Distributions

Plot the distributions of the engineered geometry features to understand their
ranges and detect potential outliers.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.histplot(shots["distance_to_goal_m"], kde=True, bins=30, ax=axes[0])
axes[0].set_title("Shot Distance to Goal (m)")
axes[0].set_xlabel("Distance (m)")

sns.histplot(shots["goal_angle_deg"], kde=True, bins=30, ax=axes[1])
axes[1].set_title("Goal Opening Angle (°)")
axes[1].set_xlabel("Angle (degrees)")

plt.tight_layout()
plt.show()

## Relationship Between Outcomes and Features

Add helper columns for goal outcomes and inspect how distance, angle, and
StatsBomb xG differ between scoring and non-scoring shots.

In [None]:
shots["is_goal"] = shots["outcome"].eq("Goal")

summary_by_outcome = shots.groupby("is_goal")[
    ["distance_to_goal_m", "goal_angle_deg", "statsbomb_xg"]
].agg(["mean", "median", "std", "count"])
summary_by_outcome

In [None]:
melted = shots.melt(
    id_vars=["is_goal"],
    value_vars=["distance_to_goal_m", "goal_angle_deg", "statsbomb_xg"],
    var_name="feature",
    value_name="value",
)

plt.figure(figsize=(12, 6))
sns.boxplot(data=melted, x="feature", y="value", hue="is_goal")
plt.title("Feature Distributions by Goal Outcome")
plt.xlabel("Feature")
plt.ylabel("Value")
plt.legend(title="Goal Scored")
plt.tight_layout()
plt.show()

### Correlation Matrix

Compute a correlation matrix between StatsBomb xG, the engineered geometry
features, and a selection of contextual flags to identify promising modelling
signals.

In [None]:
corr_features = [
    "statsbomb_xg",
    "is_goal",
    "distance_to_goal_m",
    "goal_angle_deg",
    "first_time",
    "one_on_one",
    "open_goal",
    "under_pressure",
    "is_set_piece",
    "is_corner",
    "is_free_kick",
]

corr_matrix = shots[corr_features].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap="RdBu_r",
    center=0,
    linewidths=0.5,
)
plt.title("Correlation Between Key Features and Outcomes")
plt.tight_layout()
plt.show()

## Next Steps

- Investigate additional contextual features (defensive pressure, body part,
  play pattern) for predictive power.
- Blend geometry metrics with freeze-frame derived defender proximity features.
- Calibrate an initial xG model using the engineered feature set.