# 00 - Data exploration

Open and inspect raw and merged datasets to understand schemas and missingness.

#  
### Objective
Explore and understand the Big Data Bowl 2026 dataset (input, output, supplementary).
Identify which fields are available, what’s missing, and what must be reconstructed.


In [1]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()

VBox(children=(HTML(value='<center> <img\nsrc=https://www.kaggle.com/static/images/site-logo.png\nalt=\'Kaggle…

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

Nice — here’s a clean, action-oriented **step-by-step checklist** for `01_data_exploration.ipynb`.
Each step contains: **what to do**, **why it matters**, and **what to produce/check** so you’ll always know *what* and *why* you did it.

Copy this into your notebook as headings and run each step one-by-one.

---

## Step 0 — Notebook header

**Aim:** declare purpose and keep the notebook self-documenting.
**Do:** add a top markdown cell:

* Title: `01_data_exploration.ipynb`
* Objective: “Inspect input/output/supplementary; confirm throw-window logic; list fields to reconstruct; produce Data Schema & Time Logic Summary.”
  **Produce:** short header so anyone opening the notebook knows the goal.

---

## Step 1 — Imports & file paths

**Aim:** make environment reproducible and files easy to load.
**Do:** import pandas, numpy, matplotlib, seaborn (optional), pathlib. Set data paths for `data/raw/`.
**Check:** notebook runs without import errors.
**Produce:** a small code cell showing file paths.

---

## Step 2 — Quick file load & shape check

**Aim:** confirm files load and see basic row/col counts.
**Do:** `pd.read_csv()` each of `input_*.csv`, `output_*.csv`, `supplementary.csv` for at least one week file. Print `.shape` and `.columns`.
**Why:** ensures files exist and columns match expectations; reveals immediate size/memory considerations.
**Produce / Check:** prints like:

* `Input shape: (n_rows, n_cols)`
* `Output shape: ...`
* `Supplementary shape: ...`

---

## Step 3 — Column inventory per file

**Aim:** list what fields exist where (quick data dictionary).
**Do:** for each df run `df.columns.tolist()` and `df.dtypes`. Note fields present only in input (e.g., `s, a, o, dir, ball_land_x/y`), and fields present only in output (`frame_id`, `x,y` after throw).
**Why:** identifies exactly what must be reconstructed or merged.
**Produce:** a tiny markdown table showing `Field | Input | Output | Supplementary`.

---

## Step 4 — Missingness & types

**Aim:** detect nulls, weird types, or garbage values early.
**Do:** `df.isna().sum()` and `df.describe()` for numeric ranges (especially `x`, `y`). Look for NaNs in `ball_land_x/y`, `num_frames_output`, or `frame_id`.
**Why:** missing values affect merging, reconstruction, and metric validity.
**Produce / Check:** list of columns with >0% missing; note whether `x,y` have invalid ranges (<0 or >field dims).

---

## Step 5 — Uniqueness & key-structure validation

**Aim:** verify identity integrity fundamentals.
**Do:** run checks:

* `input_df.duplicated(['game_id','play_id','nfl_id']).any()`
* `output_df.duplicated(['game_id','play_id','nfl_id','frame_id']).any()`
* `supp_df.duplicated(['game_id','play_id']).any()`
  Also check `input`’s `num_frames_output` vs `output` max `frame_id` per (game_id,play_id,nfl_id).
  **Why:** prevents silent ID mixing later.
  **Produce / Check:** assert tests or a small report listing any violations (if violations exist, log them).

---

## Step 6 — Frame rate & temporal consistency

**Aim:** confirm frame rate (frames per second) and per-play frame continuity.
**Do:** for several plays compute `max(frame_id)` and infer duration as `max_frame / assumed_frame_rate` (assume 10Hz initially). Check if `frame_id` is sequential 1..N.
**Why:** smoothing and derivative formulas depend on correct Δt.
**Produce / Check:** list of sample plays with `max_frame`, and note if any non-1 start or gaps exist.

---

## Step 7 — Cross-file play matching

**Aim:** ensure plays in input map to output and supplementary.
**Do:** compute counts per play and compare:

* Unique `(game_id,play_id)` in input vs supplementary vs output.
  Report plays present in input but missing in output, etc.
  **Why:** some plays may be filtered out (penalties, scrambles); know which to drop.
  **Produce / Check:** small Venn or counts and list of mismatched play ids.

---

## Step 8 — QuickVisualization: single-play sanity check

**Aim:** visually confirm coordinate orientation and consistency.
**Do:** pick 2–3 sample plays (one deep pass, one short, one contested). Plot `x,y` before throw (input) and after throw (output) on same axes. Color by `frame_id` (use alpha).
**Why:** confirms coordinate system (0–120 x axis, 0–53.3 y) and play_direction orientation.
**Produce / Check:** 2 plots per play: pre-throw scatter + post-throw trace. Verify shapes look like football plays, not noisy clouds.

---

## Step 9 — Identify target / receiver & defender sets

**Aim:** make sure you can identify targeted receiver and defender pool per play.
**Do:** from input look at `player_role` and `player_side` to find `Targeted Receiver` and `Defensive Coverage`. For each targeted receiver, list defenders present (their `nfl_id`s) in output frames.
**Why:** SG and CCI need nearest defenders and defender sets.
**Produce / Check:** sample mapping for a play like `target_nfl_id: [def1, def2,...]`.

---

## Step 10 — List fields to reconstruct and how

**Aim:** finalize the reconstruction checklist so future work is clear.
**Do:** create a markdown table of fields you must derive from `x,y` in output:

* `s` = speed = sqrt((dx/dt)^2 + (dy/dt)^2)
* `vpx, vpy` = velocity components = dx/dt, dy/dt
* `a` = acceleration = dv/dt
* `dir` or heading = atan2(dy, dx)
* `closing radial velocity` = projection of velocity toward ball_land point
  **Why:** explicit formulas let you prototype and test consistently.
  **Produce:** the table with formula lines and note to compute on **smoothed** coordinates (not raw).

---

## Step 11 — Small statistical checks for distributions

**Aim:** get baseline stats to detect outliers and guide smoothing thresholds.
**Do:** compute distribution summaries for a sample of reconstructed-like values (or for `s`/`a` if present in input file): mean, median, max. Note plausible physical bounds (max speed ~ 11–12 yd/s).
**Why:** helps choose smoothing parameters and speed/accel caps.
**Produce / Check:** histograms for speed & acceleration; flag any extreme outliers for inspection.

---

## Step 12 — Produce the “Data Schema & Time Logic Summary” (deliverable)

**Aim:** create the one-page reference describing what you have and the reconstruction needs.
**Do:** write a single markdown cell with:

* Dataset files used
* Key fields present in each file
* The throw-window definition (output frames = ball-in-air)
* Fields to reconstruct and formulas (from Step 10)
* Integrity checks performed and their results (Step 5 & 7)
  **Why:** this is your official checkpoint before reconstruction — include in `reports/00_data_availability.md`.
  **Produce:** one-page summary saved to repo.

---

## Step 13 — Save a small merged sample for next notebook

**Aim:** prepare a tiny working file for reconstruction prototyping.
**Do:** merge one or two fully validated plays (on keys `game_id,play_id,nfl_id`) into a small parquet `data/interim/merged_sample.parquet` containing both input columns (player attributes + ball_land) and output frames for those plays.
**Why:** speeds up development and avoids repeatedly reloading massive files.
**Produce:** `merged_sample.parquet` (2–5 plays).

---

## Step 14 — Quick checklist & next actions

**Aim:** finalize and move to reconstruction with clarity.
**Do:** confirm:

* All uniqueness checks passed or logged.
* You have sample plays and `merged_sample.parquet`.
* Data Schema & Time Logic Summary saved.
* Notebook saved & committed to GitHub.
  **Why:** ensures reproducibility and a clear handoff to `02_data_reconstruction.ipynb`.
  **Produce:** short todo list for next notebook: smoothing method(s) to test, validation plays, config constants (frame_rate, smoothing windows).

---

### Small code hints (copy-paste friendly)

* Uniqueness check example:

```python
assert not output_df.duplicated(['game_id','play_id','nfl_id','frame_id']).any()
```

* Max frame vs num_frames_output:

```python
check = merged.groupby(['game_id','play_id','nfl_id']).agg(max_frame=('frame_id','max'), expected=('num_frames_output','first'))
(check['max_frame']==check['expected']).value_counts()
```

* Simple velocity estimate (for prototyping):

```python
df = df.sort_values('frame_id')
df['dx'] = df['x'].diff()
df['dy'] = df['y'].diff()
df['s_raw'] = np.sqrt(df['dx']**2 + df['dy']**2) * frame_rate  # frame_rate e.g. 10
```

---

If you follow these steps, at the end of this notebook you will have:

* A crystal-clear understanding of what data exists and what must be reconstructed,
* Validation that IDs and frames align,
* A saved small merged sample for fast prototyping,
* The one-page Data Schema & Time Logic Summary that documents the decisions.

Ready to start? If you want, I can now produce a ready-to-copy cell template for each step (code + markdown) so you can paste directly into `01_data_exploration.ipynb`.


To push changes to your GitHub repository, you need to set up Git in your notebook environment.

In [6]:
# Install Git if it's not already installed
!apt-get update && apt-get install -y git

# Configure Git with your name and email
!git config --global user.name "muma005"
!git config --global user.email "2307556@students.kcau.ac.ke"

# Use Colab's userdata to access the stored secret
from google.colab import userdata

# Get the GitHub token from Secret Manager
GITHUB_TOKEN = userdata.get('GITHUB_TOKEN')

# Replace with your repository details
YOUR_GITHUB_USERNAME = "muma005"  # Replace with your GitHub username
YOUR_REPOSITORY_NAME = "nfl-ball-flight-analysis" # Replace with your repository name

# Construct the URL with the token for authentication
# This format is username:token@github.com/username/repo.git
# Using the username is optional here, the token is what authenticates.
# You can use any placeholder username, like your GitHub username.
repo_url = f"https://{YOUR_GITHUB_USERNAME}:{GITHUB_TOKEN}@github.com/{YOUR_GITHUB_USERNAME}/{YOUR_REPOSITORY_NAME}.git"


# Clone the repository
!git clone {https://github.com/muma005/nfl-ball-flight-analysis}

# Change directory tYo the cloned repository
%cd {nfl-ball-flight-analysis}

# Add your changes
!git add .

# Commit your changes (only if there are changes)
# Check if there are changes to commit first to avoid error
!git status --porcelain | grep -q . && git commit -m "changes to notebook" || echo "No changes to commit"

# Push your changes
!git push origin master # or the branch you are working on

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Connecting to security.                                                                               Hit:2 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Connecting to security.                                                                               Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Connected                                                                               Hit:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
0% [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadconte                                                                               Hit:5 http://archive.ubuntu.com/ubuntu jammy InR

**Important:**

*   Replace `"YOUR_NAME"`, `"YOUR_EMAIL"`, `YOUR_REPOSITORY_URL`, and `YOUR_REPOSITORY_NAME` with your actual information.
*   Make sure you have the necessary permissions to push to the repository. You might need to set up SSH keys or use a personal access token for authentication.
*   If you are not working on the `main` branch, replace `main` with the name of your branch in the `git push` command.

Let me know if you encounter any issues or need further assistance!