# üèåÔ∏è‚Äç‚ôÇÔ∏è Golf Performance Analytics ‚Äî Governed Data Preparation Notebook

**Author:** Mike Phillips  
**Program:** NewForce WV Data Analytics Bootcamp (July‚ÄìDecember 2025)  
**Notebook Scope:** Phases 1‚Äì3 ‚Äî *Business Understanding ‚Üí Data Understanding ‚Üí Data Preparation*  
**Environment:** `golfcapstone` (Conda)  
**Last Updated:** 2025-11-07  

---

## üìò Purpose

This notebook is a **governed analytics engineering pipeline** that turns 15+ years of GolfShot round- and shot-level history into clean, validated, analysis-ready data.

It is designed the way a data & analytics consultancy would deliver it:

- every step declares its inputs,
- every transformation is logged,
- every assumption is recorded,
- nothing sensitive leaves the local machine,
- and the whole run can be reproduced later.

You‚Äôre not just analyzing golf ‚Äî you‚Äôre operating a traceable data product.

---

## üß≠ Phase Overview

### Phase 1 ‚Äî Setup & Governance Foundations
- Initialize project paths (`data/`, `deliverables/`, `reference/`).
- Load secrets from `.env` (e.g. Golf Course API).
- Define global governance containers: `STEP_LOG`, `VALIDATION_LOG`, `ASSUMPTIONS_LOG`, `TRANSFORM_LOG`, `DATA_DICTIONARIES`.
- Register helper utilities:  
  - `log_step(...)`  
  - `validate_columns(...)`  
  - `record_assumption(...)`  
  - `track_transform(...)`  
  - `generate_data_dictionary(...)`  
  - `render_governance_summary(...)`
- Log environment + run metadata so downstream exports are traceable.

### Phase 2 ‚Äî Data Understanding
- Load GolfShot export and confirm schema.  
- Check GPS availability, scoring completeness, and vendor yardage consistency.  
- Add GPS-derived shot distance (`yardage_calc`), error metrics, and `yardage_suspect` flags.  
- Log tolerance assumptions for downstream modeling.  
- Capture all of the above in governance logs (no data dropped ‚Äî just labeled).

### Phase 3 ‚Äî Data Preparation
**3.1 Round-Level Validity & Features**
- Filter out non-scoring rows.  
- Count holes per round and flag partial rounds.  
- Keep only complete 9- or 18-hole rounds.  
- Sequence rounds per player and per player√ócourse.  
- Derive time, calendar, and seasonal context.  
- Normalize naming (`round_holes_scored`, `round_is_partial`, etc.).  
- Close out with a round-level checkpoint.

**3.2 Hole-Level Validity & Features**
- Derive hole context (par bucket, strokes over par, score name).  
- Derive putting context (putts over expected, 3+ putts, GIR 3+ putts).  
- Derive outcome quality (scramble opportunity/success, wasted GIR, chip-ins, situation tags).
- Derive hole GPS coordinates based on shot-level data centroids. 
- Close out with a hole-level checkpoint to ensure schema stability for shot-level work.

**3.3 Club-Level Validity & Features**
- Build `player_club_profile` from actual shots: counts, percents, p50/p65/p80 planning distances.  
- Build GPS-based dispersion per player√óclub√óhole.  
- Roll it up to club-level dispersion and join back to the profile.  
- Close out club-level work in this phase so Phase 4 can consume it directly.

**3.4 Facility-Level Validity & Features**
- Rebuild a single `facilities` table from observed rounds.  
- Derive shot-based centroids and promote to canonical geo columns.  
- Build IANA time zones from coordinates.  
- Cache GolfCourseAPI lookups to `facilities_api_cache.csv`.  
- Leave 3.4.6‚Äì3.4.9 (JSON parsing, USGA cross-check, final merge-back) **on hold** for a later pass.

**3.5 Governance Close-Out (Data Preparation)**
- Snapshot key working tables (at minimum: `golf_valid`, `facilities`, `player_club_profile`).  
- Export all governance logs (steps, validations, assumptions, transforms).  
- Export the consolidated data dictionary for everything created so far.  
- Save data in `data/processed` for Phases 4-6, and save governed public exports in `deliverables/` for documentation in GitHub.

---

## üîí Governance & Data Integrity

| Control | Description |
|--------|-------------|
| **Schema validation** | Every major cell calls `validate_columns(...)`; execution halts on mismatch. |
| **Transformation logging** | `track_transform(...)` records row deltas, new/dropped columns, and notes. |
| **Step logging** | `log_step(...)` captures the narrative: what ran, why, inputs, outputs, and key metrics. |
| **Assumption catalog** | `record_assumption(...)` documents business rules (e.g. ‚Äúonly 9/18-hole rounds are valid‚Äù). |
| **Data dictionaries** | Every new feature set can be registered and later exported as `data_dictionary.csv`. |
| **Idempotency** | Cells are written so re-running them does not create duplicate columns or bad joins. |
| **Local sensitivity** | Raw and private data stay under `data/raw/` and `data/private/`, which are git-ignored. |

This is the backbone that makes the rest of the project auditable.

---

## üì¶ What This Notebook Produces Right Now

- A cleaned, governed hole-level fact table: **`golf_valid`**
- A governed facility dimension: **`facilities`**
- A per-player, per-club profile (distance + dispersion): **`player_club_profile`**
- A full set of governance logs:
  - `STEP_LOG`
  - `VALIDATION_LOG`
  - `ASSUMPTIONS_LOG`
  - `TRANSFORM_LOG`
  - plus a consolidated **data dictionary**

---

## üß± How to Use This Notebook

1. Run Section 1 top-to-bottom to initialize paths, secrets, and governance helpers.  
2. Run Sections 2 and 3 in order ‚Äî each section expects the previous one‚Äôs columns to exist.  
3. At the end of Phase 3, run the close-out cell to write all tables and logs to `deliverables/`.  
4. Take those CSV/XLSX exports into Tableau, Power BI, or a modeling notebook for Phase 4+.

---

## üèÅ Next Steps (Outside This Notebook)

- Phase 4: visual analytics, performance dashboards, course comparisons.  
- Phase 5: improvement ideas (dispersion triangles, practice prescriptions).  
- Phase 6: final control/deployment ‚Äî basically confirming the exports from this notebook are the system of record.

*This notebook is the system-of-record builder. Everything else consumes what you create here.*


## ======================================================
## 1.0 Phase 1 ‚Äì Business Understanding
## ======================================================

**PURPOSE**  
Establish the technical, governance, and reproducibility foundation for the Golf Performance Analytics pipeline before any data ingestion or analysis.  
This phase answers: *‚ÄúWhat environment, rules, and audit structures will ensure our project remains reproducible and professionally governed?‚Äù*

**STEPS IN THIS PHASE**  
- **1.1 Setup & Configuration**  
  - Imports all required Python libraries (data, geospatial, and visualization)  
  - Loads environment variables and API keys securely via `.env`  
  - Defines canonical project paths (`/data`, `/private`, `/deliverables`, etc.)  
  - Initializes all governance containers (`STEP_LOG`, `ASSUMPTIONS_LOG`, `TRANSFORM_LOG`, etc.)  
  - Defines helper utilities for logging, validation, lineage tracking, business rule checks, data dictionaries, and export management  
  - Captures the first step log and assumption to formally document the setup environment  
- **1.2 Governance Reporting Utilities (view layer)**  
  - Defines the function `render_governance_summary(current_phase)` for interactive HTML summaries of all governance logs  
  - Enables filtered reviews by phase (e.g., ‚Äú2.‚Äù or ‚Äú3.‚Äù) to simplify QA  
  - Acts as the notebook‚Äôs built-in governance dashboard, allowing instructors, reviewers, or auditors to see all assumptions, validations, and transformations at a glance  

**WHY IT MATTERS**  
Phase 1 is the **governance infrastructure layer** of the entire pipeline.  
It ensures that every downstream step‚Äîingest, cleaning, enrichment, analysis‚Äîautomatically captures:  
- **What changed**  
- **Why it changed**  
- **When it changed**  
- **Who or what made the decision**

This transforms the notebook from an exploratory script into a **traceable, auditable, and reproducible analytics system**‚Äîa standard expected in professional data engineering, consulting, and continuous improvement environments.

**OUTPUTS OF PHASE 1**  
- Initialized governance framework and directory structure  
- Active API connection (`GOLF_API_KEY`)  
- Verified environment setup  
- Live governance summary function available via:  
  ```python
  render_governance_summary("1.")


### ======================================================
### 1.1 Setup & Configuration
### ======================================================

**INPUTS**  
- `.env` file (securely excluded from Git) containing `GOLF_API_KEY`  
- Conda environment file `golfcapstone.yml` specifying dependency versions  

**WHAT THIS STEP DOES**  
- Imports all required core, analytical, and geospatial libraries  
- Loads environment variables and validates API key presence  
- Defines canonical project paths (`DATA_PATH`, `PRIVATE_PATH`, etc.)  
- Creates critical folders if they don‚Äôt exist  
- Establishes pandas display and numeric precision standards  
- Initializes all governance containers (`STEP_LOG`, `ASSUMPTIONS_LOG`, etc.)  
- Defines governance helper utilities for consistent audit logging:  
  - `log_step()` ‚Äî capture actions with context and timestamp  
  - `record_assumption()` ‚Äî track methodological assumptions  
  - `validate_columns()` ‚Äî enforce schema gates  
  - `track_transform()` ‚Äî capture row/column lineage deltas  
  - `generate_data_dictionary()` ‚Äî define schema metadata
- Adds a built-in `render_governance_summary()` utility for human-readable, HTML governance summaries directly inside notebooks  
- Logs the environment setup to `STEP_LOG` and registers assumptions  

**WHY IT MATTERS**  
This is the **governed environment bootstrapping** step.  
By initializing logging, validation, and audit tools at the top of the pipeline, every subsequent cell (2.x ‚Üí 5.x) automatically inherits the same governance rigor.  
This ensures reproducibility, traceability, and readiness for DMAIC‚Äôs **Control** phase ‚Äî all from the first line of code.

**OUTPUTS**  
- Verified API key and initialized environment  
- Canonical folder structure created (data, deliverables, private, etc.)  
- Governance containers and helper utilities registered  
- Step logged to `STEP_LOG` with metadata (versions, paths, API key presence)  
- Initial assumption recorded (API key usage for course enrichment)  
- Ready-to-use `render_governance_summary()` function for quality reviews


In [1]:
# ======================================================
# 1.1 Setup & Configuration
# ======================================================

# ------------------------------------------------------
# 1. Imports (core, geospatial, http, display)
# ------------------------------------------------------
import os
import json
import math
import time
import logging
import warnings
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any, Optional

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from dotenv import load_dotenv
from geopy.distance import geodesic
from geopy.geocoders import Nominatim, ArcGIS
from geopy.extra.rate_limiter import RateLimiter
from timezonefinder import TimezoneFinder
from math import cos, radians
from IPython.display import display, HTML

# ------------------------------------------------------
# 2. Load environment / secrets
# ------------------------------------------------------
load_dotenv()
GOLF_API_KEY = os.getenv("GOLF_API_KEY")
if not GOLF_API_KEY:
    raise RuntimeError(
        "‚ùå GOLF_API_KEY not found. Create a .env file in the project root "
        "(see .env.example) and add your API key."
    )

# ------------------------------------------------------
# 3. Canonical project paths
# ------------------------------------------------------
ROOT = Path.cwd().parent           # repo root (one level above /notebooks)
DATA_PATH = ROOT / "data"
CACHE_PATH = DATA_PATH / "raw"
PROCESSED_PATH = DATA_PATH / "processed"    # folder for final exports
PRIVATE_PATH = DATA_PATH / "private"
OUTPUT_PATH = ROOT / "deliverables"   # folder for final governed outputs
NOTEBOOKS_PATH = ROOT / "notebooks"

# ensure required folders exist
for p in [DATA_PATH, CACHE_PATH, PROCESSED_PATH, PRIVATE_PATH, OUTPUT_PATH]:
    p.mkdir(parents=True, exist_ok=True)

# ------------------------------------------------------
# 4. Display / pandas options (reproducibility / readability)
# ------------------------------------------------------
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)
pd.set_option("display.float_format", lambda x: f"{x:,.3f}")

# ------------------------------------------------------
# 5. Global governance containers
# ------------------------------------------------------
STEP_LOG: List[Dict[str, Any]] = []
VALIDATION_LOG: List[Dict[str, Any]] = []
ASSUMPTIONS_LOG: List[Dict[str, Any]] = []
TRANSFORM_LOG: List[Dict[str, Any]] = []
DATA_DICTIONARIES: List[pd.DataFrame] = []

# ------------------------------------------------------
# 6. Helper Utilities (governance / auditability)
# ------------------------------------------------------
def _now_ts() -> str:
    """Return current timestamp as string for consistent logging."""
    return datetime.now().strftime("%Y-%m-%d %H:%M:%S")


# 6.1 log_step ---------------------------------------------------------------
def log_step(step_name: str,
             description: str = "",
             inputs: Optional[List[str]] = None,
             outputs: Optional[List[str]] = None,
             df: Optional[pd.DataFrame] = None,
             extra_info: Optional[Dict[str, Any]] = None):
    entry = {
        "ts": _now_ts(),
        "step_name": step_name,
        "description": description,
        "inputs": inputs if inputs else [],
        "outputs": outputs if outputs else [],
        "rows": df.shape[0] if df is not None else None,
        "cols": df.shape[1] if df is not None else None,
        "extra_info": extra_info if extra_info else {},
    }
    STEP_LOG.append(entry)

    # console echo
    print(f"‚úÖ {step_name} @ {entry['ts']}")
    if df is not None:
        print(f"   DataFrame shape: {entry['rows']} rows √ó {entry['cols']} cols")
    if description:
        print(f"   {description}")
    if extra_info:
        for k, v in extra_info.items():
            print(f"   {k}: {v}")


# 6.2 validate_columns -------------------------------------------------------
def validate_columns(df: pd.DataFrame,
                     required_cols: List[str],
                     context_name: str = "(unnamed step)"):
    missing = [c for c in required_cols if c not in df.columns]
    entry = {
        "ts": _now_ts(),
        "context": context_name,
        "required_cols": required_cols,
        "missing_cols": missing,
        "passed": (len(missing) == 0),
        "rows": df.shape[0],
        "cols": df.shape[1],
    }
    VALIDATION_LOG.append(entry)

    if missing:
        raise KeyError(
            f"[{context_name}] Missing expected column(s): {missing}. "
            "Upstream step may have failed or schema changed."
        )


# 6.3 record_assumption ------------------------------------------------------
def record_assumption(text: str,
                      rationale: str,
                      impact_area: str):
    entry = {
        "ts": _now_ts(),
        "assumption": text,
        "rationale": rationale,
        "impact_area": impact_area,
    }
    ASSUMPTIONS_LOG.append(entry)
    print(f"üìå Assumption logged: {text}  | Impact: {impact_area}")


# 6.4 track_transform --------------------------------------------------------
def track_transform(stage_name: str,
                    df_before: Optional[pd.DataFrame],
                    df_after: pd.DataFrame,
                    notes: str = "",
                    new_cols: Optional[List[str]] = None,
                    dropped_cols: Optional[List[str]] = None):
    rows_before = df_before.shape[0] if df_before is not None else None
    rows_after = df_after.shape[0]
    cols_before = df_before.shape[1] if df_before is not None else None
    cols_after = df_after.shape[1]

    if new_cols is None and df_before is not None:
        new_cols = [c for c in df_after.columns if c not in df_before.columns]
    if dropped_cols is None and df_before is not None:
        dropped_cols = [c for c in df_before.columns if c not in df_after.columns]

    pct_change = None
    if rows_before is not None and rows_before != 0:
        pct_change = (rows_after - rows_before) / rows_before

    entry = {
        "ts": _now_ts(),
        "stage_name": stage_name,
        "notes": notes,
        "rows_before": rows_before,
        "rows_after": rows_after,
        "row_delta": (rows_after - rows_before) if rows_before is not None else None,
        "row_pct_change": pct_change,
        "cols_before": cols_before,
        "cols_after": cols_after,
        "new_cols": new_cols if new_cols else [],
        "dropped_cols": dropped_cols if dropped_cols else [],
    }
    TRANSFORM_LOG.append(entry)

    print(f"üîÑ Transform logged: {stage_name}")
    if rows_before is not None:
        print(f"   Rows {rows_before} ‚Üí {rows_after} ({entry['row_delta']} change)")
    if notes:
        print(f"   {notes}")


# 6.4 generate_data_dictionary ----------------------------------------------
def generate_data_dictionary(df: pd.DataFrame,
                             table_name: str,
                             desc_map: Optional[Dict[str, str]] = None,
                             lineage_map: Optional[Dict[str, str]] = None) -> pd.DataFrame:
    rows = []
    for col in df.columns:
        rows.append({
            "table": table_name,
            "column_name": col,
            "dtype": str(df[col].dtype),
            "description": (desc_map.get(col, "") if desc_map else ""),
            "lineage": (lineage_map.get(col, "") if lineage_map else ""),
        })
    dict_df = pd.DataFrame(rows)
    DATA_DICTIONARIES.append(dict_df)
    print(f"üìò Data dictionary generated for table '{table_name}' "
          f"({len(dict_df)} columns).")
    return dict_df


# ------------------------------------------------------
# 7. Governance reporting view (baked in from step 1)
# ------------------------------------------------------
def render_governance_summary(current_phase: str = None):
    sections = []

    if len(STEP_LOG) > 0:
        df_steps = pd.DataFrame(STEP_LOG).copy()
        df_steps["phase"] = df_steps["step_name"].str.extract(r"^(\d+\.\d+)")
        if current_phase:
            df_steps = df_steps[df_steps["phase"].str.startswith(current_phase)]
        sections.append("<h3>üìò Step Log</h3>")
        sections.append(
            df_steps.loc[:, ["ts", "phase", "step_name", "description"]].to_html(index=False)
        )

    if len(TRANSFORM_LOG) > 0:
        df_trans = pd.DataFrame(TRANSFORM_LOG).copy()
        df_trans["phase"] = df_trans["stage_name"].str.extract(r"^(\d+\.\d+)")
        if current_phase:
            df_trans = df_trans[df_trans["phase"].str.startswith(current_phase)]
        df_trans["row_delta"] = df_trans["rows_after"] - df_trans["rows_before"]
        sections.append("<h3>üîÑ Transform Log</h3>")
        sections.append(
            df_trans.loc[:, ["ts", "phase", "stage_name", "notes", "row_delta", "new_cols"]].to_html(index=False)
        )

    if len(VALIDATION_LOG) > 0:
        df_val = pd.DataFrame(VALIDATION_LOG).copy()
        if current_phase:
            df_val = df_val[df_val["context"].str.startswith(current_phase)]
        sections.append("<h3>‚úÖ Validation Log</h3>")
        sections.append(
            df_val.loc[:, ["ts", "context", "passed", "missing_cols"]].to_html(index=False)
        )

    if len(ASSUMPTIONS_LOG) > 0:
        df_ass = pd.DataFrame(ASSUMPTIONS_LOG).copy()
        sections.append("<h3>üìå Assumptions</h3>")
        sections.append(
            df_ass.loc[:, ["ts", "assumption", "rationale", "impact_area"]].to_html(index=False)
        )

    html_out = "<br>".join(sections) if sections else "<p>No governance entries yet.</p>"
    display(HTML(html_out))
    print("Governance summary rendered.")

# ------------------------------------------------------
# 8. Environment banner / audit log of setup itself
# ------------------------------------------------------
run_info = {
    "project_root": str(ROOT),
    "data_path": str(DATA_PATH),
    "private_data_path": str(PRIVATE_PATH),
    "output_path": str(OUTPUT_PATH),
    "api_key_loaded": bool(GOLF_API_KEY),
    "pandas_version": pd.__version__,
    "matplotlib_version": matplotlib.__version__,
    "numpy_version": np.__version__,
    "conda_env": "golfcapstone",
}

log_step(
    step_name="1.1 Setup & Configuration",
    description="Environment, paths, secrets, governance containers, and helper utilities initialized.",
    inputs=[],
    outputs=[],
    df=None,
    extra_info={
        **run_info,
        "phase": "1.1",
        "category": "setup",
    },
)

record_assumption(
    text="API key will be used to enrich course metadata (slope/rating/etc.).",
    rationale="Needed for course difficulty context in analysis.",
    impact_area="Course enrichment / difficulty-adjusted scoring",
)


‚úÖ 1.1 Setup & Configuration @ 2025-11-16 19:33:12
   Environment, paths, secrets, governance containers, and helper utilities initialized.
   project_root: C:\Users\micha\onedrive\documents\newforce\golf-capstone
   data_path: C:\Users\micha\onedrive\documents\newforce\golf-capstone\data
   private_data_path: C:\Users\micha\onedrive\documents\newforce\golf-capstone\data\private
   output_path: C:\Users\micha\onedrive\documents\newforce\golf-capstone\deliverables
   api_key_loaded: True
   pandas_version: 2.3.3
   matplotlib_version: 3.9.4
   numpy_version: 2.3.4
   conda_env: golfcapstone
   phase: 1.1
   category: setup
üìå Assumption logged: API key will be used to enrich course metadata (slope/rating/etc.).  | Impact: Course enrichment / difficulty-adjusted scoring


### ======================================================
### 1.2 Governance Reporting Utilities (view layer)
### ======================================================

**INPUTS**  
- In-memory governance containers created in **1.1 Setup & Configuration**:  
  `STEP_LOG`, `TRANSFORM_LOG`, `VALIDATION_LOG`, `ASSUMPTIONS_LOG`
- Display utilities from Jupyter/IPython (`display`, `HTML`)

**WHAT THIS STEP DOES**  
- Defines a single convenience function: `render_governance_summary(...)`  
- Produces a human-readable, HTML-formatted view of all governance activity to date  
- Optionally filters by phase (e.g. `"2."` to show only Section 2 steps)  
- Consolidates 5 separate logs into one reviewer-friendly output:
  - **Step Log** ‚Üí what ran, in what order
  - **Transform Log** ‚Üí what changed, rows/cols, new fields
  - **Validation Log** ‚Üí schema gates and pass/fail
  - **Assumptions Log** ‚Üí decisions we are baking into the analysis
  - **Business Rule Checks** ‚Üí domain-level QA
- Prints a confirmation message so notebook users know the utilities were registered

**WHY IT MATTERS**  
Without a view layer, governance is trapped in Python lists and CSV exports.  
This utility turns those raw logs into something a project sponsor, instructor, or peer reviewer can actually read inside the notebook, at any point in the pipeline.  
It also standardizes how we review individual phases (2.x, 3.x) without rewriting reporting code.

**OUTPUTS**  
- Callable function: `render_governance_summary(current_phase: str = None)`  
- Console confirmation: ‚Äú1.2 Governance reporting utilities registered...‚Äù  
- Ready-to-use HTML governance report for interactive QA


In [2]:
# ======================================================
# 1.2 Governance Reporting Utilities (view layer)
# ======================================================

# ------------------------------------------------------
# 1. Human-readable governance renderer
# ------------------------------------------------------
def render_governance_summary(current_phase: str = None):
    """
    Render an HTML summary of all governance logs captured so far.
    - If `current_phase` is provided (e.g. "2." or "2.5"), the view is filtered
      to just those steps/stages whose names start with that phase.
    - Designed for notebook reviewers so they don't have to open 5 CSVs.
    """
    sections = []

    # --------------------------------------------------
    # STEP_LOG (chronological pipeline actions)
    # --------------------------------------------------
    if len(STEP_LOG) > 0:
        df_steps = pd.DataFrame(STEP_LOG).copy()
        df_steps["phase"] = df_steps["step_name"].str.extract(r"^(\d+\.\d+)")
        if current_phase:
            df_steps = df_steps[df_steps["phase"].str.startswith(current_phase)]
        df_steps = df_steps.sort_values("ts")
        sections.append("<h3>üìò Step Log</h3>")
        sections.append(
            df_steps.loc[:, ["ts", "phase", "step_name", "description", "rows", "cols"]]
            .to_html(index=False)
        )

    # --------------------------------------------------
    # TRANSFORM_LOG (lineage, row/col deltas)
    # --------------------------------------------------
    if len(TRANSFORM_LOG) > 0:
        df_trans = pd.DataFrame(TRANSFORM_LOG).copy()
        df_trans["phase"] = df_trans["stage_name"].str.extract(r"^(\d+\.\d+)")
        if current_phase:
            df_trans = df_trans[df_trans["phase"].str.startswith(current_phase)]
        df_trans["row_delta"] = df_trans["rows_after"] - df_trans["rows_before"]
        df_trans = df_trans.sort_values("ts")
        sections.append("<h3>üîÑ Transform Log</h3>")
        sections.append(
            df_trans.loc[
                :,
                [
                    "ts",
                    "phase",
                    "stage_name",
                    "notes",
                    "rows_before",
                    "rows_after",
                    "row_delta",
                    "new_cols",
                    "dropped_cols",
                ],
            ].to_html(index=False)
        )

    # --------------------------------------------------
    # VALIDATION_LOG (schema gates)
    # --------------------------------------------------
    if len(VALIDATION_LOG) > 0:
        df_val = pd.DataFrame(VALIDATION_LOG).copy()
        if current_phase:
            df_val = df_val[df_val["context"].str.startswith(current_phase)]
        df_val = df_val.sort_values("ts")
        sections.append("<h3>‚úÖ Validation Log</h3>")
        sections.append(
            df_val.loc[
                :, ["ts", "context", "required_cols", "missing_cols", "passed"]
            ].to_html(index=False)
        )

    # --------------------------------------------------
    # ASSUMPTIONS_LOG (methodological decisions)
    # --------------------------------------------------
    if len(ASSUMPTIONS_LOG) > 0:
        df_ass = pd.DataFrame(ASSUMPTIONS_LOG).copy()
        df_ass = df_ass.sort_values("ts")
        sections.append("<h3>üìå Assumptions</h3>")
        sections.append(
            df_ass.loc[:, ["ts", "assumption", "rationale", "impact_area"]]
            .to_html(index=False)
        )

    # --------------------------------------------------
    # 2. Render or fallback
    # --------------------------------------------------
    html_out = "<br>".join(sections) if sections else "<p>No governance entries yet.</p>"
    display(HTML(html_out))
    print("Governance summary rendered.")

# lightweight confirmation so the user sees it was defined
print("‚úÖ 1.2 Governance reporting utilities registered. Use render_governance_summary('2.') to view Phase 2.")


‚úÖ 1.2 Governance reporting utilities registered. Use render_governance_summary('2.') to view Phase 2.


## ======================================================
## 2.0 Phase 2 ‚Äì Data Understanding
## ======================================================

**PURPOSE**  
Establish a governed, analyzable base table from the raw GolfShot vendor export before any cleaning, filtering, or enrichment in Section 3.  
This phase answers: *‚ÄúWhat do we actually have, how complete is it, and what problems do we need to control for?‚Äù*

**STEPS IN THIS PHASE**  
- **2.1 Load Raw GolfShot Data**  
  - Ingest the vendor CSV as-is  
  - Add a persistent `row_id` for end-to-end lineage  
  - Validate that the supplier schema hasn‚Äôt drifted  
- **2.2 Standardize Schema & Datatypes**  
  - Rename vendor columns to internal, snake_case names  
  - Parse timestamps and normalize vendor GIR format  
  - Apply consistent nullable numeric/string dtypes  
  - Establish the internal schema contract for all downstream work  
- **2.3 Initial Data Profiling**  
  - Measure null coverage on critical analytics fields  
  - Count distinct players, facilities, and facility‚Äìcourse combinations  
  - Assess GPS completeness for shot-level analytics  
- **2.4 Core Field Validation & Logical Consistency**  
  - Run domain checks (hole number range, nonpositive hole scores, missing timestamps, implausible round scores)  
  - Document cleanup assumptions for Section 3  
- **2.5 Initial Vendor Yardage Validation**  
  - Recalculate shot distance from GPS  
  - Compare to vendor `yardage` and flag suspect values  
  - Preserve diagnostics on the master `golf` table  
- **2.6 Phase 2 Governance Close-Out**  
  - Capture a terminal profile and data dictionary  
  - Export a snapshot of the governed dataset to `/data/private`  
  - Export all governance logs to `/deliverables/...`  
  - Record a ‚Äúphase closed‚Äù assumption for audit traceability  

**WHY IT MATTERS**  
By the end of Phase 2 we have:  
1. A single, standardized DataFrame `golf` that is traceable back to the vendor export via `row_id`.  
2. Documented data-quality issues (not silently fixed).  
3. Governance evidence (steps, validations, assumptions, business rules) ready to export.  
4. A frozen, auditable baseline that Section 3 (Data Preparation) can safely transform.

**OUTPUTS OF PHASE 2**  
- Governed base table: `golf`  
- Governance artifacts in memory: `STEP_LOG`, `VALIDATION_LOG`, `ASSUMPTIONS_LOG`, `TRANSFORM_LOG`, `DATA_DICTIONARIES`  
- Snapshot export: `golf_clean_snapshot.xlsx`  



### ======================================================
### 2.1 Load Raw GolfShot Data
### ======================================================

**INPUTS**  
- File: `data/raw/golfshot_export_all.csv` (vendor export)  
- Columns expected (pre-standardization):  
  `Name`, `Round DateTime (UTC)`, `Facility Name`, `Course Name`,  
  `Round Score`, `Round GIR`, `Hole Number`, `Hole Score`,  
  `Shot Start Latitude`, `Shot Start Longitude`  

**WHAT THIS STEP DOES**  
- Loads the unmodified vendor CSV export into memory as `golf_raw`  
- Assigns a persistent surrogate key `row_id` for end-to-end traceability  
- Validates the presence of required vendor columns  
- Logs ingestion metadata to the governance framework:  
  - **Transform Log** ‚Üí establishes lineage baseline (`2.1_raw_import`)  
  - **Assumptions Log** ‚Üí documents that `row_id` uniquely anchors every record  
  - **Step Log** ‚Üí records the import and column validation activity  
- Serves as the authoritative raw ingest point (no filtering or mutation)

**WHY IT MATTERS**  
This step formalizes the data hand-off between GolfShot and the analytics pipeline.  
By introducing `row_id` immediately, we ensure that every observation downstream ‚Äî whether in filtered, enriched, or reshaped form ‚Äî can be traced back to its original vendor row.  
If GolfShot‚Äôs schema changes, this step will fail fast, preventing silent corruption of later transformations.

**OUTPUTS**  
- `golf_raw`: raw vendor export with appended `row_id`  
- Updated governance artifacts:  
  - `STEP_LOG`  
  - `TRANSFORM_LOG`  
  - `ASSUMPTIONS_LOG`  
- Data preview (`golf_raw.head()`) for developer sanity check  


In [3]:
# ======================================================
# 2.1 Load Raw GolfShot Data
# ======================================================

# ------------------------------------------------------
# 1. Locate and load vendor export
# ------------------------------------------------------
raw_csv_path = CACHE_PATH / "golfshot_export_all.csv"

golf_raw = pd.read_csv(raw_csv_path)

# ------------------------------------------------------
# 2. Attach persistent surrogate key (row-level lineage)
# ------------------------------------------------------
# Recreate row_id every time we load the raw file so the ingest is idempotent.
golf_raw["row_id"] = range(1, len(golf_raw) + 1)

# ------------------------------------------------------
# 3. Validate presence of required vendor columns
# ------------------------------------------------------
required_cols_2_1 = [
    "Name",
    "Round DateTime (UTC)",
    "Facility Name",
    "Course Name",
    "Round Score",
    "Round GIR",
    "Hole Number",
    "Hole Score",
    "Shot Start Latitude",
    "Shot Start Longitude",
]

validate_columns(
    golf_raw,
    required_cols=required_cols_2_1,
    context_name="2.1 Load Raw GolfShot Data",
)

# ------------------------------------------------------
# 4. Governance logging
# ------------------------------------------------------
# 4a. Establish lineage baseline at ingest
track_transform(
    stage_name="2.1_raw_import",
    df_before=None,
    df_after=golf_raw,
    notes="Initial GolfShot CSV ingest + deterministic row_id assignment.",
    new_cols=["row_id"],
)

# 4b. Assumption: this is the authoritative raw hand-off
record_assumption(
    text="Each vendor row ingested in 2.1 is treated as authoritative raw input and receives a unique row_id.",
    rationale="We need stable row-level lineage for audit, QA, and troubleshooting after filtering/merging.",
    impact_area="All downstream analysis, data retention reporting, error investigation",
)

# 4c. Human-readable step log
log_step(
    step_name="2.1 Load Raw GolfShot Data",
    description="Loaded GolfShot vendor export, added row_id, and validated expected raw columns.",
    inputs=[str(raw_csv_path)],
    outputs=["golf_raw"],
    df=golf_raw,
    extra_info={
        "phase": "2.1",
        "category": "ingest",
        "note": "Authoritative raw ingest. Do not clean here.",
        "preview_cols": list(golf_raw.columns)[:12],
    },
)

# ------------------------------------------------------
# 5. Reviewer convenience
# ------------------------------------------------------
golf_raw.head()


üîÑ Transform logged: 2.1_raw_import
   Initial GolfShot CSV ingest + deterministic row_id assignment.
üìå Assumption logged: Each vendor row ingested in 2.1 is treated as authoritative raw input and receives a unique row_id.  | Impact: All downstream analysis, data retention reporting, error investigation
‚úÖ 2.1 Load Raw GolfShot Data @ 2025-11-16 19:33:12
   DataFrame shape: 4287 rows √ó 24 cols
   Loaded GolfShot vendor export, added row_id, and validated expected raw columns.
   phase: 2.1
   category: ingest
   note: Authoritative raw ingest. Do not clean here.
   preview_cols: ['Name', 'Round DateTime (UTC)', 'Facility Name', 'Course Name', 'Round Score', 'Round Fairway Hits', 'Round Putts', 'Round GIR', 'Hole Number', 'Hole Par', 'Hole Fairway Hit Type', 'Hole Fairway Hits']


Unnamed: 0,Name,Round DateTime (UTC),Facility Name,Course Name,Round Score,Round Fairway Hits,Round Putts,Round GIR,Hole Number,Hole Par,Hole Fairway Hit Type,Hole Fairway Hits,Hole Putts,Hole Score,Hole GIR,Shot Club,Shot Direction,Shot Start Latitude,Shot Start Longitude,Shot End Latitude,Shot End Longitude,Yardage,Yardage To Pin,row_id
0,Mike Phillips,3/1/2011 12:00:00 AM,Sligo Creek Golf Course,Sligo Creek,94,56,38,"2,777.78%",1,4,Left,5,2,7,No,,,,,,,,,1
1,Mike Phillips,3/1/2011 12:00:00 AM,Sligo Creek Golf Course,Sligo Creek,94,56,38,"2,777.78%",2,3,Unknown,1,1,2,Yes,,,,,,,,,2
2,Mike Phillips,3/1/2011 12:00:00 AM,Sligo Creek Golf Course,Sligo Creek,94,56,38,"2,777.78%",3,4,Hit,2,2,4,Yes,,,,,,,,,3
3,Mike Phillips,3/1/2011 12:00:00 AM,Sligo Creek Golf Course,Sligo Creek,94,56,38,"2,777.78%",4,3,Unknown,5,1,6,No,,,,,,,,,4
4,Mike Phillips,3/1/2011 12:00:00 AM,Sligo Creek Golf Course,Sligo Creek,94,56,38,"2,777.78%",5,4,Right,3,3,6,No,,,,,,,,,5


### ======================================================
### 2.2 Standardize Schema & Datatypes
### ======================================================

**INPUTS**  
- DataFrame: `golf_raw` (output from 2.1 Load Raw GolfShot Data)  
- Vendor columns expected (selected):  
  `Name`, `Round DateTime (UTC)`, `Facility Name`, `Course Name`,  
  `Round Score`, `Round GIR`, `Hole Number`, `Hole Score`,  
  `Shot Start Latitude`, `Shot Start Longitude`, `Shot End Latitude`, `Shot End Longitude`,  
  `Yardage`, `Yardage To Pin`

**WHAT THIS STEP DOES**  
- Renames vendor-supplied column headers to an internal, snake_case schema  
- Preserves the surrogate key `row_id` from 2.1  
- Normalizes `round_gir` from vendor strings (e.g. `"2,777.78%"`) to real percents (0‚Äì100 scale)  
- Parses `round_dt` into pandas datetime  
- Applies consistent nullable numeric dtypes (Int64 / Float64) for analytics  
- Trims text fields and coerces to string dtype  
- Validates the new internal schema for downstream steps  
- Logs the transformation for lineage and records assumptions about timestamp semantics

**WHY IT MATTERS**  
This is the first step where we define the **internal contract** for the rest of the notebook.  
Every subsequent enrichment (holes, facilities, clubs) will assume these names and dtypes.  
By cleaning percent fields and timestamps here, we avoid inconsistent logic later on.

**OUTPUTS**  
- DataFrame: `golf` (standardized, typed, analysis-ready)  
- Updated logs: `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`, `VALIDATION_LOG`  
- QC metrics in the step log (null `round_dt`, min/max `round_gir`)


In [4]:
# ======================================================
# 2.2 Standardize Schema & Datatypes
# ======================================================

# ------------------------------------------------------
# 1. Capture pre-transform state
# ------------------------------------------------------
_before = golf_raw.copy()

# ------------------------------------------------------
# 2. Rename vendor columns ‚Üí internal schema
# ------------------------------------------------------
RENAME_MAP = {
    "Name": "player_name",
    "Round DateTime (UTC)": "round_dt",
    "Facility Name": "facility",
    "Course Name": "course",
    "Round Score": "round_score",
    "Round Fairway Hits": "round_fairway_hits",
    "Round Putts": "round_putts",
    "Round GIR": "round_gir",
    "Hole Number": "hole_number",
    "Hole Par": "hole_par",
    "Hole Fairway Hit Type": "hole_fairway_hit_type",
    "Hole Fairway Hits": "hole_fairway_hits",
    "Hole Putts": "hole_putts",
    "Hole Score": "hole_score",
    "Hole GIR": "hole_gir",
    "Shot Club": "shot_club",
    "Shot Direction": "shot_direction",
    "Shot Start Latitude": "shot_start_lat",
    "Shot Start Longitude": "shot_start_lon",
    "Shot End Latitude": "shot_end_lat",
    "Shot End Longitude": "shot_end_lon",
    "Yardage": "yardage",
    "Yardage To Pin": "yardage_to_pin",
}

golf = golf_raw.rename(columns=RENAME_MAP).copy()

# Ensure row_id survived
if "row_id" not in golf.columns:
    raise RuntimeError("row_id was expected from Step 2.1 but is missing after rename.")

# ------------------------------------------------------
# 3. Clean round_gir into numeric percent (0‚Äì100)
# ------------------------------------------------------
if "round_gir" in golf.columns:
    gir_clean = (
        golf["round_gir"]
        .astype(str)
        .str.replace(r"\s+", "", regex=True)
        .str.replace("%", "", regex=False)
        .str.replace(",", "", regex=False)
    )
    gir_clean = pd.to_numeric(gir_clean, errors="coerce")
    golf["round_gir"] = gir_clean / 100.0  # vendor over-scales

# ------------------------------------------------------
# 4. Parse round_dt into datetime
# ------------------------------------------------------
if "round_dt" in golf.columns:
    golf["round_dt"] = golf["round_dt"].astype(str).str.strip()
    golf["round_dt"] = pd.to_datetime(golf["round_dt"], errors="coerce")

# ------------------------------------------------------
# 5. Apply numeric dtypes (nullable)
# ------------------------------------------------------
INT_LIKE_COLS = [
    "round_score",
    "round_fairway_hits",
    "round_putts",
    "hole_number",
    "hole_par",
    "hole_fairway_hits",
    "hole_putts",
    "hole_score",
]

FLOAT_LIKE_COLS = [
    "shot_start_lat",
    "shot_start_lon",
    "shot_end_lat",
    "shot_end_lon",
    "yardage",
    "yardage_to_pin",
]

for col in INT_LIKE_COLS:
    if col in golf.columns:
        golf[col] = pd.to_numeric(golf[col], errors="coerce").astype("Int64")

for col in FLOAT_LIKE_COLS:
    if col in golf.columns:
        golf[col] = pd.to_numeric(golf[col], errors="coerce").astype("Float64")

# ------------------------------------------------------
# 6. Apply string dtypes + trim
# ------------------------------------------------------
STRING_COLS = [
    "player_name",
    "facility",
    "course",
    "hole_fairway_hit_type",
    "shot_club",
    "shot_direction",
    "hole_gir",
]

for col in STRING_COLS:
    if col in golf.columns:
        golf[col] = golf[col].astype("string").str.strip()

# ------------------------------------------------------
# 7. Validate cleaned schema
# ------------------------------------------------------
required_cols_2_2 = [
    "row_id",
    "player_name",
    "round_dt",
    "facility",
    "course",
    "hole_number",
    "hole_score",
    "round_score",
    "round_gir",
    "shot_start_lat",
    "shot_start_lon",
]

validate_columns(
    golf,
    required_cols=required_cols_2_2,
    context_name="2.2 Standardize Schema & Datatypes",
)

# ------------------------------------------------------
# 8. Governance logging
# ------------------------------------------------------
# Assumptions about semantics
record_assumption(
    text="Treat 'Round DateTime (UTC)' as the local tee time recorded by the device, not verified true UTC.",
    rationale="Empirical comparison showed timestamps align to played local time, not actual UTC offsets.",
    impact_area="Time-of-day analysis, temporal feature engineering",
)
record_assumption(
    text="Normalize vendor 'Round GIR' strings into real % on 0‚Äì100 scale.",
    rationale="Vendor encodes GIR in scaled percentage form; we need comparable KPIs.",
    impact_area="Round-level GIR reporting and trend analysis",
)

# QC metrics
gir_min = (
    float(golf["round_gir"].min())
    if "round_gir" in golf and golf["round_gir"].notna().any()
    else None
)
gir_max = (
    float(golf["round_gir"].max())
    if "round_gir" in golf and golf["round_gir"].notna().any()
    else None
)
null_round_dt = int(golf["round_dt"].isna().sum()) if "round_dt" in golf else None

# Lineage
track_transform(
    stage_name="2.2_standardize_schema",
    df_before=_before,
    df_after=golf,
    notes="Renamed vendor columns to internal schema, parsed timestamps, normalized GIR, and enforced dtypes.",
)

# Step log
log_step(
    step_name="2.2 Standardize Schema & Datatypes",
    description="Standardized column names, cleaned vendor percent fields, parsed datetime, and applied nullable numeric dtypes.",
    inputs=["golf_raw"],
    outputs=["golf"],
    df=golf,
    extra_info={
        "phase": "2.2",
        "category": "transform",
        "round_dt_nulls": null_round_dt,
        "round_gir_min": gir_min,
        "round_gir_max": gir_max,
        "note": "This establishes the internal schema contract for all 2.x and 3.x steps.",
    },
)

# ------------------------------------------------------
# 9. Reviewer convenience
# ------------------------------------------------------
golf.head()


üìå Assumption logged: Treat 'Round DateTime (UTC)' as the local tee time recorded by the device, not verified true UTC.  | Impact: Time-of-day analysis, temporal feature engineering
üìå Assumption logged: Normalize vendor 'Round GIR' strings into real % on 0‚Äì100 scale.  | Impact: Round-level GIR reporting and trend analysis
üîÑ Transform logged: 2.2_standardize_schema
   Rows 4287 ‚Üí 4287 (0 change)
   Renamed vendor columns to internal schema, parsed timestamps, normalized GIR, and enforced dtypes.
‚úÖ 2.2 Standardize Schema & Datatypes @ 2025-11-16 19:33:12
   DataFrame shape: 4287 rows √ó 24 cols
   Standardized column names, cleaned vendor percent fields, parsed datetime, and applied nullable numeric dtypes.
   phase: 2.2
   category: transform
   round_dt_nulls: 0
   round_gir_min: 0.0
   round_gir_max: 66.6667
   note: This establishes the internal schema contract for all 2.x and 3.x steps.


  golf["round_dt"] = pd.to_datetime(golf["round_dt"], errors="coerce")


Unnamed: 0,player_name,round_dt,facility,course,round_score,round_fairway_hits,round_putts,round_gir,hole_number,hole_par,hole_fairway_hit_type,hole_fairway_hits,hole_putts,hole_score,hole_gir,shot_club,shot_direction,shot_start_lat,shot_start_lon,shot_end_lat,shot_end_lon,yardage,yardage_to_pin,row_id
0,Mike Phillips,2011-03-01,Sligo Creek Golf Course,Sligo Creek,94,56,38,27.778,1,4,Left,5,2,7,No,,,,,,,,,1
1,Mike Phillips,2011-03-01,Sligo Creek Golf Course,Sligo Creek,94,56,38,27.778,2,3,Unknown,1,1,2,Yes,,,,,,,,,2
2,Mike Phillips,2011-03-01,Sligo Creek Golf Course,Sligo Creek,94,56,38,27.778,3,4,Hit,2,2,4,Yes,,,,,,,,,3
3,Mike Phillips,2011-03-01,Sligo Creek Golf Course,Sligo Creek,94,56,38,27.778,4,3,Unknown,5,1,6,No,,,,,,,,,4
4,Mike Phillips,2011-03-01,Sligo Creek Golf Course,Sligo Creek,94,56,38,27.778,5,4,Right,3,3,6,No,,,,,,,,,5


### ======================================================
### 2.3 Initial Data Profiling
### ======================================================

**INPUTS**  
- DataFrame: `golf` (output from 2.2)  
- Expected columns for profiling:  
  `row_id`, `player_name`, `round_dt`, `facility`, `course`,  
  `hole_number`, `hole_score`, `shot_start_lat`, `shot_start_lon`,  
  `shot_end_lat`, `shot_end_lon`, `yardage`

**WHAT THIS STEP DOES**  
- Captures a baseline, **pre-cleaning** profile of the standardized dataset  
- Measures null coverage on critical analytics fields  
- Counts distinct players, facilities, and facility‚Äìcourse pairs  
- Assesses GPS completeness (share of rows with full start/end coords)  
- Logs any data-quality assumptions (e.g. missing GPS or hole_score) for downstream use

**WHY IT MATTERS**  
This is the DMAIC ‚ÄúMeasure‚Äù checkpoint: we need to know the actual data quality before dropping, enriching, or reshaping anything.  
By freezing a profile here, we can later prove that rows were removed or repaired intentionally and not lost silently.

**OUTPUTS**
- Updated logs: `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`  
- Inline null and distinct summaries for reviewer QA


In [5]:
# ======================================================
# 2.3 Initial Data Profiling
# ======================================================

# ------------------------------------------------------
# 1. Basic completeness metrics
# ------------------------------------------------------
total_rows = len(golf)
total_cols = golf.shape[1]

critical_fields = [
    "round_dt",
    "player_name",
    "facility",
    "course",
    "hole_number",
    "hole_score",
    "shot_start_lat",
    "shot_start_lon",
    "shot_end_lat",
    "shot_end_lon",
    "yardage",
]

null_summary = (
    golf[critical_fields]
    .isna()
    .sum()
    .rename("null_count")
    .to_frame()
)
null_summary["null_pct"] = (null_summary["null_count"] / total_rows * 100).round(2)

# ------------------------------------------------------
# 2. Distinct entity counts
# ------------------------------------------------------
distinct_summary = {
    "distinct_players": golf["player_name"].nunique(dropna=True) if "player_name" in golf else None,
    "distinct_facilities": golf["facility"].nunique(dropna=True) if "facility" in golf else None,
    "distinct_facility_course_pairs": (
        golf[["facility", "course"]].dropna().drop_duplicates().shape[0]
        if all(c in golf.columns for c in ["facility", "course"])
        else None
    ),
}

# ------------------------------------------------------
# 3. Hole score range and GPS completeness
# ------------------------------------------------------
hole_score_min = (
    float(golf["hole_score"].min())
    if "hole_score" in golf and golf["hole_score"].notna().any()
    else None
)
hole_score_max = (
    float(golf["hole_score"].max())
    if "hole_score" in golf and golf["hole_score"].notna().any()
    else None
)

if all(c in golf.columns for c in ["shot_start_lat", "shot_start_lon", "shot_end_lat", "shot_end_lon"]):
    gps_valid_mask = (
        golf["shot_start_lat"].notna()
        & golf["shot_start_lon"].notna()
        & golf["shot_end_lat"].notna()
        & golf["shot_end_lon"].notna()
    )
    gps_valid_pct = float(gps_valid_mask.mean() * 100)
else:
    gps_valid_pct = None


# ------------------------------------------------------
# 4. Assumptions from observed quality
# ------------------------------------------------------
if gps_valid_pct is not None and gps_valid_pct < 70:
    record_assumption(
        text=f"Only {gps_valid_pct:.1f}% of rows have full GPS start/end coordinates.",
        rationale="Geospatial/yardage/dispersion analyses must restrict to GPS-complete rows.",
        impact_area="Shot-level yardage validation and dispersion mapping",
    )

if "hole_score" in golf:
    missing_hole_scores = int(golf["hole_score"].isna().sum())
    if missing_hole_scores > 0:
        record_assumption(
            text="Facility+course pairs are treated as unique playing contexts (e.g. 'Blue' at Facility A != 'Blue' at Facility B).",
            rationale="Course difficulty and scoring environment depend on physical facility, not just the course label.",
            impact_area="Course-level benchmarking and slope/rating enrichment",
        )

# ------------------------------------------------------
# 5. Step log
# ------------------------------------------------------
log_step(
    step_name="2.3 Initial Data Profiling",
    description="Captured baseline completeness, distinct entities, and GPS availability for standardized dataset.",
    inputs=["golf"],
    df=golf,
    extra_info={
        "phase": "2.3",
        "category": "profile",
        "total_rows": total_rows,
        "total_cols": total_cols,
        "gps_valid_pct": gps_valid_pct,
        "hole_score_range": (hole_score_min, hole_score_max),
        "distinct_players": distinct_summary["distinct_players"],
        "distinct_facilities": distinct_summary["distinct_facilities"],
        "distinct_facility_course_pairs": distinct_summary["distinct_facility_course_pairs"],
        "null_pct_preview": null_summary["null_pct"].head(6).to_dict(),
    },
)

# ------------------------------------------------------
# 6. Reviewer convenience
# ------------------------------------------------------
display(null_summary.head(20))
display(pd.DataFrame([distinct_summary]))


üìå Assumption logged: Only 45.6% of rows have full GPS start/end coordinates.  | Impact: Shot-level yardage validation and dispersion mapping
‚úÖ 2.3 Initial Data Profiling @ 2025-11-16 19:33:12
   DataFrame shape: 4287 rows √ó 24 cols
   Captured baseline completeness, distinct entities, and GPS availability for standardized dataset.
   phase: 2.3
   category: profile
   total_rows: 4287
   total_cols: 24
   gps_valid_pct: 45.579659435502684
   hole_score_range: (0.0, 15.0)
   distinct_players: 12
   distinct_facilities: 28
   distinct_facility_course_pairs: 30
   null_pct_preview: {'round_dt': 0.0, 'player_name': 0.0, 'facility': 0.0, 'course': 0.0, 'hole_number': 0.0, 'hole_score': 0.0}


Unnamed: 0,null_count,null_pct
round_dt,0,0.0
player_name,0,0.0
facility,0,0.0
course,0,0.0
hole_number,0,0.0
hole_score,0,0.0
shot_start_lat,2333,54.42
shot_start_lon,2333,54.42
shot_end_lat,2333,54.42
shot_end_lon,2333,54.42


Unnamed: 0,distinct_players,distinct_facilities,distinct_facility_course_pairs
0,12,28,30


### ======================================================
### 2.4 Core Field Validation & Logical Consistency
### ======================================================

**INPUTS**  
- DataFrame: `golf` (post 2.3 profiling)  

**WHAT THIS STEP DOES**  
- Runs domain/business sanity checks on hole- and round-level fields  
- Flags out-of-range hole numbers, nonpositive hole scores, missing timestamps, and implausible round scores  
- Records assumptions about how invalid records will be treated later (in Section 3)  
- Logs the QA step to lineage so we can prove validation happened

**WHY IT MATTERS**  
We want to discover bad or suspicious records **before** the cleaning and enrichment phases.  
By logging these findings instead of dropping rows now, we preserve auditability and give downstream steps the option to filter or repair.

**OUTPUTS**
- `STEP_LOG` / `TRANSFORM_LOG` updated  
- Reviewer-facing violation summary


In [6]:
# ======================================================
# 2.4 Core Field Validation & Logical Consistency
# ======================================================

# ------------------------------------------------------
# 1. Define rule checks
# ------------------------------------------------------
def rule_invalid_hole_number(df):
    if "hole_number" not in df.columns:
        return df.assign(_tmp=False)["_tmp"]
    return ~df["hole_number"].between(1, 18, inclusive="both")

def rule_invalid_hole_score(df):
    if "hole_score" not in df.columns:
        return df.assign(_tmp=False)["_tmp"]
    return (df["hole_score"].notna()) & (df["hole_score"] <= 0)

def rule_missing_round_dt(df):
    if "round_dt" not in df.columns:
        return df.assign(_tmp=True)["_tmp"]
    return df["round_dt"].isna()

def rule_implausible_round_score(df):
    if "round_score" not in df.columns:
        return df.assign(_tmp=False)["_tmp"]
    return (
        df["round_score"].notna()
        & ((df["round_score"] < 20) | (df["round_score"] > 200))
    )

ruleset_name = "2.4_core_field_validation"
rules_dict = {
    "hole_number_out_of_range": rule_invalid_hole_number,
    "hole_score_nonpositive": rule_invalid_hole_score,
    "round_dt_missing": rule_missing_round_dt,
    "round_score_implausible": rule_implausible_round_score,
}

# ------------------------------------------------------
# 2. Build a reviewer summary
# ------------------------------------------------------

# Build a reviewer summary
violation_summary_rows = []
for rule_name, fn in rules_dict.items():
    mask = fn(golf) if callable(fn) else fn
    total_viol = int(mask.sum())
    pct_viol = (total_viol / len(golf) * 100.0) if len(golf) > 0 else 0.0
    violation_summary_rows.append({
        "rule_name": rule_name,
        "violations": total_viol,
        "violation_pct": round(pct_viol, 2),
    })
violation_summary_df = pd.DataFrame(violation_summary_rows).sort_values(
    by="violation_pct", ascending=False
)

# ------------------------------------------------------
# 3. Record assumptions / downstream decisions
# ------------------------------------------------------
# Nonpositive hole scores ‚Üí will be excluded later
nonpos_hole_score_count = int(
    violation_summary_df.loc[
        violation_summary_df["rule_name"] == "hole_score_nonpositive",
        "violations",
    ].iloc[0]
) if "hole_score_nonpositive" in violation_summary_df["rule_name"].values else 0

if nonpos_hole_score_count > 0:
    record_assumption(
        text=f"{nonpos_hole_score_count} rows contain hole_score <= 0. These rows will be excluded from scoring KPIs.",
        rationale="hole_score <= 0 indicates incomplete/invalid hole tracking, not real golf results.",
        impact_area="Round completeness checks and scoring statistics in Data Preparation",
    )

# Missing round_dt ‚Üí unusable for time-based analyses
round_dt_missing_count = int(
    violation_summary_df.loc[
        violation_summary_df["rule_name"] == "round_dt_missing",
        "violations",
    ].iloc[0]
) if "round_dt_missing" in violation_summary_df["rule_name"].values else 0

if round_dt_missing_count > 0:
    record_assumption(
        text=f"{round_dt_missing_count} rows are missing round_dt timestamps.",
        rationale="Rows without a reliable timestamp cannot be used for time-of-day / seasonal analysis.",
        impact_area="Temporal feature engineering and trend analysis",
    )

# ------------------------------------------------------
# 4. Lineage + step log
# ------------------------------------------------------
track_transform(
    stage_name="2.4_core_field_validation",
    df_before=golf,
    df_after=golf,
    notes="Executed logical sanity checks (hole_number range, hole_score>0, timestamp presence, plausible round_score) without mutating data.",
)

log_step(
    step_name="2.4 Core Field Validation & Logical Consistency",
    description="Audited golf data for logical consistency before any row filtering. Logged rule violations and downstream cleanup assumptions.",
    inputs=["golf"],
    outputs=["violation_summary_df"],
    df=golf,
    extra_info={
        "phase": "2.4",
        "category": "validation",
        "most_common_violations": violation_summary_df.head(3).to_dict(orient="records"),
        "note": "No rows were dropped in 2.4. Cleanup happens in Section 3.",
    },
)

# ------------------------------------------------------
# 5. Reviewer display
# ------------------------------------------------------
display(violation_summary_df)


üìå Assumption logged: 275 rows contain hole_score <= 0. These rows will be excluded from scoring KPIs.  | Impact: Round completeness checks and scoring statistics in Data Preparation
üîÑ Transform logged: 2.4_core_field_validation
   Rows 4287 ‚Üí 4287 (0 change)
   Executed logical sanity checks (hole_number range, hole_score>0, timestamp presence, plausible round_score) without mutating data.
‚úÖ 2.4 Core Field Validation & Logical Consistency @ 2025-11-16 19:33:12
   DataFrame shape: 4287 rows √ó 24 cols
   Audited golf data for logical consistency before any row filtering. Logged rule violations and downstream cleanup assumptions.
   phase: 2.4
   category: validation
   most_common_violations: [{'rule_name': 'hole_score_nonpositive', 'violations': 275, 'violation_pct': 6.41}, {'rule_name': 'hole_number_out_of_range', 'violations': 0, 'violation_pct': 0.0}, {'rule_name': 'round_dt_missing', 'violations': 0, 'violation_pct': 0.0}]
   note: No rows were dropped in 2.4. Cleanup hap

Unnamed: 0,rule_name,violations,violation_pct
1,hole_score_nonpositive,275,6.41
0,hole_number_out_of_range,0,0.0
2,round_dt_missing,0,0.0
3,round_score_implausible,0,0.0


### ======================================================
### 2.5 Initial Vendor Yardage Validation
### ======================================================

**INPUTS**  
- DataFrame: `golf` (post 2.4 validation)  
- Expected columns:  
  `row_id`, `yardage`, `shot_start_lat`, `shot_start_lon`, `shot_end_lat`, `shot_end_lon`

**WHAT THIS STEP DOES**  
- Identifies rows eligible for yardage validation (have vendor yardage + full GPS)  
- Computes GPS-based shot distance using start/end coordinates (geodesic, meters ‚Üí yards)  
- Compares computed yardage to vendor-supplied `yardage`  
- Calculates absolute and percent error and flags ‚Äúsuspect‚Äù rows where error exceeds tolerance for real shots (‚â• 30 yds)  
- Merges diagnostics back onto the main `golf` DataFrame  
- Logs validation tolerance as an assumption and records summary metrics

**WHY IT MATTERS**  
Club-distance and dispersion analytics in 3.x depend on trustworthy yardage.  
By flagging bad yardages here (without dropping them), we give downstream steps a clean, governed way to filter or down-weight low-quality shots while preserving lineage.

**OUTPUTS**  
- DataFrame: `golf` with new columns:  
  `yardage_calc`, `yardage_error`, `yardage_error_abs`, `yardage_error_pct`, `yardage_suspect`  
- Updated logs: `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`  
- Reviewer preview of worst-mismatch shots


In [7]:
# ======================================================
# 2.5 Initial Vendor Yardage Validation
# ======================================================

# ------------------------------------------------------
# 1. Schema gate
# ------------------------------------------------------
validate_columns(
    golf,
    required_cols=[
        "row_id",
        "shot_start_lat",
        "shot_start_lon",
        "shot_end_lat",
        "shot_end_lon",
    ],
    context_name="2.5 Initial Vendor Yardage Validation",
)

_before = golf.copy()

# ------------------------------------------------------
# 2. Build subset of eligible rows
# ------------------------------------------------------
gps_cols_available = all(
    c in golf.columns
    for c in ["shot_start_lat", "shot_start_lon", "shot_end_lat", "shot_end_lon"]
)

if gps_cols_available and "yardage" in golf.columns:
    shot_df = (
        golf[
            golf["yardage"].notna()
            & golf["shot_start_lat"].notna()
            & golf["shot_start_lon"].notna()
            & golf["shot_end_lat"].notna()
            & golf["shot_end_lon"].notna()
        ]
        .copy()
    )
else:
    shot_df = golf.head(0).copy()

if "row_id" not in golf.columns:
    raise RuntimeError("row_id missing in `golf`. It should have been added in 2.1 and preserved through 2.2.")

# ------------------------------------------------------
# 3. Parameters / assumptions
# ------------------------------------------------------
YARDAGE_TOL_PCT = 5.0     # flag if abs % error > 5%
YARDAGE_MIN_YARDS = 30.0  # ignore very short shots
YARDS_PER_METER = 1.0936133

record_assumption(
    text=f"Vendor yardage validated against geodesic distance with ¬±{YARDAGE_TOL_PCT:.1f}% tolerance for shots ‚â• {YARDAGE_MIN_YARDS} yards.",
    rationale="Short/green-side shots are GPS-noisy and not useful for club benchmarking.",
    impact_area="Club-distance profiling, dispersion analysis, approach/tee club selection",
)

# ------------------------------------------------------
# 4. Compute GPS-based yardage
# ------------------------------------------------------
def compute_geo_yardage(row):
    try:
        p1 = (row["shot_start_lat"], row["shot_start_lon"])
        p2 = (row["shot_end_lat"], row["shot_end_lon"])
        dist_meters = geodesic(p1, p2).meters
        return dist_meters * YARDS_PER_METER
    except Exception:
        return np.nan

shot_df["yardage_calc"] = shot_df.apply(compute_geo_yardage, axis=1)

# ------------------------------------------------------
# 5. Compare to vendor yardage and flag suspects
# ------------------------------------------------------
shot_df["yardage_error"] = shot_df["yardage_calc"] - shot_df["yardage"]
shot_df["yardage_error_abs"] = shot_df["yardage_error"].abs()

shot_df["yardage_error_pct"] = (
    shot_df["yardage_error_abs"] / shot_df["yardage"].replace(0, np.nan) * 100
)

shot_df["yardage_suspect"] = (
    (shot_df["yardage_error_pct"] > YARDAGE_TOL_PCT)
    & (shot_df["yardage_calc"] >= YARDAGE_MIN_YARDS)
)

# ------------------------------------------------------
# 6. Merge diagnostics back into main golf
# ------------------------------------------------------
merge_cols = [
    "row_id",
    "yardage_calc",
    "yardage_error",
    "yardage_error_abs",
    "yardage_error_pct",
    "yardage_suspect",
]

golf = golf.merge(
    shot_df[merge_cols],
    on="row_id",
    how="left",
    suffixes=("", "_calcsrc"),
)

# ------------------------------------------------------
# 7. Business rule audit + summary
# ------------------------------------------------------
def rule_yardage_suspect(df):
    return df["yardage_suspect"].fillna(False)


def _safe_stat(val, digits=2):
    if val is None:
        return None
    try:
        if pd.isna(val):
            return None
    except Exception:
        pass
    try:
        return round(float(val), digits)
    except Exception:
        return None

eligible_shots = len(shot_df)
valid_calc_rows = int(shot_df["yardage_calc"].notna().sum()) if eligible_shots > 0 else 0
mean_abs_err = _safe_stat(shot_df["yardage_error_abs"].mean(), 2) if eligible_shots > 0 else None
max_abs_err = _safe_stat(shot_df["yardage_error_abs"].max(), 2) if eligible_shots > 0 else None
pct_flagged_raw = shot_df["yardage_suspect"].fillna(False).mean() * 100 if eligible_shots > 0 else 0
pct_flagged = _safe_stat(pct_flagged_raw, 2)

# ------------------------------------------------------
# 8. Governance logging
# ------------------------------------------------------
track_transform(
    stage_name="2.5_vendor_yardage_validation",
    df_before=_before,
    df_after=golf,
    notes="Added GPS-based yardage diagnostics and suspect flag for GPS-complete shot rows.",
    new_cols=[
        "yardage_calc",
        "yardage_error",
        "yardage_error_abs",
        "yardage_error_pct",
        "yardage_suspect",
    ],
)

log_step(
    step_name="2.5 Initial Vendor Yardage Validation",
    description="Validated vendor yardage against GPS-derived yardage for all eligible rows; flagged large discrepancies.",
    inputs=["golf"],
    outputs=["golf (with yardage diagnostics)"],
    df=golf,
    extra_info={
        "phase": "2.5",
        "category": "validation",
        "eligible_shots_evaluated": eligible_shots,
        "with_valid_calc": valid_calc_rows,
        "mean_abs_error_yards": mean_abs_err,
        "max_abs_error_yards": max_abs_err,
        "pct_flagged_suspect": pct_flagged,
        "tolerance_pct": YARDAGE_TOL_PCT,
        "min_distance_evaluated": YARDAGE_MIN_YARDS,
    },
)

# ------------------------------------------------------
# 9. Reviewer display (non-null yardage_calc only)
# ------------------------------------------------------
shot_preview = (
    shot_df.loc[shot_df["yardage_calc"].notna(), [
        "player_name",
        "facility",
        "course",
        "shot_club",
        "yardage",
        "yardage_calc",
        "yardage_error_abs",
        "yardage_error_pct",
        "yardage_suspect",
    ]]
    .sort_values("yardage_error_abs", ascending=False)
    .head(20)
)
display(shot_preview)


üìå Assumption logged: Vendor yardage validated against geodesic distance with ¬±5.0% tolerance for shots ‚â• 30.0 yards.  | Impact: Club-distance profiling, dispersion analysis, approach/tee club selection
üîÑ Transform logged: 2.5_vendor_yardage_validation
   Rows 4287 ‚Üí 4287 (0 change)
   Added GPS-based yardage diagnostics and suspect flag for GPS-complete shot rows.
‚úÖ 2.5 Initial Vendor Yardage Validation @ 2025-11-16 19:33:12
   DataFrame shape: 4287 rows √ó 29 cols
   Validated vendor yardage against GPS-derived yardage for all eligible rows; flagged large discrepancies.
   phase: 2.5
   category: validation
   eligible_shots_evaluated: 1954
   with_valid_calc: 1954
   mean_abs_error_yards: 0.52
   max_abs_error_yards: 1.82
   pct_flagged_suspect: 0.0
   tolerance_pct: 5.0
   min_distance_evaluated: 30.0


Unnamed: 0,player_name,facility,course,shot_club,yardage,yardage_calc,yardage_error_abs,yardage_error_pct,yardage_suspect
2606,Mike Phillips,Langston Golf Course,Langston,1W,344.488,346.308,1.819,0.528,False
3729,Mike Phillips,Lewisburg Elks Country Club,Lewisburg Elks,1W,256.999,258.67,1.671,0.65,False
2167,Mike Phillips,Langston Golf Course,Langston,1W,272.31,273.969,1.66,0.609,False
3642,Mike Phillips,Lewisburg Elks Country Club,Lewisburg Elks,1W,253.718,255.36,1.642,0.647,False
3255,Mike Phillips,Lewisburg Elks Country Club,Lewisburg Elks,3Hy,255.906,257.525,1.62,0.633,False
3578,Mike Phillips,Pipestem Golf Club,Regulation,1W,313.867,315.478,1.611,0.513,False
3714,Mike Phillips,Lewisburg Elks Country Club,Lewisburg Elks,1W,236.22,237.818,1.597,0.676,False
3215,Mike Phillips,Lewisburg Elks Country Club,Lewisburg Elks,3Hy,242.782,244.368,1.586,0.653,False
3929,Mike Phillips,Lewisburg Elks Country Club,Lewisburg Elks,1W,271.216,272.795,1.579,0.582,False
2395,Mike Phillips,Langston Golf Course,Langston,1W,240.595,242.168,1.573,0.654,False


### ======================================================
### 2.6 Phase 2 Governance Close-Out (Data Understanding)
### ======================================================

**PURPOSE**  
Freeze and archive the post-validation, pre-preparation state of the GolfShot dataset so that all 3.x data preparation work originates from a governed baseline.

**WHAT THIS STEP DOES**  
- Summarizes governance log population (step, validation, transform, assumptions, business rules)  
- Captures a terminal profile snapshot of `golf`  
- Generates a data dictionary for the current schema  
- Records a ‚Äúphase closed‚Äù assumption for change control  
- Exports a governed snapshot of `golf` to `/data/private`  
- Exports a full artifact bundle (data + logs + dictionaries) to `/deliverables/run_<timestamp>`

**WHY IT MATTERS**  
This is the control gate between **Data Understanding (2.x)** and **Data Preparation (3.x)**.  
By archiving here, we can always roll back or compare later steps to a known-good, validated version.

**OUTPUTS**  
- `PRIVATE_PATH / golf_clean_snapshot.xlsx`  
- Deliverable bundle in `OUTPUT_PATH / run_<timestamp>`  
- Updated logs (`STEP_LOG`, `ASSUMPTIONS_LOG`)


In [8]:
# ======================================================
# 2.6 Phase 2 Governance Close-Out (Data Understanding)
# ======================================================

# ------------------------------------------------------
# 1. Governance completeness snapshot
# ------------------------------------------------------
print("STEP_LOG entries:", len(STEP_LOG))
print("VALIDATION_LOG entries:", len(VALIDATION_LOG))
print("TRANSFORM_LOG entries:", len(TRANSFORM_LOG))
print("ASSUMPTIONS_LOG entries:", len(ASSUMPTIONS_LOG))

# ------------------------------------------------------
# 2. Generate data dictionary for current schema
# ------------------------------------------------------
generate_data_dictionary(
    golf,
    table_name="golf",
    desc_map=None,
    lineage_map=None,
)

# ------------------------------------------------------
# 3. Phase-close assumption
# ------------------------------------------------------
record_assumption(
    text="Phase 2 (Data Understanding) closed and archived. Section 3 (Data Preparation) will build from this governed baseline.",
    rationale="Ensures reproducibility and auditability of all downstream transformations.",
    impact_area="Change control / data lineage / reproducibility",
)

# ------------------------------------------------------
# 4. Export snapshot + artifacts
# ------------------------------------------------------
snapshot_path = PRIVATE_PATH / f"golf_clean_snapshot.xlsx"
golf.to_excel(snapshot_path, index=False)
print(f"üì¶ Phase 2 snapshot exported ‚Üí {snapshot_path}")

# ------------------------------------------------------
# 5. Step log
# ------------------------------------------------------
log_step(
    step_name="2.6 Phase 2 Governance Close-Out",
    description="Locked and exported governed baseline prior to Section 3 (Data Preparation).",
    inputs=["golf"],
    outputs=["golf_clean_snapshot.xlsx", "governance artifact bundle"],
    df=golf,
    extra_info={
        "phase": "2.6",
        "category": "export",
        "export_path": str(snapshot_path),
        "note": "End-of-phase control gate completed.",
    },
)


STEP_LOG entries: 6
VALIDATION_LOG entries: 3
TRANSFORM_LOG entries: 4
ASSUMPTIONS_LOG entries: 7
üìò Data dictionary generated for table 'golf' (29 columns).
üìå Assumption logged: Phase 2 (Data Understanding) closed and archived. Section 3 (Data Preparation) will build from this governed baseline.  | Impact: Change control / data lineage / reproducibility
üì¶ Phase 2 snapshot exported ‚Üí C:\Users\micha\onedrive\documents\newforce\golf-capstone\data\private\golf_clean_snapshot.xlsx
‚úÖ 2.6 Phase 2 Governance Close-Out @ 2025-11-16 19:33:15
   DataFrame shape: 4287 rows √ó 29 cols
   Locked and exported governed baseline prior to Section 3 (Data Preparation).
   phase: 2.6
   category: export
   export_path: C:\Users\micha\onedrive\documents\newforce\golf-capstone\data\private\golf_clean_snapshot.xlsx
   note: End-of-phase control gate completed.


## ======================================================
## Phase 3 ‚Äî Data Preparation  
## *(Define ‚Üí Acquire ‚Üí Prepare ‚Üí Enrich ‚Üí Analyze ‚Üí Improve ‚Üí Control)*  
## ======================================================

### **Purpose**
Phase 3 transforms raw, inconsistent golf-shot and round data into a **governed, analysis-ready data foundation**.  
Across 10+ submodules, it establishes data quality, feature engineering, and governance infrastructure that make the downstream analysis (Phases 4‚Äì6) **trustworthy, reproducible, and auditable**.  

This phase is the technical backbone of the Golf Performance Analytics Capstone‚Äîturning 15 years of round-level data and 5 years of shot-level data into validated, feature-rich datasets aligned with Six Sigma **DMAIC** and **CRISP-DM** standards.

---

### **Scope Overview**

| Section | Focus | Core Outputs |
|----------|--------|---------------|
| **3.1 Round-Level Validity & Features** | Cleans and validates the foundational unit of performance‚Äîrounds. Adds date/time attributes, part-of-day logic, and scoring integrity checks. | `golf_valid` (base analytical fact table) |
| **3.2 Hole-Level Validity & Features** | Derives KPIs from each hole: score classification, putting efficiency, scramble outcomes, and contextual flags for GIR and recovery. | Enhanced `golf_valid` with hole-level metrics |
| **3.3 Club-Level Validity & Features** | Builds per-player, per-club profiles‚Äîdistance distributions, directional bias, and dispersion metrics for planning accuracy. | `player_club_profile`, `player_club_dispersion_rollup` |
| **3.4 Facility-Level Validity & Features** | Consolidates facility/course metadata, geocodes locations, computes centroids, and attaches IANA timezones for temporal/geospatial analysis. | `facilities` table with full spatial enrichment |
| **3.5 Governance Close-Out** | Validates all schemas, logs lineage, consolidates assumptions and data dictionaries, and exports every governed artifact. | Phase 3 Handoff Package (CSV + XLSX + Logs + Dictionary) |

---

### **What This Phase Does**
1. **Standardizes structure** across multiple granularities (round, hole, shot, club, facility).  
2. **Implements validation gates** (`validate_columns`, `log_step`, `track_transform`, `record_assumption`) to enforce schema and business-rule compliance.  
3. **Derives engineered features** for scoring, putting, dispersion, and temporal analyses.  
4. **Integrates governance and lineage**, making every transformation traceable from input to output.  
5. **Exports all artifacts** (datasets, logs, and dictionaries) into a timestamped, reproducible package.  

---

### **Why It Matters**
Phase 3 ensures that all downstream insights‚Äîwhether visualized in Tableau or modeled statistically‚Äîare based on **accurate, verified, and explainable data**.  
It transforms unstructured raw exports into a **single source of analytical truth**, allowing analysts, coaches, or operations leaders to trust every number and reproduce every result.

---

### **Outputs**
| Category | Deliverable | Description |
|-----------|--------------|-------------|
| **Primary Fact Tables** | `golf_valid`, `player_club_profile`, `facilities` | Clean, joined, feature-enriched data for analysis |
| **Supplementary Tables** | `player_club_hole_dispersion`, `player_club_dispersion_rollup` | Aggregated dispersion and directional accuracy metrics |
| **Governance Artifacts** | `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`, `VALIDATION_LOG`, `DATA_DICTIONARIES`| Complete operational lineage and quality audit trail |
| **Phase 3 Handoff Workbook** | `phase3_exports.xlsx` | Multi-sheet Excel file containing all final tables, logs, and data dictionary |

---

### **Phase Outcome**
By the end of Phase 3, the project achieves:
- **Full traceability:** every column and transformation documented.  
- **Stable schemas:** all tables validated and version-locked for downstream use.  
- **Ready-to-analyze datasets:** formatted for Tableau, Power BI, or Python modeling.  
- **Governed export package:** ensuring reproducibility and long-term audit integrity.  

---

### **Next Steps**
Proceed to **Phase 4: Analysis & Visualization**, using the governed Phase 3 exports (`golf_valid`, `facilities`, `player_club_profile`) as the analytical foundation for modeling, storytelling, and improvement planning.


### ======================================================
### 3.1 Round-level Validity & Features
### ======================================================

**FOCUS**  
Ensure every round in the dataset represents a complete, valid, and chronologically traceable scoring event.  
This phase transforms hole-level raw data into a structured, governed round-level dataset.

**SUB-STEPS**
1. **3.1.1 Filter Invalid Hole Records** ‚Äì Removes non-scoring or placeholder rows (`hole_score ‚â§ 0`).  
2. **3.1.2 Calculate Holes Scored per Round** ‚Äì Aggregates distinct holes per round to assess completeness.  
3. **3.1.3 Attach Round Hole Counts & Identify Partial Rounds** ‚Äì Flags incomplete rounds for exclusion.  
4. **3.1.4 Keep Only Complete Rounds** ‚Äì Retains only valid 9- and 18-hole rounds.  
5. **3.1.5 Assign Sequential Round Numbers per Player** ‚Äì Adds player-specific chronological sequencing.  
6. **3.1.6 Assign Sequential Visit Numbers per Player √ó Course** ‚Äì Tracks repeat visits to the same course.  
7. **3.1.7 Create Round Identifiers & Sequencing** ‚Äì Generates durable `round_id` and human-readable `round_key`.  
8. **3.1.8 Round-level Time Validity & Features** ‚Äì Derives human-readable time, validates tee times, and adds seasonal context:  
   - *3.1.8.1 Derive Time Features*  
   - *3.1.8.2 Validate Round Time Quality*  
   - *3.1.8.3 Derive Seasonal & Annual Context*  
9. **3.1.9 Rename Columns for Analysis** ‚Äì Normalizes schema (round_, hole_, shot_) and standardizes `hole_gir`.  
10. **3.1.10 Round-level Validity & Features Closeout** ‚Äì Performs a formal governance and schema validation checkpoint, captures an audit snapshot, and marks `golf_valid` as ready for the hole-level phase.

**KEY BENEFITS**
- Guarantees **round-level completeness and integrity**  
- Adds **chronological, temporal, and contextual metadata** for trend analysis  
- Establishes a **governed, audit-ready foundation** for Phase 3.2 (*Hole-level Validity & Features*)  
- Creates a **formal closeout checkpoint (3.1.10)** ensuring data lineage, schema consistency, and traceability before progressing


#### ======================================================
#### 3.1.1 Filter Invalid Hole Records
#### ======================================================

**INPUTS**  
- DataFrame: `golf` (Phase 2 output)  
- Key columns: `hole_score`, `player_name`, `facility`, `course`, `round_dt`, `hole_number`

**WHAT THIS STEP DOES**  
- Validates schema before filtering  
- Removes non-scoring and placeholder hole rows (`hole_score ‚â§ 0` or missing)  
- Produces `golf_valid`, a dataset containing only real scoring holes  
- Logs row-retention metrics and schema integrity  
- Updates governance artifacts (`STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`)

**WHY IT MATTERS**  
All subsequent preparation and analysis steps rely on authentic scoring data.  
Rows with zero or missing scores usually indicate incomplete rounds, test entries, or device-sync placeholders.  
Filtering them here ensures that later KPI and round-completeness logic operate on clean, valid records.

**OUTPUTS**  
- `golf_valid` ‚Äì filtered DataFrame of scored holes only  
- `STEP_LOG`, `TRANSFORM_LOG`, and `ASSUMPTIONS_LOG` updated  
- Reviewer preview of dropped records for transparency


In [9]:
# ======================================================
# 3.1.1 Filter Invalid Hole Records
# ======================================================

# ------------------------------------------------------
# 1. Schema gate (ensure we can even filter)
# ------------------------------------------------------
validate_columns(
    golf,
    required_cols=[
        "row_id",
        "player_name",
        "round_dt",
        "facility",
        "course",
        "hole_number",
        "hole_score",
        "round_score",
    ],
    context_name="3.1.1 Filter Invalid Hole Records ‚Äì source check",
)

# ------------------------------------------------------
# 2. Keep reference to pre-filter state for lineage
# ------------------------------------------------------
_before = golf.copy()

# ------------------------------------------------------
# 3. Apply scoring rule
#    Keep only rows where hole_score is present and > 0
# ------------------------------------------------------
mask_valid = golf["hole_score"].notna() & (golf["hole_score"] > 0)
golf_valid = golf[mask_valid].copy()

# ------------------------------------------------------
# 4. Retention / quality metrics
# ------------------------------------------------------
start_rows = len(golf)
end_rows = len(golf_valid)
pct_retained = (end_rows / start_rows * 100.0) if start_rows > 0 else 0.0
any_zero_scores_left = bool(
    ("hole_score" in golf_valid.columns)
    and (golf_valid["hole_score"] <= 0).any()
)

# ------------------------------------------------------
# 5. Schema gate on the filtered frame (defensive)
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "row_id",
        "player_name",
        "round_dt",
        "facility",
        "course",
        "hole_number",
        "hole_score",
        "round_score",
    ],
    context_name="3.1.1 Filter Invalid Hole Records ‚Äì post-filter",
)

# ------------------------------------------------------
# 6. Record the business assumption
# ------------------------------------------------------
record_assumption(
    text="Rows with hole_score <= 0 or missing are excluded from scoring and round-validity analysis.",
    rationale="These rows typically represent placeholders, incomplete entries, or non-scoring events and would distort round completeness.",
    impact_area="3.1 Round Validity, scoring KPIs",
)

# ------------------------------------------------------
# 7. Lineage logging
# ------------------------------------------------------
track_transform(
    stage_name="3.1.1_filter_invalid_holes",
    df_before=_before,
    df_after=golf_valid,
    notes=(
        f"Filtered to scored holes only (hole_score > 0). "
        f"Retained {end_rows} of {start_rows} rows ({pct_retained:.2f}%)."
    ),
)

# ------------------------------------------------------
# 8. Human-readable pipeline log
# ------------------------------------------------------
log_step(
    step_name="3.1.1 Filter Invalid Hole Records",
    description="Removed non-scoring / placeholder hole rows so all 3.1.x steps operate on real, scored holes.",
    inputs=["golf (phase 2 output)"],
    outputs=["golf_valid"],
    df=golf_valid,
    extra_info={
        "phase": "3.1.1",
        "category": "filter",
        "rows_before": start_rows,
        "rows_after": end_rows,
        "pct_retained": round(pct_retained, 2),
        "non_scoring_rows_dropped": int((~mask_valid).sum()),
        "any_zero_scores_left": any_zero_scores_left,
        "note": "Downstream 3.1.x steps should consume `golf_valid`.",
    },
)

# ------------------------------------------------------
# 9. Reviewer peek (what got dropped)
# ------------------------------------------------------
dropped_rows_preview = _before[~mask_valid].head(2)
display(dropped_rows_preview)


üìå Assumption logged: Rows with hole_score <= 0 or missing are excluded from scoring and round-validity analysis.  | Impact: 3.1 Round Validity, scoring KPIs
üîÑ Transform logged: 3.1.1_filter_invalid_holes
   Rows 4287 ‚Üí 4012 (-275 change)
   Filtered to scored holes only (hole_score > 0). Retained 4012 of 4287 rows (93.59%).
‚úÖ 3.1.1 Filter Invalid Hole Records @ 2025-11-16 19:33:15
   DataFrame shape: 4012 rows √ó 29 cols
   Removed non-scoring / placeholder hole rows so all 3.1.x steps operate on real, scored holes.
   phase: 3.1.1
   category: filter
   rows_before: 4287
   rows_after: 4012
   pct_retained: 93.59
   non_scoring_rows_dropped: 275
   any_zero_scores_left: False
   note: Downstream 3.1.x steps should consume `golf_valid`.


Unnamed: 0,player_name,round_dt,facility,course,round_score,round_fairway_hits,round_putts,round_gir,hole_number,hole_par,hole_fairway_hit_type,hole_fairway_hits,hole_putts,hole_score,hole_gir,shot_club,shot_direction,shot_start_lat,shot_start_lon,shot_end_lat,shot_end_lon,yardage,yardage_to_pin,row_id,yardage_calc,yardage_error,yardage_error_abs,yardage_error_pct,yardage_suspect
30,Saurabh Gupta,2011-04-03 11:46:03,Langston Golf Course,Langston,33,26,7,25.0,4,3,Unknown,0,0,0,No,,,,,,,,,31,,,,,
32,Matt Johnson,2011-04-03 11:46:03,Langston Golf Course,Langston,24,18,6,0.0,4,3,Unknown,0,0,0,No,,,,,,,,,33,,,,,


#### ======================================================
#### 3.1.2 Calculate Holes Scored per Round
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.1.1 Filter Invalid Hole Records)  
- Required columns: `player_name`, `round_dt`, `facility`, `course`, `hole_number`, `hole_score`

**WHAT THIS STEP DOES**  
- Aggregates hole-level data to a round-level helper table  
- Counts how many **distinct** holes were actually scored for each `(player_name, round_dt, facility, course)` combination  
- Produces a new DataFrame `holes_per_round` used later to flag partial or non-standard rounds  
- Logs the distribution of hole counts (e.g. 9, 18, or ‚Äúweird‚Äù values) for QA  
- Records the assumption that valid rounds are 9- or 18-hole rounds

**WHY IT MATTERS**  
Round completeness is central to scoring analysis.  
If a round only has 6 or 14 holes logged, we don‚Äôt want to treat it the same as a full 9- or 18-hole round.  
By creating this helper early, later steps (3.1.3‚Äì3.1.4) can reliably join to it and keep only complete rounds.

**OUTPUTS**  
- `holes_per_round` ‚Äî round-level helper with `holes_scored_in_round`  
- Governance updated: `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`  
- Reviewer preview of non-standard hole counts


In [10]:
# ======================================================
# 3.1.2 Calculate Holes Scored per Round
# ======================================================

# ------------------------------------------------------
# 1. Schema gate on source (golf_valid)
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "player_name",
        "round_dt",
        "facility",
        "course",
        "hole_number",
        "hole_score",
    ],
    context_name="3.1.2 Calculate Holes Scored per Round ‚Äì source check",
)

# ------------------------------------------------------
# 2. Build round-level helper (count distinct holes)
# ------------------------------------------------------
holes_per_round = (
    golf_valid
    .groupby(["player_name", "round_dt", "facility", "course"], as_index=False)["hole_number"]
    .nunique()
    .rename(columns={"hole_number": "holes_scored_in_round"})
)

# ------------------------------------------------------
# 3. QA distribution (how many rounds have 9, 18, or weird counts)
# ------------------------------------------------------
holecount_distribution = (
    holes_per_round["holes_scored_in_round"]
    .value_counts()
    .sort_index()
    .to_dict()
)

# ------------------------------------------------------
# 4. Assumption: only 9 or 18 are considered ‚Äúcomplete‚Äù rounds
# ------------------------------------------------------
record_assumption(
    text="A valid golf round must contain 9 or 18 distinct scored holes.",
    rationale="Non-standard hole counts typically indicate partial logging, practice rounds, or device interruptions.",
    impact_area="3.1 Round Validity; completeness filtering in 3.1.3‚Äì3.1.4",
)

# ------------------------------------------------------
# 5. Lineage logging (helper creation)
# ------------------------------------------------------
track_transform(
    stage_name="3.1.2_build_round_hole_counts_helper",
    df_before=None,  # we created a new helper; we didn't mutate golf_valid
    df_after=holes_per_round,
    notes="Created round-level helper (holes_per_round) from golf_valid to support completeness checks.",
    new_cols=["holes_scored_in_round"],
)

# ------------------------------------------------------
# 6. Step log
# ------------------------------------------------------
log_step(
    step_name="3.1.2 Calculate Holes Scored per Round",
    description="Computed distinct holes per (player, round_dt, facility, course) to assess round completeness.",
    inputs=["golf_valid"],
    outputs=["holes_per_round"],
    df=holes_per_round,
    extra_info={
        "phase": "3.1.2",
        "category": "aggregation",
        "unique_rounds_detected": len(holes_per_round),
        "holecount_distribution": holecount_distribution,
        "note": "Rounds with hole counts not in {9, 18} will be flagged/merged in 3.1.3.",
    },
)

# ------------------------------------------------------
# 7. Reviewer peek: show non-standard rounds
# ------------------------------------------------------
nonstandard_rounds = holes_per_round[~holes_per_round["holes_scored_in_round"].isin([9, 18])]
display(nonstandard_rounds.head(30))


üìå Assumption logged: A valid golf round must contain 9 or 18 distinct scored holes.  | Impact: 3.1 Round Validity; completeness filtering in 3.1.3‚Äì3.1.4
üîÑ Transform logged: 3.1.2_build_round_hole_counts_helper
   Created round-level helper (holes_per_round) from golf_valid to support completeness checks.
‚úÖ 3.1.2 Calculate Holes Scored per Round @ 2025-11-16 19:33:15
   DataFrame shape: 210 rows √ó 5 cols
   Computed distinct holes per (player, round_dt, facility, course) to assess round completeness.
   phase: 3.1.2
   category: aggregation
   unique_rounds_detected: 210
   holecount_distribution: {3: 1, 4: 2, 9: 101, 14: 1, 18: 105}
   note: Rounds with hole counts not in {9, 18} will be flagged/merged in 3.1.3.


Unnamed: 0,player_name,round_dt,facility,course,holes_scored_in_round
13,Don Price,2012-09-02 13:35:25,River Run Golf Club,River Run,14
23,Jon Whitmore,2011-04-03 11:46:03,Langston Golf Course,Langston,4
29,Matt Johnson,2011-04-03 11:46:03,Langston Golf Course,Langston,3
183,Saurabh Gupta,2011-04-03 11:46:03,Langston Golf Course,Langston,4


#### ======================================================
#### 3.1.3 Attach Round Hole Counts & Identify Partial Rounds
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.1.1)  
- DataFrame: `holes_per_round` (output from 3.1.2)  
- Shared key: (`player_name`, `round_dt`, `facility`, `course`)

**WHAT THIS STEP DOES**  
- Merges each scored hole in `golf_valid` with its total `holes_scored_in_round` value  
- Flags any rounds that are **not exactly 9 or 18 holes** as `is_partial_round = True`  
- Creates a review table `partial_rounds` showing all non-standard rounds  
- Logs schema validation, assumptions, and lineage changes

**WHY IT MATTERS**  
Partial or non-standard rounds can distort round-level KPIs and scoring trends.  
By flagging them early, we preserve round completeness integrity before filtering in **3.1.4**.  
This ensures that only legitimate 9- or 18-hole rounds are included in scoring and trend analyses.

**OUTPUTS**  
- `golf_valid` ‚Äì enriched with `holes_scored_in_round` and `is_partial_round`  
- `partial_rounds` ‚Äì summary of non-standard rounds  
- Updated `STEP_LOG`, `TRANSFORM_LOG`, and `ASSUMPTIONS_LOG` for governance tracking


In [11]:
# ======================================================
# 3.1.3 Attach Round Hole Counts & Identify Partial Rounds
# ======================================================

# ------------------------------------------------------
# 1. Schema gates on inputs
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "player_name",
        "round_dt",
        "facility",
        "course",
        "hole_number",
        "hole_score",
    ],
    context_name="3.1.3 golf_valid input",
)

validate_columns(
    holes_per_round,
    required_cols=[
        "player_name",
        "round_dt",
        "facility",
        "course",
        "holes_scored_in_round",
    ],
    context_name="3.1.3 holes_per_round input",
)

# ------------------------------------------------------
# 2. Keep pre-merge state for lineage
# ------------------------------------------------------
_before = golf_valid.copy()

# ------------------------------------------------------
# 3. Merge per-round hole counts onto every scored row
# ------------------------------------------------------
golf_valid = golf_valid.merge(
    holes_per_round,
    on=["player_name", "round_dt", "facility", "course"],
    how="left",
    validate="m:1",
)

# (optional) normalize dtype for consistency
if "holes_scored_in_round" in golf_valid.columns:
    golf_valid["holes_scored_in_round"] = pd.to_numeric(
        golf_valid["holes_scored_in_round"], errors="coerce"
    ).astype("Int64")

# ------------------------------------------------------
# 4. Flag partial rounds (not exactly 9 or 18)
# ------------------------------------------------------
golf_valid["is_partial_round"] = ~golf_valid["holes_scored_in_round"].isin([9, 18])

# ------------------------------------------------------
# 5. Build reviewer table of partial rounds
# ------------------------------------------------------
partial_rounds = (
    golf_valid[golf_valid["is_partial_round"]]
    [["player_name", "round_dt", "facility", "course", "holes_scored_in_round"]]
    .drop_duplicates()
    .sort_values(["player_name", "round_dt", "facility", "course"])
    .reset_index(drop=True)
)
num_partial_rounds = len(partial_rounds)

# ------------------------------------------------------
# 6. Assumption: partial rounds are excluded downstream
# ------------------------------------------------------
record_assumption(
    text="Rounds with holes_scored_in_round not in {9, 18} are considered partial and will be excluded downstream.",
    rationale="Partial rounds bias round-level KPIs, trends, and sequencing logic.",
    impact_area="3.1 Round Validity",
)

# ------------------------------------------------------
# 7. Lineage logging
# ------------------------------------------------------
track_transform(
    stage_name="3.1.3_attach_round_hole_counts",
    df_before=_before,
    df_after=golf_valid,
    notes=f"Joined per-round hole counts and flagged {num_partial_rounds} partial round(s).",
    new_cols=["holes_scored_in_round", "is_partial_round"],
)

# ------------------------------------------------------
# 8. Step log
# ------------------------------------------------------
log_step(
    step_name="3.1.3 Attach Round Hole Counts & Identify Partial Rounds",
    description="Merged holes_per_round helper onto golf_valid and flagged non-standard (partial) rounds.",
    inputs=["golf_valid (3.1.1)", "holes_per_round (3.1.2)"],
    outputs=["golf_valid", "partial_rounds"],
    df=golf_valid,
    extra_info={
        "phase": "3.1.3",
        "category": "enrichment",
        "partial_rounds_detected": num_partial_rounds,
        "standard_round_rule": "Valid rounds are exactly 9 or 18 distinct scored holes.",
        "note": "3.1.4 will filter out rows where is_partial_round == True.",
    },
)

# ------------------------------------------------------
# 9. Reviewer peek
# ------------------------------------------------------
display(partial_rounds.head(30))


üìå Assumption logged: Rounds with holes_scored_in_round not in {9, 18} are considered partial and will be excluded downstream.  | Impact: 3.1 Round Validity
üîÑ Transform logged: 3.1.3_attach_round_hole_counts
   Rows 4012 ‚Üí 4012 (0 change)
   Joined per-round hole counts and flagged 4 partial round(s).
‚úÖ 3.1.3 Attach Round Hole Counts & Identify Partial Rounds @ 2025-11-16 19:33:15
   DataFrame shape: 4012 rows √ó 31 cols
   Merged holes_per_round helper onto golf_valid and flagged non-standard (partial) rounds.
   phase: 3.1.3
   category: enrichment
   partial_rounds_detected: 4
   standard_round_rule: Valid rounds are exactly 9 or 18 distinct scored holes.
   note: 3.1.4 will filter out rows where is_partial_round == True.


Unnamed: 0,player_name,round_dt,facility,course,holes_scored_in_round
0,Don Price,2012-09-02 13:35:25,River Run Golf Club,River Run,14
1,Jon Whitmore,2011-04-03 11:46:03,Langston Golf Course,Langston,4
2,Matt Johnson,2011-04-03 11:46:03,Langston Golf Course,Langston,3
3,Saurabh Gupta,2011-04-03 11:46:03,Langston Golf Course,Langston,4


#### ======================================================
#### 3.1.4 Keep Only Complete Rounds (9 or 18 Holes)
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.1.3, already enriched with `holes_scored_in_round`)  
- Required columns:  
  `player_name`, `round_dt`, `facility`, `course`,  
  `hole_number`, `hole_score`, `holes_scored_in_round`

**WHAT THIS STEP DOES**  
- Applies the completeness rule: keep only rounds that have **exactly 9 or 18 scored holes**  
- Drops partial, abandoned, or mislogged rounds from the working dataset  
- Quantifies retention (rows before vs. after) and captures the distribution of surviving round sizes  
- Logs the assumption that only standard 9/18-hole rounds will be used in scoring and trend analysis  
- Updates lineage so auditors can see this was an intentional filter, not silent row loss

**WHY IT MATTERS**  
Round-level analytics (scoring averages, progression, facility comparisons) are only meaningful when the round is complete.  
Including partial rounds would artificially lower scores and distort player trends.  
By locking this rule here, all downstream 3.1.x steps can safely assume ‚Äúthis is a real round.‚Äù

**OUTPUTS**  
- `golf_valid` ‚Äî filtered to complete 9- or 18-hole rounds only  
- Updated `STEP_LOG` and `TRANSFORM_LOG` with retention metrics  
- Reviewer preview of remaining rounds for spot-checking


In [12]:
# ======================================================
# 3.1.4 Keep Only Complete Rounds (9 or 18 Holes)
# ======================================================

# ------------------------------------------------------
# 1. Schema gate (must have counts from 3.1.3)
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "player_name",
        "round_dt",
        "facility",
        "course",
        "hole_number",
        "hole_score",
        "holes_scored_in_round",
    ],
    context_name="3.1.4 Keep Only Complete Rounds ‚Äì source check",
)

# ------------------------------------------------------
# 2. Record the completeness assumption (explicit)
# ------------------------------------------------------
record_assumption(
    text="Only rounds with exactly 9 or 18 scored holes are retained in the working dataset.",
    rationale="Partial or mislogged rounds bias round-level KPIs, scoring averages, and sequencing.",
    impact_area="3.1 Round Validity; downstream scoring analysis",
)

# ------------------------------------------------------
# 3. Preserve pre-filter state for lineage
# ------------------------------------------------------
_before = golf_valid.copy()

# ------------------------------------------------------
# 4. Apply completeness rule
# ------------------------------------------------------
golf_valid = golf_valid[golf_valid["holes_scored_in_round"].isin([9, 18])].copy()

# ------------------------------------------------------
# 5. Retention / distribution metrics
# ------------------------------------------------------
rows_before = len(_before)
rows_after = len(golf_valid)
pct_retained = (rows_after / rows_before * 100.0) if rows_before > 0 else None

# round-level size distribution (unique rounds only)
round_size_dist = (
    golf_valid[["player_name", "round_dt", "facility", "course", "holes_scored_in_round"]]
    .drop_duplicates()
    ["holes_scored_in_round"]
    .value_counts()
    .sort_index()
    .to_dict()
)

# ------------------------------------------------------
# 6. Track lineage
# ------------------------------------------------------
track_transform(
    stage_name="3.1.4_keep_only_complete_rounds",
    df_before=_before,
    df_after=golf_valid,
    notes=(
        "Filtered working dataset to retain only rounds with holes_scored_in_round in {9, 18}. "
        f"Retained {rows_after} of {rows_before} rows ({pct_retained:.2f}%)." if pct_retained is not None else
        "Filtered working dataset to retain only rounds with holes_scored_in_round in {9, 18}."
    ),
)

# ------------------------------------------------------
# 7. Step log
# ------------------------------------------------------
log_step(
    step_name="3.1.4 Keep Only Complete Rounds",
    description="Removed partial or incomplete rounds based on the 9/18-hole completeness rule.",
    inputs=["golf_valid (enriched with holes_scored_in_round)"],
    outputs=["golf_valid (complete rounds only)"],
    df=golf_valid,
    extra_info={
        "phase": "3.1.4",
        "category": "filter",
        "rows_before": rows_before,
        "rows_after": rows_after,
        "pct_retained": round(pct_retained, 2) if pct_retained is not None else None,
        "round_size_distribution": round_size_dist,
        "note": "Working dataset now contains only complete 9- or 18-hole rounds; downstream steps can assume completeness.",
    },
)

# ------------------------------------------------------
# 8. Reviewer preview (remaining rounds)
# ------------------------------------------------------
display(
    golf_valid[
        ["player_name", "round_dt", "facility", "course", "holes_scored_in_round"]
    ]
    .drop_duplicates()
    .head(10)
)


üìå Assumption logged: Only rounds with exactly 9 or 18 scored holes are retained in the working dataset.  | Impact: 3.1 Round Validity; downstream scoring analysis
üîÑ Transform logged: 3.1.4_keep_only_complete_rounds
   Rows 4012 ‚Üí 3987 (-25 change)
   Filtered working dataset to retain only rounds with holes_scored_in_round in {9, 18}. Retained 3987 of 4012 rows (99.38%).
‚úÖ 3.1.4 Keep Only Complete Rounds @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 31 cols
   Removed partial or incomplete rounds based on the 9/18-hole completeness rule.
   phase: 3.1.4
   category: filter
   rows_before: 4012
   rows_after: 3987
   pct_retained: 99.38
   round_size_distribution: {np.int64(9): 101, np.int64(18): 105}
   note: Working dataset now contains only complete 9- or 18-hole rounds; downstream steps can assume completeness.


Unnamed: 0,player_name,round_dt,facility,course,holes_scored_in_round
0,Mike Phillips,2011-03-01 00:00:00,Sligo Creek Golf Course,Sligo Creek,18
21,Mike Phillips,2011-04-03 11:46:03,Langston Golf Course,Langston,9
38,Mike Phillips,2011-04-10 12:01:39,Langston Golf Course,Langston,18
56,Mike Phillips,2011-04-23 09:19:31,Sligo Creek Golf Course,Sligo Creek,9
65,Mike Phillips,2011-04-24 07:09:47,The Preserve at Eisenhower Golf Course,The Preserve at Eisenhower Golf Course,18
66,Don Price,2011-04-24 07:09:47,The Preserve at Eisenhower Golf Course,The Preserve at Eisenhower Golf Course,18
101,Mike Phillips,2011-04-30 14:05:36,Sligo Creek Golf Course,Sligo Creek,9
102,Saurabh Gupta,2011-04-30 14:05:36,Sligo Creek Golf Course,Sligo Creek,9
119,Mike Phillips,2011-04-30 19:36:00,Renditions Golf Grand Slam Experience,Renditions,18
137,Jon Whitmore,2011-05-22 16:48:01,Sligo Creek Golf Course,Sligo Creek,9


#### ======================================================
#### 3.1.5 Assign Sequential Round Numbers per Player
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.1.4 Keep Only Complete Rounds)  
- Required columns:  
  `player_name`, `round_dt`, `facility`, `course`, `holes_scored_in_round`, `round_score`

**WHAT THIS STEP DOES**  
- Sorts all complete rounds by player and round date to establish true chronological order  
- Assigns each player a sequential round index (`round_no_player`) using a **dense rank** (1, 2, 3, ‚Ä¶)  
- Logs the assumption that `round_dt` is the correct chronological anchor for round progression  
- Adds lineage tracking and step logging to document this new feature  

**WHY IT MATTERS**  
Player performance improvement can only be measured relative to the order of play.  
By assigning a sequential round number, we enable time-series or ‚Äúcareer progression‚Äù analysis that is resilient to calendar irregularities or data gaps.  
This forms the basis for ‚Äúround progression,‚Äù moving averages, and skill-trend features in later phases.

**OUTPUTS**  
- `golf_valid` ‚Äî now includes `round_no_player` for every scored hole  
- Governance artifacts updated: `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`  
- Reviewer preview: first 20 chronological rounds per player for verification


In [13]:
# ======================================================
# 3.1.5 Assign Sequential Round Numbers per Player
# ======================================================

# ------------------------------------------------------
# 1. Schema gate (must be post 3.1.4)
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "player_name",
        "round_dt",
        "facility",
        "course",
        "holes_scored_in_round",
        "round_score",
    ],
    context_name="3.1.5 Assign Sequential Round Numbers per Player ‚Äì source check",
)

# ------------------------------------------------------
# 2. Capture pre-transform state
# ------------------------------------------------------
_before = golf_valid.copy()

# ------------------------------------------------------
# 3. Sort to establish chronology
#    We sort on player ‚Üí round_dt ‚Üí facility ‚Üí course to make the grouping stable.
# ------------------------------------------------------
golf_valid = (
    golf_valid
    .sort_values(by=["player_name", "round_dt", "facility", "course"])
    .copy()
)

# ------------------------------------------------------
# 4. Assign per-player sequential round numbers
#    Using dense rank: 1, 2, 3, ... with no gaps.
#    If round_dt is occasionally null, those rows will get NaT sorted last;
#    we‚Äôll document that assumption.
# ------------------------------------------------------
golf_valid["round_no_player"] = (
    golf_valid.groupby("player_name")["round_dt"]
    .rank(method="dense")
    .astype("Int64")
)

# ------------------------------------------------------
# 5. Assumption: round_dt is the chronological anchor
# ------------------------------------------------------
record_assumption(
    text="Player-level round sequencing uses round_dt as the primary chronology key.",
    rationale="round_dt is the most consistently available temporal signal after 2.x; using it enables progression analysis (1st, 2nd, 3rd round).",
    impact_area="3.1.x round-level feature engineering, trend/progression views",
)

# ------------------------------------------------------
# 6. Lineage logging
# ------------------------------------------------------
track_transform(
    stage_name="3.1.5_assign_round_no_player",
    df_before=_before,
    df_after=golf_valid,
    notes="Added sequential round index per player (dense rank on round_dt).",
    new_cols=["round_no_player"],
)

# ------------------------------------------------------
# 7. Build QA preview
# ------------------------------------------------------
round_sequence_preview = (
    golf_valid[
        [
            "player_name",
            "round_dt",
            "facility",
            "course",
            "holes_scored_in_round",
            "round_score",
            "round_no_player",
        ]
    ]
    .drop_duplicates()
    .sort_values(["player_name", "round_dt"])
    .head(20)
)

unique_rounds_after_seq = (
    golf_valid[["player_name", "round_dt", "facility", "course"]]
    .drop_duplicates()
    .shape[0]
)

# ------------------------------------------------------
# 8. Step log
# ------------------------------------------------------
log_step(
    step_name="3.1.5 Assign Sequential Round Numbers per Player",
    description="Created chronological round index per player for progression/improvement analysis.",
    inputs=["golf_valid (complete rounds only)"],
    outputs=["golf_valid (with round_no_player)"],
    df=golf_valid,
    extra_info={
        "phase": "3.1.5",
        "category": "feature_engineering",
        "unique_rounds_after_seq": unique_rounds_after_seq,
        "preview_rows": len(round_sequence_preview),
        "note": "round_no_player is now available for player progression views.",
    },
)

# ------------------------------------------------------
# 9. Reviewer peek
# ------------------------------------------------------
display(round_sequence_preview)


üìå Assumption logged: Player-level round sequencing uses round_dt as the primary chronology key.  | Impact: 3.1.x round-level feature engineering, trend/progression views
üîÑ Transform logged: 3.1.5_assign_round_no_player
   Rows 3987 ‚Üí 3987 (0 change)
   Added sequential round index per player (dense rank on round_dt).
‚úÖ 3.1.5 Assign Sequential Round Numbers per Player @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 32 cols
   Created chronological round index per player for progression/improvement analysis.
   phase: 3.1.5
   category: feature_engineering
   unique_rounds_after_seq: 206
   preview_rows: 20
   note: round_no_player is now available for player progression views.


Unnamed: 0,player_name,round_dt,facility,course,holes_scored_in_round,round_score,round_no_player
193,David Brooks,2011-06-05 06:58:01,Rock Creek Park Golf Course,Rock Creek Park,18,87,1
408,David Brooks,2011-09-04 08:03:26,Sligo Creek Golf Course,Sligo Creek,9,37,2
534,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,18,88,3
1178,David Brooks,2015-07-03 06:56:13,East Potomac Park Golf Course,White,9,39,4
1195,David Brooks,2015-07-03 09:07:42,East Potomac Park Golf Course,Red,9,33,5
1853,David Brooks,2018-05-27 08:14:04,Sligo Creek Golf Course,Sligo Creek,9,41,6
66,Don Price,2011-04-24 07:09:47,The Preserve at Eisenhower Golf Course,The Preserve at Eisenhower Golf Course,18,97,1
245,Don Price,2011-06-11 10:55:36,Sligo Creek Golf Course,Sligo Creek,9,47,2
300,Don Price,2011-07-03 12:58:38,Ocean Pines Golf & Country Club,Ocean Pines,18,96,3
461,Don Price,2012-05-27 11:48:40,Eagle's Landing Golf Course,Eagle's Landing,18,102,4


#### ======================================================
#### 3.1.6 Assign Sequential Visit Number per Player √ó Course
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.1.5, already filtered to complete rounds and with `round_no_player`)  
- Required columns:  
  `player_name`, `round_dt`, `facility`, `course`,  
  `holes_scored_in_round`, `round_score`, `round_no_player`

**WHAT THIS STEP DOES**  
- Orders each player‚Äôs history **within each facility/course** by round date  
- Assigns a dense visit counter, `round_no_player_course`, that represents:  
  - ‚Äúthis was the player‚Äôs 1st time at this course,‚Äù  
  - ‚Äúthis was their 2nd time,‚Äù etc.  
- Captures course-familiarity over time, which is useful for explaining improvements that are course-specific  
- Logs the assumption that `round_dt` is the correct chronology key for visit ordering  
- Records lineage for auditability

**WHY IT MATTERS**  
Players often improve faster on a course the more they play it.  
By sequencing visits at the `(player, facility, course)` level, we can separate ‚Äúoverall skill growth‚Äù from ‚ÄúI know this course better now,‚Äù which leads to better coaching and more accurate performance analytics.

**OUTPUTS**  
- `golf_valid` ‚Äî now includes `round_no_player_course`  
- Updated governance artifacts: `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`  
- Reviewer preview of visit sequences for manual spot-checking


In [14]:
# ======================================================
# 3.1.6 Assign Sequential Visit Number per Player √ó Course
# ======================================================

# ------------------------------------------------------
# 1. Schema gate (must follow 3.1.5)
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "player_name",
        "round_dt",
        "facility",
        "course",
        "holes_scored_in_round",
        "round_score",
        "round_no_player",  # created in 3.1.5
    ],
    context_name="3.1.6 Assign Sequential Visit Number per Player √ó Course ‚Äì source check",
)

# ------------------------------------------------------
# 2. Preserve pre-transform state
# ------------------------------------------------------
_before = golf_valid.copy()

# ------------------------------------------------------
# 3. Sort to ensure deterministic player √ó course chronology
# ------------------------------------------------------
golf_valid = (
    golf_valid
    .sort_values(["player_name", "facility", "course", "round_dt"])
    .copy()
)

# ------------------------------------------------------
# 4. Assign dense visit numbers per (player, facility, course)
#    This captures course familiarity over time.
# ------------------------------------------------------
golf_valid["round_no_player_course"] = (
    golf_valid
    .groupby(["player_name", "facility", "course"])["round_dt"]
    .rank(method="dense")
    .astype("Int64")
)

# ------------------------------------------------------
# 5. Assumption: round_dt is the visit order key for player √ó course
# ------------------------------------------------------
record_assumption(
    text="Course-familiarity sequencing uses round_dt within each (player, facility, course) group.",
    rationale="round_dt is the most reliable chronological signal for when a player visited a specific course.",
    impact_area="3.1 round-level feature engineering; facility/course familiarity analysis",
)

# ------------------------------------------------------
# 6. Lineage logging
# ------------------------------------------------------
track_transform(
    stage_name="3.1.6_assign_round_no_player_course",
    df_before=_before,
    df_after=golf_valid,
    notes="Added per-player √ó facility √ó course visit sequencing (round_no_player_course).",
    new_cols=["round_no_player_course"],
)

# ------------------------------------------------------
# 7. QA / storytelling preview
# ------------------------------------------------------
round_visit_preview = (
    golf_valid[
        [
            "player_name",
            "round_dt",
            "facility",
            "course",
            "holes_scored_in_round",
            "round_score",
            "round_no_player",
            "round_no_player_course",
        ]
    ]
    .drop_duplicates()
    .sort_values(["player_name", "facility", "course", "round_dt"])
    .head(30)
)

# ------------------------------------------------------
# 8. Summary metrics
# ------------------------------------------------------
num_unique_player_course_pairs = (
    golf_valid[["player_name", "facility", "course"]]
    .drop_duplicates()
    .shape[0]
)

# ------------------------------------------------------
# 9. Step log
# ------------------------------------------------------
log_step(
    step_name="3.1.6 Assign Sequential Visit Number per Player √ó Course",
    description="Created course-specific visit sequencing (1st visit, 2nd visit, ‚Ä¶) for each player.",
    inputs=["golf_valid"],
    outputs=["golf_valid"],
    df=golf_valid,
    extra_info={
        "phase": "3.1.6",
        "category": "feature_engineering",
        "unique_player_course_pairs": num_unique_player_course_pairs,
        "round_visit_preview_rows": len(round_visit_preview),
        "note": "round_no_player_course now available for course-familiarity and adaptation analysis.",
    },
)

# ------------------------------------------------------
# 10. Reviewer peek
# ------------------------------------------------------
display(round_visit_preview)


üìå Assumption logged: Course-familiarity sequencing uses round_dt within each (player, facility, course) group.  | Impact: 3.1 round-level feature engineering; facility/course familiarity analysis
üîÑ Transform logged: 3.1.6_assign_round_no_player_course
   Rows 3987 ‚Üí 3987 (0 change)
   Added per-player √ó facility √ó course visit sequencing (round_no_player_course).
‚úÖ 3.1.6 Assign Sequential Visit Number per Player √ó Course @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 33 cols
   Created course-specific visit sequencing (1st visit, 2nd visit, ‚Ä¶) for each player.
   phase: 3.1.6
   category: feature_engineering
   unique_player_course_pairs: 57
   round_visit_preview_rows: 30
   note: round_no_player_course now available for course-familiarity and adaptation analysis.


Unnamed: 0,player_name,round_dt,facility,course,holes_scored_in_round,round_score,round_no_player,round_no_player_course
534,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,18,88,3,1
1195,David Brooks,2015-07-03 09:07:42,East Potomac Park Golf Course,Red,9,33,5,1
1178,David Brooks,2015-07-03 06:56:13,East Potomac Park Golf Course,White,9,39,4,1
193,David Brooks,2011-06-05 06:58:01,Rock Creek Park Golf Course,Rock Creek Park,18,87,1,1
408,David Brooks,2011-09-04 08:03:26,Sligo Creek Golf Course,Sligo Creek,9,37,2,1
1853,David Brooks,2018-05-27 08:14:04,Sligo Creek Golf Course,Sligo Creek,9,41,6,2
570,Don Price,2012-06-10 07:16:49,Bay Hills Golf Club,Bay Hills,18,102,5,1
669,Don Price,2012-07-07 06:13:21,Bay Hills Golf Club,Bay Hills,18,88,6,2
740,Don Price,2012-08-03 14:07:43,Bay Hills Golf Club,Bay Hills,18,93,7,3
1673,Don Price,2017-06-30 12:26:18,Bay Hills Golf Club,Bay Hills,18,101,12,4


#### ======================================================
#### 3.1.7 Create Round Identifiers & Sequencing
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (after 3.1.6, so it already has player- and course-level sequencing)  
- Required columns:  
  `player_name`, `round_dt`, `facility`, `course`,  
  `round_no_player`, `round_no_player_course`, `holes_scored_in_round`

**WHAT THIS STEP DOES**  
- Builds a **round-level index** (one record per distinct round) from the hole-level data  
- Assigns a durable surrogate key `round_id` to every round  
- Creates a human-readable `round_key` that combines player, timestamp, facility, and course for audit/debugging  
- Merges `round_id` and `round_key` back onto the hole-level `golf_valid` so every hole now knows which round it belongs to  
- Logs the transformation for lineage

**WHY IT MATTERS**  
Hole-level data is great for shot and putting analysis, but most reporting is done at the **round** grain.  
A surrogate `round_id` gives us a stable join key for round-level aggregates, exports, and dimensional tables ‚Äî even if the original vendor keys change or we need to re-ingest.  
The audit-friendly `round_key` makes it easy to trace a record back to ‚Äúwho played where and when.‚Äù

**OUTPUTS**  
- `golf_valid` ‚Äî now includes `round_id` and `round_key` on every row  
- `round_index` ‚Äî round-level table (1 row per round) that can be exported or reused  
- Governance updated: `STEP_LOG`, `TRANSFORM_LOG`


In [15]:
# ======================================================
# 3.1.7 Create Round Identifiers & Sequencing
# ======================================================

# ------------------------------------------------------
# 1. Schema gate ‚Äì must follow 3.1.5 and 3.1.6
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "player_name",
        "round_dt",
        "facility",
        "course",
        "round_no_player",         # from 3.1.5
        "round_no_player_course",  # from 3.1.6
        "holes_scored_in_round",
    ],
    context_name="3.1.7 Create Round Identifiers & Sequencing ‚Äì source check",
)

# ------------------------------------------------------
# 2. Build a round-level index (1 row per distinct round)
# ------------------------------------------------------
round_index = (
    golf_valid[
        [
            "player_name",
            "round_dt",
            "facility",
            "course",
            "round_no_player",
            "round_no_player_course",
            "holes_scored_in_round",
        ]
    ]
    .drop_duplicates()
    .sort_values(["player_name", "round_dt", "facility", "course"])
    .reset_index(drop=True)
)

# ------------------------------------------------------
# 3. Assign surrogate round_id
# ------------------------------------------------------
round_index["round_id"] = range(1, len(round_index) + 1)

# ------------------------------------------------------
# 4. Create an audit-friendly round_key
#    Make sure NaT in round_dt doesn't break strftime.
# ------------------------------------------------------
round_dt_str = round_index["round_dt"].dt.strftime("%Y%m%d_%H%M%S").fillna("NA")

round_index["round_key"] = (
    round_index["player_name"].astype(str).str.replace(r"\s+", "_", regex=True)
    + "__"
    + round_dt_str
    + "__"
    + round_index["facility"].astype(str).str.replace(r"\s+", "_", regex=True)
    + "__"
    + round_index["course"].astype(str).str.replace(r"\s+", "_", regex=True)
)

# ------------------------------------------------------
# 5. Merge surrogate ID back to hole-level data
# ------------------------------------------------------
_before = golf_valid.copy()

golf_valid = golf_valid.merge(
    round_index[
        [
            "player_name",
            "round_dt",
            "facility",
            "course",
            "round_id",
            "round_key",
        ]
    ],
    on=["player_name", "round_dt", "facility", "course"],
    how="left",
    validate="m:1",
)

# ------------------------------------------------------
# 6. Lineage logging
# ------------------------------------------------------
track_transform(
    stage_name="3.1.7_create_round_id",
    df_before=_before,
    df_after=golf_valid,
    notes="Attached surrogate round_id and audit-friendly round_key to hole-level data.",
    new_cols=["round_id", "round_key"],
)

# ------------------------------------------------------
# 7. Step log
# ------------------------------------------------------
log_step(
    step_name="3.1.7 Create Round Identifiers & Sequencing",
    description="Created round-level surrogate IDs and merged them onto golf_valid for stable joins.",
    inputs=["golf_valid (hole-level)"],
    outputs=["golf_valid (with round_id, round_key)", "round_index (round-level)"],
    df=golf_valid,
    extra_info={
        "phase": "3.1.7",
        "category": "enrichment",
        "distinct_rounds_created": len(round_index),
        "note": "round_index can be exported or joined later for round-level reporting.",
    },
)

# ------------------------------------------------------
# 8. Reviewer peek
# ------------------------------------------------------
display(round_index.head(20))


üîÑ Transform logged: 3.1.7_create_round_id
   Rows 3987 ‚Üí 3987 (0 change)
   Attached surrogate round_id and audit-friendly round_key to hole-level data.
‚úÖ 3.1.7 Create Round Identifiers & Sequencing @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 35 cols
   Created round-level surrogate IDs and merged them onto golf_valid for stable joins.
   phase: 3.1.7
   category: enrichment
   distinct_rounds_created: 206
   note: round_index can be exported or joined later for round-level reporting.


Unnamed: 0,player_name,round_dt,facility,course,round_no_player,round_no_player_course,holes_scored_in_round,round_id,round_key
0,David Brooks,2011-06-05 06:58:01,Rock Creek Park Golf Course,Rock Creek Park,1,1,18,1,David_Brooks__20110605_065801__Rock_Creek_Park...
1,David Brooks,2011-09-04 08:03:26,Sligo Creek Golf Course,Sligo Creek,2,1,9,2,David_Brooks__20110904_080326__Sligo_Creek_Gol...
2,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,3,1,18,3,David_Brooks__20120607_074057__East_Potomac_Pa...
3,David Brooks,2015-07-03 06:56:13,East Potomac Park Golf Course,White,4,1,9,4,David_Brooks__20150703_065613__East_Potomac_Pa...
4,David Brooks,2015-07-03 09:07:42,East Potomac Park Golf Course,Red,5,1,9,5,David_Brooks__20150703_090742__East_Potomac_Pa...
5,David Brooks,2018-05-27 08:14:04,Sligo Creek Golf Course,Sligo Creek,6,2,9,6,David_Brooks__20180527_081404__Sligo_Creek_Gol...
6,Don Price,2011-04-24 07:09:47,The Preserve at Eisenhower Golf Course,The Preserve at Eisenhower Golf Course,1,1,18,7,Don_Price__20110424_070947__The_Preserve_at_Ei...
7,Don Price,2011-06-11 10:55:36,Sligo Creek Golf Course,Sligo Creek,2,1,9,8,Don_Price__20110611_105536__Sligo_Creek_Golf_C...
8,Don Price,2011-07-03 12:58:38,Ocean Pines Golf & Country Club,Ocean Pines,3,1,18,9,Don_Price__20110703_125838__Ocean_Pines_Golf_&...
9,Don Price,2012-05-27 11:48:40,Eagle's Landing Golf Course,Eagle's Landing,4,1,18,10,Don_Price__20120527_114840__Eagle's_Landing_Go...


#### ======================================================
#### 3.1.8 Round-level Time Validity & Features
#### ======================================================

**FOCUS**  
Establish temporal context and data quality at the **round level** ‚Äî ensuring that every round has accurate, interpretable, and analysis-ready time features.

**SUB-STEPS**
1. **3.1.8.1 Derive Time Features**  
   Extracts human-readable time fields (`date`, `hour`, `dow`, `part_of_day`, etc.) from `round_dt`.

2. **3.1.8.2 Validate Round Time Quality**  
   Applies quality rules for plausible tee times (05:00‚Äì20:59) and flags unrealistic timestamps while retaining all rows.

3. **3.1.8.3 Derive Seasonal & Annual Context**  
   Adds higher-level calendar and seasonal features (`year`, `month`, `season`, `round_year_index`) to support longitudinal analysis.

**KEY BENEFITS**
- Enables rich **time-based analysis** (e.g., weekday vs weekend, seasonality, trend over years)  
- Improves **temporal data quality** through validation flags  
- Supports **continuous improvement tracking** at both round and player-year levels


##### ======================================================
##### 3.1.8.1 Derive Time Features
##### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.1.7, now with round-level identifiers)  
- Required columns:  
  `round_dt`, `player_name`, `facility`, `course`,  
  `round_id`, `round_no_player`, `round_no_player_course`

**WHAT THIS STEP DOES**  
- Converts `round_dt` into a rich set of time-derived features:  
  - `date` (calendar date)  
  - `time` (tee time, HH:MM)  
  - `hour` (integer hour of day)  
  - `dow` (day-of-week short label)  
  - `is_weekend` (boolean flag)  
  - `part_of_day` (Morning / Afternoon / Evening / Night)  
- These contextual dimensions are essential for temporal trend analysis, time-of-day performance splits, and seasonality patterns.  
- Logs the assumption that the timestamp reflects **local tee time** as recorded by GolfShot.

**WHY IT MATTERS**  
Time features allow analysts to connect scoring performance with real-world temporal factors ‚Äî for example, identifying whether early-morning rounds produce better results, or if weekend rounds are more error-prone due to pace-of-play or fatigue.  
This step begins building the ‚Äútemporal dimension‚Äù that enables longitudinal and behavioral insight.

**OUTPUTS**  
- `golf_valid` ‚Äî enriched with time context fields (`date`, `time`, `hour`, `dow`, `is_weekend`, `part_of_day`)  
- Updated governance: `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`  
- Reviewer preview showing representative time features per round


In [16]:
# ======================================================
# 3.1.8.1 Derive Time Features
# ======================================================

# ------------------------------------------------------
# 1. Schema gate (must include round_dt + round_id)
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "round_dt",
        "player_name",
        "facility",
        "course",
        "round_id",
        "round_no_player",
        "round_no_player_course",
    ],
    context_name="3.1.8.1 Derive Time Features ‚Äì source check",
)

# ------------------------------------------------------
# 2. Confirm round_dt type integrity
# ------------------------------------------------------
if not pd.api.types.is_datetime64_any_dtype(golf_valid["round_dt"]):
    raise TypeError(
        "[3.1.8.1 Derive Time Features] 'round_dt' must be datetime64 dtype. "
        "Confirm parsing was handled in Step 2.2 (Standardize Schema & Datatypes)."
    )

_before = golf_valid.copy()

# ------------------------------------------------------
# 3. Derive calendar/time-based features
# ------------------------------------------------------
golf_valid["date"] = golf_valid["round_dt"].dt.date.astype("string")
golf_valid["time"] = golf_valid["round_dt"].dt.strftime("%H:%M").astype("string")
golf_valid["hour"] = golf_valid["round_dt"].dt.hour.astype("Int64")
golf_valid["dow"] = golf_valid["round_dt"].dt.day_name().str[:3].astype("string")
golf_valid["is_weekend"] = golf_valid["round_dt"].dt.weekday.isin([5, 6])

# ------------------------------------------------------
# 4. Derive part_of_day categorical variable
# ------------------------------------------------------
hour_series = golf_valid["hour"].astype("Int64")

conditions = [
    (hour_series >= 5)  & (hour_series < 12),  # Morning
    (hour_series >= 12) & (hour_series < 16),  # Afternoon
    (hour_series >= 16) & (hour_series < 21),  # Evening
    ((hour_series >= 21) & (hour_series <= 23)) | (hour_series < 5),  # Night
]
choices = ["Morning", "Afternoon", "Evening", "Night"]

golf_valid["part_of_day"] = pd.Series(
    np.select(conditions, choices, default="Unknown"),
    index=golf_valid.index,
).astype("string")

# ------------------------------------------------------
# 5. Record assumption about time derivation
# ------------------------------------------------------
record_assumption(
    text="Derived time-of-day context (hour, DOW, weekend, part_of_day) from round_dt assuming it reflects local tee time.",
    rationale="GolfShot timestamps behave like local start times; using them enables time-based segmentation.",
    impact_area="3.1.8 Round-level temporal features and fatigue/time-of-day performance analysis",
)

# ------------------------------------------------------
# 6. QA preview for reviewers
# ------------------------------------------------------
time_preview = (
    golf_valid[
        [
            "player_name",
            "round_dt",
            "date",
            "time",
            "hour",
            "dow",
            "is_weekend",
            "part_of_day",
            "facility",
            "course",
            "round_id",
            "round_no_player",
            "round_no_player_course",
        ]
    ]
    .drop_duplicates(subset=["player_name", "round_dt", "facility", "course"])
    .sort_values(["player_name", "round_dt"])
    .head(20)
)

part_of_day_counts = (
    golf_valid["part_of_day"].value_counts(dropna=False).sort_index().to_dict()
)

# ------------------------------------------------------
# 7. Lineage tracking
# ------------------------------------------------------
track_transform(
    stage_name="3.1.8.1_derive_time_features",
    df_before=_before,
    df_after=golf_valid,
    notes="Added time-derived context fields (date, time, hour, dow, is_weekend, part_of_day).",
    new_cols=["date", "time", "hour", "dow", "is_weekend", "part_of_day"],
)

# ------------------------------------------------------
# 8. Step log
# ------------------------------------------------------
log_step(
    step_name="3.1.8.1 Derive Time Features",
    description="Derived time-of-day, weekday/weekend, and part-of-day features from round_dt.",
    inputs=["golf_valid (round-level enriched)"],
    outputs=["golf_valid (with temporal features)"],
    df=golf_valid,
    extra_info={
        "phase": "3.1.8.1",
        "category": "feature_engineering",
        "rows_enriched": len(golf_valid),
        "part_of_day_distribution": part_of_day_counts,
        "note": "Temporal context now ready for quality validation in 3.1.8.2.",
    },
)

# ------------------------------------------------------
# 9. Reviewer preview
# ------------------------------------------------------
display(time_preview)


üìå Assumption logged: Derived time-of-day context (hour, DOW, weekend, part_of_day) from round_dt assuming it reflects local tee time.  | Impact: 3.1.8 Round-level temporal features and fatigue/time-of-day performance analysis
üîÑ Transform logged: 3.1.8.1_derive_time_features
   Rows 3987 ‚Üí 3987 (0 change)
   Added time-derived context fields (date, time, hour, dow, is_weekend, part_of_day).
‚úÖ 3.1.8.1 Derive Time Features @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 41 cols
   Derived time-of-day, weekday/weekend, and part-of-day features from round_dt.
   phase: 3.1.8.1
   category: feature_engineering
   rows_enriched: 3987
   part_of_day_distribution: {'Afternoon': 588, 'Evening': 489, 'Morning': 2874, 'Night': 36}
   note: Temporal context now ready for quality validation in 3.1.8.2.


Unnamed: 0,player_name,round_dt,date,time,hour,dow,is_weekend,part_of_day,facility,course,round_id,round_no_player,round_no_player_course
36,David Brooks,2011-06-05 06:58:01,2011-06-05,06:58,6,Sun,True,Morning,Rock Creek Park Golf Course,Rock Creek Park,1,1,1
54,David Brooks,2011-09-04 08:03:26,2011-09-04,08:03,8,Sun,True,Morning,Sligo Creek Golf Course,Sligo Creek,2,2,1
0,David Brooks,2012-06-07 07:40:57,2012-06-07,07:40,7,Thu,False,Morning,East Potomac Park Golf Course,Blue,3,3,1
27,David Brooks,2015-07-03 06:56:13,2015-07-03,06:56,6,Fri,False,Morning,East Potomac Park Golf Course,White,4,4,1
18,David Brooks,2015-07-03 09:07:42,2015-07-03,09:07,9,Fri,False,Morning,East Potomac Park Golf Course,Red,5,5,1
63,David Brooks,2018-05-27 08:14:04,2018-05-27,08:14,8,Sun,True,Morning,Sligo Creek Golf Course,Sligo Creek,6,6,2
261,Don Price,2011-04-24 07:09:47,2011-04-24,07:09,7,Sun,True,Morning,The Preserve at Eisenhower Golf Course,The Preserve at Eisenhower Golf Course,7,1,1
243,Don Price,2011-06-11 10:55:36,2011-06-11,10:55,10,Sat,True,Morning,Sligo Creek Golf Course,Sligo Creek,8,2,1
171,Don Price,2011-07-03 12:58:38,2011-07-03,12:58,12,Sun,True,Afternoon,Ocean Pines Golf & Country Club,Ocean Pines,9,3,1
153,Don Price,2012-05-27 11:48:40,2012-05-27,11:48,11,Sun,True,Morning,Eagle's Landing Golf Course,Eagle's Landing,10,4,1


##### ======================================================
##### 3.1.8.2 Validate Round Time Quality
##### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.1.8.1 with time-derived features)  
- Required columns:  
  `round_dt`, `hour`, `part_of_day`,  
  `player_name`, `facility`, `course`,  
  `round_id`, `round_no_player`, `round_no_player_course`

**WHAT THIS STEP DOES**  
- Applies a business rule that a plausible tee time is **between 05:00 and 20:59**  
- Adds a boolean flag `round_time_valid` to every row  
- For implausible tee times, normalizes `part_of_day` to `"Unknown"` so they don‚Äôt pollute time-of-day analysis  
- Keeps **all** rows ‚Äî this is a quality annotation step, not a filtering step  
- Logs how many rounds had suspect times

**WHY IT MATTERS**  
Timestamps in mobile golf apps can be off due to timezone, sync, or manual-entry quirks.  
If we don‚Äôt flag those out-of-band times, any analysis of ‚Äúmorning vs afternoon‚Äù or ‚Äúweekday vs weekend‚Äù can get skewed.  
By annotating (not deleting) these rows, we preserve scoring data but make it easy to filter on quality later.

**OUTPUTS**  
- `golf_valid` ‚Äî now includes `round_time_valid` and cleaned `part_of_day`  
- Governance updated: `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`  
- Reviewer preview of rounds with suspect times


In [17]:
# ======================================================
# 3.1.8.2 Validate Round Time Quality
# ======================================================

# ------------------------------------------------------
# 1. Schema gate ‚Äì must follow 3.1.8.1
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "round_dt",
        "hour",
        "part_of_day",
        "player_name",
        "facility",
        "course",
        "round_id",
        "round_no_player",
        "round_no_player_course",
    ],
    context_name="3.1.8.2 Validate Round Time Quality ‚Äì source check",
)

_before = golf_valid.copy()

# ------------------------------------------------------
# 2. Apply business rule for plausible tee times
#    Valid = 05:00‚Äì20:59 (inclusive)
# ------------------------------------------------------
golf_valid["round_time_valid"] = golf_valid["hour"].between(5, 20, inclusive="both")

# ------------------------------------------------------
# 3. Normalize part_of_day for invalid times
#    We keep the row but force the bucket to "Unknown"
# ------------------------------------------------------
invalid_mask = ~golf_valid["round_time_valid"]
golf_valid.loc[invalid_mask, "part_of_day"] = "Unknown"
golf_valid["part_of_day"] = golf_valid["part_of_day"].astype("string")

# ------------------------------------------------------
# 4. Build preview of suspect rounds (for reviewer)
# ------------------------------------------------------
invalid_rounds_preview = (
    golf_valid.loc[
        invalid_mask,
        [
            "player_name",
            "round_dt",
            "date",
            "time",
            "hour",
            "dow",
            "facility",
            "course",
            "round_time_valid",
            "part_of_day",
            "round_id",
            "round_no_player",
            "round_no_player_course",
        ],
    ]
    .drop_duplicates(subset=["player_name", "round_dt", "facility", "course"])
    .sort_values(["player_name", "round_dt"])
    .reset_index(drop=True)
)

num_invalid_rounds = len(invalid_rounds_preview)

# ------------------------------------------------------
# 5. Record assumption about suspect times
# ------------------------------------------------------
record_assumption(
    text="Rounds with tee times outside 05:00‚Äì20:59 are retained but treated as time-quality-suspect.",
    rationale="Early/late device syncs or timezone mismatches can produce odd timestamps; keeping rows avoids losing scoring data.",
    impact_area="3.1.8 round-level temporal analysis; time-of-day performance splits",
)

# ------------------------------------------------------
# 6. Lineage logging
# ------------------------------------------------------
track_transform(
    stage_name="3.1.8.2_validate_round_time_quality",
    df_before=_before,
    df_after=golf_valid,
    notes="Added round_time_valid flag and normalized part_of_day='Unknown' for implausible tee times.",
    new_cols=["round_time_valid"],
)

# ------------------------------------------------------
# 7. Step log
# ------------------------------------------------------
log_step(
    step_name="3.1.8.2 Validate Round Time Quality",
    description="Flagged unrealistic tee times using 05:00‚Äì20:59 rule; kept rows but marked them for exclusion in some analyses.",
    inputs=["golf_valid (with temporal features)"],
    outputs=["golf_valid (with round_time_valid)"],
    df=golf_valid,
    extra_info={
        "phase": "3.1.8.2",
        "category": "data_quality",
        "distinct_rounds_with_suspect_times": num_invalid_rounds,
        "business_rule": "Valid tee times are 05:00‚Äì20:59 local; others flagged.",
        "impact_note": "No rows dropped; suspect rounds can be filtered in analysis.",
    },
)

# ------------------------------------------------------
# 8. Reviewer preview
# ------------------------------------------------------
display(invalid_rounds_preview.head(20))


üìå Assumption logged: Rounds with tee times outside 05:00‚Äì20:59 are retained but treated as time-quality-suspect.  | Impact: 3.1.8 round-level temporal analysis; time-of-day performance splits
üîÑ Transform logged: 3.1.8.2_validate_round_time_quality
   Rows 3987 ‚Üí 3987 (0 change)
   Added round_time_valid flag and normalized part_of_day='Unknown' for implausible tee times.
‚úÖ 3.1.8.2 Validate Round Time Quality @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 42 cols
   Flagged unrealistic tee times using 05:00‚Äì20:59 rule; kept rows but marked them for exclusion in some analyses.
   phase: 3.1.8.2
   category: data_quality
   distinct_rounds_with_suspect_times: 2
   business_rule: Valid tee times are 05:00‚Äì20:59 local; others flagged.
   impact_note: No rows dropped; suspect rounds can be filtered in analysis.


Unnamed: 0,player_name,round_dt,date,time,hour,dow,facility,course,round_time_valid,part_of_day,round_id,round_no_player,round_no_player_course
0,Mike Phillips,2011-03-01 00:00:00,2011-03-01,00:00,0,Tue,Sligo Creek Golf Course,Sligo Creek,False,Unknown,28,1,1
1,Mike Phillips,2011-06-30 23:56:35,2011-06-30,23:56,23,Thu,Ocean City Golf Club,Seaside,False,Unknown,41,14,1


##### ======================================================
##### 3.1.8.3 Derive Seasonal & Annual Context
##### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (enriched with time features from 3.1.8.1‚Äì3.1.8.2)  
- Required columns:  
  `round_dt`, `player_name`, `facility`, `course`,  
  `round_id`, `round_no_player`, `round_no_player_course`

**WHAT THIS STEP DOES**  
- Adds higher-level **temporal context** for longitudinal and trend analyses:
  - `year`, `month`, `month_name`, `year_month` (calendar-level)
  - `season` (Winter, Spring, Summer, Fall) using Northern Hemisphere mapping
  - `round_year_index`: the **nth distinct year of recorded play** for each player (e.g., 1st, 2nd, 3rd season)
- Merges these back to the working dataset for continuity across phases.

**WHY IT MATTERS**  
This step enables **seasonality**, **progression**, and **year-over-year** comparisons.  
By structuring play into time-based dimensions, we can:
- Identify seasonal scoring patterns  
- Detect improvement over multiple years of data  
- Normalize results by season or year in downstream analysis

**OUTPUTS**  
- `golf_valid` ‚Äî now includes `year`, `month`, `month_name`, `year_month`, `season`, and `round_year_index`  
- Governance updated: `STEP_LOG`, `TRANSFORM_LOG`, `ASSUMPTIONS_LOG`  
- Reviewer preview showing distinct rounds across multiple seasons


In [18]:
# ======================================================
# 3.1.8.3 Derive Seasonal & Annual Context
# ======================================================

# ------------------------------------------------------
# 1. Schema validation
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "round_dt",
        "player_name",
        "facility",
        "course",
        "round_id",
        "round_no_player",
        "round_no_player_course",
    ],
    context_name="3.1.8.3 Derive Seasonal & Annual Context ‚Äì source check",
)

_before = golf_valid.copy()

# ------------------------------------------------------
# 2. Derive calendar features
# ------------------------------------------------------
golf_valid["year"] = golf_valid["round_dt"].dt.year.astype("Int64")
golf_valid["month"] = golf_valid["round_dt"].dt.month.astype("Int64")
golf_valid["month_name"] = golf_valid["round_dt"].dt.month_name().str[:3].astype("string")
golf_valid["year_month"] = golf_valid["round_dt"].dt.strftime("%Y-%m").astype("string")

# ------------------------------------------------------
# 3. Map month ‚Üí season
# ------------------------------------------------------
def _month_to_season(m):
    if m in (12, 1, 2):
        return "Winter"
    elif m in (3, 4, 5):
        return "Spring"
    elif m in (6, 7, 8):
        return "Summer"
    elif m in (9, 10, 11):
        return "Fall"
    return "Unknown"

golf_valid["season"] = golf_valid["month"].apply(_month_to_season).astype("string")

# ------------------------------------------------------
# 4. Derive player-relative annual index
# ------------------------------------------------------
player_year_seq = (
    golf_valid[["player_name", "year"]]
    .drop_duplicates()
    .sort_values(["player_name", "year"])
    .reset_index(drop=True)
)
player_year_seq["round_year_index"] = player_year_seq.groupby("player_name").cumcount() + 1

golf_valid = golf_valid.merge(
    player_year_seq,
    on=["player_name", "year"],
    how="left",
    validate="m:1",
)
golf_valid["round_year_index"] = golf_valid["round_year_index"].astype("Int64")

# ------------------------------------------------------
# 5. Governance logging
# ------------------------------------------------------
record_assumption(
    text="Seasons mapped as Winter(12‚Äì2), Spring(3‚Äì5), Summer(6‚Äì8), Fall(9‚Äì11).",
    rationale="Standard Northern Hemisphere golf season alignment.",
    impact_area="3.1.8 seasonal and trend analysis",
)

track_transform(
    stage_name="3.1.8.3_derive_seasonal_annual_context",
    df_before=_before,
    df_after=golf_valid,
    notes="Added year/month/season/year_month and round_year_index for longitudinal context.",
    new_cols=["year", "month", "month_name", "year_month", "season", "round_year_index"],
)

# ------------------------------------------------------
# 6. Summary stats and preview
# ------------------------------------------------------
season_dist = golf_valid["season"].value_counts(dropna=False).sort_index().to_dict()
player_years = golf_valid[["player_name", "year"]].drop_duplicates().shape[0]

season_preview = (
    golf_valid[
        [
            "player_name",
            "round_dt",
            "facility",
            "course",
            "year",
            "month",
            "month_name",
            "season",
            "year_month",
            "round_no_player",
            "round_no_player_course",
            "round_year_index",
        ]
    ]
    .drop_duplicates(subset=["player_name", "round_dt", "facility", "course"])
    .sort_values(["player_name", "round_dt"])
    .head(12)
    .reset_index(drop=True)
)

# ------------------------------------------------------
# 7. Log the step
# ------------------------------------------------------
log_step(
    step_name="3.1.8.3 Derive Seasonal & Annual Context",
    description="Added higher-level calendar and seasonal context to each round for trend analysis.",
    inputs=["golf_valid (with time features)"],
    outputs=["golf_valid (with seasonal/annual context)"],
    df=golf_valid,
    extra_info={
        "unique_player_year_pairs": player_years,
        "season_distribution": season_dist,
        "note": "round_year_index enables per-player longitudinal performance review.",
    },
)

# ------------------------------------------------------
# 8. Reviewer preview
# ------------------------------------------------------
display(season_preview)


üìå Assumption logged: Seasons mapped as Winter(12‚Äì2), Spring(3‚Äì5), Summer(6‚Äì8), Fall(9‚Äì11).  | Impact: 3.1.8 seasonal and trend analysis
üîÑ Transform logged: 3.1.8.3_derive_seasonal_annual_context
   Rows 3987 ‚Üí 3987 (0 change)
   Added year/month/season/year_month and round_year_index for longitudinal context.
‚úÖ 3.1.8.3 Derive Seasonal & Annual Context @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 48 cols
   Added higher-level calendar and seasonal context to each round for trend analysis.
   unique_player_year_pairs: 41
   season_distribution: {'Fall': 1216, 'Spring': 575, 'Summer': 2178, 'Winter': 18}
   note: round_year_index enables per-player longitudinal performance review.


Unnamed: 0,player_name,round_dt,facility,course,year,month,month_name,season,year_month,round_no_player,round_no_player_course,round_year_index
0,David Brooks,2011-06-05 06:58:01,Rock Creek Park Golf Course,Rock Creek Park,2011,6,Jun,Summer,2011-06,1,1,1
1,David Brooks,2011-09-04 08:03:26,Sligo Creek Golf Course,Sligo Creek,2011,9,Sep,Fall,2011-09,2,1,1
2,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,2012,6,Jun,Summer,2012-06,3,1,2
3,David Brooks,2015-07-03 06:56:13,East Potomac Park Golf Course,White,2015,7,Jul,Summer,2015-07,4,1,3
4,David Brooks,2015-07-03 09:07:42,East Potomac Park Golf Course,Red,2015,7,Jul,Summer,2015-07,5,1,3
5,David Brooks,2018-05-27 08:14:04,Sligo Creek Golf Course,Sligo Creek,2018,5,May,Spring,2018-05,6,2,4
6,Don Price,2011-04-24 07:09:47,The Preserve at Eisenhower Golf Course,The Preserve at Eisenhower Golf Course,2011,4,Apr,Spring,2011-04,1,1,1
7,Don Price,2011-06-11 10:55:36,Sligo Creek Golf Course,Sligo Creek,2011,6,Jun,Summer,2011-06,2,1,1
8,Don Price,2011-07-03 12:58:38,Ocean Pines Golf & Country Club,Ocean Pines,2011,7,Jul,Summer,2011-07,3,1,1
9,Don Price,2012-05-27 11:48:40,Eagle's Landing Golf Course,Eagle's Landing,2012,5,May,Spring,2012-05,4,1,2


#### ======================================================
#### 3.1.9 Rename Columns for Analysis (Idempotent)
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (finalized round-level working dataset after 3.1.8)  
- Expected columns include:  
  `hole_fairway_hits`, `round_fairway_hits`, `hole_fairway_hit_type`,  
  `holes_scored_in_round`, `is_partial_round`, and `hole_gir`

**WHAT THIS STEP DOES**  
- Standardizes column naming conventions to match a consistent **prefix-based schema**:  
  - `round_` for round-level metrics  
  - `hole_` for hole-level metrics  
  - `shot_` for shot-level metrics  
- Normalizes `hole_gir` to a **boolean** field for analytical consistency  
- Applies only if old names exist (safe to rerun; **idempotent**)

**WHY IT MATTERS**  
Consistent naming conventions make the dataset **predictable, easier to join, and safer for reuse** in future pipelines.  
They also prevent subtle errors when combining tables of different grains (round, hole, or shot).

**OUTPUTS**  
- `golf_valid` ‚Äî updated with standardized column names and boolean `hole_gir`  
- Governance logs: `STEP_LOG`, `TRANSFORM_LOG`, and `ASSUMPTIONS_LOG`  
- Reviewer preview of renamed columns


In [19]:
# ======================================================
# 3.1.9 Rename Columns for Analysis (idempotent)
# ======================================================

# ------------------------------------------------------
# 1. Define canonical rename map (old ‚Üí new)
# ------------------------------------------------------
rename_map = {
    "hole_fairway_hits": "hole_fairway_strokes",
    "round_fairway_hits": "round_fairway_strokes",
    "hole_fairway_hit_type": "shot_fairway_hit_type",
    "holes_scored_in_round": "round_holes_scored",
    "is_partial_round": "round_is_partial",
}

# ------------------------------------------------------
# 2. Flexible schema gate:
#    For each pair, we require EITHER the old name OR the new name to exist.
#    This keeps the step idempotent (safe to re-run).
# ------------------------------------------------------
required_either = [
    ("hole_fairway_hits", "hole_fairway_strokes"),
    ("round_fairway_hits", "round_fairway_strokes"),
    ("hole_fairway_hit_type", "shot_fairway_hit_type"),
    ("holes_scored_in_round", "round_holes_scored"),
    ("is_partial_round", "round_is_partial"),
]

missing_pairs = []
for old_col, new_col in required_either:
    if (old_col not in golf_valid.columns) and (new_col not in golf_valid.columns):
        missing_pairs.append((old_col, new_col))

# we always expect hole_gir in *some* form by this point
if "hole_gir" not in golf_valid.columns:
    missing_pairs.append(("hole_gir", "hole_gir"))

if missing_pairs:
    flat_missing = [f"{old}|{new}" for (old, new) in missing_pairs]
    raise KeyError(
        "[3.1.9 Rename Columns for Analysis] Missing expected column(s) "
        f"(either old or new name): {flat_missing}"
    )

# ------------------------------------------------------
# 3. Capture pre-state for lineage
# ------------------------------------------------------
_before = golf_valid.copy()
before_cols = golf_valid.columns.tolist()

# ------------------------------------------------------
# 4. Perform renames ONLY where the old name still exists
# ------------------------------------------------------
actual_renames = {}
for old_col, new_col in rename_map.items():
    if old_col in golf_valid.columns:
        actual_renames[old_col] = new_col

if actual_renames:
    golf_valid = golf_valid.rename(columns=actual_renames)

# ------------------------------------------------------
# 5. Normalize hole_gir to boolean every run
#    (idempotent: turning "yes"/"Yes"/"YES" into True)
# ------------------------------------------------------
golf_valid["hole_gir"] = (
    golf_valid["hole_gir"].astype("string").str.strip().str.lower() == "yes"
)

# ------------------------------------------------------
# 6. Assumption: enforce naming convention
# ------------------------------------------------------
record_assumption(
    text="Round-level fields use the prefix 'round_', hole-level fields use 'hole_', and shot-level fields use 'shot_'.",
    rationale="Consistent prefixes make join logic, exports, and downstream feature engineering clearer.",
    impact_area="3.x Data Preparation; final exports",
)

# ------------------------------------------------------
# 7. Lineage logging
# ------------------------------------------------------
track_transform(
    stage_name="3.1.9_rename_columns_for_analysis",
    df_before=_before,
    df_after=golf_valid,
    notes="Standardized mixed-grain columns to round_*, hole_*, shot_* naming and normalized hole_gir to boolean.",
    new_cols=[c for c in golf_valid.columns if c not in before_cols],
    dropped_cols=[c for c in before_cols if c not in golf_valid.columns],
)

# ------------------------------------------------------
# 8. Step log
# ------------------------------------------------------
log_step(
    step_name="3.1.9 Rename Columns for Analysis",
    description="Standardized column prefixes and normalized GIR indicator. Safe to re-run.",
    inputs=["golf_valid"],
    outputs=["golf_valid"],
    df=golf_valid,
    extra_info={
        "phase": "3.1.9",
        "category": "standardization",
        "applied_renames": actual_renames,
        "note": "This is an idempotent cleanup step to make 3.2‚Äì3.5 easier to read.",
    },
)

# ------------------------------------------------------
# 9. Reviewer peek
# ------------------------------------------------------
preview_cols = [
    "player_name",
    "round_dt",
    "round_fairway_strokes",
    "round_holes_scored",
    "round_is_partial",
    "hole_number",
    "hole_par",
    "hole_score",
    "hole_fairway_strokes",
    "hole_gir",
    "shot_fairway_hit_type",
]
existing_preview_cols = [c for c in preview_cols if c in golf_valid.columns]
display(golf_valid[existing_preview_cols].head(25))


üìå Assumption logged: Round-level fields use the prefix 'round_', hole-level fields use 'hole_', and shot-level fields use 'shot_'.  | Impact: 3.x Data Preparation; final exports
üîÑ Transform logged: 3.1.9_rename_columns_for_analysis
   Rows 3987 ‚Üí 3987 (0 change)
   Standardized mixed-grain columns to round_*, hole_*, shot_* naming and normalized hole_gir to boolean.
‚úÖ 3.1.9 Rename Columns for Analysis @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 48 cols
   Standardized column prefixes and normalized GIR indicator. Safe to re-run.
   phase: 3.1.9
   category: standardization
   applied_renames: {'hole_fairway_hits': 'hole_fairway_strokes', 'round_fairway_hits': 'round_fairway_strokes', 'hole_fairway_hit_type': 'shot_fairway_hit_type', 'holes_scored_in_round': 'round_holes_scored', 'is_partial_round': 'round_is_partial'}
   note: This is an idempotent cleanup step to make 3.2‚Äì3.5 easier to read.


Unnamed: 0,player_name,round_dt,round_fairway_strokes,round_holes_scored,round_is_partial,hole_number,hole_par,hole_score,hole_fairway_strokes,hole_gir,shot_fairway_hit_type
0,David Brooks,2012-06-07 07:40:57,54,18,False,1,4,7,6,False,Unknown
1,David Brooks,2012-06-07 07:40:57,54,18,False,2,4,5,3,False,Unknown
2,David Brooks,2012-06-07 07:40:57,54,18,False,3,5,6,4,False,Unknown
3,David Brooks,2012-06-07 07:40:57,54,18,False,4,3,4,2,False,Unknown
4,David Brooks,2012-06-07 07:40:57,54,18,False,5,4,4,2,True,Unknown
5,David Brooks,2012-06-07 07:40:57,54,18,False,6,5,6,3,True,Unknown
6,David Brooks,2012-06-07 07:40:57,54,18,False,7,4,5,3,False,Unknown
7,David Brooks,2012-06-07 07:40:57,54,18,False,8,3,4,3,False,Unknown
8,David Brooks,2012-06-07 07:40:57,54,18,False,9,4,4,3,False,Unknown
9,David Brooks,2012-06-07 07:40:57,54,18,False,10,4,3,2,True,Unknown


#### ======================================================
#### 3.1.10 Round-level Validity & Features Closeout
#### ======================================================

**OBJECTIVE**  
Formally conclude the round-level portion of Data Preparation (3.1) by validating the schema, confirming governance completeness, and capturing a profile snapshot before proceeding to hole-level enrichment.

**KEY ACTIONS**
- Verified presence of all expected round-level features
- Logged the final transformation (`track_transform`) and governance summary (`log_step`)  
- Declared readiness for the next phase (3.2 Hole-level Validity & Features)

**OUTPUTS**
- `golf_valid` ‚Äî clean, complete, round-level dataset ready for hole-level analysis  
- Governance updates: `STEP_LOG`, `TRANSFORM_LOG`, `VALIDATION_LOG`  
- Transition metadata noting 3.1 completion and next-phase readiness

**WHY IT MATTERS**
This step establishes a **formal control point** ‚Äî verifying completeness, audit trails, and quality prior to enrichment.  
It represents a **CRISP-DM ‚ÄúMeasure ‚Üí Analyze‚Äù handoff** and a **Six Sigma ‚ÄúDefine-to-Measure‚Äù gate**, ensuring analytical integrity as the dataset advances in grain and complexity.


In [20]:
# ======================================================
# 3.1.10 Round-level Validity & Features Closeout
# ======================================================

"""
Formal closeout for 3.1.
- Confirm round-level schema is present
- Capture audit/profile snapshot
- Log completion so 3.2 can safely build on `golf_valid`
"""

# ------------------------------------------------------
# 1. Validate final round-level schema (flexible: only check cols we expect to exist)
# ------------------------------------------------------
required_final_cols_3_1 = [
    "player_name",
    "round_dt",
    "facility",
    "course",
    "round_id",
    "round_key",
    "round_no_player",
    "round_no_player_course",
    "round_holes_scored",
    "round_fairway_strokes",
    "round_is_partial",
    "round_time_valid",
    "part_of_day",
    "season",
    "year",
    "month",
    "round_year_index",
]

validate_columns(
    golf_valid,
    required_cols=[c for c in required_final_cols_3_1 if c in golf_valid.columns],
    context_name="3.1.10 Round-level Validity & Features Closeout",
)

# ------------------------------------------------------
# 2. Governance logging (transform + step)
# ------------------------------------------------------
track_transform(
    stage_name="3.1.10_round_level_closeout",
    df_before=None,
    df_after=golf_valid,
    notes="Phase 3.1 complete: round-level filters, sequencing, time features, and naming conventions applied.",
)

log_step(
    step_name="3.1.10 Round-level Validity & Features Closeout",
    description="Validated round-level schema, captured audit snapshot, and marked dataset ready for 3.2 Hole-level Validity & Features.",
    inputs=["golf_valid (post 3.1.9)"],
    outputs=["golf_valid (ready for 3.2)"],
    df=golf_valid,
    extra_info={
        "phase_complete": True,
        "validated_columns": [c for c in required_final_cols_3_1 if c in golf_valid.columns],
        "next_step": "3.2 Hole-level Validity & Features",
        "note": "Use the same golf_valid for 3.2; continue appending hole-level fields.",
    },
)

# ------------------------------------------------------
# 3. Optional: render a mini governance dashboard for 3.1
# ------------------------------------------------------
render_governance_summary(current_phase="3.1")


üîÑ Transform logged: 3.1.10_round_level_closeout
   Phase 3.1 complete: round-level filters, sequencing, time features, and naming conventions applied.
‚úÖ 3.1.10 Round-level Validity & Features Closeout @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 48 cols
   Validated round-level schema, captured audit snapshot, and marked dataset ready for 3.2 Hole-level Validity & Features.
   phase_complete: True
   validated_columns: ['player_name', 'round_dt', 'facility', 'course', 'round_id', 'round_key', 'round_no_player', 'round_no_player_course', 'round_holes_scored', 'round_fairway_strokes', 'round_is_partial', 'round_time_valid', 'part_of_day', 'season', 'year', 'month', 'round_year_index']
   next_step: 3.2 Hole-level Validity & Features
   note: Use the same golf_valid for 3.2; continue appending hole-level fields.


ts,phase,step_name,description,rows,cols
2025-11-16 19:33:15,3.1,3.1.1 Filter Invalid Hole Records,"Removed non-scoring / placeholder hole rows so all 3.1.x steps operate on real, scored holes.",4012.0,29.0
2025-11-16 19:33:15,3.1,3.1.2 Calculate Holes Scored per Round,"Computed distinct holes per (player, round_dt, facility, course) to assess round completeness.",210.0,5.0
2025-11-16 19:33:15,3.1,3.1.3 Attach Round Hole Counts & Identify Partial Rounds,Merged holes_per_round helper onto golf_valid and flagged non-standard (partial) rounds.,4012.0,31.0
2025-11-16 19:33:15,3.1,3.1.4 Keep Only Complete Rounds,Removed partial or incomplete rounds based on the 9/18-hole completeness rule.,3987.0,31.0
2025-11-16 19:33:15,3.1,3.1.5 Assign Sequential Round Numbers per Player,Created chronological round index per player for progression/improvement analysis.,3987.0,32.0
2025-11-16 19:33:15,3.1,3.1.6 Assign Sequential Visit Number per Player √ó Course,"Created course-specific visit sequencing (1st visit, 2nd visit, ‚Ä¶) for each player.",3987.0,33.0
2025-11-16 19:33:15,3.1,3.1.7 Create Round Identifiers & Sequencing,Created round-level surrogate IDs and merged them onto golf_valid for stable joins.,3987.0,35.0
2025-11-16 19:33:15,3.1,3.1.8.1 Derive Time Features,"Derived time-of-day, weekday/weekend, and part-of-day features from round_dt.",3987.0,41.0
2025-11-16 19:33:15,3.1,3.1.8.2 Validate Round Time Quality,Flagged unrealistic tee times using 05:00‚Äì20:59 rule; kept rows but marked them for exclusion in some analyses.,3987.0,42.0
2025-11-16 19:33:15,3.1,3.1.8.3 Derive Seasonal & Annual Context,Added higher-level calendar and seasonal context to each round for trend analysis.,3987.0,48.0

ts,phase,stage_name,notes,rows_before,rows_after,row_delta,new_cols,dropped_cols
2025-11-16 19:33:15,3.1,3.1.1_filter_invalid_holes,Filtered to scored holes only (hole_score > 0). Retained 4012 of 4287 rows (93.59%).,4287.0,4012,-275.0,[],[]
2025-11-16 19:33:15,3.1,3.1.2_build_round_hole_counts_helper,Created round-level helper (holes_per_round) from golf_valid to support completeness checks.,,210,,[holes_scored_in_round],[]
2025-11-16 19:33:15,3.1,3.1.3_attach_round_hole_counts,Joined per-round hole counts and flagged 4 partial round(s).,4012.0,4012,0.0,"[holes_scored_in_round, is_partial_round]",[]
2025-11-16 19:33:15,3.1,3.1.4_keep_only_complete_rounds,"Filtered working dataset to retain only rounds with holes_scored_in_round in {9, 18}. Retained 3987 of 4012 rows (99.38%).",4012.0,3987,-25.0,[],[]
2025-11-16 19:33:15,3.1,3.1.5_assign_round_no_player,Added sequential round index per player (dense rank on round_dt).,3987.0,3987,0.0,[round_no_player],[]
2025-11-16 19:33:15,3.1,3.1.6_assign_round_no_player_course,Added per-player √ó facility √ó course visit sequencing (round_no_player_course).,3987.0,3987,0.0,[round_no_player_course],[]
2025-11-16 19:33:15,3.1,3.1.7_create_round_id,Attached surrogate round_id and audit-friendly round_key to hole-level data.,3987.0,3987,0.0,"[round_id, round_key]",[]
2025-11-16 19:33:15,3.1,3.1.8.1_derive_time_features,"Added time-derived context fields (date, time, hour, dow, is_weekend, part_of_day).",3987.0,3987,0.0,"[date, time, hour, dow, is_weekend, part_of_day]",[]
2025-11-16 19:33:15,3.1,3.1.8.2_validate_round_time_quality,Added round_time_valid flag and normalized part_of_day='Unknown' for implausible tee times.,3987.0,3987,0.0,[round_time_valid],[]
2025-11-16 19:33:15,3.1,3.1.8.3_derive_seasonal_annual_context,Added year/month/season/year_month and round_year_index for longitudinal context.,3987.0,3987,0.0,"[year, month, month_name, year_month, season, round_year_index]",[]

ts,context,required_cols,missing_cols,passed
2025-11-16 19:33:15,3.1.1 Filter Invalid Hole Records ‚Äì source check,"[row_id, player_name, round_dt, facility, course, hole_number, hole_score, round_score]",[],True
2025-11-16 19:33:15,3.1.1 Filter Invalid Hole Records ‚Äì post-filter,"[row_id, player_name, round_dt, facility, course, hole_number, hole_score, round_score]",[],True
2025-11-16 19:33:15,3.1.2 Calculate Holes Scored per Round ‚Äì source check,"[player_name, round_dt, facility, course, hole_number, hole_score]",[],True
2025-11-16 19:33:15,3.1.3 golf_valid input,"[player_name, round_dt, facility, course, hole_number, hole_score]",[],True
2025-11-16 19:33:15,3.1.3 holes_per_round input,"[player_name, round_dt, facility, course, holes_scored_in_round]",[],True
2025-11-16 19:33:15,3.1.4 Keep Only Complete Rounds ‚Äì source check,"[player_name, round_dt, facility, course, hole_number, hole_score, holes_scored_in_round]",[],True
2025-11-16 19:33:15,3.1.5 Assign Sequential Round Numbers per Player ‚Äì source check,"[player_name, round_dt, facility, course, holes_scored_in_round, round_score]",[],True
2025-11-16 19:33:15,3.1.6 Assign Sequential Visit Number per Player √ó Course ‚Äì source check,"[player_name, round_dt, facility, course, holes_scored_in_round, round_score, round_no_player]",[],True
2025-11-16 19:33:15,3.1.7 Create Round Identifiers & Sequencing ‚Äì source check,"[player_name, round_dt, facility, course, round_no_player, round_no_player_course, holes_scored_in_round]",[],True
2025-11-16 19:33:15,3.1.8.1 Derive Time Features ‚Äì source check,"[round_dt, player_name, facility, course, round_id, round_no_player, round_no_player_course]",[],True

ts,assumption,rationale,impact_area
2025-11-16 19:33:12,API key will be used to enrich course metadata (slope/rating/etc.).,Needed for course difficulty context in analysis.,Course enrichment / difficulty-adjusted scoring
2025-11-16 19:33:12,Each vendor row ingested in 2.1 is treated as authoritative raw input and receives a unique row_id.,"We need stable row-level lineage for audit, QA, and troubleshooting after filtering/merging.","All downstream analysis, data retention reporting, error investigation"
2025-11-16 19:33:12,"Treat 'Round DateTime (UTC)' as the local tee time recorded by the device, not verified true UTC.","Empirical comparison showed timestamps align to played local time, not actual UTC offsets.","Time-of-day analysis, temporal feature engineering"
2025-11-16 19:33:12,Normalize vendor 'Round GIR' strings into real % on 0‚Äì100 scale.,Vendor encodes GIR in scaled percentage form; we need comparable KPIs.,Round-level GIR reporting and trend analysis
2025-11-16 19:33:12,Only 45.6% of rows have full GPS start/end coordinates.,Geospatial/yardage/dispersion analyses must restrict to GPS-complete rows.,Shot-level yardage validation and dispersion mapping
2025-11-16 19:33:12,275 rows contain hole_score <= 0. These rows will be excluded from scoring KPIs.,"hole_score <= 0 indicates incomplete/invalid hole tracking, not real golf results.",Round completeness checks and scoring statistics in Data Preparation
2025-11-16 19:33:12,Vendor yardage validated against geodesic distance with ¬±5.0% tolerance for shots ‚â• 30.0 yards.,Short/green-side shots are GPS-noisy and not useful for club benchmarking.,"Club-distance profiling, dispersion analysis, approach/tee club selection"
2025-11-16 19:33:12,Phase 2 (Data Understanding) closed and archived. Section 3 (Data Preparation) will build from this governed baseline.,Ensures reproducibility and auditability of all downstream transformations.,Change control / data lineage / reproducibility
2025-11-16 19:33:15,Rounds with tee times outside 05:00‚Äì20:59 are retained but treated as time-quality-suspect.,Early/late device syncs or timezone mismatches can produce odd timestamps; keeping rows avoids losing scoring data.,3.1.8 round-level temporal analysis; time-of-day performance splits
2025-11-16 19:33:15,"Derived time-of-day context (hour, DOW, weekend, part_of_day) from round_dt assuming it reflects local tee time.",GolfShot timestamps behave like local start times; using them enables time-based segmentation.,3.1.8 Round-level temporal features and fatigue/time-of-day performance analysis


Governance summary rendered.


### ======================================================
### 3.2 Hole-Level Validity & Features
### ======================================================

This section engineers and validates all *hole-level* performance features from the cleaned dataset `golf_valid` produced in 3.1.  
It captures per-hole context, putting efficiency, and short-game outcome signals that underpin the later *shot-level* and *strokes-gained* analyses.

| Step | Title | Purpose |
|------|--------|----------|
| **3.2.1** | Derive Hole-Level Context Features | Adds `hole_par_bucket`, `hole_strokes_over_par`, and descriptive `hole_score_name` (Birdie, Par, Bogey, etc.). |
| **3.2.2** | Derive Hole-Level Putting Flags | Calculates putting performance indicators such as 3-putt frequency, over-expected putts, and GIR vs. non-GIR putting outcomes. |
| **3.2.3** | Derive Hole Outcome Quality Signals | Creates scramble and recovery metrics‚Äîe.g., scramble opportunity/success, wasted GIRs, chip-ins, and scoring/recovery tags. |
| **3.2.4** | Hole-Level Validity & Features Closeout | Performs governance validation, data-quality snapshotting, and final schema confirmation before moving to 3.3 Shot-Level Features. |

Together, these transformations make each hole a *unit of performance analysis*, connecting scoring outcomes to process metrics (GIR, putting, scrambling, recovery).  
All results are logged, profiled, and ready for downstream aggregation and visualization.


#### ======================================================
#### 3.2.1 Derive Hole-Level Context Features
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.1.10, already round-valid and renamed)  
- Required columns:  
  `player_name`, `round_dt`, `facility`, `course`,  
  `hole_number`, `hole_par`, `hole_score`,  
  `hole_gir`, `hole_fairway_strokes`

**WHAT THIS STEP DOES**  
- Adds standard, analysis-friendly hole-level attributes:
  - `hole_par_bucket` ‚Üí labels holes as ‚ÄúPar 3‚Äù, ‚ÄúPar 4‚Äù, ‚ÄúPar 5‚Äù, etc.
  - `hole_strokes_over_par` ‚Üí quick scoring delta (`hole_score - hole_par`)
  - `hole_score_name` ‚Üí human-readable outcome (‚ÄúBirdie‚Äù, ‚ÄúPar‚Äù, ‚ÄúBogey‚Äù, ‚ÄúDouble Bogey‚Äù, ‚ÄúDisaster Hole‚Äù, etc.)
- Registers the new columns in the governance logs and data dictionary so they‚Äôre traceable and reusable downstream.

**WHY IT MATTERS**  
Most golf performance questions at the hole level boil down to ‚Äúhow did we score on par X holes‚Äù or ‚Äúwhich holes blew up the round?‚Äù  
By standardizing these features here, later sections (putting flags, outcome quality, dispersion) can assume a consistent, labeled hole context without having to recalculate it.

**OUTPUTS**  
- `golf_valid` ‚Äî enriched with `hole_par_bucket`, `hole_strokes_over_par`, and `hole_score_name`  
- Governance updated: `STEP_LOG`, `TRANSFORM_LOG`, `DATA_DICTIONARIES`  
- Reviewer preview of enriched hole-level records


In [21]:
# ======================================================
# 3.2.1 Derive Hole-Level Context Features
# ======================================================

"""
Enrich the working dataset (golf_valid) with derived hole-level fields to make
scoring analysis, GIR analysis, and hole-type rollups easier.

Adds:
- hole_par_bucket (e.g. "Par 3", "Par 4")
- hole_strokes_over_par (score - par)
- hole_score_name (Birdie, Par, Bogey, etc.)
"""

# ------------------------------------------------------
# 1. Schema gate ‚Äì must come after 3.1.9
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "player_name",
        "round_dt",
        "facility",
        "course",
        "hole_number",
        "hole_par",
        "hole_score",
        "hole_gir",              # normalized in 3.1.9
        "hole_fairway_strokes",  # renamed in 3.1.9
    ],
    context_name="3.2.1 Derive Hole-Level Context Features ‚Äì source check",
)

golf_before = golf_valid.copy()
before_cols = set(golf_valid.columns)

# ------------------------------------------------------
# 2. Derive hole_par_bucket
# ------------------------------------------------------
golf_valid["hole_par_bucket"] = "Par " + golf_valid["hole_par"].astype("string")

# ------------------------------------------------------
# 3. Derive strokes over par
# ------------------------------------------------------
golf_valid["hole_strokes_over_par"] = golf_valid["hole_score"] - golf_valid["hole_par"]

# ------------------------------------------------------
# 4. Derive human-friendly hole score name
# ------------------------------------------------------
def _hole_score_name(row):
    par_ = row["hole_par"]
    score_ = row["hole_score"]

    # guard against missing
    if pd.isna(par_) or pd.isna(score_):
        return "Unknown"

    # special / rare
    if (par_ in (3, 4, 5)) and (score_ == 1):
        return "Hole in 1"

    diff = score_ - par_

    if diff <= -3:
        return "Albatross"
    elif diff == -2:
        return "Eagle"
    elif diff == -1:
        return "Birdie"
    elif diff == 0:
        return "Par"
    elif diff == 1:
        return "Bogey"
    elif diff == 2:
        return "Double Bogey"
    elif diff == 3:
        return "Triple Bogey"
    else:
        return "Disaster Hole"

golf_valid["hole_score_name"] = golf_valid.apply(_hole_score_name, axis=1).astype("string")

# ------------------------------------------------------
# 5. Lineage / governance
# ------------------------------------------------------
new_cols = [c for c in golf_valid.columns if c not in before_cols]

track_transform(
    stage_name="3.2.1_hole_level_context_features",
    df_before=golf_before,
    df_after=golf_valid,
    notes="Added hole_par_bucket, hole_strokes_over_par, and hole_score_name derived from hole_par and hole_score.",
    new_cols=new_cols,
)

log_step(
    step_name="3.2.1 Derive Hole-Level Context Features",
    description="Created standard hole-level context fields for scoring/GIR analysis.",
    inputs=["golf_valid"],
    outputs=["golf_valid"],
    df=golf_valid,
    extra_info={
        "new_cols": new_cols,
        "distinct_hole_score_names": int(golf_valid["hole_score_name"].nunique()),
        "note": "These fields are safe to reuse in 3.2.2‚Äì3.2.3.",
    },
)

# ------------------------------------------------------
# 6. Optional: data dictionary for the new hole-level fields
# ------------------------------------------------------
generate_data_dictionary(
    golf_valid[
        [
            "hole_par_bucket",
            "hole_strokes_over_par",
            "hole_score_name",
            "hole_gir",
            "hole_fairway_strokes",
        ]
    ].copy(),
    table_name="golf_valid_hole_features",
)

# ------------------------------------------------------
# 7. Reviewer peek
# ------------------------------------------------------
display(
    golf_valid[
        [
            "player_name",
            "round_dt",
            "facility",
            "course",
            "hole_number",
            "hole_par",
            "hole_score",
            "hole_par_bucket",
            "hole_strokes_over_par",
            "hole_score_name",
            "hole_gir",
            "hole_fairway_strokes",
        ]
    ].head(25)
)


üîÑ Transform logged: 3.2.1_hole_level_context_features
   Rows 3987 ‚Üí 3987 (0 change)
   Added hole_par_bucket, hole_strokes_over_par, and hole_score_name derived from hole_par and hole_score.
‚úÖ 3.2.1 Derive Hole-Level Context Features @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 51 cols
   Created standard hole-level context fields for scoring/GIR analysis.
   new_cols: ['hole_par_bucket', 'hole_strokes_over_par', 'hole_score_name']
   distinct_hole_score_names: 7
   note: These fields are safe to reuse in 3.2.2‚Äì3.2.3.
üìò Data dictionary generated for table 'golf_valid_hole_features' (5 columns).


Unnamed: 0,player_name,round_dt,facility,course,hole_number,hole_par,hole_score,hole_par_bucket,hole_strokes_over_par,hole_score_name,hole_gir,hole_fairway_strokes
0,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,1,4,7,Par 4,3,Triple Bogey,False,6
1,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,2,4,5,Par 4,1,Bogey,False,3
2,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,3,5,6,Par 5,1,Bogey,False,4
3,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,4,3,4,Par 3,1,Bogey,False,2
4,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,5,4,4,Par 4,0,Par,True,2
5,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,6,5,6,Par 5,1,Bogey,True,3
6,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,7,4,5,Par 4,1,Bogey,False,3
7,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,8,3,4,Par 3,1,Bogey,False,3
8,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,9,4,4,Par 4,0,Par,False,3
9,David Brooks,2012-06-07 07:40:57,East Potomac Park Golf Course,Blue,10,4,3,Par 4,-1,Birdie,True,2


#### ======================================================
#### 3.2.2 Derive Hole-Level Putting Flags
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.2.1)  
- Required columns:  
  `hole_putts`, `hole_gir`

**WHAT THIS STEP DOES**  
- Adds standardized hole-level putting indicators for deeper short-game performance analysis:
  - `hole_putts_over_expected`: deviation from the baseline expectation of two putts per hole  
  - `hole_putts_3plus`: flags holes with three or more putts  
  - `hole_gir_putts_3plus`: isolates 3+ putts on holes where the green was hit in regulation  
  - `hole_notgir_putts_3plus`: isolates 3+ putts on holes *not* hit in regulation  
- Records lineage and adds these fields to the data dictionary for reuse in later stages.

**WHY IT MATTERS**  
Putting is one of the largest differentiators in scoring consistency.  
These standardized flags allow for quick segmentation of putting performance‚Äîby GIR vs. non-GIR, over-expected performance, and overall 3-putt frequency‚Äîwithout re-engineering logic in downstream notebooks or dashboards.

**OUTPUTS**  
- `golf_valid` enriched with new putting-related fields  
- Governance logs updated (`STEP_LOG`, `TRANSFORM_LOG`, `DATA_DICTIONARIES`)  
- Reviewer preview confirming correct feature creation


In [22]:
# ======================================================
# 3.2.2 Derive Hole-Level Putting Flags
# ======================================================

"""
Add standardized putting indicators to golf_valid so that putting quality
can be analyzed by GIR vs non-GIR, and by over/under expected putts.
"""

# ------------------------------------------------------
# 1. Schema gate ‚Äì must follow 3.2.1 and earlier renames
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "hole_putts",
        "hole_gir",  # boolean from 3.1.9
    ],
    context_name="3.2.2 Derive Hole-Level Putting Flags ‚Äì source check",
)

golf_before = golf_valid.copy()
before_cols = set(golf_valid.columns)

# ------------------------------------------------------
# 2. Core putting features
# ------------------------------------------------------

# baseline expectation: 2 putts per hole
golf_valid["hole_putts_over_expected"] = golf_valid["hole_putts"] - 2

# 3+ putts (general)
golf_valid["hole_putts_3plus"] = golf_valid["hole_putts"] >= 3

# 3+ putts on GIR
golf_valid["hole_gir_putts_3plus"] = (
    (golf_valid["hole_gir"] == True) & (golf_valid["hole_putts"] >= 3)
)

# 3+ putts on non-GIR
golf_valid["hole_notgir_putts_3plus"] = (
    (golf_valid["hole_gir"] == False) & (golf_valid["hole_putts"] >= 3)
)

new_cols = [c for c in golf_valid.columns if c not in before_cols]

# ------------------------------------------------------
# 3. Lineage / governance
# ------------------------------------------------------
track_transform(
    stage_name="3.2.2_hole_level_putting_flags",
    df_before=golf_before,
    df_after=golf_valid,
    notes="Added hole-level putting flags and over-expected metric.",
    new_cols=new_cols,
)

log_step(
    step_name="3.2.2 Derive Hole-Level Putting Flags",
    description="Created standardized hole-level putting indicators (over-expected, 3+ putts, GIR/non-GIR splits).",
    inputs=["golf_valid"],
    outputs=["golf_valid"],
    df=golf_valid,
    extra_info={
        "total_rows": int(golf_valid.shape[0]),
        "total_3plus": int(golf_valid["hole_putts_3plus"].sum()),
        "gir_3plus": int(golf_valid["hole_gir_putts_3plus"].sum()),
        "nongir_3plus": int(golf_valid["hole_notgir_putts_3plus"].sum()),
        "note": "These flags support miss/lag-putting analysis and hole-out quality signals.",
    },
)

# ------------------------------------------------------
# 4. Optional: data dictionary for putting-related fields
# ------------------------------------------------------
generate_data_dictionary(
    golf_valid[
        [
            "hole_putts_over_expected",
            "hole_putts_3plus",
            "hole_gir_putts_3plus",
            "hole_notgir_putts_3plus",
        ]
    ].copy(),
    table_name="golf_valid_hole_features_putting",
)

# ------------------------------------------------------
# 5. Reviewer preview
# ------------------------------------------------------
display(
    golf_valid[
        [
            "player_name",
            "facility",
            "course",
            "hole_number",
            "hole_par",
            "hole_score",
            "hole_gir",
            "hole_putts",
            "hole_putts_over_expected",
            "hole_putts_3plus",
            "hole_gir_putts_3plus",
            "hole_notgir_putts_3plus",
        ]
    ].head(2)
)


üîÑ Transform logged: 3.2.2_hole_level_putting_flags
   Rows 3987 ‚Üí 3987 (0 change)
   Added hole-level putting flags and over-expected metric.
‚úÖ 3.2.2 Derive Hole-Level Putting Flags @ 2025-11-16 19:33:15
   DataFrame shape: 3987 rows √ó 55 cols
   Created standardized hole-level putting indicators (over-expected, 3+ putts, GIR/non-GIR splits).
   total_rows: 3987
   total_3plus: 691
   gir_3plus: 211
   nongir_3plus: 480
   note: These flags support miss/lag-putting analysis and hole-out quality signals.
üìò Data dictionary generated for table 'golf_valid_hole_features_putting' (4 columns).


Unnamed: 0,player_name,facility,course,hole_number,hole_par,hole_score,hole_gir,hole_putts,hole_putts_over_expected,hole_putts_3plus,hole_gir_putts_3plus,hole_notgir_putts_3plus
0,David Brooks,East Potomac Park Golf Course,Blue,1,4,7,False,1,-1,False,False,False
1,David Brooks,East Potomac Park Golf Course,Blue,2,4,5,False,2,0,False,False,False


#### ======================================================
#### 3.2.3 Derive Hole Outcome Quality Signals
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.2.2)  
- Required columns:  
  `hole_par`, `hole_score`, `hole_gir`, `hole_fairway_strokes`, `hole_putts`

**WHAT THIS STEP DOES**  
This step derives *qualitative hole-outcome indicators* that explain how scoring results were achieved at the shot and short-game level. Specifically, it adds:  
- `hole_scramble_opportunity`: missed-GIR holes where par was still achievable  
- `hole_scramble_success`: missed-GIR holes where par (or better) was successfully saved  
- `hole_gir_wasted`: holes where a GIR did not convert to par or better  
- `hole_notgir_chip_in`: non-GIR holes completed with 0 putts and par-or-better result  
- `hole_is_scoring_chance`: marks all GIR holes as scoring opportunities  
- `hole_is_recovery`: marks all scramble-opportunity holes as recovery situations  

These derived indicators allow for later aggregation of *short-game efficiency*, *scrambling rate*, and *wasted opportunities*‚Äîcritical dimensions in continuous improvement analysis.

**WHY IT MATTERS**  
Beyond raw scores, golf performance improvement depends on *understanding how each hole unfolded*.  
This logic introduces structured metrics for recovery performance, opportunity conversion, and avoidable errors‚Äîbridging the gap between outcome (score) and process (execution quality).

**OUTPUTS**  
- `golf_valid` enriched with 6 new outcome-quality flags  
- Recorded assumption: simplified scramble model based on par minus non-putt strokes  
- Governance artifacts updated (`TRANSFORM_LOG`, `STEP_LOG`, `DATA_DICTIONARIES`)  
- Reviewer preview displaying hole-level outcomes for quick validation


In [23]:
# ======================================================
# 3.2.3 Derive Hole Outcome Quality Signals
# (scramble, wasted GIR, chip-ins, situation tags)
# ======================================================

"""
Add analytic hole-outcome flags so we can explain *why* a hole ended
the way it did (great recovery, wasted GIR, chip-in save, etc.).
These are descriptive, not filtering.
"""

# ------------------------------------------------------
# 1. Schema gate ‚Äì must follow 3.2.1 and 3.2.2
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "hole_par",
        "hole_score",
        "hole_gir",              # boolean from 3.1.9
        "hole_fairway_strokes",  # renamed in 3.1.9
        "hole_putts",
    ],
    context_name="3.2.3 Derive Hole Outcome Quality Signals ‚Äì source check",
)

golf_before = golf_valid.copy()
before_cols = set(golf_valid.columns)

# ------------------------------------------------------
# 2. Helper: strokes remaining for par after non-putt strokes
#    (i.e. did we have a chance to get up-and-down?)
# ------------------------------------------------------
strokes_remaining_for_par = golf_valid["hole_par"] - golf_valid["hole_fairway_strokes"]

# ------------------------------------------------------
# 3. Scramble opportunity: missed GIR but still had a path to par
# ------------------------------------------------------
golf_valid["hole_scramble_opportunity"] = (
    (golf_valid["hole_gir"] == False) &
    (strokes_remaining_for_par >= 1)
)

# ------------------------------------------------------
# 4. Scramble success: opportunity existed AND we still made par or better
# ------------------------------------------------------
golf_valid["hole_scramble_success"] = (
    golf_valid["hole_scramble_opportunity"]
    & (golf_valid["hole_score"] <= golf_valid["hole_par"])
    & (golf_valid["hole_putts"] >= 1)
)

# ------------------------------------------------------
# 5. Wasted GIR: we hit the green in regulation but did NOT make par
# ------------------------------------------------------
golf_valid["hole_gir_wasted"] = (
    (golf_valid["hole_gir"] == True)
    & (golf_valid["hole_score"] > golf_valid["hole_par"])
)

# ------------------------------------------------------
# 6. Non-GIR chip-in: missed GIR, zero putts, and still par or better
# ------------------------------------------------------
golf_valid["hole_notgir_chip_in"] = (
    (golf_valid["hole_gir"] == False)
    & (golf_valid["hole_putts"] == 0)
    & (golf_valid["hole_score"] <= golf_valid["hole_par"])
)

# ------------------------------------------------------
# 7. Situation tags ‚Äì for rollups in later sections
# ------------------------------------------------------
# scoring chance = on in regulation
golf_valid["hole_is_scoring_chance"] = (golf_valid["hole_gir"] == True)

# recovery = scramble was on the table
golf_valid["hole_is_recovery"] = (golf_valid["hole_scramble_opportunity"] == True)

# ------------------------------------------------------
# 8. Governance / lineage
# ------------------------------------------------------
new_cols = [c for c in golf_valid.columns if c not in before_cols]

# assumption: we‚Äôre using a simplified scramble model
record_assumption(
    text="Scramble logic uses (missed GIR + had strokes left for par) as the opportunity definition.",
    rationale="We don‚Äôt have full lie/shot-type detail, so we approximate scramble chances from par minus non-putt strokes.",
    impact_area="3.2 hole-outcome analytics, short-game performance storytelling",
)

track_transform(
    stage_name="3.2.3_hole_outcome_quality_signals",
    df_before=golf_before,
    df_after=golf_valid,
    notes="Added scramble opportunity/success, wasted GIR, non-GIR chip-ins, and situation tags.",
    new_cols=new_cols,
)

log_step(
    step_name="3.2.3 Derive Hole Outcome Quality Signals",
    description="Derived hole-level outcome flags (scramble, wasted GIR, chip-ins) to explain scoring results.",
    inputs=["golf_valid"],
    outputs=["golf_valid"],
    df=golf_valid,
    extra_info={
        "scramble_opportunities": int(golf_valid["hole_scramble_opportunity"].sum()),
        "scramble_successes": int(golf_valid["hole_scramble_success"].sum()),
        "gir_wasted": int(golf_valid["hole_gir_wasted"].sum()),
        "nongir_chip_ins": int(golf_valid["hole_notgir_chip_in"].sum()),
        "note": "These are descriptive flags; no rows were dropped.",
    },
)

# ------------------------------------------------------
# 9. Optional data dictionary for these signals
# ------------------------------------------------------
generate_data_dictionary(
    golf_valid[
        [
            "hole_scramble_opportunity",
            "hole_scramble_success",
            "hole_gir_wasted",
            "hole_notgir_chip_in",
            "hole_is_scoring_chance",
            "hole_is_recovery",
        ]
    ].copy(),
    table_name="golf_valid_hole_outcome_features",
)

# ------------------------------------------------------
# 10. Reviewer preview
# ------------------------------------------------------
display(
    golf_valid[
        [
            "player_name",
            "facility",
            "course",
            "hole_number",
            "hole_par",
            "hole_fairway_strokes",
            "hole_putts",
            "hole_score",
            "hole_gir",
            "hole_scramble_opportunity",
            "hole_scramble_success",
            "hole_gir_wasted",
            "hole_notgir_chip_in",
            "hole_is_scoring_chance",
            "hole_is_recovery",
        ]
    ].head(3)
)


üìå Assumption logged: Scramble logic uses (missed GIR + had strokes left for par) as the opportunity definition.  | Impact: 3.2 hole-outcome analytics, short-game performance storytelling
üîÑ Transform logged: 3.2.3_hole_outcome_quality_signals
   Rows 3987 ‚Üí 3987 (0 change)
   Added scramble opportunity/success, wasted GIR, non-GIR chip-ins, and situation tags.
‚úÖ 3.2.3 Derive Hole Outcome Quality Signals @ 2025-11-16 19:33:16
   DataFrame shape: 3987 rows √ó 61 cols
   Derived hole-level outcome flags (scramble, wasted GIR, chip-ins) to explain scoring results.
   scramble_opportunities: 1429
   scramble_successes: 248
   gir_wasted: 204
   nongir_chip_ins: 14
   note: These are descriptive flags; no rows were dropped.
üìò Data dictionary generated for table 'golf_valid_hole_outcome_features' (6 columns).


Unnamed: 0,player_name,facility,course,hole_number,hole_par,hole_fairway_strokes,hole_putts,hole_score,hole_gir,hole_scramble_opportunity,hole_scramble_success,hole_gir_wasted,hole_notgir_chip_in,hole_is_scoring_chance,hole_is_recovery
0,David Brooks,East Potomac Park Golf Course,Blue,1,4,6,1,7,False,False,False,False,False,False,False
1,David Brooks,East Potomac Park Golf Course,Blue,2,4,3,2,5,False,True,False,False,False,False,True
2,David Brooks,East Potomac Park Golf Course,Blue,3,5,4,2,6,False,True,False,False,False,False,True


#### ======================================================
#### 3.2.4 Hole-Level Validity & Features Closeout
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (from 3.2.3)  
- Columns validated: all hole-level enrichment fields from 3.2.1‚Äì3.2.3  
- Shot-level fields: `shot_start_lat`, `shot_start_lon`, `shot_end_lat`, `shot_end_lon` (used to derive centroids)

---

**WHAT THIS STEP DOES**  
This step serves as the final **governance and spatial enrichment checkpoint** for hole-level data.  
In addition to validating all hole-level features and performance metrics, it now introduces **hole-level shot centroids**‚Äîthe median latitude and longitude of all shots observed for each `(facility, course, hole_number)` combination.  

These centroids (`hole_lat`, `hole_lon`) establish a stable **geospatial anchor** for hole-level analysis and visualization in later phases, ensuring that every hole can be mapped even if full tee and pin coordinates are unavailable.

---

**KEY ACTIONS**  
- ‚úÖ **Schema validation:** Confirms all expected hole-level fields from 3.2.1‚Äì3.2.3 are present  
- üìç **Hole centroid derivation:** Calculates per-hole median shot coordinate (`hole_lat`, `hole_lon`) using all available GPS points  
- üìä **Data quality snapshot:** Summarizes key hole-level performance metrics (3-putt %, scramble %, GIR wasted %, and number of holes with centroids)  
- üß© **Governance updates:** Records assumptions; logs transformations 

---

**OUTPUTS**  
- `golf_valid` (now includes `hole_lat`, `hole_lon`, and `hole_gps_points`)  
- Governance logs updated (`ASSUMPTIONS_LOG`, `STEP_LOG`, `TRANSFORM_LOG`)  
- Reviewer-friendly summary table of key hole-level KPIs and spatial coverage  

---

**WHY THIS MATTERS**  
By introducing geospatial centroids, the dataset now supports **hole-level mapping and drill-downs** in Tableau or Power BI, enabling analysts to:
- Visualize dispersion and scoring trends by hole location  
- Align hole-level KPIs with geographic context  
- Seamlessly integrate hole- and shot-level data for performance storytelling in Phase 4  

The dataset emerging from this step is both **validated** and **geospatially enriched**, forming a solid bridge between traditional performance metrics and modern spatial analytics.


In [24]:
# ======================================================
# 3.2.4 Hole-Level Validity & Features Closeout
# (now also derives per-hole shot centroids)
# ======================================================

"""
Closeout checkpoint for hole-level feature engineering (3.2.1‚Äì3.2.3).

New in this version:
- For any hole where we have at least one shot-level GPS point, we compute a
  hole-level centroid at grain (facility, course, hole_number) and attach:
    - hole_lat
    - hole_lon
  This gives us a hole anchor we can carry downstream into 3.5.1 when we split
  to golf_holes and golf_shots, so Tableau maps can drill to hole grain.
"""

# ------------------------------------------------------
# 1. Schema validation gate
#    (only checks for columns we know we just created in 3.2.x)
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "player_name", "round_dt", "facility", "course", "hole_number",
        "hole_par", "hole_score", "hole_putts", "hole_gir",
        "hole_par_bucket", "hole_strokes_over_par", "hole_score_name",
        "hole_putts_over_expected", "hole_putts_3plus",
        "hole_gir_putts_3plus", "hole_notgir_putts_3plus",
        "hole_scramble_opportunity", "hole_scramble_success",
        "hole_gir_wasted", "hole_notgir_chip_in",
        "hole_is_scoring_chance", "hole_is_recovery",
    ],
    context_name="3.2.4 Hole-Level Validity & Features Closeout",
)

_before = golf_valid.copy()

# ------------------------------------------------------
# 2. Derive per-hole shot centroids (facility √ó course √ó hole_number)
#    using available shot GPS
# ------------------------------------------------------
# build a working view of shot-level rows
shot_like = golf_valid[[
    "facility", "course", "hole_number",
    "shot_start_lat", "shot_start_lon",
    "shot_end_lat", "shot_end_lon",
]].copy()

# prefer shot_start_*, else fall back to shot_end_*
shot_like["hole_coord_lat"] = shot_like["shot_start_lat"].where(
    shot_like["shot_start_lat"].notna(),
    shot_like["shot_end_lat"],
)
shot_like["hole_coord_lon"] = shot_like["shot_start_lon"].where(
    shot_like["shot_start_lon"].notna(),
    shot_like["shot_end_lon"],
)

# keep only rows that actually have a usable coord
shot_like = shot_like[
    shot_like["hole_coord_lat"].notna() & shot_like["hole_coord_lon"].notna()
].copy()

if len(shot_like) > 0:
    hole_centroids = (
        shot_like
        .groupby(["facility", "course", "hole_number"], as_index=False)
        .agg(
            hole_lat=("hole_coord_lat", "median"),
            hole_lon=("hole_coord_lon", "median"),
            hole_gps_points=("hole_coord_lat", "size"),
        )
        .sort_values(["facility", "course", "hole_number"])
        .reset_index(drop=True)
    )

    # idempotent merge: don't create _x/_y, and don't blow away existing cols
    golf_valid = golf_valid.merge(
        hole_centroids,
        on=["facility", "course", "hole_number"],
        how="left",
        suffixes=("", "_new"),
    )

    # coalesce in case user reruns the cell
    for col in ["hole_lat", "hole_lon", "hole_gps_points"]:
        new_col = f"{col}_new"
        if new_col in golf_valid.columns:
            if col in golf_valid.columns:
                golf_valid[col] = golf_valid[col].combine_first(golf_valid[new_col])
            else:
                golf_valid[col] = golf_valid[new_col]
            golf_valid.drop(columns=[new_col], inplace=True, errors="ignore")

    print(f"üìç Hole centroids added/updated for {len(hole_centroids)} facility√ócourse√óhole combos.")
else:
    print("üìç No shot-level GPS available to derive hole centroids ‚Äî skipping centroid step.")

# ------------------------------------------------------
# 3. Data quality / summary stats
# ------------------------------------------------------
hole_summary_stats = {
    "total_holes": len(golf_valid),
    "unique_rounds": (
        golf_valid["round_id"].nunique() if "round_id" in golf_valid.columns else None
    ),
    "putt_3plus_rate_pct": round(float(golf_valid["hole_putts_3plus"].mean() * 100), 2)
        if "hole_putts_3plus" in golf_valid.columns else None,
    "scramble_success_rate_pct": (
        round(
            float(
                golf_valid.loc[golf_valid["hole_scramble_opportunity"], "hole_scramble_success"].mean()
                * 100
            ),
            2,
        )
        if (
            "hole_scramble_opportunity" in golf_valid.columns
            and golf_valid["hole_scramble_opportunity"].any()
        )
        else None
    ),
    "gir_wasted_rate_pct": round(float(golf_valid["hole_gir_wasted"].mean() * 100), 2)
        if "hole_gir_wasted" in golf_valid.columns else None,
    # new: how many holes we successfully gave a centroid
    "holes_with_centroids": int(
        golf_valid[["hole_lat", "hole_lon"]].dropna(how="any").shape[0]
    ) if ("hole_lat" in golf_valid.columns and "hole_lon" in golf_valid.columns) else 0,
}

# ------------------------------------------------------
# 4. Governance checkpoints
# ------------------------------------------------------
record_assumption(
    text="Hole-level enrichment (3.2.1‚Äì3.2.3) completed and schema locked; hole-level GPS centroids added where shot data exists.",
    rationale="Ensures hole-level visualizations (e.g. Tableau maps) have a consistent spatial anchor.",
    impact_area="3.x Data Preparation ‚Üí spatial / mapping views",
)

track_transform(
    stage_name="3.2.4_hole_level_closeout",
    df_before=_before,
    df_after=golf_valid,
    notes="Hole-level features finalized; validation, profile, and hole centroids captured.",
)

log_step(
    step_name="3.2.4 Hole-Level Validity & Features Closeout",
    description="Validated hole-level schema, captured profile, added per-hole shot centroids, and logged readiness for next section.",
    inputs=["golf_valid (post 3.2.3)"],
    outputs=["golf_valid (ready for 3.3 / next family)"],
    df=golf_valid,
    extra_info=hole_summary_stats,
)

# ------------------------------------------------------
# 5. Optional: render governance for 3.2 only
# ------------------------------------------------------
render_governance_summary(current_phase="3.2")

# ------------------------------------------------------
# 6. Reviewer-friendly view of the summary
# ------------------------------------------------------
display(
    pd.DataFrame([hole_summary_stats]).T.rename(columns={0: "value"})
)

# optional peek of holes that actually got a centroid
if "hole_lat" in golf_valid.columns and "hole_lon" in golf_valid.columns:
    display(
        golf_valid[
            [
                "player_name",
                "facility",
                "course",
                "hole_number",
                "hole_lat",
                "hole_lon",
            ]
        ]
        .dropna(subset=["hole_lat", "hole_lon"])
        .drop_duplicates()
        .head(40)
    )


üìç Hole centroids added/updated for 126 facility√ócourse√óhole combos.
üìå Assumption logged: Hole-level enrichment (3.2.1‚Äì3.2.3) completed and schema locked; hole-level GPS centroids added where shot data exists.  | Impact: 3.x Data Preparation ‚Üí spatial / mapping views
üîÑ Transform logged: 3.2.4_hole_level_closeout
   Rows 3987 ‚Üí 3987 (0 change)
   Hole-level features finalized; validation, profile, and hole centroids captured.
‚úÖ 3.2.4 Hole-Level Validity & Features Closeout @ 2025-11-16 19:33:16
   DataFrame shape: 3987 rows √ó 64 cols
   Validated hole-level schema, captured profile, added per-hole shot centroids, and logged readiness for next section.
   total_holes: 3987
   unique_rounds: 206
   putt_3plus_rate_pct: 17.33
   scramble_success_rate_pct: 17.35
   gir_wasted_rate_pct: 5.12
   holes_with_centroids: 2408


ts,phase,step_name,description,rows,cols
2025-11-16 19:33:15,3.2,3.2.1 Derive Hole-Level Context Features,Created standard hole-level context fields for scoring/GIR analysis.,3987.0,51.0
2025-11-16 19:33:15,3.2,3.2.2 Derive Hole-Level Putting Flags,"Created standardized hole-level putting indicators (over-expected, 3+ putts, GIR/non-GIR splits).",3987.0,55.0
2025-11-16 19:33:16,3.2,3.2.3 Derive Hole Outcome Quality Signals,"Derived hole-level outcome flags (scramble, wasted GIR, chip-ins) to explain scoring results.",3987.0,61.0
2025-11-16 19:33:16,3.2,3.2.4 Hole-Level Validity & Features Closeout,"Validated hole-level schema, captured profile, added per-hole shot centroids, and logged readiness for next section.",3987.0,64.0

ts,phase,stage_name,notes,rows_before,rows_after,row_delta,new_cols,dropped_cols
2025-11-16 19:33:15,3.2,3.2.1_hole_level_context_features,"Added hole_par_bucket, hole_strokes_over_par, and hole_score_name derived from hole_par and hole_score.",3987.0,3987,0.0,"[hole_par_bucket, hole_strokes_over_par, hole_score_name]",[]
2025-11-16 19:33:15,3.2,3.2.2_hole_level_putting_flags,Added hole-level putting flags and over-expected metric.,3987.0,3987,0.0,"[hole_putts_over_expected, hole_putts_3plus, hole_gir_putts_3plus, hole_notgir_putts_3plus]",[]
2025-11-16 19:33:16,3.2,3.2.3_hole_outcome_quality_signals,"Added scramble opportunity/success, wasted GIR, non-GIR chip-ins, and situation tags.",3987.0,3987,0.0,"[hole_scramble_opportunity, hole_scramble_success, hole_gir_wasted, hole_notgir_chip_in, hole_is_scoring_chance, hole_is_recovery]",[]
2025-11-16 19:33:16,3.2,3.2.4_hole_level_closeout,"Hole-level features finalized; validation, profile, and hole centroids captured.",3987.0,3987,0.0,"[hole_lat, hole_lon, hole_gps_points]",[]

ts,context,required_cols,missing_cols,passed
2025-11-16 19:33:15,3.2.1 Derive Hole-Level Context Features ‚Äì source check,"[player_name, round_dt, facility, course, hole_number, hole_par, hole_score, hole_gir, hole_fairway_strokes]",[],True
2025-11-16 19:33:15,3.2.2 Derive Hole-Level Putting Flags ‚Äì source check,"[hole_putts, hole_gir]",[],True
2025-11-16 19:33:16,3.2.3 Derive Hole Outcome Quality Signals ‚Äì source check,"[hole_par, hole_score, hole_gir, hole_fairway_strokes, hole_putts]",[],True
2025-11-16 19:33:16,3.2.4 Hole-Level Validity & Features Closeout,"[player_name, round_dt, facility, course, hole_number, hole_par, hole_score, hole_putts, hole_gir, hole_par_bucket, hole_strokes_over_par, hole_score_name, hole_putts_over_expected, hole_putts_3plus, hole_gir_putts_3plus, hole_notgir_putts_3plus, hole_scramble_opportunity, hole_scramble_success, hole_gir_wasted, hole_notgir_chip_in, hole_is_scoring_chance, hole_is_recovery]",[],True

ts,assumption,rationale,impact_area
2025-11-16 19:33:12,API key will be used to enrich course metadata (slope/rating/etc.).,Needed for course difficulty context in analysis.,Course enrichment / difficulty-adjusted scoring
2025-11-16 19:33:12,Each vendor row ingested in 2.1 is treated as authoritative raw input and receives a unique row_id.,"We need stable row-level lineage for audit, QA, and troubleshooting after filtering/merging.","All downstream analysis, data retention reporting, error investigation"
2025-11-16 19:33:12,"Treat 'Round DateTime (UTC)' as the local tee time recorded by the device, not verified true UTC.","Empirical comparison showed timestamps align to played local time, not actual UTC offsets.","Time-of-day analysis, temporal feature engineering"
2025-11-16 19:33:12,Normalize vendor 'Round GIR' strings into real % on 0‚Äì100 scale.,Vendor encodes GIR in scaled percentage form; we need comparable KPIs.,Round-level GIR reporting and trend analysis
2025-11-16 19:33:12,Only 45.6% of rows have full GPS start/end coordinates.,Geospatial/yardage/dispersion analyses must restrict to GPS-complete rows.,Shot-level yardage validation and dispersion mapping
2025-11-16 19:33:12,275 rows contain hole_score <= 0. These rows will be excluded from scoring KPIs.,"hole_score <= 0 indicates incomplete/invalid hole tracking, not real golf results.",Round completeness checks and scoring statistics in Data Preparation
2025-11-16 19:33:12,Vendor yardage validated against geodesic distance with ¬±5.0% tolerance for shots ‚â• 30.0 yards.,Short/green-side shots are GPS-noisy and not useful for club benchmarking.,"Club-distance profiling, dispersion analysis, approach/tee club selection"
2025-11-16 19:33:12,Phase 2 (Data Understanding) closed and archived. Section 3 (Data Preparation) will build from this governed baseline.,Ensures reproducibility and auditability of all downstream transformations.,Change control / data lineage / reproducibility
2025-11-16 19:33:15,"Round-level fields use the prefix 'round_', hole-level fields use 'hole_', and shot-level fields use 'shot_'.","Consistent prefixes make join logic, exports, and downstream feature engineering clearer.",3.x Data Preparation; final exports
2025-11-16 19:33:15,"Seasons mapped as Winter(12‚Äì2), Spring(3‚Äì5), Summer(6‚Äì8), Fall(9‚Äì11).",Standard Northern Hemisphere golf season alignment.,3.1.8 seasonal and trend analysis


Governance summary rendered.


Unnamed: 0,value
total_holes,3987.0
unique_rounds,206.0
putt_3plus_rate_pct,17.33
scramble_success_rate_pct,17.35
gir_wasted_rate_pct,5.12
holes_with_centroids,2408.0


Unnamed: 0,player_name,facility,course,hole_number,hole_lat,hole_lon
0,David Brooks,East Potomac Park Golf Course,Blue,1,38.872,-77.027
1,David Brooks,East Potomac Park Golf Course,Blue,2,38.869,-77.025
2,David Brooks,East Potomac Park Golf Course,Blue,3,38.866,-77.024
3,David Brooks,East Potomac Park Golf Course,Blue,4,38.863,-77.024
4,David Brooks,East Potomac Park Golf Course,Blue,5,38.863,-77.023
5,David Brooks,East Potomac Park Golf Course,Blue,6,38.866,-77.023
6,David Brooks,East Potomac Park Golf Course,Blue,7,38.868,-77.024
7,David Brooks,East Potomac Park Golf Course,Blue,8,38.87,-77.025
8,David Brooks,East Potomac Park Golf Course,Blue,9,38.873,-77.026
9,David Brooks,East Potomac Park Golf Course,Blue,10,38.872,-77.028


### ======================================================
### 3.3 Club-Level Validity & Features Overview
### ======================================================

**FOCUS**  
The club-level feature engineering phase (3.3.x) transforms **shot-level performance data** into **strategic, per-club profiles** that quantify both *distance control* and *accuracy dispersion*.  
It connects detailed shot geometry to player‚Äìclub performance insights, producing calibrated dispersion cones and reliability metrics for every club in the bag.

**QUESTIONS ADDRESSED**  
- How consistent is each club‚Äôs strike pattern for a given player?  
- What are the realistic (p50), typical (p65), and strong (p80) yardages per club?  
- How wide is each club‚Äôs expected dispersion cone at each distance band?  
- Which clubs have sufficient shot volume for statistically meaningful conclusions?  
- How do dispersion and distance trade off across the player‚Äôs full set?

**KEY STEPS (3.3.1‚Äì3.3.6)**  
| Step | Name | Description |
|------|------|-------------|
| 3.3.1 | **Build Player‚ÄìClub Planning Profile** | Computes yardage distributions, directional mix, and planning distances (short, hit, long) per player‚Äìclub. |
| 3.3.2 | **Player‚ÄìClub‚ÄìHole Dispersion** | Measures left/right dispersion at the hole level using first‚Üílast shot geometry. |
| 3.3.3 | **Player‚ÄìClub Dispersion Rollup** | Aggregates dispersion metrics per player‚Äìclub, normalizing by sample volume and consistency. |
| 3.3.4 | **Attach Dispersion to Player‚ÄìClub Profile** | Integrates dispersion rollup back into the unified player‚Äìclub table. |
| 3.3.5 | **Enrich Player‚ÄìClub Profile with Dispersion Cone Fields** | Calibrates a driver-based dispersion half-angle and applies it to all clubs to precompute cone widths (short, hit, long). |
| 3.3.6 | **Club-Level Validity & Features Closeout** | Validates all club artifacts, confirms cone-field integrity, captures QA metrics, and logs Phase 3.3 completion. |

**DATA OUTPUTS**  
| Dataset | Description |
|----------|--------------|
| `player_club_profile` | Master club-level dataset with planning distances, directional tendencies, dispersion metrics, and cone geometry fields. |
| `player_club_hole_dispersion` | Per-hole dispersion measurements derived from GPS shot geometry. |
| `player_club_dispersion_rollup` | Club-level rollups summarizing typical width, shape, and sufficiency. |

**DOWNSTREAM USE**  
The finalized club-level tables provide a **ready-to-visualize foundation** for Phase 4 analysis and Tableau dashboards:  
- **Power vs. Precision Maps** ‚Äî distance vs. dispersion by club.  
- **Dispersion Cones** ‚Äî precomputed geometries for realistic ‚Äúmiss area‚Äù overlays.  
- **Player Consistency Models** ‚Äî identify strengths, weak clubs, and distance gaps.  

By completing this phase, the project establishes a **validated, geometry-aware baseline** for player performance analytics, bridging raw shot data and actionable strategy insights.


#### ======================================================
#### 3.3.1 Build Player‚ÄìClub Planning Profile
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (output from 3.2.x ‚Äî hole-level features already attached)  
- Required columns (schema gate):  
  - `player_name`  
  - `shot_club`  
  - `yardage`  
  - `shot_fairway_hit_type` (standardized name from 3.1.9)  

**WHAT THIS STEP DOES**  
- Aggregates shot-level data to the **player √ó club** grain  
- Calculates distance stats for each player‚Äìclub (mean, std, min, max, p25, p50, p65, p80, p90)  
- Derives **planning distances**:  
  - `plan_distance_short` ‚Üí p50  
  - `plan_distance_hit` ‚Üí p65  
  - `plan_distance_long` ‚Üí p80  
- Computes the app-recorded directional mix from `shot_fairway_hit_type` ‚Üí `app_pct_left`, `app_pct_right`, `app_pct_short`, etc.  
- Flags whether the player‚Äìclub combo has enough data (`shots_sufficient`, e.g. ‚â• 5 shots)  
- Logs the transform and step metadata to governance containers  
- Generates a data dictionary for the new planning table

**WHY IT MATTERS**  
This is the **club-level anchor table** for the rest of 3.3.  
Later steps (dispersion, rollups, player-facing summaries) will join back to this profile so we don‚Äôt have to recompute yardage statistics.  
By enforcing a schema gate here and recording the sufficiency rule, we can explain to stakeholders why some player‚Äìclub pairs don‚Äôt have trustworthy distances.

**OUTPUTS**  
- `player_club_profile` (DataFrame, grain: player √ó club)  
- Governance updates:  
  - `STEP_LOG` ‚Üí ‚Äú3.3.1 Build Player‚ÄìClub Planning Profile‚Äù  
  - `TRANSFORM_LOG` ‚Üí creation of aggregated planning table  
  - `DATA_DICTIONARIES` ‚Üí `player_club_profile` structure  
- Optional: preview of first 15 rows for notebook reviewers


In [25]:
# ======================================================
# 3.3.1 Build Player‚ÄìClub Planning Profile
# ======================================================

"""
Construct a per-player √ó per-club profile that captures:
- shot count sufficiency
- planning yardages at key percentiles
- directional / outcome mix recorded by the app

We call this table `player_club_profile` because later 3.3.x steps
will append dispersion- and quality-related metrics to the same structure.
"""

# ------------------------------------------------------
# 0) Tunable planning percentiles
# ------------------------------------------------------
PLAN_PCT_SHORT = 0.50   # median
PLAN_PCT_ON    = 0.65   # realistic target
PLAN_PCT_LONG  = 0.80   # strong strike
MIN_SHOTS      = 5      # sufficiency gate

# ------------------------------------------------------
# 1) Schema gate
# ------------------------------------------------------
required_cols_3_3_1 = [
    "player_name",
    "shot_club",
    "yardage",
    "shot_fairway_hit_type",  # standardized in 3.1.9
]
validate_columns(
    golf_valid,
    required_cols=required_cols_3_3_1,
    context_name="3.3.1 Build Player‚ÄìClub Planning Profile",
)

_before = None  # we are creating a new DF, not transforming golf_valid

# ------------------------------------------------------
# 2) App-recorded directional / outcome mix per player √ó club
# ------------------------------------------------------
direction_levels = ["Unknown", "Left", "Right", "Short", "Long", "Hit"]

dir_counts = (
    golf_valid
    .groupby(["player_name", "shot_club", "shot_fairway_hit_type"])
    .size()
    .unstack(fill_value=0)
)

# ensure all expected direction columns exist, in a consistent order
for d in direction_levels:
    if d not in dir_counts.columns:
        dir_counts[d] = 0

dir_totals = dir_counts.sum(axis=1).replace(0, pd.NA)
dir_pct = dir_counts.div(dir_totals, axis=0) * 100.0

dir_pct = (
    dir_pct[direction_levels]
    .reset_index()
    .rename(columns={d: f"app_pct_{d.lower()}" for d in direction_levels})
)

# ------------------------------------------------------
# 3) Distance stats per player √ó club
# ------------------------------------------------------
def top5(lst: pd.Series):
    return lst.sort_values(ascending=False).head(5).tolist()

dist_stats = (
    golf_valid
    .groupby(["player_name", "shot_club"])
    .agg(
        shot_count=("yardage", "count"),
        yardage_mean=("yardage", "mean"),
        yardage_std=("yardage", "std"),
        yardage_min=("yardage", "min"),
        yardage_max=("yardage", "max"),
        yardage_p25=("yardage", lambda x: x.quantile(0.25)),
        yardage_p50=("yardage", lambda x: x.quantile(PLAN_PCT_SHORT)),
        yardage_p65=("yardage", lambda x: x.quantile(PLAN_PCT_ON)),
        yardage_p80=("yardage", lambda x: x.quantile(PLAN_PCT_LONG)),
        yardage_p90=("yardage", lambda x: x.quantile(0.90)),
        top5_max_yardages=("yardage", top5),
    )
    .reset_index()
)

# ------------------------------------------------------
# 4) Planning distances (explicit columns)
# ------------------------------------------------------
dist_stats["plan_distance_short"] = dist_stats["yardage_p50"]
dist_stats["plan_distance_hit"]    = dist_stats["yardage_p65"]
dist_stats["plan_distance_long"]  = dist_stats["yardage_p80"]

# ------------------------------------------------------
# 5) Merge directional mix ‚Üí final player‚Äìclub profile
# ------------------------------------------------------
player_club_profile = dist_stats.merge(
    dir_pct,
    on=["player_name", "shot_club"],
    how="left",
)

# ------------------------------------------------------
# 6) Sufficiency flag
# ------------------------------------------------------
player_club_profile["shots_sufficient"] = (
    player_club_profile["shot_count"] >= MIN_SHOTS
)

sufficient_count = int(player_club_profile["shots_sufficient"].sum())
total_groups = int(len(player_club_profile))

# ------------------------------------------------------
# 7) Governance logging
# ------------------------------------------------------
track_transform(
    stage_name="3.3.1_build_player_club_profile",
    df_before=_before,
    df_after=player_club_profile,
    notes=(
        "Built per-player √ó per-club planning profile with p50/p65/p80 distances "
        "and app-recorded directional mix from shot_fairway_hit_type. "
        f"{sufficient_count}/{total_groups} groups have ‚â•{MIN_SHOTS} shots."
    ),
)

log_step(
    step_name="3.3.1 Build Player‚ÄìClub Planning Profile",
    description="Computed planning distances (p50, p65, p80) and app_pct_* directional mix; stored in player_club_profile.",
    inputs=["golf_valid"],
    outputs=["player_club_profile"],
    df=player_club_profile,
    extra_info={
        "groups_total": total_groups,
        "groups_sufficient": sufficient_count,
        "min_shots_threshold": MIN_SHOTS,
        "plan_pct_short": PLAN_PCT_SHORT,
        "plan_pct_on": PLAN_PCT_ON,
        "plan_pct_long": PLAN_PCT_LONG,
    },
)

# ------------------------------------------------------
# 8) Optional data dictionary
# ------------------------------------------------------
generate_data_dictionary(
    player_club_profile,
    table_name="player_club_profile",
    desc_map={
        "shot_count": "Number of shots for this player+club",
        "yardage_mean": "Average yardage",
        "yardage_std": "Std dev of yardage",
        "yardage_min": "Shortest shot observed",
        "yardage_max": "Longest shot observed",
        "yardage_p25": "25th percentile yardage",
        "yardage_p50": f"{int(PLAN_PCT_SHORT*100)}th percentile yardage (plan_distance_short)",
        "yardage_p65": f"{int(PLAN_PCT_ON*100)}th percentile yardage (plan_distance_hit)",
        "yardage_p80": f"{int(PLAN_PCT_LONG*100)}th percentile yardage (plan_distance_long)",
        "yardage_p90": "90th percentile yardage (rarely achieved distance)",
        "plan_distance_short": "Conservative / typical planning distance",
        "plan_distance_hit": "Realistic target planning distance",
        "plan_distance_long": "Strong-strike planning distance",
        "app_pct_unknown": "% outcomes marked Unknown by the app",
        "app_pct_left": "% outcomes marked Left by the app",
        "app_pct_right": "% outcomes marked Right by the app",
        "app_pct_short": "% outcomes marked Short by the app",
        "app_pct_long": "% outcomes marked Long by the app",
        "app_pct_hit": "% outcomes marked Hit by the app",
        "top5_max_yardages": "5 longest shots logged for this player+club",
        "shots_sufficient": f"True if shot_count ‚â• {MIN_SHOTS}",
    },
)

# ------------------------------------------------------
# 9) Reviewer peek
# ------------------------------------------------------
display(player_club_profile.head(15))


üîÑ Transform logged: 3.3.1_build_player_club_profile
   Built per-player √ó per-club planning profile with p50/p65/p80 distances and app-recorded directional mix from shot_fairway_hit_type. 14/14 groups have ‚â•5 shots.
‚úÖ 3.3.1 Build Player‚ÄìClub Planning Profile @ 2025-11-16 19:33:16
   DataFrame shape: 14 rows √ó 23 cols
   Computed planning distances (p50, p65, p80) and app_pct_* directional mix; stored in player_club_profile.
   groups_total: 14
   groups_sufficient: 14
   min_shots_threshold: 5
   plan_pct_short: 0.5
   plan_pct_on: 0.65
   plan_pct_long: 0.8
üìò Data dictionary generated for table 'player_club_profile' (23 columns).


Unnamed: 0,player_name,shot_club,shot_count,yardage_mean,yardage_std,yardage_min,yardage_max,yardage_p25,yardage_p50,yardage_p65,yardage_p80,yardage_p90,top5_max_yardages,plan_distance_short,plan_distance_hit,plan_distance_long,app_pct_unknown,app_pct_left,app_pct_right,app_pct_short,app_pct_long,app_pct_hit,shots_sufficient
0,Mike Phillips,1W,456,228.301,52.267,0.0,386.045,211.175,237.314,249.344,263.561,277.81,"[386.0454948999999, 344.4881895, 341.2073496, ...",237.314,249.344,263.561,0.219,14.693,23.684,8.333,14.912,38.158,True
1,Mike Phillips,3Hy,62,171.534,63.433,21.872,276.684,123.852,185.367,208.06,223.535,252.406,"[276.6841649, 266.8416452, 259.1863521, 255.90...",185.367,208.06,223.535,11.29,8.065,27.419,17.742,11.29,24.194,True
2,Mike Phillips,3W,145,187.289,67.216,7.655,308.399,156.387,201.225,226.378,245.407,262.03,"[308.3989506, 295.275591, 284.339458, 278.8713...",201.225,226.378,245.407,1.379,9.655,26.897,13.103,6.897,42.069,True
3,Mike Phillips,3i,17,177.165,37.453,73.272,235.127,166.229,182.633,187.008,200.131,216.098,"[235.1268595, 218.72266, 214.3482068, 202.3184...",182.633,187.008,200.131,52.941,11.765,5.882,11.765,11.765,5.882,True
4,Mike Phillips,4i,37,154.318,51.013,65.617,274.497,112.642,168.416,181.102,198.6,207.568,"[274.4969383, 211.0673669, 209.9737536, 208.88...",168.416,181.102,198.6,21.622,18.919,21.622,13.514,2.703,21.622,True
5,Mike Phillips,5i,30,138.245,46.572,39.37,200.815,113.462,150.372,163.878,178.696,190.836,"[200.8149263402087, 197.9440073, 195.7567807, ...",150.372,163.878,178.696,36.667,16.667,6.667,6.667,6.667,26.667,True
6,Mike Phillips,6i,75,151.221,46.739,0.0,212.161,139.436,165.136,173.885,181.977,194.007,"[212.1609802, 208.8801403, 208.8801403, 204.50...",165.136,173.885,181.977,58.667,8.0,16.0,2.667,2.667,12.0,True
7,Mike Phillips,7i,60,132.432,46.491,3.281,207.787,107.721,147.638,154.582,168.416,174.978,"[207.786527, 195.7567807, 191.3823275, 179.352...",147.638,154.582,168.416,53.333,1.667,8.333,5.0,13.333,18.333,True
8,Mike Phillips,8i,59,123.274,43.008,9.843,190.289,100.612,135.608,146.544,154.386,171.697,"[190.2887142, 178.2589679, 178.2589679, 173.88...",135.608,146.544,154.386,25.424,6.78,15.254,5.085,6.78,40.678,True
9,Mike Phillips,9i,59,109.47,44.157,4.374,230.752,90.77,119.204,129.812,138.67,152.231,"[230.7524063, 169.5100615, 166.2292216, 156.38...",119.204,129.812,138.67,25.424,13.559,15.254,10.169,5.085,30.508,True


#### ======================================================
#### 3.3.2 Player‚ÄìClub‚ÄìHole Dispersion
#### ======================================================

**INPUTS**  
- DataFrame: `golf_valid` (must contain GPS start/end points)  
- Required columns (schema gate):  
  - `player_name`  
  - `shot_club`  
  - `facility`  
  - `course`  
  - `hole_number`  
  - `shot_start_lat`, `shot_start_lon`  
  - `shot_end_lat`, `shot_end_lon`  

**WHAT THIS STEP DOES**  
- For each **player √ó facility √ó course √ó hole**, defines a **line of play** using the first shot‚Äôs start ‚Üí last shot‚Äôs end  
- Projects every shot on that hole into a local coordinate system (forward vs. lateral)  
- Calculates **sided dispersion** at a tunable percentile (e.g. p50):  
  - `left_dispersion_yards`  
  - `right_dispersion_yards`  
  - `total_dispersion_yards` (left + right)  
- Also captures along-line distances (`length_p50`, `length_pX`, `length_max`)  
- Retains geospatial anchors (`calc_start_*`, `calc_end_*`) so we can map or debug the line-of-play later  
- Stores the result in a new DataFrame at the **player √ó club √ó facility √ó course √ó hole** grain

**WHY IT MATTERS**  
Distance alone doesn‚Äôt tell the full story ‚Äî a club that goes 150y but sprays 25y left/right isn‚Äôt the same as a club that goes 150y and stays within 8y.  
This step converts raw GPS into **operational dispersion metrics** that can be rolled up in 3.3.3 and attached back to `player_club_profile` in 3.3.4.  
By logging the anchors and the percentile used, the dispersion logic is reproducible and auditable.

**OUTPUTS**  
- `player_club_hole_dispersion` (DataFrame, grain: player √ó club √ó facility √ó course √ó hole)  
- Governance updates:  
  - `STEP_LOG` ‚Üí ‚Äú3.3.2 Player‚ÄìClub‚ÄìHole Dispersion‚Äù  
  - `TRANSFORM_LOG` ‚Üí creation of dispersion table with notes on percentile used  
  - `DATA_DICTIONARIES` ‚Üí structure of `player_club_hole_dispersion`  
- Columns available for mapping/debug: `calc_start_lat/lon`, `calc_end_lat/lon`


In [26]:
# ======================================================
# 3.3.2 Player‚ÄìClub‚ÄìHole Dispersion
# ======================================================

"""
Measure lateral (left/right) and along-line dispersion for each
player √ó club √ó facility √ó course √ó hole using observed GPS shots.

Approach:
1. For each hole, define a line-of-play using the FIRST shot's start
   and the LAST shot's end.
2. Project every shot on that hole into a local XY system (forward + lateral).
3. Compute sided dispersion (left, right, total) at a chosen percentile.
4. Keep calc_* anchors so we can map/diagnose later.

This produces a hole-grain dispersion table that 3.3.3 can roll up.
"""

# ------------------------------------------------------
# 0) Tunable dispersion percentile
# ------------------------------------------------------
DISP_PCT = 0.50   # can be tuned to 0.75, 0.90, etc.
M_TO_YARDS = 1.0936133

# ------------------------------------------------------
# 1) Schema gate
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "player_name",
        "shot_club",
        "facility",
        "course",
        "hole_number",
        "shot_start_lat",
        "shot_start_lon",
        "shot_end_lat",
        "shot_end_lon",
    ],
    context_name="3.3.2 Player‚ÄìClub‚ÄìHole Dispersion",
)

# ------------------------------------------------------
# 2) Prepare working copy and sequence shots within each hole
# ------------------------------------------------------
shots = golf_valid.copy()

shots["shot_seq_in_hole"] = (
    shots.sort_index()
    .groupby(["player_name", "facility", "course", "hole_number"])
    .cumcount()
)


def latlon_to_xy_meters(lat_ref, lon_ref, lat, lon):
    """
    Convert (lat, lon) to local x/y meters relative to (lat_ref, lon_ref)
    using an equirectangular approximation.
    x = east-west, y = north-south
    """
    R = 6371000  # meters
    dlat = radians(lat - lat_ref)
    dlon = radians(lon - lon_ref)
    x = dlon * R * cos(radians(lat_ref))
    y = dlat * R
    return x, y

# ------------------------------------------------------
# 3) Project every (player, facility, course, hole) group into local XY
# ------------------------------------------------------
records = []

hole_groups = shots.groupby(
    ["player_name", "facility", "course", "hole_number"], sort=False
)

for (player, fac, crs, hole), grp in hole_groups:
    grp = grp.sort_values("shot_seq_in_hole")

    first = grp.iloc[0]
    origin_lat = first["shot_start_lat"]
    origin_lon = first["shot_start_lon"]

    last = grp.iloc[-1]
    fwd_lat = last["shot_end_lat"]
    fwd_lon = last["shot_end_lon"]

    # if we can't define a line-of-play for this hole, skip it
    if (
        pd.isna(origin_lat)
        or pd.isna(origin_lon)
        or pd.isna(fwd_lat)
        or pd.isna(fwd_lon)
    ):
        continue

    # forward vector (meters)
    fwd_x_m, fwd_y_m = latlon_to_xy_meters(origin_lat, origin_lon, fwd_lat, fwd_lon)
    fwd_norm_m = (fwd_x_m**2 + fwd_y_m**2) ** 0.5
    if fwd_norm_m == 0:
        continue

    # unit forward
    fwd_ux = fwd_x_m / fwd_norm_m
    fwd_uy = fwd_y_m / fwd_norm_m

    # unit lateral (perpendicular)
    lat_ux = -fwd_uy
    lat_uy = fwd_ux

    # project each shot's END point
    for _, shot in grp.iterrows():
        end_lat = shot["shot_end_lat"]
        end_lon = shot["shot_end_lon"]

        if pd.isna(end_lat) or pd.isna(end_lon):
            continue

        sx_m, sy_m = latlon_to_xy_meters(origin_lat, origin_lon, end_lat, end_lon)

        # distance along line-of-play
        length_m = sx_m * fwd_ux + sy_m * fwd_uy
        # sideways (signed)
        lateral_m = sx_m * lat_ux + sy_m * lat_uy

        records.append(
            {
                "player_name": player,
                "facility": fac,
                "course": crs,
                "hole_number": hole,
                "shot_club": shot["shot_club"],
                "shot_seq_in_hole": shot["shot_seq_in_hole"],
                "length_calc": length_m * M_TO_YARDS,
                "lateral_yards": lateral_m * M_TO_YARDS,
                # anchors for this hole's line-of-play
                "calc_start_lat": origin_lat,
                "calc_start_lon": origin_lon,
                "calc_end_lat": fwd_lat,
                "calc_end_lon": fwd_lon,
            }
        )

# ------------------------------------------------------
# 4) Turn projected points into a DataFrame
# ------------------------------------------------------
if records:
    dispersion_points = pd.DataFrame(records)
else:
    dispersion_points = pd.DataFrame(
        columns=[
            "player_name",
            "facility",
            "course",
            "hole_number",
            "shot_club",
            "shot_seq_in_hole",
            "length_calc",
            "lateral_yards",
            "calc_start_lat",
            "calc_start_lon",
            "calc_end_lat",
            "calc_end_lon",
        ]
    )

# ------------------------------------------------------
# 5) Aggregate to player √ó club √ó facility √ó course √ó hole
#    with sided dispersion at chosen percentile
# ------------------------------------------------------
hole_rows = []

grouped_disp = dispersion_points.groupby(
    ["player_name", "shot_club", "facility", "course", "hole_number"], sort=False
)

for (player, club, fac, crs, hole), grp in grouped_disp:
    shots_used = len(grp)
    laterals = grp["lateral_yards"]

    # sided splits
    left_side = laterals[laterals < 0]
    if len(left_side) > 0:
        left_px = abs(left_side.quantile(DISP_PCT))
    else:
        left_px = 0.0

    right_side = laterals[laterals > 0]
    if len(right_side) > 0:
        right_px = right_side.quantile(DISP_PCT)
    else:
        right_px = 0.0

    total_px_dispersion_yards = left_px + right_px

    # along-line distances
    length_p50 = grp["length_calc"].quantile(0.50)
    length_px = grp["length_calc"].quantile(DISP_PCT)
    length_max = grp["length_calc"].max()

    # anchors (same for all rows in grp)
    first_row = grp.iloc[0]
    calc_start_lat = first_row["calc_start_lat"]
    calc_start_lon = first_row["calc_start_lon"]
    calc_end_lat = first_row["calc_end_lat"]
    calc_end_lon = first_row["calc_end_lon"]

    hole_rows.append(
        {
            "player_name": player,
            "shot_club": club,
            "facility": fac,
            "course": crs,
            "hole_number": hole,
            "shots_used": shots_used,
            "left_dispersion_yards": left_px,
            "right_dispersion_yards": right_px,
            "total_dispersion_yards": total_px_dispersion_yards,
            "length_p50": length_p50,
            "length_pX": length_px,
            "length_max": length_max,
            "calc_start_lat": calc_start_lat,
            "calc_start_lon": calc_start_lon,
            "calc_end_lat": calc_end_lat,
            "calc_end_lon": calc_end_lon,
        }
    )

player_club_hole_dispersion = pd.DataFrame(hole_rows)

# ------------------------------------------------------
# 6) Governance
# ------------------------------------------------------
track_transform(
    stage_name="3.3.2_player_club_hole_dispersion",
    df_before=None,
    df_after=player_club_hole_dispersion,
    notes=(
        f"Per hole, used first‚Üílast shot as line-of-play; computed sided dispersion "
        f"at p={DISP_PCT}; kept calc_* anchors for mapping/debug."
    ),
)

log_step(
    step_name="3.3.2 Player‚ÄìClub‚ÄìHole Dispersion",
    description="Measured per-hole left/right dispersion from derived line-of-play, with tunable percentile and mapping anchors.",
    inputs=["golf_valid"],
    outputs=["player_club_hole_dispersion"],
    df=player_club_hole_dispersion,
    extra_info={
        "hole_groups_with_dispersion": int(len(player_club_hole_dispersion)),
        "raw_points_used": int(len(dispersion_points)),
        "dispersion_percentile": DISP_PCT,
    },
)

generate_data_dictionary(
    player_club_hole_dispersion,
    table_name="player_club_hole_dispersion",
    desc_map={
        "shots_used": "Number of shots on this hole used for dispersion",
        "left_dispersion_yards": f"{int(DISP_PCT*100)}th percentile leftward miss (absolute yards)",
        "right_dispersion_yards": f"{int(DISP_PCT*100)}th percentile rightward miss (yards)",
        "total_dispersion_yards": "Left + right dispersion at chosen percentile",
        "length_p50": "Median distance along line-of-play for this player/club/hole",
        "length_pX": f"{int(DISP_PCT*100)}th percentile distance along line-of-play",
        "length_max": "Maximum distance along line-of-play (furthest projected shot)",
        "calc_start_lat": "Latitude of hole line-of-play origin (first shot start)",
        "calc_start_lon": "Longitude of hole line-of-play origin (first shot start)",
        "calc_end_lat": "Latitude of hole line-of-play target (last shot end)",
        "calc_end_lon": "Longitude of hole line-of-play target (last shot end)",
    },
)

# ------------------------------------------------------
# 7) Reviewer peek
# ------------------------------------------------------
display(player_club_hole_dispersion.head(3))


üîÑ Transform logged: 3.3.2_player_club_hole_dispersion
   Per hole, used first‚Üílast shot as line-of-play; computed sided dispersion at p=0.5; kept calc_* anchors for mapping/debug.
‚úÖ 3.3.2 Player‚ÄìClub‚ÄìHole Dispersion @ 2025-11-16 19:33:22
   DataFrame shape: 250 rows √ó 16 cols
   Measured per-hole left/right dispersion from derived line-of-play, with tunable percentile and mapping anchors.
   hole_groups_with_dispersion: 250
   raw_points_used: 778
   dispersion_percentile: 0.5
üìò Data dictionary generated for table 'player_club_hole_dispersion' (16 columns).


Unnamed: 0,player_name,shot_club,facility,course,hole_number,shots_used,left_dispersion_yards,right_dispersion_yards,total_dispersion_yards,length_p50,length_pX,length_max,calc_start_lat,calc_start_lon,calc_end_lat,calc_end_lon
0,Mike Phillips,Lw,Enterprise Golf Course,Enterprise Golf Course,1,1,0.0,0.0,0.0,485.124,485.124,485.124,38.926,-76.818,38.93,-76.818
1,Mike Phillips,Lw,Enterprise Golf Course,Enterprise Golf Course,5,1,28.043,0.0,28.043,-21.118,-21.118,-21.118,38.931,-76.817,38.932,-76.816
2,Mike Phillips,9i,Enterprise Golf Course,Enterprise Golf Course,5,1,0.607,0.0,0.607,1.137,1.137,1.137,38.931,-76.817,38.932,-76.816


#### ======================================================
#### 3.3.3 Player‚ÄìClub Dispersion Rollup (Shot-Weighted)
#### ======================================================

**INPUTS**  
- DataFrame: `player_club_hole_dispersion` (output from 3.3.2)  
- Required columns (schema gate):  
  - `player_name`  
  - `shot_club`  
  - `left_dispersion_yards`, `right_dispersion_yards`, `total_dispersion_yards`  
  - `shots_used`  

**WHAT THIS STEP DOES**  
- Aggregates the **hole-level dispersion** values into a **single, player‚Äìclub‚Äìlevel record**.  
- Applies a **weighted rollup** so that holes with more shot data have greater influence on the club‚Äôs final dispersion values.  
- Computes:  
  - `club_left_dispersion_yards`  
  - `club_right_dispersion_yards`  
  - `club_total_dispersion_yards` = left + right  
- Flags clubs with enough valid data (`dispersion_sufficient`) for downstream reliability.  
- Records a formal assumption explaining why shot-weighting is used and logs the entire transformation through governance checkpoints.

**WHY IT MATTERS**  
This step converts **raw hole-by-hole dispersion** into **usable, club-level accuracy metrics** ‚Äî essential for understanding which clubs are consistent versus erratic.  
By weighting results based on the number of shots, this method gives more confidence to dispersion metrics supported by robust data and reduces noise from under-sampled holes.  
The output serves as the **accuracy companion** to the distance metrics from 3.3.1 and will be merged in 3.3.4 to create a unified performance profile.

**OUTPUTS**  
- DataFrame: `player_club_dispersion_rollup` (grain: player √ó club)  
- Key columns:  
  - `club_left_dispersion_yards`, `club_right_dispersion_yards`, `club_total_dispersion_yards`  
  - `holes_contributing`, `total_shots_used`, `dispersion_sufficient`  
- Governance updates:  
  - `ASSUMPTIONS_LOG` ‚Üí rationale for shot-weighting and sufficiency threshold  
  - `TRANSFORM_LOG` ‚Üí rollup strategy and weighting method recorded  
  - `STEP_LOG` ‚Üí "3.3.3 Player‚ÄìClub Dispersion Rollup" summary  
  - `DATA_DICTIONARIES` ‚Üí field-level descriptions for reproducibility  


In [27]:
# ======================================================
# 3.3.3 Player‚ÄìClub Dispersion Rollup (shot-weighted)
# ======================================================

"""
Take the hole-grain dispersion table from 3.3.2 and roll it up to
one row per (player_name, shot_club).

Why:
- Holes have different widths and different # of shots
- We want a single, reasonable dispersion number per club per player
- We want to remember how much data supported that number
"""

# ------------------------------------------------------
# 0) Rollup parameters (tunable)
# ------------------------------------------------------
ROLLUP_STRATEGY = "nonzero_p75"   # or "nonzero_median"
WEIGHT_BY_SHOTS = False               # holes with more shots get more influence
MIN_HOLES_FOR_CONFIDENCE = 2         # relaxed, because per-hole data can be thin

# ------------------------------------------------------
# 1) Schema gate
# ------------------------------------------------------
validate_columns(
    player_club_hole_dispersion,
    required_cols=[
        "player_name",
        "shot_club",
        "total_dispersion_yards",
        "left_dispersion_yards",
        "right_dispersion_yards",
        "shots_used",
    ],
    context_name="3.3.3 Player‚ÄìClub Dispersion Rollup",
)

# ------------------------------------------------------
# 2) Assumption(s) we are baking in
# ------------------------------------------------------
record_assumption(
    text="Club-level dispersion should be influenced more by holes where the player hit more shots.",
    rationale="More shots on a hole give a more stable estimate of lateral spread, so we weight by shots_used.",
    impact_area="3.3 Club-level Validity & Features",
)

# ------------------------------------------------------
# 3) Roll up hole-grain dispersion ‚Üí player √ó club
# ------------------------------------------------------
rollup_records = []

for (player, club), grp in player_club_hole_dispersion.groupby(
    ["player_name", "shot_club"], sort=False
):
    holes_contributing = len(grp)

    # nonzero-only series
    left_series = grp.loc[grp["left_dispersion_yards"] > 0, "left_dispersion_yards"]
    right_series = grp.loc[grp["right_dispersion_yards"] > 0, "right_dispersion_yards"]

    # matching weights for those nonzero rows
    left_weights = grp.loc[left_series.index, "shots_used"] if WEIGHT_BY_SHOTS else None
    right_weights = grp.loc[right_series.index, "shots_used"] if WEIGHT_BY_SHOTS else None

    def _agg_nonzero(series: pd.Series, weights: Optional[pd.Series]) -> float:
        """Aggregate nonzero dispersion values using either shot-weighting or a percentile/median."""
        if series.empty:
            return 0.0

        # shot-weighted
        if WEIGHT_BY_SHOTS and weights is not None:
            wsum = float(weights.sum())
            if wsum == 0:
                # fallback to median if weights are somehow zero
                return float(series.median())
            return float((series * weights).sum() / wsum)

        # unweighted
        if ROLLUP_STRATEGY == "nonzero_p75":
            return float(series.quantile(0.75))
        return float(series.median())

    club_left = _agg_nonzero(left_series, left_weights)
    club_right = _agg_nonzero(right_series, right_weights)
    club_total = club_left + club_right

    total_shots_used = int(grp["shots_used"].sum()) if "shots_used" in grp.columns else None

    rollup_records.append(
        {
            "player_name": player,
            "shot_club": club,
            "holes_contributing": holes_contributing,
            "club_left_dispersion_yards": club_left,
            "club_right_dispersion_yards": club_right,
            "club_total_dispersion_yards": club_total,
            "total_shots_used": total_shots_used,
        }
    )

player_club_dispersion_rollup = pd.DataFrame(rollup_records)

# ------------------------------------------------------
# 4) Sufficiency flag (do we trust this club‚Äôs dispersion?)
# ------------------------------------------------------
player_club_dispersion_rollup["dispersion_sufficient"] = (
    player_club_dispersion_rollup["holes_contributing"] >= MIN_HOLES_FOR_CONFIDENCE
)

# ------------------------------------------------------
# 5) Governance
# ------------------------------------------------------
track_transform(
    stage_name="3.3.3_player_club_dispersion_rollup",
    df_before=None,
    df_after=player_club_dispersion_rollup,
    notes=(
        "Rolled up hole-level dispersion to player+club using "
        f"WEIGHT_BY_SHOTS={WEIGHT_BY_SHOTS}, strategy={ROLLUP_STRATEGY}. "
        "Holes with more shots influence the club-level dispersion more."
    ),
)

log_step(
    step_name="3.3.3 Player‚ÄìClub Dispersion Rollup",
    description="Rolled up per-hole dispersion to one row per player+club, preserving coverage and sufficiency signals.",
    inputs=["player_club_hole_dispersion"],
    outputs=["player_club_dispersion_rollup"],
    df=player_club_dispersion_rollup,
    extra_info={
        "player_club_rows": int(len(player_club_dispersion_rollup)),
        "rollup_strategy": ROLLUP_STRATEGY,
        "weighted_by_shots": WEIGHT_BY_SHOTS,
        "min_holes_for_confidence": MIN_HOLES_FOR_CONFIDENCE,
    },
)

generate_data_dictionary(
    player_club_dispersion_rollup,
    table_name="player_club_dispersion_rollup",
    desc_map={
        "player_name": "Player identifier as recorded in shots",
        "shot_club": "Club name / label as captured by the app",
        "holes_contributing": "Number of hole-level dispersion rows used for this player+club",
        "club_left_dispersion_yards": "Final left dispersion in yards for this player+club (shot-weighted if enabled)",
        "club_right_dispersion_yards": "Final right dispersion in yards for this player+club (shot-weighted if enabled)",
        "club_total_dispersion_yards": "Left + right dispersion ‚Äî typical width for this club",
        "total_shots_used": "Total number of shots across all contributing holes",
        "dispersion_sufficient": f"True if at least {MIN_HOLES_FOR_CONFIDENCE} holes contributed data",
    },
)

# ------------------------------------------------------
# 6) Reviewer peek
# ------------------------------------------------------
display(player_club_dispersion_rollup.head(14))


üìå Assumption logged: Club-level dispersion should be influenced more by holes where the player hit more shots.  | Impact: 3.3 Club-level Validity & Features
üîÑ Transform logged: 3.3.3_player_club_dispersion_rollup
   Rolled up hole-level dispersion to player+club using WEIGHT_BY_SHOTS=False, strategy=nonzero_p75. Holes with more shots influence the club-level dispersion more.
‚úÖ 3.3.3 Player‚ÄìClub Dispersion Rollup @ 2025-11-16 19:33:22
   DataFrame shape: 13 rows √ó 8 cols
   Rolled up per-hole dispersion to one row per player+club, preserving coverage and sufficiency signals.
   player_club_rows: 13
   rollup_strategy: nonzero_p75
   weighted_by_shots: False
   min_holes_for_confidence: 2
üìò Data dictionary generated for table 'player_club_dispersion_rollup' (8 columns).


Unnamed: 0,player_name,shot_club,holes_contributing,club_left_dispersion_yards,club_right_dispersion_yards,club_total_dispersion_yards,total_shots_used,dispersion_sufficient
0,Mike Phillips,Lw,49,19.735,32.141,51.877,227,True
1,Mike Phillips,9i,12,37.431,39.5,76.931,21,True
2,Mike Phillips,1W,42,37.408,29.257,66.665,203,True
3,Mike Phillips,3Hy,15,23.843,48.663,72.507,24,True
4,Mike Phillips,3W,22,62.137,54.083,116.22,43,True
5,Mike Phillips,7i,7,66.89,69.802,136.692,10,True
6,Mike Phillips,8i,11,36.668,55.484,92.152,22,True
7,Mike Phillips,Gw,23,39.11,42.266,81.376,49,True
8,Mike Phillips,Pw,19,41.837,36.287,78.125,57,True
9,Mike Phillips,Sw,24,27.135,28.613,55.748,71,True


#### ======================================================
#### 3.3.4 Attach Dispersion to Player‚ÄìClub Profile (Idempotent)
#### ======================================================

**INPUTS**  
- `player_club_profile` (from 3.3.1): baseline per-player, per-club distance and directional planning data  
- `player_club_dispersion_rollup` (from 3.3.3): aggregated accuracy/dispersion metrics by player‚Äìclub  

**WHAT THIS STEP DOES**  
- Integrates **dispersion metrics** (accuracy data) into the unified player‚Äìclub planning table.  
- Ensures **idempotent** execution ‚Äî meaning rerunning this cell won‚Äôt duplicate or conflict with existing columns.  
- Merges these new fields:
  - `plan_dispersion` ‚Äì typical left+right shot spread for the club  
  - `dispersion_hole_count` ‚Äì number of holes contributing to dispersion  
  - `dispersion_shot_count` ‚Äì number of shots contributing to dispersion  
- Cleans any prior dispersion columns, re-attaches current ones, and documents the transformation in the governance logs.

**WHY IT MATTERS**  
This step fuses the **distance (power)** and **dispersion (accuracy)** metrics into one master profile, giving each player‚Äìclub combination a complete performance fingerprint.  
It‚Äôs a key integration point between **shot-level precision** and **strategic planning analytics** ‚Äî bridging accuracy and consistency with realistic on-course club selection expectations.  
By making this join idempotent and fully governed, the process remains repeatable and traceable for future enrichment phases.

**OUTPUTS**  
- DataFrame: `player_club_profile` (now includes both distance + dispersion metrics)  
- Key columns added:
  - `plan_dispersion`
  - `dispersion_hole_count`
  - `dispersion_shot_count`
- Governance artifacts updated:
  - `TRANSFORM_LOG` ‚Üí Enrichment lineage from 3.3.3 documented  
  - `STEP_LOG` ‚Üí ‚Äú3.3.4 Attach Dispersion to Player‚ÄìClub Profile‚Äù recorded  
  - `DATA_DICTIONARIES` ‚Üí Updated definitions for dispersion-related columns  
  - Optional future link: supports downstream visualization of ‚ÄúPower vs. Precision‚Äù club mapping in 4.x analytics


In [28]:
# ======================================================
# 3.3.4 Attach Dispersion to Player‚ÄìClub Profile (idempotent)
# ======================================================

"""
Goal
----
Take the accuracy metrics we just built in 3.3.3
(`player_club_dispersion_rollup`) and attach them to the planning /
distance table from 3.3.1 (`player_club_profile`).

This must be SAFE TO RERUN (idempotent):
- if the columns already exist, we refresh them
- if they don‚Äôt, we create them
"""

# ------------------------------------------------------
# 1) Schema gates
# ------------------------------------------------------
validate_columns(
    player_club_profile,
    required_cols=["player_name", "shot_club"],
    context_name="3.3.4 Attach Dispersion to Player‚ÄìClub Profile (base)",
)

validate_columns(
    player_club_dispersion_rollup,
    required_cols=[
        "player_name",
        "shot_club",
        "club_total_dispersion_yards",
        "holes_contributing",
        "total_shots_used",
    ],
    context_name="3.3.4 Attach Dispersion to Player‚ÄìClub Profile (dispersion)",
)

# keep a copy for lineage
_before = player_club_profile.copy()

# ------------------------------------------------------
# 2) Make base profile clean of old dispersion cols (idempotent)
# ------------------------------------------------------
dispersion_cols = [
    "plan_dispersion",
    "dispersion_hole_count",
    "dispersion_shot_count",
]

for col in dispersion_cols:
    if col in player_club_profile.columns:
        player_club_profile = player_club_profile.drop(columns=[col])

# ------------------------------------------------------
# 3) Prepare dispersion DF for join
# ------------------------------------------------------
dispersion_for_join = (
    player_club_dispersion_rollup[
        [
            "player_name",
            "shot_club",
            "club_total_dispersion_yards",
            "holes_contributing",
            "total_shots_used",
        ]
    ]
    .rename(
        columns={
            "club_total_dispersion_yards": "plan_dispersion",
            "holes_contributing": "dispersion_hole_count",
            "total_shots_used": "dispersion_shot_count",
        }
    )
)

# ------------------------------------------------------
# 4) Merge (left) and normalize suffixes
# ------------------------------------------------------
merged = player_club_profile.merge(
    dispersion_for_join,
    on=["player_name", "shot_club"],
    how="left",
    suffixes=("", "_new"),
)

# if we somehow had *_new, prefer those
for col in dispersion_cols:
    new_col = f"{col}_new"
    if new_col in merged.columns:
        merged[col] = merged[col].combine_first(merged[new_col])
        merged = merged.drop(columns=[new_col])

player_club_profile = merged

# ------------------------------------------------------
# 5) Governance
# ------------------------------------------------------
track_transform(
    stage_name="3.3.4_attach_dispersion_to_player_club_profile",
    df_before=_before,
    df_after=player_club_profile,
    notes=(
        "Enriched player_club_profile with dispersion metrics from 3.3.3 "
        "(plan_dispersion, dispersion_hole_count, dispersion_shot_count) in an idempotent way."
    ),
    new_cols=[c for c in dispersion_cols if c in player_club_profile.columns],
)

log_step(
    step_name="3.3.4 Attach Dispersion to Player‚ÄìClub Profile",
    description="Joined club-level dispersion metrics onto the player‚Äìclub planning profile.",
    inputs=["player_club_profile (3.3.1)", "player_club_dispersion_rollup (3.3.3)"],
    outputs=["player_club_profile"],
    df=player_club_profile,
    extra_info={
        "rows_after_enrichment": int(player_club_profile.shape[0]),
        "added_dispersion_columns": dispersion_cols,
        "note": "This is the unified per-player, per-club table (distance + dispersion).",
    },
)

generate_data_dictionary(
    player_club_profile,
    table_name="player_club_profile",
    desc_map={
        "plan_dispersion": "Typical left+right dispersion in yards for this player+club (from 3.3.3 rollup).",
        "dispersion_hole_count": "Number of holes that contributed to the dispersion estimate (3.3.3).",
        "dispersion_shot_count": "Total shots across contributing holes used to estimate dispersion (3.3.3).",
    },
)

# ------------------------------------------------------
# 6) Reviewer peek
# ------------------------------------------------------
display(
    player_club_profile[
        [
            "player_name",
            "shot_club",
            "shot_count",
            "plan_distance_short",
            "plan_distance_hit",
            "plan_distance_long",
            "plan_dispersion",
            "dispersion_hole_count",
            "dispersion_shot_count",
        ]
    ].head(14)
)


üîÑ Transform logged: 3.3.4_attach_dispersion_to_player_club_profile
   Rows 14 ‚Üí 14 (0 change)
   Enriched player_club_profile with dispersion metrics from 3.3.3 (plan_dispersion, dispersion_hole_count, dispersion_shot_count) in an idempotent way.
‚úÖ 3.3.4 Attach Dispersion to Player‚ÄìClub Profile @ 2025-11-16 19:33:22
   DataFrame shape: 14 rows √ó 26 cols
   Joined club-level dispersion metrics onto the player‚Äìclub planning profile.
   rows_after_enrichment: 14
   added_dispersion_columns: ['plan_dispersion', 'dispersion_hole_count', 'dispersion_shot_count']
   note: This is the unified per-player, per-club table (distance + dispersion).
üìò Data dictionary generated for table 'player_club_profile' (26 columns).


Unnamed: 0,player_name,shot_club,shot_count,plan_distance_short,plan_distance_hit,plan_distance_long,plan_dispersion,dispersion_hole_count,dispersion_shot_count
0,Mike Phillips,1W,456,237.314,249.344,263.561,66.665,42.0,203.0
1,Mike Phillips,3Hy,62,185.367,208.06,223.535,72.507,15.0,24.0
2,Mike Phillips,3W,145,201.225,226.378,245.407,116.22,22.0,43.0
3,Mike Phillips,3i,17,182.633,187.008,200.131,,,
4,Mike Phillips,4i,37,168.416,181.102,198.6,91.697,7.0,10.0
5,Mike Phillips,5i,30,150.372,163.878,178.696,114.673,6.0,13.0
6,Mike Phillips,6i,75,165.136,173.885,181.977,37.234,13.0,28.0
7,Mike Phillips,7i,60,147.638,154.582,168.416,136.692,7.0,10.0
8,Mike Phillips,8i,59,135.608,146.544,154.386,92.152,11.0,22.0
9,Mike Phillips,9i,59,119.204,129.812,138.67,76.931,12.0,21.0


#### ======================================================
#### 3.3.5 Enrich Player‚ÄìClub Profile with Dispersion Cone Fields
#### ======================================================

**INPUTS**  
- `player_club_profile` (from 3.3.4, already merged with dispersion rollup)  
  - must contain:  
    - `player_name`  
    - `shot_club`  
    - `plan_distance_short`  
    - `plan_distance_hit` *(formerly `plan_distance_on`)*  
    - `plan_distance_long`  
    - `plan_dispersion` *(club-level width from 3.3.4)*  

**WHAT THIS STEP DOES**  
Calibrates a **per-player dispersion cone** using the player‚Äôs **driver (1W)** ‚Äî the club with the strongest/most complete data ‚Äî and then **projects that cone to every other club**.  
This lets Tableau (or any downstream viz) draw consistent club cones without re-deriving trigonometry in the dashboard.

**How it works:**
1. For each player, find their driver row where `shot_club == "1W"`.  
2. Use the driver‚Äôs `plan_dispersion` (width in yards) and `plan_distance_long` (p80 distance) to compute a **dispersion half-angle**:  
   - `disp_half_angle_rad`  
   - `disp_half_angle_deg`  
   - `disp_tan_half_angle`  
3. Apply that same angle to every other club‚Äôs planning distances to get **expected dispersion widths** at all 3 planning levels:  
   - `plan_dispersion_short` ‚Üê from `plan_distance_short`  
   - `plan_dispersion_hit` ‚Üê from `plan_distance_hit`  
   - `plan_dispersion_long` ‚Üê from `plan_distance_long`  
4. Write these fields back onto every row in `player_club_profile` so all downstream exports already have them.

**WHY IT MATTERS**  
- Keeps the **dispersion logic centralized** in the data pipeline (not in Tableau).  
- Guarantees that all clubs for a given player share the **same angular model**, making cones visually consistent.  
- Uses the **best-measured club (1W)** as the authority to avoid overfitting to sparse irons/wedges.  
- Makes Phase 4 dashboards simpler: width = ‚Äúalready in the table.‚Äù

**OUTPUTS**  
New columns added to `player_club_profile`:
- `disp_half_angle_rad` ‚Äì per-player dispersion half-angle in radians (from 1W)  
- `disp_half_angle_deg` ‚Äì same, in degrees  
- `disp_tan_half_angle` ‚Äì precomputed tangent for easy width projections  
- `plan_dispersion_short` ‚Äì expected width at p50 distance  
- `plan_dispersion_hit` ‚Äì expected width at p65 ‚Äúon/normal‚Äù distance  
- `plan_dispersion_long` ‚Äì expected width at p80 ‚Äústrong‚Äù distance  

Governance:
- `track_transform(...)` logged under `3.3.5_enrich_club_profile_with_cone_fields`  
- `log_step(...)` records players processed and total rows updated  

The updated `player_club_profile` is now ready to be exported in 3.5.2 with all cone metrics baked in.


In [29]:
# ======================================================
# 3.3.5 Enrich Player‚ÄìClub Profile with Dispersion Cone Fields
# ======================================================

"""
Calibrate a dispersion half-angle per player from driver ("1W") and project that
angle to every other club. Assumes player_club_profile already has:
- player_name
- shot_club
- plan_distance_short
- plan_distance_hit   # was plan_distance_on upstream
- plan_distance_long
- plan_dispersion
and that we want to write:
- disp_half_angle_rad / disp_half_angle_deg / disp_tan_half_angle
- plan_dispersion_short / plan_dispersion_hit / plan_dispersion_long
"""

validate_columns(
    player_club_profile,
    required_cols=[
        "player_name",
        "shot_club",
        "plan_distance_short",
        "plan_distance_hit",
        "plan_distance_long",
        "plan_dispersion",
    ],
    context_name="3.3.5 Enrich Player‚ÄìClub Profile with Dispersion Cone Fields",
)

pcp_before = player_club_profile.copy()
enriched_chunks = []

for player, grp in player_club_profile.groupby("player_name", sort=False):
    driver_rows = grp[grp["shot_club"] == "1W"]
    if driver_rows.empty:
        raise ValueError(f"[3.3.5] Player '{player}' has no driver ('1W') row in player_club_profile.")
    driver = driver_rows.iloc[0]
    driver_dist_long = driver["plan_distance_long"]
    driver_disp = driver["plan_dispersion"]
    if pd.isna(driver_dist_long) or driver_dist_long <= 0:
        raise ValueError(f"[3.3.5] Player '{player}' driver has invalid plan_distance_long={driver_dist_long!r}.")
    if pd.isna(driver_disp) or driver_disp <= 0:
        raise ValueError(f"[3.3.5] Player '{player}' driver has invalid plan_dispersion={driver_disp!r}.")
    half_angle_rad = np.arctan((driver_disp / 2.0) / driver_dist_long)
    half_angle_deg = np.degrees(half_angle_rad)
    tan_half_angle = np.tan(half_angle_rad)
    grp = grp.copy()
    grp["disp_half_angle_rad"] = half_angle_rad
    grp["disp_half_angle_deg"] = half_angle_deg
    grp["disp_tan_half_angle"] = tan_half_angle
    def _proj(dist_val):
        if pd.isna(dist_val):
            return np.nan
        return 2.0 * dist_val * tan_half_angle
    grp["plan_dispersion_short"] = grp["plan_distance_short"].apply(_proj)
    grp["plan_dispersion_hit"] = grp["plan_distance_hit"].apply(_proj)
    grp["plan_dispersion_long"] = grp["plan_distance_long"].apply(_proj)
    enriched_chunks.append(grp)

player_club_profile = pd.concat(enriched_chunks, ignore_index=True)

track_transform(
    stage_name="3.3.5_enrich_club_profile_with_cone_fields",
    df_before=pcp_before,
    df_after=player_club_profile,
    notes=(
        "Calibrated per-player dispersion cone from driver (1W) using plan_dispersion and plan_distance_long, "
        "then projected expected dispersion to all clubs at short/hit/long planning distances."
    ),
    new_cols=[
        "disp_half_angle_rad",
        "disp_half_angle_deg",
        "disp_tan_half_angle",
        "plan_dispersion_short",
        "plan_dispersion_hit",
        "plan_dispersion_long",
    ],
)

log_step(
    step_name="3.3.5 Enrich Player‚ÄìClub Profile with Dispersion Cone Fields",
    description="Added dispersion cone metrics so Tableau can draw cones without extra calcs.",
    inputs=["player_club_profile (from 3.3.4)"],
    outputs=["player_club_profile"],
    df=player_club_profile,
    extra_info={
        "players_processed": player_club_profile["player_name"].nunique(),
        "rows_processed": len(player_club_profile),
    },
)

display(
    player_club_profile[
        [
            "player_name",
            "shot_club",
            "plan_distance_short",
            "plan_distance_hit",
            "plan_distance_long",
            "plan_dispersion",
            "disp_half_angle_deg",
            "plan_dispersion_short",
            "plan_dispersion_hit",
            "plan_dispersion_long",
        ]
    ].head(25)
)


üîÑ Transform logged: 3.3.5_enrich_club_profile_with_cone_fields
   Rows 14 ‚Üí 14 (0 change)
   Calibrated per-player dispersion cone from driver (1W) using plan_dispersion and plan_distance_long, then projected expected dispersion to all clubs at short/hit/long planning distances.
‚úÖ 3.3.5 Enrich Player‚ÄìClub Profile with Dispersion Cone Fields @ 2025-11-16 19:33:22
   DataFrame shape: 14 rows √ó 32 cols
   Added dispersion cone metrics so Tableau can draw cones without extra calcs.
   players_processed: 1
   rows_processed: 14


Unnamed: 0,player_name,shot_club,plan_distance_short,plan_distance_hit,plan_distance_long,plan_dispersion,disp_half_angle_deg,plan_dispersion_short,plan_dispersion_hit,plan_dispersion_long
0,Mike Phillips,1W,237.314,249.344,263.561,66.665,7.208,60.026,63.069,66.665
1,Mike Phillips,3Hy,185.367,208.06,223.535,72.507,7.208,46.887,52.626,56.541
2,Mike Phillips,3W,201.225,226.378,245.407,116.22,7.208,50.898,57.26,62.073
3,Mike Phillips,3i,182.633,187.008,200.131,,7.208,46.195,47.302,50.621
4,Mike Phillips,4i,168.416,181.102,198.6,91.697,7.208,42.599,45.808,50.234
5,Mike Phillips,5i,150.372,163.878,178.696,114.673,7.208,38.035,41.451,45.199
6,Mike Phillips,6i,165.136,173.885,181.977,37.234,7.208,41.769,43.982,46.029
7,Mike Phillips,7i,147.638,154.582,168.416,136.692,7.208,37.343,39.1,42.599
8,Mike Phillips,8i,135.608,146.544,154.386,92.152,7.208,34.301,37.067,39.05
9,Mike Phillips,9i,119.204,129.812,138.67,76.931,7.208,30.151,32.835,35.075


#### ======================================================
#### 3.3.6 Club-Level Validity & Features Closeout
#### ======================================================

**INPUTS**  
- `player_club_profile` ‚Äî enriched club-level master table (includes dispersion cone fields from 3.3.5)  
- `player_club_hole_dispersion` ‚Äî per-hole dispersion metrics from 3.3.2  
- `player_club_dispersion_rollup` ‚Äî club-level dispersion rollups from 3.3.3  

**WHAT THIS STEP DOES**  
This step formally **closes out Section 3.3 (Club-Level Validity & Features)** by validating that all club-related artifacts are present, schema-complete, and ready for downstream export.  
It also verifies that the new **dispersion cone fields (3.3.5)** were successfully applied to every player‚Äôs club profile and captures QA metrics for coverage, sufficiency, and readiness for Phase 4 dashboards.

**KEY ACTIONS**
| Category | Description | Outputs |
|-----------|--------------|----------|
| **Schema Validation** | Ensures all three core club tables (`player_club_profile`, `player_club_hole_dispersion`, `player_club_dispersion_rollup`) meet structural and completeness gates. | Validation log entries in `VALIDATION_LOG`. |
| **Cone Field Verification** | Confirms that every club in `player_club_profile` includes the driver-calibrated dispersion cone fields: `disp_half_angle_rad`, `disp_half_angle_deg`, `disp_tan_half_angle`, and `plan_dispersion_*`. | % of clubs with cone fields populated. |
| **QA Coverage Summary** | Measures data sufficiency across player‚Äìclub combinations (dispersion presence, sufficient shots, average distance & dispersion). | `club_closeout_metrics` dictionary. |
| **Governance & Lineage Logs** | Records assumption that 3.3.x artifacts are now stable and ready for Phase 4 visualization. | STEP_LOG, TRANSFORM_LOG, ASSUMPTIONS_LOG updated. |
| **Optional Export** | Writes the finalized `player_club_profile.xlsx` to `PRIVATE_PATH` for audit or Tableau import. | Excel file with all dispersion and cone metrics. |

**WHY IT MATTERS**  
- Confirms that **all dispersion and cone enrichments** were applied successfully.  
- Establishes **data lineage** from hole- to club-level metrics.  
- Locks the club-level artifacts for **stable downstream modeling and visualization**.  
- Enables **ready-to-plot dispersion cones** in Tableau without further calculations.

**OUTPUTS**  
| Output Type | Description |
|--------------|-------------|
| **Validated Tables** | `player_club_profile`, `player_club_hole_dispersion`, `player_club_dispersion_rollup`. |
| **QA Metrics** | Total player‚Äìclub combinations, % with dispersion, % with cone fields, average distances & widths. |
| **Governance Logs** | Updates in `STEP_LOG`, `ASSUMPTIONS_LOG`, `TRANSFORM_LOG`. |
| **Excel Export (Optional)** | `player_club_profile.xlsx` written to `/data/private/`. |

**‚úÖ Phase 3.3 Summary**  
All club-level enrichment steps (3.3.1‚Äì3.3.5) have been validated and documented.  
`player_club_profile` now includes calibrated **dispersion cone geometry**, providing a complete, audit-ready foundation for Phase 4 analysis and Tableau visualization.


In [30]:
# ======================================================
# 3.3.6 Club-Level Validity & Features Closeout
# ======================================================

"""
Close out all 3.3.x club-level work.

Covers:
- 3.3.1 ‚Üí player_club_profile (planning distances + app pct)
- 3.3.2 ‚Üí player_club_hole_dispersion (per-hole, sided)
- 3.3.3 ‚Üí player_club_dispersion_rollup (club-level rollup)
- 3.3.4 ‚Üí attach dispersion to profile
- 3.3.5 ‚Üí add dispersion-cone fields (driver-calibrated)

Goals:
- Confirm the three core club artifacts are present and well-formed.
- Confirm the new 3.3.5 fields exist on player_club_profile.
- Capture QA/coverage so Phase 4 can treat these as stable dimensional inputs.
"""

# ------------------------------------------------------
# 1) Schema gates for the 3 club artifacts
# ------------------------------------------------------
validate_columns(
    player_club_profile,
    required_cols=[
        "player_name",
        "shot_club",
        "shot_count",
        "plan_distance_short",
        "plan_distance_hit",
        "plan_distance_long",
        # from 3.3.4
        "plan_dispersion",
        "dispersion_hole_count",
        "dispersion_shot_count",
        # from 3.3.5 (cone fields)
        "disp_half_angle_rad",
        "disp_half_angle_deg",
        "disp_tan_half_angle",
        "plan_dispersion_short",
        "plan_dispersion_hit",
        "plan_dispersion_long",
    ],
    context_name="3.3.6 Closeout ‚Üí player_club_profile",
)

validate_columns(
    player_club_hole_dispersion,
    required_cols=[
        "player_name",
        "shot_club",
        "facility",
        "course",
        "hole_number",
        "shots_used",
        "left_dispersion_yards",
        "right_dispersion_yards",
        "total_dispersion_yards",
    ],
    context_name="3.3.6 Closeout ‚Üí player_club_hole_dispersion",
)

validate_columns(
    player_club_dispersion_rollup,
    required_cols=[
        "player_name",
        "shot_club",
        "club_left_dispersion_yards",
        "club_right_dispersion_yards",
        "club_total_dispersion_yards",
        "holes_contributing",
        "total_shots_used",
        "dispersion_sufficient",
    ],
    context_name="3.3.6 Closeout ‚Üí player_club_dispersion_rollup",
)

# snapshot for lineage
_before = player_club_profile.copy()

# ------------------------------------------------------
# 2) QA / coverage metrics
# ------------------------------------------------------
total_player_clubs = len(player_club_profile)

clubs_with_dispersion = int(player_club_profile["plan_dispersion"].notna().sum())

clubs_with_cone = int(
    player_club_profile["disp_half_angle_rad"].notna().sum()
)

clubs_with_sufficient_shots = (
    int(player_club_profile["shots_sufficient"].sum())
    if "shots_sufficient" in player_club_profile.columns
    else None
)

avg_plan_hit = float(player_club_profile["plan_distance_hit"].mean())
avg_plan_dispersion = (
    float(player_club_profile["plan_dispersion"].mean())
    if player_club_profile["plan_dispersion"].notna().any()
    else None
)

club_closeout_metrics = {
    "total_player_clubs": total_player_clubs,
    "clubs_with_dispersion": clubs_with_dispersion,
    "clubs_with_dispersion_pct": round(
        clubs_with_dispersion / total_player_clubs * 100, 2
    )
    if total_player_clubs > 0
    else None,
    "clubs_with_cone_fields": clubs_with_cone,
    "clubs_with_cone_fields_pct": round(
        clubs_with_cone / total_player_clubs * 100, 2
    )
    if total_player_clubs > 0
    else None,
    "clubs_with_sufficient_shots": clubs_with_sufficient_shots,
    "avg_plan_distance_hit": round(avg_plan_hit, 2),
    "avg_plan_dispersion": round(avg_plan_dispersion, 2)
    if avg_plan_dispersion is not None
    else None,
}

# ------------------------------------------------------
# 3) Governance assumptions
# ------------------------------------------------------
record_assumption(
    text=(
        "Club-level artifacts (3.3.1‚Äì3.3.5), including driver-calibrated "
        "dispersion-cone fields, are considered stable for Phase 4 visualization."
    ),
    rationale=(
        "All three source club tables validated; dispersion was attached back to "
        "player_club_profile; cone fields successfully derived per player."
    ),
    impact_area="Phase 4 dashboards, dispersion maps, player-level club analysis",
)

# ------------------------------------------------------
# 4) Lineage log
# ------------------------------------------------------
track_transform(
    stage_name="3.3.6_club_level_closeout",
    df_before=_before,
    df_after=player_club_profile,
    notes=(
        "Closed out 3.3 club-level features; validated base, hole-level, rollup, "
        "and cone-enriched club profile; captured QA metrics."
    ),
)

# ------------------------------------------------------
# 5) Step log for notebook reviewers
# ------------------------------------------------------
log_step(
    step_name="3.3.6 Club-Level Validity & Features Closeout",
    description=(
        "Validated player_club_profile (distances + dispersion + cone fields), "
        "player_club_hole_dispersion, and player_club_dispersion_rollup; "
        "captured coverage metrics for Phase 4."
    ),
    inputs=[
        "player_club_profile (3.3.1, 3.3.4, 3.3.5)",
        "player_club_hole_dispersion (3.3.2)",
        "player_club_dispersion_rollup (3.3.3)",
    ],
    outputs=["player_club_profile (ready for Phase 4)"],
    df=player_club_profile,
    extra_info=club_closeout_metrics,
)

# ------------------------------------------------------
# 6) Optional export of the final club profile
# ------------------------------------------------------
player_club_profile_path = PRIVATE_PATH / "player_club_profile.xlsx"
player_club_profile.to_excel(player_club_profile_path, index=False)
print(f"üìÅ Exported player_club_profile ‚Üí {player_club_profile_path}")

# ------------------------------------------------------
# 7) Reviewer-friendly summary
# ------------------------------------------------------
display(pd.DataFrame([club_closeout_metrics]).T.rename(columns={0: "value"}))


üìå Assumption logged: Club-level artifacts (3.3.1‚Äì3.3.5), including driver-calibrated dispersion-cone fields, are considered stable for Phase 4 visualization.  | Impact: Phase 4 dashboards, dispersion maps, player-level club analysis
üîÑ Transform logged: 3.3.6_club_level_closeout
   Rows 14 ‚Üí 14 (0 change)
   Closed out 3.3 club-level features; validated base, hole-level, rollup, and cone-enriched club profile; captured QA metrics.
‚úÖ 3.3.6 Club-Level Validity & Features Closeout @ 2025-11-16 19:33:22
   DataFrame shape: 14 rows √ó 32 cols
   Validated player_club_profile (distances + dispersion + cone fields), player_club_hole_dispersion, and player_club_dispersion_rollup; captured coverage metrics for Phase 4.
   total_player_clubs: 14
   clubs_with_dispersion: 13
   clubs_with_dispersion_pct: 92.86
   clubs_with_cone_fields: 14
   clubs_with_cone_fields_pct: 100.0
   clubs_with_sufficient_shots: 14
   avg_plan_distance_hit: 151.86
   avg_plan_dispersion: 82.45
üìÅ Exported

Unnamed: 0,value
total_player_clubs,14.0
clubs_with_dispersion,13.0
clubs_with_dispersion_pct,92.86
clubs_with_cone_fields,14.0
clubs_with_cone_fields_pct,100.0
clubs_with_sufficient_shots,14.0
avg_plan_distance_hit,151.86
avg_plan_dispersion,82.45


### ======================================================
### 3.4 Facility-Level Validity & Features  
### ======================================================

#### **Purpose**  
This phase expanded the dataset from individual rounds and clubs to the *facility level*, consolidating each golf course‚Äôs identifying, geographic, and temporal attributes into a unified dimension table (`facilities`). The goal was to establish each facility‚Äôs **identity**, **location**, and **timezone** so that future enrichment and analysis can align all shot-, hole-, and round-level data to the correct geospatial context.

---

#### **Inputs**  
- `golf_valid` (validated shot- and round-level data from Step 3.1)  
- Prior exports in `data/private` and cached user inputs in `data/raw/facilities.csv`  

---

#### **What This Step Does**  
| Substep | Description | Output |
|----------|--------------|---------|
| **3.4.1 Build / Refresh Facilities Table** | Aggregated `golf_valid` by facility √ó course to create a base facilities dimension, capturing first/last observed round and total rounds played. Preserved any prior user-provided city/state hints. | `facilities.csv` initialized with `hint_city` and `hint_state_abbr` columns. |
| **3.4.2 Derive Shot-Based Facility Centroids** | Calculated median shot coordinates per facility √ó course to estimate on-course centroids, stored as `shot_centroid_lat` / `shot_centroid_lon`, ensuring idempotent re-runs. | Facilities table enriched with shot-based centroids. |
| **3.4.3 Geocode Remaining Facilities** | Promoted any shot-based centroids into canonical `geo_lat` / `geo_lon` columns, then geocoded unresolved facilities using Nominatim ‚Üí ArcGIS, recording the `geo_source`. | Facilities table with standardized geographic coordinates. |
| **3.4.4 Build Facility Timezones from Coordinates** | Applied `TimezoneFinder` to assign each facility an IANA timezone (`facility_tz`), defaulting to ‚ÄúAmerica/New_York‚Äù when coordinates were missing or ambiguous. | Facilities table with validated timezone coverage. |
| **3.4.5 Enrich Facilities from GolfCourseAPI (cache-only)** | Queried the external GolfCourseAPI using controlled API-rate logic, capturing full JSON payloads to a separate cache file (`facilities_api_cache.csv`) for later parsing. | External data staged safely for analysis without altering core datasets. |

---

#### **What‚Äôs On Hold (Planned for Later Release)**  
| WBS | Task | Status | Description |
|------|------|---------|-------------|
| **3.4.6 Parse and Normalize Cached Facility API Data** | üü° On Hold | Parse the JSON payloads from `facilities_api_cache.csv` and normalize key fields (e.g., slope, rating, par, tee boxes) for integration. |
| **3.4.7 Validate Facility Locations vs Geocoded Data** | üü° On Hold | Compare API-provided coordinates to geocoded lat/lon and flag mismatches or outliers. |
| **3.4.8 Check GolfCourseAPI Course Data Against USGA** | üü° On Hold | Perform external verification against USGA data; document any discrepancies or missing courses. |
| **3.4.9 Apply to Facilities Table** | üü° On Hold | After validation, merge trusted API-derived attributes back into the main `facilities` table. |

---

#### **Outputs**  
- **Primary:** `facilities.csv` (governed facility dimension with lat/lon, timezone, hint columns)  
- **Secondary:** `facilities_api_cache.csv` (raw API payloads awaiting normalization)

---

#### **Why It Matters**  
This step establishes a **trusted facility-level foundation** ‚Äî linking observed performance data to real geographic and temporal contexts. It also creates a controlled **enrichment pipeline** where external data (GolfCourseAPI, USGA) can be vetted, cached, and integrated without compromising the integrity of the core dataset.  

---

**Next Step ‚Üí 3.5 Governance Close-Out (Data Preparation):**  
We will perform a comprehensive quality and lineage audit, confirming that all round-, hole-, club-, and facility-level tables are synchronized, validated, and ready for downstream analysis and performance modeling.


#### ======================================================
#### 3.4.1 Build / Refresh Facilities Table (Single Source)
#### ======================================================

**INPUTS**  
- `golf_valid`: Cleaned and validated round-level dataset from Phase 3.1.  
- Optional: `facilities.csv` (if already exists in `CACHE_PATH`), containing user-provided city and state hints.  

**WHAT THIS STEP DOES**  
- Rebuilds the canonical **Facilities Table**, representing every unique `facility √ó course` pair observed in `golf_valid`.  
- Computes foundational metadata ‚Äî `first_seen_round_dt`, `last_seen_round_dt`, and `rounds_observed`.  
- Checks for an existing `facilities.csv` file; if present, merges back any **user-maintained hint columns** such as `hint_city` and `hint_state_abbr`.  
- Ensures that all expected columns exist, even if this is the first build.  
- Exports the refreshed table back to the cache folder so users can fill missing hint data before API/geocoding steps.  

**WHY IT MATTERS**  
This step creates the **single source of truth for all facility-level operations** (geocoding, timezone lookup, and GolfCourseAPI enrichment).  
By merging and preserving editable columns, it enables a reproducible and human-in-the-loop workflow ‚Äî users can correct or supplement the data without breaking the pipeline.  
It also logs the assumption that **hint_city** and **hint_state_abbr** will be user-supplied when possible, ensuring transparency in future enrichment accuracy.

**OUTPUTS**  
- `facilities`: DataFrame containing all unique facility‚Äìcourse pairs and observation metadata.  
- `CACHE_PATH/facilities.csv`: Saved CSV file ready for user editing.  
- Governance artifacts:
  - `ASSUMPTIONS_LOG` ‚Üí user-supplied geocoding hints recorded  
  - `TRANSFORM_LOG` ‚Üí facility table creation lineage  
  - `STEP_LOG` ‚Üí human-readable record of this phase  
- Key QA metrics:
  - Count of distinct `facility √ó course` combinations  
  - Number of rows missing city/state hints  


In [31]:
# ======================================================
# 3.4.1 Build / Refresh Facilities Table (single source)
# ======================================================

"""
ACTION
- Derive the current universe of facility √ó course combinations from `golf_valid`.
- Rebuild observable metrics (first_seen_round_dt, last_seen_round_dt, rounds_observed).
- If a prior facilities file exists, merge user-editable columns back in.
- Save the refreshed table so users can fill in geocoding hints.

GOVERNANCE
- Schema gate on `golf_valid`
- Idempotent merge with existing facilities.csv
- Assumption logged about user-maintained columns
- Lineage and step logs captured
"""

# ------------------------------------------------------
# 1) Paths / setup
# ------------------------------------------------------
FACILITY_CACHE_PATH = CACHE_PATH / "facilities.csv"

# ------------------------------------------------------
# 2) Schema gate on upstream data
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=["facility", "course", "round_dt", "round_id"],
    context_name="3.4.1 Build / Refresh Facilities Table",
)

# ------------------------------------------------------
# 3) Build fresh base from observed rounds
# ------------------------------------------------------
facilities_new = (
    golf_valid
    .groupby(["facility", "course"], as_index=False)
    .agg(
        first_seen_round_dt=("round_dt", "min"),
        last_seen_round_dt=("round_dt", "max"),
        rounds_observed=("round_id", "nunique"),
    )
    .sort_values(["facility", "course"])
    .reset_index(drop=True)
)

# ------------------------------------------------------
# 4) If an existing file is present, bring user columns back
# ------------------------------------------------------
user_cols = ["hint_city", "hint_state_abbr"]

if FACILITY_CACHE_PATH.exists():
    facilities_existing = pd.read_csv(FACILITY_CACHE_PATH)

    # make sure key cols exist in the old file
    for col in ["facility", "course"]:
        if col not in facilities_existing.columns:
            raise KeyError(
                f"[3.4.1] Existing facilities file is missing required key column: {col}"
            )

    # drop rebuildable cols from existing so we don't clash on merge
    facilities_existing = facilities_existing.drop(
        columns=["first_seen_round_dt", "last_seen_round_dt", "rounds_observed"],
        errors="ignore",
    )

    facilities = facilities_new.merge(
        facilities_existing,
        on=["facility", "course"],
        how="left",
    )
else:
    facilities = facilities_new.copy()

# ------------------------------------------------------
# 5) Ensure user-editable columns exist (for the CSV the user will edit)
# ------------------------------------------------------
for col in user_cols:
    if col not in facilities.columns:
        facilities[col] = pd.NA

# ------------------------------------------------------
# 6) Record assumption about user-maintained hint fields
# ------------------------------------------------------
record_assumption(
    text="Facility geocoding / API enrichment will rely on user-supplied hint_city and hint_state_abbr when present.",
    rationale="Observed shot data does not always contain enough location context to disambiguate facilities; user hints reduce API errors.",
    impact_area="3.4 Facility-level Validity & Features; future geocoding and course-enrichment steps",
)

# ------------------------------------------------------
# 7) Data dictionary (optional, for handoff)
# ------------------------------------------------------
generate_data_dictionary(
    facilities,
    table_name="facilities",
    desc_map={
        "facility": "Name of the club/facility as recorded in GolfShot.",
        "course": "Name of the course/loop at that facility.",
        "first_seen_round_dt": "Earliest round_dt observed for this facility√ócourse.",
        "last_seen_round_dt": "Latest round_dt observed for this facility√ócourse.",
        "rounds_observed": "Count of distinct rounds played on this facility√ócourse in this dataset.",
        "hint_city": "USER-EDITABLE: city hint to improve geocoding/API lookup.",
        "hint_state_abbr": "USER-EDITABLE: 2-letter state hint to improve geocoding/API lookup.",
    },
)

# ------------------------------------------------------
# 8) Write refreshed facilities to cache for user editing
# ------------------------------------------------------
facilities.to_csv(FACILITY_CACHE_PATH, index=False)

# ------------------------------------------------------
# 9) Coverage / QA metrics
# ------------------------------------------------------
no_city = facilities["hint_city"].isna() | (facilities["hint_city"].astype(str).str.strip() == "")
no_state = facilities["hint_state_abbr"].isna() | (facilities["hint_state_abbr"].astype(str).str.strip() == "")
rows_needing_hints = int((no_city | no_state).sum())

# ------------------------------------------------------
# 10) Lineage logging (new artifact, so df_before=None)
# ------------------------------------------------------
track_transform(
    stage_name="3.4.1_build_or_refresh_facilities",
    df_before=None,
    df_after=facilities,
    notes="Built facilities from golf_valid and merged user-provided hint columns from cache if available.",
    new_cols=list(facilities.columns),
)

# ------------------------------------------------------
# 11) Step log
# ------------------------------------------------------
log_step(
    step_name="3.4.1 Build / Refresh Facilities Table",
    description="Derived unique facility√ócourse combos with observed round metrics; preserved user-editable hint columns.",
    inputs=["golf_valid", str(FACILITY_CACHE_PATH)],
    outputs=["facilities", str(FACILITY_CACHE_PATH)],
    df=facilities,
    extra_info={
        "distinct_facility_course": len(facilities),
        "rows_needing_hints": rows_needing_hints,
        "user_editable_columns": user_cols,
        "note": "User should open the cached facilities.csv and fill in hint_city / hint_state_abbr where blank.",
    },
)

# ------------------------------------------------------
# 12) Reviewer peek
# ------------------------------------------------------
display(facilities.head(20))


üìå Assumption logged: Facility geocoding / API enrichment will rely on user-supplied hint_city and hint_state_abbr when present.  | Impact: 3.4 Facility-level Validity & Features; future geocoding and course-enrichment steps
üìò Data dictionary generated for table 'facilities' (16 columns).
üîÑ Transform logged: 3.4.1_build_or_refresh_facilities
   Built facilities from golf_valid and merged user-provided hint columns from cache if available.
‚úÖ 3.4.1 Build / Refresh Facilities Table @ 2025-11-16 19:33:22
   DataFrame shape: 30 rows √ó 16 cols
   Derived unique facility√ócourse combos with observed round metrics; preserved user-editable hint columns.
   distinct_facility_course: 30
   rows_needing_hints: 0
   user_editable_columns: ['hint_city', 'hint_state_abbr']
   note: User should open the cached facilities.csv and fill in hint_city / hint_state_abbr where blank.


Unnamed: 0,facility,course,first_seen_round_dt,last_seen_round_dt,rounds_observed,hint_city,hint_state_abbr,facility_course_key,shot_centroid_lat,shot_centroid_lon,shots_used_for_centroids,geo_lat,geo_lon,geo_source,geo_query_used,facility_tz
0,Atlas Valley Country Club,Atlas Valley,2018-08-20 07:15:27,2018-08-20 07:15:27,1,Grand Blanc,MI,Atlas_Valley_Country_Club__Atlas_Valley,,,,42.944,-83.539,nominatim,"Atlas Valley Country Club, Grand Blanc, MI",America/New_York
1,Bay Hills Golf Club,Bay Hills,2012-06-10 07:16:49,2017-07-12 17:07:03,11,Arnold,MD,Bay_Hills_Golf_Club__Bay_Hills,,,,39.045,-76.467,nominatim,"Bay Hills Golf Club, Arnold, MD",America/New_York
2,Delray Beach Golf Club,Delray Beach,2013-06-22 09:04:57,2013-06-22 09:04:57,1,Delray Beach,FL,Delray_Beach_Golf_Club__Delray_Beach,,,,26.453,-80.099,nominatim,"Delray Beach Golf Club, Delray Beach, FL",America/New_York
3,Eagle's Landing Golf Course,Eagle's Landing,2012-05-27 11:48:40,2012-05-27 11:48:40,2,Berlin,MD,Eagle's_Landing_Golf_Course__Eagle's_Landing,,,,38.304,-75.13,arcgis,"Eagle's Landing Golf Course, Berlin, MD",America/New_York
4,East Potomac Park Golf Course,Blue,2012-06-07 07:40:57,2021-10-23 07:31:48,12,Washington,DC,East_Potomac_Park_Golf_Course__Blue,38.868,-77.025,352.0,38.868,-77.025,shot_centroid,,America/New_York
5,East Potomac Park Golf Course,Red,2015-07-03 09:07:42,2015-07-03 09:07:42,2,Washington,DC,East_Potomac_Park_Golf_Course__Red,,,,38.877,-77.03,arcgis,"East Potomac Park Golf Course, Washington, DC",America/New_York
6,East Potomac Park Golf Course,White,2012-06-16 09:23:12,2021-09-12 06:53:53,5,Washington,DC,East_Potomac_Park_Golf_Course__White,38.876,-77.031,17.0,38.876,-77.031,shot_centroid,,America/New_York
7,Enterprise Golf Course,Enterprise Golf Course,2021-10-31 09:02:58,2021-10-31 09:02:58,1,Mitchellville,MD,Enterprise_Golf_Course__Enterprise_Golf_Course,38.928,-76.817,26.0,38.928,-76.817,shot_centroid,,America/New_York
8,Hampshire Greens Golf Course,Hampshire Greens Golf Course,2017-08-23 09:45:23,2017-08-23 09:45:23,1,Ashton,MD,Hampshire_Greens_Golf_Course__Hampshire_Greens...,,,,39.128,-77.0,arcgis,"Hampshire Greens Golf Course, Ashton, MD",America/New_York
9,Langston Golf Course,Langston,2011-04-03 11:46:03,2022-03-05 10:46:04,19,Washington,DC,Langston_Golf_Course__Langston,38.904,-76.966,529.0,38.904,-76.966,shot_centroid,,America/New_York


#### ======================================================
#### 3.4.2 Derive Shot-Based Facility Centroids (Idempotent)
#### ======================================================

**INPUTS**  
- `golf_valid`: Contains shot-level latitude and longitude data (`shot_start_lat`, `shot_start_lon`, `shot_end_lat`, `shot_end_lon`).  
- `facilities`: Unified facility‚Äìcourse table from **3.4.1 Build / Refresh Facilities Table**.  

**WHAT THIS STEP DOES**  
- Uses observed **shot GPS coordinates** to estimate a representative **centroid (latitude, longitude)** for each `facility √ó course` pair.  
- For each shot, prioritizes the starting GPS coordinate; if missing, substitutes the ending coordinate.  
- Filters implausible latitude/longitude values to maintain data integrity.  
- Aggregates by facility and course using the **median** to produce stable, outlier-resistant centroids.  
- Merges the results back into the `facilities` table, **idempotently**, so rerunning this step will not create duplicate or `_x / _y` suffixed columns.  
- Saves the refreshed `facilities.csv` and records full governance lineage.

**WHY IT MATTERS**  
This step creates the **first spatial grounding** of each course purely from observed data.  
These provisional centroids are essential for mapping, time zone derivation, and later geocoding/API validation steps.  
Because the merge is idempotent, analysts can safely rerun this step after upstream changes without corrupting prior edits or cached data.

**OUTPUTS**  
- `facilities`: Updated with:
  - `shot_centroid_lat`
  - `shot_centroid_lon`
  - `shots_used_for_centroids`
- `CACHE_PATH/facilities.csv`: Persisted version ready for downstream steps.  
- Governance artifacts:
  - `ASSUMPTIONS_LOG` ‚Üí Records that shot-based centroids are provisional  
  - `TRANSFORM_LOG` ‚Üí Logs creation and merge lineage  
  - `STEP_LOG` ‚Üí Documents execution details and statistics (rows updated, valid shots used)  
- Key QA metrics:
  - Number of facility‚Äìcourse centroids built  
  - Count of valid shots contributing to centroid calculation  
  - Confirmation that reruns are safe (idempotent merge verified)  


In [32]:
# ======================================================
# 3.4.2 Derive Shot-Based Facility Centroids (idempotent)
# ======================================================

"""
ACTION
- Use observed GPS from shots to estimate a per-facility √ó course centroid.
- Attach those centroid columns to the unified `facilities` table.
- Make it safe to re-run without creating _x / _y duplicate columns.

GOVERNANCE
- Schema gates on both `golf_valid` and `facilities`
- Assumption logged that shot-derived centroids are a first-pass location
- Lineage captured via track_transform()
- Refreshed facilities table written back to cache for inspection
"""

# ------------------------------------------------------
# 1) Schema gates
# ------------------------------------------------------
validate_columns(
    golf_valid,
    required_cols=[
        "facility",
        "course",
        "shot_start_lat",
        "shot_start_lon",
        "shot_end_lat",
        "shot_end_lon",
    ],
    context_name="3.4.2 Derive Shot-Based Facility Centroids (golf_valid)",
)

validate_columns(
    facilities,
    required_cols=[
        "facility",
        "course",
        "first_seen_round_dt",
        "last_seen_round_dt",
        "rounds_observed",
    ],
    context_name="3.4.2 Derive Shot-Based Facility Centroids (facilities)",
)

fac_before = facilities.copy()

# ------------------------------------------------------
# 2) Pick a single lat/lon per shot row
#    Prefer shot_start_*; fall back to shot_end_* if start is missing
# ------------------------------------------------------
shots = golf_valid[
    [
        "facility",
        "course",
        "shot_start_lat",
        "shot_start_lon",
        "shot_end_lat",
        "shot_end_lon",
    ]
].copy()

shots["coord_lat"] = shots["shot_start_lat"].where(
    shots["shot_start_lat"].notna(),
    shots["shot_end_lat"],
)
shots["coord_lon"] = shots["shot_start_lon"].where(
    shots["shot_start_lon"].notna(),
    shots["shot_end_lon"],
)

# ------------------------------------------------------
# 3) Keep only plausible coordinates
# ------------------------------------------------------
valid_mask = (
    shots["coord_lat"].between(-90, 90)
    & shots["coord_lon"].between(-180, 180)
)
shots_valid = shots.loc[valid_mask].copy()

# ------------------------------------------------------
# 4) Aggregate to facility √ó course ‚Üí median centroid
# ------------------------------------------------------
shot_centroids = (
    shots_valid
    .groupby(["facility", "course"], as_index=False)
    .agg(
        shot_centroid_lat=("coord_lat", "median"),
        shot_centroid_lon=("coord_lon", "median"),
        shots_used_for_centroids=("coord_lat", "size"),
    )
    .sort_values(["facility", "course"])
    .reset_index(drop=True)
)

print(
    f"üìç Built centroids for {len(shot_centroids)} facility√ócourse combos "
    f"from {len(shots_valid):,} valid shots."
)

# ------------------------------------------------------
# 5) Merge onto facilities (idempotent pattern)
# ------------------------------------------------------
facilities = facilities.merge(
    shot_centroids,
    on=["facility", "course"],
    how="left",
    suffixes=("", "_new"),
)

# coalesce so re-runs don't create duplicates
for col in ["shot_centroid_lat", "shot_centroid_lon", "shots_used_for_centroids"]:
    new_col = f"{col}_new"
    if new_col in facilities.columns:
        # if we already had a value (maybe user edited), keep it; else use the new one
        facilities[col] = facilities[col].combine_first(facilities[new_col])
        facilities.drop(columns=[new_col], inplace=True, errors="ignore")

# ------------------------------------------------------
# 6) Assumption: shot-based centroids are provisional
# ------------------------------------------------------
record_assumption(
    text="Facility centroids are initially derived from observed shot GPS; later geocoding/API steps may improve accuracy.",
    rationale="Shot clusters usually sit on the course but may not match the clubhouse/geocoding centroid exactly.",
    impact_area="3.4 Facility-level Validity & Features; mapping; timezone derivation",
)

# ------------------------------------------------------
# 7) Persist refreshed facilities to cache
# ------------------------------------------------------
facilities_path = CACHE_PATH / "facilities.csv"
facilities.to_csv(facilities_path, index=False)

# ------------------------------------------------------
# 8) Governance logging
# ------------------------------------------------------
track_transform(
    stage_name="3.4.2_derive_shot_based_centroids",
    df_before=fac_before,
    df_after=facilities,
    notes="Added/updated shot_centroid_* columns from observed shot GPS; re-runnable without duplicate columns.",
    new_cols=["shot_centroid_lat", "shot_centroid_lon", "shots_used_for_centroids"],
)

log_step(
    step_name="3.4.2 Derive Shot-Based Facility Centroids",
    description="Estimated facility/course centroids from observed shots and merged onto facilities table.",
    inputs=["golf_valid", "facilities (pre-3.4.2)"],
    outputs=["facilities (with shot centroids)", str(facilities_path)],
    df=facilities,
    extra_info={
        "facility_course_rows": len(facilities),
        "centroids_built_this_run": len(shot_centroids),
        "valid_shots_used": len(shots_valid),
        "note": "User-edited facility rows will be preserved on re-run.",
    },
)

# ------------------------------------------------------
# 9) Reviewer peek
# ------------------------------------------------------
display(
    facilities[
        [
            "facility",
            "course",
            "shot_centroid_lat",
            "shot_centroid_lon",
            "shots_used_for_centroids",
        ]
    ].head(20)
)


üìç Built centroids for 10 facility√ócourse combos from 1,946 valid shots.
üìå Assumption logged: Facility centroids are initially derived from observed shot GPS; later geocoding/API steps may improve accuracy.  | Impact: 3.4 Facility-level Validity & Features; mapping; timezone derivation
üîÑ Transform logged: 3.4.2_derive_shot_based_centroids
   Rows 30 ‚Üí 30 (0 change)
   Added/updated shot_centroid_* columns from observed shot GPS; re-runnable without duplicate columns.
‚úÖ 3.4.2 Derive Shot-Based Facility Centroids @ 2025-11-16 19:33:23
   DataFrame shape: 30 rows √ó 16 cols
   Estimated facility/course centroids from observed shots and merged onto facilities table.
   facility_course_rows: 30
   centroids_built_this_run: 10
   valid_shots_used: 1946
   note: User-edited facility rows will be preserved on re-run.


Unnamed: 0,facility,course,shot_centroid_lat,shot_centroid_lon,shots_used_for_centroids
0,Atlas Valley Country Club,Atlas Valley,,,
1,Bay Hills Golf Club,Bay Hills,,,
2,Delray Beach Golf Club,Delray Beach,,,
3,Eagle's Landing Golf Course,Eagle's Landing,,,
4,East Potomac Park Golf Course,Blue,38.868,-77.025,352.0
5,East Potomac Park Golf Course,Red,,,
6,East Potomac Park Golf Course,White,38.876,-77.031,17.0
7,Enterprise Golf Course,Enterprise Golf Course,38.928,-76.817,26.0
8,Hampshire Greens Golf Course,Hampshire Greens Golf Course,,,
9,Langston Golf Course,Langston,38.904,-76.966,529.0


#### ======================================================
#### 3.4.3 Geocode Remaining Facilities (Nominatim ‚Üí ArcGIS)
#### ======================================================

**INPUTS**  
- `facilities`: Table from **3.4.2 Derive Shot-Based Facility Centroids**, containing `shot_centroid_lat`, `shot_centroid_lon`, and user hint columns `hint_city`, `hint_state_abbr`.  
- External services: **Nominatim** (OpenStreetMap) and **ArcGIS** geocoders for resolving missing coordinates.  

**WHAT THIS STEP DOES**  
- Promotes any available `shot_centroid_lat` / `shot_centroid_lon` values into canonical columns `geo_lat`, `geo_lon`, and sets `geo_source = 'shot_centroid'` when those fields were previously empty.  
- Identifies facilities still missing coordinates and geocodes them using public services, prioritizing **Nominatim** (for open data) and falling back to **ArcGIS** when necessary.  
- Builds descriptive `geo_query_used` strings for each geocode attempt based on `facility`, `course`, and available user hints (`hint_city`, `hint_state_abbr`).  
- Merges results back into the `facilities` table safely (idempotently) without overwriting valid existing coordinates or creating duplicate columns.  
- Writes the updated table back to cache as `CACHE_PATH/facilities.csv`, ensuring downstream reproducibility.  
- Logs coverage statistics and provenance ‚Äî i.e., how many rows came from shot centroids, Nominatim, ArcGIS, or remain missing.

**WHY IT MATTERS**  
Accurate geographic coordinates are essential for later **time zone alignment, API enrichment, and mapping visualizations**.  
This step ensures every facility‚Äìcourse combination has the best available location data ‚Äî either empirically from shot data, user hints, or reliable geocoding services ‚Äî while maintaining a clear audit trail of data provenance (`geo_source`).  
The idempotent merge guarantees safe re-runs after users fill in missing hints or correct locations manually.

**OUTPUTS**  
- Updated `facilities` table including:  
  - `geo_lat`, `geo_lon`, `geo_source`, and `geo_query_used`  
- Cached file: `CACHE_PATH/facilities.csv` (persisted for later enrichment steps)  
- Governance artifacts:  
  - `ASSUMPTIONS_LOG` ‚Üí notes reliance on external geocoding as fallback  
  - `TRANSFORM_LOG` ‚Üí records changes to geo_* fields and data lineage  
  - `STEP_LOG` ‚Üí captures counts for each `geo_source` type (shot_centroid, nominatim, arcgis, missing)  
- Key QA metrics:  
  - Number of facilities promoted from shot-based centroids  
  - Number successfully geocoded via Nominatim and ArcGIS  
  - Number still missing coordinates after geocoding pass  


In [33]:
# ======================================================
# 3.4.3 Geocode Remaining Facilities (Nominatim ‚Üí ArcGIS)
# ======================================================

"""
ACTION
- Promote shot-based centroids (from 3.4.2) into canonical geo_* columns when empty.
- Geocode only those facility√ócourse rows that still have no coordinates.
- Try Nominatim first, then ArcGIS, using user-provided hint_city / hint_state_abbr.
- Persist back to CACHE_PATH / "facilities.csv".
- Log source counts so we know where our locations came from.

GOVERNANCE
- Idempotent: re-running will not create duplicate columns.
- We do NOT overwrite an existing geo_* value created by the user or by shots.
- We log assumptions about external geocoding reliability and user hints.
"""

# ------------------------------------------------------
# 1) Schema gates
# ------------------------------------------------------
validate_columns(
    facilities,
    required_cols=[
        "facility",
        "course",
        "first_seen_round_dt",
        "last_seen_round_dt",
        "rounds_observed",
        "shot_centroid_lat",
        "shot_centroid_lon",
        "hint_city",
        "hint_state_abbr",
    ],
    context_name="3.4.3 Geocode Remaining Facilities",
)

fac_before = facilities.copy()

# ------------------------------------------------------
# 2) Ensure canonical geo_* columns exist
# ------------------------------------------------------
if "geo_lat" not in facilities.columns:
    facilities["geo_lat"] = pd.NA
if "geo_lon" not in facilities.columns:
    facilities["geo_lon"] = pd.NA
if "geo_source" not in facilities.columns:
    facilities["geo_source"] = pd.NA
if "geo_query_used" not in facilities.columns:
    facilities["geo_query_used"] = pd.NA

# ------------------------------------------------------
# 3) Promote shot centroids ‚Üí canonical geo_* (only when empty)
# ------------------------------------------------------
shot_mask = (
    facilities["geo_lat"].isna()
    & facilities["geo_lon"].isna()
    & facilities["shot_centroid_lat"].notna()
    & facilities["shot_centroid_lon"].notna()
)

facilities.loc[shot_mask, "geo_lat"] = facilities.loc[shot_mask, "shot_centroid_lat"]
facilities.loc[shot_mask, "geo_lon"] = facilities.loc[shot_mask, "shot_centroid_lon"]
facilities.loc[shot_mask, "geo_source"] = "shot_centroid"

promoted_from_shots = int(shot_mask.sum())
print(f"üìç Promoted {promoted_from_shots} facility√ócourse rows from shot centroids ‚Üí geo_*.")

# ------------------------------------------------------
# 4) Figure out what still needs geocoding
# ------------------------------------------------------
need_geo_mask = facilities["geo_lat"].isna() | facilities["geo_lon"].isna()
to_geo = facilities.loc[need_geo_mask].copy()
print(f"üåê Facilities still needing external geocode: {len(to_geo)}")

# If we still have rows to geocode, document the assumption up front
if len(to_geo) > 0:
    record_assumption(
        text="External geocoding (Nominatim ‚Üí ArcGIS) is used only for facility√ócourse rows without shot or user coordinates.",
        rationale="We want deterministic, user-overridable coordinates; external services are a fallback.",
        impact_area="3.4 Facility-level Validity & Features / location enrichment",
    )

# ------------------------------------------------------
# 5) Run geocoding for the remaining rows (if any)
# ------------------------------------------------------
geo_rows = []
if len(to_geo) > 0:
    # geocoders already imported in Step 1
    nominatim = Nominatim(user_agent="golf-capstone-geocoder", timeout=10)
    nominatim_geocode = RateLimiter(nominatim.geocode, min_delay_seconds=1.25, max_retries=2)

    arcgis = ArcGIS(timeout=10)
    arcgis_geocode = RateLimiter(arcgis.geocode, min_delay_seconds=1.25, max_retries=2)

    for _, row in to_geo.iterrows():
        fac = row["facility"]
        crs = row["course"]
        city = row["hint_city"]
        state = row["hint_state_abbr"]

        # build query
        parts = [str(fac).strip()]
        if pd.notna(crs) and str(crs).strip() != "":
            parts.append(str(crs).strip())

        loc_parts = []
        if pd.notna(city) and str(city).strip() != "":
            loc_parts.append(str(city).strip())
        if pd.notna(state) and str(state).strip() != "":
            loc_parts.append(str(state).strip())
        if loc_parts:
            parts.append(", ".join(loc_parts))

        query = " ".join(parts)

        lat = None
        lon = None
        source = None

        # try Nominatim first
        try:
            loc = nominatim_geocode(query)
            if loc:
                lat = loc.latitude
                lon = loc.longitude
                source = "nominatim"
        except Exception as e:
            print(f"‚ö†Ô∏è Nominatim failed for {fac}/{crs} ({query}): {e}")

        # then ArcGIS
        if lat is None or lon is None:
            try:
                loc2 = arcgis_geocode(query)
                if loc2:
                    lat = loc2.latitude
                    lon = loc2.longitude
                    source = "arcgis"
            except Exception as e:
                print(f"‚ö†Ô∏è ArcGIS failed for {fac}/{crs} ({query}): {e}")

        geo_rows.append(
            {
                "facility": fac,
                "course": crs,
                "geo_lat_new": lat,
                "geo_lon_new": lon,
                "geo_source_new": source,
                "geo_query_used": query,
            }
        )

    geo_df = pd.DataFrame(geo_rows)

    # --------------------------------------------------
    # 6) Merge the geocoded results back, coalescing
    # --------------------------------------------------
    facilities = facilities.merge(
        geo_df,
        on=["facility", "course"],
        how="left",
    )

    if "geo_lat_new" in facilities.columns:
        facilities["geo_lat"] = facilities["geo_lat"].combine_first(facilities["geo_lat_new"])
        facilities.drop(columns=["geo_lat_new"], inplace=True)
    if "geo_lon_new" in facilities.columns:
        facilities["geo_lon"] = facilities["geo_lon"].combine_first(facilities["geo_lon_new"])
        facilities.drop(columns=["geo_lon_new"], inplace=True)
    if "geo_source_new" in facilities.columns:
        facilities["geo_source"] = facilities["geo_source"].combine_first(facilities["geo_source_new"])
        facilities.drop(columns=["geo_source_new"], inplace=True)

    # keep the actual query we used (for audit / reruns)
    facilities["geo_query_used"] = facilities["geo_query_used"].combine_first(geo_df.set_index(["facility", "course"])["geo_query_used"])

# ------------------------------------------------------
# 7) Persist to cache
# ------------------------------------------------------
facilities_path = CACHE_PATH / "facilities.csv"
facilities.to_csv(facilities_path, index=False)

# ------------------------------------------------------
# 8) Build and log a clear source report
# ------------------------------------------------------
source_counts = (
    facilities["geo_source"]
    .fillna("missing")
    .value_counts(dropna=False)
    .to_dict()
)

track_transform(
    stage_name="3.4.3_geocode_remaining_facilities",
    df_before=fac_before,
    df_after=facilities,
    notes="Filled geo_* from shot centroids where present, then geocoded remaining rows via Nominatim ‚Üí ArcGIS.",
)

log_step(
    step_name="3.4.3 Geocode Remaining Facilities",
    description="Promoted shot centroids to canonical geo_* and geocoded leftover facilities using public geocoders.",
    inputs=["facilities (from 3.4.1/3.4.2)"],
    outputs=["facilities", str(facilities_path)],
    df=facilities,
    extra_info={
        "promoted_from_shots": promoted_from_shots,
        "geo_source_counts": source_counts,
        "rows_still_missing_geo": int(
            (facilities["geo_lat"].isna() | facilities["geo_lon"].isna()).sum()
        ),
        "note": "If some rows are still missing, add hint_city / hint_state_abbr and rerun 3.4.3.",
    },
)

# ------------------------------------------------------
# 9) Reviewer peek
# ------------------------------------------------------
display(
    facilities[
        [
            "facility",
            "course",
            "shot_centroid_lat",
            "shot_centroid_lon",
            "geo_lat",
            "geo_lon",
            "geo_source",
            "geo_query_used",
        ]
    ].head(30)
)


üìç Promoted 0 facility√ócourse rows from shot centroids ‚Üí geo_*.
üåê Facilities still needing external geocode: 0
üîÑ Transform logged: 3.4.3_geocode_remaining_facilities
   Rows 30 ‚Üí 30 (0 change)
   Filled geo_* from shot centroids where present, then geocoded remaining rows via Nominatim ‚Üí ArcGIS.
‚úÖ 3.4.3 Geocode Remaining Facilities @ 2025-11-16 19:33:23
   DataFrame shape: 30 rows √ó 16 cols
   Promoted shot centroids to canonical geo_* and geocoded leftover facilities using public geocoders.
   promoted_from_shots: 0
   geo_source_counts: {'arcgis': 12, 'shot_centroid': 10, 'nominatim': 8}
   rows_still_missing_geo: 0
   note: If some rows are still missing, add hint_city / hint_state_abbr and rerun 3.4.3.


Unnamed: 0,facility,course,shot_centroid_lat,shot_centroid_lon,geo_lat,geo_lon,geo_source,geo_query_used
0,Atlas Valley Country Club,Atlas Valley,,,42.944,-83.539,nominatim,"Atlas Valley Country Club, Grand Blanc, MI"
1,Bay Hills Golf Club,Bay Hills,,,39.045,-76.467,nominatim,"Bay Hills Golf Club, Arnold, MD"
2,Delray Beach Golf Club,Delray Beach,,,26.453,-80.099,nominatim,"Delray Beach Golf Club, Delray Beach, FL"
3,Eagle's Landing Golf Course,Eagle's Landing,,,38.304,-75.13,arcgis,"Eagle's Landing Golf Course, Berlin, MD"
4,East Potomac Park Golf Course,Blue,38.868,-77.025,38.868,-77.025,shot_centroid,
5,East Potomac Park Golf Course,Red,,,38.877,-77.03,arcgis,"East Potomac Park Golf Course, Washington, DC"
6,East Potomac Park Golf Course,White,38.876,-77.031,38.876,-77.031,shot_centroid,
7,Enterprise Golf Course,Enterprise Golf Course,38.928,-76.817,38.928,-76.817,shot_centroid,
8,Hampshire Greens Golf Course,Hampshire Greens Golf Course,,,39.128,-77.0,arcgis,"Hampshire Greens Golf Course, Ashton, MD"
9,Langston Golf Course,Langston,38.904,-76.966,38.904,-76.966,shot_centroid,


#### ======================================================
#### 3.4.4 Build Facility Timezones from Coordinates
#### ======================================================

**INPUTS**  
- `facilities` table (from 3.4.3) with columns:  
  - `facility`, `course` ‚Äî unique identifiers  
  - `geo_lat`, `geo_lon` ‚Äî canonical facility coordinates  
- Utility: `TimezoneFinder()` for IANA timezone lookup  

**WHAT THIS STEP DOES**  
- Iterates through all facility√ócourse rows and assigns an appropriate IANA timezone (e.g., `America/New_York`, `America/Chicago`) based on geographic coordinates.  
- Uses **TimezoneFinder**‚Äôs `timezone_at()` as the primary method and falls back to `closest_timezone_at()` when coordinates are near borders or fail to resolve.  
- If coordinates are missing or cannot be resolved, assigns a **safe default** (`America/New_York`) to ensure temporal consistency for downstream analytics.  
- Writes the updated table back to `CACHE_PATH/facilities.csv`.  
- Logs how many facilities were successfully resolved vs. defaulted, ensuring full traceability.  
- The step is **idempotent** ‚Äî it only fills missing timezones and never overwrites existing `facility_tz` values.  

**WHY IT MATTERS**  
Establishing a timezone for each facility is a critical precursor to all **temporal normalization** tasks in later phases.  
It enables accurate conversion of UTC timestamps into local round start times, ensuring that time-of-day, weekday/weekend, and seasonal analyses are consistent across regions.  
By enforcing defaults and maintaining an audit trail, this step guarantees completeness and reproducibility in temporal data enrichment.  

**OUTPUTS**  
- Updated `facilities` table with new column:  
  - `facility_tz` ‚Äî the facility‚Äôs derived IANA timezone or default fallback  
- Cached file: `CACHE_PATH/facilities.csv`  
- Governance artifacts:  
  - `ASSUMPTIONS_LOG` ‚Äî documents default-timezone logic and fallback rationale  
  - `TRANSFORM_LOG` ‚Äî records addition of the `facility_tz` column  
  - `STEP_LOG` ‚Äî captures counts for resolved and defaulted facilities  
- Key QA metrics:  
  - Facilities successfully resolved from coordinates  
  - Facilities defaulted to `America/New_York` due to missing or invalid coordinates  


In [34]:
# ======================================================
# 3.4.4 Build Facility Timezones from Coordinates
# ======================================================

"""
ACTION
- Use each facility‚Äôs latitude/longitude (geo_lat / geo_lon) to derive its IANA timezone via TimezoneFinder.
- If coordinates are missing or unresolvable, assign a safe default (America/New_York).
- Append the new field `facility_tz` to the facilities table and persist back to cache.

GOVERNANCE
- Schema validation enforces presence of geo_lat / geo_lon.
- Lineage tracked via track_transform() and log_step().
- Idempotent: rerunning this cell updates only rows missing a timezone.
"""

# ------------------------------------------------------
# 1) Schema gate
# ------------------------------------------------------
validate_columns(
    facilities,
    required_cols=[
        "facility",
        "course",
        "geo_lat",
        "geo_lon",
    ],
    context_name="3.4.4 Build Facility Timezones from Coordinates",
)

fac_before = facilities.copy()

# ------------------------------------------------------
# 2) Ensure column exists
# ------------------------------------------------------
if "facility_tz" not in facilities.columns:
    facilities["facility_tz"] = pd.NA

# ------------------------------------------------------
# 3) Resolve timezones
# ------------------------------------------------------
tf = TimezoneFinder()
resolved = 0
defaulted = 0
DEFAULT_TZ = "America/New_York"

record_assumption(
    text="Timezone derived via TimezoneFinder using geo_lat/geo_lon; defaults to America/New_York for missing or invalid coordinates.",
    rationale="Guarantees that every facility has a valid IANA timezone for downstream temporal normalization.",
    impact_area="Facility-level enrichment / datetime alignment",
)

for idx, row in facilities.iterrows():
    lat = row["geo_lat"]
    lon = row["geo_lon"]

    # skip rows that already have a timezone
    if pd.notna(row.get("facility_tz")) and str(row["facility_tz"]).strip() != "":
        continue

    if pd.notna(lat) and pd.notna(lon):
        try:
            tz = tf.timezone_at(lng=float(lon), lat=float(lat))
        except Exception:
            tz = None

        if tz is None:
            try:
                tz = tf.closest_timezone_at(lng=float(lon), lat=float(lat))
            except Exception:
                tz = None

        if tz:
            facilities.at[idx, "facility_tz"] = tz
            resolved += 1
        else:
            facilities.at[idx, "facility_tz"] = DEFAULT_TZ
            defaulted += 1
    else:
        facilities.at[idx, "facility_tz"] = DEFAULT_TZ
        defaulted += 1

# ------------------------------------------------------
# 4) Persist updates
# ------------------------------------------------------
facilities_path = CACHE_PATH / "facilities.csv"
facilities.to_csv(facilities_path, index=False)

# ------------------------------------------------------
# 5) Governance logging
# ------------------------------------------------------
log_step(
    step_name="3.4.4 Build Facility Timezones from Coordinates",
    description="Derived IANA timezones using TimezoneFinder; defaulted unresolved rows.",
    inputs=["facilities"],
    outputs=["facilities", str(facilities_path)],
    df=facilities,
    extra_info={
        "resolved_from_coords": resolved,
        "defaulted_due_to_missing_or_unresolved_coords": defaulted,
        "default_tz_used": DEFAULT_TZ,
    },
)

track_transform(
    stage_name="3.4.4_build_facility_timezones",
    df_before=fac_before,
    df_after=facilities,
    notes="Appended facility_tz field from geographic coordinates; idempotent and reproducible.",
)

# ------------------------------------------------------
# 6) Reviewer sample
# ------------------------------------------------------
display(
    facilities[
        [
            "facility",
            "course",
            "geo_lat",
            "geo_lon",
            "facility_tz",
        ]
    ].head(30)
)


üìå Assumption logged: Timezone derived via TimezoneFinder using geo_lat/geo_lon; defaults to America/New_York for missing or invalid coordinates.  | Impact: Facility-level enrichment / datetime alignment
‚úÖ 3.4.4 Build Facility Timezones from Coordinates @ 2025-11-16 19:33:25
   DataFrame shape: 30 rows √ó 16 cols
   Derived IANA timezones using TimezoneFinder; defaulted unresolved rows.
   resolved_from_coords: 0
   defaulted_due_to_missing_or_unresolved_coords: 0
   default_tz_used: America/New_York
üîÑ Transform logged: 3.4.4_build_facility_timezones
   Rows 30 ‚Üí 30 (0 change)
   Appended facility_tz field from geographic coordinates; idempotent and reproducible.


Unnamed: 0,facility,course,geo_lat,geo_lon,facility_tz
0,Atlas Valley Country Club,Atlas Valley,42.944,-83.539,America/New_York
1,Bay Hills Golf Club,Bay Hills,39.045,-76.467,America/New_York
2,Delray Beach Golf Club,Delray Beach,26.453,-80.099,America/New_York
3,Eagle's Landing Golf Course,Eagle's Landing,38.304,-75.13,America/New_York
4,East Potomac Park Golf Course,Blue,38.868,-77.025,America/New_York
5,East Potomac Park Golf Course,Red,38.877,-77.03,America/New_York
6,East Potomac Park Golf Course,White,38.876,-77.031,America/New_York
7,Enterprise Golf Course,Enterprise Golf Course,38.928,-76.817,America/New_York
8,Hampshire Greens Golf Course,Hampshire Greens Golf Course,39.128,-77.0,America/New_York
9,Langston Golf Course,Langston,38.904,-76.966,America/New_York


#### ======================================================
#### 3.4.5 Enrich Facilities from GolfCourseAPI (Cache-Only)
#### ======================================================

**INPUTS**  
- `facilities` (from Step 3.4.4) ‚Äî includes `facility`, `course`, `facility_course_key`, `hint_city`, and `hint_state_abbr`.  
- `CACHE_PATH/facilities_api_cache.csv` ‚Äî local cache of past API lookups (if any).  
- External data source: **GolfCourseAPI** (authenticated via your stored `GOLF_API_KEY`).  

**WHAT THIS STEP DOES**  
- Determines which facilities still lack GolfCourseAPI data and selectively calls the API:  
  - **If `run_api_again == False`** ‚Üí skip entirely.  
  - **If `run_api_again == True`** ‚Üí use *only* the user-supplied `user_query` field (manual rerun).  
  - **If blank or NaN and no prior success** ‚Üí use *auto candidate strategies* (facility + course + hint city/state).  
- Calls the GolfCourseAPI‚Äôs `/v1/search` and `/v1/courses/{id}` endpoints, with rate-limiting and graceful handling of `429 Too Many Requests`.  
- Writes results to a **cache-only file** (`facilities_api_cache.csv`), not to the main `facilities` table.  
- Each record captures:  
  - `api_course_id`, `api_payload_json`, `api_query_used`, `api_match_strategy`, `api_success`, `run_api_again`, and `user_query`.  
- Guarantees **idempotency** and **auditability** ‚Äî existing cache entries are preserved; only stale or newly flagged rows are updated.  

**WHY IT MATTERS**  
This is the controlled integration point with the external **GolfCourseAPI**.  
It decouples network-dependent enrichment from the core dataset, ensuring that:  
- The pipeline can be safely re-run offline using cached data.  
- Users can manually correct or re-run specific facilities (`run_api_again=True`).  
- No external API errors or rate limits can corrupt the canonical `facilities` table.  
In later steps (3.4.6 ‚Äì 3.4.9), this cache becomes the trusted source for slope, rating, and location metadata enrichment.  

**OUTPUTS**  
- Updated **`facilities_api_cache.csv`** with the latest enrichment results.  
- Governance artifacts:  
  - `TRANSFORM_LOG` ‚Üí documents API call volume, cache update mode, and run scope.  
  - `STEP_LOG` ‚Üí logs counts for `api_calls_made`, rate-limit hits, and cache growth.  
  - (Optional) `ASSUMPTIONS_LOG` entry noting that cache writes do not modify `facilities`.  
- Key QA metrics:  
  - Total rows in API cache after update.  
  - Number of API calls made this run.  
  - Whether rate-limit protection triggered early termination.  


In [35]:
# ======================================================
# 3.4.5 Enrich Facilities from GolfCourseAPI (cache-only)
# ======================================================

"""
ACTION
- Look at the current facilities table and figure out which facility√ócourse
  rows still need an API lookup.
- Call GolfCourseAPI ONLY for those rows (honoring user override flags).
- Write/refresh CACHE_PATH / "facilities_api_cache.csv".
- Do NOT mutate the main `facilities` DataFrame beyond adding a key column.

NOTES
- This is deliberately decoupled from the core dataset so API flakiness
  never corrupts `facilities`.
- Users can open the cache CSV and set:
    - run_api_again = TRUE
    - user_query = "Club at XYZ, City ST"
  ‚Ä¶then re-run this step to try again for just those rows.
"""

# ------------------------------------------------------
# 0) Config / knobs
# ------------------------------------------------------
VERBOSE = True
API_CACHE_PATH = CACHE_PATH / "facilities_api_cache.csv"

MAX_API_CALLS_THIS_RUN = 30       # hard cap so we don't hammer the service
PAUSE_SECONDS = 0.75              # between auto strategies
STOP_ON_FIRST_429 = True          # safety

# ------------------------------------------------------
# 1) Schema gate on facilities
#    (we also tolerate missing facility_course_key by creating it)
# ------------------------------------------------------
validate_columns(
    facilities,
    required_cols=[
        "facility",
        "course",
        "hint_city",
        "hint_state_abbr",
    ],
    context_name="3.4.5 Enrich Facilities from GolfCourseAPI (facilities input)"
)

fac_before = facilities.copy()

# ensure we have a durable key for merges/cache
if "facility_course_key" not in facilities.columns:
    facilities["facility_course_key"] = (
        facilities["facility"].astype(str).str.strip()
        + "::"
        + facilities["course"].astype(str).str.strip()
    )

# ------------------------------------------------------
# 2) Load or initialize the API cache
# ------------------------------------------------------
if API_CACHE_PATH.exists():
    facilities_api_cache = pd.read_csv(API_CACHE_PATH)
else:
    facilities_api_cache = pd.DataFrame(
        columns=[
            "facility_course_key",
            "api_course_id",
            "api_payload_json",
            "api_query_used",
            "api_last_checked_ts",
            "api_match_strategy",
            "api_last_note",
            "api_success",
            "api_tried_all_strategies",
            "run_api_again",
            "user_query",
        ]
    )

# make sure expected columns exist even if cache is old
for col in [
    "api_course_id",
    "api_payload_json",
    "api_query_used",
    "api_last_checked_ts",
    "api_match_strategy",
    "api_last_note",
    "api_success",
    "api_tried_all_strategies",
    "run_api_again",
    "user_query",
]:
    if col not in facilities_api_cache.columns:
        facilities_api_cache[col] = pd.NA

# ------------------------------------------------------
# 3) Join facilities with cache to decide what to call
# ------------------------------------------------------
fac_enrich = facilities.merge(
    facilities_api_cache,
    on="facility_course_key",
    how="left",
    suffixes=("", "_cached"),
)

def _clean(s):
    if s is None:
        return None
    s = str(s).strip()
    if s == "" or s.lower() == "nan":
        return None
    return s

def _needs_api(row) -> bool:
    """
    Rules:
    - run_api_again == False ‚Üí never call
    - run_api_again == True  ‚Üí call, but user_query-only
    - else:
        - if no prior api_course_id and no payload ‚Üí call (auto strategies)
        - otherwise ‚Üí skip
    """
    run_flag = row.get("run_api_again")

    # explicit skip
    if run_flag is False:
        return False

    # explicit rerun
    if run_flag is True:
        return True

    # implicit (never had API success)
    never_had_api = pd.isna(row.get("api_course_id")) and pd.isna(row.get("api_payload_json"))
    return bool(never_had_api)

rows_needing_api = fac_enrich[fac_enrich.apply(_needs_api, axis=1)].copy()

if VERBOSE:
    print(f"üîé Total facilities in scope: {len(fac_enrich)}")
    print(f"üîé Facilities needing API this run: {len(rows_needing_api)}")

# ------------------------------------------------------
# 4) Helper functions for API
# ------------------------------------------------------
def auto_candidate_queries(facility, course, hint_city=None, hint_state=None):
    """Generate a few reasonable search strings for the API."""
    hint_bits = []
    if _clean(hint_city):
        hint_bits.append(_clean(hint_city))
    if _clean(hint_state):
        hint_bits.append(_clean(hint_state))
    hint_str = " ".join(hint_bits) if hint_bits else None

    fac_course = f"{facility} {course}".strip()

    if hint_str:
        yield f"{fac_course} {hint_str}"
    yield fac_course
    if hint_str:
        yield f"{facility} {hint_str}"
    yield facility

def gca_search(query: str):
    headers = {
        "Authorization": f"Key {GOLF_API_KEY}",
        "accept": "application/json",
    }
    resp = requests.get(
        "https://api.golfcourseapi.com/v1/search",
        headers=headers,
        params={"search_query": query},
        timeout=15,
    )
    resp.raise_for_status()
    return resp.json().get("courses", []) or []

def gca_course_detail(course_id: int):
    headers = {
        "Authorization": f"Key {GOLF_API_KEY}",
        "accept": "application/json",
    }
    resp = requests.get(
        f"https://api.golfcourseapi.com/v1/courses/{course_id}",
        headers=headers,
        timeout=15,
    )
    resp.raise_for_status()
    return resp.json()

# ------------------------------------------------------
# 5) Main API loop (selective, rate-limited)
# ------------------------------------------------------
enriched_rows = []
api_calls_made = 0
hit_rate_limit = False

for _, r in rows_needing_api.iterrows():
    if api_calls_made >= MAX_API_CALLS_THIS_RUN:
        if VERBOSE:
            print("‚ö†Ô∏è API call cap hit ‚Äî stopping for this run.")
        break
    if hit_rate_limit:
        break

    fac = r["facility"]
    crs = r["course"]
    key = r["facility_course_key"]
    hint_city = r.get("hint_city")
    hint_state = r.get("hint_state_abbr")
    user_query = _clean(r.get("user_query"))
    run_flag = r.get("run_api_again")

    best_match = None
    detail_payload = None
    last_note = None

    # ---------- CASE 1: user-triggered rerun ----------
    if run_flag is True:
        if not user_query:
            # user wanted rerun but didn't give a query
            enriched_rows.append({
                "facility_course_key": key,
                "api_course_id": None,
                "api_payload_json": None,
                "api_query_used": None,
                "api_last_checked_ts": datetime.utcnow().isoformat(),
                "api_match_strategy": None,
                "api_last_note": "run_api_again=True but user_query missing",
                "api_success": False,
                "api_tried_all_strategies": False,
                "run_api_again": True,
                "user_query": r.get("user_query", pd.NA),
            })
            if VERBOSE:
                print(f"‚ùå {fac} / {crs}: run_api_again=True but no user_query; recorded stub.")
            continue

        try:
            courses = gca_search(user_query)
            api_calls_made += 1

            if courses:
                best_match = courses[0]
                # optionally fetch detail
                if api_calls_made < MAX_API_CALLS_THIS_RUN:
                    try:
                        detail_payload = gca_course_detail(best_match["id"])
                        api_calls_made += 1
                        last_note = "success (user_query)"
                    except requests.HTTPError as e_det:
                        detail_payload = None
                        last_note = f"user_query detail failed: {e_det}"
                else:
                    last_note = "success (user_query, detail skipped due to cap)"

                enriched_rows.append({
                    "facility_course_key": key,
                    "api_course_id": best_match.get("id"),
                    "api_payload_json": json.dumps(detail_payload) if detail_payload else None,
                    "api_query_used": user_query,
                    "api_last_checked_ts": datetime.utcnow().isoformat(),
                    "api_match_strategy": 5,  # arbitrary bucket for "user query"
                    "api_last_note": last_note,
                    "api_success": True,
                    "api_tried_all_strategies": False,
                    "run_api_again": True,  # keep true so user can retry if this is still wrong
                    "user_query": r.get("user_query", pd.NA),
                })

                if VERBOSE:
                    print(f"‚úÖ {fac} / {crs}: matched via user_query='{user_query}'")

            else:
                enriched_rows.append({
                    "facility_course_key": key,
                    "api_course_id": None,
                    "api_payload_json": None,
                    "api_query_used": user_query,
                    "api_last_checked_ts": datetime.utcnow().isoformat(),
                    "api_match_strategy": None,
                    "api_last_note": "user_query returned no results",
                    "api_success": False,
                    "api_tried_all_strategies": False,
                    "run_api_again": True,
                    "user_query": r.get("user_query", pd.NA),
                })
                if VERBOSE:
                    print(f"‚ùå {fac} / {crs}: user_query='{user_query}' returned no results.")

        except requests.HTTPError as e:
            status = e.response.status_code
            last_note = f"{status} {e}"
            enriched_rows.append({
                "facility_course_key": key,
                "api_course_id": None,
                "api_payload_json": None,
                "api_query_used": user_query,
                "api_last_checked_ts": datetime.utcnow().isoformat(),
                "api_match_strategy": None,
                "api_last_note": last_note,
                "api_success": False,
                "api_tried_all_strategies": False,
                "run_api_again": True,
                "user_query": r.get("user_query", pd.NA),
            })
            if status == 429 and STOP_ON_FIRST_429:
                print("‚õî Hit API rate limit ‚Äî stopping this run.")
                hit_rate_limit = True
            continue

    # ---------- CASE 2: implicit auto-enrichment ----------
    else:
        tried_all = True
        for strat_idx, q in enumerate(
            auto_candidate_queries(fac, crs, hint_city, hint_state),
            start=1
        ):
            if api_calls_made >= MAX_API_CALLS_THIS_RUN:
                break

            try:
                courses = gca_search(q)
                api_calls_made += 1

                if courses:
                    best_match = courses[0]
                    # try detail if we still have budget
                    if api_calls_made < MAX_API_CALLS_THIS_RUN:
                        try:
                            detail_payload = gca_course_detail(best_match["id"])
                            api_calls_made += 1
                            last_note = "success"
                        except requests.HTTPError as e_det:
                            detail_payload = None
                            last_note = f"detail failed: {e_det}"
                    else:
                        last_note = "success (detail skipped due to cap)"

                    enriched_rows.append({
                        "facility_course_key": key,
                        "api_course_id": best_match.get("id"),
                        "api_payload_json": json.dumps(detail_payload) if detail_payload else None,
                        "api_query_used": q,
                        "api_last_checked_ts": datetime.utcnow().isoformat(),
                        "api_match_strategy": strat_idx,
                        "api_last_note": last_note,
                        "api_success": True,
                        "api_tried_all_strategies": False,
                        "run_api_again": r.get("run_api_again", pd.NA),
                        "user_query": r.get("user_query", pd.NA),
                    })

                    if VERBOSE:
                        print(f"‚úÖ {fac} / {crs}: matched with auto query='{q}' (strategy {strat_idx})")
                    break

                else:
                    if VERBOSE:
                        print(f"   ‚Ä¶ no results for query='{q}', trying next")
                    time.sleep(PAUSE_SECONDS)

            except requests.HTTPError as e:
                status = e.response.status_code
                last_note = f"{status} {e}"
                if status == 429 and STOP_ON_FIRST_429:
                    print(f"‚õî Hit API rate limit on '{q}' ‚Äî stopping this run.")
                    enriched_rows.append({
                        "facility_course_key": key,
                        "api_course_id": None,
                        "api_payload_json": None,
                        "api_query_used": q,
                        "api_last_checked_ts": datetime.utcnow().isoformat(),
                        "api_match_strategy": None,
                        "api_last_note": last_note,
                        "api_success": False,
                        "api_tried_all_strategies": False,
                        "run_api_again": r.get("run_api_again", pd.NA),
                        "user_query": r.get("user_query", pd.NA),
                    })
                    hit_rate_limit = True
                    tried_all = False
                    break
                else:
                    if VERBOSE:
                        print(f"‚ö†Ô∏è API fetch failed for {fac} / {crs} with query='{q}': {e}")
                    time.sleep(PAUSE_SECONDS)

            except Exception as e:
                last_note = str(e)
                if VERBOSE:
                    print(f"‚ö†Ô∏è Unexpected error for {fac} / {crs} with query='{q}': {e}")
                time.sleep(PAUSE_SECONDS)

        # exhausted strategies with no match
        if (best_match is None) and (not hit_rate_limit):
            enriched_rows.append({
                "facility_course_key": key,
                "api_course_id": None,
                "api_payload_json": None,
                "api_query_used": None,
                "api_last_checked_ts": datetime.utcnow().isoformat(),
                "api_match_strategy": None,
                "api_last_note": last_note or "no strategy matched",
                "api_success": False,
                "api_tried_all_strategies": tried_all,
                "run_api_again": r.get("run_api_again", pd.NA),
                "user_query": r.get("user_query", pd.NA),
            })
            if VERBOSE:
                print(f"‚ùå No match for {fac} / {crs} after auto strategies.")

# ------------------------------------------------------
# 6) Upsert into cache (in-memory) and write to disk
# ------------------------------------------------------
if enriched_rows:
    new_api_df = pd.DataFrame(enriched_rows)

    # drop duplicate keys in existing, keep last
    facilities_api_cache = facilities_api_cache.drop_duplicates(
        subset=["facility_course_key"],
        keep="last",
    )

    facilities_api_cache = (
        pd.concat([facilities_api_cache, new_api_df], ignore_index=True)
        .drop_duplicates(subset=["facility_course_key"], keep="last")
    )

# write cache ONLY
facilities_api_cache.to_csv(API_CACHE_PATH, index=False)

# ------------------------------------------------------
# 7) Governance: assumption + lineage + step log
# ------------------------------------------------------
record_assumption(
    text="External GolfCourseAPI enrichment is staged in a cache file and not applied directly to the main facilities table.",
    rationale="Prevents bad API matches or rate-limit errors from corrupting production-like facility dimensions.",
    impact_area="Facility enrichment pipeline (3.4.x)"
)

track_transform(
    stage_name="3.4.5_enrich_facilities_from_golfcourseapi_cache_only",
    df_before=fac_enrich,
    df_after=fac_enrich,  # structure unchanged, but we log the action
    notes=(
        "Queried GolfCourseAPI for facilities that needed enrichment; "
        f"updated cache at {API_CACHE_PATH.name}; API calls made: {api_calls_made}."
    ),
)

log_step(
    step_name="3.4.5 Enrich Facilities from GolfCourseAPI (cache-only)",
    description="Selective GolfCourseAPI lookups based on run_api_again and cache status; wrote results to facilities_api_cache.csv only.",
    inputs=["facilities", str(API_CACHE_PATH)],
    outputs=[str(API_CACHE_PATH)],
    df=fac_enrich,
    extra_info={
        "rows_in_facilities": len(facilities),
        "rows_needing_api": len(rows_needing_api),
        "api_calls_made": api_calls_made,
        "hit_rate_limit": hit_rate_limit,
        "cache_rows_after": len(facilities_api_cache),
    },
)

# ------------------------------------------------------
# 8) Reviewer peek
# ------------------------------------------------------
display(facilities_api_cache.head(30))


üîé Total facilities in scope: 30
üîé Facilities needing API this run: 0
üìå Assumption logged: External GolfCourseAPI enrichment is staged in a cache file and not applied directly to the main facilities table.  | Impact: Facility enrichment pipeline (3.4.x)
üîÑ Transform logged: 3.4.5_enrich_facilities_from_golfcourseapi_cache_only
   Rows 30 ‚Üí 30 (0 change)
   Queried GolfCourseAPI for facilities that needed enrichment; updated cache at facilities_api_cache.csv; API calls made: 0.
‚úÖ 3.4.5 Enrich Facilities from GolfCourseAPI (cache-only) @ 2025-11-16 19:33:25
   DataFrame shape: 30 rows √ó 26 cols
   Selective GolfCourseAPI lookups based on run_api_again and cache status; wrote results to facilities_api_cache.csv only.
   rows_in_facilities: 30
   rows_needing_api: 0
   api_calls_made: 0
   hit_rate_limit: False
   cache_rows_after: 30


Unnamed: 0,facility_course_key,api_course_id,api_payload_json,api_query_used,api_last_checked_ts,api_match_strategy,api_last_note,api_success,api_tried_all_strategies,run_api_again,user_query
0,Atlas_Valley_Country_Club__Atlas_Valley,28981.0,"{""course"": {""id"": 28981, ""club_name"": ""Atlas V...",Atlas Valley Country Club Atlas Valley Grand B...,2025-11-04T20:45:12.175921,9.0,,True,False,False,
1,Bay_Hills_Golf_Club__Bay_Hills,20030.0,"{""course"": {""id"": 20030, ""club_name"": ""Bay Hil...",Bay Hills Golf Club Bay Hills,2025-11-04T20:45:15.245823,4.0,,True,False,False,
2,Delray_Beach_Golf_Club__Delray_Beach,30297.0,"{""course"": {""id"": 30297, ""club_name"": ""Delray ...",Delray Beach Golf Club Delray Beach Delray Beach,2025-11-04T20:45:16.493554,2.0,,True,False,False,
3,Enterprise_Golf_Course__Enterprise_Golf_Course,20099.0,"{""course"": {""id"": 20099, ""club_name"": ""Enterpr...",Enterprise Golf Course Enterprise Golf Course,2025-11-04T21:16:05.975763,4.0,,True,False,False,
4,Langston_Golf_Course__Langston,20241.0,"{""course"": {""id"": 20241, ""club_name"": ""Langsto...",Langston Golf Course Langston,2025-11-04T21:17:30.567011,4.0,,True,False,False,
5,Lewisburg_Elks_Country_Club__Lewisburg_Elks,22404.0,"{""course"": {""id"": 22404, ""club_name"": ""Lewisbu...",lewisburg elks lewisburg elks,2025-11-04T21:17:39.189668,9.0,,True,False,False,
6,Marlton_Golf_Club__Marlton,19958.0,"{""course"": {""id"": 19958, ""club_name"": ""Marlton...",Marlton Golf Club Marlton,2025-11-04T21:17:42.336157,4.0,,True,False,False,
7,Needwood_Golf_Course__Needwood,19995.0,"{""course"": {""id"": 19995, ""club_name"": ""Needwoo...",needwood needwood,2025-11-04T21:17:49.938526,9.0,,True,False,False,
8,Northwest_Golf_Course__Main,19985.0,"{""course"": {""id"": 19985, ""club_name"": ""Northwe...",Northwest Golf Course,2025-11-04T21:17:55.787601,7.0,,True,False,False,
9,Ocean_City_Golf_Club__Seaside,6862.0,"{""course"": {""id"": 6862, ""club_name"": ""Ocean Ci...",Ocean City Golf Club Seaside,2025-11-04T21:18:09.008119,4.0,,True,False,False,


#### (ON HOLD) 3.4.6  Parse and Normalize Cached Facility API Data
Parse JSON payloads and normalize for merging.

#### (ON HOLD) 3.4.7 Validate Facility Locations vs our geocoded locations
Compare vendor course coordinates to geocoding from 3.4.2-3.4.x.
Add a api_facility_location_suspect flag if the distance between API coordinates and observed centroid exceeds some threshold (e.g. >10 km). 
QA signal that tells you ‚Äúis this really the course we think it is in the api?‚Äù

#### (ON HOLD) 3.4.8 Check GolfCourseAPI Course Data Against Official USGA
Manual spot-check step to confirm that slope/rating/par/etc. from api. Values comply with official published numbers.
This will not pull from USGA automatically (legal terms); instead, you will manually verify a few key courses and record notes in `facilities`.

#### (ON HOLD) 3.4.9 Apply to Facilities Table
Apply confirmed data to facilities table.

#### (ON HOLD) 3.4.10 Facility-level Close Out

## ======================================================
## 3.5 Phase 3 Close-Out Summary
## ======================================================

#### **Purpose**
Step 3.5 finalizes **Phase 3: Data Preparation**, transforming the governed, cleaned, and enriched GolfShot dataset into a fully documented, analysis-ready foundation for Phases 4‚Äì6.  
It completes the bridge between **data engineering** and **analytics deployment**, ensuring every table, column, and transformation has been validated, logged, and exported with complete traceability.

---

#### **What Was Done**
| Substep | Description | Key Outcomes |
|----------|--------------|---------------|
| **3.5.1 ‚Äì Create DataFrames for Phase 4** | Introduced dimensional data modeling by splitting `golf_valid` into round-, hole-, and shot-level datasets with unique relational keys (`round_id`, `hole_id`, `row_id`). Added round-level context to hole and shot tables for flexible filtering and dashboard interactivity. | Created `golf_rounds`, `golf_holes`, and `golf_shots` DataFrames. Updated data dictionary with all new tables, columns, and lineage details. |
| **3.5.2 ‚Äì Data Preparation Close-Out (Exports)** | Consolidated all governed DataFrames, logs, and dictionaries into non-timestamped, structured export folders for long-term reproducibility and BI integration. | Delivered clean exports in `/deliverables/phase3_deliverables` and `/data/processed/phase3_exports`, plus a unified `phase3_exports.xlsx` for Tableau and Power BI. |

---

#### **Why It Matters**
- Establishes a **single source of truth** for all downstream analysis.  
- Locks in schema stability and documentation, ensuring full reproducibility.  
- Delivers fully normalized, relational datasets ideal for Tableau dashboards.  
- Creates a governed audit trail of every transformation and validation step.

---

#### **Outputs**
| Output Type | Description |
|--------------|-------------|
| **Analysis-Ready Tables** | `golf_rounds`, `golf_holes`, `golf_shots`, `facilities`, `player_club_profile` |
| **Governance Artifacts** | `assumptions_log`, `validation_log`, `transform_log`, `step_log`, and master `data_dictionary` |
| **Excel Deliverables** | `/deliverables/phase3_deliverables/phase3_deliverables.xlsx` and `/data/processed/phase3_exports.xlsx` (all Phase 3 tables and logs in one governed file) |

---

#### **Transition to Next Phases**
Phase 3 concludes the data preparation journey.  
The project now enters the **analytics and visualization phase**, where insights, performance metrics, and dashboards will be developed using the exported, governed datasets.

**‚úÖ Phase 3 Complete:**  
All data has been validated, documented, and exported ‚Äî forming the governed foundation for analytical modeling, visualization, and deployment.


### ======================================================
### 3.5.1 Create DataFrames for Phase 4
### ======================================================

**Inputs**  
- `golf_valid` (fully prepared table from Phase 3; contains round-, hole-, and shot-grain fields)  
- Governance helpers: `validate_columns`, `track_transform`, `log_step`, `generate_data_dictionary`

**What this step does**  
1. Adds a durable `hole_id` (`round_id` + `"-"` + `hole_number`) to `golf_valid` so every hole has a unique, reproducible key.  
2. Splits the governed master table into three analysis-ready DataFrames:
   - **`golf_rounds`** ‚Äî 1 row per `round_id`, containing all round-level, player-level, facility-level, and temporal context fields needed for filtering and aggregation.  
   - **`golf_holes`** ‚Äî 1 row per scored hole (`hole_id`), with all hole-level scoring and performance metrics, plus round context for flexible dashboard filtering.  
   - **`golf_shots`** ‚Äî 1 row per shot (where `shot_club` exists), retaining shot-level GPS and yardage diagnostics from Step 2.5, along with its round/hole context for blending and drill-downs.  
3. Registers all four tables ‚Äî **`golf_valid`**, **`golf_rounds`**, **`golf_holes`**, and **`golf_shots`** ‚Äî in the master data dictionary.  
   - Each call to `generate_data_dictionary()` documents table-level metadata, including:
     - `table_name`
     - each column‚Äôs `dtype`
     - a clear **business definition** in `description`
     - the **lineage** (which 3.x step or source system produced it).  
4. This ensures that as Phase 4 begins, every analytical dataset can be traced back to its preparation logic, supporting reproducibility and governance standards.

**Why it matters**  
- Business Intelligence (BI) tools like Tableau and Power BI operate best on normalized, single-grain datasets.  
- Carrying higher-grain attributes (round-level context) into hole- and shot-level tables ensures that cross-table filters (e.g., by player, facility, season, or time of day) remain functional even when users interact with visualizations at different grains.  
- Adding `hole_id` now establishes a **consistent relational key structure** (`round_id` ‚Üí `hole_id` ‚Üí `row_id`) for downstream dashboards.  
- Expanding the project‚Äôs **data dictionary** at this point locks in complete documentation of every column‚Äôs purpose and lineage, preventing ambiguity later.

**Outputs**  
- `golf_valid` (same as prior, but now with `hole_id`)  
- `golf_rounds` (round grain; key = `round_id`)  
- `golf_holes` (hole grain; keys = `hole_id`, `round_id`)  
- `golf_shots` (shot grain; keys = `row_id`, `hole_id`, `round_id`)  
- Updated data dictionary entries for all four DataFrames (`golf_valid`, `golf_rounds`, `golf_holes`, `golf_shots`) with column descriptions and lineage mapped to their respective 3.x derivation steps.  
- Step logs documenting record counts and key creation.


In [36]:
# ======================================================
# 3.5.1 Create DataFrames for Phase 4
# ======================================================

# 1) Schema validation gate
validate_columns(
    golf_valid,
    required_cols=[
        "round_id",
        "hole_number",
        "round_dt",
        "facility",
        "course",
        "player_name",
    ],
    context_name="3.5.1 Create DataFrames for Phase 4",
)

gv_before = golf_valid.copy()

# ------------------------------------------------------
# 2) Ensure hole_id exists on the master DF first
#    (round_id + "-" + hole_number) so holes and shots can both reuse it
# ------------------------------------------------------
golf_valid["hole_id"] = (
    golf_valid["round_id"].astype(str) + "-" + golf_valid["hole_number"].astype(str)
)

# ------------------------------------------------------
# 3) Round-level DataFrame (1 row per round_id)
# ------------------------------------------------------
round_cols = [
    "round_id",
    "round_key",
    "player_name",
    "facility",
    "course",
    "round_score",
    "round_fairway_strokes",
    "round_putts",
    "round_gir",
    "round_holes_scored",
    "round_is_partial",
    "round_no_player",
    "round_no_player_course",
    "round_dt",
    "date",
    "time",
    "hour",
    "dow",
    "is_weekend",
    "part_of_day",
    "round_time_valid",
    "year",
    "month",
    "month_name",
    "year_month",
    "season",
    "round_year_index",
]

golf_rounds = (
    golf_valid[round_cols]
    .drop_duplicates(subset=["round_id"])
    .reset_index(drop=True)
)

# ------------------------------------------------------
# 4) Hole-level DataFrame (1 row per round_id + hole_number)
#    includes round context for Tableau convenience
# ------------------------------------------------------
hole_cols = [
    # keys
    "round_key",
    "round_id",
    "hole_id",
    "player_name",
    "facility",
    "course",
    # hole grain
    "hole_number",
    "hole_lat",
    "hole_lon",
    "hole_par",
    "hole_par_bucket",
    "hole_score",
    "hole_strokes_over_par",
    "hole_score_name",
    "hole_fairway_strokes",
    "hole_putts",
    "hole_putts_over_expected",
    "hole_gir",
    "hole_putts_3plus",
    "hole_gir_putts_3plus",
    "hole_notgir_putts_3plus",
    "hole_scramble_opportunity",
    "hole_scramble_success",
    "hole_gir_wasted",
    "hole_notgir_chip_in",
    "hole_is_scoring_chance",
    "hole_is_recovery",
    "shot_fairway_hit_type",
    # round context for filters
    "round_score",
    "round_fairway_strokes",
    "round_putts",
    "round_gir",
    "round_holes_scored",
    "round_no_player",
    "round_no_player_course",
    "round_dt",
    "date",
    "dow",
    "is_weekend",
    "part_of_day",
    "year",
    "month",
    "month_name",
    "year_month",
    "season",
]

golf_holes = (
    golf_valid[hole_cols]
    .drop_duplicates(subset=["hole_id"])
    .reset_index(drop=True)
)

# ------------------------------------------------------
# 5) Shot-level DataFrame (1 row per shot; only rows with actual shots)
#    includes both round and hole context to make Tableau joins easy
# ------------------------------------------------------
shot_cols = [
    # keys
    "round_key",
    "round_id",
    "hole_id",
    "row_id",
    "hole_number",
    # player/course
    "player_name",
    "facility",
    "course",
    # shot grain
    "shot_club",
    "shot_direction",
    "shot_start_lat",
    "shot_start_lon",
    "shot_end_lat",
    "shot_end_lon",
    "yardage",
    "yardage_to_pin",
    "yardage_calc",
    "yardage_error",
    "yardage_error_abs",
    "yardage_error_pct",
    "yardage_suspect",
    "shot_fairway_hit_type",
    # hole context
    "hole_lat",
    "hole_lon",
    "hole_par",
    "hole_par_bucket",
    "hole_score",
    "hole_strokes_over_par",
    "hole_score_name",
    "hole_fairway_strokes",
    "hole_putts",
    "hole_putts_over_expected",
    "hole_gir",
    "hole_putts_3plus",
    "hole_gir_putts_3plus",
    "hole_notgir_putts_3plus",
    "hole_scramble_opportunity",
    "hole_scramble_success",
    "hole_gir_wasted",
    "hole_notgir_chip_in",
    "hole_is_scoring_chance",
    "hole_is_recovery",
    # round context
    "round_score",
    "round_putts",
    "round_gir",
    "round_holes_scored",
    "round_no_player",
    "round_no_player_course",
    "round_dt",
    "date",
    "dow",
    "is_weekend",
    "part_of_day",
    "year",
    "month",
    "month_name",
    "year_month",
    "season",
]

golf_shots = (
    golf_valid.loc[golf_valid["shot_club"].notna(), shot_cols]
    .reset_index(drop=True)
)

# ------------------------------------------------------
# 6) Data dictionaries for derived Phase 4 tables
# ------------------------------------------------------

# a) golf_rounds
generate_data_dictionary(
    golf_rounds,
    table_name="golf_rounds",
    desc_map={
        "round_id": "Surrogate round key (1 row per round).",
        "round_key": "Human-readable round identifier, stable across reruns.",
        "player_name": "Player / golfer name as captured in source.",
        "round_dt": "Timestamp of round start from source.",
        "facility": "Facility / club name from source.",
        "course": "Course / loop name at facility.",
        "round_score": "Total scored strokes for the round.",
        "round_fairway_strokes": "Total fairway strokes recorded for the round.",
        "round_putts": "Total putts recorded for the round.",
        "round_gir": "Greens in regulation value at round grain (source/derived).",
        "round_holes_scored": "Number of distinct scored holes in this round.",
        "round_no_player": "Sequential round number per player (1 = earliest).",
        "round_no_player_course": "Sequential round (visit) number per player√ócourse.",
        "date": "Calendar date of the round, derived from round_dt.",
        "dow": "Day-of-week label (Mon, Tue, ...).",
        "is_weekend": "True if round_dt is Saturday or Sunday.",
        "part_of_day": "Bucketed tee-time band (Morning, Afternoon, Evening, Night, Unknown).",
        "year": "Calendar year of round_dt.",
        "month": "Calendar month (1‚Äì12) of round_dt.",
        "month_name": "3-letter month label.",
        "year_month": "YYYY-MM string for easy grouping.",
        "season": "Season bucket derived from month.",
    },
    lineage_map={
        "round_id": "Created in 3.1.7 from canonical round key.",
        "round_key": "Created in 3.1.7 alongside round_id.",
        "round_holes_scored": "Carried forward from 3.1.3/3.1.4 completeness logic.",
        "round_no_player": "Assigned in 3.1.5.",
        "round_no_player_course": "Assigned in 3.1.6.",
        "part_of_day": "Derived in 3.1.8.1 and validated in 3.1.8.2.",
        "season": "Derived in 3.1.8.3.",
    },
)

# b) golf_holes
generate_data_dictionary(
    golf_holes,
    table_name="golf_holes",
    desc_map={
        "round_key": "Parent round identifier.",
        "hole_id": "Surrogate hole key = round_id + '-' + hole_number.",
        "round_id": "Parent round key.",
        "player_name": "Player name (carried down from round).",
        "round_dt": "Round timestamp (carried down).",
        "facility": "Facility / club name.",
        "course": "Course / loop name.",
        "hole_number": "Hole number within the round (1‚Äì18).",
        "hole_lat": "Hole latitude derived from shot level centroid",
        "hole_lon": "Hole longitude derived from shot level centroid",
        "hole_par": "Par for this hole.",
        "hole_par_bucket": "String label for par (e.g. 'Par 3', 'Par 4').",
        "hole_score": "Strokes taken on this hole.",
        "hole_strokes_over_par": "hole_score - hole_par.",
        "hole_score_name": "Golf-friendly name for outcome (Birdie, Par, Bogey, ...).",
        "hole_putts": "Putts taken on this hole.",
        "hole_putts_over_expected": "Putts - 2 (simple expectation).",
        "hole_putts_3plus": "Flag for 3+ putts.",
        "hole_gir": "Green in regulation flag.",
        "hole_gir_putts_3plus": "3+ putts on GIR holes.",
        "hole_notgir_putts_3plus": "3+ putts on non-GIR holes.",
        "hole_scramble_opportunity": "True if player missed GIR but still had strokes to save par.",
        "hole_scramble_success": "True if scramble opportunity and par-or-better achieved.",
        "hole_gir_wasted": "True if GIR but scored over par.",
        "hole_notgir_chip_in": "True if non-GIR save with 0 putts.",
        "hole_is_scoring_chance": "Tag for analysis: GIR or better opportunity.",
        "hole_is_recovery": "Tag for analysis: scramble/recovery situation.",
        "round_score": "Round grain metric carried down for dashboard filtering.",
        "round_fairway_strokes": "Round grain metric carried down for dashboard filtering.",
        "round_putts": "Round grain metric carried down for dashboard filtering.",
        "round_gir": "Round grain metric carried down for dashboard filtering.",
        "round_holes_scored": "Round grain metric carried down for dashboard filtering.",
        "round_no_player": "Round sequence per player (carried down).",
        "round_no_player_course": "Course-specific sequence per player (carried down).",
        "dow": "Day-of-week from round_dt (carried down).",
        "is_weekend": "Weekend flag from round_dt (carried down).",
        "part_of_day": "Part-of-day from round_dt (carried down).",
        "year": "Year from round_dt (carried down).",
        "month": "Month from round_dt (carried down).",
        "month_name": "Month label from round_dt (carried down).",
        "year_month": "YYYY-MM from round_dt (carried down).",
        "season": "Season from round_dt (carried down).",
        "shot_fairway_hit_type": "Captured value from source at hole/shot marking time.",
    },
    lineage_map={
        "hole_id": "Created in 3.5.1 to support Tableau relationships.",
        "hole_par_bucket": "Derived in 3.2.1.",
        "hole_strokes_over_par": "Derived in 3.2.1.",
        "hole_score_name": "Derived in 3.2.1.",
        "hole_putts_over_expected": "Derived in 3.2.2.",
        "hole_scramble_opportunity": "Derived in 3.2.3.",
        "hole_scramble_success": "Derived in 3.2.3.",
        "hole_gir_wasted": "Derived in 3.2.3.",
        "hole_notgir_chip_in": "Derived in 3.2.3.",
        "hole_is_scoring_chance": "Derived in 3.2.3.",
        "hole_is_recovery": "Derived in 3.2.3.",
    },
)

# c) golf_shots
generate_data_dictionary(
    golf_shots,
    table_name="golf_shots",
    desc_map={
        "round_key": "Parent round identifier.",
        "row_id": "Original row identifier from ingest (shot-level).",
        "round_id": "Parent round key.",
        "hole_id": "Parent hole key (round_id + '-' + hole_number).",
        "hole_number": "Hole number this shot belongs to.",
        "hole_lat": "Hole latitude derived from shot level centroid",
        "hole_lon": "Hole longitude derived from shot level centroid",
        "hole_par": "Par for this hole.",
        "hole_par_bucket": "String label for par (e.g. 'Par 3', 'Par 4').",
        "hole_score": "Strokes taken on this hole.",
        "hole_strokes_over_par": "hole_score - hole_par.",
        "hole_score_name": "Golf-friendly name for outcome (Birdie, Par, Bogey, ...).",
        "hole_putts": "Putts taken on this hole.",
        "hole_putts_over_expected": "Putts - 2 (simple expectation).",
        "hole_putts_3plus": "Flag for 3+ putts.",
        "hole_gir": "Green in regulation flag.",
        "hole_gir_putts_3plus": "3+ putts on GIR holes.",
        "hole_notgir_putts_3plus": "3+ putts on non-GIR holes.",
        "hole_scramble_opportunity": "True if player missed GIR but still had strokes to save par.",
        "hole_scramble_success": "True if scramble opportunity and par-or-better achieved.",
        "hole_gir_wasted": "True if GIR but scored over par.",
        "hole_notgir_chip_in": "True if non-GIR save with 0 putts.",
        "hole_is_scoring_chance": "Tag for analysis: GIR or better opportunity.",
        "hole_is_recovery": "Tag for analysis: scramble/recovery situation.",
        "player_name": "Player name.",
        "facility": "Facility / club name.",
        "course": "Course / loop name.",
        "shot_club": "Club recorded for this shot.",
        "shot_direction": "Directional outcome from source app, if present.",
        "shot_start_lat": "GPS start latitude for this shot.",
        "shot_start_lon": "GPS start longitude for this shot.",
        "shot_end_lat": "GPS end latitude for this shot.",
        "shot_end_lon": "GPS end longitude for this shot.",
        "yardage": "Vendor-provided yardage for this shot.",
        "yardage_calc": "GPS-derived yardage from 2.5.",
        "yardage_error": "GPS-derived yardage minus vendor yardage.",
        "yardage_error_abs": "Absolute error between GPS and vendor yardage.",
        "yardage_error_pct": "Absolute percent error between GPS and vendor yardage.",
        "yardage_suspect": "True if yardage error exceeded tolerance in 2.5.",
        "round_score": "Round-level metric carried down for dashboard filtering.",
        "round_fairway_strokes": "Round-level metric carried down for dashboard filtering.",
        "round_putts": "Round-level metric carried down for dashboard filtering.",
        "round_gir": "Round-level metric carried down for dashboard filtering.",
        "round_holes_scored": "Round-level metric carried down for dashboard filtering.",
        "round_no_player": "Player-level round sequence (carried down).",
        "round_no_player_course": "Player√ócourse visit sequence (carried down).",
        "dow": "Day-of-week from round_dt (carried down).",
        "is_weekend": "Weekend flag from round_dt (carried down).",
        "part_of_day": "Part-of-day from round_dt (carried down).",
        "year": "Year from round_dt (carried down).",
        "month": "Month from round_dt (carried down).",
        "month_name": "Month label from round_dt (carried down).",
        "year_month": "YYYY-MM from round_dt (carried down).",
        "season": "Season from round_dt (carried down).",
        "shot_fairway_hit_type": "Outcome marking captured at the shot/hole entry time.",
    },
    lineage_map={
        "yardage_calc": "Computed in 2.5 using geodesic distance.",
        "yardage_error": "Computed in 2.5 as yardage_calc - yardage.",
        "yardage_suspect": "Flagged in 2.5 when error exceeded tolerance.",
        "hole_id": "Created in 3.5.1.",
    },
)

# d) golf_valid
generate_data_dictionary(
    golf_valid,
    table_name="golf_valid",
    desc_map={
        "hole_id": "Surrogate hole key added in 3.5.1; not present in raw source.",
    },
    lineage_map={
        "hole_id": "Created in 3.5.1 to support dimensional exports.",
    },
)

# ------------------------------------------------------
# 7) Governance
# ------------------------------------------------------
record_assumption(
    text="Round-, hole-, and shot-level tables will be exported separately for Phase 4 / Tableau, each carrying enough context to filter independently.",
    rationale="Tableau relationship model is simpler when lower-grain tables already contain higher-grain filter columns.",
    impact_area="Phase 4 dashboards; dimensional modeling; export layer",
)

track_transform(
    stage_name="3.5.1_create_phase4_dataframes",
    df_before=gv_before,
    df_after=golf_valid,
    notes="Derived golf_rounds, golf_holes (with hole_id), and golf_shots from golf_valid.",
    new_cols=["hole_id"],
)

log_step(
    step_name="3.5.1 Create DataFrames for Phase 4",
    description="Split governance-ready golf_valid into round-, hole-, and shot-level tables with relational keys.",
    inputs=["golf_valid"],
    outputs=["golf_rounds", "golf_holes", "golf_shots"],
    df=golf_valid,
    extra_info={
        "golf_rounds_rows": len(golf_rounds),
        "golf_holes_rows": len(golf_holes),
        "golf_shots_rows": len(golf_shots),
        "note": "These will be exported in 3.5.2 close-out.",
    },
)

display({
    "golf_rounds": golf_rounds.head(5),
    "golf_holes": golf_holes.head(5),
    "golf_shots": golf_shots.head(5),
})


üìò Data dictionary generated for table 'golf_rounds' (27 columns).
üìò Data dictionary generated for table 'golf_holes' (45 columns).
üìò Data dictionary generated for table 'golf_shots' (58 columns).
üìò Data dictionary generated for table 'golf_valid' (65 columns).
üìå Assumption logged: Round-, hole-, and shot-level tables will be exported separately for Phase 4 / Tableau, each carrying enough context to filter independently.  | Impact: Phase 4 dashboards; dimensional modeling; export layer
üîÑ Transform logged: 3.5.1_create_phase4_dataframes
   Rows 3987 ‚Üí 3987 (0 change)
   Derived golf_rounds, golf_holes (with hole_id), and golf_shots from golf_valid.
‚úÖ 3.5.1 Create DataFrames for Phase 4 @ 2025-11-16 19:33:25
   DataFrame shape: 3987 rows √ó 65 cols
   Split governance-ready golf_valid into round-, hole-, and shot-level tables with relational keys.
   golf_rounds_rows: 206
   golf_holes_rows: 2799
   golf_shots_rows: 1946
   note: These will be exported in 3.5.2 close

{'golf_rounds':    round_id                                          round_key   player_name                       facility  \
 0         3  David_Brooks__20120607_074057__East_Potomac_Pa...  David Brooks  East Potomac Park Golf Course   
 1         5  David_Brooks__20150703_090742__East_Potomac_Pa...  David Brooks  East Potomac Park Golf Course   
 2         4  David_Brooks__20150703_065613__East_Potomac_Pa...  David Brooks  East Potomac Park Golf Course   
 3         1  David_Brooks__20110605_065801__Rock_Creek_Park...  David Brooks    Rock Creek Park Golf Course   
 4         2  David_Brooks__20110904_080326__Sligo_Creek_Gol...  David Brooks        Sligo Creek Golf Course   
 
             course  round_score  round_fairway_strokes  round_putts  round_gir  round_holes_scored  round_is_partial  \
 0             Blue           88                     54           34     33.333                  18             False   
 1              Red           33                     17           16 

### ======================================================
### 3.5.2 Data Preparation Close-Out (Exports)
### ======================================================

#### **Purpose**
This step officially closes **Phase 3: Data Preparation**, consolidating all validated data, metadata, and governance artifacts into final export formats.  
It functions as the **handoff point** between data engineering and analysis, ensuring that all deliverables are complete, reproducible, and ready for Tableau, Power BI, or Python-based modeling in Phase 4.

---

#### **Inputs**
- Fully prepared DataFrames:
  - `golf_valid`, `golf_rounds`, `golf_holes`, `golf_shots` (fact tables)
  - `facilities`, `player_club_profile` (dimension tables)
  - `golf_raw`, `golf` (Phase 2 cleaned tables)
- Supporting or intermediate DataFrames:
  - `round_index`, `holes_per_round`, `player_club_hole_dispersion`, `player_club_dispersion_rollup`
  - Optional: `facilities_api_cache` (if API step was run)
- Governance artifacts:
  - `STEP_LOG`, `TRANSFORM_LOG`, `VALIDATION_LOG`, `ASSUMPTIONS_LOG`, `DATA_DICTIONARIES`

---

#### **What This Step Does**
| Category | Description | Outputs |
|-----------|--------------|----------|
| **Governance Exports** | Writes all governance logs (assumptions, validation, step, transform, and data dictionary) to `/deliverables/phase3_deliverables`. | Individual CSV files + a single Excel summary (`phase3_deliverables.xlsx`). |
| **Processed Data Exports** | Exports all core analytical DataFrames to `/data/processed/phase3_exports` for direct use in Tableau or modeling. | One `.xlsx` file per dataset, e.g. `golf_rounds.xlsx`, `facilities.xlsx`, etc. |
| **Consolidated Workbook** | Builds a single multi-sheet Excel file, `/data/processed/phase3_exports.xlsx`, containing all final tables and governance logs in the correct order for handoff. | Unified Excel handoff file for external tools. |
| **Deprecation Cleanup** | Removes unneeded artifacts (`profiles`, `business_rules`, `run_summary`, `golf_phase2`) from the export process per new standards. | Leaner deliverables folder, focused on governed outputs only. |
| **Governance Logging** | Records completion of close-out and the paths of exported files into STEP_LOG for audit traceability. | STEP_LOG entry confirming export success. |

---

#### **Why It Matters**
- Establishes a **controlled analytical baseline** ‚Äî the version of record for all future phases.  
- Consolidates 15 years of GolfShot data into clean, relational tables (round-, hole-, and shot-level) for dashboard and model use.  
- Guarantees **schema stability and full traceability** ‚Äî every column is documented and logged.  
- Creates reproducible, shareable deliverables for business stakeholders, instructors, or recruiters reviewing the capstone.

---

#### **Outputs**
| Output Type | Location | Description |
|--------------|-----------|-------------|
| **Governance Logs (CSV)** | `/deliverables/phase3_deliverables/` | `assumptions_log.csv`, `data_dictionary.csv`, `step_log.csv`, `transform_log.csv`, `validation_log.csv` |
| **Governance Workbook (Excel)** | `/deliverables/phase3_deliverables/phase3_deliverables.xlsx` | Single file containing all five logs above |
| **Processed Data Exports (Excel)** | `/data/processed/phase3_exports/` | Individual `.xlsx` exports for `facilities`, `player_club_profile`, `golf_rounds`, `golf_holes`, `golf_shots`, `golf_valid`, `golf_raw`, `golf_clean`, and `data_dictionary` |
| **Consolidated Workbook** | `/data/processed/phase3_exports.xlsx` | Multi-sheet workbook combining data, logs, and reference tables (for Tableau / Power BI import) |
| **Governance Logs Update** | `STEP_LOG` | Final confirmation entry noting completion of Phase 3 exports and their storage locations |

---

#### **Next Steps**
- **Phase 4 ‚Äì Analysis & Visualization:** Build Tableau dashboards leveraging `golf_rounds`, `golf_holes`, `golf_shots`, and `player_club_profile`.  
- **Phase 5 ‚Äì Modeling & Improvement:** Use dispersion, GIR, and scoring consistency metrics for predictive modeling and scenario testing.  
- **Phase 6 ‚Äì Control & Deployment:** Confirm reproducibility, publish final dashboards, and document model performance.

---

**‚úÖ Phase 3 Close-Out Summary:**  
All validated datasets, logs, and definitions have been exported to governed locations.  
This marks the official transition from data preparation to analysis ‚Äî the analytical foundation of the Golf Performance Capstone is now complete, auditable, and ready for deployment.


In [37]:
# ======================================================
# 3.5.2 Data Preparation Close-Out (Exports)
# ======================================================
"""
This step finalizes Phase 3 by exporting:
1) Governance logs ‚Üí /deliverables/phase3_deliverables (CSV + 1 Excel)
2) Analysis-ready tables ‚Üí /data/processed/phase3_exports (one file per table)
3) A single consolidated workbook ‚Üí /data/processed/phase3_exports.xlsx

Notes from requirements:
- DO NOT export profiles, business-rule logs, phase 2 scratch, or timestamped folders.
- Folder names are FIXED (not timestamped).
- Some tables may not exist (e.g. facilities_api_cache if 3.4.5 was skipped) ‚Äî handle safely.
"""

# ------------------------------------------------------
# 0. Helper: safe-get for optional tables
# ------------------------------------------------------
def _maybe_df(name: str):
    """
    Return a DataFrame if it exists in globals(), else an empty DataFrame.
    This lets us keep the export structure stable even if a step was skipped.
    """
    return globals().get(name, pd.DataFrame())


# ------------------------------------------------------
# 1. Reconstruct small helper tables from golf_valid if needed
#    (these weren't always kept as standalone DataFrames earlier)
# ------------------------------------------------------
# hole-level feature subsets (3.2.1‚Äì3.2.3)
if "golf_valid" not in globals():
    raise RuntimeError("golf_valid not found ‚Äî cannot close out Phase 3 exports.")

# 1a) hole features (par bucket, strokes over par, score name)
golf_valid_hole_features = (
    golf_valid[
        [
            "round_id",
            "player_name",
            "facility",
            "course",
            "hole_number",
            "hole_id",  # added in 3.5.1
            "hole_par_bucket",
            "hole_strokes_over_par",
            "hole_score_name",
        ]
    ]
    .copy()
    if "hole_par_bucket" in golf_valid.columns
    else pd.DataFrame()
)

# 1b) hole putting features
golf_valid_hole_features_putting = (
    golf_valid[
        [
            "round_id",
            "player_name",
            "facility",
            "course",
            "hole_number",
            "hole_id",
            "hole_putts",
            "hole_putts_over_expected",
            "hole_putts_3plus",
            "hole_gir_putts_3plus",
            "hole_notgir_putts_3plus",
        ]
    ]
    .copy()
    if "hole_putts_over_expected" in golf_valid.columns
    else pd.DataFrame()
)

# 1c) hole outcome features (scramble, wasted GIR, etc.)
golf_valid_hole_outcome_features = (
    golf_valid[
        [
            "round_id",
            "player_name",
            "facility",
            "course",
            "hole_number",
            "hole_id",
            "hole_scramble_opportunity",
            "hole_scramble_success",
            "hole_gir_wasted",
            "hole_notgir_chip_in",
            "hole_is_scoring_chance",
            "hole_is_recovery",
        ]
    ]
    .copy()
    if "hole_scramble_opportunity" in golf_valid.columns
    else pd.DataFrame()
)

# ------------------------------------------------------
# 2. Build export directory structure (non-timestamped)
# ------------------------------------------------------
deliver_dir = OUTPUT_PATH / "phase3_deliverables"
deliver_dir.mkdir(parents=True, exist_ok=True)

processed_dir = PROCESSED_PATH / "phase3_exports"
processed_dir.mkdir(parents=True, exist_ok=True)

# ------------------------------------------------------
# 3. Gather governance logs (CSV ‚Üí deliverables)
# ------------------------------------------------------
assumptions_df = pd.DataFrame(ASSUMPTIONS_LOG) if len(ASSUMPTIONS_LOG) > 0 else pd.DataFrame()
step_df = pd.DataFrame(STEP_LOG) if len(STEP_LOG) > 0 else pd.DataFrame()
transform_df = pd.DataFrame(TRANSFORM_LOG) if len(TRANSFORM_LOG) > 0 else pd.DataFrame()
validation_df = pd.DataFrame(VALIDATION_LOG) if len(VALIDATION_LOG) > 0 else pd.DataFrame()

# data dictionary ‚Äî concatenate everything we collected
if len(DATA_DICTIONARIES) > 0:
    data_dict_df = pd.concat(DATA_DICTIONARIES, ignore_index=True)
else:
    data_dict_df = pd.DataFrame()

# write CSVs to /deliverables/phase3_deliverables
assumptions_df.to_csv(deliver_dir / "assumptions_log.csv", index=False)
data_dict_df.to_csv(deliver_dir / "data_dictionary.csv", index=False)
step_df.to_csv(deliver_dir / "step_log.csv", index=False)
transform_df.to_csv(deliver_dir / "transform_log.csv", index=False)
validation_df.to_csv(deliver_dir / "validation_log.csv", index=False)

# also: single XLSX in the same folder with those 5 tabs
with pd.ExcelWriter(deliver_dir / "phase3_deliverables.xlsx", engine="xlsxwriter") as writer:
    assumptions_df.to_excel(writer, sheet_name="assumptions_log", index=False)
    data_dict_df.to_excel(writer, sheet_name="data_dictionary", index=False)
    step_df.to_excel(writer, sheet_name="step_log", index=False)
    transform_df.to_excel(writer, sheet_name="transform_log", index=False)
    validation_df.to_excel(writer, sheet_name="validation_log", index=False)

# ------------------------------------------------------
# 4. Export analysis tables to /data/processed/phase3_exports
#    one file per table, NO timestamps
# ------------------------------------------------------
def _to_xlsx(df: pd.DataFrame, path: Path):
    # save even empty frames to keep structure predictable
    df.to_excel(path, index=False)

# core Phase 3 tables
_to_xlsx(_maybe_df("facilities"), processed_dir / "facilities.xlsx")
_to_xlsx(_maybe_df("player_club_profile"), processed_dir / "player_club_profile.xlsx")
_to_xlsx(_maybe_df("golf_rounds"), processed_dir / "golf_rounds.xlsx")
_to_xlsx(_maybe_df("golf_holes"), processed_dir / "golf_holes.xlsx")
_to_xlsx(_maybe_df("golf_shots"), processed_dir / "golf_shots.xlsx")
_to_xlsx(_maybe_df("golf_valid"), processed_dir / "golf_valid.xlsx")

# phase 2 / raw
_to_xlsx(_maybe_df("golf_raw"), processed_dir / "golf_raw.xlsx")
_to_xlsx(_maybe_df("golf"), processed_dir / "golf_clean.xlsx")  # golf == Phase 2 standardized

# data dictionary
_to_xlsx(data_dict_df, processed_dir / "data_dictionary.xlsx")

# ------------------------------------------------------
# 5. Consolidated workbook ‚Üí /data/processed/phase3_exports.xlsx
#    with all requested sheets in the order specified
# ------------------------------------------------------
consolidated_path = PROCESSED_PATH / "phase3_exports.xlsx"
with pd.ExcelWriter(consolidated_path, engine="xlsxwriter") as writer:
    # 1. data_dictionary
    data_dict_df.to_excel(writer, sheet_name="data_dictionary", index=False)

    # 2‚Äì6. main fact/dimension tables
    _maybe_df("golf_rounds").to_excel(writer, sheet_name="golf_rounds", index=False)
    _maybe_df("golf_holes").to_excel(writer, sheet_name="golf_holes", index=False)
    _maybe_df("golf_shots").to_excel(writer, sheet_name="golf_shots", index=False)
    _maybe_df("facilities").to_excel(writer, sheet_name="facilities", index=False)
    _maybe_df("player_club_profile").to_excel(writer, sheet_name="player_club_profile", index=False)

    # 7‚Äì10. governance logs
    assumptions_df.to_excel(writer, sheet_name="assumptions_log", index=False)
    step_df.to_excel(writer, sheet_name="step_log", index=False)
    transform_df.to_excel(writer, sheet_name="transform_log", index=False)
    validation_df.to_excel(writer, sheet_name="validation_log", index=False)

    # 11‚Äì13. source / intermediate
    _maybe_df("golf_raw").to_excel(writer, sheet_name="golf_raw", index=False)
    _maybe_df("golf").to_excel(writer, sheet_name="golf_clean", index=False)
    _maybe_df("golf_valid").to_excel(writer, sheet_name="golf_valid", index=False)

    # 14‚Äì15. round helpers
    _maybe_df("round_index").to_excel(writer, sheet_name="round_index", index=False)
    _maybe_df("holes_per_round").to_excel(writer, sheet_name="holes_per_round", index=False)

    # 16‚Äì18. hole feature subsets
    golf_valid_hole_features.to_excel(writer, sheet_name="hole_features", index=False)
    golf_valid_hole_features_putting.to_excel(writer, sheet_name="hole_putting", index=False)
    golf_valid_hole_outcome_features.to_excel(writer, sheet_name="hole_outcome", index=False)

    # 19‚Äì20. dispersion tables
    _maybe_df("player_club_hole_dispersion").to_excel(writer, sheet_name="hole_dispersion", index=False)
    _maybe_df("player_club_dispersion_rollup").to_excel(writer, sheet_name="club_dispersion", index=False)

    # 21. facilities_api_cache (may be empty)
    _maybe_df("facilities_api_cache").to_excel(writer, sheet_name="facilities_api_cache", index=False)

print(f"üì¶ Phase 3 exports written to:\n - {deliver_dir}\n - {processed_dir}\n - {consolidated_path}")

# ------------------------------------------------------
# 6. Consolidated workbook for Phase 4 ‚Üí /data/processed/golf_capstone_exports.xlsx
#    with all requested sheets in the order specified
# ------------------------------------------------------
viz_path = PROCESSED_PATH / "golf_capstone_exports.xlsx"
with pd.ExcelWriter(viz_path, engine="xlsxwriter") as writer:
    # 1. data_dictionary
    data_dict_df.to_excel(writer, sheet_name="data_dictionary", index=False)

    # 2‚Äì6. main fact/dimension tables
    _maybe_df("facilities").to_excel(writer, sheet_name="facilities", index=False)
    _maybe_df("golf_rounds").to_excel(writer, sheet_name="rounds", index=False)
    _maybe_df("golf_holes").to_excel(writer, sheet_name="holes", index=False)
    _maybe_df("golf_shots").to_excel(writer, sheet_name="shots", index=False)
    _maybe_df("player_club_profile").to_excel(writer, sheet_name="clubs", index=False)

# ------------------------------------------------------
# 7. CSV exports for Tableau Public ‚Üí /data/processed/golf_capstone_exports_csvs
#    (no data_dictionary CSV)
# ------------------------------------------------------
csv_dir = PROCESSED_PATH / "golf_capstone_exports_csvs"
csv_dir.mkdir(parents=True, exist_ok=True)

csv_specs = [
    ("facilities",           "facilities.csv"),
    ("golf_rounds",          "rounds.csv"),
    ("golf_holes",           "holes.csv"),
    ("golf_shots",           "shots.csv"),
    ("player_club_profile",  "clubs.csv"),
]

for df_name, filename in csv_specs:
    df = _maybe_df(df_name)
    if df is not None:
        df.to_csv(csv_dir / filename, index=False)

print(
    "üì¶ Phase 3 Visualization exports written to:\n"
    f" - {deliver_dir}\n"
    f" - {processed_dir}\n"
    f" - {viz_path}\n"
    f" - CSVs: {csv_dir}"
)



# ------------------------------------------------------
# 8. Governance: log that the close-out completed
# ------------------------------------------------------
log_step(
    step_name="3.5.2 Data Preparation Close-Out (Exports)",
    description="Exported Phase 3 governance logs to /deliverables and all analysis tables to /data/processed.",
    inputs=[
        "golf_valid",
        "golf_rounds",
        "golf_holes",
        "golf_shots",
        "facilities",
        "player_club_profile",
    ],
    outputs=[
        str(deliver_dir / "assumptions_log.csv"),
        str(deliver_dir / "data_dictionary.csv"),
        str(deliver_dir / "step_log.csv"),
        str(deliver_dir / "transform_log.csv"),
        str(deliver_dir / "validation_log.csv"),
        str(deliver_dir / "phase3_deliverables.xlsx"),
        str(processed_dir),
        str(consolidated_path),
    ],
    df=None,
    extra_info={
        "note": "Profiles and business-rule exports were intentionally excluded per requirements.",
    },
)


üì¶ Phase 3 exports written to:
 - C:\Users\micha\onedrive\documents\newforce\golf-capstone\deliverables\phase3_deliverables
 - C:\Users\micha\onedrive\documents\newforce\golf-capstone\data\processed\phase3_exports
 - C:\Users\micha\onedrive\documents\newforce\golf-capstone\data\processed\phase3_exports.xlsx
üì¶ Phase 3 Visualization exports written to:
 - C:\Users\micha\onedrive\documents\newforce\golf-capstone\deliverables\phase3_deliverables
 - C:\Users\micha\onedrive\documents\newforce\golf-capstone\data\processed\phase3_exports
 - C:\Users\micha\onedrive\documents\newforce\golf-capstone\data\processed\golf_capstone_exports.xlsx
 - CSVs: C:\Users\micha\onedrive\documents\newforce\golf-capstone\data\processed\golf_capstone_exports_csvs
‚úÖ 3.5.2 Data Preparation Close-Out (Exports) @ 2025-11-16 19:35:37
   Exported Phase 3 governance logs to /deliverables and all analysis tables to /data/processed.
   note: Profiles and business-rule exports were intentionally excluded per require

## Phase 4. Modeling & Visualization

**Objective:** Identify patterns and relationships that explain performance, highlight information to execute shot strategy on the course, and provide tailored practice priority insights. **This is currently being worked on in other tools from exports in 3.5 Governance Close-Out (Data Preparation)**

### 4.1	Diagnostic Performance Analysis using all relevant KPIs developed in data preparation

Dashboard to evaluate scoring trends and trouble patterns. Scoring by type (eagle, birdie, etc.),scrambling, putting, etc. all filterable by person, facility, course, round, time features, etc.

### 4.2	Shot-level Modeling (to execute my on-course strategy)

Consider creting a better statistical way to create a vector for misses to visualize. Using data from 3.3.3, make a  triangle, using the most statistically significant of your data (Driver) as the end dispersion, and drawing back to your location for all planned distances in player_club_profile.

### 4.3	Case Study: Lewisburg Elks Hole 2

Facility-level context, Round-level context, Hole-level context from dashboard in 4.1 , Shot-level (mistakes with driver on 2) from 4.2. Produce visual story of dispersion and miss tendencies.


## Phase 5. Evaluation

**Objective:** Interpret analytical results and identify improvement levers. **This is planned work**

### 5.1	Summarize Key Findings & Hypotheses

Interpret analysis outcomes vs hypotheses.

### 5.2	Quantify Improvement Opportunities

Identify practice targets and measurable improvements.

Which areas of my game should I practice to improve scoring?
- Driving?
- Putting?
- Scrambling from < 30 yards?
- Approaches < 100 yards?
- Approaches from > 100 yards (based on dispersion insights)

### 5.3	Document Lessons Learned

Record key assumptions and insights for reuse. Also identify useful metrics to track in the future due to current data limitations.

## 6. Deployment

**Objective:** Package outputs, document governance, and sustain the process.

‚úÖ DRAFT Data Dictionary exported in 3.5 Governance Close-Out (Data Preparation).

üìå PENDING: Complete Final Data Dictionary: Complete field definitions for final datasets.

‚úÖ ALL DRAFT Artifacts Exported (datasets, logs, dictionaries, and documentation).

‚úÖ Control Plan Summarized: README and this notebook header outline how to re-run, validate, and sustain improvements.
