## Automated Dataset Audit – EDA Checklist

This code performs a standardized structural audit of every CSV file within the current season folder.  
The purpose is to evaluate data quality, structure, and merge-readiness before any cleaning or transformation.

For each file, the script checks:

- **Shape**: Number of rows and columns
- **Duplicate rows**: Detects potential data duplication
- **Grain guess**: Infers likely unit of analysis (team-season, player-season, draft-pick, etc.)
- **Percent columns**: Flags columns containing percentage values stored as strings
- **Numeric-like object columns**: Identifies columns that appear numeric but are stored as object dtype
- **Key columns present**: Detects potential join keys (e.g., Team, Player, Season)
- **Missing data patterns**: Reports columns with non-zero missing percentages

Importantly, this step does **not modify any data**.  
It is strictly an exploratory assessment to document structural issues and identify
what cleaning or preparation will be required in the Data Preparation phase.

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import re

# ----------------------------
# CONFIG (edit these)
# ----------------------------
FOLDER = Path.cwd()  # if notebook is inside the same season folder
# FOLDER = Path("2015 NFL Season")  # example if notebook is at repo root
KEY_CANDIDATES = ["Tm", "Team", "Player", "season", "Season", "Year", "Pick", "Round"]


# ----------------------------
# Helpers
# ----------------------------
def _infer_pct_cols(df: pd.DataFrame):
    """Find columns that look like percentage columns by name or values."""
    pct_by_name = [c for c in df.columns if "%" in c or c.lower().endswith("pct")]
    pct_by_values = []
    for c in df.columns:
        if df[c].dtype == "object":
            s = df[c].dropna().astype(str).head(50)
            if len(s) and (s.str.contains(r"%").mean() > 0.5):
                pct_by_values.append(c)
    return sorted(set(pct_by_name + pct_by_values))


def _infer_numeric_as_object_cols(df: pd.DataFrame):
    """
    Identify object columns that *mostly* look numeric (e.g., "12.3", "1,234", "$45", "67%").
    These are candidates for cleaning.
    """
    candidates = []
    for c in df.columns:
        if df[c].dtype != "object":
            continue
        s = df[c].dropna().astype(str).head(200)
        if s.empty:
            continue

        # strip common formatting and test numeric parse rate
        stripped = (
            s.str.replace(",", "", regex=False)
             .str.replace("%", "", regex=False)
             .str.replace("$", "", regex=False)
             .str.strip()
        )

        parsed = pd.to_numeric(stripped, errors="coerce")
        parse_rate = parsed.notna().mean()

        # if most values parse, it's likely numeric stored as object
        if parse_rate >= 0.8:
            candidates.append((c, round(float(parse_rate), 3)))
    return candidates


def _pick_key_cols(df: pd.DataFrame, key_candidates=KEY_CANDIDATES):
    """Return the key candidates present in df (in preference order)."""
    present = [k for k in key_candidates if k in df.columns]
    return present


def eda_checklist_report_for_file(csv_path: Path):
    df = pd.read_csv(csv_path)

    # Basic structure
    n_rows, n_cols = df.shape
    columns = list(df.columns)

    # Missingness
    missing_counts = df.isna().sum()
    missing_pct = (df.isna().mean() * 100).round(2)

    # Duplicates
    n_dup_rows = int(df.duplicated().sum())

    # Percent & numeric-as-object flags
    pct_cols = _infer_pct_cols(df)
    num_as_obj = _infer_numeric_as_object_cols(df)  # list of tuples (col, parse_rate)

    # Key checks
    keys_present = _pick_key_cols(df)
    key_uniques = {k: int(df[k].nunique(dropna=True)) for k in keys_present}

    # A simple "grain guess" heuristic
    grain_guess = "unknown"
    if "Tm" in df.columns and n_rows in (32, 33):  # 32 teams typical; sometimes includes 'Lg Avg'
        grain_guess = "team-season (likely)"
    elif "Player" in df.columns and n_rows > 32:
        grain_guess = "player-season or player-level (likely)"
    elif "Pick" in df.columns or "Round" in df.columns:
        grain_guess = "draft-pick level (likely)"

    # Top missing columns (if any)
    top_missing = (
        missing_pct[missing_pct > 0]
        .sort_values(ascending=False)
        .head(10)
        .to_dict()
    )

    # Build a compact summary row (for a table)
    summary = {
        "file": csv_path.name,
        "rows": n_rows,
        "cols": n_cols,
        "duplicate_rows": n_dup_rows,
        "grain_guess": grain_guess,
        "pct_cols_found": ", ".join(pct_cols) if pct_cols else "",
        "object_cols_numeric_like": ", ".join([f"{c}({r})" for c, r in num_as_obj]) if num_as_obj else "",
        "keys_present": ", ".join(keys_present) if keys_present else "",
        "top_missing_cols_%": top_missing if top_missing else {},
    }

    # Also return details if you want to print per-file
    details = {
        "shape": (n_rows, n_cols),
        "columns": columns,
        "dtypes": df.dtypes.astype(str).to_dict(),
        "missing_counts": missing_counts.to_dict(),
        "missing_pct": missing_pct.to_dict(),
        "duplicate_rows": n_dup_rows,
        "pct_cols": pct_cols,
        "numeric_as_object_candidates": num_as_obj,
        "keys_present": keys_present,
        "key_uniques": key_uniques,
        "grain_guess": grain_guess,
    }

    return summary, details


def run_folder_checklist(folder: Path, pattern: str = "*.csv", verbose: bool = False):
    csv_files = sorted(folder.glob(pattern))
    if not csv_files:
        raise FileNotFoundError(f"No CSV files found in: {folder.resolve()}")

    summaries = []
    all_details = {}

    for f in csv_files:
        summary, details = eda_checklist_report_for_file(f)
        summaries.append(summary)
        all_details[f.name] = details

        if verbose:
            print("\n" + "=" * 80)
            print(f"FILE: {f.name}")
            print(f"Shape: {details['shape']} | Duplicates: {details['duplicate_rows']} | Grain: {details['grain_guess']}")
            if details["pct_cols"]:
                print(f"Percent-like cols: {details['pct_cols']}")
            if details["numeric_as_object_candidates"]:
                print("Numeric-like object cols:", details["numeric_as_object_candidates"])
            if details["keys_present"]:
                print("Key cols present:", details["keys_present"])
                print("Key nunique:", details["key_uniques"])
            top_missing = {k: v for k, v in details["missing_pct"].items() if v > 0}
            if top_missing:
                top10 = dict(sorted(top_missing.items(), key=lambda x: x[1], reverse=True)[:10])
                print("Top missing %:", top10)

    summary_df = pd.DataFrame(summaries).sort_values(["rows", "file"]).reset_index(drop=True)
    return summary_df, all_details


# ----------------------------
# Run it
# ----------------------------
summary_df, details_by_file = run_folder_checklist(FOLDER, verbose=True)

summary_df


FILE: 2016 Defense.csv
Shape: (36, 28) | Duplicates: 0 | Grain: unknown
Top missing %: {'Unnamed: 0': 8.33, 'Unnamed: 2': 8.33, 'Unnamed: 27': 5.56}

FILE: 2016 Draft Selections.csv
Shape: (254, 29) | Duplicates: 0 | Grain: unknown
Top missing %: {'Unnamed: 25': 72.83, 'Unnamed: 26': 64.17, 'Unnamed: 24': 31.89, 'Unnamed: 11': 12.6, 'Unnamed: 28': 9.45, 'Unnamed: 6': 6.69, 'Approx Val': 6.69, 'Unnamed: 12': 6.69, 'Passing': 6.69, 'Unnamed: 14': 6.69}

FILE: 2016 Pass Defense.csv
Shape: (35, 25) | Duplicates: 0 | Grain: unknown
Percent-like cols: ['Cmp%', 'Int%', 'Sk%', 'TD%']
Key cols present: ['Tm']
Key nunique: {'Tm': 35}
Top missing %: {'Rk': 8.57, 'G': 8.57, 'EXP': 5.71}

FILE: 2016 Passing.csv
Shape: (98, 33) | Duplicates: 0 | Grain: player-season or player-level (likely)
Percent-like cols: ['Cmp%', 'Int%', 'Sk%', 'Succ%', 'TD%']
Key cols present: ['Team', 'Player']
Key nunique: {'Team': 32, 'Player': 98}
Top missing %: {'Awards': 78.57, 'QBrec': 44.9, 'Lng': 17.35, 'Succ%': 16.3

Unnamed: 0,file,rows,cols,duplicate_rows,grain_guess,pct_cols_found,object_cols_numeric_like,keys_present,top_missing_cols_%
0,2016 Team Performances.csv,32,13,0,team-season (likely),W-L%,,Tm,{}
1,2016 Pass Defense.csv,35,25,0,unknown,"Cmp%, Int%, Sk%, TD%",,Tm,"{'Rk': 8.57, 'G': 8.57, 'EXP': 5.71}"
2,2016 Rush Defense.csv,35,9,0,unknown,,,Tm,"{'Rk': 8.57, 'G': 8.57, 'EXP': 5.71}"
3,2016 Defense.csv,36,28,0,unknown,,,,"{'Unnamed: 0': 8.33, 'Unnamed: 2': 8.33, 'Unna..."
4,2016 Passing.csv,98,33,0,player-season or player-level (likely),"Cmp%, Int%, Sk%, Succ%, TD%",,"Team, Player","{'Awards': 78.57, 'QBrec': 44.9, 'Lng': 17.35,..."
5,2016 Draft Selections.csv,254,29,0,unknown,,,,"{'Unnamed: 25': 72.83, 'Unnamed: 26': 64.17, '..."
6,2016 Receiving.csv,500,21,0,unknown,,,,"{'Unnamed: 20': 92.0, 'Unnamed: 14': 1.8, 'Unn..."
7,2016 Rushing.csv,500,18,0,unknown,,,,"{'Unnamed: 17': 90.6, 'Unnamed: 12': 34.6, 'Un..."


## Automated Dataset Audit – 2016 Season Findings Summary

The automated audit of the 2016 season datasets indicates structural consistency with later seasons (2017–2025), with the notable absence of the Defense Advanced dataset.

### Key Observations

- Team-level data (Team Performances) contains 32 rows, consistent with one row per NFL team for the season.
- Defensive summary tables (Defense, Pass Defense, Rush Defense) contain slightly more than 32 rows (35–36), likely due to league-average or aggregate rows that will require filtering during Data Preparation.
- The Defense Advanced dataset is not available for 2016, consistent with its absence in 2017 and availability beginning in 2018.
- Player-level datasets (Passing, Rushing, Receiving) contain substantially more rows (97–500+), consistent with a player-season grain.
- The Draft dataset contains 254 rows, consistent with draft-pick-level data.
- Percentage-based metrics (e.g., `Cmp%`, `Int%`, `Sk%`, `W-L%`) are stored as string/object types and will require numeric conversion during Data Preparation.
- Several files contain `Unnamed` columns, likely artifacts from the source table export formatting, which will be removed during cleaning.
- Missing data patterns are consistent with later seasons, primarily concentrated in award-related or role-specific fields.

### Overall Assessment

The 2016 datasets are structurally sound and consistent with subsequent seasons, aside from the absence of Defense Advanced metrics. This marks the beginning of the pre-advanced-metrics era in the dataset and introduces a structural breakpoint that must be considered during multi-year modeling.