## Automated Dataset Audit – EDA Checklist

This code performs a standardized structural audit of every CSV file within the current season folder.  
The purpose is to evaluate data quality, structure, and merge-readiness before any cleaning or transformation.

For each file, the script checks:

- **Shape**: Number of rows and columns
- **Duplicate rows**: Detects potential data duplication
- **Grain guess**: Infers likely unit of analysis (team-season, player-season, draft-pick, etc.)
- **Percent columns**: Flags columns containing percentage values stored as strings
- **Numeric-like object columns**: Identifies columns that appear numeric but are stored as object dtype
- **Key columns present**: Detects potential join keys (e.g., Team, Player, Season)
- **Missing data patterns**: Reports columns with non-zero missing percentages

Importantly, this step does **not modify any data**.  
It is strictly an exploratory assessment to document structural issues and identify
what cleaning or preparation will be required in the Data Preparation phase.

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import re

# ----------------------------
# CONFIG (edit these)
# ----------------------------
FOLDER = Path.cwd()  # if notebook is inside the same season folder
# FOLDER = Path("2015 NFL Season")  # example if notebook is at repo root
KEY_CANDIDATES = ["Tm", "Team", "Player", "season", "Season", "Year", "Pick", "Round"]


# ----------------------------
# Helpers
# ----------------------------
def _infer_pct_cols(df: pd.DataFrame):
    """Find columns that look like percentage columns by name or values."""
    pct_by_name = [c for c in df.columns if "%" in c or c.lower().endswith("pct")]
    pct_by_values = []
    for c in df.columns:
        if df[c].dtype == "object":
            s = df[c].dropna().astype(str).head(50)
            if len(s) and (s.str.contains(r"%").mean() > 0.5):
                pct_by_values.append(c)
    return sorted(set(pct_by_name + pct_by_values))


def _infer_numeric_as_object_cols(df: pd.DataFrame):
    """
    Identify object columns that *mostly* look numeric (e.g., "12.3", "1,234", "$45", "67%").
    These are candidates for cleaning.
    """
    candidates = []
    for c in df.columns:
        if df[c].dtype != "object":
            continue
        s = df[c].dropna().astype(str).head(200)
        if s.empty:
            continue

        # strip common formatting and test numeric parse rate
        stripped = (
            s.str.replace(",", "", regex=False)
             .str.replace("%", "", regex=False)
             .str.replace("$", "", regex=False)
             .str.strip()
        )

        parsed = pd.to_numeric(stripped, errors="coerce")
        parse_rate = parsed.notna().mean()

        # if most values parse, it's likely numeric stored as object
        if parse_rate >= 0.8:
            candidates.append((c, round(float(parse_rate), 3)))
    return candidates


def _pick_key_cols(df: pd.DataFrame, key_candidates=KEY_CANDIDATES):
    """Return the key candidates present in df (in preference order)."""
    present = [k for k in key_candidates if k in df.columns]
    return present


def eda_checklist_report_for_file(csv_path: Path):
    df = pd.read_csv(csv_path)

    # Basic structure
    n_rows, n_cols = df.shape
    columns = list(df.columns)

    # Missingness
    missing_counts = df.isna().sum()
    missing_pct = (df.isna().mean() * 100).round(2)

    # Duplicates
    n_dup_rows = int(df.duplicated().sum())

    # Percent & numeric-as-object flags
    pct_cols = _infer_pct_cols(df)
    num_as_obj = _infer_numeric_as_object_cols(df)  # list of tuples (col, parse_rate)

    # Key checks
    keys_present = _pick_key_cols(df)
    key_uniques = {k: int(df[k].nunique(dropna=True)) for k in keys_present}

    # A simple "grain guess" heuristic
    grain_guess = "unknown"
    if "Tm" in df.columns and n_rows in (32, 33):  # 32 teams typical; sometimes includes 'Lg Avg'
        grain_guess = "team-season (likely)"
    elif "Player" in df.columns and n_rows > 32:
        grain_guess = "player-season or player-level (likely)"
    elif "Pick" in df.columns or "Round" in df.columns:
        grain_guess = "draft-pick level (likely)"

    # Top missing columns (if any)
    top_missing = (
        missing_pct[missing_pct > 0]
        .sort_values(ascending=False)
        .head(10)
        .to_dict()
    )

    # Build a compact summary row (for a table)
    summary = {
        "file": csv_path.name,
        "rows": n_rows,
        "cols": n_cols,
        "duplicate_rows": n_dup_rows,
        "grain_guess": grain_guess,
        "pct_cols_found": ", ".join(pct_cols) if pct_cols else "",
        "object_cols_numeric_like": ", ".join([f"{c}({r})" for c, r in num_as_obj]) if num_as_obj else "",
        "keys_present": ", ".join(keys_present) if keys_present else "",
        "top_missing_cols_%": top_missing if top_missing else {},
    }

    # Also return details if you want to print per-file
    details = {
        "shape": (n_rows, n_cols),
        "columns": columns,
        "dtypes": df.dtypes.astype(str).to_dict(),
        "missing_counts": missing_counts.to_dict(),
        "missing_pct": missing_pct.to_dict(),
        "duplicate_rows": n_dup_rows,
        "pct_cols": pct_cols,
        "numeric_as_object_candidates": num_as_obj,
        "keys_present": keys_present,
        "key_uniques": key_uniques,
        "grain_guess": grain_guess,
    }

    return summary, details


def run_folder_checklist(folder: Path, pattern: str = "*.csv", verbose: bool = False):
    csv_files = sorted(folder.glob(pattern))
    if not csv_files:
        raise FileNotFoundError(f"No CSV files found in: {folder.resolve()}")

    summaries = []
    all_details = {}

    for f in csv_files:
        summary, details = eda_checklist_report_for_file(f)
        summaries.append(summary)
        all_details[f.name] = details

        if verbose:
            print("\n" + "=" * 80)
            print(f"FILE: {f.name}")
            print(f"Shape: {details['shape']} | Duplicates: {details['duplicate_rows']} | Grain: {details['grain_guess']}")
            if details["pct_cols"]:
                print(f"Percent-like cols: {details['pct_cols']}")
            if details["numeric_as_object_candidates"]:
                print("Numeric-like object cols:", details["numeric_as_object_candidates"])
            if details["keys_present"]:
                print("Key cols present:", details["keys_present"])
                print("Key nunique:", details["key_uniques"])
            top_missing = {k: v for k, v in details["missing_pct"].items() if v > 0}
            if top_missing:
                top10 = dict(sorted(top_missing.items(), key=lambda x: x[1], reverse=True)[:10])
                print("Top missing %:", top10)

    summary_df = pd.DataFrame(summaries).sort_values(["rows", "file"]).reset_index(drop=True)
    return summary_df, all_details


# ----------------------------
# Run it
# ----------------------------
summary_df, details_by_file = run_folder_checklist(FOLDER, verbose=True)

summary_df


FILE: 2025 Defense Adv.csv
Shape: (32, 19) | Duplicates: 0 | Grain: team-season (likely)
Percent-like cols: ['Bltz%', 'Hrry%', 'Prss%', 'QBKD%']
Key cols present: ['Tm']
Key nunique: {'Tm': 32}

FILE: 2025 Defense.csv
Shape: (36, 28) | Duplicates: 0 | Grain: unknown
Top missing %: {'Unnamed: 0': 8.33, 'Unnamed: 2': 8.33, 'Unnamed: 27': 5.56}

FILE: 2025 Draft Selections.csv
Shape: (258, 29) | Duplicates: 0 | Grain: unknown
Top missing %: {'Unnamed: 25': 91.47, 'Unnamed: 26': 82.17, 'Unnamed: 24': 53.1, 'Unnamed: 11': 12.79, 'Unnamed: 6': 11.63, 'Approx Val': 11.63, 'Unnamed: 12': 11.63, 'Passing': 11.63, 'Unnamed: 14': 11.63, 'Unnamed: 15': 11.63}

FILE: 2025 Pass Defense.csv
Shape: (35, 25) | Duplicates: 0 | Grain: unknown
Percent-like cols: ['Cmp%', 'Int%', 'Sk%', 'TD%']
Key cols present: ['Tm']
Key nunique: {'Tm': 35}
Top missing %: {'Rk': 8.57, 'G': 8.57, 'EXP': 5.71, 'Int': 2.86, 'Int%': 2.86}

FILE: 2025 Passing.csv
Shape: (103, 33) | Duplicates: 0 | Grain: player-season or play

Unnamed: 0,file,rows,cols,duplicate_rows,grain_guess,pct_cols_found,object_cols_numeric_like,keys_present,top_missing_cols_%
0,2025 Defense Adv.csv,32,19,0,team-season (likely),"Bltz%, Hrry%, Prss%, QBKD%",,Tm,{}
1,2025 Team Performance.csv,32,13,0,team-season (likely),W-L%,,Tm,{}
2,2025 Pass Defense.csv,35,25,0,unknown,"Cmp%, Int%, Sk%, TD%",,Tm,"{'Rk': 8.57, 'G': 8.57, 'EXP': 5.71, 'Int': 2...."
3,2025 Rush Defense.csv,35,9,0,unknown,,,Tm,"{'Rk': 8.57, 'G': 8.57, 'EXP': 5.71}"
4,2025 Defense.csv,36,28,0,unknown,,,,"{'Unnamed: 0': 8.33, 'Unnamed: 2': 8.33, 'Unna..."
5,2025 Passing.csv,103,33,0,player-season or player-level (likely),"Cmp%, Int%, Sk%, Succ%, TD%",,"Team, Player","{'Awards': 91.26, 'QBrec': 37.86, 'Lng': 16.5,..."
6,2025 Draft Selections.csv,258,29,0,unknown,,,,"{'Unnamed: 25': 91.47, 'Unnamed: 26': 82.17, '..."
7,2025 Schedule.csv,286,14,0,unknown,,,,"{'Unnamed: 5': 53.5, 'PtsW': 0.7, 'PtsL': 0.7,..."
8,2025 Receiving.csv,501,21,0,unknown,,,,"{'Unnamed: 20': 94.01, 'Unnamed: 14': 1.0, 'Un..."
9,2025 Rushing.csv,636,19,0,unknown,,,,"{'Unnamed: 18': 95.44, 'Rushing.16': 46.54, 'R..."


## Automated Dataset Audit – Findings Summary

The automated audit confirms that the 2025 season datasets are structurally consistent and largely complete.

### Key Observations

- Most team-level datasets contain approximately 32 rows, consistent with one row per NFL team for the season.
- Some defensive summary tables (e.g., Defense, Pass Defense, Rush Defense) contain slightly more than 32 rows, likely due to the inclusion of league-average or aggregate rows that will require filtering during data preparation.
- Player-level datasets (Passing, Rushing, Receiving) contain substantially more rows, consistent with a player-season grain.
- The Draft dataset contains over 250 rows, consistent with draft-pick-level data.
- Several datasets contain columns with percentage symbols (e.g., `Bltz%`, `Hrry%`, `W-L%`), which are currently stored as string/object types and will require numeric conversion in the Data Preparation phase.
- Certain exported tables include unnamed columns (e.g., `Unnamed: 0`, `Unnamed: 25`), likely artifacts from the source website’s formatting. These columns will be removed during cleaning.
- Missing data is minimal in team-level datasets but more prevalent in player-level datasets, particularly in role-specific or award-related columns.

### Overall Assessment

The datasets are structurally sound and suitable for further analysis. Identified issues (percentage formatting, unnamed columns, and occasional extra rows) are formatting artifacts rather than substantive data problems and can be addressed systematically during Data Preparation.