# Week 1 — Part 02: Data profiling script (CSV → JSON/Markdown)

**Estimated time:** 90–120 minutes

## Learning Objectives

- Treat real-world CSV data as untrusted input
- Build a deterministic profiling artifact (`profile.json` + `profile.md`)
- Fail fast with clear errors for missing/empty inputs
- Add optional schema/required-column checks


## Overview

In AI/ML/LLM projects, most pain starts with data issues:

- wrong column names
- unexpected types
- empty files
- missing values

A data profiling script makes these issues visible early.

---

## Pre-study (Self-learn)

Foundamental Course assumes Self-learn is complete. If you need a refresher on modules, exceptions, file I/O, or JSON:

- [Foundamental Course Pre-study index](../PRESTUDY.md)
- [Self-learn — Modules and exception handling](../self_learn/Chapters/2/02_modules_exceptions.md)

---

## What success looks like (end of Part 02)

Given the same input CSV, your code should always produce:

- `output/profile.json` (machine-readable)
- `output/profile.md` (human-readable)

And it should fail with clear errors for:

- missing file
- empty file
- missing required columns (optional extension)

### Checkpoint

After you run the notebook end-to-end, you should see `output/profile.json` and `output/profile.md` on disk.

Key reproducibility detail: keep outputs deterministic so diffs are meaningful.

---

## Output contract (recap)

Given the same input CSV, the script should always produce:

- `output/profile.json`
- `output/profile.md`

### What this cell does
Defines the core data structures and helper functions for the profiling script:

- **`Profile` dataclass** — a typed container for all profile fields (row count, column types, missing value counts). Using a dataclass makes it easy to serialize to JSON with `asdict()`.
- **`load_csv(path)`** — validates the file exists and is non-empty *before* reading it. This is "fail fast" defensive programming: catch problems at the boundary, not deep inside your logic.
- **`make_profile(df)`** — computes the profile from a DataFrame. Notice `int(...)` casts — pandas returns numpy integers which are not JSON-serializable by default.

**Why `sort_keys=True` matters:** Running the same script twice on the same input should produce byte-for-byte identical JSON. Sorted keys guarantee that — without it, dict ordering can vary between Python versions.

In [1]:
import json
from dataclasses import asdict, dataclass
from pathlib import Path
from typing import Dict, List


try:
    import pandas as pd
except Exception as e:  # pragma: no cover
    pd = None
    _pd_import_error = e


OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)


@dataclass
class Profile:
    rows: int
    cols: int
    columns: List[str]
    dtypes: Dict[str, str]
    missing_by_column: Dict[str, int]


def load_csv(path: Path):
    if not path.exists():
        raise FileNotFoundError("Input file not found: %s" % path)
    if path.stat().st_size == 0:
        raise ValueError("Input file is empty: %s" % path)
    if pd is None:
        raise RuntimeError("pandas is required: %s" % _pd_import_error)
    return pd.read_csv(path)


def make_profile(df) -> Profile:
    missing = df.isna().sum().to_dict()
    dtypes = {col: str(dtype) for col, dtype in df.dtypes.to_dict().items()}
    return Profile(
        rows=int(df.shape[0]),
        cols=int(df.shape[1]),
        columns=list(df.columns),
        dtypes=dtypes,
        missing_by_column={k: int(v) for k, v in missing.items()},
    )


print("ready")

ready


### What this cell does
Creates a small sample CSV file with intentional data quality issues (a `None` in `age`, a `None` in `country`) so you can see how the profiler handles missing values.

**Why synthetic data?** Using a known dataset lets you verify the profiler's output is correct — you know exactly how many missing values to expect. Real data often has surprises; start with controlled inputs first.

In [2]:
# Create a small sample CSV for profiling (non-verbatim example)
if pd is not None:
    sample_path = OUTPUT_DIR / "sample_profile.csv"
    df = pd.DataFrame(
        {
            "user_id": [1, 2, 3, 4],
            "age": [22, None, 35, 29],
            "country": ["US", "SG", None, "US"],
        }
    )
    df.to_csv(sample_path, index=False)
    print("wrote sample:", sample_path)

wrote sample: output/sample_profile.csv


### What this cell does
Defines `profile_to_markdown()` — converts the `Profile` dataclass into a human-readable Markdown table — then runs the full pipeline: load → profile → write both `profile.json` and `profile.md`.

**Why two output formats?**
- `profile.json` is machine-readable: downstream scripts can parse it programmatically.
- `profile.md` is human-readable: you can open it in any editor or GitHub to quickly inspect results.

**Reproducibility check:** Run this cell twice on the same input. The output files should be byte-for-byte identical. If they differ, something in your pipeline is non-deterministic (e.g., unsorted dict keys, timestamps).

In [3]:
def profile_to_markdown(p: Profile) -> str:
    lines = []
    lines.append("# Data Profile")
    lines.append("")
    lines.append(f"- Rows: {p.rows}")
    lines.append(f"- Columns: {p.cols}")
    lines.append("")
    lines.append("## Columns")
    lines.append("")
    lines.append("| column | dtype | missing |")
    lines.append("|---|---|---:|")
    for col in p.columns:
        lines.append(f"| {col} | {p.dtypes.get(col, '')} | {p.missing_by_column.get(col, 0)} |")
    lines.append("")
    return "\n".join(lines)


if pd is not None:
    df2 = load_csv(sample_path)
    p = make_profile(df2)
    (OUTPUT_DIR / "profile.json").write_text(json.dumps(asdict(p), indent=2, sort_keys=True), encoding="utf-8")
    (OUTPUT_DIR / "profile.md").write_text(profile_to_markdown(p), encoding="utf-8")
    print("wrote:", OUTPUT_DIR / "profile.json")
    print("wrote:", OUTPUT_DIR / "profile.md")

wrote: output/profile.json
wrote: output/profile.md


### What this cell does
Defines `require_columns()` — a validator that checks whether all required columns are present — and a `numeric_summary_todo()` stub for you to implement.

**Why validate columns explicitly?** Downstream code often assumes specific column names exist. Without this check, a missing column causes a cryptic `KeyError` deep inside your logic. Validating at the boundary gives a clear, actionable error message.

**Your task:** Implement `numeric_summary_todo()` to return min/mean/max for each numeric column. Hint: `df.select_dtypes(include='number').describe()` gives you most of what you need.

In [4]:
from typing import Any, Dict, List


def require_columns(df, required: List[str]) -> None:
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise ValueError("Missing required columns: %s" % missing)


def numeric_summary_todo(df) -> Dict[str, Any]:
    # TODO: implement basic numeric summaries (mean/min/max) per numeric column.
    # Hint: df.select_dtypes(include='number').describe().to_dict() is a good starting point.
    return {}


print("TODO: extend with required columns + numeric summaries")

TODO: extend with required columns + numeric summaries


## Appendix: Solutions (peek only after trying)

Reference implementation for the numeric summaries extension.

In [None]:
def numeric_summary_todo(df) -> dict:
    if pd is None:
        raise RuntimeError(f"pandas is required: {_pd_import_error}")
    return df.select_dtypes(include='number').describe().to_dict()


if pd is not None:
    df2 = load_csv(sample_path)
    print(list(numeric_summary_todo(df2).keys())[:5])
