# Week 6 — Part 02: Sampling and compression for tabular data

**Estimated time:** 75–120 minutes

## What success looks like (end of Part 02)

- You can build a deterministic compressed table representation (same output for same seed).
- You can write a bounded JSON artifact suitable for an LLM under `output/`.
- You can extend compression with at least one extra summary (numeric stats or top categories).

### Checkpoint

After running this notebook, you should have:

- a printed JSON payload for a `CompressedTable`
- an `output/compressed_input.json` file

## Learning Objectives

- Explain why compression is required for large tables
- Build a compressed representation (stats + sample rows)
- Produce deterministic JSON artifacts for LLM input
- Add TODO exercises for improving compression

## Overview

You usually cannot send a full dataset to an LLM. Instead you send a compressed representation:

- descriptive stats
- missingness summary
- a small sample of rows
- detected anomalies

---

## Underlying theory: you are fitting information into a fixed budget

The model has a fixed context window, so your input must satisfy a budget constraint:

$$
C \ge T_{\text{prompt}} + T_{\text{table}} + T_{\text{output}}
$$

If your table is large, $T_{\text{table}}$ dominates. Compression reduces $T_{\text{table}}$ by replacing raw rows with summaries.

Practical implication: good compression keeps *the facts that matter for the task* while dropping redundant detail.

In [None]:
import json
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, List, Tuple


try:
    import pandas as pd
except Exception as e:  # pragma: no cover
    pd = None
    _pd_import_error = e


@dataclass
class CompressedTable:
    shape: Tuple[int, int]
    columns: List[str]
    dtypes: Dict[str, str]
    missing: Dict[str, int]
    sample_rows: List[Dict[str, object]]
    sample_seed: int


def compress_table_v2(df, *, sample_n: int = 6, seed: int = 7) -> CompressedTable:
    sample = df.sample(n=min(sample_n, len(df)), random_state=seed) if len(df) > 0 else df
    return CompressedTable(
        shape=(int(df.shape[0]), int(df.shape[1])),
        columns=list(df.columns),
        dtypes={c: str(t) for c, t in df.dtypes.to_dict().items()},
        missing={c: int(v) for c, v in df.isna().sum().to_dict().items()},
        sample_rows=sample.to_dict(orient="records"),
        sample_seed=seed,
    )


def to_json(ct: CompressedTable) -> str:
    payload = {
        "shape": list(ct.shape),
        "columns": ct.columns,
        "dtypes": ct.dtypes,
        "missing": ct.missing,
        "sample_rows": ct.sample_rows,
        "sample_seed": ct.sample_seed,
    }
    return json.dumps(payload, indent=2, sort_keys=True)


OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)


if pd is None:
    print("pandas not available:", _pd_import_error)
else:
    df = pd.DataFrame(
        {
            "age": [29, 31, None, 42, 25],
            "city": ["NY", "LA", "LA", "SF", "NY"],
            "salary": [90000, 120000, 95000, None, 70000],
        }
    )
    ct = compress_table_v2(df, sample_n=4, seed=101)
    payload_str = to_json(ct)
    print(payload_str)
    (OUTPUT_DIR / "compressed_input.json").write_text(payload_str, encoding="utf-8")
    print("wrote:", OUTPUT_DIR / "compressed_input.json")

## Why the design choices matter

- sampling uses a `seed` so results are stable across runs
- `sort_keys=True` produces deterministic JSON (diff-friendly)
- a structured object (`CompressedTable`) makes it easier to evolve the contract later

Calibration tip:

- start with a small `sample_n` (e.g., 5–10)
- if the LLM misses important patterns, add targeted summaries rather than dumping more rows

In [None]:
from typing import Any, Dict


def add_numeric_summary(df) -> Dict[str, Any]:
    # TODO: add min/mean/max per numeric column.
    return {}


def add_top_categories(df, top_k: int = 3) -> Dict[str, Any]:
    # TODO: compute top categories for object/string columns.
    return {}


if pd is None:
    print("pandas not available; skipping exercise cells")
else:
    extra = {
        "numeric_summary": add_numeric_summary(df),
        "top_categories": add_top_categories(df, top_k=3),
    }
    (OUTPUT_DIR / "compressed_extras.json").write_text(json.dumps(extra, indent=2, sort_keys=True), encoding="utf-8")
    print("wrote:", OUTPUT_DIR / "compressed_extras.json")

## Appendix: Solutions (peek only after trying)

Reference implementations for the TODO functions in this notebook.

In [None]:
def add_numeric_summary(df) -> Dict[str, Any]:
    numeric = df.select_dtypes(include="number")
    out: Dict[str, Any] = {}
    if numeric.shape[1] == 0:
        return out
    for col in numeric.columns:
        s = numeric[col].dropna()
        if len(s) == 0:
            out[str(col)] = {"min": None, "mean": None, "max": None}
        else:
            out[str(col)] = {
                "min": float(s.min()),
                "mean": float(s.mean()),
                "max": float(s.max()),
            }
    return out


def add_top_categories(df, top_k: int = 3) -> Dict[str, Any]:
    obj = df.select_dtypes(include=["object"]) if hasattr(df, "select_dtypes") else df
    out: Dict[str, Any] = {}
    if getattr(obj, "shape", (0, 0))[1] == 0:
        return out

    for col in obj.columns:
        vc = obj[col].fillna("<NA>").astype(str).value_counts(dropna=False)
        top = vc.head(int(top_k))
        out[str(col)] = [{"value": str(idx), "count": int(cnt)} for idx, cnt in top.items()]
    return out


if pd is not None:
    extra_solution = {
        "numeric_summary": add_numeric_summary(df),
        "top_categories": add_top_categories(df, top_k=3),
    }
    (OUTPUT_DIR / "compressed_extras_solution.json").write_text(
        json.dumps(extra_solution, indent=2, sort_keys=True),
        encoding="utf-8",
    )
    print("wrote:", OUTPUT_DIR / "compressed_extras_solution.json")