# Week 6 — Part 02: Sampling and compression for tabular data

**Estimated time:** 75–120 minutes

## Learning Objectives

- Explain why compression is required for large tables
- Build a compressed representation (stats + sample rows)
- Produce deterministic JSON artifacts for LLM input
- Add TODO exercises for improving compression


## Overview

You usually cannot send a full dataset to an LLM. Instead you send a compressed representation:

- descriptive stats
- missingness summary
- a small sample of rows
- detected anomalies

---

## Underlying theory: you are fitting information into a fixed budget

The model has a fixed context window, so your input must satisfy a budget constraint:

$$
C \ge T_{\text{prompt}} + T_{\text{table}} + T_{\text{output}}
$$

If your table is large, $T_{\text{table}}$ dominates. Compression reduces $T_{\text{table}}$ by replacing raw rows with summaries.

Practical implication: good compression keeps *the facts that matter for the task* while dropping redundant detail.

In [None]:
from __future__ import annotations

import json
from dataclasses import dataclass


try:
    import pandas as pd
except Exception as e:  # pragma: no cover
    pd = None
    _pd_import_error = e


@dataclass
class CompressedTable:
    shape: tuple[int, int]
    columns: list[str]
    dtypes: dict[str, str]
    missing: dict[str, int]
    sample_rows: list[dict]
    sample_seed: int


def compress_table_v2(df, *, sample_n: int = 6, seed: int = 7) -> CompressedTable:
    sample = df.sample(n=min(sample_n, len(df)), random_state=seed) if len(df) > 0 else df
    return CompressedTable(
        shape=(int(df.shape[0]), int(df.shape[1])),
        columns=list(df.columns),
        dtypes={c: str(t) for c, t in df.dtypes.to_dict().items()},
        missing={c: int(v) for c, v in df.isna().sum().to_dict().items()},
        sample_rows=sample.to_dict(orient="records"),
        sample_seed=seed,
    )


def to_json(ct: CompressedTable) -> str:
    payload = {
        "shape": ct.shape,
        "columns": ct.columns,
        "dtypes": ct.dtypes,
        "missing": ct.missing,
        "sample_rows": ct.sample_rows,
        "sample_seed": ct.sample_seed,
    }
    return json.dumps(payload, indent=2, sort_keys=True)


if pd is None:
    print("pandas not available:", _pd_import_error)
else:
    df = pd.DataFrame(
        {
            "age": [29, 31, None, 42, 25],
            "city": ["NY", "LA", "LA", "SF", "NY"],
            "salary": [90000, 120000, 95000, None, 70000],
        }
    )
    ct = compress_table_v2(df, sample_n=4, seed=101)
    print(to_json(ct))

## Why the design choices matter

- sampling uses a `seed` so results are stable across runs
- `sort_keys=True` produces deterministic JSON (diff-friendly)
- a structured object (`CompressedTable`) makes it easier to evolve the contract later

Calibration tip:

- start with a small `sample_n` (e.g., 5–10)
- if the LLM misses important patterns, add targeted summaries rather than dumping more rows

In [None]:
def add_numeric_summary(df) -> dict:
    # TODO: add min/mean/max per numeric column.
    raise NotImplementedError


def add_top_categories(df, top_k: int = 3) -> dict:
    # TODO: compute top categories for object/string columns.
    raise NotImplementedError


print("Implement add_numeric_summary() and add_top_categories().")