# 06: Private Evolution for Tabular Data

This notebook implements Private Evolution (PE) adapted for the DCA telemetry wide table, following Lin et al. (2024) and Swanberg et al. (2025). Instead of training a generative model with DP-SGD, PE uses black-box API access to a foundation model (GPT-5 nano) and a DP nearest-neighbor histogram to iteratively select synthetic candidates that best approximate the real data distribution.

## Outline

1. Load the wide training table (from notebook 05)
2. Configure PE parameters and privacy budget
3. Run PE (RANDOM_API -> DP histogram -> selection -> VARIATION_API)
4. Decompose synthetic wide table into reporting tables
5. Run benchmark queries and compare with ground truth and DP-SGD results

In [1]:
import sys
import os
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display, Markdown
from dotenv import load_dotenv

load_dotenv(Path("../.env"))

sys.path.insert(0, str(Path("..").resolve()))

REPORTING = Path("../data/reporting")
QUERIES_DIR = Path("../docs/queries")
REAL_RESULTS = Path("../data/results/real")
PE_REPORTING = Path("../data/reporting/pe")
PE_RESULTS = Path("../data/results/pe")
MODEL = "gpt-5-nano"

---
## Step 1: Load the wide training table

In [2]:
wide = pd.read_parquet(REPORTING / "wide_training_table.parquet")

cat_cols = ["chassistype", "countryname_normalized", "modelvendor_normalized",
            "os", "cpuname", "cpucode", "cpu_family", "persona", "processornumber"]
numeric_cols = [c for c in wide.columns if c != "guid" and c not in cat_cols]

display(Markdown(
    f"Wide table: {len(wide):,} rows x {len(wide.columns)} columns\n\n"
    f"Categorical: {len(cat_cols)} columns, Numeric: {len(numeric_cols)} columns"
))

Wide table: 1,000,000 rows x 69 columns

Categorical: 9 columns, Numeric: 59 columns

---
## Step 2: Configure PE and privacy budget

Following Swanberg et al. (2025), we use T=1 iteration as the primary setting (their finding that T=1 is optimal for tabular PE). We match the DP-SGD privacy budget: epsilon=4.0, delta=1e-5.

The noise multiplier sigma is calibrated via the analytic Gaussian mechanism (Balle and Wang, 2018) with adaptive composition (Dong et al., 2019): T iterations with noise sigma each compose to a single Gaussian mechanism with effective sensitivity sqrt(T).

In [3]:
from src.pe.privacy import calibrate_sigma, compute_epsilon

N_SYNTH = 50000
T = 1
L = 3
EPSILON = 4.0
DELTA = 1e-5
MODEL = "gpt-5-nano"

sigma = calibrate_sigma(EPSILON, DELTA, T)

display(Markdown(
    f"PE configuration:\n\n"
    f"- Model: `{MODEL}`\n"
    f"- N_synth: {N_SYNTH:,}\n"
    f"- T (iterations): {T}\n"
    f"- L (variations per candidate + 1): {L}\n"
    f"- Target epsilon: {EPSILON}, delta: {DELTA}\n"
    f"- Calibrated sigma: {sigma:.4f}\n"
    f"- Initial population: {N_SYNTH * L:,} (N_synth x L)\n"
    f"- Privacy guarantee: (epsilon={EPSILON}, delta={DELTA})-DP via analytic Gaussian mechanism"
))

PE configuration:

- Model: `gpt-5-nano`
- N_synth: 50,000
- T (iterations): 1
- L (variations per candidate + 1): 3
- Target epsilon: 4.0, delta: 1e-05
- Calibrated sigma: 1.0812
- Initial population: 150,000 (N_synth x L)
- Privacy guarantee: (epsilon=4.0, delta=1e-05)-DP via analytic Gaussian mechanism

---
## Step 3: Run Private Evolution

The PE loop:
1. RANDOM_API generates 150,000 initial candidates (N_synth x L = 50K x 3)
2. Each of the 1M real records votes for its nearest synthetic candidate under the workload-aware distance
3. Gaussian noise (sigma) is added to the histogram to ensure DP
4. Top 50,000 candidates are selected by rank

With T=1, there is no VARIATION_API call (selection is the final step).

In [4]:
import importlib
import src.pe.api, src.pe.distance, src.pe.privacy, src.pe.histogram
importlib.reload(src.pe.api)
importlib.reload(src.pe.distance)
importlib.reload(src.pe.privacy)
importlib.reload(src.pe.histogram)
from src.pe.histogram import private_evolution
from src.pe.api import PEApi

api = PEApi(wide, model=MODEL, max_concurrent=50)

USE_BATCH = True
WORK_DIR = Path("../data/batch_jobs")
CHECKPOINT_DIR = Path("../data/pe_checkpoints")
WORK_DIR.mkdir(parents=True, exist_ok=True)
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)

synth_wide, pe_history = await private_evolution(
    real_df=wide,
    api=api,
    n_synth=N_SYNTH,
    T=T,
    L=L,
    epsilon=EPSILON,
    delta=DELTA,
    real_chunk=5000,
    synth_chunk=10000,
    batch_size=10,
    variation_batch_size=5,
    use_batch=USE_BATCH,
    work_dir=WORK_DIR,
    checkpoint_dir=CHECKPOINT_DIR,
)

display(Markdown(
    f"PE complete:\n\n"
    f"- Synthetic records: {len(synth_wide):,}\n"
    f"- Total time: {pe_history['total_time']:.1f}s\n"
    f"- Actual epsilon: {pe_history['actual_epsilon']:.4f}\n"
    f"- Sigma: {pe_history['sigma']:.4f}\n"
    f"- Mode: {'Batch API (50% cheaper)' if USE_BATCH else 'Realtime API'}"
))

PE config: N_synth=50000, T=1, L=3, epsilon=4.0, delta=1e-05, sigma=1.0812, voting_records=1,000,000, mode=Batch API (50% cheaper)

--- Generating initial population (N=150000) ---
RANDOM_API_BATCH: 150000 records (18750 calls across 24 batch(es), 25% buffer)
  24 sequential chunk(s) of up to 800 requests
  RANDOM chunk 1/24: loaded 7893 cached records
  RANDOM chunk 2/24: loaded 7886 cached records
  RANDOM chunk 3/24: loaded 7894 cached records
  RANDOM chunk 4/24: loaded 7979 cached records
  RANDOM chunk 5/24: resuming batch batch_698d9d3ccdfc81909f0ff84f6cf51f4d
  RANDOM chunk 5/24: 780/800 done, 0 failed [in_progress]
  RANDOM chunk 5/24: 780/800 done, 0 failed [in_progress]
  RANDOM chunk 5/24: 780/800 done, 0 failed [in_progress]
  RANDOM chunk 5/24: 780/800 done, 0 failed [in_progress]
  RANDOM chunk 5/24: 780/800 done, 0 failed [in_progress]
  RANDOM chunk 5/24: 800/800 done, 0 failed [finalizing]
  RANDOM chunk 5/24: 800/800 done, 0 failed [finalizing]
  RANDOM chunk 5/24: 8

APITimeoutError: Request timed out.

In [None]:
synth_wide.to_parquet(REPORTING / "pe_wide_table.parquet", index=False)

display(Markdown(f"Saved PE synthetic wide table: {len(synth_wide):,} rows x {len(synth_wide.columns)} columns"))
display(synth_wide.head())

### Inspect sparsity patterns

A key question: does the LLM generate realistic sparsity patterns?

In [None]:
sparsity_rows = []
for c in numeric_cols:
    real_nz = (wide[c] > 0).mean() * 100
    synth_nz = (synth_wide[c] > 0).mean() * 100 if c in synth_wide.columns else 0
    sparsity_rows.append({"column": c, "real_nonzero_pct": round(real_nz, 1), "synth_nonzero_pct": round(synth_nz, 1)})

sparsity_df = pd.DataFrame(sparsity_rows)
display(Markdown("Nonzero percentage comparison (real vs PE synthetic):"))
display(sparsity_df)

---
## Step 4: Decompose into reporting tables

In [None]:
from src.eval.decompose import decompose_wide_table

counts = decompose_wide_table(synth_wide, PE_REPORTING)

rows = "\n".join(f"| {t} | {c:,} |" for t, c in counts.items())
display(Markdown(f"Decomposed into {len(counts)} synthetic reporting tables:\n\n| Table | Rows |\n|---|---|\n{rows}"))

---
## Step 5: Benchmark evaluation

Run the same 8 benchmark queries evaluated for DP-SGD.

In [None]:
from src.eval.benchmark import run_benchmark

eval_queries = [
    "avg_platform_power_c0_freq_temp_by_chassis",
    "Xeon_network_consumption",
    "pkg_power_by_country",
    "ram_utilization_histogram",
    "battery_power_on_geographic_summary",
    "persona_web_cat_usage_analysis",
    "popular_browsers_by_count_usage_percentage",
    "most_popular_browser_in_each_country_by_system_count",
]

pe_results = run_benchmark(eval_queries, QUERIES_DIR, PE_REPORTING, PE_RESULTS)

display(Markdown(f"{len(pe_results)}/{len(eval_queries)} queries executed on PE synthetic data."))
for name, df in pe_results.items():
    display(Markdown(f"### `{name}` ({len(df)} rows)"))
    display(df.head(10))

---
## Step 6: Comparison with ground truth and DP-SGD

In [None]:
DPSGD_RESULTS = Path("../data/results/synthetic")

comparison_rows = []
for name in eval_queries:
    real_path = REAL_RESULTS / f"{name}.csv"
    dpsgd_path = DPSGD_RESULTS / f"{name}.csv"
    pe_path = PE_RESULTS / f"{name}.csv"

    if not real_path.exists():
        continue
    real_df = pd.read_csv(real_path)

    for col in real_df.select_dtypes(include=[np.number]).columns:
        real_mean = real_df[col].mean()
        if abs(real_mean) < 1e-10:
            continue

        row = {"query": name.replace("_", " "), "column": col, "real_mean": real_mean}

        if dpsgd_path.exists():
            dpsgd_df = pd.read_csv(dpsgd_path)
            if col in dpsgd_df.columns:
                dpsgd_mean = dpsgd_df[col].mean()
                row["dpsgd_mean"] = dpsgd_mean
                row["dpsgd_rel_error"] = abs(real_mean - dpsgd_mean) / abs(real_mean)

        if pe_path.exists():
            pe_df = pd.read_csv(pe_path)
            if col in pe_df.columns:
                pe_mean = pe_df[col].mean()
                row["pe_mean"] = pe_mean
                row["pe_rel_error"] = abs(real_mean - pe_mean) / abs(real_mean)

        comparison_rows.append(row)

comp_df = pd.DataFrame(comparison_rows)
display(Markdown("Column-level mean comparison (real vs DP-SGD vs PE):"))
display(comp_df)

In [None]:
browser_query = "most_popular_browser_in_each_country_by_system_count"
real_browsers = pd.read_csv(REAL_RESULTS / f"{browser_query}.csv")

pe_browsers_path = PE_RESULTS / f"{browser_query}.csv"
if pe_browsers_path.exists():
    pe_browsers = pd.read_csv(pe_browsers_path)
    merged = real_browsers.merge(pe_browsers, on="country", suffixes=("_real", "_pe"), how="inner")
    matches = (merged["browser_real"] == merged["browser_pe"]).sum()
    total = len(merged)
    display(Markdown(
        f"Browser ranking accuracy (PE): {matches}/{total} countries correct "
        f"({100*matches/total:.0f}%)"
    ))

    dpsgd_browsers_path = DPSGD_RESULTS / f"{browser_query}.csv"
    if dpsgd_browsers_path.exists():
        dpsgd_browsers = pd.read_csv(dpsgd_browsers_path)
        merged_dpsgd = real_browsers.merge(dpsgd_browsers, on="country", suffixes=("_real", "_dpsgd"), how="inner")
        dpsgd_matches = (merged_dpsgd["browser_real"] == merged_dpsgd["browser_dpsgd"]).sum()
        dpsgd_total = len(merged_dpsgd)
        display(Markdown(
            f"Browser ranking accuracy (DP-SGD): {dpsgd_matches}/{dpsgd_total} countries correct "
            f"({100*dpsgd_matches/dpsgd_total:.0f}%)"
        ))

---
## Summary

In [None]:
summary_lines = [
    "| | DP-SGD (VAE) | Private Evolution |",
    "|---|---|---|",
    f"| Model | DP-VAE (505K params) | GPT-5 nano (API) |",
    f"| Privacy | (3.996, 1e-5)-DP | ({pe_history['actual_epsilon']:.3f}, 1e-5)-DP |",
    f"| Synthetic records | 1,000,000 | {len(synth_wide):,} |",
    f"| Training/generation time | 360 min (CPU) | {pe_history['total_time']:.0f}s |",
    f"| Iterations | 20 epochs | {T} PE iteration(s) |",
]

display(Markdown("\n".join(summary_lines)))