# 06: Private Evolution for Tabular Data

This notebook implements Private Evolution (PE) adapted for the DCA telemetry wide table, following Lin et al. (2024) and Swanberg et al. (2025). Instead of training a generative model with DP-SGD, PE uses black-box API access to a foundation model (GPT-5 nano) and a DP nearest-neighbor histogram to iteratively select synthetic candidates that best approximate the real data distribution.

## Outline

1. Load the wide training table (from notebook 05)
2. Configure PE parameters and privacy budget
3. Run PE (RANDOM_API -> DP histogram -> selection -> VARIATION_API)
4. Decompose synthetic wide table into reporting tables
5. Run benchmark queries and compare with ground truth and DP-SGD results

In [1]:
import sys
import os
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display, Markdown
from dotenv import load_dotenv

load_dotenv(Path("../.env"))

sys.path.insert(0, str(Path("..").resolve()))

REPORTING = Path("../data/reporting")
QUERIES_DIR = Path("../docs/queries")
REAL_RESULTS = Path("../data/results/real")
PE_REPORTING = Path("../data/reporting/pe")
PE_RESULTS = Path("../data/results/pe")
MODEL = "gpt-5-nano"

---
## Step 1: Load the wide training table

In [2]:
wide = pd.read_parquet(REPORTING / "wide_training_table.parquet")

cat_cols = ["chassistype", "countryname_normalized", "modelvendor_normalized",
            "os", "cpuname", "cpucode", "cpu_family", "persona", "processornumber"]
numeric_cols = [c for c in wide.columns if c != "guid" and c not in cat_cols]

display(Markdown(
    f"Wide table: {len(wide):,} rows x {len(wide.columns)} columns\n\n"
    f"Categorical: {len(cat_cols)} columns, Numeric: {len(numeric_cols)} columns"
))

Wide table: 1,000,000 rows x 69 columns

Categorical: 9 columns, Numeric: 59 columns

---
## Step 2: Configure PE and privacy budget

Following Swanberg et al. (2025), we use T=1 iteration as the primary setting (their finding that T=1 is optimal for tabular PE). We match the DP-SGD privacy budget: epsilon=4.0, delta=1e-5.

The noise multiplier sigma is calibrated via the analytic Gaussian mechanism (Balle and Wang, 2018) with adaptive composition (Dong et al., 2019): T iterations with noise sigma each compose to a single Gaussian mechanism with effective sensitivity sqrt(T).

### Step 2b: API smoke test

Generate a small batch to verify the API is working, inspect the raw output, and check sparsity patterns before committing to the full run.

In [3]:
from src.pe.api import PEApi

api = PEApi(wide, model=MODEL, max_concurrent=50)
test_df = await api.random_api(20, batch_size=10)

display(Markdown(f"Generated {len(test_df)} records with {len(test_df.columns)} columns"))
display(test_df.head(5))

RANDOM_API: generating 20 records (3 batches of 10, 25% buffer)...
  RANDOM_API: 3/3 calls (42s, 0.1 calls/s)
RANDOM_API: 27 raw records, returning 20 (74% of raw)


Generated 20 records with 68 columns

Unnamed: 0,chassistype,countryname_normalized,modelvendor_normalized,os,cpuname,cpucode,cpu_family,persona,processornumber,ram,...,psys_rap_nrs,psys_rap_avg,pkg_c0_nrs,pkg_c0_avg,avg_freq_nrs,avg_freq_avg,temp_nrs,temp_avg,pkg_power_nrs,pkg_power_avg
0,Notebook,United States of America,Dell,Win11,8th Gen i7,i7-7700HQ,Core i7,Office/Productivity,14 nm,16.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Desktop,Germany,HP,Win10,7th Gen i5,i5-7500U,Core i5,Office/Productivity,14 nm,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2 in 1,Japan,Lenovo,Win11,10th Gen i5,i5-8265U,Core i5,Web User,22 nm,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Notebook,India,Acer,Win11,8th Gen i5,i5-8250U,Core i5,Casual User,14 nm,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Desktop,Brazil,Dell,Win11,7th Gen i7,i7-7700HQ,Core i7,Gamer,14 nm,32.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
display(Markdown("Categorical value distribution:"))
for c in cat_cols:
    if c in test_df.columns:
        display(Markdown(f"`{c}`: {test_df[c].value_counts().to_dict()}"))

display(Markdown("Numeric sparsity (nonzero counts):"))
sparsity_check = []
for c in numeric_cols:
    if c in test_df.columns:
        nz = int((test_df[c] > 0).sum())
        real_nz_pct = (wide[c] > 0).mean() * 100
        sparsity_check.append({"column": c, "synth_nonzero": f"{nz}/{len(test_df)}", "real_nonzero_pct": f"{real_nz_pct:.1f}%"})
display(pd.DataFrame(sparsity_check))

Categorical value distribution:

`chassistype`: {'Notebook': 7, 'Desktop': 5, '2 in 1': 3, 'Server/WS': 2, 'Tablet': 2, ' Notebook': 1}

`countryname_normalized`: {'United States of America': 4, 'Germany': 3, 'India': 3, 'Japan': 2, 'Brazil': 2, 'Korea, Republic of': 2, 'United Kingdom of Great Britain and Northern Ireland': 2, 'Russian Federation': 1, 'China': 1}

`modelvendor_normalized`: {'Dell': 5, 'HP': 3, 'Lenovo': 3, 'Asus': 3, 'Acer': 2, 'Gigabyte': 2, 'Intel': 2}

`os`: {'Win11': 14, 'Win10': 3, 'Win Server': 2, 'n/a': 1}

`cpuname`: {'10th Gen i5': 5, '8th Gen i7': 3, '8th Gen i5': 3, '7th Gen i5': 2, '7th Gen i7': 2, '6th Gen i5': 2, '6th Gen i7': 2, '3rd Gen i5': 1}

`cpucode`: {'i5-8265U': 4, 'i7-7700HQ': 3, 'i5-8250U': 3, 'i5-7500U': 2, 'i7-6700HQ': 2, 'i7-7500U': 1, 'i5-6200U': 1, 'i7-8750H': 1, 'i5-1035G1': 1, 'i5-6300HQ': 1, 'i5-6400U': 1}

`cpu_family`: {'Core i5': 13, 'Core i7': 7}

`persona`: {'Office/Productivity': 6, 'Casual User': 4, 'Web User': 3, 'Gamer': 3, 'Entertainment': 2, 'Communication': 1, 'Content Creator/IT': 1}

`processornumber`: {'14 nm': 14, '22 nm': 3, '32 nm': 2, '45 nm': 1}

Numeric sparsity (nonzero counts):

Unnamed: 0,column,synth_nonzero,real_nonzero_pct
0,ram,20/20,99.9%
1,net_nrs,0/20,3.7%
2,net_received_bytes,0/20,3.7%
3,net_sent_bytes,0/20,3.7%
4,mem_nrs,0/20,6.9%
5,mem_avg_pct_used,0/20,6.9%
6,mem_sysinfo_ram,0/20,6.9%
7,batt_num_power_ons,0/20,2.0%
8,batt_duration_mins,0/20,2.0%
9,web_chrome_duration,1/20,5.3%


In [5]:
from src.pe.privacy import calibrate_sigma, compute_epsilon

N_SYNTH = 50000
T = 1
L = 3
EPSILON = 4.0
DELTA = 1e-5
MODEL = "gpt-5-nano"

sigma = calibrate_sigma(EPSILON, DELTA, T)

display(Markdown(
    f"PE configuration:\n\n"
    f"- Model: `{MODEL}`\n"
    f"- N_synth: {N_SYNTH:,}\n"
    f"- T (iterations): {T}\n"
    f"- L (variations per candidate + 1): {L}\n"
    f"- Target epsilon: {EPSILON}, delta: {DELTA}\n"
    f"- Calibrated sigma: {sigma:.4f}\n"
    f"- Initial population: {N_SYNTH * L:,} (N_synth x L)\n"
    f"- Privacy guarantee: (epsilon={EPSILON}, delta={DELTA})-DP via analytic Gaussian mechanism"
))

PE configuration:

- Model: `gpt-5-nano`
- N_synth: 50,000
- T (iterations): 1
- L (variations per candidate + 1): 3
- Target epsilon: 4.0, delta: 1e-05
- Calibrated sigma: 1.0812
- Initial population: 150,000 (N_synth x L)
- Privacy guarantee: (epsilon=4.0, delta=1e-05)-DP via analytic Gaussian mechanism

---
## Step 3: Run Private Evolution

The PE loop:
1. RANDOM_API generates 150,000 initial candidates (N_synth x L = 50K x 3)
2. Each of the 1M real records votes for its nearest synthetic candidate under the workload-aware distance
3. Gaussian noise (sigma) is added to the histogram to ensure DP
4. Top 50,000 candidates are selected by rank

With T=1, there is no VARIATION_API call (selection is the final step).

In [None]:
from src.pe.histogram import private_evolution

USE_BATCH = True
WORK_DIR = Path("../data/batch_jobs")
WORK_DIR.mkdir(parents=True, exist_ok=True)

synth_wide, pe_history = await private_evolution(
    real_df=wide,
    api=api,
    n_synth=N_SYNTH,
    T=T,
    L=L,
    epsilon=EPSILON,
    delta=DELTA,
    real_chunk=5000,
    synth_chunk=10000,
    batch_size=10,
    variation_batch_size=5,
    use_batch=USE_BATCH,
    work_dir=WORK_DIR,
)

display(Markdown(
    f"PE complete:\n\n"
    f"- Synthetic records: {len(synth_wide):,}\n"
    f"- Total time: {pe_history['total_time']:.1f}s\n"
    f"- Actual epsilon: {pe_history['actual_epsilon']:.4f}\n"
    f"- Sigma: {pe_history['sigma']:.4f}\n"
    f"- Mode: {'Batch API (50% cheaper)' if USE_BATCH else 'Realtime API'}"
))

In [None]:
synth_wide.to_parquet(REPORTING / "pe_wide_table.parquet", index=False)

display(Markdown(f"Saved PE synthetic wide table: {len(synth_wide):,} rows x {len(synth_wide.columns)} columns"))
display(synth_wide.head())

### Inspect sparsity patterns

A key question: does the LLM generate realistic sparsity patterns?

In [None]:
sparsity_rows = []
for c in numeric_cols:
    real_nz = (wide[c] > 0).mean() * 100
    synth_nz = (synth_wide[c] > 0).mean() * 100 if c in synth_wide.columns else 0
    sparsity_rows.append({"column": c, "real_nonzero_pct": round(real_nz, 1), "synth_nonzero_pct": round(synth_nz, 1)})

sparsity_df = pd.DataFrame(sparsity_rows)
display(Markdown("Nonzero percentage comparison (real vs PE synthetic):"))
display(sparsity_df)

---
## Step 4: Decompose into reporting tables

In [None]:
from src.eval.decompose import decompose_wide_table

counts = decompose_wide_table(synth_wide, PE_REPORTING)

rows = "\n".join(f"| {t} | {c:,} |" for t, c in counts.items())
display(Markdown(f"Decomposed into {len(counts)} synthetic reporting tables:\n\n| Table | Rows |\n|---|---|\n{rows}"))

---
## Step 5: Benchmark evaluation

Run the same 8 benchmark queries evaluated for DP-SGD.

In [None]:
from src.eval.benchmark import run_benchmark

eval_queries = [
    "avg_platform_power_c0_freq_temp_by_chassis",
    "Xeon_network_consumption",
    "pkg_power_by_country",
    "ram_utilization_histogram",
    "battery_power_on_geographic_summary",
    "persona_web_cat_usage_analysis",
    "popular_browsers_by_count_usage_percentage",
    "most_popular_browser_in_each_country_by_system_count",
]

pe_results = run_benchmark(eval_queries, QUERIES_DIR, PE_REPORTING, PE_RESULTS)

display(Markdown(f"{len(pe_results)}/{len(eval_queries)} queries executed on PE synthetic data."))
for name, df in pe_results.items():
    display(Markdown(f"### `{name}` ({len(df)} rows)"))
    display(df.head(10))

---
## Step 6: Comparison with ground truth and DP-SGD

In [None]:
DPSGD_RESULTS = Path("../data/results/synthetic")

comparison_rows = []
for name in eval_queries:
    real_path = REAL_RESULTS / f"{name}.csv"
    dpsgd_path = DPSGD_RESULTS / f"{name}.csv"
    pe_path = PE_RESULTS / f"{name}.csv"

    if not real_path.exists():
        continue
    real_df = pd.read_csv(real_path)

    for col in real_df.select_dtypes(include=[np.number]).columns:
        real_mean = real_df[col].mean()
        if abs(real_mean) < 1e-10:
            continue

        row = {"query": name.replace("_", " "), "column": col, "real_mean": real_mean}

        if dpsgd_path.exists():
            dpsgd_df = pd.read_csv(dpsgd_path)
            if col in dpsgd_df.columns:
                dpsgd_mean = dpsgd_df[col].mean()
                row["dpsgd_mean"] = dpsgd_mean
                row["dpsgd_rel_error"] = abs(real_mean - dpsgd_mean) / abs(real_mean)

        if pe_path.exists():
            pe_df = pd.read_csv(pe_path)
            if col in pe_df.columns:
                pe_mean = pe_df[col].mean()
                row["pe_mean"] = pe_mean
                row["pe_rel_error"] = abs(real_mean - pe_mean) / abs(real_mean)

        comparison_rows.append(row)

comp_df = pd.DataFrame(comparison_rows)
display(Markdown("Column-level mean comparison (real vs DP-SGD vs PE):"))
display(comp_df)

In [None]:
browser_query = "most_popular_browser_in_each_country_by_system_count"
real_browsers = pd.read_csv(REAL_RESULTS / f"{browser_query}.csv")

pe_browsers_path = PE_RESULTS / f"{browser_query}.csv"
if pe_browsers_path.exists():
    pe_browsers = pd.read_csv(pe_browsers_path)
    merged = real_browsers.merge(pe_browsers, on="country", suffixes=("_real", "_pe"), how="inner")
    matches = (merged["browser_real"] == merged["browser_pe"]).sum()
    total = len(merged)
    display(Markdown(
        f"Browser ranking accuracy (PE): {matches}/{total} countries correct "
        f"({100*matches/total:.0f}%)"
    ))

    dpsgd_browsers_path = DPSGD_RESULTS / f"{browser_query}.csv"
    if dpsgd_browsers_path.exists():
        dpsgd_browsers = pd.read_csv(dpsgd_browsers_path)
        merged_dpsgd = real_browsers.merge(dpsgd_browsers, on="country", suffixes=("_real", "_dpsgd"), how="inner")
        dpsgd_matches = (merged_dpsgd["browser_real"] == merged_dpsgd["browser_dpsgd"]).sum()
        dpsgd_total = len(merged_dpsgd)
        display(Markdown(
            f"Browser ranking accuracy (DP-SGD): {dpsgd_matches}/{dpsgd_total} countries correct "
            f"({100*dpsgd_matches/dpsgd_total:.0f}%)"
        ))

---
## Summary

In [None]:
summary_lines = [
    "| | DP-SGD (VAE) | Private Evolution |",
    "|---|---|---|",
    f"| Model | DP-VAE (505K params) | GPT-5 nano (API) |",
    f"| Privacy | (3.996, 1e-5)-DP | ({pe_history['actual_epsilon']:.3f}, 1e-5)-DP |",
    f"| Synthetic records | 1,000,000 | {len(synth_wide):,} |",
    f"| Training/generation time | 360 min (CPU) | {pe_history['total_time']:.0f}s |",
    f"| Iterations | 20 epochs | {T} PE iteration(s) |",
]

display(Markdown("\n".join(summary_lines)))