# Differential Privacy Demo

**This notebook will:**

* Upload a dataset (CSV)
* Automatically detect categorical and numerical columns
* Train a privacy-preserving synthesizer (patectgan / dpctgan)
* Generate synthetic data while preserving structure
* Match the schema (int/float precision, categorical types) of the original dataset
* Compare the Datasets
* Save & download the synthetic dataset

### Step 1: Install dependencies

In [None]:
!pip install pandas numpy smartnoise-synth

This installs Pandas, NumPy, and SmartNoise Synth (Microsoftâ€™s DP synthetic data library).

### Step 2: Import libraries

In [None]:
import pandas as pd
from snsynth import Synthesizer
from google.colab import files
import ipywidgets as widgets
from IPython.display import display

**We load the libraries:**

* pandas â†’ data handling
* snsynth â†’ SmartNoise synthesizers
* colab.files â†’ file upload/download
* ipywidgets â†’ to make interactive controls (slider for epsilon, etc.)

### Step 3: Upload your dataset

In [None]:
# Prompt user to upload CSV
print("ðŸ“‚ Please upload your CSV dataset:")
uploaded = files.upload()

# Get uploaded filename
INPUT_CSV = list(uploaded.keys())[0]
print(f"âœ… Uploaded file: {INPUT_CSV}")

This cell lets you upload any CSV file. The file is stored in Colab for later use.

### Step 4: Configure privacy budget (epsilon)

In [None]:
# Interactive slider for epsilon
epsilon_slider = widgets.FloatSlider(
    value=1.0,
    min=0.1,
    max=10.0,
    step=0.1,
    description='Epsilon:',
    continuous_update=False
)
display(epsilon_slider)

**Here you can control epsilon (privacy budget).**

* Smaller epsilon = stronger privacy, lower accuracy
* Larger epsilon = weaker privacy, higher accuracy

### Step 5: Helper functions

In [None]:
def infer_column_types(df, cat_threshold=15):
    categorical, continuous, ordinal = [], [], []
    for col in df.columns:
        unique_vals = df[col].nunique(dropna=True)
        if pd.api.types.is_object_dtype(df[col]) or unique_vals <= cat_threshold:
            categorical.append(col)
        else:
            continuous.append(col)
    return categorical, continuous, ordinal

def preprocess_dataframe(df, continuous_cols):
    for col in continuous_cols:
        df[col] = pd.to_numeric(df[col], errors="coerce")
        df[col] = df[col].fillna(df[col].mean())
    return df

def count_decimals(series: pd.Series) -> int:
    """Infer the maximum number of decimal places in a numeric column."""
    decimals = []
    for val in series.dropna().astype(str):
        if "." in val:
            decimals.append(len(val.split(".")[1]))
    return max(decimals) if decimals else 0

def enforce_schema(df_synth, df_original):
    """Ensure synthetic dataset matches schema and decimal places exactly."""
    df_fixed = df_synth.copy()
    for col in df_original.columns:
        if pd.api.types.is_integer_dtype(df_original[col]):
            df_fixed[col] = df_fixed[col].round().astype(int)
        elif pd.api.types.is_float_dtype(df_original[col]):
            dp = count_decimals(df_original[col].astype(str))
            df_fixed[col] = df_fixed[col].round(dp).astype(float)
            df_fixed[col] = df_fixed[col].map(lambda x: f"{x:.{dp}f}")
        elif pd.api.types.is_object_dtype(df_original[col]):
            df_fixed[col] = df_fixed[col].astype(str)
    return df_fixed[df_original.columns]


**These functions:**

* Detect column types (categorical vs continuous)
* Fill missing values
* Preserve decimal places + dtypes in synthetic output

### Step 6: Train synthesizer & generate synthetic data

In [None]:
EPSILON = epsilon_slider.value
SYNTH_NAME = "patectgan"  # or "dpctgan"

# Load dataset
df = pd.read_csv(INPUT_CSV)
print(f"Loaded {df.shape[0]} rows, {df.shape[1]} columns from {INPUT_CSV}")

# Detect schema
categorical_cols, continuous_cols, ordinal_cols = infer_column_types(df)
print("ðŸ”Ž Detected categorical columns:", categorical_cols)
print("ðŸ”Ž Detected continuous columns:", continuous_cols)

# Preprocess
df = preprocess_dataframe(df, continuous_cols)

# Create and train synthesizer
synth = Synthesizer.create(SYNTH_NAME, epsilon=EPSILON, verbose=True)
synth.fit(
    df,
    categorical_columns=categorical_cols,
    continuous_columns=continuous_cols,
    ordinal_columns=ordinal_cols,
    preprocessor_eps=0.2
)

# Generate synthetic dataset
synth_df = synth.sample(df.shape[0])

# Enforce schema
synth_df = enforce_schema(synth_df, df)

print("âœ… Synthetic data generated!")
synth_df.head()

This trains the synthesizer and outputs the first 5 rows of synthetic data.

### Step 7: Save & download synthetic dataset

In [None]:
OUTPUT_CSV = "synthetic_dataset.csv"
synth_df.to_csv(OUTPUT_CSV, index=False)
print(f"âœ… Synthetic dataset saved to {OUTPUT_CSV}")

files.download(OUTPUT_CSV)

This saves your synthetic dataset and gives you a download link.

### Step 8: Evaluate synthetic dataset quality

In [None]:
import matplotlib.pyplot as plt

def compare_distributions(df_real, df_synth, max_cols=6):
    """Plot distributions of real vs synthetic data for both numerical and categorical columns."""
    cols = df_real.columns[:max_cols]  # limit to first N columns for readability
    n = len(cols)
    fig, axes = plt.subplots(n, 2, figsize=(12, 4 * n))

    if n == 1:
        axes = [axes]  # ensure iterable

    for i, col in enumerate(cols):
        ax1, ax2 = axes[i]

        if pd.api.types.is_numeric_dtype(df_real[col]):
            ax1.hist(df_real[col].dropna(), bins=30, alpha=0.7, label="Real", color="blue")
            ax2.hist(df_synth[col].dropna().astype(float), bins=30, alpha=0.7, label="Synthetic", color="orange")
        else:
            df_real[col].value_counts().plot(kind="bar", ax=ax1, color="blue", alpha=0.7)
            df_synth[col].value_counts().plot(kind="bar", ax=ax2, color="orange", alpha=0.7)

        ax1.set_title(f"Real: {col}")
        ax2.set_title(f"Synthetic: {col}")

    plt.tight_layout()
    plt.show()

# Run evaluation
compare_distributions(df, synth_df, max_cols=6)

**This will:**

* Plot up to 6 columns.
* Use histograms for numeric columns.
* Use bar charts for categorical columns.
* Display side-by-side comparison (real vs synthetic).