# Differentially Private Synthetic Data

[![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/differential-privacy/differential-privacy.ipynb)

In this notebook, we demonstrate how a generator can be trained with differential privacy guarantees, and explore how the various settings can impact the data fidelity.

For further background and analysis see also [this blog post](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) on "_Differentially Private Synthetic Data with MOSTLY AI_".

In [None]:
#!pip install -U 'mostlyai[local]'

## Load Original Data

In [1]:
import pandas as pd

# fetch original data
df_original = pd.read_csv("https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz")
df_original.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Train Generators with and without Differential Privacy

In [2]:
from mostlyai.sdk import MostlyAI

# initialize SDK
mostly = MostlyAI(local=True)  # or: MostlyAI(base_url='xxx', api_key='xxx')

Train a generator without DP until fully converged.

In [3]:
g_no_dp = mostly.train(
    config={
        "name": "US Census without DP - full",
        "tables": [
            {
                "name": "census",
                "data": df_original,
            }
        ],
    },
)

Output()

Train a generator without DP, but limited to 5 epochs.

In [4]:
g_no_dp_e5 = mostly.train(
    config={
        "name": "US Census without DP - 5 epochs",
        "tables": [
            {
                "name": "census",
                "data": df_original,
                "tabular_model_configuration": {
                    "max_epochs": 5,
                },
            }
        ],
    },
)

Output()

Train a generator with DP, keeping all defaults.

In [5]:
g_dp_A = mostly.train(
    config={
        "name": "Census with DP - 1.5 1",
        "tables": [
            {
                "name": "census",
                "data": df_original,
                "tabular_model_configuration": {
                    "differential_privacy": {
                        "max_epsilon": None,  # Specifies the maximum allowable epsilon value. If the training process exceeds this threshold, it will be terminated early.
                        "delta": 1e-5,  # The delta value for differential privacy. It is the probability of the privacy guarantee not holding.
                        "noise_multiplier": 1.5,  # The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (How much noise to add).
                        "max_grad_norm": 1.0,  # The maximum norm of the per-sample gradients for training the model with differential privacy.
                    },
                },
            }
        ],
    },
)

Output()

Train a generator with DP, using stricter configurations.

In [7]:
g_dp_B = mostly.train(
    config={
        "name": "Census with DP - 4 2",
        "tables": [
            {
                "name": "census",
                "data": df_original,
                "tabular_model_configuration": {
                    "differential_privacy": {
                        "max_epsilon": None,
                        "delta": 1e-5,
                        "noise_multiplier": 4.0,  # increased compared to default
                        "max_grad_norm": 2.0,  # increased compared to default
                    },
                },
            }
        ],
    },
)

Output()

## Compare Metrics across these Runs

In [63]:
generators = [g_no_dp, g_no_dp_e5, g_dp_A, g_dp_B]
for g in generators:
    # fetch final epsilon from message of last model checkpoint
    messages = pd.DataFrame(g.training.progress().steps[3].messages)
    final_msg = messages.loc[messages.is_checkpoint == 1, :].tail(1)
    final_eps = next(iter(final_msg.to_dict("list").get("dp_eps")))
    # print out stats
    print(f"# {g.name}\nAccuracy: {g.accuracy:.1%}\nRuntime: {g.training_time}\nEpsilon: {final_eps}\n")

# US Census without DP - full
Accuracy: 98.2%
Runtime: None
Epsilon: None

# US Census without DP - 5 epochs
Accuracy: 92.4%
Runtime: None
Epsilon: None

# Census with DP - 1.5 1
Accuracy: 95.9%
Runtime: None
Epsilon: 2.53

# Census with DP - 4 2
Accuracy: 92.5%
Runtime: None
Epsilon: 0.84



## Further exercises

In addition to walking through the above instructions, we suggest..
* to experiment with different DP settings
* to study the impact of the total size of the training data on final eps
* to evaluate the accuracy-privacy trade off also for other datasets

## Conclusion

This tutorial demonstrated how to train with and without differential privacy guarantees. Note: DP just provides additional mathematical guarantees for use cases that require these. However, given the other privacy mechanism in-built into the SDK, synthetic data can also without stricter DP guarantees be considered to be anonymous. See again [here](https://mostly.ai/blog/differentially-private-synthetic-data-with-mostly-ai) for a further discussion.