# Differential Privacy (DP) Aggregate Seeded Synthesizer

> Example based on: https://github.com/microsoft/synthetic-data-showcase/blob/main/packages/lib-pacsynth/samples/dp_aggregate_seeded_detailed_example.ipynb

DP Aggregate Seeded synthesizer is a differentially private synthesizer that relies on DP Marginals to build synthetic data. It will compute DP Marginals (called aggregates) for your dataset using a specified `reporting length`, and synthesize data based on the computed aggregated counts.

> Aggregates will be computed for all lengths of attribute combination up to and including the `reporting length`.

## 1. Overview

### 1.1. Aggregate data generation with DP
Let's consider the following input as example:

| A | B | C |
| -- | -- | -- |
| a1 | b1 | c1 |
| a1 | b2 | c1 |
| a2 |    | c2 |
| a2 | b2 | c1 |
| a1 | b2 |    |

The input data is assumed to be categorical and the domain will be inferred from the input dataset:

- `A` possible values are `a1,a2`
- `B` possible values are `b1,b2`
- `C` possible values are `c1,c2`

For a `reporting length=2`, the aggregates in the dataset above could be:

- 1-counts
    - `A:a1`: 3 + NOISE
    - `A:a2`: 2 + NOISE
    - `B:b1`: 1 + NOISE
    - `B:b2`: 3 + NOISE
    - `C:c1`: 3 + NOISE
    - `C:c2`: 1 + NOISE

- 2-counts:
    - `A:a1;C:c1`: 2 + NOISE
    - `A:a2;B:b2`: 1 + NOISE
    - `B:b1;C:c1`: 1 + NOISE
    - `A:a1;B:b1`: 1 + NOISE
    - `A:a1;B:b2`: 2 + NOISE
    - `B:b2;C:c1`: 2 + NOISE
    - `A:a2;C:c2`: 1 + NOISE
    - `B:b2;C:c2`: 0 + NOISE

Also, some spurious combinations might be created and reported to ensure differentially private guarantees - notice that `B:b2;C:c2` does not exist in the sensitive dataset, but it has been _fabricated_ and added to the output.

Similarly, some attribute combinations might be suppressed. For example, even though `A:a2;C:c1` exists in the sensitive dataset, it has not been reported as an aggregate.

### 1.2. Synthesis
Data will be then synthesized directly from the aggregates computed with differential privacy to produce synthetic data. Which will ensure the same DP guarantees to the synthetic data.

## 2. Imports and global config

In [1]:
import pandas as pd
import math

from snsynth.aggregate_seeded import \
    AggregateSeededSynthesizer, \
    AccuracyMode, \
    FabricationMode, \
    AggregateSeededDataset

from utils import gen_data_frame, ErrorReport

## 3. Generating an example data frame with random data

> `gen_data_frame` is just an utility to generate some example data (the code for it is in [`utils.py`](./utils.py))

To illustrate the library, let's start by creating an example data frame:

In [2]:
number_of_records_to_generate = 6000

sensitive_df = gen_data_frame(number_of_records_to_generate)

sensitive_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,2,1,1,0,0,1,0,1,1
1,2,2,,1,1,1,0,1,0,1
2,1,,1,0,0,1,0,0,0,1
3,2,3,2,0,0,0,1,0,0,1
4,,2,5,1,0,0,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
5995,,,6,0,1,0,0,0,1,1
5996,2,,,1,1,1,0,1,0,0
5997,,4,6,0,0,0,1,1,1,0
5998,2,4,8,0,1,0,0,1,1,0


## 4. Creating the sensitive dataset

### 4.1. Creation from constructor

The library uses an internal representation of the data to optimize execution time complexity.

If the data is already in the required raw format, you can call the constructor directly:

In [3]:
sensitive_raw_data = [
    # headers
    ['A', 'B', 'C', 'D'],
    # records
    ['a1', 'b1', 'c1', '0'],
    ['a1', 'b1', 'c2', '0'],
    ['a2', '', 'c2', '1'],
]
sensitive_dataset = AggregateSeededDataset(sensitive_raw_data)
sensitive_dataset.to_data_frame()

Unnamed: 0,A,B,C,D
0,a1,b1,c1,
1,a1,b1,c2,
2,a2,,c2,1.0


### 4.2. Negative value interpretation

The library distinguishes 'positive' attribute values that indicate the presence of specific sensitive data from 'negative' attribute values that indicate the absence of such data. By default, the integer zero (`0`) and the empty string (`""`) are not taken into account when creating and counting attribute combinations. Any columns where zero values are of interest (and thus sensitive) should be listed as `sensitive_zeros`, so they will be treated the same way as positive values.

> For more parameters see the library documentation - `help('pacsynth.Dataset')`.

In [4]:
sensitive_raw_data = [
    # headers
    ['A', 'B', 'C', 'D'],
    # records
    ['a1', 'b1', 'c1', '0'],
    ['a1', 'b1', 'c2', '0'],
    ['a2', '', 'c2', '1'],
]
sensitive_dataset = AggregateSeededDataset(sensitive_raw_data, sensitive_zeros=['D'])
sensitive_dataset.to_data_frame()

Unnamed: 0,A,B,C,D
0,a1,b1,c1,0
1,a1,b1,c2,0
2,a2,,c2,1


### 4.3. Creating sensitive dataset from a pandas data frame
For convenience, a method is provided to build a dataset from a pandas data frame:

In [5]:
sensitive_dataset = AggregateSeededDataset.from_data_frame(sensitive_df)
sensitive_dataset.to_data_frame()

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,2,1,1,,,1,,1,1
1,2,2,,1,1,1,,1,,1
2,1,,1,,,1,,,,1
3,2,3,2,,,,1,,,1
4,,2,5,1,,,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
5995,,,6,,1,,,,1,1
5996,2,,,1,1,1,,1,,
5997,,4,6,,,,1,1,1,
5998,2,4,8,,1,,,1,1,


## 5. Generating the synthetic data

### 5.1. Defining synthesizer parameters

If you just want to create the synthesizer with default parameters:

In [6]:
synth = AggregateSeededSynthesizer()
print (synth.parameters)

{
  "reporting_length": 3,
  "epsilon": 4.0,
  "delta": null,
  "percentile_percentage": 99,
  "percentile_epsilon_proportion": 0.01,
  "sigma_proportions": [
    1.0,
    0.5,
    0.3333333333333333
  ],
  "number_of_records_epsilon_proportion": 0.005,
  "threshold": {
    "type": "Adaptive",
    "valuesByLen": {
      "3": 1.0,
      "2": 1.0
    }
  },
  "empty_value": "",
  "use_synthetic_counts": false,
  "weight_selection_percentile": 95,
  "aggregate_counts_scale_factor": null
}


However, this might not produce the optimal output for your dataset and downstream analysis tasks.

So you can tune configuration by changing synthesizer parameters accordingly:

In [7]:
# this explicitly outlines the default parameters
synth = AggregateSeededSynthesizer(
    reporting_length = 3,
    epsilon = 4.0,
    delta = 1.0 / (math.log(len(sensitive_df)) * len(sensitive_df)),
    percentile_percentage = 99,
    percentile_epsilon_proportion = 0.01,
    accuracy_mode = AccuracyMode.prioritize_long_combinations(),
    number_of_records_epsilon_proportion = 0.005,
    fabrication_mode = FabricationMode.uncontrolled(),
    empty_value = "",
    weight_selection_percentile = 95,
    use_synthetic_counts = False,
    aggregate_counts_scale_factor=1.0
)
print (synth.parameters)


{
  "reporting_length": 3,
  "epsilon": 4.0,
  "delta": 0.000019158156689251674,
  "percentile_percentage": 99,
  "percentile_epsilon_proportion": 0.01,
  "sigma_proportions": [
    1.0,
    0.5,
    0.3333333333333333
  ],
  "number_of_records_epsilon_proportion": 0.005,
  "threshold": {
    "type": "Adaptive",
    "valuesByLen": {
      "3": 1.0,
      "2": 1.0
    }
  },
  "empty_value": "",
  "use_synthetic_counts": false,
  "weight_selection_percentile": 95,
  "aggregate_counts_scale_factor": 1.0
}


To continue with this example, let's set the parameters we care about for now:

In [8]:
reporting_length = 4

synth = AggregateSeededSynthesizer(
    reporting_length = reporting_length,
    epsilon = 4.0,
    accuracy_mode = AccuracyMode.prioritize_long_combinations(),
    fabrication_mode = FabricationMode.uncontrolled(),
    use_synthetic_counts = True,
)

### 5.2 Building the model and synthesizing data

In [9]:
synth.fit(sensitive_dataset)

# we could decide to use this or not as the sample number
protected_number_of_records = synth.get_dp_number_of_records()

print ('Number of records protected with DP:', synth.get_dp_number_of_records())

# here if we do not specify the desired number of samples, the synthesizer will
# use all the available attributes based on the 1-counts to synthesize records
synthetic_raw_data = synth.sample(protected_number_of_records)
synthetic_dataset = AggregateSeededDataset(synthetic_raw_data)

# as an example, let's create a pandas data frame from the raw synthetic data
synthetic_df = AggregateSeededDataset.raw_data_to_data_frame(synthetic_raw_data)
synthetic_df

Number of records protected with DP: 6007


Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,1,2,1,1,1,1,1,1,1
1,1,1,2,1,1,1,1,1,1,1
2,1,1,2,1,1,1,1,1,1,1
3,1,1,2,1,1,1,1,1,1,1
4,1,1,2,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
6002,,4,3,,,,,,1,
6003,,4,3,,1,,,,,
6004,,5,1,,,,,,1,
6005,,6,1,1,,,,,,


## 6. Generating/exporting aggregate data

In [10]:
# generate sensitive aggregates
sensitive_aggregates = synth.get_sensitive_aggregates(';')

# export the differentially private aggregates (internal to the synthesizer)
dp_aggregates = synth.get_dp_aggregates(';')

# generate aggregates from the synthetic data
synthetic_aggregates = synthetic_dataset.get_aggregates(reporting_length, ';')

# let's take a look at the DP aggregates
list(dp_aggregates.items())[:20]

[('H2:6;H3:10;H5:1;H7:1', 52),
 ('H2:5;H3:9;H6:1;H8:1', 57),
 ('H1:1;H3:7;H6:1', 73),
 ('H1:1;H2:1;H3:1;H4:1', 6),
 ('H10:1;H2:1;H3:3;H4:1', 32),
 ('H2:4;H3:8;H5:1;H9:1', 30),
 ('H10:1;H3:5;H4:1;H6:1', 46),
 ('H10:1;H1:1;H3:7;H8:1', 38),
 ('H2:4;H3:10;H4:1;H9:1', 32),
 ('H3:1;H4:1;H7:1;H8:1', 77),
 ('H1:1;H2:2;H7:1;H8:1', 50),
 ('H2:4;H3:7;H4:1;H6:1', 15),
 ('H10:1;H2:6;H3:8;H8:1', 39),
 ('H10:1;H1:1;H6:1;H8:1', 250),
 ('H1:1;H7:1;H8:1', 527),
 ('H1:1;H3:7;H5:1;H9:1', 36),
 ('H3:1;H6:1;H7:1;H9:1', 48),
 ('H10:1;H1:1;H3:7;H6:1', 42),
 ('H2:3;H3:2;H4:1;H9:1', 39),
 ('H3:7;H6:1;H7:1;H8:1', 47)]

## 7. Evaluating DP aggregates and DP synthetic data

This section is an example evaluation of both the DP aggregates and the DP synthetic data, as well as the influence of some synthesizer parameters in the DP aggregates and synthetic data.

### 7.1. Evaluating current results

> `ErrorReport` is just an example way to evaluate results (the code for it is [`utils.py`](./utils.py))

- **Count**: mean of the aggregate counts for the given length
- **Error**: mean of the `abs(sensitive_count - dp_aggregated_count)` or `abs(sensitive_count - synthetic_count)`
- **Suppressed %**: percentage of combinations present in the sensitive dataset, but not present in the aggregated/synthetic data
- **Fabricated %**: percentage of combinations that were reported in the aggregated/synthetic data, but do not exist in the sensitive dataset

In [11]:
sensitive_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in sensitive_aggregates.items()
}
dp_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in dp_aggregates.items()
}
synthetic_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in synthetic_aggregates.items()
}

**Sensitive Data vs. DP Aggregates**

In [12]:
ErrorReport(sensitive_aggregates_parsed, dp_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1379.40 +/- 12.60,0.00 %,0.00 %
1,2,424.05 +/- 12.46,0.00 %,7.11 %
2,3,147.83 +/- 12.35,0.00 %,6.78 %
3,4,55.39 +/- 11.36,3.14 %,2.41 %
4,All,110.38 +/- 11.70,2.13 %,3.87 %


**Sensitive Data vs. DP Synthetic Data**

In [13]:
ErrorReport(sensitive_aggregates_parsed, synthetic_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1379.40 +/- 77.44,0.00 %,0.00 %
1,2,424.05 +/- 56.91,0.00 %,7.11 %
2,3,147.83 +/- 30.71,0.00 %,5.72 %
3,4,55.39 +/- 13.56,3.14 %,1.83 %
4,All,110.38 +/- 21.15,2.13 %,3.20 %


### 7.2. Targeting less fabrication

Let's update the current synthesizer parameters to `minimize` fabrication:

In [14]:
synth = AggregateSeededSynthesizer(
    reporting_length = reporting_length,
    epsilon = 4.0,
    accuracy_mode = AccuracyMode.prioritize_long_combinations(),
    fabrication_mode = FabricationMode.minimize(),
    use_synthetic_counts = True,
)

synth.fit(sensitive_dataset)

synthetic_raw_data = synth.sample(protected_number_of_records)
synthetic_dataset = AggregateSeededDataset(synthetic_raw_data)

synthetic_df = AggregateSeededDataset.raw_data_to_data_frame(synthetic_raw_data)
synthetic_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,1,,1,1,1,1,1,1,1
1,1,1,,1,1,1,1,1,1,1
2,1,1,,1,1,1,1,1,1,1
3,1,1,,1,1,1,1,1,1,1
4,1,1,,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
6002,1,1,4,,,,,,,
6003,1,1,4,,,,,,,
6004,1,1,4,,,,,,,
6005,1,1,4,,,,,,,


Evaluating again:

In [15]:
sensitive_aggregates = sensitive_dataset.get_aggregates(reporting_length, ';')
dp_aggregates = synth.get_dp_aggregates(';')
synthetic_aggregates = synthetic_dataset.get_aggregates(reporting_length, ';')

sensitive_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in sensitive_aggregates.items()
}
dp_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in dp_aggregates.items()
}
synthetic_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in synthetic_aggregates.items()
}

**Sensitive Data vs. DP Aggregates**

In [16]:
ErrorReport(sensitive_aggregates_parsed, dp_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1379.40 +/- 13.24,0.00 %,0.00 %
1,2,424.05 +/- 13.30,0.00 %,0.00 %
2,3,147.83 +/- 14.32,9.81 %,0.00 %
3,4,55.39 +/- 10.32,37.87 %,0.00 %
4,All,110.38 +/- 11.88,28.15 %,0.00 %


**Sensitive Data vs. DP Synthetic Data**

In [17]:
ErrorReport(sensitive_aggregates_parsed, synthetic_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1379.40 +/- 149.60,0.00 %,0.00 %
1,2,424.05 +/- 105.16,0.00 %,0.00 %
2,3,147.83 +/- 58.27,9.81 %,0.00 %
3,4,55.39 +/- 27.56,37.87 %,0.00 %
4,All,110.38 +/- 45.07,28.15 %,0.00 %


### 7.3. Prioritize short combinations

Let's update the current synthesizer parameters to `prioritize_short_combinations`:

In [18]:
synth = AggregateSeededSynthesizer(
    reporting_length = reporting_length,
    epsilon = 4.0,
    accuracy_mode = AccuracyMode.prioritize_short_combinations(),
    fabrication_mode = FabricationMode.uncontrolled(),
    use_synthetic_counts = False,
)

synth.fit(sensitive_dataset)

synthetic_raw_data = synth.sample(protected_number_of_records)
synthetic_dataset = AggregateSeededDataset(synthetic_raw_data)

synthetic_df = AggregateSeededDataset.raw_data_to_data_frame(synthetic_raw_data)
synthetic_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,,1,,1,1,1,1,1,1,1
1,,1,,1,1,1,1,1,1,1
2,,1,,1,1,1,1,1,1,1
3,,1,,1,1,1,1,1,1,1
4,,1,,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
6002,,6,9,,,,,,,
6003,,6,9,,,,,,,
6004,,6,9,,,,,,,
6005,,6,9,,,,,,,


Evaluating again:

In [19]:
sensitive_aggregates = sensitive_dataset.get_aggregates(reporting_length, ';')
dp_aggregates = synth.get_dp_aggregates(';')
synthetic_aggregates = synthetic_dataset.get_aggregates(reporting_length, ';')

sensitive_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in sensitive_aggregates.items()
}
dp_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in dp_aggregates.items()
}
synthetic_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in synthetic_aggregates.items()
}

**Sensitive Data vs. DP Aggregates**

In [20]:
ErrorReport(sensitive_aggregates_parsed, dp_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1379.40 +/- 2.96,0.00 %,0.00 %
1,2,424.05 +/- 9.13,0.00 %,9.52 %
2,3,147.83 +/- 22.14,0.88 %,9.83 %
3,4,55.39 +/- 27.72,29.58 %,4.49 %
4,All,110.38 +/- 24.34,20.24 %,6.60 %


**Sensitive Data vs. DP Synthetic Data**

In [21]:
ErrorReport(sensitive_aggregates_parsed, synthetic_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1379.40 +/- 68.40,0.00 %,0.00 %
1,2,424.05 +/- 120.91,0.00 %,8.73 %
2,3,147.83 +/- 76.83,0.99 %,7.23 %
3,4,55.39 +/- 48.68,30.96 %,2.89 %
4,All,110.38 +/- 63.38,21.20 %,4.76 %
