# DP Aggregate Seeded Synthesizer
DP Aggregate Seeded synthesizer is a differentially private synthesizer that relies on DP Marginals to build synthetic data. It will compute DP Marginals (called aggregates) for your dataset using an specified `reporting length`, and synthesize data based on the computed aggregated counts.

> The `reporting length` is the maximum combination length that aggregates are going to be computed for.

## 1. Overview

### 1.1. Aggregated data generation with DP
Let's consider the following input as example:

| A | B | C |
| -- | -- | -- |
| a1 | b1 | c1 |
| a1 | b2 | c1 |
| a2 |    | c2 |
| a2 | b2 | c1 |
| a1 | b2 |    |

The input data is assumed to be categorical and the domain will be inferred from the input dataset:

- `A` possible values are - `a1,a2`
- `B` possible values are - `b1,b2`
- `C` possible values are - `c1,c2`

For a `reporting length=2`, the aggregates in the dataset above could be:

- 1-counts
    - `A:a1`: 3 + NOISE
    - `A:a2`: 2 + NOISE
    - `B:b1`: 1 + NOISE
    - `B:b2`: 3 + NOISE
    - `C:c1`: 3 + NOISE
    - `C:c2`: 1 + NOISE

- 2-counts:
    - `A:a1;C:c1`: 2 + NOISE
    - `A:a2;B:b2`: 1 + NOISE
    - `B:b1;C:c1`: 1 + NOISE
    - `A:a1;B:b1`: 1 + NOISE
    - `A:a1;B:b2`: 2 + NOISE
    - `B:b2;C:c1`: 2 + NOISE
    - `A:a2;C:c2`: 1 + NOISE
    - `B:b2;C:c2`: 0 + NOISE

Also, some spurious combinations might be created and reported to ensure differentially private guarantees - notice that `B:b2;C:c2` does not exist in the sensitive dataset, but it has been _fabricated_ and added to the output.

Following the same lines, some attribute combinations might be suppressed - even though `A:a2;C:c1` exists in the sensitive dataset, it has not been reported as an aggregate.

### 1.2. Synthesis
Data will be then synthesized directly from the aggregates computed with differential privacy to produce synthetic data. Which will ensure the same DP guarantees to the synthetic data.

## 2. Imports ang global config

In [1]:
import pandas as pd

from sdssynth import init_logger, set_number_of_threads
from sdssynth import Dataset
from sdssynth import DpAggregateSeededParametersBuilder, AccuracyMode, FabricationMode
from sdssynth import DpAggregatedSeededSynthesizer, Dataset

from utils import gen_dataset, ErrorReport

# The library allows the desired log level to be set if wanted
# ('off' || 'error' || 'warn' || 'info' || 'debug' || 'trace')
# init_logger('trace')

# some algorithms have parallel implementations, so the desired number of threads can be set
# (the default is one thread per CPU core)
# set_number_of_threads(2)

## 3. Generating an example data frame with random data

> `gen_dataset` is just an utility to generate some example data (the code for it is in [`utils.py`](./utils.py))

To exemplify the library, let's create an example data frame that will be used to later on.

In [2]:
number_of_records_to_generate = 6000

sensitive_df = gen_dataset(number_of_records_to_generate)

sensitive_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,2,,,1,0,1,1,1,0,0
1,,2,5,1,0,0,1,1,1,0
2,1,3,5,1,0,0,0,1,0,1
3,2,3,3,1,1,1,1,1,0,0
4,2,2,1,1,1,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...
5995,2,4,8,0,1,1,1,1,0,1
5996,2,6,6,0,0,0,1,0,0,0
5997,,6,10,0,0,0,0,0,0,0
5998,2,,6,1,1,1,0,1,0,0


## 4. Creating the sensitive dataset

### 4.1. Creation from constructor

The library uses an internal representation of the data to optimize the execution time complexity.

To create a dataset you can directly call the construction if the data in the raw format:

In [3]:
sensitive_raw_data = [
    # headers
    ['A', 'B', 'C', 'D'],
    # records
    ['a1', 'b1', 'c1', '0'],
    ['a1', 'b1', 'c2', '0'],
    ['a2', '', 'c2', '1'],
]
sensitive_dataset = Dataset(sensitive_raw_data)

### 4.2. Negative value interpretation

The library distinguishes 'positive' attribute values that indicate the presence of specific sensitive data from 'negative' attribute values that indicate the absence of such data. By default, the integer zero (`0`) and the empty string (`""`) are not taken into account when creating and counting attribute combinations. Any columns where zero values are of interest (and thus sensitive) should be listed as `sensitive_zeros`, so they will be treated the same way as positive values.

> For more parameters see the library documentation.

In [4]:
sensitive_raw_data = [
    # headers
    ['A', 'B', 'C', 'D'],
    # records
    ['a1', 'b1', 'c1', '0'],
    ['a1', 'b1', 'c2', '0'],
    ['a2', '', 'c2', '1'],
]
sensitive_dataset = Dataset(sensitive_raw_data, sensitive_zeros=['D'])

### 4.3. Creating sensitive dataset from a pandas data frame
For convenience a method is provided to build a dataset from a pandas data frame:

In [5]:
sensitive_dataset = Dataset.from_data_frame(sensitive_df)

## 5. Generating the synthetic data

### 5.1. Defining synthesizer parameters

If you just want to create the synthesizer with default parameters:

In [6]:
synth = DpAggregatedSeededSynthesizer()
print (synth.parameters)

{
  "reporting_length": 3,
  "epsilon": 0.1,
  "delta": null,
  "percentile_percentage": 99,
  "percentile_epsilon_proportion": 0.01,
  "sigma_proportions": [
    1.0,
    0.5,
    0.3333333333333333
  ],
  "number_of_records_epsilon": 0.1,
  "threshold": {
    "type": "Adaptive",
    "valuesByLen": {
      "3": 1.0,
      "2": 1.0
    }
  },
  "empty_value": "",
  "use_synthetic_counts": false,
  "weight_selection_percentile": 95,
  "aggregate_counts_scale_factor": null
}


Although, this might not produce the optimized output for your dataset.

To tune the synthesizer parameters we provide a builder (`DpAggregateSeededParametersBuilder`), where only the desired parameters can be provided and to have its default value updated:

In [7]:
# this explicitly outlines the default parameters
synth = DpAggregateSeededParametersBuilder() \
        .reporting_length(3) \
        .epsilon(6.0) \
        .delta(1 / (2.0 * len(sensitive_df))) \
        .percentile_percentage(99) \
        .percentile_epsilon_proportion(0.01) \
        .accuracy_mode(AccuracyMode.prioritize_large_counts()) \
        .number_of_records_epsilon(0.1) \
        .fabrication_mode(FabricationMode.uncontrolled()) \
        .empty_value("") \
        .weight_selection_percentile(95) \
        .use_synthetic_counts(False) \
        .aggregate_counts_scale_factor(1.0) \
        .build()
print (synth)


{
  "reporting_length": 3,
  "epsilon": 6.0,
  "delta": 0.00008333333333333333,
  "percentile_percentage": 99,
  "percentile_epsilon_proportion": 0.01,
  "sigma_proportions": [
    1.0,
    0.5,
    0.3333333333333333
  ],
  "number_of_records_epsilon": 0.1,
  "threshold": {
    "type": "Adaptive",
    "valuesByLen": {
      "3": 1.0,
      "2": 1.0
    }
  },
  "empty_value": "",
  "use_synthetic_counts": false,
  "weight_selection_percentile": 95,
  "aggregate_counts_scale_factor": 1.0
}


To continue with this example, let's set the parameters we care about for now:

In [8]:
reporting_length = 4

builder = DpAggregateSeededParametersBuilder() \
    .reporting_length(reporting_length) \
    .epsilon(0.2) \
    .accuracy_mode(AccuracyMode.prioritize_large_counts()) \
    .fabrication_mode(FabricationMode.uncontrolled()) \
    .use_synthetic_counts(True)

synth = DpAggregatedSeededSynthesizer(builder.build())

### 5.2 Building the model and synthesizing data

In [9]:
synth.fit(sensitive_dataset)

# here if we do not specify the desired number of samples, the synthesizer will
# use all the available attributes based in the 1-counts to synthesize records
synthetic_raw_data = synth.sample(len(sensitive_df))
synthetic_dataset = Dataset(synthetic_raw_data)

# as an example, let's create a pandas data frame from the raw synthetic data
synthetic_df = Dataset.raw_data_to_data_frame(synthetic_raw_data)
synthetic_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,,,1,1,1,1,1,1,1
1,1,,,1,1,1,1,1,1,1
2,1,,,1,1,1,1,1,1,1
3,1,,,1,1,1,1,1,1,1
4,1,,,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
5995,2,1,7,,,,,,,
5996,2,1,7,,,,,,,
5997,2,5,4,,,,,,,
5998,2,5,4,,,,,,,


## 6. Generating/exporting aggregated data

In [10]:
# generate sensitive aggregates
sensitive_aggregates = sensitive_dataset.get_aggregates(reporting_length, ';')

# export the differentially private aggregates (internal to the synthesizer)
dp_aggregates = synth.get_dp_aggregates(';')

# generate aggregates from the synthetic data
synthetic_aggregates = synthetic_dataset.get_aggregates(reporting_length, ';')

# let's take a look at the DP aggregates
dp_aggregates

{'H1:1;H3:6;H4:1;H8:1': 49,
 'H1:1;H3:3;H6:1;H8:1': 15,
 'H1:1;H3:7;H6:1': 200,
 'H2:6;H3:9;H8:1': 50,
 'H10:1;H3:6;H8:1': 180,
 'H1:2;H2:5;H4:1': 143,
 'H10:1;H2:1;H3:8': 48,
 'H10:1;H1:1;H3:7;H8:1': 69,
 'H10:1;H1:2;H5:1;H6:1': 227,
 'H10:1;H3:7;H5:1': 130,
 'H3:1;H4:1;H7:1;H8:1': 129,
 'H1:1;H2:2;H7:1;H8:1': 46,
 'H10:1;H1:1;H6:1;H8:1': 149,
 'H1:1;H3:8;H4:1': 99,
 'H1:1;H7:1;H8:1': 603,
 'H1:2;H5:1;H6:1;H9:1': 239,
 'H1:1;H3:7;H5:1;H9:1': 4,
 'H10:1;H2:6;H3:8;H8:1': 69,
 'H2:5;H3:2;H4:1;H7:1': 68,
 'H10:1;H1:1;H3:7;H6:1': 5,
 'H2:3;H4:1;H6:1;H8:1': 127,
 'H3:6;H4:1;H9:1': 127,
 'H10:1;H1:1;H4:1;H5:1': 256,
 'H10:1;H3:6;H4:1;H9:1': 2,
 'H2:3;H3:2;H4:1;H9:1': 5,
 'H3:7;H6:1;H7:1;H8:1': 14,
 'H3:7;H5:1;H8:1': 41,
 'H3:7;H5:1;H6:1;H9:1': 29,
 'H1:2;H2:5;H4:1;H8:1': 143,
 'H1:1;H3:2;H8:1': 19,
 'H6:1': 3045,
 'H1:2;H3:1;H8:1': 87,
 'H10:1;H2:2;H4:1;H7:1': 28,
 'H1:1;H7:1': 953,
 'H10:1;H2:5;H3:8;H8:1': 32,
 'H2:6;H3:9;H6:1;H8:1': 50,
 'H1:1;H4:1;H5:1;H9:1': 258,
 'H3:9;H9:1': 181,
 'H1:

## 7. Evaluate DP aggregates and Synthetic data

This section is an example evaluation of both the DP aggregates and the Synthetic data, as well as the influence of some synthesizer parameters in the DP aggregates and synthetic data.

### 7.1. Evaluating current results

> `ErrorReport` is just an example way to evaluate results (the code for it is [`utils.py`](./utils.py))

- **Count**: mean of the aggregate counts for the given length
- **Error**: mean of the `abs(sensitive_count - dp_aggregated_count)` or `abs(sensitive_count - synthetic_count)`
- **Suppressed %**: percentage of combinations present in the sensitive dataset, but not present in the aggregated/synthetic data
- **Fabricated %**: percentage of combinations that were reported in the aggregated/synthetic data, but do not exist in the sensitive dataset

In [11]:
sensitive_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in sensitive_aggregates.items()
}
dp_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in dp_aggregates.items()
}
synthetic_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in synthetic_aggregates.items()
}

**Sensitive x DP Aggregates**

In [12]:
ErrorReport(sensitive_aggregates_parsed, dp_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1378.36 +/- 42.68,0.00 %,0.00 %
1,2,424.55 +/- 59.39,0.00 %,6.28 %
2,3,148.79 +/- 44.08,9.59 %,6.82 %
3,4,56.26 +/- 29.51,46.04 %,3.74 %
4,All,111.24 +/- 37.42,33.62 %,5.03 %


**Sensitive x Synthetic Data**

In [13]:
ErrorReport(sensitive_aggregates_parsed, synthetic_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1378.36 +/- 218.52,0.00 %,0.00 %
1,2,424.55 +/- 148.59,0.00 %,6.28 %
2,3,148.79 +/- 81.39,9.70 %,6.72 %
3,4,56.26 +/- 41.18,47.17 %,3.52 %
4,All,111.24 +/- 67.03,34.41 %,4.89 %


### 7.2. Targeting less fabrication

Let's update the current synthesizer parameters to `minimize` fabrication:

In [14]:
synth = DpAggregatedSeededSynthesizer(
    builder \
        .fabrication_mode(FabricationMode.minimize()) \
        .build()
)

synth.fit(sensitive_dataset)

synthetic_raw_data = synth.sample(len(sensitive_df))
synthetic_dataset = Dataset(synthetic_raw_data)

synthetic_df = Dataset.raw_data_to_data_frame(synthetic_raw_data)
synthetic_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,,,1,1,1,1,1,1,1
1,1,,,1,1,1,1,1,1,1
2,1,,,1,1,1,1,1,1,1
3,1,,,1,1,1,1,1,1,1
4,1,,,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
5995,2,,5,,,,,,,
5996,2,,5,,,,,,,
5997,2,,5,,,,,,,
5998,2,,5,,,,,,,


Evaluating again:

In [15]:
sensitive_aggregates = sensitive_dataset.get_aggregates(reporting_length, ';')
dp_aggregates = synth.get_dp_aggregates(';')
synthetic_aggregates = synthetic_dataset.get_aggregates(reporting_length, ';')

sensitive_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in sensitive_aggregates.items()
}
dp_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in dp_aggregates.items()
}
synthetic_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in synthetic_aggregates.items()
}

**Sensitive x DP Aggregates**

In [16]:
ErrorReport(sensitive_aggregates_parsed, dp_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1378.36 +/- 61.64,0.00 %,0.00 %
1,2,424.55 +/- 47.11,19.14 %,0.00 %
2,3,148.79 +/- 66.12,79.60 %,0.00 %
3,4,56.26 +/- 40.96,94.93 %,0.00 %
4,All,111.24 +/- 53.38,85.83 %,0.00 %


**Sensitive x Synthetic Data**

In [17]:
ErrorReport(sensitive_aggregates_parsed, synthetic_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1378.36 +/- 504.68,0.00 %,0.00 %
1,2,424.55 +/- 284.83,19.14 %,0.00 %
2,3,148.79 +/- 168.41,79.60 %,0.00 %
3,4,56.26 +/- 56.45,94.93 %,0.00 %
4,All,111.24 +/- 197.48,85.83 %,0.00 %


### 7.3. Prioritize small counts

Let's update the current synthesizer parameters to `prioritize_small_counts`:

In [18]:
synth = DpAggregatedSeededSynthesizer(
    builder \
        .fabrication_mode(FabricationMode.uncontrolled()) \
        .accuracy_mode(AccuracyMode.prioritize_small_counts()) \
        .build()
)

synth.fit(sensitive_dataset)

synthetic_raw_data = synth.sample(len(sensitive_df))
synthetic_dataset = Dataset(synthetic_raw_data)

synthetic_df = Dataset.raw_data_to_data_frame(synthetic_raw_data)
synthetic_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,,,,1,1,1,,1,1,1
1,,,,1,1,1,,1,1,1
2,,,,1,1,1,,1,1,1
3,,,,1,1,1,,1,1,1
4,,,,1,1,1,,1,1,1
...,...,...,...,...,...,...,...,...,...,...
5995,1,5,9,,,,,,,
5996,1,5,9,,,,,,,
5997,2,2,8,,,,,,,
5998,2,3,5,,,,,,,


Evaluating again:

In [19]:
sensitive_aggregates = sensitive_dataset.get_aggregates(reporting_length, ';')
dp_aggregates = synth.get_dp_aggregates(';')
synthetic_aggregates = synthetic_dataset.get_aggregates(reporting_length, ';')

sensitive_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in sensitive_aggregates.items()
}
dp_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in dp_aggregates.items()
}
synthetic_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in synthetic_aggregates.items()
}

**Sensitive x DP Aggregates**

In [20]:
ErrorReport(sensitive_aggregates_parsed, dp_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1378.36 +/- 15.48,0.00 %,0.00 %
1,2,424.55 +/- 38.28,0.00 %,7.93 %
2,3,148.79 +/- 61.96,20.62 %,11.00 %
3,4,56.26 +/- 43.64,68.71 %,8.46 %
4,All,111.24 +/- 50.32,51.79 %,9.38 %


**Sensitive x Synthetic Data**

In [21]:
ErrorReport(sensitive_aggregates_parsed, synthetic_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1378.36 +/- 313.00,0.00 %,0.00 %
1,2,424.55 +/- 201.49,0.00 %,7.52 %
2,3,148.79 +/- 109.06,21.06 %,9.37 %
3,4,56.26 +/- 61.82,69.42 %,7.12 %
4,All,111.24 +/- 103.07,52.38 %,8.05 %
