# Differential Privacy (DP) Aggregate Seeded Synthesizer
DP Aggregate Seeded synthesizer is a differentially private synthesizer that relies on DP Marginals to build synthetic data. It will compute DP Marginals (called aggregates) for your dataset using a specified `reporting length`, and synthesize data based on the computed aggregated counts.

> Aggregates will be computed for all lengths of attribute combination up to and including the `reporting length`.

## 1. Overview

### 1.1. Aggregate data generation with DP
Let's consider the following input as example:

| A | B | C |
| -- | -- | -- |
| a1 | b1 | c1 |
| a1 | b2 | c1 |
| a2 |    | c2 |
| a2 | b2 | c1 |
| a1 | b2 |    |

The input data is assumed to be categorical and the domain will be inferred from the input dataset:

- `A` possible values are `a1,a2`
- `B` possible values are `b1,b2`
- `C` possible values are `c1,c2`

For a `reporting length=2`, the aggregates in the dataset above could be:

- 1-counts
    - `A:a1`: 3 + NOISE
    - `A:a2`: 2 + NOISE
    - `B:b1`: 1 + NOISE
    - `B:b2`: 3 + NOISE
    - `C:c1`: 3 + NOISE
    - `C:c2`: 1 + NOISE

- 2-counts:
    - `A:a1;C:c1`: 2 + NOISE
    - `A:a2;B:b2`: 1 + NOISE
    - `B:b1;C:c1`: 1 + NOISE
    - `A:a1;B:b1`: 1 + NOISE
    - `A:a1;B:b2`: 2 + NOISE
    - `B:b2;C:c1`: 2 + NOISE
    - `A:a2;C:c2`: 1 + NOISE
    - `B:b2;C:c2`: 0 + NOISE

Also, some spurious combinations might be created and reported to ensure differentially private guarantees - notice that `B:b2;C:c2` does not exist in the sensitive dataset, but it has been _fabricated_ and added to the output.

Similarly, some attribute combinations might be suppressed. For example, even though `A:a2;C:c1` exists in the sensitive dataset, it has not been reported as an aggregate.

### 1.2. Synthesis
Data will be then synthesized directly from the aggregates computed with differential privacy to produce synthetic data. Which will ensure the same DP guarantees to the synthetic data.

## 2. Imports and global config

In [1]:
import pandas as pd

from pacsynth import init_logger, set_number_of_threads
from pacsynth import Dataset
from pacsynth import DpAggregateSeededParametersBuilder, AccuracyMode, FabricationMode
from pacsynth import DpAggregateSeededSynthesizer, Dataset

from utils import gen_data_frame, ErrorReport

# The library allows the desired log level to be set if wanted
# ('off' || 'error' || 'warn' || 'info' || 'debug' || 'trace')
# init_logger('trace')

# some algorithms have parallel implementations, so the desired number of threads can be set
# (the default is one thread per CPU core)
# set_number_of_threads(2)

## 3. Generating an example data frame with random data

> `gen_data_frame` is just an utility to generate some example data (the code for it is in [`utils.py`](./utils.py))

To illustrate the library, let's start by creating an example data frame:

In [2]:
number_of_records_to_generate = 6000

sensitive_df = gen_data_frame(number_of_records_to_generate)

sensitive_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,3,,0,0,1,0,0,1,0
1,1,1,3,1,0,1,0,0,0,1
2,,,1,1,0,1,0,0,0,0
3,,1,3,0,1,0,1,1,0,1
4,,1,4,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
5995,1,6,9,1,1,1,1,0,1,0
5996,2,6,8,0,1,1,1,1,0,0
5997,2,6,,0,0,0,0,1,1,1
5998,2,6,,1,1,0,0,1,1,0


## 4. Creating the sensitive dataset

### 4.1. Creation from constructor

The library uses an internal representation of the data to optimize execution time complexity.

If the data is already in the required raw format, you can call the constructor directly:

In [3]:
sensitive_raw_data = [
    # headers
    ['A', 'B', 'C', 'D'],
    # records
    ['a1', 'b1', 'c1', '0'],
    ['a1', 'b1', 'c2', '0'],
    ['a2', '', 'c2', '1'],
]
sensitive_dataset = Dataset(sensitive_raw_data)
sensitive_dataset.to_data_frame()

Unnamed: 0,A,B,C,D
0,a1,b1,c1,
1,a1,b1,c2,
2,a2,,c2,1.0


### 4.2. Negative value interpretation

The library distinguishes 'positive' attribute values that indicate the presence of specific sensitive data from 'negative' attribute values that indicate the absence of such data. By default, the integer zero (`0`) and the empty string (`""`) are not taken into account when creating and counting attribute combinations. Any columns where zero values are of interest (and thus sensitive) should be listed as `sensitive_zeros`, so they will be treated the same way as positive values.

> For more parameters see the library documentation - `help('pacsynth.Dataset')`.

In [4]:
sensitive_raw_data = [
    # headers
    ['A', 'B', 'C', 'D'],
    # records
    ['a1', 'b1', 'c1', '0'],
    ['a1', 'b1', 'c2', '0'],
    ['a2', '', 'c2', '1'],
]
sensitive_dataset = Dataset(sensitive_raw_data, sensitive_zeros=['D'])
sensitive_dataset.to_data_frame()

Unnamed: 0,A,B,C,D
0,a1,b1,c1,0
1,a1,b1,c2,0
2,a2,,c2,1


### 4.3. Creating sensitive dataset from a pandas data frame
For convenience, a method is provided to build a dataset from a pandas data frame:

In [5]:
sensitive_dataset = Dataset.from_data_frame(sensitive_df)
sensitive_dataset.to_data_frame()

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,3,,,,1,,,1,
1,1,1,3,1,,1,,,,1
2,,,1,1,,1,,,,
3,,1,3,,1,,1,1,,1
4,,1,4,,1,,,,,
...,...,...,...,...,...,...,...,...,...,...
5995,1,6,9,1,1,1,1,,1,
5996,2,6,8,,1,1,1,1,,
5997,2,6,,,,,,1,1,1
5998,2,6,,1,1,,,1,1,


## 5. Generating the synthetic data

### 5.1. Defining synthesizer parameters

If you just want to create the synthesizer with default parameters:

In [6]:
synth = DpAggregateSeededSynthesizer()
print (synth.parameters)

{
  "reporting_length": 3,
  "epsilon": 0.1,
  "delta": null,
  "percentile_percentage": 99,
  "percentile_epsilon_proportion": 0.01,
  "sigma_proportions": [
    1.0,
    0.5,
    0.3333333333333333
  ],
  "number_of_records_epsilon": 0.1,
  "threshold": {
    "type": "Adaptive",
    "valuesByLen": {
      "3": 1.0,
      "2": 1.0
    }
  },
  "empty_value": "",
  "use_synthetic_counts": false,
  "weight_selection_percentile": 95,
  "aggregate_counts_scale_factor": null
}


However, this might not produce the optimal output for your dataset and downstream analysis tasks.

To tune the synthesizer parameters, we provide a builder (`DpAggregateSeededParametersBuilder`). Any parameters passed to the builder will override the default values accordingly:

In [7]:
# this explicitly outlines the default parameters
params = DpAggregateSeededParametersBuilder() \
        .reporting_length(3) \
        .epsilon(0.1) \
        .delta(1 / (2.0 * len(sensitive_df))) \
        .percentile_percentage(99) \
        .percentile_epsilon_proportion(0.01) \
        .accuracy_mode(AccuracyMode.prioritize_large_counts()) \
        .number_of_records_epsilon(0.1) \
        .fabrication_mode(FabricationMode.uncontrolled()) \
        .empty_value("") \
        .weight_selection_percentile(95) \
        .use_synthetic_counts(False) \
        .aggregate_counts_scale_factor(1.0) \
        .build()
print (params)


{
  "reporting_length": 3,
  "epsilon": 0.1,
  "delta": 0.00008333333333333333,
  "percentile_percentage": 99,
  "percentile_epsilon_proportion": 0.01,
  "sigma_proportions": [
    1.0,
    0.5,
    0.3333333333333333
  ],
  "number_of_records_epsilon": 0.1,
  "threshold": {
    "type": "Adaptive",
    "valuesByLen": {
      "3": 1.0,
      "2": 1.0
    }
  },
  "empty_value": "",
  "use_synthetic_counts": false,
  "weight_selection_percentile": 95,
  "aggregate_counts_scale_factor": 1.0
}


To continue with this example, let's set the parameters we care about for now:

In [8]:
reporting_length = 4

builder = DpAggregateSeededParametersBuilder() \
    .reporting_length(reporting_length) \
    .epsilon(0.9) \
    .accuracy_mode(AccuracyMode.prioritize_large_counts()) \
    .fabrication_mode(FabricationMode.uncontrolled()) \
    .use_synthetic_counts(True)

synth = DpAggregateSeededSynthesizer(builder.build())

### 5.2 Building the model and synthesizing data

In [9]:
synth.fit(sensitive_dataset)

# we could decide to use this or not as the sample number
protected_number_of_records = synth.get_dp_number_of_records()

print ('Number of records protected with DP:', synth.get_dp_number_of_records())

# here if we do not specify the desired number of samples, the synthesizer will
# use all the available attributes based on the 1-counts to synthesize records
synthetic_raw_data = synth.sample(protected_number_of_records)
synthetic_dataset = Dataset(synthetic_raw_data)

# as an example, let's create a pandas data frame from the raw synthetic data
synthetic_df = Dataset.raw_data_to_data_frame(synthetic_raw_data)
synthetic_df

Number of records protected with DP: 6001


Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,2,,1,1,1,1,1,1,1
1,1,2,,1,1,1,1,1,1,1
2,1,2,,1,1,1,1,1,1,1
3,1,2,,1,1,1,1,1,1,1
4,1,2,,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
5996,,6,5,1,,,,,,
5997,,6,5,1,,,,,,
5998,,6,5,1,,,,,,
5999,,6,5,1,,,,,,


## 6. Generating/exporting aggregate data

In [10]:
# generate sensitive aggregates
sensitive_aggregates = sensitive_dataset.get_aggregates(reporting_length, ';')

# export the differentially private aggregates (internal to the synthesizer)
dp_aggregates = synth.get_dp_aggregates(';')

# generate aggregates from the synthetic data
synthetic_aggregates = synthetic_dataset.get_aggregates(reporting_length, ';')

# let's take a look at the DP aggregates
list(dp_aggregates.items())[:20]

[('H1:1;H3:2;H5:1;H8:1', 41),
 ('H2:6;H3:10;H5:1;H7:1', 82),
 ('H1:1;H3:6;H4:1;H8:1', 53),
 ('H1:1;H3:3;H6:1;H8:1', 30),
 ('H2:5;H3:9;H6:1;H8:1', 28),
 ('H1:1;H3:7;H6:1', 74),
 ('H1:2;H2:5;H3:9;H9:1', 22),
 ('H2:6;H3:9;H8:1', 89),
 ('H10:1;H3:6;H8:1', 102),
 ('H1:2;H2:5;H4:1', 148),
 ('H10:1;H2:1;H3:3;H4:1', 21),
 ('H2:4;H3:8;H5:1;H9:1', 23),
 ('H10:1;H3:5;H4:1;H6:1', 75),
 ('H10:1;H1:1;H3:7;H8:1', 23),
 ('H2:1;H3:2;H6:1;H8:1', 4),
 ('H1:2;H2:4;H3:9;H9:1', 4),
 ('H1:2;H2:6;H3:7;H4:1', 32),
 ('H10:1;H1:2;H5:1;H6:1', 230),
 ('H10:1;H3:7;H5:1', 115),
 ('H2:4;H3:10;H4:1;H9:1', 21)]

## 7. Evaluating DP aggregates and DP synthetic data

This section is an example evaluation of both the DP aggregates and the DP synthetic data, as well as the influence of some synthesizer parameters in the DP aggregates and synthetic data.

### 7.1. Evaluating current results

> `ErrorReport` is just an example way to evaluate results (the code for it is [`utils.py`](./utils.py))

- **Count**: mean of the aggregate counts for the given length
- **Error**: mean of the `abs(sensitive_count - dp_aggregated_count)` or `abs(sensitive_count - synthetic_count)`
- **Suppressed %**: percentage of combinations present in the sensitive dataset, but not present in the aggregated/synthetic data
- **Fabricated %**: percentage of combinations that were reported in the aggregated/synthetic data, but do not exist in the sensitive dataset

In [11]:
sensitive_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in sensitive_aggregates.items()
}
dp_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in dp_aggregates.items()
}
synthetic_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in synthetic_aggregates.items()
}

**Sensitive Data vs. DP Aggregates**

In [12]:
ErrorReport(sensitive_aggregates_parsed, dp_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1380.28 +/- 41.84,0.00 %,0.00 %
1,2,425.34 +/- 24.54,0.00 %,7.11 %
2,3,149.06 +/- 24.62,1.32 %,7.16 %
3,4,56.37 +/- 18.69,15.58 %,2.80 %
4,All,111.44 +/- 20.95,10.88 %,4.35 %


**Sensitive Data vs. DP Synthetic Data**

In [13]:
ErrorReport(sensitive_aggregates_parsed, synthetic_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1380.28 +/- 85.60,0.00 %,0.00 %
1,2,425.34 +/- 71.05,0.00 %,7.11 %
2,3,149.06 +/- 44.88,1.32 %,6.09 %
3,4,56.37 +/- 24.38,15.84 %,2.24 %
4,All,111.44 +/- 33.82,11.05 %,3.68 %


### 7.2. Targeting less fabrication

Let's update the current synthesizer parameters to `minimize` fabrication:

In [14]:
synth = DpAggregateSeededSynthesizer(
    builder \
        .fabrication_mode(FabricationMode.minimize()) \
        .build()
)

synth.fit(sensitive_dataset)

synthetic_raw_data = synth.sample(len(sensitive_df))
synthetic_dataset = Dataset(synthetic_raw_data)

synthetic_df = Dataset.raw_data_to_data_frame(synthetic_raw_data)
synthetic_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,3,,1,1,1,1,1,1,1
1,1,3,,1,1,1,1,1,1,1
2,1,3,,1,1,1,1,1,1,1
3,1,3,,1,1,1,1,1,1,1
4,1,3,,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
5995,1,,1,,,,,,,
5996,1,,1,,,,,,,
5997,1,,1,,,,,,,
5998,1,,1,,,,,,,


Evaluating again:

In [15]:
sensitive_aggregates = sensitive_dataset.get_aggregates(reporting_length, ';')
dp_aggregates = synth.get_dp_aggregates(';')
synthetic_aggregates = synthetic_dataset.get_aggregates(reporting_length, ';')

sensitive_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in sensitive_aggregates.items()
}
dp_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in dp_aggregates.items()
}
synthetic_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in synthetic_aggregates.items()
}

**Sensitive Data vs. DP Aggregates**

In [16]:
ErrorReport(sensitive_aggregates_parsed, dp_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1380.28 +/- 25.60,0.00 %,0.00 %
1,2,425.34 +/- 25.96,0.96 %,0.00 %
2,3,149.06 +/- 28.07,42.56 %,0.00 %
3,4,56.37 +/- 24.28,74.86 %,0.00 %
4,All,111.44 +/- 26.02,61.65 %,0.00 %


**Sensitive Data vs. DP Synthetic Data**

In [17]:
ErrorReport(sensitive_aggregates_parsed, synthetic_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1380.28 +/- 307.76,0.00 %,0.00 %
1,2,425.34 +/- 185.35,0.96 %,0.00 %
2,3,149.06 +/- 102.71,42.56 %,0.00 %
3,4,56.37 +/- 51.05,74.86 %,0.00 %
4,All,111.44 +/- 96.23,61.65 %,0.00 %


### 7.3. Prioritize small counts

Let's update the current synthesizer parameters to `prioritize_small_counts`:

In [18]:
synth = DpAggregateSeededSynthesizer(
    builder \
        .fabrication_mode(FabricationMode.uncontrolled()) \
        .accuracy_mode(AccuracyMode.prioritize_small_counts()) \
        .use_synthetic_counts(False) \
        .build()
)

synth.fit(sensitive_dataset)

synthetic_raw_data = synth.sample(len(sensitive_df))
synthetic_dataset = Dataset(synthetic_raw_data)

synthetic_df = Dataset.raw_data_to_data_frame(synthetic_raw_data)
synthetic_df

Unnamed: 0,H1,H2,H3,H4,H5,H6,H7,H8,H9,H10
0,1,,,1,1,1,1,1,1,1
1,1,,,1,1,1,1,1,1,1
2,1,,,1,1,1,1,1,1,1
3,1,,,1,1,1,1,1,1,1
4,1,,,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...
5995,,4,1,,,,,,,
5996,,4,1,,,,,,,
5997,,4,1,,,,,,,
5998,,6,3,,,,,,,


Evaluating again:

In [19]:
sensitive_aggregates = sensitive_dataset.get_aggregates(reporting_length, ';')
dp_aggregates = synth.get_dp_aggregates(';')
synthetic_aggregates = synthetic_dataset.get_aggregates(reporting_length, ';')

sensitive_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in sensitive_aggregates.items()
}
dp_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in dp_aggregates.items()
}
synthetic_aggregates_parsed = {
    tuple(agg.split(';')): count for (agg, count) in synthetic_aggregates.items()
}

**Sensitive Data vs. DP Aggregates**

In [20]:
ErrorReport(sensitive_aggregates_parsed, dp_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1380.28 +/- 11.72,0.00 %,0.00 %
1,2,425.34 +/- 16.25,0.00 %,7.11 %
2,3,149.06 +/- 32.59,2.76 %,8.03 %
3,4,56.37 +/- 30.42,38.79 %,4.94 %
4,All,111.44 +/- 29.83,26.96 %,6.15 %


**Sensitive Data vs. DP Synthetic Data**

In [21]:
ErrorReport(sensitive_aggregates_parsed, synthetic_aggregates_parsed).gen()

Unnamed: 0,Length,Count +/- Error,Suppressed %,Fabricated %
0,1,1380.28 +/- 98.88,0.00 %,0.00 %
1,2,425.34 +/- 129.07,0.00 %,6.70 %
2,3,149.06 +/- 87.48,2.98 %,6.58 %
3,4,56.37 +/- 60.39,40.64 %,3.01 %
4,All,111.44 +/- 75.86,28.26 %,4.56 %


## 8. Library documentation

The library is fully documented, to check the documentation you can:

In [22]:
# you can check the docs for the entire library
# help('pacsynth')

In [23]:
# for a class
help('pacsynth.DpAggregateSeededSynthesizer')

Help on class DpAggregateSeededSynthesizer in pacsynth:

pacsynth.DpAggregateSeededSynthesizer = class DpAggregateSeededSynthesizer(object)
 |  pacsynth.DpAggregateSeededSynthesizer(parameters=None)
 |  
 |  Differential Privacy (DP) Aggregate Seeded Synthesizer.
 |  
 |  DP Aggregate Seeded synthesizer is a differentially private synthesizer that relies on
 |  DP Marginals to build synthetic data. It will compute DP Marginals (called aggregates)
 |  for your dataset (`.fit`) using the specified parameters, and synthesize data (`.sample`) based on the
 |  computed aggregated counts (`.get_dp_aggregates`).
 |  
 |  Arguments:
 |      * parameters: Optional[DpAggregateSeededParameters] - parameters constructed with DpAggregateSeededParametersBuilder
 |          - if not provided, default parameters will be used: `DpAggregateSeededParametersBuilder().build()`
 |  
 |  Returns:
 |      New DpAggregateSeededSynthesizer
 |  
 |  Methods defined here:
 |  
 |  fit(self)
 |      Computes the d

In [24]:
# for a method
help('pacsynth.AccuracyMode.prioritize_large_counts')

Help on built-in function prioritize_large_counts in pacsynth.AccuracyMode:

pacsynth.AccuracyMode.prioritize_large_counts = prioritize_large_counts()
    This mode will ensure that more privacy budget is spent for
    for larger attribute combination lengths.
    
    For example, if reporting_length=3 and S(i) the scale of a gaussian noise
    added to the correspondent combination length:
        - single attribute counts      (1-counts) = S(1)
        - combinations of 2 attributes (2-counts) = S(2) = S(1) / 2
        - combinations of 3 attributes (3-counts) = S(3) = S(1) / 3
    
    So 3 times MORE BUDGET is going to be spent with the 3-counts
    than with the 1-counts, meaning that the scale of noise related to the 1-counts
    will be 3 times bigger than the scale related with the 3-counts.
    
    Summary:
        Use this if you want smaller errors for larger attribute combination lengths
        (e.g. the accuracy for 3-counts is more important than for 1-counts)
    
   

In [25]:
# for a function
help('pacsynth.init_logger')

Help on built-in function init_logger in pacsynth:

pacsynth.init_logger = init_logger(level_str)
    Enables logging and sets the desired log level.
    This is supposed to be called only once.
    
    Arguments:
        * level_str: str - 'off' || 'error' || 'warn' || 'info' || 'debug' || 'trace'

