## User Story D
**Research on Benchmark datasets**


This demo will show how to use `PETsARD`'s benchmark datasets to evaluate synthetic algorithms.

In this demonstration, as an advanced user with a basic understanding of different differential privacy/synthetic data technologies and their corresponding evaluation metrics, you aim to assess the differences between technologies and other academic and practical issues.

PETsARD provides a complete platform that, by integrating commonly used benchmark datasets in academics, competitions, or practical applications, allows for easy setup of different benchmark datasets, execution on various synthetic algorithms, and execution of different evaluation combinations. This enables you to easily obtain comprehensive data support, focusing on your academic or development work.

本示範將展示如何使用 `PETsARD` 的基準資料集來評估合成演算法。

在這個示範中，您作為進階的使用者，對於不同的差分隱私/合成資料技術、以及對應的評測指標有初步理解，希望評估技術彼此之間的差異等學術與實務議題。

而 `PETsARD` 將提供你完整的平台，藉由預先整合好的，在學術、比賽或實務上常用的基準資料集， `PETsARD` 能輕鬆設定不同基準資料集、執行在不同合成演算法上、並執行不同評測的實驗組合，讓您能輕鬆獲得綜合性資料的支持，專注在您的學術或開發工作上。

---

## Environment

In [1]:
import os
import pprint
import sys

import yaml


# Setting up the path to the PETsARD package
path_petsard = os.path.dirname(os.getcwd())
sys.path.append(path_petsard)
# setting for pretty priny YAML
pp = pprint.PrettyPrinter(depth=3, sort_dicts=False)


from petsard import Executor

---

## User Story D-1
**Synthesizing on default data**

With a specified data generation algorithm, a default benchmark dataset collection will serve as inputs, and the pipeline will generate the corresponding privacy enhanced datasets as output, using the selected algorithm.

指定資料生成演算法後，預設的經典資料集會用作輸入，並且該流程將使用該演算法輸出對應的隱私強化資料集。

In [2]:
config_file = '../yaml/User Story D-1.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-1',
                            'source': 'Postprocessor'}}}


In [3]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with demo...
Loader - Benchmarker : Success download the benchmark dataset from https://petsard-benchmark.s3.amazonaws.com/adult-income.csv.
Now is Preprocessor with demo...
[I 20241225 10:24:54] MediatorMissing is created.
[I 20241225 10:24:54] MediatorOutlier is created.
[I 20241225 10:24:54] MediatorEncoder is created.
[I 20241225 10:24:54] missing fitting done.
[I 20241225 10:24:54] <petsard.processor.mediator.MediatorMissing object at 0x329019240> fitting done.
[I 20241225 10:24:54] outlier fitting done.
[I 20241225 10:24:54] <petsard.processor.mediator.MediatorOutlier object at 0x32901aec0> fitting done.
[I 20241225 10:24:54] encoder fitting done.
[I 20241225 10:24:54] <petsard.processor.mediator.MediatorEncoder object at 0x32901b040> fitting done.
[I 20241225 10:24:54] scaler fitting done.
[I 20241225 10:24:54] missing transformation done.
[I 20241225 10:24:54] <petsard.processor.mediator.MediatorMissing object at 0x329019240> transformation done.
[I 20241225 10:24



[I 20241225 10:24:54] No rounding scheme detected for column 'relationship'. Data will not be rounded.
[I 20241225 10:24:54] No rounding scheme detected for column 'race'. Data will not be rounded.
[I 20241225 10:24:54] No rounding scheme detected for column 'gender'. Data will not be rounded.
[I 20241225 10:24:54] No rounding scheme detected for column 'capital-gain'. Data will not be rounded.
[I 20241225 10:24:54] No rounding scheme detected for column 'capital-loss'. Data will not be rounded.
[I 20241225 10:24:54] No rounding scheme detected for column 'hours-per-week'. Data will not be rounded.
[I 20241225 10:24:54] No rounding scheme detected for column 'native-country'. Data will not be rounded.
[I 20241225 10:24:54] No rounding scheme detected for column 'income'. Data will not be rounded.
[I 20241225 10:24:54] {'EVENT': 'Fit processed data', 'TIMESTAMP': datetime.datetime(2024, 12, 25, 10, 24, 54, 780050), 'SYNTHESIZER CLASS NAME': 'GaussianCopulaSynthesizer', 'SYNTHESIZER ID':

In [4]:
pp.pprint(exec.get_result())

{'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':        age         workclass  fnlwgt     education  educational-num  \
0       36           Private   63960    Assoc-acdm               10   
1       21  Self-emp-not-inc   85886       HS-grad                9   
2       44           Private   51705       HS-grad               11   
3       41           Private  192972     Bachelors                8   
4       51      Self-emp-inc  258854     Bachelors               12   
...    ...               ...     ...           ...              ...   
48837   41           Private  292629       HS-grad               10   
48838   19       Federal-gov  190607  Some-college                9   
48839   21           Private  223081     Bachelors               10   
48840   51       Federal-gov  159476  Some-college                9   
48841   47           Private  35576

---

## User Story D-2
**Synthesizing on multiple data**

Following User Story D-1, the user can specify a list of datasets instead.

根據用戶故事 D-1，使用者可以改為指定一個資料集列表。

In [5]:
config_file = '../yaml/User Story D-2.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'default': {'method': 'default'},
            'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-2',
                            'source': 'Postprocessor'}}}


In [6]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with default...
Loader - Benchmarker: file benchmark/adult-income.csv already exist and match SHA-256.
                      petsard will ignore download and use local data directly.
Now is Preprocessor with demo...
[I 20241225 10:24:57] MediatorMissing is created.
[I 20241225 10:24:57] MediatorOutlier is created.
[I 20241225 10:24:57] MediatorEncoder is created.
[I 20241225 10:24:57] missing fitting done.
[I 20241225 10:24:57] <petsard.processor.mediator.MediatorMissing object at 0x32901bd30> fitting done.
[I 20241225 10:24:57] outlier fitting done.
[I 20241225 10:24:57] <petsard.processor.mediator.MediatorOutlier object at 0x328fb3970> fitting done.
[I 20241225 10:24:57] encoder fitting done.
[I 20241225 10:24:57] <petsard.processor.mediator.MediatorEncoder object at 0x328fb3a90> fitting done.
[I 20241225 10:24:57] scaler fitting done.
[I 20241225 10:24:57] missing transformation done.
[I 20241225 10:24:57] <petsard.processor.mediator.MediatorMissing object at 0x32901bd



[I 20241225 10:24:58] No rounding scheme detected for column 'occupation'. Data will not be rounded.
[I 20241225 10:24:58] No rounding scheme detected for column 'relationship'. Data will not be rounded.
[I 20241225 10:24:58] No rounding scheme detected for column 'race'. Data will not be rounded.
[I 20241225 10:24:58] No rounding scheme detected for column 'gender'. Data will not be rounded.
[I 20241225 10:24:58] No rounding scheme detected for column 'capital-gain'. Data will not be rounded.
[I 20241225 10:24:58] No rounding scheme detected for column 'capital-loss'. Data will not be rounded.
[I 20241225 10:24:58] No rounding scheme detected for column 'hours-per-week'. Data will not be rounded.
[I 20241225 10:24:58] No rounding scheme detected for column 'native-country'. Data will not be rounded.
[I 20241225 10:24:58] No rounding scheme detected for column 'income'. Data will not be rounded.
[I 20241225 10:24:58] {'EVENT': 'Fit processed data', 'TIMESTAMP': datetime.datetime(2024, 



Synthesizer (SDV): Fitting GaussianCopula spent 1.4851 sec.
[I 20241225 10:25:04] {'EVENT': 'Sample', 'TIMESTAMP': datetime.datetime(2024, 12, 25, 10, 25, 3, 911577), 'SYNTHESIZER CLASS NAME': 'GaussianCopulaSynthesizer', 'SYNTHESIZER ID': 'GaussianCopulaSynthesizer_1.17.1_7e933de8f72d4291881dfbb7dc822c5f', 'TOTAL NUMBER OF TABLES': 1, 'TOTAL NUMBER OF ROWS': 32561, 'TOTAL NUMBER OF COLUMNS': 15}
Synthesizer (SDV): Sampling GaussianCopula # 32561 rows (same as Loader data) in 0.4257 sec.
Now is Postprocessor with demo...
[I 20241225 10:25:04] MediatorEncoder is created.
[I 20241225 10:25:04] scaler inverse transformation done.
[I 20241225 10:25:04] <petsard.processor.mediator.MediatorEncoder object at 0x32ab09cf0> transformation done.
[I 20241225 10:25:04] encoder inverse transformation done.
[I 20241225 10:25:04] missing inverse transformation done.
[I 20241225 10:25:04] age changes data dtype from float64 to int8 for metadata alignment.
[I 20241225 10:25:04] workclass changes data dt

In [7]:
pp.pprint(exec.get_result())

{'Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':        age         workclass  fnlwgt     education  educational-num  \
0       53           Private  289692     Bachelors               15   
1       32           Private  229890       HS-grad               11   
2       35  Self-emp-not-inc  274308       HS-grad                7   
3       35           Private  218093  Some-college               10   
4       50         Local-gov   30968          11th               12   
...    ...               ...     ...           ...              ...   
48837   25           Private  110113       HS-grad                8   
48838   48           Private  127665       HS-grad               11   
48839   46           Private  132146     Assoc-voc                9   
48840   41           Private  118226       HS-grad               13   
48841   27           Private 

---

## User Story D-3
**Synthesizing and Evaluating on default data**

Following User Story D-1, if users enable the evaluation step,  the evaluation module will create a report covering default privacy risk and utility metrics for all datasets.

根據用戶故事 D-1，如果使用者啟用評估步驟，評估模組將會產生一份涵蓋所有資料集的隱私風險與效用指標報告。

In [8]:
config_file = '../yaml/User Story D-3.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Evaluator': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-3',
                            'source': 'Postprocessor'},
              'save_report_global': {'method': 'save_report',
                                     'output': 'User Story D-3',
                                     'eval': 'demo',
                                     'granularity': 'global'}}}


In [9]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with demo...
Loader - Benchmarker: file benchmark/adult-income.csv already exist and match SHA-256.
                      petsard will ignore download and use local data directly.
Now is Preprocessor with demo...
[I 20241225 10:25:04] MediatorMissing is created.
[I 20241225 10:25:04] MediatorOutlier is created.
[I 20241225 10:25:04] MediatorEncoder is created.
[I 20241225 10:25:04] missing fitting done.
[I 20241225 10:25:04] <petsard.processor.mediator.MediatorMissing object at 0x32ac2b7f0> fitting done.
[I 20241225 10:25:04] outlier fitting done.
[I 20241225 10:25:04] <petsard.processor.mediator.MediatorOutlier object at 0x32ab5ae30> fitting done.
[I 20241225 10:25:04] encoder fitting done.
[I 20241225 10:25:04] <petsard.processor.mediator.MediatorEncoder object at 0x32ab5aef0> fitting done.
[I 20241225 10:25:04] scaler fitting done.
[I 20241225 10:25:04] missing transformation done.
[I 20241225 10:25:04] <petsard.processor.mediator.MediatorMissing object at 0x32ac2b7f0>



[I 20241225 10:25:05] No rounding scheme detected for column 'occupation'. Data will not be rounded.
[I 20241225 10:25:05] No rounding scheme detected for column 'relationship'. Data will not be rounded.
[I 20241225 10:25:05] No rounding scheme detected for column 'race'. Data will not be rounded.
[I 20241225 10:25:05] No rounding scheme detected for column 'gender'. Data will not be rounded.
[I 20241225 10:25:05] No rounding scheme detected for column 'capital-gain'. Data will not be rounded.
[I 20241225 10:25:05] No rounding scheme detected for column 'capital-loss'. Data will not be rounded.
[I 20241225 10:25:05] No rounding scheme detected for column 'hours-per-week'. Data will not be rounded.
[I 20241225 10:25:05] No rounding scheme detected for column 'native-country'. Data will not be rounded.
[I 20241225 10:25:05] No rounding scheme detected for column 'income'. Data will not be rounded.
[I 20241225 10:25:05] {'EVENT': 'Fit processed data', 'TIMESTAMP': datetime.datetime(2024, 

  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

(2/2) Evaluating Column Pair Trends: |██████████| 105/105 [00:00<00:00, 254.29it/s]|
Column Pair Trends Score: 60.67%



  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

Overall Score (Average): 78.04%

Now is Reporter with save_data...
Now is User Story D-3_Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...
Now is Reporter with save_report_global...
Now is User Story D-3[Report]_demo_[global] save to csv...


In [10]:
pp.pprint(exec.get_result())

{'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Evaluator[demo]_Reporter[save_data]': {'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':        age         workclass  fnlwgt     education  educational-num  \
0       38      Self-emp-inc  180795     Bachelors               10   
1       47           Private  220719  Some-college                7   
2       44         State-gov  201048     Bachelors               11   
3       31           Private  177691       HS-grad                8   
4       42           Private   75222       Masters               13   
...    ...               ...     ...           ...              ...   
48837   39           Private   55673  Some-college               10   
48838   20  Self-emp-not-inc   41130    Assoc-acdm                8   
48839   21           Private  195825       HS-grad               11   
48840   54         Local-gov   43509     Assoc-voc                9   
48841   39         