## User Story D
**Research on Benchmark datasets**


This demo will show how to use `PETsARD`'s benchmark datasets to evaluate synthetic algorithms.

In this demonstration, as an advanced user with a basic understanding of different differential privacy/synthetic data technologies and their corresponding evaluation metrics, you aim to assess the differences between technologies and other academic and practical issues.

PETsARD provides a complete platform that, by integrating commonly used benchmark datasets in academics, competitions, or practical applications, allows for easy setup of different benchmark datasets, execution on various synthetic algorithms, and execution of different evaluation combinations. This enables you to easily obtain comprehensive data support, focusing on your academic or development work.

本示範將展示如何使用 `PETsARD` 的基準資料集來評估合成演算法。

在這個示範中，您作為進階的使用者，對於不同的差分隱私/合成資料技術、以及對應的評測指標有初步理解，希望評估技術彼此之間的差異等學術與實務議題。

而 `PETsARD` 將提供你完整的平台，藉由預先整合好的，在學術、比賽或實務上常用的基準資料集， `PETsARD` 能輕鬆設定不同基準資料集、執行在不同合成演算法上、並執行不同評測的實驗組合，讓您能輕鬆獲得綜合性資料的支持，專注在您的學術或開發工作上。

---

## Environment

In [11]:
import os
import pprint
import sys

import yaml


# Setting up the path to the PETsARD package
path_petsard = os.path.dirname(os.getcwd())
print(path_petsard)
sys.path.append(path_petsard)
# setting for pretty priny YAML
pp = pprint.PrettyPrinter(depth=3, sort_dicts=False)


from PETsARD import Executor

/Users/justyn.chen/Dropbox/310_Career_工作/20231016_NICS_資安院/41_PETsARD/PETsARD


---

## User Story D-1
**Synthesizing on default data**

With a specified data generation algorithm, a default benchmark dataset collection will serve as inputs, and the pipeline will generate the corresponding privacy enhanced datasets as output, using the selected algorithm.

指定資料生成演算法後，預設的經典資料集會用作輸入，並且該流程將使用該演算法輸出對應的隱私強化資料集。

In [12]:
config_file = '../yaml/User Story D-1.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-1',
                            'source': 'Postprocessor'}}}


In [13]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with demo...
Loader - Benchmarker: file benchmark/adult-income.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
[I 20241108 10:53:28] MediatorMissing is created.
[I 20241108 10:53:28] MediatorOutlier is created.
[I 20241108 10:53:28] MediatorEncoder is created.
[I 20241108 10:53:28] missing fitting done.
[I 20241108 10:53:28] <PETsARD.processor.mediator.MediatorMissing object at 0x323bbf2e0> fitting done.
[I 20241108 10:53:28] outlier fitting done.
[I 20241108 10:53:28] <PETsARD.processor.mediator.MediatorOutlier object at 0x32564e3e0> fitting done.
[I 20241108 10:53:28] encoder fitting done.
[I 20241108 10:53:28] <PETsARD.processor.mediator.MediatorEncoder object at 0x32564f880> fitting done.
[I 20241108 10:53:28] scaler fitting done.
[I 20241108 10:53:28] missing transformation done.
[I 20241108 10:53:28] <PETsARD.processor.mediator.MediatorMissing object at 0x323bbf2e0>



[I 20241108 10:53:28] No rounding scheme detected for column 'relationship'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'race'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'gender'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'capital-gain'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'capital-loss'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'hours-per-week'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'native-country'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'income'. Data will not be rounded.
[I 20241108 10:53:28] {'EVENT': 'Fit processed data', 'TIMESTAMP': datetime.datetime(2024, 11, 8, 10, 53, 28, 663617), 'SYNTHESIZER CLASS NAME': 'GaussianCopulaSynthesizer', 'SYNTHESIZER ID': 

In [14]:
pp.pprint(exec.get_result())

{'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':        age  workclass  fnlwgt     education  educational-num  \
0       36          ?  178572  Some-college               16   
1       37    Private  234756       HS-grad               11   
2       28    Private  239395    Assoc-acdm                7   
3       40    Private  158074       HS-grad               10   
4       28    Private  153816     Assoc-voc               13   
...    ...        ...     ...           ...              ...   
48837   39    Private   50983  Some-college                7   
48838   19    Private   38358     Bachelors               12   
48839   25  Local-gov  127916       HS-grad               10   
48840   43    Private   74163     Bachelors               13   
48841   51    Private   99458       Masters                9   

           marital-status         occupation   

---

## User Story D-2
**Synthesizing on multiple data**

Following User Story D-1, the user can specify a list of datasets instead.

根據用戶故事 D-1，使用者可以改為指定一個資料集列表。

In [15]:
config_file = '../yaml/User Story D-2.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'default': {'method': 'default'},
            'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-2',
                            'source': 'Postprocessor'}}}


In [16]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with default...
Loader - Benchmarker: file benchmark/adult-income.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
[I 20241108 10:53:31] MediatorMissing is created.
[I 20241108 10:53:31] MediatorOutlier is created.
[I 20241108 10:53:31] MediatorEncoder is created.
[I 20241108 10:53:31] missing fitting done.
[I 20241108 10:53:31] <PETsARD.processor.mediator.MediatorMissing object at 0x325553a00> fitting done.
[I 20241108 10:53:31] outlier fitting done.
[I 20241108 10:53:31] <PETsARD.processor.mediator.MediatorOutlier object at 0x3256a5360> fitting done.
[I 20241108 10:53:31] encoder fitting done.
[I 20241108 10:53:31] <PETsARD.processor.mediator.MediatorEncoder object at 0x3256a7760> fitting done.
[I 20241108 10:53:31] scaler fitting done.
[I 20241108 10:53:31] missing transformation done.
[I 20241108 10:53:31] <PETsARD.processor.mediator.MediatorMissing object at 0x325553a



[I 20241108 10:53:32] No rounding scheme detected for column 'gender'. Data will not be rounded.
[I 20241108 10:53:32] No rounding scheme detected for column 'capital-gain'. Data will not be rounded.
[I 20241108 10:53:32] No rounding scheme detected for column 'capital-loss'. Data will not be rounded.
[I 20241108 10:53:32] No rounding scheme detected for column 'hours-per-week'. Data will not be rounded.
[I 20241108 10:53:32] No rounding scheme detected for column 'native-country'. Data will not be rounded.
[I 20241108 10:53:32] No rounding scheme detected for column 'income'. Data will not be rounded.
[I 20241108 10:53:32] {'EVENT': 'Fit processed data', 'TIMESTAMP': datetime.datetime(2024, 11, 8, 10, 53, 32, 386597), 'SYNTHESIZER CLASS NAME': 'GaussianCopulaSynthesizer', 'SYNTHESIZER ID': 'GaussianCopulaSynthesizer_1.17.1_d3175d5161e9425da3f126b53068e931', 'TOTAL NUMBER OF TABLES': 1, 'TOTAL NUMBER OF ROWS': 28558, 'TOTAL NUMBER OF COLUMNS': 15}
[I 20241108 10:53:32] Fitting Gaussian



Synthesizer (SDV): Fitting GaussianCopula spent 1.5715 sec.
[I 20241108 10:53:37] {'EVENT': 'Sample', 'TIMESTAMP': datetime.datetime(2024, 11, 8, 10, 53, 37, 101767), 'SYNTHESIZER CLASS NAME': 'GaussianCopulaSynthesizer', 'SYNTHESIZER ID': 'GaussianCopulaSynthesizer_1.17.1_32e927e3dc154256897a1d698d0539aa', 'TOTAL NUMBER OF TABLES': 1, 'TOTAL NUMBER OF ROWS': 32561, 'TOTAL NUMBER OF COLUMNS': 15}
Synthesizer (SDV): Sampling GaussianCopula # 32561 rows (same as Loader data) in 0.3903 sec.
Now is Postprocessor with demo...
[I 20241108 10:53:37] MediatorEncoder is created.
[I 20241108 10:53:37] scaler inverse transformation done.
[I 20241108 10:53:37] <PETsARD.processor.mediator.MediatorEncoder object at 0x3256a6fb0> transformation done.
[I 20241108 10:53:37] encoder inverse transformation done.
[I 20241108 10:53:37] missing inverse transformation done.
[I 20241108 10:53:37] age changes data dtype from float64 to int8 for metadata alignment.
[I 20241108 10:53:37] workclass changes data dt

In [17]:
pp.pprint(exec.get_result())

{'Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':        age         workclass  fnlwgt     education  educational-num  \
0       45           Private   84837  Some-college               16   
1       21           Private   77454       HS-grad               11   
2       33  Self-emp-not-inc   40725     Bachelors                7   
3       45           Private  200900  Some-college               10   
4       53      Self-emp-inc  249495     Bachelors               11   
...    ...               ...     ...           ...              ...   
48837   34  Self-emp-not-inc  262869  Some-college                8   
48838   20       Federal-gov  243501       HS-grad               12   
48839   28           Private  284252  Some-college               10   
48840   49         State-gov  139576       HS-grad               12   
48841   38           Private 

---

## User Story D-3
**Synthesizing and Evaluating on default data**

Following User Story D-1, if users enable the evaluation step,  the evaluation module will create a report covering default privacy risk and utility metrics for all datasets.

根據用戶故事 D-1，如果使用者啟用評估步驟，評估模組將會產生一份涵蓋所有資料集的隱私風險與效用指標報告。

In [18]:
config_file = '../yaml/User Story D-3.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Evaluator': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-3',
                            'source': 'Postprocessor'},
              'save_report_global': {'method': 'save_report',
                                     'output': 'User Story D-3',
                                     'eval': 'demo',
                                     'granularity': 'global'}}}


In [19]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with demo...
Loader - Benchmarker: file benchmark/adult-income.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
[I 20241108 10:53:37] MediatorMissing is created.
[I 20241108 10:53:37] MediatorOutlier is created.
[I 20241108 10:53:37] MediatorEncoder is created.
[I 20241108 10:53:37] missing fitting done.
[I 20241108 10:53:37] <PETsARD.processor.mediator.MediatorMissing object at 0x3256a78b0> fitting done.
[I 20241108 10:53:37] outlier fitting done.
[I 20241108 10:53:37] <PETsARD.processor.mediator.MediatorOutlier object at 0x3255ca110> fitting done.
[I 20241108 10:53:37] encoder fitting done.
[I 20241108 10:53:37] <PETsARD.processor.mediator.MediatorEncoder object at 0x3255c9840> fitting done.
[I 20241108 10:53:37] scaler fitting done.
[I 20241108 10:53:37] missing transformation done.
[I 20241108 10:53:37] <PETsARD.processor.mediator.MediatorMissing object at 0x3256a78b0>



[I 20241108 10:53:38] No rounding scheme detected for column 'gender'. Data will not be rounded.
[I 20241108 10:53:38] No rounding scheme detected for column 'capital-gain'. Data will not be rounded.
[I 20241108 10:53:38] No rounding scheme detected for column 'capital-loss'. Data will not be rounded.
[I 20241108 10:53:38] No rounding scheme detected for column 'hours-per-week'. Data will not be rounded.
[I 20241108 10:53:38] No rounding scheme detected for column 'native-country'. Data will not be rounded.
[I 20241108 10:53:38] No rounding scheme detected for column 'income'. Data will not be rounded.
[I 20241108 10:53:38] {'EVENT': 'Fit processed data', 'TIMESTAMP': datetime.datetime(2024, 11, 8, 10, 53, 38, 294565), 'SYNTHESIZER CLASS NAME': 'GaussianCopulaSynthesizer', 'SYNTHESIZER ID': 'GaussianCopulaSynthesizer_1.17.1_e3ad61c08ef54498915d53300fc9d68c', 'TOTAL NUMBER OF TABLES': 1, 'TOTAL NUMBER OF ROWS': 28558, 'TOTAL NUMBER OF COLUMNS': 15}
[I 20241108 10:53:38] Fitting Gaussian

  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

(2/2) Evaluating Column Pair Trends: |██████████| 105/105 [00:00<00:00, 298.06it/s]|
Column Pair Trends Score: 60.65%

Overall Score (Average): 78.02%

Now is Reporter with save_data...
Now is User Story D-3_Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...


  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size(

Now is Reporter with save_report_global...
Now is User Story D-3[Report]_demo_[global] save to csv...


In [20]:
pp.pprint(exec.get_result())

{'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Evaluator[demo]_Reporter[save_data]': {'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':        age         workclass  fnlwgt     education  educational-num  \
0       45         Local-gov  220745     Bachelors                9   
1       31         Local-gov  181721       HS-grad                8   
2       51           Private  210846       HS-grad               11   
3       35           Private  196491  Some-college                8   
4       45  Self-emp-not-inc   45410       Masters               13   
...    ...               ...     ...           ...              ...   
48837   36           Private   69793  Some-college               10   
48838   43           Private  117850       HS-grad                8   
48839   26           Private  214151       HS-grad               11   
48840   50           Private   40547     Bachelors                9   
48841   37         