## User Story D
**Research on Benchmark datasets**


This demo will show how to use `PETsARD`'s benchmark datasets to evaluate synthetic algorithms.

In this demonstration, as an advanced user with a basic understanding of different differential privacy/synthetic data technologies and their corresponding evaluation metrics, you aim to assess the differences between technologies and other academic and practical issues.

PETsARD provides a complete platform that, by integrating commonly used benchmark datasets in academics, competitions, or practical applications, allows for easy setup of different benchmark datasets, execution on various synthetic algorithms, and execution of different evaluation combinations. This enables you to easily obtain comprehensive data support, focusing on your academic or development work.

本示範將展示如何使用 `PETsARD` 的基準資料集來評估合成演算法。

在這個示範中，您作為進階的使用者，對於不同的差分隱私/合成資料技術、以及對應的評測指標有初步理解，希望評估技術彼此之間的差異等學術與實務議題。

而 `PETsARD` 將提供你完整的平台，藉由預先整合好的，在學術、比賽或實務上常用的基準資料集， `PETsARD` 能輕鬆設定不同基準資料集、執行在不同合成演算法上、並執行不同評測的實驗組合，讓您能輕鬆獲得綜合性資料的支持，專注在您的學術或開發工作上。

---

## Environment

In [1]:
import os
import pprint
import sys

import yaml


# Setting up the path to the PETsARD package
path_petsard = os.path.dirname(os.getcwd())
print(path_petsard)
sys.path.append(path_petsard)
# setting for pretty priny YAML
pp = pprint.PrettyPrinter(depth=3, sort_dicts=False)


from PETsARD import Executor

d:\Dropbox\89_other_application\GitHub\PETsARD


---

## User Story D-1
**Synthesizing on default data**

With a specified data generation algorithm, a default benchmark dataset collection will serve as inputs, and the pipeline will generate the corresponding privacy enhanced datasets as output, using the selected algorithm.

指定資料生成演算法後，預設的經典資料集會用作輸入，並且該流程將使用該演算法輸出對應的隱私強化資料集。

In [2]:
config_file = '../yaml/User Story D-1.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-1',
                            'source': 'Postprocessor'}}}


In [3]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with demo...
Loader - Benchmarker: file benchmark\adult-income.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0668 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.




Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 10.3226 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 28558 rows (same as raw) in 2.0695 sec.
Now is Postprocessor with demo...
Now is Reporter with save_data...
Now is User Story D-1_Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...


In [4]:
pp.pprint(exec.get_result())

{'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':              age  workclass         fnlwgt     education  educational-num  \
0      41.300239  Local-gov  260041.632035     Bachelors         9.357748   
1      38.307939    Private  300656.664397       HS-grad         7.804312   
2      35.926336    Private  370704.146330     Bachelors        11.474106   
3      36.448291    Private  139757.197871  Some-college         7.847397   
4      21.507367    Private   98790.880130          11th        14.081383   
...          ...        ...            ...           ...              ...   
28553  31.414335    Private   53695.309925     Bachelors        14.792524   
28554  42.516247    Private   76986.361495       HS-grad        11.159744   
28555  35.612312    Private   61447.883304  Some-college        10.527065   
28556  44.798094    Private  269301.558394    

---

## User Story D-2
**Synthesizing on multiple data**

Following User Story D-1, the user can specify a list of datasets instead.

根據用戶故事 D-1，使用者可以改為指定一個資料集列表。

In [5]:
config_file = '../yaml/User Story D-2.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'default': {'method': 'default'},
            'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-2',
                            'source': 'Postprocessor'}}}


In [6]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with default...
Loader - Benchmarker: file benchmark\adult-income.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.


Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0354 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.




Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 12.8732 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 28558 rows (same as raw) in 2.0057 sec.
Now is Postprocessor with demo...
Now is Reporter with save_data...
Now is User Story D-2_Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult_uci.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0675 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.




Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 7.4971 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 18997 rows (same as raw) in 1.2266 sec.
Now is Postprocessor with demo...
Now is Reporter with save_data...
Now is User Story D-2_Loader[adult]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...


In [7]:
pp.pprint(exec.get_result())

{'Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':              age workclass         fnlwgt     education  educational-num  \
0      52.657660   Private  235761.599869       HS-grad        14.969192   
1      27.596470   Private  221023.517485       HS-grad        11.026189   
2      30.045651   Private  304731.381011  Some-college         7.305732   
3      44.118947   Private  152724.650856  Some-college         9.770654   
4      23.482847   Private  174168.801863          10th        13.395511   
...          ...       ...            ...           ...              ...   
28553  45.494394   Private  120104.363513          10th         8.645080   
28554  63.001282   Private  147910.770055       HS-grad        10.886884   
28555  23.537015   Private  105019.024470  Some-college        13.960612   
28556  42.967905   Private  381918.934221  Some-co

---

## User Story D-3
**Synthesizing and Evaluating on default data**

Following User Story D-1, if users enable the evaluation step,  the evaluation module will create a report covering default privacy risk and utility metrics for all datasets.

根據用戶故事 D-1，如果使用者啟用評估步驟，評估模組將會產生一份涵蓋所有資料集的隱私風險與效用指標報告。

In [8]:
config_file = '../yaml/User Story D-3.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Evaluator': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-3',
                            'source': 'Postprocessor'},
              'save_report_global': {'method': 'save_report',
                                     'output': 'User Story D-3',
                                     'eval': 'demo',
                                     'granularity': 'global'}}}


In [9]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with demo...
Loader - Benchmarker: file benchmark\adult-income.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0622 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.




Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 11.3317 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 28558 rows (same as raw) in 2.1094 sec.
Now is Postprocessor with demo...
Now is Evaluator with demo...
Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 15/15 [00:00<00:00, 70.47it/s]
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 105/105 [00:11<00:00,  9.44it/s]

Overall Score: 77.49%

Properties:
- Column Shapes: 94.41%
- Column Pair Trends: 60.58%
Now is Reporter with save_data...
Now is User Story D-3_Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...
Now is Reporter with save_report_global...
Now is User Story D-3[Report]_demo_[global] save to csv...


In [10]:
pp.pprint(exec.get_result())

{'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Evaluator[demo]_Reporter[save_data]': {'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':              age         workclass         fnlwgt     education  \
0      41.887845           Private   74732.501369     Bachelors   
1      41.879273           Private  130998.151124  Some-college   
2      22.697555           Private   87530.503125  Some-college   
3      35.820800           Private  145147.597971       HS-grad   
4      33.255472           Private  380621.117785    Assoc-acdm   
...          ...               ...            ...           ...   
28553  37.672567  Self-emp-not-inc  297937.029849     Bachelors   
28554  34.497576      Self-emp-inc  241131.708649          10th   
28555  49.701021           Private  347759.329644       7th-8th   
28556  30.407721           Private  149817.405338       HS-grad   
28557  45.412698           Private  159729.941197   Prof-school