## User Story D
**Research on Benchmark datasets**


This demo will show how to use `PETsARD`'s benchmark datasets to evaluate synthetic algorithms.

In this demonstration, as an advanced user with a basic understanding of different differential privacy/synthetic data technologies and their corresponding evaluation metrics, you aim to assess the differences between technologies and other academic and practical issues.

PETsARD provides a complete platform that, by integrating commonly used benchmark datasets in academics, competitions, or practical applications, allows for easy setup of different benchmark datasets, execution on various synthetic algorithms, and execution of different evaluation combinations. This enables you to easily obtain comprehensive data support, focusing on your academic or development work.

本示範將展示如何使用 `PETsARD` 的基準資料集來評估合成演算法。

在這個示範中，您作為進階的使用者，對於不同的差分隱私/合成資料技術、以及對應的評測指標有初步理解，希望評估技術彼此之間的差異等學術與實務議題。

而 `PETsARD` 將提供你完整的平台，藉由預先整合好的，在學術、比賽或實務上常用的基準資料集， `PETsARD` 能輕鬆設定不同基準資料集、執行在不同合成演算法上、並執行不同評測的實驗組合，讓您能輕鬆獲得綜合性資料的支持，專注在您的學術或開發工作上。

---

## Environment

In [1]:
import os
import pprint
import sys

import yaml


# Setting up the path to the PETsARD package
path_petsard = os.path.dirname(os.getcwd())
print(path_petsard)
sys.path.append(path_petsard)


from PETsARD import Executor

d:\Dropbox\89_other_application\GitHub\PETsARD


> `Demo_UserStory` is a function created for demonstrating details and can be ignored.
>
> `Demo_UserStory` 是為了展現示範細節而建立的函式，可忽略。

In [2]:
def Demo_UserStory(config_file: str):
    print(f"YAML file {config_file} is ...")
    print(f"")
    pp = pprint.PrettyPrinter(depth=3, sort_dicts=False)
    with open(config_file, 'r') as yaml_file:
        yaml_raw: dict = yaml.safe_load(yaml_file)
    pp.pprint(yaml_raw)

    print(f"")
    print(f"")
    print(f"Here's the execution...")
    print(f"")
    exec = Executor(config=config_file)
    exec.run()

    print(f"")
    print(f"")
    print(f"Here's the result...")
    print(f"")
    pp.pprint(exec.get_result())

---

## User Story D-1
**Synthesizing on default data**

With a specified data generation algorithm, a default benchmark dataset collection will serve as inputs, and the pipeline will generate the corresponding privacy enhanced datasets as output, using the selected algorithm.

指定資料生成演算法後，預設的經典資料集會用作輸入，並且該流程將使用該演算法輸出對應的隱私強化資料集。

In [3]:
config_file = '../yaml/User Story D-1.yaml'

Demo_UserStory(config_file = config_file)

YAML file user_story/User Story D-1.yaml is ...

{'Loader': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'UserStory_08',
                            'source': 'Postprocessor'}}}


Here's the execution...

Now is Loader with demo...
Loader - Benchmarker : Success download the benchmark dataset from https://petsard-benchmark.s3.amazonaws.com/adult-income.csv.
Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0328 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.




Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 11.3368 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 28558 rows (same as raw) in 1.7402 sec.
Now is Postprocessor with demo...
Now is Reporter with save_data...
Now is UserStory_08_Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...


Here's the result...

{'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':              age     workclass         fnlwgt     education  educational-num  \
0      44.262005       Private  149382.882910       HS-grad        15.773848   
1      35.889065       Private  200163.081419       HS-grad        10.696364   
2      32.005830       Private  194471.515056     Bachelors         6.993173   
3      37.649825       Private  142304.062201  Some-college        10.057643   
4      23.896207     Local-gov  2715

---

## User Story D-2
**Synthesizing on default data**

Following User Story D-1, the user can specify a list of datasets instead.

根據用戶故事 D-1，使用者可以改為指定一個資料集列表。

In [4]:
config_file = '../yaml/User Story D-2.yaml'

Demo_UserStory(config_file = config_file)

YAML file user_story/User Story D-2.yaml is ...

{'Loader': {'default': {'method': 'default'},
            'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-2',
                            'source': 'Postprocessor'}}}


Here's the execution...

Now is Loader with default...
Loader - Benchmarker: file benchmark\adult-income.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0313 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.




Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 10.7583 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 28558 rows (same as raw) in 1.775 sec.
Now is Postprocessor with demo...
Now is Reporter with save_data...
Now is User Story D-2_Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult_uci.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0352 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.




Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 6.7353 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 18997 rows (same as raw) in 1.2054 sec.
Now is Postprocessor with demo...
Now is Reporter with save_data...
Now is User Story D-2_Loader[adult]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...


Here's the result...

{'Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[default]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':              age  workclass         fnlwgt     education  educational-num  \
0      53.034322    Private  295530.480319     Bachelors        15.119559   
1      21.419505    Private  224816.718640       HS-grad        11.119809   
2      24.263561    Private  367977.906073       HS-grad         7.586702   
3      47.628559    Private  179975.294493  Some-college         9.868676   
4      26.260571    Private   73542.533176 

---

## User Story D-3
**Synthesizing and Evaluating on default data**

Following User Story D-1, if users enable the evaluation step,  the evaluation module will create a report covering default privacy risk and utility metrics for all datasets.

根據用戶故事 D-1，如果使用者啟用評估步驟，評估模組將會產生一份涵蓋所有資料集的隱私風險與效用指標報告。

In [5]:
config_file = '../yaml/User Story D-3.yaml'

Demo_UserStory(config_file = config_file)

YAML file user_story/User Story D-3.yaml is ...

{'Loader': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Evaluator': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story D-3',
                            'source': 'Postprocessor'},
              'save_report_global': {'method': 'save_report',
                                     'output': 'User Story D-3',
                                     'eval': 'demo',
                                     'granularity': 'global'}}}


Here's the execution...

Now is Loader with demo...
Loader - Benchmarker: file benchmark\adult-income.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesize



Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 12.5061 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 28558 rows (same as raw) in 1.7105 sec.
Now is Postprocessor with demo...
Now is Evaluator with demo...
Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 15/15 [00:00<00:00, 67.31it/s]
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 105/105 [00:06<00:00, 16.14it/s]

Overall Score: 77.55%

Properties:
- Column Shapes: 94.5%
- Column Pair Trends: 60.6%
Now is Reporter with save_data...
Now is User Story D-3_Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...
Now is Reporter with save_report_global...
Now is User Story D-3[Report]_demo_[global] save to csv...


Here's the result...

{'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Evaluator[demo]_Reporter[save_data]': {'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':       