# User Story A
**Privacy Enhancing Data Generation**

This demo will show how to generate privacy-enhanced data using `PETsARD`.

In this demonstration, you, as the user, already possess a data file locally, and `PETsARD` will assist you in loading that file and then generating a privacy-enhanced version of it.

At the same time, privacy-enhancing algorithms often have format restrictions and require specific pre-processing and post-processing procedures to function correctly. However, `PETsARD` has taken this into account for the user. `PETsARD` offers both default and customizable preprocessing and postprocessing workflows to help users get started quickly.

本示範將展示如何使用 `PETsARD` 生成隱私強化資料。

在這個示範中，您作為使用者，在本機上已經擁有一份資料檔案，而 `PETsARD` 將幫助您讀取該檔案、然後生成經隱私強化後的版本。

同時，隱私強化演算法通常都有格式的限制，必須經過特定的前處理 (Pre-processing) 與後處理 (Post-processing) 程序才能正確運作，但 `PETsARD` 已經為使用者考慮到這點，`PETsARD` 提供預設與可客製化的前後處理流程，幫助使用者快速上手。

---

## Environment

In [1]:
import os
import pprint
import sys

import yaml


# Setting up the path to the PETsARD package
path_petsard = os.path.dirname(os.getcwd())
sys.path.append(path_petsard)
# setting for pretty priny YAML
pp = pprint.PrettyPrinter(depth=3, sort_dicts=False)


from petsard import Executor # noqa: E402

---

## User Story A-1
**Default Synthesizing**

Given an original dataset without specified algorithm, the pipeline will generate a list of privacy enhanced datasets using the default algorithms.

給定一個原始資料集、但未指定演算法，該流程會利用預設的演算法生成一組隱私強化資料集。

In [2]:
config_file = '../yaml/User Story A-1.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult': {'filepath': '../benchmark/adult-income.csv',
                      'na_values': {...}}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'demo': {'method': 'default'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story A-1',
                            'source': 'Postprocessor'}}}


In [3]:
exec = Executor(config=config_file)
exec.run()



Synthesizer (SDV): Fitting GaussianCopula.
Synthesizer (SDV): Fitting GaussianCopula spent 1.6788 sec.


INFO:root:age changes data dtype from float64 to int8 for metadata alignment.
INFO:root:workclass changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:fnlwgt changes data dtype from float64 to int32 for metadata alignment.
INFO:root:education changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:educational-num changes data dtype from float64 to int8 for metadata alignment.
INFO:root:marital-status changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:occupation changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:relationship changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:race changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:gender changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:capi

Synthesizer (SDV): Sampling GaussianCopula # 48842 rows (same as Loader data) in 0.582 sec.
Now is User Story A-1_Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[demo] save to csv...


In [4]:
pp.pprint(exec.get_result())

{'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[demo]_Reporter[save_data]': {'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[demo]':        age  workclass  fnlwgt   education  educational-num  \
0       60    Private   93485   Bachelors               15   
1       32    Private  100200     HS-grad               10   
2       34    Private   44330     HS-grad                7   
3       39    Private  184431     HS-grad               10   
4       44        nan  268664  Assoc-acdm               11   
...    ...        ...     ...         ...              ...   
48837   36    Private  267334     HS-grad                8   
48838   50    Private  245409     HS-grad               11   
48839   43    Private  333487        10th                9   
48840   56  Local-gov  151796     HS-grad               12   
48841   26  Local-gov  292967     HS-grad               10   

           marital-status         occupation   relationship   race  gender  \
0      

---

## User Story A-2
**Customized Synthesizing**

Given an original dataset, specified privacy enhancing data generation algorithms and parameters, the pipeline will generate a privacy enhanced dataset.

給定一個原始資料集，並指定隱私強化技術生成演算法與參數，該流程會依此產生隱私強化資料集。

In [5]:
config_file = '../yaml/User Story A-2.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult': {'filepath': '../benchmark/adult-income.csv',
                      'na_values': {...}}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story A-1',
                            'source': 'Postprocessor'}}}


In [6]:
exec = Executor(config=config_file)
exec.run()



Synthesizer (SDV): Fitting GaussianCopula.
Synthesizer (SDV): Fitting GaussianCopula spent 2.0192 sec.


INFO:root:age changes data dtype from float64 to int8 for metadata alignment.
INFO:root:workclass changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:fnlwgt changes data dtype from float64 to int32 for metadata alignment.
INFO:root:education changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:educational-num changes data dtype from float64 to int8 for metadata alignment.
INFO:root:marital-status changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:occupation changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:relationship changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:race changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:gender changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:capi

Synthesizer (SDV): Sampling GaussianCopula # 48842 rows (same as Loader data) in 0.6032 sec.
Now is User Story A-1_Loader[adult]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo] save to csv...


In [7]:
pp.pprint(exec.get_result())

{'Loader[adult]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[adult]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':        age         workclass  fnlwgt     education  educational-num  \
0       59               nan   83648  Some-college               15   
1       32           Private  108365       HS-grad               11   
2       24           Private   50659  Some-college                8   
3       44           Private  177742       HS-grad               10   
4       27           Private  331487       7th-8th               13   
...    ...               ...     ...           ...              ...   
48837   41         Local-gov  249685       HS-grad                8   
48838   46           Private  165371       HS-grad               11   
48839   53           Private  314941          10th                8   
48840   47         Local-gov  177315       HS-grad               13   
48841   30  Self-emp-not-inc  278