# User Story A
**Privacy Enhancing Data Generation**

This demo will show how to generate privacy-enhanced data using `PETsARD`.

In this demonstration, you, as the user, already possess a data file locally, and `PETsARD` will assist you in loading that file and then generating a privacy-enhanced version of it.

At the same time, privacy-enhancing algorithms often have format restrictions and require specific pre-processing and post-processing procedures to function correctly. However, `PETsARD` has taken this into account for the user. `PETsARD` offers both default and customizable preprocessing and postprocessing workflows to help users get started quickly.

本示範將展示如何使用 `PETsARD` 生成隱私強化資料。

在這個示範中，您作為使用者，在本機上已經擁有一份資料檔案，而 `PETsARD` 將幫助您讀取該檔案、然後生成經隱私強化後的版本。

同時，隱私強化演算法通常都有格式的限制，必須經過特定的前處理 (Pre-processing) 與後處理 (Post-processing) 程序才能正確運作，但 `PETsARD` 已經為使用者考慮到這點，`PETsARD` 提供預設與可客製化的前後處理流程，幫助使用者快速上手。

---

## Environment

In [8]:
import os
import pprint
import sys

import yaml


# Setting up the path to the PETsARD package
path_petsard = os.path.dirname(os.getcwd())
print(path_petsard)
sys.path.append(path_petsard)
# setting for pretty priny YAML
pp = pprint.PrettyPrinter(depth=3, sort_dicts=False)


from PETsARD import Executor

/Users/justyn.chen/Dropbox/310_Career_工作/20231016_NICS_資安院/41_PETsARD/PETsARD


---

## User Story A-1
**Default Synthesizing**

Given an original dataset without specified algorithm, the pipeline will generate a list of privacy enhanced datasets using the default algorithms.

給定一個原始資料集、但未指定演算法，該流程會利用預設的演算法生成一組隱私強化資料集。

In [9]:
config_file = '../yaml/User Story A-1.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult': {'filepath': 'benchmark/adult-income.csv',
                      'na_values': {...}}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'demo': {'method': 'default'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story A-1',
                            'source': 'Postprocessor'}}}


In [10]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with adult...
Now is Preprocessor with demo...
[I 20241108 10:53:24] MediatorMissing is created.
[I 20241108 10:53:24] MediatorOutlier is created.
[I 20241108 10:53:24] MediatorEncoder is created.
[I 20241108 10:53:24] missing fitting done.
[I 20241108 10:53:24] <PETsARD.processor.mediator.MediatorMissing object at 0x328c01ea0> fitting done.
[I 20241108 10:53:24] outlier fitting done.
[I 20241108 10:53:24] <PETsARD.processor.mediator.MediatorOutlier object at 0x328c43c10> fitting done.
[I 20241108 10:53:24] encoder fitting done.
[I 20241108 10:53:24] <PETsARD.processor.mediator.MediatorEncoder object at 0x328c40340> fitting done.
[I 20241108 10:53:24] scaler fitting done.
[I 20241108 10:53:24] missing transformation done.
[I 20241108 10:53:24] <PETsARD.processor.mediator.MediatorMissing object at 0x328c01ea0> transformation done.
[I 20241108 10:53:24] outlier transformation done.
[I 20241108 10:53:24] <PETsARD.processor.mediator.MediatorOutlier object at 0x328c43c10> tran



[I 20241108 10:53:25] No rounding scheme detected for column 'income'. Data will not be rounded.
[I 20241108 10:53:25] {'EVENT': 'Fit processed data', 'TIMESTAMP': datetime.datetime(2024, 11, 8, 10, 53, 25, 117100), 'SYNTHESIZER CLASS NAME': 'GaussianCopulaSynthesizer', 'SYNTHESIZER ID': 'GaussianCopulaSynthesizer_1.17.1_59c6cf4b2d5a40d5811c8108878a0527', 'TOTAL NUMBER OF TABLES': 1, 'TOTAL NUMBER OF ROWS': 26933, 'TOTAL NUMBER OF COLUMNS': 15}
[I 20241108 10:53:25] Fitting GaussianMultivariate(distribution="{'age': <class 'copulas.univariate.beta.BetaUnivariate'>, 'workclass': <class 'copulas.univariate.beta.BetaUnivariate'>, 'fnlwgt': <class 'copulas.univariate.beta.BetaUnivariate'>, 'education': <class 'copulas.univariate.beta.BetaUnivariate'>, 'educational-num': <class 'copulas.univariate.beta.BetaUnivariate'>, 'marital-status': <class 'copulas.univariate.beta.BetaUnivariate'>, 'occupation': <class 'copulas.univariate.beta.BetaUnivariate'>, 'relationship': <class 'copulas.univariat

In [11]:
pp.pprint(exec.get_result())

{'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[demo]_Reporter[save_data]': {'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[demo]':        age         workclass  fnlwgt     education  educational-num  \
0       37         Local-gov   68339       Masters                9   
1       22         State-gov  102557       HS-grad                9   
2       29           Private   50999  Some-college               12   
3       46           Private  171244  Some-college                8   
4       34           Private  343554     Bachelors               14   
...    ...               ...     ...           ...              ...   
48837   43  Self-emp-not-inc  251450  Some-college                9   
48838   20       Federal-gov  149590     Bachelors                8   
48839   31           Private  266889       HS-grad               10   
48840   41         State-gov  165104     Bachelors                9   
48841   48           Private  311847       Master

---

## User Story A-2
**Customized Synthesizing**

Given an original dataset, specified privacy enhancing data generation algorithms and parameters, the pipeline will generate a privacy enhanced dataset.

給定一個原始資料集，並指定隱私強化技術生成演算法與參數，該流程會依此產生隱私強化資料集。

In [12]:
config_file = '../yaml/User Story A-2.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult': {'filepath': 'benchmark/adult-income.csv',
                      'na_values': {...}}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Reporter': {'save_data': {'method': 'save_data',
                            'output': 'User Story A-1',
                            'source': 'Postprocessor'}}}


In [13]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with adult...
Now is Preprocessor with demo...
[I 20241108 10:53:27] MediatorMissing is created.
[I 20241108 10:53:27] MediatorOutlier is created.
[I 20241108 10:53:27] MediatorEncoder is created.
[I 20241108 10:53:27] missing fitting done.
[I 20241108 10:53:27] <PETsARD.processor.mediator.MediatorMissing object at 0x328c42da0> fitting done.
[I 20241108 10:53:27] outlier fitting done.
[I 20241108 10:53:27] <PETsARD.processor.mediator.MediatorOutlier object at 0x328c01f90> fitting done.
[I 20241108 10:53:27] encoder fitting done.
[I 20241108 10:53:27] <PETsARD.processor.mediator.MediatorEncoder object at 0x328c003a0> fitting done.
[I 20241108 10:53:27] scaler fitting done.
[I 20241108 10:53:27] missing transformation done.
[I 20241108 10:53:27] <PETsARD.processor.mediator.MediatorMissing object at 0x328c42da0> transformation done.
[I 20241108 10:53:27] outlier transformation done.
[I 20241108 10:53:27] <PETsARD.processor.mediator.MediatorOutlier object at 0x328c01f90> tran



[I 20241108 10:53:28] No rounding scheme detected for column 'capital-loss'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'hours-per-week'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'native-country'. Data will not be rounded.
[I 20241108 10:53:28] No rounding scheme detected for column 'income'. Data will not be rounded.
[I 20241108 10:53:28] {'EVENT': 'Fit processed data', 'TIMESTAMP': datetime.datetime(2024, 11, 8, 10, 53, 28, 334683), 'SYNTHESIZER CLASS NAME': 'GaussianCopulaSynthesizer', 'SYNTHESIZER ID': 'GaussianCopulaSynthesizer_1.17.1_06ae7d20e94c4085a879446507fb9320', 'TOTAL NUMBER OF TABLES': 1, 'TOTAL NUMBER OF ROWS': 26933, 'TOTAL NUMBER OF COLUMNS': 15}
[I 20241108 10:53:28] Fitting GaussianMultivariate(distribution="{'age': <class 'copulas.univariate.beta.BetaUnivariate'>, 'workclass': <class 'copulas.univariate.beta.BetaUnivariate'>, 'fnlwgt': <class 'copulas.univariate.beta.BetaUnivar

In [14]:
pp.pprint(exec.get_result())

{'Loader[adult]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]_Reporter[save_data]': {'Loader[adult]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':        age  workclass  fnlwgt     education  educational-num  \
0       43  Local-gov  192394          10th                9   
1       25    Private  166077       HS-grad                8   
2       28    Private  242177       HS-grad               12   
3       44    Private  180775  Some-college                8   
4       30    Private  123392          10th               14   
...    ...        ...     ...           ...              ...   
48837   31        nan   47222       HS-grad               10   
48838   22    Private   48518     Assoc-voc                9   
48839   36    Private  176452           9th               10   
48840   31    Private   48300  Some-college               10   
48841   36    Private   84106       HS-grad               10   

           marital-status         occupation 