# User Story

This demo is checklist for "User Stories" of PETsARD Spec. Also it can help users to set their own config file. Enjoy : )

本示例為【資料生成演算法分析與評估系統 Spec】當中【用戶故事 User Stories】的檢查清單。它還可以幫助用戶設置自己的配置文件。祝您使用愉快 : )

## Stocktaking

Type 1: Generation

1. User Story 1: Describing - **Incompleted**
   - 缺 DescriberOperator
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

2. User Story 2: Synthesizing - **Incompleted**
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

3. User Story 3: Default Synthesizing - **Incompleted**
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

4. User Story 4: Evaluating - **Incompleted**
   - 缺 Evaluator method = 'default'
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

5. User Story 5: Customized Evaluating - **Incompleted**
   - 缺 Evaluator method = 'custom_method'
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

Type 2: Evaluating data

6. User Story 6: Given data evaluating
   - 缺 Splitter method = 'custom_data'

7. User Story 7: Given data customized evaluating
   - 缺 Splitter method = 'custom_data'

Type 3: Evaluating algorithm

8. User Story 8: Synthesizing on default data



## Environment

In [1]:
import os
import sys

sys.path.append(os.path.dirname(os.getcwd()))

In [2]:
import pprint
from typing import List

import yaml

from PETsARD import Executor


def DevTest_UserStory(config_file: str, sequence: List[str]):
    pp = pprint.PrettyPrinter(depth=3)
    with open(config_file, 'r') as yaml_file:
        yaml_raw: dict = yaml.safe_load(yaml_file)
    pp.pprint(yaml_raw)

    exec = Executor(config=config_file, sequence=sequence)
    exec.run()
    pp.pprint(exec.get_result())

# Type 1: Generation

> Type 1: Generate a privacy enhanced dataset along with reports
>
> 類型一：生成一個隱私強化資料集、與提供相關報告

## User Story 1: Describing

> Given a dataset as an input, the pipeline can go through the "describe" module to get a summary of the dataset. 
>
> 給定一個資料集做輸入，該流程可以藉由調用 "describe" 模組而得到該資料集的摘要

In [3]:
config_file = 'UserStory_1.yaml'
sequence = ['Loader'] # 'Describer', 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
{'Loader[adult]':        age     workclass  fnlwgt     education  educational-num  \
0       25       Private  226802          11th                7   
1       38       Private   89814       HS-grad                9   
2       28     Local-gov  336951    Assoc-acdm               12   
3       44       Private  160323  Some-college               10   
4       18           NaN  103497  Some-college               10   
...    ...           ...     ...           ...              ...   
48837   27       Private  257302    Assoc-acdm               12   
48838   40       Private  154374       HS-grad                9   
48839   58       Private  151910       HS-grad                9   
48840   22       Private  201490       HS-grad 

## User Story 2: Synthesizing

> Given an original dataset, specified privacy enhancing data generation algorithms and parameters, the pipeline will generate a privacy enhanced dataset.
>
> 給定一個原始資料集，並指定隱私強化技術生成演算法與參數，該流程會依此產生隱私強化資料集。

In [4]:
config_file = 'UserStory_2.yaml'
sequence = ['Loader', 'Preprocessor', 'Synthesizer', 'Postprocessor'] # 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0857 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.
Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 9.7242 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 26933 rows (same as raw) in 2.1716 sec.
Now is Postprocessor with demo...
{'Loader[adult]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':              age         workclass         fnlwgt     education 

## User Story 3: Default Synthesizing

> Given an original dataset without specified algorithm, the pipeline will generate a list of privacy enhanced datasets using the default algorithms.
>
> 給定一個原始資料集、但未指定演算法，該流程會利用預設的演算法生成一組隱私強化資料集。

In [5]:
config_file = 'UserStory_3.yaml'
sequence = ['Loader', 'Preprocessor', 'Synthesizer', 'Postprocessor'] # 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'demo': {'method': 'default'}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with demo...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0312 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.
Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 9.1163 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 26933 rows (same as raw) in 1.9443 sec.
Now is Postprocessor with demo...
{'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[demo]':              age  workclass         fnlwgt     education  educational-num  \
0      44.235812  State-gov  183690

## User Story 4: Evaluating

> Following User Story 2 and 3, if users enable the "evaluate" step ,  the evaluation module will create a report covering default privacy risk and utility metrics.
>
> 根據用戶故事 2 跟 3，如果使用者啟用了 "evaluate"  步驟，評估模組會產生涵蓋預設的隱私風險與效用指標的報告。

In [6]:
config_file = 'UserStory_4.yaml'
sequence = ['Loader', 'Preprocessor', 'Synthesizer', 'Postprocessor'] # , 'Evaluator' 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Evaluator': {'demo': {'method': 'default'}},
 'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'demo': {'method': 'default'}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with demo...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.027 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.
Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 9.8807 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 26933 rows (same as raw) in 1.8995 sec.
Now is Postprocessor with demo...
{'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[demo]':              age         workclass         fnlwgt     education  \

## User Story 5: Customized Evaluating

> Following User Story 4, if specific types of metrics are set or a customized evaluation script is provided, the module will create a customized evaluation report.
> 
> 根據用戶故事 4，如果指定特定的指標、或是提供用戶自定義的評估腳本，模組會產生客製化的評估報告。

In [7]:
config_file = 'UserStory_5.yaml'
sequence = ['Loader', 'Preprocessor', 'Synthesizer', 'Postprocessor'] # , 'Evaluator' 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Evaluator': {'custom': {'custom_method': 'my_method_filename',
                          'method': 'custom_method'}},
 'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'demo': {'method': 'default'}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with demo...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0464 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.
Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 9.3985 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 26933 rows (same as raw) in 1.9573 sec.
Now is Postprocessor with demo...
{'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[

# Type 2: Evaluating data

> Type 2: Evaluate privacy enhanced dataset
>
> 類型二：評估隱私強化資料集

## User Story 6: Given data evaluating

> Given an original dataset and a privacy enhanced dataset to the evaluation module, the pipeline will create a report covering default/general metrics of privacy risk and utility.
>
> 給定原始資料集與對應的隱私強化資料集到評估模組中，該流程會產生一份涵蓋預設/一般指標的隱私風險與效用的報告。

In [5]:
config_file = 'UserStory_6.yaml'
sequence = ['Loader', 'Synthesizer', 'Evaluator'] # 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Evaluator': {'demo': {'method': 'sdmetrics-qualityreport'}},
 'Loader': {'adult_ori': {'filepath': 'ori.csv'}},
 'Synthesizer': {'custom': {'filepath': 'syn.csv', 'method': 'custom_data'}}}
Now is Loader with adult_ori...
Now is Synthesizer with custom...
Now is Evaluator with demo...
Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 15/15 [00:00<00:00, 78.50it/s]
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 105/105 [00:04<00:00, 21.21it/s]

Overall Score: 78.22%

Properties:
- Column Shapes: 95.42%
- Column Pair Trends: 61.03%
{'Loader[adult_ori]_Synthesizer[custom]_Evaluator[demo]': {'columnwise':                       Property        Metric     Score
age              Column Shapes  KSComplement  0.953697
workclass        Column Shapes  TVComplement  0.993424
fnlwgt           Column Shapes  KSComplement  0.961962
education        Column Shapes  TVComplement  0.953049
educational-num  Column Shapes  KSComplement  0.854475
marital-status   Column Sha

## User Story 7: Given data customized evaluating

> Following User Story 6, if specific types of metrics are set or a customized evaluation script is provided, the module will create a customized evaluation report.
>
> 根據用戶故事 6，如果指定特定的指標、或是提供用戶自定義的評估腳本，模組會產生客製化的評估報告。

In [None]:
config_file = 'UserStory_7.yaml'
sequence = ['Loader', 'Synthesizer', 'Evaluator'] # 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

# Type 3: Evaluating algorithm

> Type 3: Evaluate a data generation algorithm (incl. data processing)
>
> 類型三：評估資料生成演算法（與資料處理流程）

## User Story 8: Synthesizing on default data

> With a specified data generation algorithm, a default benchmark dataset collection will serve as inputs, and the pipeline will generate the corresponding privacy enhanced datasets as output, using the selected algorithm.
>
> 指定資料生成演算法後，預設的經典資料集會用作輸入，並且該流程將使用該演算法輸出對應的隱私強化資料集。

In [3]:
config_file = 'UserStory_8.yaml'
sequence = ['Loader', 'Preprocessor', 'Synthesizer', 'Postprocessor'] # 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Loader': {'demo': {'method': 'default'}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}}}
Now is Loader with demo...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0365 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.
Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 11.9046 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 28558 rows (same as raw) in 1.8215 sec.
Now is Postprocessor with demo...
{'Loader[demo]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]': None}
