# User Story

This demo is checklist for "User Stories" of PETsARD Spec. Also it can help users to set their own config file. Enjoy : )

本示例為【資料生成演算法分析與評估系統 Spec】當中【用戶故事 User Stories】的檢查清單。它還可以幫助用戶設置自己的配置文件。祝您使用愉快 : )

## Stocktaking

Type 1: Generation

1. User Story 1: Describing - **Incompleted**
   - 缺 DescriberOperator
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

2. User Story 2: Synthesizing - **Incompleted**
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

3. User Story 3: Default Synthesizing - **Incompleted**
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

4. User Story 4: Evaluating - **Incompleted**
   - 缺 Evaluator method = 'default'
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

5. User Story 5: Customized Evaluating - **Incompleted**
   - 缺 Evaluator method = 'custom_method'
   - 缺 Reporter, ReporterOperator, 與 method = 'save'

## Environment

In [1]:
import os
import sys

sys.path.append(os.path.dirname(os.getcwd()))

In [7]:
import pprint
from typing import List

import yaml

from PETsARD import Executor


def DevTest_UserStory(config_file: str, sequence: List[str]):
    pp = pprint.PrettyPrinter(depth=3)
    with open(config_file, 'r') as yaml_file:
        yaml_raw: dict = yaml.safe_load(yaml_file)
    pp.pprint(yaml_raw)

    exec = Executor(config=config_file, sequence=sequence)
    exec.run()
    pp.pprint(exec.get_result())

# Type 1: Generation

> Type 1: Generate a privacy enhanced dataset along with reports
>
> 類型一：生成一個隱私強化資料集、與提供相關報告

## User Story 1: Describing

> Given a dataset as an input, the pipeline can go through the "describe" module to get a summary of the dataset. 
>
> 給定一個資料集做輸入，該流程可以藉由調用 "describe" 模組而得到該資料集的摘要

In [3]:
config_file = 'UserStory_1.yaml'
sequence = ['Loader'] # 'Describer', 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
{'Loader[adult]':        age     workclass  fnlwgt     education  educational-num  \
0       25       Private  226802          11th                7   
1       38       Private   89814       HS-grad                9   
2       28     Local-gov  336951    Assoc-acdm               12   
3       44       Private  160323  Some-college               10   
4       18           NaN  103497  Some-college               10   
...    ...           ...     ...           ...              ...   
48837   27       Private  257302    Assoc-acdm               12   
48838   40       Private  154374       HS-grad                9   
48839   58       Private  151910       HS-grad                9   
48840   22       Private  201490       HS-grad 

## User Story 2: Synthesizing

> Given an original dataset, specified privacy enhancing data generation algorithms and parameters, the pipeline will generate a privacy enhanced dataset.
>
> 給定一個原始資料集，並指定隱私強化技術生成演算法與參數，該流程會依此產生隱私強化資料集。

In [4]:
config_file = 'UserStory_2.yaml'
sequence = ['Loader', 'Preprocessor', 'Synthesizer', 'Postprocessor'] # 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'sdv-gaussian': {'method': 'sdv-single_table-gaussiancopula'}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with sdv-gaussian...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0857 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.
Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 9.7242 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 26933 rows (same as raw) in 2.1716 sec.
Now is Postprocessor with demo...
{'Loader[adult]_Preprocessor[demo]_Synthesizer[sdv-gaussian]_Postprocessor[demo]':              age         workclass         fnlwgt     education 

## User Story 3: Default Synthesizing

> Given an original dataset without specified algorithm, the pipeline will generate a list of privacy enhanced datasets using the default algorithms.
>
> 給定一個原始資料集、但未指定演算法，該流程會利用預設的演算法生成一組隱私強化資料集。

In [5]:
config_file = 'UserStory_3.yaml'
sequence = ['Loader', 'Preprocessor', 'Synthesizer', 'Postprocessor'] # 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'demo': {'method': 'default'}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with demo...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0312 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.
Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 9.1163 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 26933 rows (same as raw) in 1.9443 sec.
Now is Postprocessor with demo...
{'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[demo]':              age  workclass         fnlwgt     education  educational-num  \
0      44.235812  State-gov  183690

## User Story 4: Evaluating

> Following User Story 2 and 3, if users enable the "evaluate" step ,  the evaluation module will create a report covering default privacy risk and utility metrics.
>
> 根據用戶故事 2 跟 3，如果使用者啟用了 "evaluate"  步驟，評估模組會產生涵蓋預設的隱私風險與效用指標的報告。

In [6]:
config_file = 'UserStory_4.yaml'
sequence = ['Loader', 'Preprocessor', 'Synthesizer', 'Postprocessor'] # , 'Evaluator' 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Evaluator': {'demo': {'method': 'default'}},
 'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'demo': {'method': 'default'}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with demo...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.027 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.
Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 9.8807 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 26933 rows (same as raw) in 1.8995 sec.
Now is Postprocessor with demo...
{'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[demo]':              age         workclass         fnlwgt     education  \

## User Story 5: Customized Evaluating

> Following User Story 4, if specific types of metrics are set or a customized evaluation script is provided, the module will create a customized evaluation report.
> 
> 根據用戶故事 4，如果指定特定的指標、或是提供用戶自定義的評估腳本，模組會產生客製化的評估報告。

In [7]:
config_file = 'UserStory_5.yaml'
sequence = ['Loader', 'Preprocessor', 'Synthesizer', 'Postprocessor'] # , 'Evaluator' 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Evaluator': {'custom': {'custom_method': 'my_method_filename',
                          'method': 'custom_method'}},
 'Loader': {'adult': {'filepath': 'benchmark://adult', 'na_values': {...}}},
 'Postprocessor': {'demo': {'method': 'default'}},
 'Preprocessor': {'demo': {'method': 'default'}},
 'Synthesizer': {'demo': {'method': 'default'}}}
Now is Loader with adult...
Loader - Benchmarker: file benchmark\adult.csv already exist and match SHA-256.
                      PETsARD will ignore download and use local data directly.
Now is Preprocessor with demo...
Now is Synthesizer with demo...
Synthesizer (SDV - SingleTable): Metafile loading time: 0.0464 sec.
Synthesizer (SDV - SingleTable): Fitting GaussianCopula.
Synthesizer (SDV - SingleTable): Fitting  GaussianCopula spent 9.3985 sec.
Synthesizer (SDV - SingleTable): Sampling GaussianCopula # 26933 rows (same as raw) in 1.9573 sec.
Now is Postprocessor with demo...
{'Loader[adult]_Preprocessor[demo]_Synthesizer[demo]_Postprocessor[

# Type 2: Evaluation

> Type 2: Evaluate privacy enhanced dataset
>
> 類型二：評估隱私強化資料集

## User Story 6: Given data evaluating

> Given an original dataset and a privacy enhanced dataset to the evaluation module, the pipeline will create a report covering default/general metrics of privacy risk and utility.
>
> 給定原始資料集與對應的隱私強化資料集到評估模組中，該流程會產生一份涵蓋預設/一般指標的隱私風險與效用的報告。

In [8]:
config_file = 'UserStory_6.yaml'
sequence = ['Loader', 'Synthesizer', 'Evaluator'] # 'Reporter'

DevTest_UserStory(
    config_file = config_file,
    sequence = sequence
)

{'Evaluator': {'demo': {'method': 'sdmetrics-qualityreport'}},
 'Loader': {'adult_ori': {'filepath': 'ori.csv'}},
 'Synthesizer': {'custom': {'filepath': 'syn.csv', 'method': 'custom_data'}}}
Now is Loader with adult_ori...
Now is Synthesizer with custom...
Now is Evaluator with demo...


TypeError: SDMetricsBase.__init__() missing 1 required positional argument: 'data'

In [2]:
from PETsARD import Synthesizer


syn = Synthesizer(method='custom_data', filepath='syn.csv')
syn.create(data={})
syn.fit_sample()

In [3]:
syn.data_syn

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,47.770363,Local-gov,210507.671875,Bachelors,9.212439,Married-civ-spouse,Prof-specialty,Not-in-family,White,Male,0.0,0.0,45.634048,United-States,>50K
1,23.035469,,170300.359375,HS-grad,8.032124,Never-married,,Not-in-family,White,Male,0.0,0.0,42.279007,United-States,>50K
2,57.541721,,232871.812500,HS-grad,10.631953,Separated,,Own-child,White,Female,0.0,0.0,39.142406,United-States,<=50K
3,39.859409,Private,190946.984375,Some-college,8.096195,Married-civ-spouse,Adm-clerical,Husband,White,Male,0.0,0.0,41.688583,United-States,<=50K
4,40.325970,Private,70642.859375,Masters,12.995354,Never-married,Craft-repair,Not-in-family,Black,Female,0.0,0.0,38.131351,Mexico,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21632,58.880329,Private,114618.070312,Bachelors,9.107739,Married-civ-spouse,Craft-repair,Own-child,White,Male,0.0,0.0,37.773296,Thailand,<=50K
21633,45.098606,Private,299006.250000,HS-grad,10.240206,Married-civ-spouse,Exec-managerial,Own-child,White,Male,0.0,0.0,48.648289,United-States,<=50K
21634,27.457649,Private,226504.015625,Some-college,11.489886,Divorced,Sales,Wife,White,Female,0.0,0.0,41.446903,United-States,<=50K
21635,40.984989,Private,126128.070312,Assoc-voc,11.299727,Married-civ-spouse,Other-service,Husband,Asian-Pac-Islander,Male,0.0,0.0,35.957973,Ireland,>50K
