# User Story C
**Privacy Enhancing Data Evaluation**


This demo will show how to evaluate privacy-enhanced data using `PETsARD`.

In this demonstration, you, as the user, already have a data file on your local machine, as well as its corresponding synthetic data results, which are likely from your existing privacy protection service. `PETsARD` will assist you in reading these files and evaluating the results, helping you compare your current solution with other technologies.

本示範將展示如何使用 `PETsARD` 生成與評測隱私強化資料。

在這個示範中，您作為使用者，在本機上已經擁有一份資料檔案、以及其對應的合成資料結果，這很可能是來自於您現有的隱私保護服務，而 `PETsARD` 將幫助您讀取這些檔案、評測結果，幫助您針對現有的解決方案跟其他技術做比較。

---

## Environment

In [1]:
import os
import pprint
import sys

import yaml


# Setting up the path to the PETsARD package
path_petsard = os.path.dirname(os.getcwd())
sys.path.append(path_petsard)
# setting for pretty priny YAML
pp = pprint.PrettyPrinter(depth=3, sort_dicts=False)


from petsard import Executor # noqa: E402

---

## User Story C-1
**Default Describing**

Given a dataset as an input, the pipeline can go through the "describe" module to get a summary of the dataset.

給定一個資料集做輸入，該流程可以藉由調用 "describe" 模組而得到該資料集的摘要

In [2]:
config_file = '../yaml/User Story C-1.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult': {'filepath': '../benchmark/adult-income.csv',
                      'na_values': {...}}},
 'Describer': {'demo': {'method': 'default'}},
 'Reporter': {'save_report_global': {'method': 'save_report',
                                     'output': 'User Story C-1',
                                     'eval': 'demo',
                                     'granularity': 'global'},
              'save_report_columnwise': {'method': 'save_report',
                                         'output': 'User Story C-1',
                                         'eval': 'demo',
                                         'granularity': 'columnwise'},
              'save_report_pairwise': {'method': 'save_report',
                                       'output': 'User Story C-1',
                                       'eval': 'demo',
                                       'granularity': 'pairwise'}}}


In [3]:
exec = Executor(config=config_file)
exec.run()

Now is User Story C-1[Report]_demo_[global] save to csv...
Now is User Story C-1[Report]_demo_[columnwise] save to csv...
Now is User Story C-1[Report]_demo_[pairwise] save to csv...


In [4]:
pp.pprint(exec.get_result())

{'Loader[adult]_Describer[demo]_Reporter[save_report_global]': {'demo_[global]':                          full_expt_name Loader      Describer  demo_row_count  \
0  Loader[adult]_Describerdemo_[global]  adult  demo_[global]           48842   

   demo_col_count  demo_na_count  
0              15           3620  },
 'Loader[adult]_Describer[demo]_Reporter[save_report_columnwise]': {'demo_[columnwise]':                               full_expt_name Loader          Describer  \
0   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
1   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
2   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
3   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
4   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
5   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
6   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
7  

---

## User Story C-2
**Given data evaluating**

Given an original dataset and a privacy enhanced dataset to the evaluation module, the pipeline will create a report covering default/general metrics of privacy risk and utility.

給定原始資料集與對應的隱私強化資料集到評估模組中，該流程會產生一份涵蓋預設/一般指標的隱私風險與效用的報告。

### User Story C-2a

C-2a demonstrates the evaluation approach of the Evaluator as comparing "original data" with "synthetic data," for instance, using`method = 'default'` or tools starting with `'sdmetrics-'` from SDMetrics.

C-2a 展示的是 Evaluator 的評測方式是「原始資料」對照「合成資料」，例如 `method = 'default'` 或 `'sdmetrics-'` 開頭的 SDMetrics 評測工具。

In [5]:
config_file = '../yaml/User Story C-2a.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult_ori': {'filepath': '../benchmark/adult-income_ori.csv'}},
 'Synthesizer': {'custom': {'method': 'custom_data',
                            'filepath': '../benchmark/adult-income_syn.csv'}},
 'Evaluator': {'demo': {'method': 'default'}},
 'Reporter': {'save_report_global': {'method': 'save_report',
                                     'output': 'User Story C-2a',
                                     'eval': 'demo',
                                     'granularity': 'global'}}}


In [6]:
exec = Executor(config=config_file)
exec.run()

INFO:root:workclass changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:education changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:marital-status changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:occupation changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:relationship changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:race changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:gender changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:capital-gain changes data dtype from int8 to int32 for metadata alignment.
INFO:root:capital-loss changes data dtype from int8 to int16 for metadata alignment.
INFO:root:native-country changes data dtype from category[object] to category[object] for meta

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 15/15 [00:00<00:00, 88.57it/s]|
Column Shapes Score: 94.32%

(2/2) Evaluating Column Pair Trends: |          | 0/105 [00:00<?, ?it/s]|

  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)


(2/2) Evaluating Column Pair Trends: |████▉     | 52/105 [00:00<00:00, 255.09it/s]|

  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size(

(2/2) Evaluating Column Pair Trends: |██████████| 105/105 [00:00<00:00, 277.04it/s]|
Column Pair Trends Score: 60.42%



  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

Overall Score (Average): 77.37%

Now is User Story C-2a[Report]_demo_[global] save to csv...


In [7]:
pp.pprint(exec.get_result())

{'Loader[adult_ori]_Synthesizer[custom]_Evaluator[demo]_Reporter[save_report_global]': {'demo_[global]':                                            full_expt_name     Loader  \
result  Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori   

       Synthesizer      Evaluator  demo_Score  demo_Column Shapes  \
result      custom  demo_[global]    0.773713            0.943236   

        demo_Column Pair Trends  
result                 0.604191  }}


### User Story C-2b


C-2b demonstrates the evaluation method of the Evaluator as involving "original data used in synthesis" (abbreviated as ori), "original data not used in synthesis" (abbreviated as control), and "synthesized data" (abbreviated as syn), for example, using tools starting with `method ='anonymeter-'` from Anonymeter.

C-2b 展示的是 Evaluator 的評測方式是「參與合成的原始資料」(original data, 縮寫為 ori)、「不參與合成的原始資料」(control data, 縮寫為 control)、與「合成資料」(synthesized data, 縮寫為 syn)，例如 `method ='anonymeter-'` 開頭的 Anonymeter 評測工具。

In [8]:
config_file = '../yaml/User Story C-2b.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Splitter': {'custom': {'method': 'custom_data', 'filepath': {...}}},
 'Synthesizer': {'custom': {'method': 'custom_data',
                            'filepath': '../benchmark/adult-income_syn.csv'}},
 'Evaluator': {'demo': {'method': 'default'},
               'anony-singling': {'method': 'anonymeter-singlingout'}},
 'Reporter': {'save_report_global_demo': {'method': 'save_report',
                                          'output': 'User Story C-2b',
                                          'eval': 'demo',
                                          'granularity': 'global'},
              'save_report_global_anony': {'method': 'save_report',
                                           'output': 'User Story C-2b',
                                           'eval': 'anony-singling',
                                           'granularity': 'global'}}}


In [9]:
exec = Executor(config=config_file)
exec.run()

INFO:root:workclass changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:education changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:marital-status changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:occupation changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:relationship changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:race changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:gender changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:capital-gain changes data dtype from int8 to int32 for metadata alignment.
INFO:root:capital-loss changes data dtype from int8 to int16 for metadata alignment.
INFO:root:native-country changes data dtype from category[object] to category[object] for meta

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 15/15 [00:00<00:00, 92.05it/s]|
Column Shapes Score: 94.32%

(2/2) Evaluating Column Pair Trends: |          | 0/105 [00:00<?, ?it/s]|

  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(


(2/2) Evaluating Column Pair Trends: |█████     | 53/105 [00:00<00:00, 264.76it/s]|

  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

(2/2) Evaluating Column Pair Trends: |██████████| 105/105 [00:00<00:00, 276.29it/s]|
Column Pair Trends Score: 60.42%



  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

Overall Score (Average): 77.37%

Now is User Story C-2b[Report]_demo_[global] save to csv...


INFO:root:race changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:gender changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:capital-gain changes data dtype from int8 to int32 for metadata alignment.
INFO:root:capital-loss changes data dtype from int8 to int16 for metadata alignment.
INFO:root:native-country changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:income changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:workclass changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:education changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:marital-status changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:occupation changes data dtype from category[object] to category[object] for metadata a

Now is User Story C-2b[Report]_anony-singling_[global] save to csv...


In [10]:
pp.pprint(exec.get_result())

{'Splitter[custom_[1-1]]_Synthesizer[custom]_Evaluator[demo]_Reporter[save_report_global_demo]': {'demo_[global]':                                            full_expt_name      Splitter  \
result  Splitter[custom_[1-1]]_Synthesizer[custom]_Eva...  custom_[1-1]   

       Synthesizer      Evaluator  demo_Score  demo_Column Shapes  \
result      custom  demo_[global]    0.773713            0.943236   

        demo_Column Pair Trends  
result                 0.604191  },
 'Splitter[custom_[1-1]]_Synthesizer[custom]_Evaluator[demo]_Reporter[save_report_global_anony]': {},
 'Splitter[custom_[1-1]]_Synthesizer[custom]_Evaluator[anony-singling]_Reporter[save_report_global_demo]': {},
 'Splitter[custom_[1-1]]_Synthesizer[custom]_Evaluator[anony-singling]_Reporter[save_report_global_anony]': {'anony-singling_[global]':                                            full_expt_name      Splitter  \
result  Splitter[custom_[1-1]]_Synthesizer[custom]_Eva...  custom_[1-1]   

       Synthesizer       

---

## User Story C-3 - Given data customized evaluating

> Following User Story C-2, if specific types of metrics are set or a customized evaluation script is provided, the module will create a customized evaluation report.
>
>
> 根據用戶故事 C-2，如果指定特定的指標、或是提供用戶自定義的評估腳本，模組會產生客製化的評估報告。

> Aligning with User Story 7 of the Spec
>
>
> 對標規格說明 (Spec) 的 User Story 7

In [11]:
config_file = '../yaml/User Story C-3.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult_ori': {'filepath': '../benchmark/adult-income_ori.csv'}},
 'Synthesizer': {'custom': {'method': 'custom_data',
                            'filepath': '../benchmark/adult-income_syn.csv'}},
 'Evaluator': {'custom': {'method': 'custom_method', 'custom_method': {...}}},
 'Reporter': {'save_report_global': {'method': 'save_report',
                                     'output': 'User Story C-3',
                                     'eval': 'custom',
                                     'granularity': 'global'},
              'save_report_columnwise': {'method': 'save_report',
                                         'output': 'User Story C-3',
                                         'eval': 'custom',
                                         'granularity': 'columnwise'},
              'save_report_pairwise': {'method': 'save_report',
                                       'output': 'User Story C-3',
                                       'eval': 'custom',
              

In [12]:
exec = Executor(config=config_file)
exec.run()

INFO:root:workclass changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:education changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:marital-status changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:occupation changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:relationship changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:race changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:gender changes data dtype from category[object] to category[object] for metadata alignment.
INFO:root:capital-gain changes data dtype from int8 to int32 for metadata alignment.
INFO:root:capital-loss changes data dtype from int8 to int16 for metadata alignment.
INFO:root:native-country changes data dtype from category[object] to category[object] for meta

Now is User Story C-3[Report]_custom_[global] save to csv...
Now is User Story C-3[Report]_custom_[columnwise] save to csv...
Now is User Story C-3[Report]_custom_[pairwise] save to csv...


In [13]:
pp.pprint(exec.get_result())

{'Loader[adult_ori]_Synthesizer[custom]_Evaluator[custom]_Reporter[save_report_global]': {'custom_[global]':                                            full_expt_name     Loader  \
result  Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori   

       Synthesizer        Evaluator  custom_score  
result      custom  custom_[global]           100  },
 'Loader[adult_ori]_Synthesizer[custom]_Evaluator[custom]_Reporter[save_report_columnwise]': {'custom_[columnwise]':                                        full_expt_name     Loader Synthesizer  \
0   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
1   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
2   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
3   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
4   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
5   Loader[adult_ori]_Synthesizer[custom]_E