# User Story C
**Privacy Enhancing Data Evaluation**


This demo will show how to evaluate privacy-enhanced data using `PETsARD`.

In this demonstration, you, as the user, already have a data file on your local machine, as well as its corresponding synthetic data results, which are likely from your existing privacy protection service. `PETsARD` will assist you in reading these files and evaluating the results, helping you compare your current solution with other technologies.

本示範將展示如何使用 `PETsARD` 生成與評測隱私強化資料。

在這個示範中，您作為使用者，在本機上已經擁有一份資料檔案、以及其對應的合成資料結果，這很可能是來自於您現有的隱私保護服務，而 `PETsARD` 將幫助您讀取這些檔案、評測結果，幫助您針對現有的解決方案跟其他技術做比較。

---

## Environment

In [1]:
import os
import pprint
import sys

import yaml


# Setting up the path to the PETsARD package
path_petsard = os.path.dirname(os.getcwd())
print(path_petsard)
sys.path.append(path_petsard)
# setting for pretty priny YAML
pp = pprint.PrettyPrinter(depth=3, sort_dicts=False)


from PETsARD import Executor

/Users/justyn.chen/Dropbox/310_Career_工作/20231016_NICS_資安院/41_PETsARD/PETsARD


---

## User Story C-1
**Default Describing**

Given a dataset as an input, the pipeline can go through the "describe" module to get a summary of the dataset.

給定一個資料集做輸入，該流程可以藉由調用 "describe" 模組而得到該資料集的摘要

In [2]:
config_file = '../yaml/User Story C-1.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult': {'filepath': 'benchmark/adult-income.csv',
                      'na_values': {...}}},
 'Describer': {'demo': {'method': 'default'}},
 'Reporter': {'save_report_global': {'method': 'save_report',
                                     'output': 'User Story C-1',
                                     'eval': 'demo',
                                     'granularity': 'global'},
              'save_report_columnwise': {'method': 'save_report',
                                         'output': 'User Story C-1',
                                         'eval': 'demo',
                                         'granularity': 'columnwise'},
              'save_report_pairwise': {'method': 'save_report',
                                       'output': 'User Story C-1',
                                       'eval': 'demo',
                                       'granularity': 'pairwise'}}}


In [3]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with adult...
Now is Describer with demo...
Now is Reporter with save_report_global...
Now is User Story C-1[Report]_demo_[global] save to csv...
Now is Reporter with save_report_columnwise...
Now is User Story C-1[Report]_demo_[columnwise] save to csv...
Now is Reporter with save_report_pairwise...
Now is User Story C-1[Report]_demo_[pairwise] save to csv...


In [4]:
pp.pprint(exec.get_result())

{'Loader[adult]_Describer[demo]_Reporter[save_report_global]': {'demo_[global]':                          full_expt_name Loader      Describer  demo_row_count  \
0  Loader[adult]_Describerdemo_[global]  adult  demo_[global]           48842   

   demo_col_count  demo_na_count  
0              15           3620  },
 'Loader[adult]_Describer[demo]_Reporter[save_report_columnwise]': {'demo_[columnwise]':                               full_expt_name Loader          Describer  \
0   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
1   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
2   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
3   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
4   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
5   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
6   Loader[adult]_Describerdemo_[columnwise]  adult  demo_[columnwise]   
7  

---

## User Story C-2
**Given data evaluating**

Given an original dataset and a privacy enhanced dataset to the evaluation module, the pipeline will create a report covering default/general metrics of privacy risk and utility.

給定原始資料集與對應的隱私強化資料集到評估模組中，該流程會產生一份涵蓋預設/一般指標的隱私風險與效用的報告。

### User Story C-2a

C-2a demonstrates the evaluation approach of the Evaluator as comparing "original data" with "synthetic data," for instance, using`method = 'default'` or tools starting with `'sdmetrics-'` from SDMetrics.

C-2a 展示的是 Evaluator 的評測方式是「原始資料」對照「合成資料」，例如 `method = 'default'` 或 `'sdmetrics-'` 開頭的 SDMetrics 評測工具。

In [5]:
config_file = '../yaml/User Story C-2a.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult_ori': {'filepath': 'benchmark/adult-income_ori.csv'}},
 'Synthesizer': {'custom': {'method': 'custom_data',
                            'filepath': 'benchmark/adult-income_syn.csv'}},
 'Evaluator': {'demo': {'method': 'default'}},
 'Reporter': {'save_report_global': {'method': 'save_report',
                                     'output': 'User Story C-2a',
                                     'eval': 'demo',
                                     'granularity': 'global'}}}


In [6]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with adult_ori...
Now is Synthesizer with custom...
Now is Evaluator with demo...
[I 20241108 10:31:00] age changes data dtype from float32 to int8 for metadata alignment.
[I 20241108 10:31:00] workclass changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:00] fnlwgt changes data dtype from float32 to int32 for metadata alignment.
[I 20241108 10:31:00] education changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:00] educational-num changes data dtype from float32 to int8 for metadata alignment.
[I 20241108 10:31:00] marital-status changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:00] occupation changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:00] relationship changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:00] ra

  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

(2/2) Evaluating Column Pair Trends: |██████████| 105/105 [00:00<00:00, 282.40it/s]|
Column Pair Trends Score: 60.36%

Overall Score (Average): 77.4%

Now is Reporter with save_report_global...
Now is User Story C-2a[Report]_demo_[global] save to csv...


  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

In [7]:
pp.pprint(exec.get_result())

{'Loader[adult_ori]_Synthesizer[custom]_Evaluator[demo]_Reporter[save_report_global]': {'demo_[global]':                                            full_expt_name     Loader  \
result  Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori   

       Synthesizer      Evaluator  demo_Score  demo_Column Shapes  \
result      custom  demo_[global]    0.774043            0.944469   

        demo_Column Pair Trends  
result                 0.603616  }}


### User Story C-2b


C-2b demonstrates the evaluation method of the Evaluator as involving "original data used in synthesis" (abbreviated as ori), "original data not used in synthesis" (abbreviated as control), and "synthesized data" (abbreviated as syn), for example, using tools starting with `method ='anonymeter-'` from Anonymeter.

C-2b 展示的是 Evaluator 的評測方式是「參與合成的原始資料」(original data, 縮寫為 ori)、「不參與合成的原始資料」(control data, 縮寫為 control)、與「合成資料」(synthesized data, 縮寫為 syn)，例如 `method ='anonymeter-'` 開頭的 Anonymeter 評測工具。

In [8]:
config_file = '../yaml/User Story C-2b.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Splitter': {'custom': {'method': 'custom_data', 'filepath': {...}}},
 'Synthesizer': {'custom': {'method': 'custom_data',
                            'filepath': 'benchmark/adult-income_syn.csv'}},
 'Evaluator': {'demo': {'method': 'default'},
               'anony-singling': {'method': 'anonymeter-singlingout'}},
 'Reporter': {'save_report_global_demo': {'method': 'save_report',
                                          'output': 'User Story C-2b',
                                          'eval': 'demo',
                                          'granularity': 'global'},
              'save_report_global_anony': {'method': 'save_report',
                                           'output': 'User Story C-2b',
                                           'eval': 'anony-singling',
                                           'granularity': 'global'}}}


In [9]:
exec = Executor(config=config_file)
exec.run()

Now is Splitter with custom_[1-1]...


Now is Synthesizer with custom...
Now is Evaluator with demo...
[I 20241108 10:31:00] age changes data dtype from float32 to int8 for metadata alignment.
[I 20241108 10:31:00] workclass changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:00] fnlwgt changes data dtype from float32 to int32 for metadata alignment.
[I 20241108 10:31:00] education changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:00] educational-num changes data dtype from float32 to int8 for metadata alignment.
[I 20241108 10:31:00] marital-status changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:01] occupation changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:01] relationship changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:01] race changes data dtype from categ

  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

(2/2) Evaluating Column Pair Trends: |██████████| 105/105 [00:00<00:00, 280.06it/s]|
Column Pair Trends Score: 60.36%

Overall Score (Average): 77.4%

Now is Reporter with save_report_global_demo...
Now is User Story C-2b[Report]_demo_[global] save to csv...
Now is Reporter with save_report_global_anony...
Now is Evaluator with anony-singling...
[I 20241108 10:31:01] age changes data dtype from float32 to int8 for metadata alignment.
[I 20241108 10:31:01] workclass changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:01] fnlwgt changes data dtype from float32 to int32 for metadata alignment.
[I 20241108 10:31:01] education changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:31:01] educational-num changes data dtype from float32 to int8 for metadata alignment.
[I 20241108 10:31:01] marital-status changes data dtype from category[object] to category[object] for metadata alignment.
[I 2024110

  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(
  contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
  contingency_synthetic = synthetic.groupby(list(columns), dropna=False).

[W 20241108 10:31:04] Found 1622 failed queries out of 2000. Check DEBUG messages for more details.
[W 20241108 10:47:34] Reached maximum number of attempts 500000 when generating singling out queries. Returning 259 instead of the requested 2000.To avoid this, increase the number of attempts or set it to ``None`` to disable The limitation entirely.
[W 20241108 10:47:34] Attack `multivariate` could generate only 259 singling out queries out of the requested 2000. This can probably lead to an underestimate of the singling out risk.
Now is Reporter with save_report_global_demo...
Now is Reporter with save_report_global_anony...
Now is User Story C-2b[Report]_anony-singling_[global] save to csv...


In [10]:
pp.pprint(exec.get_result())

{'Splitter[custom_[1-1]]_Synthesizer[custom]_Evaluator[demo]_Reporter[save_report_global_demo]': {'demo_[global]':                                            full_expt_name      Splitter  \
result  Splitter[custom_[1-1]]_Synthesizer[custom]_Eva...  custom_[1-1]   

       Synthesizer      Evaluator  demo_Score  demo_Column Shapes  \
result      custom  demo_[global]    0.774043            0.944469   

        demo_Column Pair Trends  
result                 0.603616  },
 'Splitter[custom_[1-1]]_Synthesizer[custom]_Evaluator[demo]_Reporter[save_report_global_anony]': {},
 'Splitter[custom_[1-1]]_Synthesizer[custom]_Evaluator[anony-singling]_Reporter[save_report_global_demo]': {},
 'Splitter[custom_[1-1]]_Synthesizer[custom]_Evaluator[anony-singling]_Reporter[save_report_global_anony]': {'anony-singling_[global]':                                            full_expt_name      Splitter  \
result  Splitter[custom_[1-1]]_Synthesizer[custom]_Eva...  custom_[1-1]   

       Synthesizer       

---

## User Story C-3 - Given data customized evaluating

> Following User Story C-2, if specific types of metrics are set or a customized evaluation script is provided, the module will create a customized evaluation report.
>
>
> 根據用戶故事 C-2，如果指定特定的指標、或是提供用戶自定義的評估腳本，模組會產生客製化的評估報告。

> Aligning with User Story 7 of the Spec
>
>
> 對標規格說明 (Spec) 的 User Story 7

In [11]:
config_file = '../yaml/User Story C-3.yaml'

with open(config_file, 'r') as yaml_file:
    yaml_raw: dict = yaml.safe_load(yaml_file)
pp.pprint(yaml_raw)

{'Loader': {'adult_ori': {'filepath': 'benchmark/adult-income_ori.csv'}},
 'Synthesizer': {'custom': {'method': 'custom_data',
                            'filepath': 'benchmark/adult-income_syn.csv'}},
 'Evaluator': {'custom': {'method': 'custom_method', 'custom_method': {...}}},
 'Reporter': {'save_report_global': {'method': 'save_report',
                                     'output': 'User Story C-3',
                                     'eval': 'custom',
                                     'granularity': 'global'},
              'save_report_columnwise': {'method': 'save_report',
                                         'output': 'User Story C-3',
                                         'eval': 'custom',
                                         'granularity': 'columnwise'},
              'save_report_pairwise': {'method': 'save_report',
                                       'output': 'User Story C-3',
                                       'eval': 'custom',
                    

In [12]:
exec = Executor(config=config_file)
exec.run()

Now is Loader with adult_ori...
Now is Synthesizer with custom...
Now is Evaluator with custom...
[I 20241108 10:47:54] age changes data dtype from float32 to int8 for metadata alignment.
[I 20241108 10:47:54] workclass changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:47:54] fnlwgt changes data dtype from float32 to int32 for metadata alignment.
[I 20241108 10:47:54] education changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:47:54] educational-num changes data dtype from float32 to int8 for metadata alignment.
[I 20241108 10:47:54] marital-status changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:47:54] occupation changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:47:54] relationship changes data dtype from category[object] to category[object] for metadata alignment.
[I 20241108 10:47:54] 

In [13]:
pp.pprint(exec.get_result())

{'Loader[adult_ori]_Synthesizer[custom]_Evaluator[custom]_Reporter[save_report_global]': {'custom_[global]':                                            full_expt_name     Loader  \
result  Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori   

       Synthesizer        Evaluator  custom_score  
result      custom  custom_[global]           100  },
 'Loader[adult_ori]_Synthesizer[custom]_Evaluator[custom]_Reporter[save_report_columnwise]': {'custom_[columnwise]':                                        full_expt_name     Loader Synthesizer  \
0   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
1   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
2   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
3   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
4   Loader[adult_ori]_Synthesizer[custom]_Evaluato...  adult_ori      custom   
5   Loader[adult_ori]_Synthesizer[custom]_E