# PETsARD Adapter 測試

這個 notebook 用來測試各個 Adapter 的功能，按照 test.ipynb 的設定進行調整。

## LoaderAdapter 測試

In [22]:
from petsard.adapter import LoaderAdapter

# 測試 benchmark 資料載入
loader_adpt = LoaderAdapter(config={"filepath": "benchmark://adult-income"})

# 執行載入
loader_adpt.run(input={})

# 取得結果
data = loader_adpt.get_result()
meta = loader_adpt.get_metadata()

print(f"data: {type(data)}, meta: {type(meta)}")
print(f"data shape: {data.shape}")
print(f"Schema id: {meta.id}")
print(f"Schema name: {meta.name}")

data: <class 'pandas.core.frame.DataFrame'>, meta: <class 'petsard.metadater.metadata.Schema'>
data shape: (48842, 15)
Schema id: adult-income
Schema name: adult-income.csv


In [23]:
# 測試本地檔案載入
loader_adpt_local = LoaderAdapter(config={"filepath": "benchmark/adult-income.csv"})

loader_adpt_local.run(input={})

data_local = loader_adpt_local.get_result()
meta_local = loader_adpt_local.get_metadata()

print(f"data: {type(data_local)}, meta: {type(meta_local)}")
print(f"data shape: {data_local.shape}")

data: <class 'pandas.core.frame.DataFrame'>, meta: <class 'petsard.metadater.metadata.Schema'>
data shape: (48842, 15)


# Just simplify

In [24]:
data = data.loc[0:99, :]
data.shape

(100, 15)

## SplitterAdapter 測試

In [25]:
from petsard.adapter import SplitterAdapter

# 測試資料分割 - 使用與 test.ipynb 相同的配置
splitter_adpt = SplitterAdapter(config={"num_samples": 5, "train_split_ratio": 0.8})

# 準備輸入資料
splitter_input = {"data": data, "metadata": meta, "exclude_index": []}

splitter_adpt.run(input=splitter_input)

# 取得分割結果 - 使用第一次分割
split_results = splitter_adpt.get_result()
split_meta = splitter_adpt.get_metadata()

# 取得第一次分割的訓練集和測試集 - 限制只能取第一個
train_data = split_results["train"]
test_data = split_results["validation"]

print(f"train_data: {type(train_data)}, test_data: {type(test_data)}")
print(f"train_data shape: {train_data.shape}, test_data shape: {test_data.shape}")

train_data: <class 'pandas.core.frame.DataFrame'>, test_data: <class 'pandas.core.frame.DataFrame'>
train_data shape: (80, 15), test_data shape: (20, 15)


In [26]:
# 測試自定義資料分割
splitter_adpt_custom = SplitterAdapter(
    config={
        "method": "custom_data",
        "filepath": {
            "ori": "benchmark://adult-income_ori",
            "control": "benchmark://adult-income_control",
        },
    }
)

# 自定義資料不需要輸入資料
splitter_adpt_custom.run(input={"exclude_index": []})

custom_split_result = splitter_adpt_custom.get_result()
print(f"Custom split result keys: {custom_split_result.keys()}")

Custom split result keys: dict_keys(['train', 'validation'])


## PreprocessorAdapter 測試

In [27]:
from petsard.adapter import PreprocessorAdapter

# 測試預設預處理
preproc_adpt = PreprocessorAdapter(config={"method": "default"})

preproc_input = {"data": train_data, "metadata": meta}

preproc_adpt.run(input=preproc_input)

default_preproc_data = preproc_adpt.get_result()
preproc_meta = preproc_adpt.get_metadata()

print(f"default_preproc_data shape: {default_preproc_data.shape}")

default_preproc_data shape: (39, 15)


In [28]:
# 測試僅處理缺失值
preproc_adpt_missing = PreprocessorAdapter(
    config={
        "method": "custom",
        "config": {
            "missing": {
                "age": "missing_mean",
            },
        },
        "sequence": ["missing"],
    }
)

preproc_adpt_missing.run(input=preproc_input)

missing_preproc_data = preproc_adpt_missing.get_result()

print(f"missing_preproc_data shape: {missing_preproc_data.shape}")
print("Age column after missing value processing:")
print(missing_preproc_data["age"].head(10))

missing_preproc_data shape: (80, 15)
Age column after missing value processing:
0    25
1    38
2    18
3    63
4    24
5    55
6    65
7    36
8    26
9    58
Name: age, dtype: int64


In [29]:
# 測試僅處理離群值
preproc_adpt_outlier = PreprocessorAdapter(
    config={
        "method": "custom",
        "config": {
            "outlier": {
                "age": "outlier_zscore",
            },
        },
        "sequence": ["outlier"],
    }
)

preproc_adpt_outlier.run(input=preproc_input)

outlier_preproc_data = preproc_adpt_outlier.get_result()

print(f"outlier_preproc_data shape: {outlier_preproc_data.shape}")
print("Age column after outlier processing:")
print(outlier_preproc_data["age"].head(10))

outlier_preproc_data shape: (39, 15)
Age column after outlier processing:
0    25
1    38
2    18
3    24
4    26
5    58
6    20
7    43
8    37
9    34
Name: age, dtype: int64


In [30]:
# 測試僅編碼處理
preproc_adpt_encoder = PreprocessorAdapter(
    config={
        "method": "custom",
        "config": {
            "encoder": {
                "workclass": "encoder_onehot",
            },
        },
        "sequence": ["encoder"],
    }
)

preproc_adpt_encoder.run(input=preproc_input)

encoder_preproc_data = preproc_adpt_encoder.get_result()

print(f"encoder_preproc_data shape: {encoder_preproc_data.shape}")
print("Workclass columns after encoding:")
workclass_cols = [
    col for col in encoder_preproc_data.columns if col.startswith("workclass_")
]
print(encoder_preproc_data[workclass_cols].head(10))

encoder_preproc_data shape: (80, 20)
Workclass columns after encoding:
   workclass_Federal-gov  workclass_Local-gov  workclass_Private  \
0                    0.0                  0.0                1.0   
1                    0.0                  0.0                1.0   
2                    0.0                  0.0                0.0   
3                    0.0                  0.0                0.0   
4                    0.0                  0.0                1.0   
5                    0.0                  0.0                1.0   
6                    0.0                  0.0                1.0   
7                    1.0                  0.0                0.0   
8                    0.0                  0.0                1.0   
9                    0.0                  0.0                0.0   

   workclass_Self-emp-inc  workclass_Self-emp-not-inc  workclass_State-gov  
0                     0.0                         0.0                  0.0  
1                     0.0 

In [31]:
# 測試僅縮放處理
preproc_adpt_scaler = PreprocessorAdapter(
    config={
        "method": "custom",
        "config": {
            "scaler": {
                "age": "scaler_minmax",
            },
        },
        "sequence": ["scaler"],
    }
)

preproc_adpt_scaler.run(input=preproc_input)

scaler_preproc_data = preproc_adpt_scaler.get_result()

print(f"scaler_preproc_data shape: {scaler_preproc_data.shape}")
print("Age column after scaling:")
print(scaler_preproc_data["age"].head(10))

scaler_preproc_data shape: (80, 15)
Age column after scaling:
0    0.145455
1    0.381818
2    0.018182
3    0.836364
4    0.127273
5    0.690909
6    0.872727
7    0.345455
8    0.163636
9    0.745455
Name: age, dtype: float64


## SynthesizerAdapter 測試

In [32]:
from petsard.adapter import SynthesizerAdapter

# 測試合成器 - 使用預設方法
synth_adpt = SynthesizerAdapter(config={"method": "default"})

synth_input = {"data": default_preproc_data, "metadata": preproc_meta}

synth_adpt.run(input=synth_input)

syn_data = synth_adpt.get_result()

print(f"syn_data shape: {syn_data.shape}")
print(f"syn_data type: {type(syn_data)}")

syn_data shape: (39, 15)
syn_data type: <class 'pandas.core.frame.DataFrame'>


## PostprocessorAdapter 測試

In [33]:
from petsard.adapter import PostprocessorAdapter


# 測試後處理
postproc_adpt = PostprocessorAdapter(config={"method": "default"})

postproc_input = {
    "data": syn_data,
    "preprocessor": preproc_adpt.processor,
}

postproc_adpt.run(input=postproc_input)

postproc_data = postproc_adpt.get_result()

print(f"postproc_data shape: {postproc_data.shape}")
print(f"postproc_data type: {type(postproc_data)}")

postproc_data shape: (39, 15)
postproc_data type: <class 'pandas.core.frame.DataFrame'>


In [34]:
syn_data.head(6)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,0.926637,0.359285,-0.641638,0.192437,-0.367067,0.311414,0.374914,0.239188,0.064783,0.178628,-0.206788,-0.159976,0.417968,0.118736,0.218596
1,-1.23475,0.958577,-1.076748,0.409309,-0.016214,0.954452,0.099444,0.331003,0.691053,0.03732,-0.206788,-0.159976,-0.442144,0.001253,0.480117
2,-0.474929,0.475705,-1.134067,0.546928,-0.469547,0.647718,0.584887,0.984289,0.17013,0.747837,-0.206788,-0.159976,-0.538532,0.140874,0.263473
3,-1.324611,0.50147,-0.895602,0.516445,0.069241,0.210802,0.564291,0.807002,0.147331,0.183251,-0.206788,-0.159976,-0.077975,0.631607,0.342356
4,0.78636,0.138399,1.790982,0.035266,-1.059852,0.688465,0.767122,0.091359,0.50537,0.025802,-0.206788,-0.159976,-0.438297,0.072877,0.498143
5,-1.35242,0.699954,-1.237256,0.559875,0.037777,0.773665,0.327584,0.231884,0.767578,0.078699,-0.206788,-0.159976,-0.69227,0.401531,0.223814


In [35]:
postproc_data.head(6)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,51,Private,127358,HS-grad,8,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,44,United-States,<=50K
1,19,State-gov,82475,Some-college,9,Widowed,Other-service,Husband,White,Male,0,0,33,United-States,<=50K
2,30,Private,76562,Some-college,8,Never-married,Adm-clerical,Wife,White,Female,0,0,32,United-States,<=50K
3,18,Private,101161,Some-college,9,Married-civ-spouse,Adm-clerical,Own-child,White,Male,0,0,38,United-States,<=50K
4,49,Private,378293,HS-grad,7,Never-married,?,Husband,White,Male,0,0,33,United-States,<=50K
5,18,Self-emp-inc,65918,Some-college,9,Never-married,Machine-op-inspct,Husband,White,Male,0,0,30,United-States,<=50K


## ConstrainerAdapter 測試

In [36]:
from petsard.adapter import ConstrainerAdapter


# 測試約束器 - 空配置
constr_adpt = ConstrainerAdapter(config={})

constr_input = {"data": postproc_data}

constr_adpt.run(input=constr_input)

cnst_data = constr_adpt.get_result()

print(f"cnst_data shape: {cnst_data.shape}")
print(f"cnst_data type: {type(cnst_data)}")

cnst_data shape: (39, 15)
cnst_data type: <class 'pandas.core.frame.DataFrame'>


In [37]:
# 測試約束器 - 使用 resample_until_satisfy
satisfy_data = constr_adpt.constrainer.resample_until_satisfy(
    data=postproc_data,
    target_rows=postproc_data.shape[0],
    synthesizer=synth_adpt.synthesizer,
    max_trials=300,
    sampling_ratio=10.0,
    verbose_step=10,
)

print(f"satisfy_data shape: {satisfy_data.shape}")

satisfy_data shape: (39, 15)


## EvaluatorOperator 測試

In [38]:
from petsard.adapter import EvaluatorAdapter

# 測試評估器 - 預設方法
eval_adpt = EvaluatorAdapter(config={"method": "default"})

eval_input = {
    "data": {
        "ori": train_data,
        "control": test_data,
        "syn": satisfy_data,
    },
    "schema": meta,
}

eval_adpt.run(input=eval_input)

eval_result = eval_adpt.get_result()

print(f"Evaluation result keys: {eval_result.keys()}")
print("\nGlobal evaluation results:")
print(eval_result["global"].head(1))
print("\nColumnwise evaluation results:")
print(eval_result["columnwise"].head(1))
print("\nPairwise evaluation results:")
print(eval_result["pairwise"].head(1))

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 15/15 [00:00<00:00, 2367.79it/s]|
Column Shapes Score: 78.58%

(2/2) Evaluating Column Pair Trends: |██████████| 105/105 [00:00<00:00, 821.12it/s]|
Column Pair Trends Score: 49.82%

Overall Score (Average): 64.2%

Evaluation result keys: dict_keys(['global', 'columnwise', 'pairwise'])

Global evaluation results:
        Score  Column Shapes  Column Pair Trends
result   0.64           0.79                 0.5

Columnwise evaluation results:
          Property        Metric     Score
age  Column Shapes  KSComplement  0.652244

Pairwise evaluation results:
                         Property                 Metric     Score  \
age workclass  Column Pair Trends  ContingencySimilarity  0.544551   

               Real Correlation  Synthetic Correlation Error  
age workclass               NaN                    NaN  None  


In [39]:
# 測試評估器 - MLUtility Classification
eval_adpt_mlutility = EvaluatorAdapter(
    config={
        "method": "mlutility-classification",
        "target": "income",
    }
)

eval_input_anon = {
    "data": {
        "ori": train_data,
        "control": test_data,
        "syn": satisfy_data,
    },
    "schema": meta,
}

eval_adpt_mlutility.run(input=eval_input_anon)

eval_result_mlutility = eval_adpt_mlutility.get_result()

print(eval_result_mlutility["global"].head(1))

The cardinality of the column workclass is too high. Ori: Over row numbers 80, column cardinality 7. Syn: Over row numbers 39, column cardinality 6. The column workclass is removed.
The cardinality of the column education is too high. Ori: Over row numbers 80, column cardinality 12. Syn: Over row numbers 39, column cardinality 3. The column education is removed.
The cardinality of the column marital-status is too high. Ori: Over row numbers 80, column cardinality 6. Syn: Over row numbers 39, column cardinality 5. The column marital-status is removed.
The cardinality of the column occupation is too high. Ori: Over row numbers 80, column cardinality 13. Syn: Over row numbers 39, column cardinality 9. The column occupation is removed.
The cardinality of the column relationship is too high. Ori: Over row numbers 80, column cardinality 5. Syn: Over row numbers 39, column cardinality 5. The column relationship is removed.


   ori_mean  ori_std  syn_mean  syn_std  diff
0      0.76     0.04       0.7      0.0 -0.06


## DescriberOperator 測試

In [40]:
from petsard.adapter import DescriberAdapter

# 測試描述器
desc_op = DescriberAdapter(config={"method": "default"})

desc_input = {"data": {"data": satisfy_data}}

desc_op.run(input=desc_input)

desc_result = desc_op.get_result()

print(f"Description result type: {type(desc_result)}")
if isinstance(desc_result, dict):
    print(f"Description result keys: {desc_result.keys()}")

    print(desc_result["global"].head(1))
    print(desc_result["columnwise"].head(1))
    print(desc_result["pairwise"].head(1))

Description result type: <class 'dict'>
Description result keys: dict_keys(['global', 'columnwise', 'pairwise'])
   row_count  col_count  na_count
0         39         15         0
      mean  median   std   min   max  kurtosis  skew    q1    q3  na_count  \
age  28.74    23.0  13.7  18.0  69.0      1.89  1.52  18.0  33.5         0   

    nunique  
age    <NA>  
  column1 column2  corr
0     age     age   1.0


## ReporterOperator 測試

In [41]:
from petsard.adapter import ReporterAdapter

# 測試報告器 - Save Data
reporter_op_data = ReporterAdapter(
    config={
        "method": "save_data",
        "source": "Postprocessor",
    }
)

report_data_input = {"data": {("Postprocessor", "exp1"): satisfy_data}}

reporter_op_data.run(input=report_data_input)

report_result_data = reporter_op_data.get_result()

print(f"Save data report result: {report_result_data}")

Save data report result: {'Postprocessor[exp1]':     age         workclass  fnlwgt     education  educational-num  \
0    19           Private   79828       HS-grad                9   
1    24           Private   76979  Some-college                9   
2    49           Private  378293       HS-grad                7   
3    34           Private   88681       HS-grad                9   
4    18           Private  213215       HS-grad                9   
5    29           Private  149181  Some-college                7   
6    48           Private  190957       HS-grad                8   
7    18           Private  179257  Some-college                9   
8    42           Private   99869       HS-grad                8   
9    68  Self-emp-not-inc  265063          11th                7   
10   19                 ?  259700       HS-grad                8   
11   18           Private  152910  Some-college                9   
12   39           Private  249446       HS-grad                9   

In [42]:
# 測試報告器 - Save Report
reporter_op_report = ReporterAdapter(
    config={
        "method": "save_report",
        "granularity": "global",
    }
)

report_report_input = {
    "data": {
        ("Evaluator", "eval1_[global]"): desc_result["global"],
    }
}

reporter_op_report.run(input=report_report_input)

report_result_report = reporter_op_report.get_result()

print(f"Save report result: {report_result_report}")

Save report result: {'[global]':       full_expt_name Evaluator  eval1_row_count  eval1_col_count  \
0  Evaluator[global]  [global]               39               15   

   eval1_na_count  
0               0  }


## 總結

這個 notebook 展示了如何使用各個 Operator 類別來執行 PETsARD 的功能，按照 test.ipynb 的設定進行調整：

1. **LoaderOperator**: 載入資料和元資料
2. **SplitterOperator**: 分割資料為訓練集和測試集
3. **PreprocessorOperator**: 資料預處理（缺失值、離群值、編碼、縮放）
4. **SynthesizerOperator**: 合成資料
5. **PostprocessorOperator**: 後處理合成資料
6. **ConstrainerOperator**: 應用資料約束
7. **EvaluatorOperator**: 評估合成資料品質
8. **DescriberOperator**: 描述資料特性
9. **ReporterOperator**: 生成報告

每個 Operator 都遵循相同的模式：
- 使用配置初始化
- 準備輸入資料
- 執行 `run()` 方法
- 使用 `get_result()` 取得結果
- 使用 `get_metadata()` 取得元資料（如果適用）

這種設計讓每個模組都可以獨立測試，不需要依賴 Status 物件。