# PETsARD Adapter 測試

這個 notebook 用來測試各個 Adapter 的功能，按照 test.ipynb 的設定進行調整。

## LoaderAdapter 測試

In [1]:
from petsard.adapter import LoaderAdapter

# 測試 benchmark 資料載入
loader_adpt = LoaderAdapter(config={"filepath": "benchmark://adult-income"})

# 執行載入
loader_adpt.run(input={})

# 取得結果
data = loader_adpt.get_result()
meta = loader_adpt.get_metadata()

print(f"data: {type(data)}, meta: {type(meta)}")
print(f"data shape: {data.shape}")
print(f"metadata schema_id: {meta.schema_id}")

data: <class 'pandas.core.frame.DataFrame'>, meta: <class 'petsard.metadater.schema.schema_types.SchemaMetadata'>
data shape: (48842, 15)
metadata schema_id: adult-income


In [2]:
# 測試本地檔案載入
loader_adpt_local = LoaderAdapter(config={"filepath": "benchmark/adult-income.csv"})

loader_adpt_local.run(input={})

data_local = loader_adpt_local.get_result()
meta_local = loader_adpt_local.get_metadata()

print(f"data: {type(data_local)}, meta: {type(meta_local)}")
print(f"data shape: {data_local.shape}")

data: <class 'pandas.core.frame.DataFrame'>, meta: <class 'petsard.metadater.schema.schema_types.SchemaMetadata'>
data shape: (48842, 15)


# Just simplify

In [3]:
data = data.loc[0:99, :]
data.shape

(100, 15)

## SplitterAdapter 測試

In [4]:
from petsard.adapter import SplitterAdapter

# 測試資料分割 - 使用與 test.ipynb 相同的配置
splitter_adpt = SplitterAdapter(config={"num_samples": 5, "train_split_ratio": 0.8})

# 準備輸入資料
splitter_input = {"data": data, "metadata": meta, "exclude_index": []}

splitter_adpt.run(input=splitter_input)

# 取得分割結果 - 使用第一次分割
split_results = splitter_adpt.get_result()
split_meta = splitter_adpt.get_metadata()

# 取得第一次分割的訓練集和測試集 - 限制只能取第一個
train_data = split_results["train"]
test_data = split_results["validation"]

print(f"train_data: {type(train_data)}, test_data: {type(test_data)}")
print(f"train_data shape: {train_data.shape}, test_data shape: {test_data.shape}")

train_data: <class 'pandas.core.frame.DataFrame'>, test_data: <class 'pandas.core.frame.DataFrame'>
train_data shape: (80, 15), test_data shape: (20, 15)


In [5]:
# 測試自定義資料分割
splitter_adpt_custom = SplitterAdapter(
    config={
        "method": "custom_data",
        "filepath": {
            "ori": "benchmark://adult-income_ori",
            "control": "benchmark://adult-income_control",
        },
    }
)

# 自定義資料不需要輸入資料
splitter_adpt_custom.run(input={"exclude_index": []})

custom_split_result = splitter_adpt_custom.get_result()
print(f"Custom split result keys: {custom_split_result.keys()}")

Custom split result keys: dict_keys(['train', 'validation'])


## PreprocessorAdapter 測試

In [6]:
from petsard.adapter import PreprocessorAdapter

# 測試預設預處理
preproc_adpt = PreprocessorAdapter(config={"method": "default"})

preproc_input = {"data": train_data, "metadata": meta}

preproc_adpt.run(input=preproc_input)

default_preproc_data = preproc_adpt.get_result()
preproc_meta = preproc_adpt.get_metadata()

print(f"default_preproc_data shape: {default_preproc_data.shape}")

default_preproc_data shape: (54, 15)


In [7]:
# 測試僅處理缺失值
preproc_adpt_missing = PreprocessorAdapter(
    config={
        "method": "custom",
        "config": {
            "missing": {
                "age": "missing_mean",
            },
        },
        "sequence": ["missing"],
    }
)

preproc_adpt_missing.run(input=preproc_input)

missing_preproc_data = preproc_adpt_missing.get_result()

print(f"missing_preproc_data shape: {missing_preproc_data.shape}")
print("Age column after missing value processing:")
print(missing_preproc_data["age"].head(10))

missing_preproc_data shape: (80, 15)
Age column after missing value processing:
0    38
1    44
2    18
3    34
4    29
5    63
6    24
7    55
8    65
9    36
Name: age, dtype: int64


In [8]:
# 測試僅處理離群值
preproc_adpt_outlier = PreprocessorAdapter(
    config={
        "method": "custom",
        "config": {
            "outlier": {
                "age": "outlier_zscore",
            },
        },
        "sequence": ["outlier"],
    }
)

preproc_adpt_outlier.run(input=preproc_input)

outlier_preproc_data = preproc_adpt_outlier.get_result()

print(f"outlier_preproc_data shape: {outlier_preproc_data.shape}")
print("Age column after outlier processing:")
print(outlier_preproc_data["age"].head(10))

outlier_preproc_data shape: (54, 15)
Age column after outlier processing:
0    38
1    18
2    34
3    29
4    24
5    36
6    26
7    58
8    43
9    20
Name: age, dtype: int64


In [9]:
# 測試僅編碼處理
preproc_adpt_encoder = PreprocessorAdapter(
    config={
        "method": "custom",
        "config": {
            "encoder": {
                "workclass": "encoder_onehot",
            },
        },
        "sequence": ["encoder"],
    }
)

preproc_adpt_encoder.run(input=preproc_input)

encoder_preproc_data = preproc_adpt_encoder.get_result()

print(f"encoder_preproc_data shape: {encoder_preproc_data.shape}")
print("Workclass columns after encoding:")
workclass_cols = [
    col for col in encoder_preproc_data.columns if col.startswith("workclass_")
]
print(encoder_preproc_data[workclass_cols].head(10))

encoder_preproc_data shape: (80, 20)
Workclass columns after encoding:
   workclass_Federal-gov  workclass_Local-gov  workclass_Private  \
0                    0.0                  0.0                1.0   
1                    0.0                  0.0                1.0   
2                    0.0                  0.0                0.0   
3                    0.0                  0.0                1.0   
4                    0.0                  0.0                0.0   
5                    0.0                  0.0                0.0   
6                    0.0                  0.0                1.0   
7                    0.0                  0.0                1.0   
8                    0.0                  0.0                1.0   
9                    1.0                  0.0                0.0   

   workclass_Self-emp-inc  workclass_Self-emp-not-inc  workclass_State-gov  
0                     0.0                         0.0                  0.0  
1                     0.0 

In [10]:
# 測試僅縮放處理
preproc_adpt_scaler = PreprocessorAdapter(
    config={
        "method": "custom",
        "config": {
            "scaler": {
                "age": "scaler_minmax",
            },
        },
        "sequence": ["scaler"],
    }
)

preproc_adpt_scaler.run(input=preproc_input)

scaler_preproc_data = preproc_adpt_scaler.get_result()

print(f"scaler_preproc_data shape: {scaler_preproc_data.shape}")
print("Age column after scaling:")
print(scaler_preproc_data["age"].head(10))

scaler_preproc_data shape: (80, 15)
Age column after scaling:
0    0.381818
1    0.490909
2    0.018182
3    0.309091
4    0.218182
5    0.836364
6    0.127273
7    0.690909
8    0.872727
9    0.345455
Name: age, dtype: float64


## SynthesizerAdapter 測試

In [11]:
from petsard.adapter import SynthesizerAdapter

# 測試合成器 - 使用預設方法
synth_adpt = SynthesizerAdapter(config={"method": "default"})

synth_input = {"data": default_preproc_data, "metadata": preproc_meta}

synth_adpt.run(input=synth_input)

syn_data = synth_adpt.get_result()

print(f"syn_data shape: {syn_data.shape}")
print(f"syn_data type: {type(syn_data)}")

syn_data shape: (54, 15)
syn_data type: <class 'pandas.core.frame.DataFrame'>


## PostprocessorAdapter 測試

In [12]:
from petsard.adapter import PostprocessorAdapter


# 測試後處理
postproc_adpt = PostprocessorAdapter(config={"method": "default"})

postproc_input = {
    "data": syn_data,
    "preprocessor": preproc_adpt.processor,
}

postproc_adpt.run(input=postproc_input)

postproc_data = postproc_adpt.get_result()

print(f"postproc_data shape: {postproc_data.shape}")
print(f"postproc_data type: {type(postproc_data)}")

postproc_data shape: (54, 15)
postproc_data type: <class 'pandas.core.frame.DataFrame'>


In [14]:
syn_data

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,1.263466,0.2248,-0.04695,0.276308,0.720054,0.07401,0.425869,0.330918,0.339613,0.621573,-0.223148,-0.112509,0.390424,0.513947,0.62392
1,-0.448639,0.927445,0.735033,0.701279,-0.394736,0.538855,0.54004,0.836891,0.642224,0.530226,-0.223148,-0.112509,-0.051486,0.350513,0.81505
2,0.375738,0.004116,1.340796,0.623551,0.249906,0.548137,0.197636,0.349818,0.242217,0.285552,-0.223148,-0.112509,0.430574,0.73889,0.11711
3,-0.631964,0.544332,0.00963,0.35533,-0.637003,0.863457,0.822887,0.641169,0.199144,0.821985,-0.223148,-0.112509,-0.44303,0.895837,0.400057
4,-0.910935,0.405946,-0.778915,0.799101,1.524742,0.612711,0.267258,0.071919,0.781337,0.223755,-0.223148,-0.112509,0.937623,0.253915,0.050389
5,-1.300358,0.861791,1.022668,0.666851,-0.367858,0.586494,0.653389,0.521052,0.476802,0.91565,-0.223148,-0.112509,-0.901512,0.768087,0.549407
6,0.35597,0.927445,-0.178916,0.90609,0.436346,0.272645,0.826819,0.335638,0.642224,0.106431,-0.223148,-0.112509,0.727151,0.001717,0.868395
7,-1.211905,0.939218,-0.088136,0.749515,0.946631,0.719826,0.033261,0.919733,0.107095,0.579884,-0.223148,-0.112509,-0.342201,0.164929,0.431889
8,-1.159886,0.334501,-0.640886,0.43132,-0.245388,0.863457,0.577327,0.460181,0.864121,0.859174,-0.223148,-0.112509,-0.99463,0.493907,0.543397
9,-1.072509,0.844915,-0.627326,0.674512,0.894834,0.243417,0.605945,0.349818,0.417629,0.706174,-0.223148,-0.112509,0.726912,0.597377,0.057563


In [13]:
postproc_data

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,56,,188225,,11,,,,,,0,0,44,,
1,31,,267157,,8,,,,,,0,0,38,,
2,43,,328302,,10,,,,,,0,0,44,,
3,28,,193936,,8,,,,,,0,0,33,,
4,24,,114342,,13,,,,,,0,0,51,,
5,19,,296191,,8,,,,,,0,0,27,,
6,43,,174905,,11,,,,,,0,0,48,,
7,20,,184068,,12,,,,,,0,0,34,,
8,21,,128274,,9,,,,,,0,0,26,,
9,22,,129643,,12,,,,,,0,0,48,,


## ConstrainerAdapter 測試

In [None]:
from petsard.adapter import ConstrainerAdapter


# 測試約束器 - 空配置
constr_adpt = ConstrainerAdapter(config={})

constr_input = {"data": postproc_data}

constr_adpt.run(input=constr_input)

cnst_data = constr_adpt.get_result()

print(f"cnst_data shape: {cnst_data.shape}")
print(f"cnst_data type: {type(cnst_data)}")

cnst_data shape: (48, 15)
cnst_data type: <class 'pandas.core.frame.DataFrame'>


In [None]:
# 測試約束器 - 使用 resample_until_satisfy
satisfy_data = constr_adpt.constrainer.resample_until_satisfy(
    data=postproc_data,
    target_rows=postproc_data.shape[0],
    synthesizer=synth_adpt.synthesizer,
    max_trials=300,
    sampling_ratio=10.0,
    verbose_step=10,
)

print(f"satisfy_data shape: {satisfy_data.shape}")

satisfy_data shape: (48, 15)


## EvaluatorOperator 測試

In [None]:
from petsard.adapter import EvaluatorAdapter

# 測試評估器 - 預設方法
eval_adpt = EvaluatorAdapter(config={"method": "default"})

eval_input = {
    "data": {
        "ori": train_data,
        "control": test_data,
        "syn": satisfy_data,
    }
}

eval_adpt.run(input=eval_input)

eval_result = eval_adpt.get_result()

print(f"Evaluation result keys: {eval_result.keys()}")
print("\nGlobal evaluation results:")
print(eval_result["global"].head(1))
print("\nColumnwise evaluation results:")
print(eval_result["columnwise"].head(1))
print("\nPairwise evaluation results:")
print(eval_result["pairwise"].head(1))

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 15/15 [00:00<00:00, 2650.71it/s]|
Column Shapes Score: 87.08%

(2/2) Evaluating Column Pair Trends: |██████████| 105/105 [00:00<00:00, 411.01it/s]|
Column Pair Trends Score: 8.34%

Overall Score (Average): 47.71%

Evaluation result keys: dict_keys(['global', 'columnwise', 'pairwise'])

Global evaluation results:
        Score  Column Shapes  Column Pair Trends
result   0.48           0.87                0.08

Columnwise evaluation results:
          Property        Metric     Score Error
age  Column Shapes  KSComplement  0.916667  None

Pairwise evaluation results:
                         Property                 Metric  Score  \
age workclass  Column Pair Trends  ContingencySimilarity    0.0   

               Real Correlation  Synthetic Correlation  
age workclass               NaN                    NaN  


In [None]:
# 測試評估器 - MLUtility Classification
eval_adpt_mlutility = EvaluatorAdapter(
    config={
        "method": "mlutility-classification",
        "target": "income",
    }
)

eval_input_anon = {
    "data": {
        "ori": train_data,
        "control": test_data,
        "syn": satisfy_data,
    }
}

eval_adpt_mlutility.run(input=eval_input_anon)

eval_result_mlutility = eval_adpt_mlutility.get_result()

print(eval_result_mlutility["global"].head(1))

Data for syn is empty after removing missing values
TIMING_ERROR|EvaluatorAdapter|run|1757322121.40925|0.029840946197509766|Data for syn is empty after removing missing values


ConfigError: Data for syn is empty after removing missing values

## DescriberOperator 測試

In [None]:
from petsard.operator import DescriberOperator

# 測試描述器
desc_op = DescriberOperator(config={"method": "default"})

desc_input = {"data": {"data": satisfy_data}}

desc_op.run(input=desc_input)

desc_result = desc_op.get_result()

print(f"Description result type: {type(desc_result)}")
if isinstance(desc_result, dict):
    print(f"Description result keys: {desc_result.keys()}")

    print(desc_result["global"].head(1))
    print(desc_result["columnwise"].head(1))
    print(desc_result["pairwise"].head(1))

Description result type: <class 'dict'>
Description result keys: dict_keys(['global', 'columnwise', 'pairwise'])
   row_count  col_count  na_count
0         38         15         0
      mean  median    std    min   max  kurtosis  skew     q1    q3  na_count  \
age  33.88   33.57  12.91  18.07  69.0      0.39  0.87  22.77  40.0       0.0   

    nunique  
age    <NA>  
  column1 column2  corr
0     age     age   1.0


## ReporterOperator 測試

In [None]:
from petsard.operator import ReporterOperator

# 測試報告器 - Save Data
reporter_op_data = ReporterOperator(
    config={
        "method": "save_data",
        "source": "Postprocessor",
    }
)

report_data_input = {"data": {("Postprocessor", "exp1"): satisfy_data}}

reporter_op_data.run(input=report_data_input)

report_result_data = reporter_op_data.get_result()

print(f"Save data report result: {report_result_data}")

Now is petsard_Postprocessor[exp1] save to csv...
Save data report result: {'Postprocessor[exp1]':           age         workclass         fnlwgt     education  educational-num  \
0   21.368340       Federal-gov  136036.234374  Some-college        12.000000   
1   18.192728           Private  262862.772405       HS-grad        12.000000   
2   69.000000           Private  259823.864157       HS-grad        12.000000   
3   30.074558      Self-emp-inc   59099.210250       HS-grad         9.026702   
4   35.140393           Private  236828.797072       HS-grad        12.000000   
5   39.069126         State-gov  162763.149486     Assoc-voc        12.000000   
6   37.411303           Private  220947.186829  Some-college        10.642865   
7   35.126467           Private  141962.609039       HS-grad        12.000000   
8   35.732487           Private   69002.496064  Some-college         9.477471   
9   64.298642                 ?  131163.179409       HS-grad         9.065433   
10  53.899

In [None]:
# 測試報告器 - Save Report
reporter_op_report = ReporterOperator(
    config={
        "method": "save_report",
        "granularity": "global",
    }
)

report_report_input = {
    "data": {
        ("Evaluator", "eval1_[global]"): desc_result["global"],
    }
}

reporter_op_report.run(input=report_report_input)

report_result_report = reporter_op_report.get_result()

print(f"Save report result: {report_result_report}")

Now is petsard[Report]_[global] save to csv...
Save report result: {'[global]':       full_expt_name Evaluator  eval1_row_count  eval1_col_count  \
0  Evaluator[global]  [global]               38               15   

   eval1_na_count  
0               0  }


## 總結

這個 notebook 展示了如何使用各個 Operator 類別來執行 PETsARD 的功能，按照 test.ipynb 的設定進行調整：

1. **LoaderOperator**: 載入資料和元資料
2. **SplitterOperator**: 分割資料為訓練集和測試集
3. **PreprocessorOperator**: 資料預處理（缺失值、離群值、編碼、縮放）
4. **SynthesizerOperator**: 合成資料
5. **PostprocessorOperator**: 後處理合成資料
6. **ConstrainerOperator**: 應用資料約束
7. **EvaluatorOperator**: 評估合成資料品質
8. **DescriberOperator**: 描述資料特性
9. **ReporterOperator**: 生成報告

每個 Operator 都遵循相同的模式：
- 使用配置初始化
- 準備輸入資料
- 執行 `run()` 方法
- 使用 `get_result()` 取得結果
- 使用 `get_metadata()` 取得元資料（如果適用）

這種設計讓每個模組都可以獨立測試，不需要依賴 Status 物件。