1. Download files from here: https://drive.google.com/drive/folders/1LR-ftaIeV6_KJvVz8q-xbodA-oXtJuvV?usp=sharing
2. Place features.csv and metrics.csv to the following path from project root: resources/tabzilla/raw
3. Run this notebook

In [None]:
from ms.handler.data_source import TabzillaSource
from ms.metadataset.data_formatter import TabzillaFormatter
from ms.metadataset.data_filter import TabzillaFilter
from ms.metadataset.target_builder import TargetPerfBuilder, TargetDiffBuilder
from ms.metadataset.data_preprocessor import ScalePreprocessor, CorrelationPreprocessor
from ms.pipeline.pipeline_constants import data_transform

In [None]:
md_source = TabzillaSource()
metric_name = "F1__test"

model_classes = {
    "rtdl_FTTransformer": "nn",
    "rtdl_MLP": "nn",
    "rtdl_ResNet": "nn",
    "LinearModel": "classic",
    "RandomForest": "classic",
    "XGBoost": "classic"
}

classes_names = ["nn", "classic"]

Formatter handles raw TabZilla files performing fold values aggregation and metrics formatting.

Formatted files will be saved here: resources/tabzilla/formatted

In [None]:
formatter = TabzillaFormatter(
        features_folder="raw",
        metrics_folder="raw",
        test_mode=False,
    )
formatter.handle_features(to_rewrite=False)
formatter.handle_metrics(to_rewrite=False)

Filter performs removal of unsuitable features

Filtered files will be saved here: resources/tabzilla/filtered

In [None]:
md_filter = TabzillaFilter(
    features_folder="formatted",
    metrics_folder="formatted",
    funcs_to_exclude=[
        "count",
        "histogram",
        "iq_range",
        "median",
        "quantiles",
        "range",
    ],
    models_list=["XGBoost", "RandomForest", "LinearModel",
                     "rtdl_ResNet", "rtdl_FTTransformer", "rtdl_MLP"],
    test_mode=False,
    value_threshold=1e6,
)

md_filter.handle_features(to_rewrite=False)
md_filter.handle_metrics(to_rewrite=False)

Target builder creates target with specific strategy (rank of absolute or relative performance, difference between best performing models)

Targets will be saved here: resources/tabzilla/target

In [None]:
abs_perf_builder = TargetPerfBuilder(
    md_source=md_source,
    features_folder="filtered",
    metrics_folder="filtered",
    metric_name=metric_name,
    perf_type="abs",
    n_bins=2,
    strategy="quantile",
    test_mode=False,
)

rel_perf_builder = TargetPerfBuilder(
    md_source=md_source,
    features_folder="filtered",
    metrics_folder="filtered",
    metric_name=metric_name,
    perf_type="rel",
    n_bins=2,
    strategy="quantile",
    test_mode=False,
)

diff_builder = TargetDiffBuilder(
    classes=classes_names,
    model_classes=model_classes,
    md_source=md_source,
    features_folder="filtered",
    metrics_folder="filtered",
    metric_name=metric_name,
    test_mode=False,
)

abs_perf_builder.handle_metrics()
rel_perf_builder.handle_metrics()
diff_builder.handle_metrics()

Preproccesor performs data scaling with specific target. You can choose target type by passing suffix argument into preprocess method (suffix should correspond to one of the files in target folder)

Preprocessed data will be saved here: resources/tabzilla/preprocessed

In [None]:
scaler = ScalePreprocessor(
    md_source=md_source,
    features_folder="filtered",
    metrics_folder="target",
    to_scale=["power"],
    perf_type="abs",
    remove_outliers=False,
    test_mode=False,
)
scaled_features, scaled_metrics = scaler.preprocess(
    feature_suffix=None,
    metrics_suffix="perf_abs"
)
scaler.preprocess(
    feature_suffix=None,
    metrics_suffix="perf_rel"
)
scaler.preprocess(
    feature_suffix=None,
    metrics_suffix="diff"
)

In [None]:
corr_filter = CorrelationPreprocessor(
    md_source=md_source,
    features_folder="preprocessed",
    metrics_folder="preprocessed",
    corr_method="spearman",
    corr_value_threshold=0.9,
    vif_value_threshold=20000,
    vif_count_threshold=None,
    test_mode=False,
)

corr_features, corr_metrics = corr_filter.preprocess(
    feature_suffix=data_transform,
    metrics_suffix="perf_abs"
)
corr_filter.preprocess(
    feature_suffix=data_transform,
    metrics_suffix="perf_rel"
)
corr_filter.preprocess(
    feature_suffix=data_transform,
    metrics_suffix="diff"
)