# TabZilla Metadataset Tutorial

This notebook demonstrates how analyze our experimental results, including some of the results from our paper.

### First Things First

1. Please download the TabZilla results dataset `metadataset_clean.csv`, and the dataset meta-features `metafeatures_clean.csv` from our Google Drive folder [here](https://drive.google.com/drive/folders/1cHisTmruPHDCYVOYnaqvTdybLngMkB8R?usp=sharing), and place them in the same directory as this notebook.
2. You need to run this notebook with a python (3.11+) environment with `pandas` installed.

### Read the datasets

In [1]:
import pandas as pd

metadataset_df = pd.read_csv("./metadataset_clean.csv")
metafeatures_df = pd.read_csv("./metafeatures_clean.csv")

# 1. Explore our experiment results (`metadataset.csv`)

The most important columns in this dataset are:
- `dataset_fold_id`: the name of the "dataset fold". Each dataset is split into 10 train/test/validation splits for these experiments.
- `dataset_name`: the name of the dataset, not including the fold.
- `alg_name`: the name of the algorithm.
- `hparam_source`: the set of hyperparameters used with the algorithm.

Each row contains results for a single algorithm trained on the training set (80%) of the entire dataset, and then evaluated on both the validation and test sets (each 10%). 

This file includes the following metrics:
- Log Loss
- AUC
- Accuracy
- F1 Score
- runtime ("time").

For each of the three splits: train, test, and validation. These columns have the naming convention "{metric}__{split}". For example, the column "Log Loss__val" is the Log Loss calculated on the validation set, and "time__test" is the runtime to evaluate the test test.

For example, here are the log loss and training time results for CatBoost using default hyperparameters, for all splits of the dataset "openml__adult-census__3953":

In [2]:
metadataset_df.loc[
    (metadataset_df["alg_name"] == "CatBoost") & 
    (metadataset_df["hparam_source"] == "default") &
    (metadataset_df["dataset_name"] == "openml__adult-census__3953"),
    [
        "dataset_fold_id", 
        "alg_name", 
        "hparam_source", 
        "Log Loss__train", 
        "Log Loss__val", 
        "Log Loss__test", 
        "training_time"]
]

Unnamed: 0,dataset_fold_id,alg_name,hparam_source,Log Loss__train,Log Loss__val,Log Loss__test,training_time
103510,openml__adult-census__3953__fold_0,CatBoost,default,0.291891,0.302058,0.301728,2.485499
103978,openml__adult-census__3953__fold_1,CatBoost,default,0.293162,0.286871,0.301518,1.641616
104446,openml__adult-census__3953__fold_2,CatBoost,default,0.293819,0.295939,0.28677,1.63266
104914,openml__adult-census__3953__fold_3,CatBoost,default,0.29351,0.294855,0.296569,1.648335
105382,openml__adult-census__3953__fold_4,CatBoost,default,0.293627,0.301237,0.29569,1.590125
105850,openml__adult-census__3953__fold_5,CatBoost,default,0.292957,0.297009,0.301207,1.593544
106318,openml__adult-census__3953__fold_6,CatBoost,default,0.29363,0.301441,0.29675,1.595759
106786,openml__adult-census__3953__fold_7,CatBoost,default,0.293684,0.295741,0.30172,1.592632
107254,openml__adult-census__3953__fold_8,CatBoost,default,0.295376,0.297267,0.297509,1.589603
107722,openml__adult-census__3953__fold_9,CatBoost,default,0.293293,0.303367,0.295878,1.790339
