# Tutorial 2: TabZilla Dataset Featurizer

This notebook demonstrates how to use TabZilla to calculate metafeatures of tabular datasets ("featurize" them) using `pymfe`. 

### Rquirements

1. Please complete [Tutorial 1](TabZilla/tutorials/1-preprocess-datasets.ipynb), which shows how to download and pre-process datasets using TabZilla and `openml`. You will need to have at least one dataset pre-processed to complete this tutorial.

2. You need to have a python environment with the following python packages. We recommend following instructions on our [README](README.md) to prepare a virtual environment with `venv`. Required packages:

- [`openml`](https://pypi.org/project/openml/)
- [`argparse`](https://pypi.org/project/argparse/)
- [`pandas`](https://pypi.org/project/pandas/)
- [`scikit-learn`](https://pypi.org/project/scikit-learn/)
- [`pymfe`](https://pypi.org/project/pymfe/)

3. Like all of our code, this notebook must be run from the TabZilla directory. Make sure to run the following cell to `cd` one level up, by running the following cell:

In [1]:
%cd ../

/Users/duncan/research/active_projects/tabzilla/TabZilla


## Featurizing a dataset

## Read the pre-processed dataset

First, read the dataset you want to featurize. We will use the audiology dataset that was pre-processed in Tutorial 1, which was written to `TabZilla/datasets/openml__audiology__7`:

In [2]:
from tabzilla_datasets import TabularDataset
from pathlib import Path

dataset = TabularDataset.read(Path("./datasets/openml__audiology__7"))

### Featurize the dataset

We use `pymfe` to calculate dataset metafeatures. Please see the [`pymfe` website](https://pymfe.readthedocs.io/en/latest/auto_pages/meta_features_description.html) for a description of these features. 

`pymfe` throws a lot of warnings, but we will ignore these when we featurize the dataset. Lots of these warnings result in NaN metafeatures, which we can ignore.

In [3]:
from tabzilla_featurizer import featurize_dataset
import warnings 

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    dataset_metafeatures = featurize_dataset(Path("./datasets/openml__audiology__7"))

  from .autonotebook import tqdm as notebook_tqdm


Processing...


10it [00:17,  1.76s/it]


### Review metafeatures

There is one set of metafeatures created for each of the 10 folds defined during pre-processing. The features for each fold are stored in a dictionary: each key-value pair in the dictionary is a metafeature of the dataset, and the key "dataset_name" indicates the name (and split) of the dataset.

Please see the pymfe [website](https://pymfe.readthedocs.io/en/latest/auto_pages/meta_features_description.html) and [github page](https://github.com/ealcobaca/pymfe) for a description of each metafeature.

In [5]:
# there should be 10 items in dataset_metafeatures, one for each dataset split:
len(dataset_metafeatures)

10

In [6]:
# the dataset name and split is at key "dataset_name":
dataset_metafeatures[3]["dataset_name"]

'openml__audiology__7__fold_3'

In [7]:
# all other keys contain the names of the metafeatures:
list(dataset_metafeatures[3].keys())[:5]

['dataset_name',
 'f__pymfe.landmarking.best_node.count',
 'f__pymfe.landmarking.best_node.count.relative',
 'f__pymfe.landmarking.best_node.histogram.0',
 'f__pymfe.landmarking.best_node.histogram.0.relative']

# Creating a CSV of metafeatures for all datasets

Once you have pre-processed several datasets, so they are written to separate folders in `TabZilla/datasets`, you can use this script to featurize *all* datasets in this folder, and write their metafeatures to a single CSV. This script first searches for pre-processed datasets, and then writes a CSV with their metafeatures in `TabZilla/metafeatures.csv`:

``````
python -m tabzilla_featurizer
``````

In [11]:
# you can read the metafeatures into a pandas dataframe like this:
import pandas as pd
metafeatures = pd.read_csv("./metafeatures.csv") 

metafeatures.head()

Unnamed: 0,dataset_name,f__pymfe.landmarking.best_node.count,f__pymfe.landmarking.best_node.count.relative,f__pymfe.landmarking.best_node.histogram.0,f__pymfe.landmarking.best_node.histogram.0.relative,f__pymfe.landmarking.best_node.histogram.1,f__pymfe.landmarking.best_node.histogram.1.relative,f__pymfe.landmarking.best_node.histogram.2,f__pymfe.landmarking.best_node.histogram.2.relative,f__pymfe.landmarking.best_node.histogram.3,...,f__pymfe.relative.worst_node.quantiles.4,f__pymfe.relative.worst_node.quantiles.4.relative,f__pymfe.relative.worst_node.range,f__pymfe.relative.worst_node.range.relative,f__pymfe.relative.worst_node.sd,f__pymfe.relative.worst_node.sd.relative,f__pymfe.relative.worst_node.skewness,f__pymfe.relative.worst_node.skewness.relative,f__pymfe.statistical.iq_range,f__pymfe.statistical.t_mean
0,openml__cjs__14967__fold_0,10,4.0,0.3,6.5,0.0,2.5,0.2,6.5,0.2,...,0.166667,1.0,0.0,1.0,2.925695e-17,1.0,,7.0,,
1,openml__cjs__14967__fold_1,10,4.0,0.1,4.0,0.0,3.0,0.0,2.5,0.2,...,0.171573,1.0,0.007937,1.0,0.001912,1.0,1.025611,7.0,,
2,openml__cjs__14967__fold_2,10,4.0,0.3,7.0,0.1,5.5,0.1,4.5,0.0,...,0.166667,1.0,0.0,1.0,2.925695e-17,1.0,,7.0,,
3,openml__cjs__14967__fold_3,10,4.0,0.1,3.5,0.1,4.5,0.1,6.0,0.3,...,0.166667,1.0,0.0,1.0,2.925695e-17,1.0,,7.0,,
4,openml__cjs__14967__fold_4,10,4.0,0.1,3.5,0.1,6.5,0.1,5.5,0.4,...,0.17197,1.0,0.010497,1.0,0.002801491,1.0,-0.04571,5.0,,
