In [1]:
from sklearn.model_selection import train_test_split

from pytorch_tabular.utils import load_covertype_dataset

In [None]:
data, cat_col_names, num_col_names, target_col = load_covertype_dataset()

# Importing the Library

In [2]:
from pytorch_tabular import TabularModel, model_sweep
from pytorch_tabular.models import (
    CategoryEmbeddingModelConfig,
    DANetConfig,
    GANDALFConfig,
    FTTransformerConfig,
    TabNetModelConfig
)
from pytorch_tabular.config import DataConfig, OptimizerConfig, TrainerConfig, ExperimentConfig
from pytorch_tabular.models.common.heads import LinearHeadConfig


In [3]:
train, test = train_test_split(data, random_state=42)


NameError: name 'data' is not defined

## Model Sweep

Define the data config, trainer config, and optimizer config and do a sweep of multiple models.

In [4]:
data_config = DataConfig(
    target=[
        target_col
    ],  # target should always be a list. Multi-targets are only supported for regression. Multi-Task Classification is not implemented
    continuous_cols=num_col_names,
    categorical_cols=cat_col_names,
)
trainer_config = TrainerConfig(
    batch_size=1024,
    max_epochs=25,
    auto_lr_find=True,
    early_stopping="valid_loss",  # Monitor valid_loss for early stopping
    early_stopping_mode="min",  # Set the mode as min because for val_loss, lower is better
    early_stopping_patience=5,  # No. of epochs of degradation training will wait before terminating
    checkpoints="valid_loss",  # Save best checkpoint monitoring val_loss
    load_best=True,  # After training, load the best checkpoint
    progress_bar="none",  # Turning off Progress bar
    trainer_kwargs=dict(enable_model_summary=False),  # Turning off model summary
    accelerator="cpu",
    fast_dev_run=True,
    data_aware_init_batch_size=1024,
)
optimizer_config = OptimizerConfig()

head_config = LinearHeadConfig(
    layers="", dropout=0.1, initialization="kaiming"  # No additional layer in head, just a mapping layer to output_dim
).__dict__  # Convert to dict to pass to the model config (OmegaConf doesn't accept objects)


NameError: name 'target_col' is not defined

## Model Sweep API

<!-- Args:
    task (str): The type of prediction task. Either 'classification' or 'regression'

    train (pd.DataFrame): The training data

    test (pd.DataFrame): The test data on which performance is evaluated

    data_config (Union[DataConfig, str]): DataConfig object or path to the yaml file.

    optimizer_config (Union[OptimizerConfig, str]): OptimizerConfig object or path to the yaml file.

    trainer_config (Union[TrainerConfig, str]): TrainerConfig object or path to the yaml file.

    models (Union[str, List[Union[ModelConfig, str]]], optional): The list of models to compare. This can be one of
            the presets defined in ``pytorch_tabular.MODEL_SWEEP_PRESETS`` or a list of ``ModelConfig`` objects.
            Defaults to "fast".

    metrics (Optional[List[str]]): the list of metrics you need to track during training. The metrics
            should be one of the functional metrics implemented in ``torchmetrics``. By default, it is
            accuracy if classification and mean_squared_error for regression

    metrics_prob_input (Optional[bool]): Is a mandatory parameter for classification metrics defined in
            the config. This defines whether the input to the metric function is the probability or the class.
            Length should be same as the number of metrics. Defaults to None.

    metrics_params (Optional[List]): The parameters to be passed to the metrics function. `task` is forced to
            be `multiclass` because the multiclass version can handle binary as well and for simplicity we are
            only using `multiclass`.

    validation (Optional[DataFrame], optional):
            If provided, will use this dataframe as the validation while training.
            Used in Early Stopping and Logging. If left empty, will use 20% of Train data as validation.
            Defaults to None.

    experiment_config (Optional[Union[ExperimentConfig, str]], optional): ExperimentConfig object or path to the yaml file.

    common_model_args (Optional[dict], optional): The model argument which are common to all models. The list of params can
        be found in ``ModelConfig``. If not provided, will use defaults. Defaults to {}.

    rank_metric (Optional[Tuple[str, str]], optional): The metric to use for ranking the models. The first element of the tuple
        is the metric name and the second element is the direction. Defaults to ('loss', "lower_is_better").

    return_best_model (bool, optional): If True, will return the best model. Defaults to True.

    seed (int, optional): The seed for reproducibility. Defaults to 42.

    ignore_oom (bool, optional): If True, will ignore the Out of Memory error and continue with the next model. -->

The model sweep enables you to quickly sweep thorugh different models and configurations. It takes in a list of model configs or one of the presets defined in ``model_comparator.MODEL_PRESETS`` and trains them on the data. It then ranks the models based on the metric provided and returns the best model.

These are the major args:
- ``task``: The type of prediction task. Either 'classification' or 'regression'
- ``train``: The training data
- ``test``: The test data on which performance is evaluated
- all the config objects can be passed as either the object or the path to the yaml file.
- ``models``: The list of models to compare. This can be one of the presets defined in ``pytorch_tabular.MODEL_SWEEP_PRESETS`` or a list of ``ModelConfig`` objects.
- ``metrics``: the list of metrics you need to track during training. The metrics should be one of the functional metrics implemented in ``torchmetrics``. By default, it is accuracy if classification and mean_squared_error for regression
- ``metrics_prob_input``: Is a mandatory parameter for classification metrics defined in the config. This defines whether the input to the metric function is the probability or the class. Length should be same as the number of metrics. Defaults to None.
- ``metrics_params``: The parameters to be passed to the metrics function. 
- ``rank_metric``: The metric to use for ranking the models. The first element of the tuple is the metric name and the second element is the direction. Defaults to ('loss', "lower_is_better").
- ``return_best_model``: If True, will return the best model. Defaults to True.

In [5]:
from pytorch_tabular import MODEL_SWEEP_PRESETS
MODEL_SWEEP_PRESETS.keys()

dict_keys(['lite', 'full', 'high_memory'])

In [6]:
sweep_df, best_model = model_sweep(
    task="classification",  # One of "classification", "regression"
    train=train,
    test=test,
    data_config=data_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
    model_list="full",
    common_model_args=dict(head="LinearHead", head_config=head_config),
    metrics=['accuracy', "f1_score"],
    metrics_params=[{}, {"average": "weighted"}],
    metrics_prob_input=[False, True],
    rank_metric=("accuracy", "higher_is_better"),
    progress_bar=True,
    verbose=False
)

NameError: name 'train' is not defined

In [7]:
sweep_df.drop(columns=["params", "time_taken", "epochs"]).style.highlight_max(
    subset=["test_accuracy", "test_f1_score"], color="lightgreen"
).highlight_min(subset=["test_loss"], color="lightgreen")

NameError: name 'sweep_df' is not defined

We chose the `lite` preset which is a set of four models which have comparable # of params and trains relatively faster with less memory requirements.

We can see that GANDALF performs the best in terms of accuracy, loss and f1 score. We can also see that the training time is comparable to regular MLP. A natural next step would be to tune the model a but more and find the best parameters.

In [None]:
mlp = CategoryEmbeddingModelConfig(
    task="classification",
    layers="256-128-64",
    head="LinearHead",
    head_config=head_config,
)

danet = DANetConfig(
    task="classification",
    n_layers=8,
    abstlay_dim_1=8,
    k=5,
    head="LinearHead",
    head_config=head_config,
)

gandalf = GANDALFConfig(
    task="classification",
    gflu_stages=6,
    head="LinearHead",
    head_config=head_config,
)

tabnet = TabNetModelConfig(
    task="classification",
    n_d=32,
    n_a=32,
    n_steps=3,
    gamma=1.5,
    n_independent=1,
    n_shared=2,
    head="LinearHead",
    head_config=head_config,
)
model_list = [mlp, danet, gandalf, tabnet]

In [10]:
from pytorch_tabular import available_models

In [11]:
[m for m in available_models() if m not in ["MDNConfig", "NodeConfig"]]

['AutoIntConfig',
 'CategoryEmbeddingModelConfig',
 'DANetConfig',
 'FTTransformerConfig',
 'GANDALFConfig',
 'GatedAdditiveTreeEnsembleConfig',
 'TabNetModelConfig',
 'TabTransformerConfig']