[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lenguajenatural-ai/autotransformers/blob/main/notebooks/classification/train_multilabel.ipynb)

# Multilabel training

This tutorial is very similar in many ways to the one on emotion classification, as both are classification task. So in this one there will be some parts which are not explained so much, as they are already explained on that previous tutorial.

We first import the needed modules or, if you are running this notebook in Google colab, please uncomment the cell below and run it before importing, in order to install `autotransformers`.

We import `DatasetConfig`, the class that configures how datasets are managed inside `AutoTrainer`. We also need `ModelConfig` to define the models to train, and `ResultsPlotter` to plot the experiment results.
Additionally, we import the default hyperparameter space for base-sized models.

In [None]:
# !pip install git+https://github.com/lenguajenatural-ai/autotransformers.git 

In [None]:
from autotransformers import DatasetConfig, ModelConfig, AutoTrainer, ResultsPlotter
from autotransformers.default_param_spaces import hp_space_base

We need to define a preprocessing function, to end up with the correct format for the dataset. `AutoTrainer` expects multilabel datasets to be of the following form: one column for the text, and the rest of the columns for the labels. As our dataset is not originally in that format, we will pass a `pre_func` to `DatasetConfig` to preprocess it before tokenizing.

## Dataset Configuration

In [None]:
def pre_parse_func(example):
    label_cols = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "L", "M", "N", "Z"]
    new_example = {"text": example["abstractText"]}
    for col in label_cols:
        new_example[f"label_{col}"] = example[col]
    return new_example

Next we define the fixed training arguments.

In [None]:
fixed_train_args = {
        "evaluation_strategy": "steps",
        "num_train_epochs": 10,
        "do_train": True,
        "do_eval": True,
        "logging_strategy": "steps",
        "eval_steps": 1,
        "save_steps": 1,
        "logging_steps": 1,
        "save_strategy": "steps",
        "save_total_limit": 2,
        "seed": 69,
        "fp16": False,
        "load_best_model_at_end": True,
        "per_device_eval_batch_size": 16,
        "max_steps": 1
    }

The default arguments for the dataset don't change with respect to classification tutorial.

In [None]:
default_args_dataset = {
        "seed": 44,
        "direction_optimize": "maximize",
        "metric_optimize": "eval_f1-score",
        "retrain_at_end": False,
        "fixed_training_args": fixed_train_args
}

Next we need to define the configuration for pubmed dataset.

For multilabel classification, we need to pass argument `is_multilabel=True`. `label_col` can be any of the labels in this case, so it is not important which one you use. Additionally, we must pass the `multilabel_label_names`, that is, the names of each of the labels in the multilabel task. As we need to preprocess the dataset before tokenizing text data, we set `pre_func=pre_parse_func`, using the function defined at the beginning of the tutorial. We also decide to remove unnecesary data fields after applying the `pre_func`, as they would cause an error in the tokenization step if kept in the dataset. For configuring the number of unique multilabel labels, use `config_num_labels=14`. Finally, as this dataset only has a `train` split, we need to perform a full split of the dataset with  `split=True` (that is, to create validation and test splits). In case we had test split but no validation split, we could have used `partial_split=True`. 

In [None]:
pubmed_config = default_args_dataset.copy()
pubmed_config.update(
    {
        "dataset_name": "pubmed",
        "alias": "pubmed",
        "task": "classification",
        "is_multilabel": True,
        "multilabel_label_names": [f"label_{col}" for col in ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "L", "M", "N", "Z"]],
        "text_field": "text",
        "label_col": "label_A",
        "hf_load_kwargs": {"path": "owaiskha9654/PubMed_MultiLabel_Text_Classification_Dataset_MeSH"},
        "pre_func": pre_parse_func,
        "remove_fields_pre_func": True,
        "config_num_labels": 14,  # for multilabel we need to pass the number of labels for the config.
        "split": True  # as the dataset only comes with train split, we need to split in train, val, test.
    }
)

In [None]:
pubmed_config = DatasetConfig(**pubmed_config)

## Models Configuration

Now we can configure models, like in the classification tutorial.

**Note that we are using Spanish models for an English task. As we are not actually trying to train realistic good performing models for this task this does not matter, as this notebook is for learning purposes solely. However, please make sure you choose models that fit your tasks when using `autotransformers` for real projects.**

In [None]:
bertin_config = ModelConfig(
        name="bertin-project/bertin-roberta-base-spanish",
        save_name="bertin",
        hp_space=hp_space_base,
        n_trials=1,
)
beto_config = ModelConfig(
        name="dccuchile/bert-base-spanish-wwm-cased",
        save_name="beto",
        hp_space=hp_space_base,
        n_trials=1,
)

## Create AutoTrainer

We can now create `AutoTrainer`. For that, we will use the model configs and the dataset config we have just created. We will additionally define a metrics dir, where metrics will be saved after training.

In [None]:
autotrainer = AutoTrainer(
        model_configs=[bertin_config, beto_config],
        dataset_configs=[pubmed_config],
        metrics_dir="pubmed_metrics"
)

## Train!

In [None]:
results = autotrainer()
print(results)

## Plot the Results

Once the models have trained, we might want to see a comparison of their performance. `ResultsPlotter` can be helpful in this respect, as we see in the next cells.

In [None]:
plotter = ResultsPlotter(
        metrics_dir=autotrainer.metrics_dir,
        model_names=[model_config.save_name for model_config in autotrainer.model_configs],
        dataset_to_task_map={dataset_config.alias: dataset_config.task for dataset_config in autotrainer.dataset_configs},
)

In [None]:
ax = plotter.plot_metrics()