[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avacaondata/nlpboost/blob/main/notebooks/NER/train_spanish_ner.ipynb)

# Named Entity Recognition in Spanish

In this tutorial, we will see how we can train Spanish models for two different NER tasks: `conll2002`, which has the typical tags PER, LOC, ORG, and `ehealth_kd`, which labels correspond to entities in the biomedical domain. Additionally, these datasets do not come in the same format, so we will see how we can add a preprocessing function to `DatasetConfig` so that we can train with NER datasets in many different formats.

We first import the needed modules or, if you are running this notebook in Google colab, please uncomment the cell below and run it before importing, in order to install `nlpboost`.

We import `DatasetConfig`, the class that configures how datasets are managed inside `AutoTrainer`. We also need `ModelConfig` to define the models to train, and `ResultsPlotter` to plot the experiment results. The function `dict_to_list` will help us with `ehealth_kd` dataset, which has a field with texts, and a field with entities in a list of dictionaries. However, we need two equally-sized lists for each data instance: the list of tokens and the list of the labels of those tokens. `dict_to_list` will perform that preprocessing for us.
Additionally, we import the default hyperparameter space for base-sized models.

In [None]:
# !pip install git+https://github.com/avacaondata/nlpboost.git 

In [None]:
from nlpboost import AutoTrainer, DatasetConfig, ModelConfig, dict_to_list, ResultsPlotter
from transformers import EarlyStoppingCallback
from nlpboost.default_param_spaces import hp_space_base
from functools import partial

## Configure the dataset

The next step is to define the fixed train args, which will be the `transformers.TrainingArguments` passed to `transformers.Trainer` inside `nlpboost.AutoTrainer`. For a full list of arguments check [TrainingArguments documentation](https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/trainer#transformers.TrainingArguments). `DatasetConfig` expects these arguments in dictionary format.

To save time, we set `max_steps` to 1; in a real setting we would need to define these arguments differently. However, that is out of scope for this tutorial. To learn how to work with Transformers, and how to configure the training arguments, please check Huggingface Course on NLP. 

In [None]:
fixed_train_args = {
        "evaluation_strategy": "steps",
        "num_train_epochs": 10,
        "do_train": True,
        "do_eval": True,
        "logging_strategy": "steps",
        "eval_steps": 1,
        "save_steps": 1,
        "logging_steps": 1,
        "save_strategy": "steps",
        "save_total_limit": 2,
        "seed": 69,
        "fp16": True,
        "no_cuda": False,
        "dataloader_num_workers": 2,
        "load_best_model_at_end": True,
        "per_device_eval_batch_size": 16,
        "adam_epsilon": 1e-6,
        "adam_beta1": 0.9,
        "adam_beta2": 0.999,
        "max_steps": 1
    }

Now we define the default arguments that all NER datasets will share. That common config includes the random seed, the direction to optimize, the metric, callbacks and fixed training arguments.

In [None]:
default_args_dataset = {
        "seed": 44,
        "direction_optimize": "maximize",
        "metric_optimize": "eval_f1-score",
        "retrain_at_end": False,
        "callbacks": [EarlyStoppingCallback(1, 0.00001)],
        "fixed_training_args": fixed_train_args
}

We can not start building conll2002 configuration. As this dataset already comes with a list of tokens and a list of labels for each row, we can directly use these two columns as text field and label col respectively. 

In [None]:
conll2002_config = default_args_dataset.copy()
conll2002_config.update(
    {
        "dataset_name": "conll2002",
        "alias": "conll2002",
        "task": "ner",
        "text_field": "tokens",
        "hf_load_kwargs": {"path": "conll2002", "name": "es"},
        "label_col": "ner_tags",
    }
)

In [None]:
conll2002_config = DatasetConfig(**conll2002_config)

We have to prepare the configuration of ehealth_kd. As you see, in this case we use a `pre_func` (`dict_to_list`) to preprocess the dataset. As that function will return a list of labels called label_list, that is the name we use for `label_col` in the config. 

In [None]:
ehealth_config = default_args_dataset.copy()

ehealth_config.update(
    {
        "dataset_name": "ehealth_kd",
        "alias": "ehealth",
        "task": "ner",
        "text_field": "token_list",
        "hf_load_kwargs": {"path": "ehealth_kd"},
        "label_col": "label_list",
        "pre_func": partial(dict_to_list, nulltoken=100)
    }
)

In [None]:
ehealth_config = DatasetConfig(**ehealth_config)

In [None]:
dataset_configs = [
        conll2002_config,
        ehealth_config
]

## Configure Models

We will configure three Spanish models. As you see, we only need to define the `name`, which is the path to the model (either in HF Hub or locally), `save_name` which is an arbitrary name for the model, the hyperparameter space and the number of trials. There are more parameters, which you can check in the documentation.

In [None]:
bertin_config = ModelConfig(
        name="bertin-project/bertin-roberta-base-spanish",
        save_name="bertin",
        hp_space=hp_space_base,
        n_trials=1,
)
beto_config = ModelConfig(
        name="dccuchile/bert-base-spanish-wwm-cased",
        save_name="beto",
        hp_space=hp_space_base,
        n_trials=1,
)
albert_config = ModelConfig(
        name="CenIA/albert-tiny-spanish",
        save_name="albert",
        hp_space=hp_space_base,
        n_trials=1
)

## Let's Train

Now we can configure `AutoTrainer` with the dataset configs and model configs defined above, and we are ready to train just by calling the autotrainer.

In [None]:
autotrainer = AutoTrainer(
        model_configs=[bertin_config, beto_config, albert_config],
        dataset_configs=dataset_configs,
        metrics_dir="metrics_spanish_ner",
)

In [None]:
results = autotrainer()
print(results)

## Plot the Results

Once the models have trained, we might want to see a comparison of their performance. With `ResultsPlotter` we can easily do that.

In [None]:
plotter = ResultsPlotter(
        metrics_dir=autotrainer.metrics_dir,
        model_names=[model_config.save_name for model_config in autotrainer.model_configs],
        dataset_to_task_map={dataset_config.alias: dataset_config.task for dataset_config in autotrainer.dataset_configs},
)

In [None]:
ax = plotter.plot_metrics()