[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lenguajenatural-ai/autotransformers/blob/main/notebooks/NER/train_spanish_ner.ipynb)

# Introduction to Seq2Seq Learning with autotransformers: Mastering Text Summarization

Welcome to this comprehensive tutorial on sequence-to-sequence (seq2seq) learning using autotransformers, with a special focus on text summarization. Seq2seq models have revolutionized the way we approach various natural language processing (NLP) tasks, offering powerful tools to handle problems that involve converting one sequence of data into another. These models are at the heart of numerous applications, from machine translation and text summarization to question-answering and chatbot development.

Text summarization, the process of distilling the most important information from a source text to produce a shorter, concise version, serves as an exemplary case study to understand and harness the power of seq2seq models. This task not only demonstrates the model's ability to comprehend and generate text but also showcases its potential in extracting and condensing information, which is crucial for both academic research and real-world applications.

In this tutorial, we'll guide you through the steps of training a seq2seq model for text summarization using autotransformers, autotransformers's comprehensive library designed to streamline the development and training of language models. We'll cover everything from data preparation and model selection to training strategies and evaluation metrics. By the end of this tutorial, you'll have a solid foundation in seq2seq learning, equipped with the knowledge and skills to adapt the techniques learned here to a wide range of seq2seq tasks beyond summarization, such as machine translation, text generation, and more.

Our goal is to not only provide you with theoretical knowledge but also hands-on experience, ensuring that you're well-prepared to tackle seq2seq challenges with confidence. Let's embark on this journey together, unlocking the full potential of seq2seq models with autotransformers.

In [None]:
from autotransformers import AutoTrainer, DatasetConfig, ModelConfig, ResultsPlotter
from transformers import EarlyStoppingCallback
from transformers import Seq2SeqTrainer, MT5ForConditionalGeneration, XLMProphetNetForConditionalGeneration


  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /home/alejandro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /home/alejandro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Configure the dataset

For training sequence-to-sequence (seq2seq) tasks such as text summarization with autotransformers, the initial step involves defining the dataset configuration along with the training arguments. These configurations play a critical role in customizing the training process, ensuring that it is optimized for the specific requirements of the task at hand.

### Fixed Training Arguments

The first component of the dataset configuration is the `fixed_train_args`. This dictionary encapsulates a set of `transformers.TrainingArguments` that are later passed to the `transformers.Trainer` within `autotransformers.AutoTrainer`. The `TrainingArguments` class provides a comprehensive range of options to fine-tune the training behavior, from the evaluation strategy to the model's saving behavior.

Here is an overview of the key training arguments we're setting:

- `evaluation_strategy`: "epoch" - Evaluates the model performance at the end of each epoch.
- `num_train_epochs`: 10 - Specifies the total number of training epochs.
- `do_train`: True - Enables the training process.
- `do_eval`: True - Enables the evaluation process.
- `logging_strategy`: "epoch" - Logs metrics at the end of each epoch.
- `save_strategy`: "epoch" - Saves the model at the end of each epoch.
- `save_total_limit`: 2 - Limits the total number of model checkpoints to save.
- `seed`: 69 - Sets the seed for generating random numbers.
- `bf16`: True - Enables bfloat16 mixed precision training for faster computation.
- `dataloader_num_workers`: 16 - Sets the number of subprocesses to use for data loading.
- `load_best_model_at_end`: True - Loads the best model found during training when training is finished.
- `optim`: adafactor - Uses the Adafactor optimizer instead of AdamW, which is more memory efficient.

For a comprehensive list of available training arguments, refer to the [TrainingArguments documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [2]:
fixed_train_args = {
    "evaluation_strategy": "epoch",
    "num_train_epochs": 10,
    "do_train": True,
    "do_eval": True,
    "logging_strategy": "epoch",
    "save_strategy": "epoch",
    "save_total_limit": 2,
    "seed": 69,
    "bf16": True,
    "dataloader_num_workers": 16,
    "load_best_model_at_end": True,
    "optim": "adafactor",
    "max_steps": 1 # NOTE: This is added for the purpose of the tutorial.
}


### Dataset Configuration for Text Summarization

After establishing the training arguments, we define the `mlsum_config` dictionary, which outlines the specific settings for the text summarization task using the MLSum dataset. This configuration includes both the `fixed_train_args` and additional parameters tailored to the dataset and task:

- `seed`: 44 - Seed for random number generation, ensuring reproducibility.
- `direction_optimize`: "maximize" - The direction of optimization for the metric of interest.
- `metric_optimize`: "eval_rouge2" - The metric to optimize during training, in this case, ROUGE-2 for summarization quality.
- `callbacks`: A list of callback functions for training, such as `EarlyStoppingCallback` to prevent overfitting.
- `fixed_training_args`: The dictionary of training arguments defined previously.

Additional parameters specific to the MLSum dataset and the summarization task are also specified:

- `dataset_name` and `alias`: Both set to "mlsum" for identification.
- `retrain_at_end`: False - Indicates whether to retrain the model on the entire dataset after validation.
- `task`: "summarization" - Specifies the NLP task.
- `hf_load_kwargs`: Arguments for loading the dataset, including the path and name.
- `label_col`: "summary" - Defines the column to use as the label for summarization.
- `num_proc`: 16 - The number of processing threads for data preprocessing.

Lastly, the `mlsum_config` dictionary is transformed into a `DatasetConfig` object, encapsulating all the necessary configurations for the dataset and training setup.

This structured approach to configuring the dataset and training parameters ensures that you can adapt and optimize the seq2seq model training for text summarization, with the flexibility to adjust settings for other seq2seq tasks as well.

To delve deeper into configuring training arguments and understanding their impact on model performance, consider exploring the Hugging Face Course on NLP, which offers extensive guidance on working with the Transformers library.


In [3]:

mlsum_config = {
    "seed": 44,
    "direction_optimize": "maximize",
    "metric_optimize": "eval_rouge2",
    "callbacks": [EarlyStoppingCallback(1, 0.00001)],
    "fixed_training_args": fixed_train_args
}

mlsum_config.update(
    {
        "dataset_name": "mlsum",
        "alias": "mlsum",
        "retrain_at_end": False,
        "task": "summarization",
        "hf_load_kwargs": {"path": "mlsum", "name": "es"},
        "label_col": "summary",
        "num_proc": 16}
)

mlsum_config = DatasetConfig(**mlsum_config)

## Models Configuration

In this section, we define the hyperparameter search space and preprocessing functions, followed by the configuration for each model we plan to evaluate. The goal is to find the optimal set of parameters that yields the best performance on our seq2seq task of text summarization. We will explore different models to demonstrate the versatility of autotransformers in handling various architectures efficiently.

### Hyperparameter Search Space

The `hp_space` function is designed to define the hyperparameter search space for the optimization process. This function takes a `trial` object as input and returns a dictionary mapping hyperparameter names to their suggested values. We use `suggest_categorical` for simplicity, specifying discrete choices for each hyperparameter:

In [4]:
def hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical(
            "learning_rate", [3e-5, 5e-5, 7e-5, 2e-4]
        ),
        "num_train_epochs": trial.suggest_categorical(
            "num_train_epochs", [10]
        ),
        "per_device_train_batch_size": trial.suggest_categorical(
            "per_device_train_batch_size", [8]),
        "per_device_eval_batch_size": trial.suggest_categorical(
            "per_device_eval_batch_size", [8]),
        "gradient_accumulation_steps": trial.suggest_categorical(
            "gradient_accumulation_steps", [8]),
        "warmup_ratio": trial.suggest_categorical(
            "warmup_ratio", [0.08]
        ),
    }


### Preprocessing Function
Before feeding the data to our models, we need to preprocess it. The preprocess_function tokenizes the input text and labels (summaries), truncating them to fit the model's maximum input length. It also converts the labels into model input IDs:

In [5]:

def preprocess_function(examples, tokenizer, dataset_config):
    model_inputs = tokenizer(
        examples[dataset_config.text_field],
        truncation=True,
        max_length=1024
    )
    labels = tokenizer(
        text_target=examples[dataset_config.summary_field],
        max_length=dataset_config.max_length_summary,
        truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


## Model Configurations
We then define configurations for each model we intend to train. Here, we showcase configurations for `mt5-large` and `xprophetnet-large-wiki100-cased`, specifying the model's name, the hyperparameter search space, and the preprocessing function, among other settings. Each configuration is encapsulated in a ModelConfig object:

In [6]:

mt5_config = ModelConfig(
    name="google/mt5-large",
    save_name="mt5-large",
    hp_space=hp_space,
    num_beams=4,
    trainer_cls_summarization=Seq2SeqTrainer,
    model_cls_summarization=MT5ForConditionalGeneration,
    custom_tokenization_func=preprocess_function,
    n_trials=1,
    random_init_trials=1
)
xprophetnet_config = ModelConfig(
    name="microsoft/xprophetnet-large-wiki100-cased",
    save_name="xprophetnet",
    hp_space=hp_space,
    num_beams=4,
    trainer_cls_summarization=Seq2SeqTrainer,
    model_cls_summarization=XLMProphetNetForConditionalGeneration,
    custom_tokenization_func=preprocess_function,
    n_trials=1,
    random_init_trials=1
)

These configurations are crucial for setting up our experiments, allowing us to systematically evaluate and compare the performance of different models on the summarization task.

## Training and Evaluating Models

After configuring our models and the dataset, we proceed to instantiate the `AutoTrainer` class. This powerful class from autotransformers orchestrates the training and evaluation process for the specified models and datasets. It's designed to streamline the experimentation process, making it easier to compare the performance of different model configurations across various tasks.

### Setting Up AutoTrainer

The `AutoTrainer` is initialized with the following key components:

- `model_configs`: A list of model configurations to be trained and evaluated. In our case, we include the configurations for `mt5-large` and `xprophetnet-large-wiki100-cased`.
- `dataset_configs`: A list containing the dataset configurations. Here, it includes our earlier defined `mlsum_config` for the text summarization task.
- `metrics_dir`: The directory path where the evaluation metrics for each model will be saved. We specify `"mlsum_multilingual_models"` to organize our results.
- `metrics_cleaner`: The function or script used to process and clean the metrics data. We use `"metrics_mlsum"` to ensure our results are formatted correctly and easily interpretable.

In [None]:
autotrainer = AutoTrainer(
    model_configs=[mt5_config, xprophetnet_config],
    dataset_configs=[mlsum_config],
    metrics_dir="mlsum_multilingual_models",
    metrics_cleaner="metrics_mlsum"
)


### Running the Training and Evaluation Process
With the `AutoTrainer` configured, we simply call it to start the training and evaluation process across our specified models and dataset. The results, including performance metrics for each model, are captured and printed out, providing a comprehensive overview of how each model performed on the summarization task:

In [None]:
results = autotrainer()
print(results)


This process not only facilitates an efficient way to train and evaluate multiple models but also organizes and presents the results in a manner that aids in decision-making for selecting the best-performing model for your specific NLP task.

## Visualizing Model Performance

After training and evaluating our models, it's important to visualize their performance to make informed decisions. autotransformers provides a convenient way to do this through the `ResultsPlotter` class, which generates comparative plots of model metrics across different configurations. This visualization helps in understanding the strengths and weaknesses of each model in a more intuitive manner.

### Setting Up ResultsPlotter

The `ResultsPlotter` is initialized with several parameters to specify the source of the metrics data and how it should be visualized:

- `metrics_dir`: The directory where the metrics are stored. We use `autotrainer.metrics_dir` to automatically fetch the directory specified during the AutoTrainer setup.
- `model_names`: A list of model names to include in the plot. This is dynamically generated from the model configurations used in the AutoTrainer, ensuring that all evaluated models are represented.
- `dataset_to_task_map`: A mapping of dataset aliases to their respective tasks, helping in the categorization and labeling of plot data. This mapping is constructed from the dataset configurations used in the AutoTrainer.
- `metric_field`: The specific metric to be plotted. We choose `"rouge2"` as it's a common metric for evaluating text summarization models, reflecting their ability to generate coherent and concise summaries.


In [None]:

plotter = ResultsPlotter(
    metrics_dir=autotrainer.metrics_dir,
    model_names=[model_config.save_name for model_config in autotrainer.model_configs],
    dataset_to_task_map={dataset_config.alias: dataset_config.task for dataset_config in autotrainer.dataset_configs},
    metric_field="rouge2"
)


### Generating and Saving the Plot
With the `ResultsPlotter` configured, we generate the metrics plot by calling `plot_metrics()`. This method returns a matplotlib Axes object, which we can then use to further customize the plot or save it directly. Here, we save the plot as "results.png", providing a visual summary of our models' performance on the text summarization task:

In [None]:
ax = plotter.plot_metrics()
ax.figure.savefig("results.png")