[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/lenguajenatural-ai/autotransformers/blob/master/notebooks/chatbot_instructions/train_instructional_chatbot.ipynb)

In [None]:
!pip install autotransformers

# Train Instructional Chatbots in Spanish

In this tutorial, we'll explore how to train instructional chatbots in Spanish using the [somos-clean-alpaca](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset. This dataset provides a rich collection of conversational and instructional interactions in Spanish, making it an ideal resource for developing chatbots capable of understanding and executing specific instructions. We'll leverage the `autotransformers` library to streamline the training process, applying advanced techniques such as LoRA and quantization for efficient model adaptation and performance. Whether you're looking to enhance an existing chatbot or build a new one from scratch, this guide will equip you with the knowledge and tools needed to succeed.

## Importing Necessary Libraries

Before we begin, it's essential to import the necessary libraries that will be used throughout the tutorial. These libraries provide the foundational tools required for loading datasets, configuring models, and training. Below is a brief overview of each import and its role in our project:

- `from autotransformers import AutoTrainer, DatasetConfig, ModelConfig`: Imports the `AutoTrainer` class for orchestrating the training process, and `DatasetConfig` and `ModelConfig` for configuring the dataset and model parameters, respectively, within the `autotransformers` library.

- `from autotransformers.llm_templates import QLoraWrapperModelInit, modify_tokenizer, qlora_config, SavePeftModelCallback`: These imports from the `autotransformers` library's large language model (LLM) templates module include utilities for initializing models with LoRA wrappers, modifying tokenizers to fit our task, configuring quantization (QLoRA), and implementing a callback to save PEFT (Post-training Efficiency Fine-tuning) models.

- `from functools import partial`: The `partial` function from the `functools` module is used to partially apply functions, allowing us to pre-specify some arguments of a function, which is particularly useful for customizing our tokenizer modification function.

- `from peft import LoraConfig`: Imports the `LoraConfig` class from the `peft` library, which is used to specify configurations for LoRA (Low-Rank Adaptation), an efficient method for adapting pre-trained models to new tasks with minimal computational overhead.

- `from datasets import load_dataset`: From the `datasets` library, we import the `load_dataset` function, which is used to load and preprocess data from a wide range of datasets available in the Hugging Face Datasets repository, including our target dataset `somos-clean-alpaca-es`.

Ensure all these libraries are installed in your environment before proceeding with the tutorial.


In [None]:
from autotransformers import AutoTrainer, DatasetConfig, ModelConfig
from autotransformers.llm_templates import instructions_to_chat, NEFTuneTrainer, QLoraWrapperModelInit, modify_tokenizer, qlora_config, SavePeftModelCallback
from functools import partial
from peft import LoraConfig
from datasets import load_dataset

## Creating the Chat Template

To correctly format the conversations for training, we define a chat template using Jinja2 templating syntax. This template iterates through each message in a conversation, categorizing and formatting them based on their role:

- **User Messages**: Wrapped with `<user>` tags to clearly indicate messages from the user. These are the instructions or queries directed at the chatbot.

- **System Messages**: Enclosed within `<system>` tags, followed by line breaks for readability. These messages might include system-generated instructions or context that guides the chatbot's responses.

- **Assistant Responses**: Placed between the conversation, after `</user>` tags and marked with `</assistant>` tags at the end, along with the end-of-sentence (EOS) token. These are the chatbot's replies or actions taken in response to the user's message, at each utterance or intervention in the conversation.

- **Input Data**: Marked with `<input>` tags to distinguish any additional input or contextual information provided to the chatbot.

This structured format is crucial for the model to understand the different components of a conversation, enabling it to generate appropriate responses based on the role of each message.

Typically, a conversation will start with the system message, then have an input containing additional context for the assistant, and then turns of user-assistant, which can be one or more.

In [None]:
CHAT_TEMPLATE = """{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{'<user> ' + message['content'].strip() + ' </user>' }}
    {% elif message['role'] == 'system' %}
        {{'<system>\\n' + message['content'].strip() + '\\n</system>\\n\\n' }}
    {% elif message['role'] == 'assistant' %}
        {{ message['content'].strip() + ' </assistant>' + eos_token }}
    {% elif message['role'] == 'input' %}
        {{'<input> ' + message['content'] + ' </input>' }}
    {% endif %}
{% endfor %}"""

## Dataset Preparation

The dataset preparation phase is crucial for structuring the data in a way that's conducive to training a chatbot. We first load the dataset from the hub and then utilize a custom function, `process_alpaca`, to transform each sample from the `somos-clean-alpaca` dataset into a format that mirrors a real conversation flow involving a system message, user input, and assistant response.

### The `process_alpaca` Function

`process_alpaca` takes a dictionary representing a single dataset sample and restructures it by categorizing and ordering messages based on their role in a conversation:

- It starts by adding a **system message** that sets the context for the chatbot as an assistant designed to follow user instructions.
- If present, **input data** is added next to provide additional context or information needed to fulfill the user's request.
- The **user's instruction** is then added, followed by the **assistant's output**, which is the response to the user's request.

This restructuring results in a `messages` list within the sample dictionary, containing all conversation elements in their logical order.

### Applying the Transformation

To apply this transformation across the entire dataset:

- We use the `.map` method with `process_alpaca` as the mapping function, setting `batched=False` to process samples individually and `num_proc=4` to parallelize the operation, enhancing efficiency.
- Columns not part of the `messages` structure are removed to streamline the dataset.

Finally, the dataset is split into training and test sets with a 20% test size, ensuring that we can evaluate our chatbot's performance on unseen data. This split is achieved using the `train_test_split` method, providing a solid foundation for training and validating the chatbot model.


In [None]:
alpaca = load_dataset("somosnlp/somos-clean-alpaca-es")

In [None]:
def process_alpaca(sample: dict) -> dict:
    """
    Processes a single sample from the alpaca dataset to structure it for chatbot training.

    This function transforms the dataset sample into a format suitable for training,
    where each message is categorized by its role in the conversation (system, input, user, assistant).
    It initializes the conversation with a system message, then conditionally adds an input message,
    follows with the user's instruction, and finally, the assistant's output based on the provided inputs.

    Parameters
    ----------
    sample : dict
        A dictionary representing a single sample from the dataset. It must contain
        keys corresponding to input and output components of the conversation.

    Returns
    -------
    dict
        A modified dictionary with a 'messages' key that contains a list of ordered messages,
        each annotated with its role in the conversation.
    """
    chat = [
        {"role": "system", "content": "Eres un asistente que resuelve las instrucciones del usuario. Si se proporciona contexto adicional, utiliza esa información para completar la instrucción."}
    ]
    inp_ = sample["inputs"]["2-input"] 
    if inp_ is not None and inp_ != "":
        chat.append(
            {"role": "input", "content": inp_}
        )
    chat.extend(
        [
            {"role": "user", "content": sample["inputs"]["1-instruction"]},
            {"role": "assistant", "content": sample["inputs"]["3-output"]}
        ]
    )
    sample["messages"] = chat
    return sample


Alternativamente, podemos usar directamente la función `instructions_to_chat` de `llm_templates`.

In [None]:
alpaca = alpaca.map(
    partial(
        instructions_to_chat,
        input_field="1-instruction",
        context_field="2-input",
        output_field="3-output",
        nested_field="inputs"
    ),
    batched=False,
    num_proc=4,
    remove_columns=[col for col in alpaca["train"].column_names if col != "messages"])

In [None]:
alpaca = alpaca["train"].train_test_split(0.2, seed=203984)

## Configuring the Dataset for AutoTransformers

To ensure our instructional chatbot model trains efficiently and effectively, we meticulously configure our dataset using the `autotransformers` library's `DatasetConfig`. This step is essential for tailoring the training process to our specific needs, including hyperparameter settings, dataset particulars, and training strategies.

### Setting Up Training Arguments

A set of fixed training arguments (`fixed_train_args`) is defined to control various aspects of the training process:

- **Batch sizes** for both training and evaluation are set to 1, indicating that samples are processed individually. This can be particularly useful for large models or when GPU memory is limited.
- **Gradient accumulation** is used with 16 steps, allowing us to effectively simulate a larger batch size and stabilize training without exceeding memory limits.
- A **warmup ratio** of 0.03 gradually increases the learning rate at the beginning of training to prevent the model from converging too quickly to a suboptimal solution.
- **Learning rate**, **weight decay**, and other optimization settings are carefully chosen to balance model learning speed and quality.
- **Evaluation and saving strategies** are configured to periodically check the model's performance and save checkpoints, enabling monitoring and continuation of training from the last saved state.

### Crafting the Dataset Configuration

The `alpaca_config` dictionary encompasses all necessary information for dataset preparation and integration:

- **Dataset details** such as name, task type, and specific columns to use for text and labels ensure that the model trains on the correct data format.
- **Training parameters** are included via the `fixed_training_args` dictionary.
- **Callback classes**, such as `SavePeftModelCallback`, automate important steps like model saving during training.
- **Process optimizations** like setting a seed for reproducibility, specifying the optimization direction and metric, and enabling partial splits for validation set creation.

In [None]:
fixed_train_args = {
    "per_device_train_batch_size": 1,
    "per_device_eval_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "warmup_ratio": 0.03,
    "learning_rate": 2e-4,
    "bf16": True,
    "logging_steps": 50,
    "lr_scheduler_type": "constant",
    "weight_decay": 0.001,
    "eval_steps": 200,
    "save_steps": 50,
    "num_train_epochs": 1,
    "logging_first_step": True,
    "evaluation_strategy": "steps",
    "save_strategy": "steps",
    "max_grad_norm": 0.3,
    "optim": "paged_adamw_32bit",
    "gradient_checkpointing": True,
    "group_by_length": False,
    "save_total_limit": 50,
    "adam_beta2": 0.999
}

In [None]:
alpaca_config = {
        "seed": 9834,
        "direction_optimize": "minimize",
        "metric_optimize": "eval_loss",
        "callbacks": [SavePeftModelCallback],
        "fixed_training_args": fixed_train_args,
        "dataset_name": "alpaca",
        "alias": "alpaca",
        "retrain_at_end": False,
        "task": "chatbot",
        "text_field": "messages",
        "label_col": "messages",
        "num_proc": 4,
        "loaded_dataset": alpaca,
        "partial_split": True, # to create a validation split.
}

In [None]:
alpaca_config = DatasetConfig(**alpaca_config)

## Model Configuration

In the "Model Configuration" section, we outline how to set up the model configurations using `autotransformers`, focusing on integrating LoRA (Low-Rank Adaptation) for model adaptation and applying quantization for efficiency. These steps are crucial for tailoring the model to our specific task and environment, ensuring optimal performance and resource utilization.

### LoRA Configuration

The `LoraConfig` object is instantiated with parameters tailored to enhance model adaptability while maintaining efficiency:

- **r (rank)** and **lora_alpha** are set to adjust the capacity and learning rate multiplier for LoRA layers, balancing between model flexibility and overfitting risk.
- **target_modules** specifies which parts of the model to apply LoRA. In this case, "all-linear" modules are targeted for adaptation, offering a broad enhancement over the model's capabilities.
- **lora_dropout** is adjusted based on the model size, ensuring that regularization is appropriately scaled.
- **bias** configuration is set to "none", indicating that no additional bias terms are used in the LoRA adaptation layers.
- The **task_type** is specified as "CAUSAL_LM" to indicate the causal language modeling task, aligning with the instructional chatbot's nature.

### GEMMA Model Configuration

The `ModelConfig` for the GEMMA model includes several key parameters and customizations:

- **Model Name**: Specifies the pre-trained model to be adapted, "google/gemma-2b-it" in this case.
- **Save Name and Directory**: Defines the naming convention and location for saving the fine-tuned model.
- **Custom Parameters**: Includes model-specific settings, such as enabling trust in remote code and configuring device mapping for training.
- **Model Initialization Wrapper**: `QLoraWrapperModelInit` is used to integrate the QLoRA quantization framework with the LoRA-configured model, optimizing for both adaptability and efficiency.
- **Quantization and PEFT Configurations**: These are applied via the `quantization_config` and `peft_config` parameters, ensuring that the model benefits from both LoRA adaptations and efficient post-training quantization.
- **Tokenizer Modification**: A partial function is used to customize the tokenizer, adjusting sequence length, adding special tokens, and incorporating the chat template designed for our conversational context.

In [None]:
lora_config = LoraConfig(
        r=64,
        lora_alpha=32,
        target_modules="all-linear",  # "query_key_value" # "Wqkv"
        lora_dropout=0.05,  # 0.1 for <13B models, 0.05 otherwise.
        bias="none",
        task_type="CAUSAL_LM"
)

In [None]:
gemma_config = ModelConfig(
    name="google/gemma-2b-it",
    save_name="gemma_2b_alpaca",
    save_dir="./gemma_2b_alpaca",
    custom_params_model={"trust_remote_code": True, "device_map": {"": 0}},
    model_init_wrap_cls=QLoraWrapperModelInit,
    quantization_config=qlora_config,
    peft_config=lora_config,
    neftune_noise_alpha=10,
    custom_trainer_cls=NEFTuneTrainer,
    func_modify_tokenizer=partial(
        modify_tokenizer,
        new_model_seq_length=4096, # lower the maximum seq length to 4096 instead of 8192 to fit in google colab GPUs.
        add_special_tokens={"pad_token": "[PAD]"}, # add pad token.
        chat_template=CHAT_TEMPLATE # add the new chat template including the system and input roles.
    )
)

## Let's Train

With our dataset and model configurations in place, we're now ready to initiate the training process. This is where the `AutoTrainer` class from the `autotransformers` library comes into play, orchestrating the entire training operation based on the specifications we've provided.

### Setting Up the AutoTrainer

The `AutoTrainer` is a comprehensive class designed to streamline the training of machine learning models, especially tailored for large language models. It accepts several parameters to control the training process:

- **Model Configurations**: A list of `ModelConfig` objects, each defining the settings and customizations for a model. For our instructional chatbot, we include the configuration for the GEMMA model adapted with LoRA and quantization.
- **Dataset Configurations**: Similar to model configurations, these are specified using `DatasetConfig` objects. We pass the configuration for our pre-processed and structured `alpaca` dataset, ensuring it's utilized effectively during training.
- **Metrics Directory**: Specifies the directory where training metrics will be stored, allowing for performance monitoring and evaluation.
- **Hyperparameter Search Mode**: Set to "fixed" in our case, indicating that we're not exploring different hyperparameters but rather training with a predetermined set.
- **Clean**: A boolean flag to clean any previous runs' data, ensuring a fresh start for each training session.
- **Metrics Cleaner**: Specifies the utility for handling temporary metrics data, keeping our metrics directory tidy and focused on significant results.
- **Use Auth Token**: Enables the use of an authentication token, necessary for accessing certain models or datasets that may have access restrictions.

### Initiating the Training

With the `AutoTrainer` configured, we proceed to call its execution method. This step starts the training process, leveraging the configurations we've meticulously set up. The process involves:

- Automatically loading and preparing the dataset according to our `DatasetConfig`.
- Adapting and fine-tuning the model based on the `ModelConfig`, including any specified LoRA or quantization enhancements.
- Regularly evaluating the model's performance using the provided validation set, allowing us to monitor its effectiveness in real-time.
- Saving model checkpoints and training metrics, enabling both introspection of the training process and the resumption of training from the last saved state.

Upon completion, the training results, including performance metrics and model checkpoints, are made available for analysis and deployment. This step marks the culmination of our instructional chatbot's preparation, rendering it ready for testing and eventually, deployment in real-world scenarios.


In [None]:
autotrainer = AutoTrainer(
    model_configs=[gemma_config],
    dataset_configs=[alpaca_config],
    metrics_dir="./metrics_alpaca",
    hp_search_mode="fixed",
    clean=True,
    metrics_cleaner="tmp_metrics_cleaner",
    use_auth_token=True
)

In [None]:
results = autotrainer()