# Training NLP-based Models with Hugging Face

Training an NLP-based model involves several steps, including loading the data, encoding the data, defining the model architecture, and conducting the actual training process.

Archai implements abstract base classes that defines the expected behavior of some classes, such as datasets (`DatasetProvider`) and trainers (`TrainerBase`). Additionally, we offer boilerplate classes for the most common frameworks, such as a `DatasetProvider` compatible with `huggingface/datasets` and a `TrainerBase` compatible with `huggingface/transformers`.

## Loading and Encoding the Data

When using a dataset provider, such as Hugging Face's `datasets` library, the data loading process is simplified, as the provider takes care of downloading and pre-processing the required dataset. Next, the data needs to be encoded, typically by converting text data into numerical representations that can be fed into the model. 

This step is accomplished in the same way as the [previous notebook](./hf_dataset_provider.ipynb):

In [1]:
from transformers import AutoTokenizer, DataCollatorForLanguageModeling
from archai.datasets.nlp.hf_dataset_provider import HfHubDatasetProvider
from archai.datasets.nlp.hf_dataset_provider_utils import tokenize_dataset

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono", model_max_length=1024)
tokenizer.pad_token = tokenizer.eos_token

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
dataset_provider = HfHubDatasetProvider(dataset="wikitext", subset="wikitext-103-raw-v1")

# When loading `train_dataset`, we will override the split argument to only load 1%
# of the data and speed up its encoding
train_dataset = dataset_provider.get_train_dataset(split="train[:1%]")
encoded_train_dataset = train_dataset.map(tokenize_dataset, batched=True, fn_kwargs={"tokenizer": tokenizer})

Downloading (…)okenizer_config.json:   0%|          | 0.00/240 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Found cached dataset wikitext (C:/Users/gderosa/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


  0%|          | 0/19 [00:00<?, ?ba/s]

## Defining the Model

Once the data is encoded, we can define any NLP-based model. In this example, we will use a CodeGen architecture from `huggingface/transformers`.

In [2]:
from transformers import CodeGenConfig, CodeGenForCausalLM

config = CodeGenConfig(
    n_positions=1024,
    n_embd=768,
    n_layer=12,
    n_head=12,
    rotary_dim=16,
    bos_token_id=0,
    eos_token_id=0,
    vocab_size=50295,
)
model = CodeGenForCausalLM(config=config)

## Running the Trainer

The final step is to use the Hugging Face trainer abstraction (`HfTrainer`) to conduct the training process, which involves optimizing the model's parameters using a pre-defined optimization algorithm and loss function, and updating the model's parameters based on the training data. This process is repeated until the model converges to a satisfactory accuracy or performance level.

In [3]:
from transformers import TrainingArguments
from archai.trainers.nlp.hf_trainer import HfTrainer

training_args = TrainingArguments(
    "hf-codegen",
    evaluation_strategy="no",
    logging_steps=1,
    per_device_train_batch_size=1,
    learning_rate=0.01,
    weight_decay=0.1,
    max_steps=1,
)
trainer = HfTrainer(
    model=model,
    args=training_args,
    data_collator=collator,
    train_dataset=encoded_train_dataset,
)

trainer.train()

max_steps is given, it will override any value given in num_train_epochs
The following columns in the training set don't have a corresponding argument in `CodeGenForCausalLM.forward` and have been ignored: text. If text are not expected by `CodeGenForCausalLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 18014
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 1
  Number of trainable parameters = 162304119


  0%|          | 0/1 [00:00<?, ?it/s]

You're using a CodeGenTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  attn_weights = torch.where(causal_mask, attn_weights, mask_value)


Training completed. Do not forget to share your model on huggingface.co/models =)




{'loss': 11.0746, 'learning_rate': 0.0, 'epoch': 0.0}
{'train_runtime': 13.891, 'train_samples_per_second': 0.072, 'train_steps_per_second': 0.072, 'train_loss': 11.074588775634766, 'epoch': 0.0}


TrainOutput(global_step=1, training_loss=11.074588775634766, metrics={'train_runtime': 13.891, 'train_samples_per_second': 0.072, 'train_steps_per_second': 0.072, 'train_loss': 11.074588775634766, 'epoch': 0.0})