# Training NLP-based Models with Hugging Face

In this notebook, we will explore the training of an NLP-based model using pre-defined architectures from `huggingface/transformers`.

Archai implements abstract base classes that defines the expected behavior of some classes, such as datasets (`DatasetProvider`) and trainers (`TrainerBase`). Additionally, we offer boilerplate classes for the most common frameworks, such as a `DatasetProvider` compatible with `huggingface/datasets` and a `TrainerBase` compatible with `huggingface/transformers`.

## Loading and Encoding the Data

In [None]:
from transformers import AutoTokenizer, DataCollatorForLanguageModeling
from archai.datasets.nlp.hf_dataset_provider import HfHubDatasetProvider
from archai.datasets.nlp.hf_dataset_provider_utils import tokenize_dataset

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono", model_max_length=1024)
tokenizer.pad_token = tokenizer.eos_token

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
dataset_provider = HfHubDatasetProvider(dataset="wikitext", subset="wikitext-103-raw-v1")

train_dataset = dataset_provider.get_train_dataset()
eval_dataset = dataset_provider.get_val_dataset()

encoded_train_dataset = train_dataset.map(tokenize_dataset, batched=True, fn_kwargs={"tokenizer": tokenizer})
encoded_eval_dataset = eval_dataset.map(tokenize_dataset, batched=True, fn_kwargs={"tokenizer": tokenizer})

## Defining the Model

In [None]:
from transformers import CodeGenConfig, CodeGenForCausalLM

config = CodeGenConfig(
    n_positions=1024,
    n_embd=768,
    n_layer=12,
    n_head=12,
    rotary_dim=16,
    bos_token_id=0,
    eos_token_id=0,
    vocab_size=50295,
)
model = CodeGenForCausalLM(config=config)

## Running the Trainer

In [None]:
from archai.trainers.nlp.nvidia_trainer import NvidiaTrainer
from archai.trainers.nlp.nvidia_training_args import NvidiaTrainingArguments

training_args = NvidiaTrainingArguments(
    "nvidia-gpt2",
    seed=args.seed,
    no_cuda=args.no_cuda,
    logging_steps=args.logging_steps,
    eval_steps=args.eval_steps,
    dataset_name=args.dataset_name,
    vocab_type=args.vocab_type,
    vocab_size=args.vocab_size,
    global_batch_size=args.global_batch_size,
    seq_len=args.seq_len,
    strategy=args.strategy,
    max_steps=args.max_steps,
    gradient_accumulation_steps=args.gradient_accumulation_steps,
    optim=args.optim,
)
trainer = NvidiaTrainer(model=model, args=training_args)

trainer.train()