# 🤗 Pretraining and finetuning with HuggingFace models

Want to pretrain and finetune a Hugging Face model with Composer? No problem. Here, we'll walk through using Composer to pretrain and finetune a Hugging Face BERT model.

### Recommended Background

This tutorial assumes you are familiar with transformer models for NLP and with Hugging Face.

To better understand the Composer part, make sure you're comfortable with the material in our [Getting Started][getting_started] tutorial.

### Tutorial Goals and Concepts Covered

The goal of this tutorial is to demonstrate how to pretrain and finetune a Hugging Face transformer using the Composer library!

We will focus on pretraining and finetuning a small version of BERT to show how it is done, and then you can scale up to get better performance!

Along the way, we will touch on:

* Creating our Hugging Face BERT model, tokenizer, and data loaders
* Wrapping the Hugging Face model as a `ComposerModel` for use with the Composer trainer
* Training with Composer

Let's do this 🚀

[getting_started]: https://docs.mosaicml.com/en/stable/examples/getting_started.html

## Install Composer

To use Hugging Face with Composer, we'll need to install Composer *with the NLP dependencies*. If you haven't already, run: 

In [None]:
%pip install 'mosaicml[nlp]'

## Import Hugging Face Model
First, we import a BERT model and its associated tokenizer from the transformers library. We use tiny bert in this notebook just for training speed. It is simply a smaller version of BERT.

In [None]:
import transformers

# Create a BERT masked language modeling model using Hugging Face transformers
config = transformers.AutoConfig.from_pretrained('prajjwal1/bert-tiny')
model = transformers.AutoModelForMaskedLM.from_config(config)
tokenizer = transformers.AutoTokenizer.from_pretrained('prajjwal1/bert-tiny') 

## Creating Dataloaders

In [None]:
import datasets
from torch.utils.data import DataLoader

wikitext_dataset = datasets.load_dataset('wikitext', 'wikitext-2-v1')
train_dataset = wikitext_dataset['train']
eval_dataset = wikitext_dataset['validation']

padding = "max_length"
train_column_names = train_dataset.column_names
eval_column_names = eval_dataset.column_names
text_column_name = "text"

def tokenize_function(examples):
    # Remove empty lines
    examples[text_column_name] = [
        line for line in examples[text_column_name] if len(line) > 0 and not line.isspace()
    ]
    return tokenizer(
        examples[text_column_name],
        padding=padding,
        truncation=True,
        max_length=256,
        # We use this option because DataCollatorForLanguageModeling (see below) is more efficient when it
        # receives the `special_tokens_mask`.
        return_special_tokens_mask=True,
    )

tokenized_train = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=[text_column_name],
    load_from_cache_file=False,
)
tokenized_eval = eval_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=[text_column_name],
    load_from_cache_file=False,
)

collator = transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

train_dataloader = DataLoader(tokenized_train, batch_size=16, collate_fn=collator)
eval_dataloader = DataLoader(tokenized_eval, batch_size=16, collate_fn=collator)

## Convert model to `ComposerModel`

Composer uses `HuggingFaceModel` as a convenient interface for wrapping a Hugging Face model (such as the one we created above) in a `ComposerModel`. Its parameters are:

- `model`: The Hugging Face model to wrap.
- `tokenizer`: The Hugging Face tokenizer used to create the input data
- `metrics`: A list of torchmetrics to apply to the output of `eval_forward` (a `ComposerModel` method).
- `use_logits`: A boolean which, if True, flags that the model's output logits should be used to calculate validation metrics.

See the [API Reference][api] for additional details.

[api]: https://docs.mosaicml.com/en/stable/api_reference/generated/composer.models.HuggingFaceModel.html

In [None]:
from composer.metrics.nlp import LanguageCrossEntropy, MaskedAccuracy
from composer.models.huggingface import HuggingFaceModel

metrics = [
    LanguageCrossEntropy(ignore_index=-100, vocab_size=model.config.vocab_size),
    MaskedAccuracy(ignore_index=-100)
]
# Package as a trainer-friendly Composer model
composer_model = HuggingFaceModel(model, tokenizer=tokenizer, metrics=metrics, use_logits=True)

## Optimizers and Learning Rate Schedulers

The last setup step is to create an optimizer and a learning rate scheduler. We will use Composer's DecoupledAdamW optimizer and linear learning rate scheduler with warmup.

In [None]:
from composer.optim import DecoupledAdamW, LinearWithWarmupScheduler

optimizer = DecoupledAdamW(model.parameters(), lr=5.0e-4, betas=[0.9, 0.98], eps=1.0e-06, weight_decay=1.0e-5)
lr_scheduler = LinearWithWarmupScheduler(t_warmup='0.06dur', alpha_f=0.02)

## Composer Trainer

We will now specify a Composer `Trainer` object and run our training! `Trainer` has many arguments that are described in our [documentation](https://docs.mosaicml.com/en/stable/api_reference/generated/composer.Trainer.html#trainer), so we'll discuss only the less-obvious arguments used below:

- `max_duration` - a string specifying how long to train. This can be in terms of batches (e.g., `'10ba'` is 10 batches) or epochs (e.g., `'1ep'` is 1 epoch), [among other options][time].
- `save_folder` - a string specifying where to save checkpoints to
- `schedulers` - a (list of) PyTorch or Composer learning rate scheduler(s) that will be composed together.
- `device` - specifies if the training will be done on CPU or GPU by using `'cpu'` or `'gpu'`, respectively. You can omit this to automatically train on GPUs if they're available and fall back to the CPU if not.
- `train_subset_num_batches` - specifies the number of training batches to use for each epoch. This is not a necessary argument but is useful for quickly testing code.
- `precision` - whether to do the training in full precision (`'fp32'`) or mixed precision (`'amp'`). Mixed precision can provide a ~2x training speedup on recent NVIDIA GPUs.
- `seed` - sets the random seed for the training run, so the results are reproducible!

[time]: https://docs.mosaicml.com/en/stable/trainer/time.html

In [None]:
import torch
from composer import Trainer

# Create Trainer Object
trainer = Trainer(
    model=composer_model, # This is the model from the HuggingFaceModel wrapper class.
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    max_duration='1ep',
    save_folder='checkpoints/pretraining',
    optimizers=optimizer,
    schedulers=[lr_scheduler],
    device='gpu' if torch.cuda.is_available() else 'cpu',
    train_subset_num_batches=10,
    eval_subset_num_batches=10,
    precision='fp32',
    seed=17,
)
# Start training
trainer.fit()

## Loading the pretrained model for finetuning

Now that we have a pretrained BERT model, we will load it in and finetune it on a sequence classification task. Composer provides utilities to easily reload a HuggingFace model and tokenizer from a composer checkpoint, and add a task specific head to the model so that it can be finetuned for a new task

In [None]:
from torchmetrics import Accuracy
from composer.metrics import CrossEntropy

# delete our previous model and tokenizer because we will be loading them in from our saved checkpoint
del model, tokenizer

# Note: this does not load the weights, just the right model class. The weights will be loaded by the Composer trainer
model, tokenizer = HuggingFaceModel.hf_from_composer_checkpoint(
    'checkpoints/pretraining/latest-rank0.pt',
    model_instantiation_class='transformers.AutoModelForSequenceClassification',
    model_config_kwargs={'num_labels': 2})

metrics = [CrossEntropy(), Accuracy()]
composer_model = HuggingFaceModel(model, tokenizer=tokenizer, metrics=metrics, use_logits=True)

The next part should look very familiar if you have already gone through the (TODO: link) tutorial, as it is exactly the same except we are starting from our own pretrained model instead of BERT base!

We will finetune on the Stanford Sentiment Treebank v2 (SST-2) dataset. First we download and tokenize the dataset, and create the pytorch `DataLoader`s

In [None]:
import datasets
from multiprocessing import cpu_count

# Create BERT tokenizer
def tokenize_function(sample):
    return tokenizer(
        text=sample['sentence'],
        padding="max_length",
        max_length=256,
        truncation=True
    )

# Tokenize SST-2
sst2_dataset = datasets.load_dataset("glue", "sst2")
tokenized_sst2_dataset = sst2_dataset.map(tokenize_function,
                                          batched=True, 
                                          num_proc=cpu_count(),
                                          batch_size=100,
                                          remove_columns=['idx', 'sentence'])

# Split dataset into train and validation sets
train_dataset = tokenized_sst2_dataset["train"]
eval_dataset = tokenized_sst2_dataset["validation"]

from torch.utils.data import DataLoader
data_collator = transformers.data.data_collator.default_data_collator
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=False, drop_last=False, collate_fn=data_collator)
eval_dataloader = DataLoader(eval_dataset,batch_size=16, shuffle=False, drop_last=False, collate_fn=data_collator)

Next we will create our optimizer and learning rate scheduler for the finetuning task.

In [None]:
from composer.optim import DecoupledAdamW, LinearWithWarmupScheduler

optimizer = DecoupledAdamW(model.parameters(), lr=3.0e-5, betas=[0.9, 0.98], eps=1.0e-06, weight_decay=3.0e-6)
lr_scheduler = LinearWithWarmupScheduler(t_warmup='0.06dur', alpha_f=0.02)

Lastly we can make our finetuning trainer and train!

In [None]:
import torch
from composer import Trainer

# Create Trainer Object
trainer = Trainer(
    model=composer_model, # This is the model from the HuggingFaceModel wrapper class.
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    max_duration="1ep",
    save_folder='checkpoints/finetuning',
    optimizers=optimizer,
    schedulers=[lr_scheduler],
    device='gpu' if torch.cuda.is_available() else 'cpu',
    train_subset_num_batches=150,
    precision='fp32',
    seed=17
)
# Start training
trainer.fit()

In [None]:
trainer.state.eval_metrics['eval']['Accuracy'].compute()

In [None]:
type(model)

In [None]:
!rm checkpoints/pretraining/ep1-ba10-rank0.pt
