# 🤗 Train a BERT Model From Scratch

## **Module 2: Pre-Training**

Here, we'll walk through using Composer to load a Hugging Face BERT model and pre-train it on the Colossal Clean Crawled Corpus (C4) dataset from [Common Crawl](https://commoncrawl.org/).

### Recommended Background

This tutorial assumes you are familiar with transformer models for NLP and with Hugging Face.

To better understand the Composer part, make sure you're comfortable with the material in our [Getting Started][getting_started] tutorial.

### Tutorial Goals and Concepts Covered

The goal of this tutorial is to demonstrate how to pre-train a Hugging Face transformer using the Composer library!

In the next module, we will walk through a [tutorial][huggingface_models] on fine-tuning a pretrained BERT-base model. After both the pre-training and fine-tuning, the BERT model should be able to determine if a sentence has positive or negative sentiment.

Along the way, we will touch on:

* Creating our Hugging Face BERT model, tokenizer, and data loaders
* Wrapping the Hugging Face model as a `ComposerModel` for use with the Composer trainer
* Training with Composer
* Visualization examples

Let's do this 🚀

[getting_started]: https://docs.mosaicml.com/en/stable/examples/getting_started.html
[huggingface_models]: https://docs.mosaicml.com/en/stable/examples/huggingface_models.html

## Install Composer

To use Hugging Face with Composer, we'll need to install Composer *with the NLP dependencies*. If you haven't already, run: 

In [None]:
%pip install -U pip
%pip install -U 'mosaicml[nlp, streaming]==0.10.1'
# To install from source instead of the last release, comment the command above and uncomment the following one.
# %pip install 'mosaicml[nlp, tensorboard] @ git+https://github.com/mosaicml/composer.git'"

## Import Hugging Face Model
First, we import a BERT model (specifically, BERT-base for uncased text) and its associated tokenizer from the transformers library.

In [None]:
import transformers
from transformers.models.bert.configuration_bert import BertConfig

# Create a BERT sequence classification model using Hugging Face transformers
config = BertConfig()
model = transformers.AutoModelForPreTraining.from_config(config)
tokenizer = transformers.AutoTokenizer.from_pretrained('bert-base-uncased')

## Creating Dataloaders

Next, we will download and tokenize the C4 datasets. 

In [None]:
from composer.datasets import StreamingC4
from multiprocessing import cpu_count

# Tokenize the C4 dataset
train_dataset = StreamingC4(remote='s3://mosaicml-internal-temporary-202210-ocwdemo/mds/1-gz', 
                                    local='/tmp/c4local',
                                    shuffle=True,
                                    max_seq_len=128,
                                    split='val', 
                                    tokenizer_name='bert-base-uncased')
eval_dataset = StreamingC4(remote='s3://mosaicml-internal-temporary-202210-ocwdemo/mds/1-gz',
                                    local='/tmp/c4local',
                                    shuffle=True,
                                    max_seq_len=128,
                                    split='val', tokenizer_name='bert-base-uncased')

Here, we will create a PyTorch `DataLoader` for each of the datasets generated in the previous block.

In [None]:
from torch.utils.data import DataLoader
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True)
# data_collator = transformers.DefaultDataCollator(return_tensors='pt')
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=False, drop_last=False, collate_fn=data_collator)
eval_dataloader = DataLoader(eval_dataset,batch_size=16, shuffle=False, drop_last=False, collate_fn=data_collator)

## Convert model to `ComposerModel`

Composer uses `HuggingFaceModel` as a convenient interface for wrapping a Hugging Face model (such as the one we created above) in a `ComposerModel`. Its parameters are:

- `model`: The Hugging Face model to wrap.
- `metrics`: A list of torchmetrics to apply to the output of `validate` (a `ComposerModel` method).
- `use_logits`: A boolean which, if True, flags that the model's output logits should be used to calculate validation metrics.

See the [API Reference][api] for additional details.

[api]: https://docs.mosaicml.com/en/stable/api_reference/generated/composer.models.HuggingFaceModel.html

In [None]:
from torchmetrics.collections import MetricCollection
from composer.models.huggingface import HuggingFaceModel
from composer.metrics import LanguageCrossEntropy, MaskedAccuracy

metrics = [LanguageCrossEntropy(vocab_size=tokenizer.vocab_size), MaskedAccuracy(ignore_index=-100)]
# Package as a trainer-friendly Composer model
composer_model = HuggingFaceModel(model, metrics=metrics, use_logits=True)

## Optimizers and Learning Rate Schedulers

The last setup step is to create an optimizer and a learning rate scheduler. We will use PyTorch's AdamW optimizer and linear learning rate scheduler since these are typically used to fine-tune BERT on tasks such as SST-2.

In [None]:
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR

optimizer = AdamW(
    params=composer_model.parameters(),
    lr=3e-5, betas=(0.9, 0.98),
    eps=1e-6, weight_decay=3e-6
)
linear_lr_decay = LinearLR(
    optimizer, start_factor=1.0,
    end_factor=0, total_iters=150
)

## Composer Trainer

We will now specify a Composer `Trainer` object and run our training! `Trainer` has many arguments that are described in our [documentation](https://docs.mosaicml.com/en/stable/api_reference/generated/composer.Trainer.html#trainer), so we'll discuss only the less-obvious arguments used below:

- `max_duration` - a string specifying how long to train. This can be in terms of batches (e.g., `'10ba'` is 10 batches) or epochs (e.g., `'1ep'` is 1 epoch), [among other options][time].
- `schedulers` - a (list of) PyTorch or Composer learning rate scheduler(s) that will be composed together.
- `device` - specifies if the training will be done on CPU or GPU by using `'cpu'` or `'gpu'`, respectively. You can omit this to automatically train on GPUs if they're available and fall back to the CPU if not.
- `train_subset_num_batches` - specifies the number of training batches to use for each epoch. This is not a necessary argument but is useful for quickly testing code.
- `precision` - whether to do the training in full precision (`'fp32'`) or mixed precision (`'amp'`). Mixed precision can provide a ~2x training speedup on recent NVIDIA GPUs.
- `seed` - sets the random seed for the training run, so the results are reproducible!

[time]: https://docs.mosaicml.com/en/stable/trainer/time.html

### IMPORTANT NOTE

A full pre-training run inside this notebook would take a VERY long time to complete. We will not be waiting for this job to run to completion. 

After you launch the code cell below, look for the status messages and progress bar to see that the training work is starting. We will let the pre-training run progress for a few minutes, and then we will be stopping the run early.

### **Later in this session:** We will demo launching a pre-training run on MosaicML Cloud

In [None]:
import torch
from composer import Trainer

# Create Trainer Object
trainer = Trainer(
    model=composer_model, # This is the model from the HuggingFaceModel wrapper class.
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    max_duration="1ep",
    optimizers=optimizer,
    schedulers=[linear_lr_decay],
    device='gpu' if torch.cuda.is_available() else 'cpu',
    train_subset_num_batches=150,
    precision='fp32',
    seed=17
)
# Start training
trainer.fit()

# **STOP HERE**

This is the end of Module 2 - Pre-Training. We will take a break here, and make time for Q&A. However, if you want to continue without a break, feel free to do so.

## **Interrupt the previous cell!** 
**Please remember to stop execution of the ongoing pre-training run in the cell above.** To do that, click the "stop" square inside the circle at the top-left corner of the notebook cell. 

## **Module 3: Fine-Tuning**

## Import Hugging Face Pretrained Model
First, we import a pretrained BERT model (specifically, BERT-base for uncased text) and its associated tokenizer from the transformers library.

Sentiment classification has two labels, so we set `num_labels=2` when creating our model.

In [None]:
# Create a BERT sequence classification model using Hugging Face transformers
sentiment_model = transformers.AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
sst2_tokenizer = transformers.AutoTokenizer.from_pretrained('bert-base-uncased') 

## Creating Dataloaders for Fine-Tuning

When it's time to fine-tune a model, you don't need nearly as much data, and not nearly as much computational work, to get it done. A pre-trained BERT model has learned a great deal about the structure and context of the language. Because of the pre-training, fine-tuning for a specific use case can be done much more quickly.

Next, we will download and tokenize the Stanford Sentiment Treebank v2 (SST-2) dataset. The SST-2 dataset contains a variety of sentences that have been labeled as having either Positive or Negative sentiment.

In [None]:
import datasets
from multiprocessing import cpu_count

# Create BERT tokenizer
def tokenize_function(sample):
    return sst2_tokenizer(
        text=sample['sentence'],
        padding="max_length",
        max_length=256,
        truncation=True
    )

# Tokenize SST-2
sst2_dataset = datasets.load_dataset("glue", "sst2")
tokenized_sst2_dataset = sst2_dataset.map(tokenize_function,
                                          batched=True, 
                                          num_proc=cpu_count(),
                                          batch_size=100,
                                          remove_columns=['idx', 'sentence'])

# Split dataset into train and validation sets
sst2_train_dataset = tokenized_sst2_dataset["train"]
sst2_eval_dataset = tokenized_sst2_dataset["validation"]

Here, we will create a PyTorch `DataLoader` for each of the datasets generated in the previous block.

In [None]:
from torch.utils.data import DataLoader
sst2_data_collator = transformers.data.data_collator.default_data_collator
sst2_train_dataloader = DataLoader(sst2_train_dataset, batch_size=16, shuffle=False, drop_last=False, collate_fn=sst2_data_collator)
sst2_eval_dataloader = DataLoader(sst2_eval_dataset,batch_size=16, shuffle=False, drop_last=False, collate_fn=sst2_data_collator)

### Composer Sentiment Analysis Model

In [None]:
from torchmetrics import Accuracy
from torchmetrics.collections import MetricCollection
from composer.metrics import CrossEntropy

metrics = [CrossEntropy(), Accuracy()]
# Package as a trainer-friendly Composer model
composer_sentiment_model = HuggingFaceModel(sentiment_model, metrics=metrics, use_logits=True)

## Optimizers and Learning Rate Schedulers

The last setup step is to create an optimizer and a learning rate scheduler. We will use PyTorch's AdamW optimizer and linear learning rate scheduler since these are typically used to fine-tune BERT on tasks such as SST-2.

In [None]:
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR

sst2_optimizer = AdamW(
    params=composer_sentiment_model.parameters(),
    lr=3e-5, betas=(0.9, 0.98),
    eps=1e-6, weight_decay=3e-6
)
sst2_linear_lr_decay = LinearLR(
    sst2_optimizer, start_factor=1.0,
    end_factor=0, total_iters=150
)

### Composer Trainer

Here is our second Composer `Trainer` object for the fine-tuning step. You can refer back to the `Trainer` [documentation](https://docs.mosaicml.com/en/stable/api_reference/generated/composer.Trainer.html#trainer), and review details from our previous `Trainer` in Module 2:

- `max_duration` - a string specifying how long to train. This can be in terms of batches (e.g., `'10ba'` is 10 batches) or epochs (e.g., `'1ep'` is 1 epoch), [among other options][time].
- `schedulers` - a (list of) PyTorch or Composer learning rate scheduler(s) that will be composed together.
- `device` - specifies if the training will be done on CPU or GPU by using `'cpu'` or `'gpu'`, respectively. You can omit this to automatically train on GPUs if they're available and fall back to the CPU if not.
- `train_subset_num_batches` - specifies the number of training batches to use for each epoch. This is not a necessary argument but is useful for quickly testing code.
- `precision` - whether to do the training in full precision (`'fp32'`) or mixed precision (`'amp'`). Mixed precision can provide a ~2x training speedup on recent NVIDIA GPUs.
- `seed` - sets the random seed for the training run, so the results are reproducible!

[time]: https://docs.mosaicml.com/en/stable/trainer/time.html

In [None]:
# Create Trainer Object
sentiment_trainer = Trainer(
    model=composer_sentiment_model, # This is the model from the HuggingFaceModel wrapper class.
    train_dataloader=sst2_train_dataloader,
    eval_dataloader=sst2_eval_dataloader,
    max_duration="1ep",
    optimizers=sst2_optimizer,
    schedulers=[sst2_linear_lr_decay],
    device='gpu' if torch.cuda.is_available() else 'cpu',
    train_subset_num_batches=150,
    precision='fp32',
    seed=17
)
# Start training
sentiment_trainer.fit()

To check the training's validation accuracy, we read the `Trainer` object `state.eval_metrics` 

In [None]:
sentiment_trainer.state.eval_metrics

Our model reaches ~86% accuracy with only 150 iterations of training! 
Let's visualize a few samples from the validation set to see how our model performs.

In [None]:
eval_batch = next(iter(sst2_eval_dataloader))

# Move batch to gpu
eval_batch = {k: v.cuda() if torch.cuda.is_available() else v for k, v in eval_batch.items()}
with torch.no_grad():
    predictions = composer_sentiment_model(eval_batch)["logits"].argmax(dim=1)

# Visualize only 5 samples
predictions = predictions[:5]

label = ['negative', 'positive']
for i, prediction in enumerate(predictions):
    sentence = sst2_dataset["validation"][i]["sentence"]
    correct_label = label[sst2_dataset["validation"][i]["label"]]
    prediction_label = label[prediction]
    print(f"Sample: {sentence}")
    print(f"Label: {correct_label}")
    print(f"Prediction: {prediction_label}")
    print()

## Save Pre-Trained Model

Finally, to save the pre-trained model parameters we call the PyTorch `save` method and pass it the model's `state_dict`: 

In [None]:
torch.save(trainer.state.model.state_dict(), 'model.pt')

# **Congratulations! LAB COMPLETE**

You've now seen how to use the Composer `Trainer` to pre-train a Hugging Face BERT, using the C4 dataset. Following that, you've fine-tuned a Hugging Face BERT model on the SST-2 dataset, and seen how the model predicts a sentence to have positive or negative sentiment.

## What next?

## Come get involved with MosaicML!

We'd love for you to get involved with the MosaicML community in any of these ways:

### [Star Composer on GitHub](https://github.com/mosaicml/composer)

Help make others aware of our work by [starring Composer on GitHub](https://github.com/mosaicml/composer).

### [Join the MosaicML Slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg)

Head on over to the [MosaicML slack](https://join.slack.com/t/mosaicml-community/shared_invite/zt-w0tiddn9-WGTlRpfjcO9J5jyrMub1dg) to join other ML efficiency enthusiasts. Come for the paper discussions, stay for the memes!

### Contribute to Composer

Is there a bug you noticed or a feature you'd like? File an [issue](https://github.com/mosaicml/composer/issues) or make a [pull request](https://github.com/mosaicml/composer/pulls)!