Before getting into the model and fine-tuning, we need to download some data to fine-tune our model with...

In [1]:
from datasets import load_dataset

# load the first 1K rows of the TREC dataset
trec = load_dataset('trec', split='train[:1000]')
trec

  from .autonotebook import tqdm as notebook_tqdm
Using custom data configuration default
Reusing dataset trec (/Users/jamesbriggs/.cache/huggingface/datasets/trec/default/1.1.0/751da1ab101b8d297a3d6e9c79ee9b0173ff94c4497b75677b59b61d5467a9b9)


Dataset({
    features: ['label-coarse', 'label-fine', 'text'],
    num_rows: 1000
})

In [2]:
trec[0]

{'label-coarse': 0,
 'label-fine': 0,
 'text': 'How did serfdom develop in and then leave Russia ?'}

There are also a few data preparation steps we need to perform. We need to tokenize our input text `text`, one-hot encode our labels `label-coarse`, and then place these together in a dataset and dataloader.

For tokenization we need to use a *tokenizer*, we will use the `bert-base-uncased` tokenizer from HF.

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# we have a small dataset so we can tokenize everything at once
# tokenize everything
tokens = tokenizer(
    trec['text'], max_length=512,
    truncation=True, padding='max_length'
)

This returns a list of encoding objects

In [4]:
tokens[:2]

[Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
 Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]

And we access individual components using (for example):

In [5]:
tokens[0].ids

[101,
 2129,
 2106,
 14262,
 2546,
 9527,
 4503,
 1999,
 1998,
 2059,
 2681,
 3607,
 1029,
 102,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,

Next we one-hot encode our labels.

In [6]:
import numpy as np

# initialize array to be used
labels = np.zeros(
    (len(trec), max(trec['label-coarse'])+1)
)
# one-hot encode
labels[np.arange(len(trec)), trec['label-coarse']] = 1
labels[:5]

array([[1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.]])

In [7]:
import torch

labels = torch.Tensor(labels)

Now we're ready to create the dataset object.

In [8]:
class TrecDataset(torch.utils.data.Dataset):
    def __init__(self, tokens, labels):
        self.tokens = tokens
        self.labels = labels

    def __getitem__(self, idx):
        input_ids = self.tokens[idx].ids
        attention_mask = self.tokens[idx].attention_mask
        labels = self.labels[idx]
        return {
            'input_ids': torch.tensor(input_ids),
            'attention_mask': torch.tensor(attention_mask),
            'labels': torch.tensor(labels)
        }

    def __len__(self):
        return len(self.labels)

dataset = TrecDataset(tokens, labels)

In [9]:
loader = torch.utils.data.DataLoader(
    dataset, batch_size=64
)

Now let's try training a BERT model, we'll use this and our TREC data to compare inference time on CPU vs MPS.

In [10]:
from transformers import BertForSequenceClassification, BertConfig

config = BertConfig.from_pretrained('bert-base-uncased')
config.num_labels = max(trec['label-coarse'])+1
model = BertForSequenceClassification(config)

Fine-tuning the entire BERT model on first-gen M1 Mac is not going to work, but we can fine-tune the classification head, so let's test that by freezing all BERT layer parameters. Leaving fine-tuning to just to final few classification layers.

In [11]:
for param in model.bert.parameters():
    param.requires_grad = False

Training prep

In [12]:
# activate training mode of model
model.train()

# initialize adam optimizer with weight decay
optim = torch.optim.Adam(model.parameters(), lr=5e-5)

In [13]:
from time import time
from tqdm.auto import tqdm

loop_time = []

# setup loop (using tqdm for the progress bar)
loop = tqdm(loader, leave=True)
for batch in loop:
    t0 = time()
    # initialize calculated gradients (from prev step)
    optim.zero_grad()
    # train model on batch and return outputs (incl. loss)
    outputs = model(**batch)
    # extract loss
    loss = outputs[0]
    # calculate loss for every parameter that needs grad update
    loss.backward()
    # update parameters
    optim.step()
    loop_time.append(time()-t0)
    # print relevant info to progress bar
    loop.set_postfix(loss=loss.item())

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  'labels': torch.tensor(labels)
100%|██████████| 16/16 [08:32<00:00, 32.00s/it, loss=0.627]


In [14]:
loop_time

[33.2337121963501,
 32.69078779220581,
 32.49973511695862,
 31.878857851028442,
 31.820831060409546,
 32.20605492591858,
 33.72793507575989,
 33.6209659576416,
 34.16751790046692,
 32.69612193107605,
 32.390515089035034,
 34.32974076271057,
 32.47592902183533,
 33.35161590576172,
 33.055907011032104,
 17.779623985290527]

---