# Playing around with examples from the HuggingFace NLP Course

In [1]:
import torch.nn.functional as F

## Behind the Pipeline

* Tokenizer: Maps raw input text to tokens (indices into a vocabulary). 2D tensor of ints.
* Model:
  - Input embeddings: Map each index to latent vector. 3D tensor of floats.
  - Backbone: Transform 3D tensor through multiple layers, using self-attention, FFN, layer normalization,
    dropout, etc. Output is 3D tensor of same size.
  - Head: Maps 3D tensor to output tensor, which is 2D or 3D, depending on model type. For some models (such
    as sequence classification), the sequence dimension is squeezed. The head is usually a linear layer,
    followed by a model type specific nonlinearity.
* Postprocessing: Either loss function or mapping outputs to probabilities

In [2]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased-finetuned-sst-2-english', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

The tokenizer has a vocabulary of size 30522. Its maximum sequence length is 512 tokens. Both padding and truncation are done on the right side: any sequence of more than 512 tokens is truncated by dropping tokens on the right.

The tokenizer defines a number of special tokens, which do not map to subwords. `[PAD]` is used for padding, `[CLS]` maps to start of a sequence, `[SEP]` marks the end of a sequence or separates between multiple sequences, `[MASK]` is the mask token for cloze expressions.

In [4]:
raw_inputs = [
    "This book is just fucking awesome",
    "I'd lie if I pretended reading this book was enjoyable"
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt").to("mps")
print(inputs)

{'input_ids': tensor([[  101,  2023,  2338,  2003,  2074,  8239, 12476,   102,     0,     0,
             0,     0,     0,     0],
        [  101,  1045,  1005,  1040,  4682,  2065,  1045, 14688,  3752,  2023,
          2338,  2001, 22249,   102]], device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='mps:0')}


Note:
* Each token sequence starts with `[CLS] = 101` and ends with `[SEP] = 102`
* The first sequence is shorter and is padded to the same length than the second using `[PAD] = 0`.
  This is called dynamic padding, whereas static padding would pad both sequences to the full
  length 512 (TensorFlow requires that)
* `attention_mask` contains masks for the non-padding tokens
* The tensors were mapped to the `mps` device, so the Apple M1 chip is used for processing

In [9]:
for x in inputs["input_ids"]:
    print(x)

tensor([  101,  2023,  2338,  2003,  2074,  8239, 12476,   102,     0,     0,
            0,     0,     0,     0], device='mps:0')
tensor([  101,  1045,  1005,  1040,  4682,  2065,  1045, 14688,  3752,  2023,
         2338,  2001, 22249,   102], device='mps:0')


In [10]:
token_seqs = [
    [
        f"{ind:5d} {token}"
        for ind, token in zip(row, tokenizer.convert_ids_to_tokens(row))
    ]
    for row in inputs["input_ids"]
]

In [11]:
token_seqs

[['  101 [CLS]',
  ' 2023 this',
  ' 2338 book',
  ' 2003 is',
  ' 2074 just',
  ' 8239 fucking',
  '12476 awesome',
  '  102 [SEP]',
  '    0 [PAD]',
  '    0 [PAD]',
  '    0 [PAD]',
  '    0 [PAD]',
  '    0 [PAD]',
  '    0 [PAD]'],
 ['  101 [CLS]',
  ' 1045 i',
  " 1005 '",
  ' 1040 d',
  ' 4682 lie',
  ' 2065 if',
  ' 1045 i',
  '14688 pretended',
  ' 3752 reading',
  ' 2023 this',
  ' 2338 book',
  ' 2001 was',
  '22249 enjoyable',
  '  102 [SEP]']]

In this vocabulary, there seem to be tokens for at least common English words. Note that all tokens are lower-case, since the model is uncased. Note that token for "i", which is repeated in the second sequence.

## Fine-Tuning

This example is about fine-tuning a pre-trained model on an additional dataset. Here, the MRPC dataset is used, where the input consists of two sentences, and the binary label states whether they are paraphrases (i.e., mean the same) or not. MRPC is part of the GLUE benchmark suite.

In [12]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Generating train split: 100%|████████████████████████████████████████████████████████████████████████████| 3668/3668 [00:00<00:00, 93518.94 examples/s]
Generating validation split: 100%|████████████████████████████████████████████████████████████████████████| 408/408 [00:00<00:00, 147282.56 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████████████| 1725/1725 [00:00<00:00, 440631.81 examples/s]


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [13]:
raw_datasets["train"]

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

In [17]:
raw_datasets["train"][2222]

{'sentence1': 'The jury verdict , reached Wednesday after less than four hours of deliberation , followed a 2 1 / 2 week trial , during which Waagner represented himself .',
 'sentence2': 'The quick conviction followed a 2 1 / 2 week trial , during which the Venango County man represented himself .',
 'label': 1,
 'idx': 2475}

In [18]:
raw_datasets["train"].features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

In [20]:
raw_datasets["train"].features["label"].names

['not_equivalent', 'equivalent']

In [22]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [23]:
tokenizer

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

We need to preprocess the data. First, there are two sentences for each datapoint, whereas the model expects one only. the tokenizer can be fed several sentences and concatenates them. Each sequence ends with `[SEP]`, and `[CLS]` is appended at the start of the combined sequence.

Preprocessing is done by mapping the dataset with a mapping function. This function is applied to each case. The mapping function is applied on demand, so the dataset does not to fit into memory.

In [25]:
# `example` can contain a batch of cases, in that its dictionary values are
# lists of strings. This allows for batched processing, which is faster.
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3668/3668 [00:00<00:00, 25624.35 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 408/408 [00:00<00:00, 26086.92 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1725/1725 [00:00<00:00, 30534.86 examples/s]


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

In [27]:
print(tokenized_datasets["train"][0])
print(tokenized_datasets["train"][1])

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', 'label': 1, 'idx': 0, 'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'sentence1': "Yucaipa owned Dominick 's before selling the chain to Safewa

Note that each sequence has a different number of tokens, so we still need to do the padding.

Except on TPUs, it is most efficient to use dynamic padding, so that the sequences in each batch are padded to the length of the largest one. This means that each batch can have a different number of tokens.

A **collate** function is used to put together cases into a batch. The padding is done there.

In [29]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [30]:
# Create a batch of size 8 by hand
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

In [36]:
batch = data_collator(samples).to("mps")

In [38]:
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

Let us write a preprocessing function which works for all of the GLUE datasets

In [52]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

unique_keys = list(set(task_to_keys.values()))

def signature(example, pair):
    return tuple(None if example.get(k) is None else k for k in pair)

def glue_tokenize_function(example):
    keys = None
    for pair in unique_keys:
        if sum(x == y for x, y in zip(pair, signature(example, pair))) == 2:
            keys = pair
            break
    assert keys is not None, f"example keys = {list(example.keys())} do not match any of the GLUE sets"
    print(keys)
    k1, k2 = keys
    seqs = (example[k1],)
    v2 = example.get(k2)
    if v2 is not None:
        seqs = seqs + (v2,)
    return tokenizer(*seqs, truncation=True)