## HuggingFace Transformers

Hugging Face is an organization that is on a path to solve and democratize AI through natural language. Their open-source library 'transformers' is very popular among the NLP community. It is very useful and powerful for several NLP and NLU tasks. It includes thousands of pre-trained models in about 100+ languages. One of the many advantages of the transformer library is that it is compatible with both PyTorch and TensorFlow.

We can install transformers directly using pip as shown in the following:

In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.1-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.1/311.1 kB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m83.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
Col

### Generating Bert Embedding

In this section, we will learn how to extract embeddings from the pre-trained BERT.
Consider the sentence 'I love Paris'. Let's see how to obtain the contextualized word embedding of all the words in the sentence using the pre trained BERT model with Huggung Face's transformer library.

First, let's import the necessary modules:

In [3]:
from transformers import BertModel, BertTokenizer
import torch

Next, we download the pretraned BERT model. We can check all the available pretrained BERT models here : https://huggingface.co/docs/transformers/index

We use the 'bert-base-uncased' model. As the name suggests, it is the BERT-based model with 12 encoders and it is trained with uncased tokens.

Since we are using the BERT-base, the representation size will be 768.

Download and load the pre-trained bert model :

In [40]:
model = BertModel.from_pretrained('bert-base-uncased')


In [27]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer

BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

### Preprocessing the input

Define the sentence:

In [28]:
s = 'I love Paris'

Tokenize the sentence and obtain the tokens:

In [29]:
tokens = tokenizer.tokenize(s)
tokens

['i', 'love', 'paris']

Now, we will add the [CLS] token at the beginning and [SEP] token at the end of the tokens list:

In [30]:
tokens = ['[CLS]'] + tokens + ['[SEP]']
tokens

['[CLS]', 'i', 'love', 'paris', '[SEP]']

As we can observe, we have [CLS] token at the begining and sep token at the end of our tokens list. we can also observe that length of our tokens is 5.

Say, we need to keep the length of our tokens list to 7, in that case, we will add two [PAD] tokens at the end as show in the following:

In [31]:
tokens += ['[PAD]']*2
tokens

['[CLS]', 'i', 'love', 'paris', '[SEP]', '[PAD]', '[PAD]']

Next, we create the attention mask. We set the attention mask value to 1 if the token is not a [PAD] token else we will set the attention mask to 0 as shown below:

In [32]:
attention_mask = [1 if t!= '[PAD]' else 0 for t in tokens]

In [33]:
attention_mask

[1, 1, 1, 1, 1, 0, 0]

we convert all the tokens to their token_ids as shown below:

In [34]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[101, 1045, 2293, 3000, 102, 0, 0]

we convert the token_ids and attention_mask to tensors:

In [35]:
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)

In [36]:
print(token_ids)
print(attention_mask)

tensor([[ 101, 1045, 2293, 3000,  102,    0,    0]])
tensor([[1, 1, 1, 1, 1, 0, 0]])


### Getting the embedding

As shown in the following code, we feed the token_ids, and attention_mask to the model and get the embeddings. Note that the model returns the output as a tuple with two values.

The first value indicates the hidden state representation, hidden_rep and it consists of the representation of all the tokens obtained from the final encoder (encoder 12), and the second value, cls_head consists of the representation of the [CLS] token:

In [57]:
output=model(token_ids,attention_mask=attention_mask)
output.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

각 값들의 의미를 살펴보자.

우선 last_hidden_state는 마지막 layer의 hidden state이다. bert-base-uncased의 경우 (batch_size, sequence_length, 768) 크기의 tensor이다. 일반적으로 이 값을 입력된 텍스트에 대해 BERT가 생성한 최종 embedding으로 여긴다. 이 embedding을 사용하여 downstream task를 수행한다.

In [58]:
print(output[0]==output.last_hidden_state) # hls_rep
print(output[1].shape) # cls_head에 linear+activate func(tanH)를 거친 값

tensor([[[True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         ...,
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True]]])
torch.Size([1, 768])


In [53]:
output[0][0][0].shape

torch.Size([768])

In [55]:
output[1][0].shape

torch.Size([768])

In [60]:
from transformers import BertModel

model1 = BertModel.from_pretrained("bert-base-uncased", add_pooling_layer=False, output_hidden_states=True, output_attentions=True)

In [61]:
output=model1(token_ids,attention_mask=attention_mask)
output.keys()

odict_keys(['last_hidden_state', 'hidden_states', 'attentions'])

hidden_states는 각 layer의 hidden state를 모아놓은 list이다. 이때 마지막 layer일수록 뒤에 있다. 즉 hidden_states[-1]과 last_hidden_state는 같다. bert-base-uncased의 경우 길이 13인 list이고(첫 번째 원소는 BertEmbeddings 모듈의 출력값이다), 각 원소는 크기 (batch_size, sequence_length, 768)인 tensor이다.

In [72]:
print(output[0].shape)
print(output.hidden_states[0].shape)
print(type(output[2]))

torch.Size([1, 7, 768])
torch.Size([1, 7, 768])
<class 'tuple'>


In [74]:
output[0]==output.hidden_states[-1]

tensor([[[True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         ...,
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True]]])

attentions은 각 layer의 attention weight를 모아놓은 list이다. 이때 마지막 layer일수록 뒤에 있다. bert-base-uncased의 경우 길이 12인 list이고, 각 원소는 크기 (batch_size, 12, sequence_length, sequence_length)인 tensor이다.

In [77]:
print(len(output.attentions))
print(output.attentions[0].shape)

12
torch.Size([1, 12, 7, 7])
