# Explore Transformers

We wil use the `transformers` package from `HuggingFace`, that provides a unified interface to a variety of Transormer nodels.

In [1]:
!pip3 install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 8.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 48.3MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 53.2MB/s 
Installing collected packages: sacremoses, tokenizers, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1


In [2]:
from transformers import AutoModel, AutoTokenizer

Pretrained models can be downloaded directly from the HuggingFace repository.

We need also the Tokenizer for a given model, since each model may do it differently.

In [3]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




Here we get the base BERT model `uncased`, i.i. where toknes are all lowercased.

In [4]:
model = AutoModel.from_pretrained('bert-base-uncased') 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [5]:
wps_ids = tokenizer.encode("Hypatia was a mathematician")
wps_ids

[101, 1044, 22571, 10450, 2050, 2001, 1037, 13235, 102]

In [6]:
wordpieces = tokenizer.convert_ids_to_tokens(wps_ids)

See what they are:

In [7]:
wordpieces

['[CLS]', 'h', '##yp', '##ati', '##a', 'was', 'a', 'mathematician', '[SEP]']

Convert to tensor

In [8]:
import torch
wps_tensor = torch.tensor([wps_ids])

In [9]:
outputs = model(wps_tensor, output_hidden_states=True, output_attentions=True)

In [10]:
last_hidden_state = outputs.last_hidden_state
pooler_output = outputs.pooler_output
hidden_states = outputs.hidden_states
attentions = outputs.attentions

`last_hidden_state` is the sequence of hidden-states at the output of the last layer of the model.

shape: (batch_size, sequence_length, hidden_size)

In [11]:
last_hidden_state[0]

tensor([[-0.1904, -0.2317, -0.4896,  ..., -0.4366,  0.5167,  0.5166],
        [ 0.7274,  0.4021, -0.4051,  ..., -0.9010,  1.4341,  0.2526],
        [-0.2553, -0.1450, -0.4257,  ..., -0.8199,  1.1043,  0.2408],
        ...,
        [ 0.1278,  0.1691, -0.8889,  ..., -0.6259,  0.1597,  0.8265],
        [-0.8941, -0.1419, -0.6008,  ..., -0.1861,  0.8240,  0.4556],
        [ 0.7600,  0.0256, -0.5482,  ...,  0.2768, -0.5651, -0.2150]],
       grad_fn=<SelectBackward>)

In [12]:
last_hidden_state[0].shape

torch.Size([9, 768])

`pooler_output` Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

shape: `(batch_size, hidden_size)`

In [13]:
len(pooler_output[0])

768

`hidden_states` is a Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer.

shape of each layer: `(batch_size, sequence_length, hidden_size)`

In [14]:
len(hidden_states)

13

Look at the last layer

In [15]:
hidden_states[-1][0]

tensor([[-0.1904, -0.2317, -0.4896,  ..., -0.4366,  0.5167,  0.5166],
        [ 0.7274,  0.4021, -0.4051,  ..., -0.9010,  1.4341,  0.2526],
        [-0.2553, -0.1450, -0.4257,  ..., -0.8199,  1.1043,  0.2408],
        ...,
        [ 0.1278,  0.1691, -0.8889,  ..., -0.6259,  0.1597,  0.8265],
        [-0.8941, -0.1419, -0.6008,  ..., -0.1861,  0.8240,  0.4556],
        [ 0.7600,  0.0256, -0.5482,  ...,  0.2768, -0.5651, -0.2150]],
       grad_fn=<SelectBackward>)

Look at the one for the first wordpiece (skipping (CLS]):

In [16]:
hidden_states[0][-1][1]

tensor([ 5.5931e-01, -2.9349e-01,  1.5142e-01, -6.6894e-01, -3.0138e-01,
        -8.1427e-01,  2.7488e-01, -2.3421e-01, -6.0991e-01, -5.4946e-01,
        -7.7725e-01, -1.2423e+00, -4.5414e-01,  4.8284e-01, -5.9605e-01,
        -7.7930e-01,  3.9271e-01, -8.5450e-02, -5.2417e-01,  3.6583e-01,
         5.5155e-01, -9.3273e-01,  6.2493e-01,  1.1886e+00,  8.0808e-01,
        -1.8684e-01, -9.5162e-01,  1.0439e+00, -5.9416e-01,  6.8520e-01,
        -3.5158e-01, -2.2820e-01,  6.8480e-02, -2.8891e-01,  2.8974e-01,
        -8.4148e-01, -1.7565e+00, -4.3860e-01, -2.6000e-01,  1.2688e-01,
         1.1053e+00,  7.0817e-01,  1.7711e-01,  8.7551e-02,  2.4324e-02,
         3.2612e-01,  1.2759e-01, -6.7901e-01,  4.2167e-01, -1.1967e+00,
        -4.2347e-01,  3.7393e-01, -8.0793e-01, -2.8902e-02, -7.6317e-01,
        -5.8712e-01,  7.4728e-02, -4.7688e-01,  4.4490e-01, -5.4128e-01,
        -7.2543e-02,  4.3697e-01, -3.0546e-02, -1.8312e-01,  7.0248e-01,
         3.3365e-01, -5.8335e-01,  1.0221e+00,  2.6

`attentions` Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.

shape: `(batch_size, num_heads, sequence_length, sequence_length)`

In [17]:
attentions[0].shape

torch.Size([1, 12, 9, 9])

Look at the single head:

In [18]:
attentions[0][0].shape

torch.Size([12, 9, 9])

Attentions for last layer

In [19]:
attentions[0][0][-1].shape

torch.Size([9, 9])

In [20]:
attentions[0][0][-1]

tensor([[8.9986e-01, 8.9166e-03, 3.4554e-03, 8.5585e-04, 2.9731e-03, 4.7226e-03,
         3.5215e-02, 4.7056e-03, 3.9298e-02],
        [1.2391e-01, 8.4317e-02, 3.2262e-01, 1.5241e-01, 4.7131e-02, 6.6522e-02,
         5.5258e-02, 2.4437e-02, 1.2340e-01],
        [2.3238e-03, 9.2049e-01, 2.0366e-03, 4.4564e-02, 5.7512e-03, 9.3530e-04,
         1.4572e-02, 6.2075e-03, 3.1225e-03],
        [2.4987e-01, 2.4486e-02, 2.8669e-01, 1.5228e-02, 2.8864e-02, 4.3187e-02,
         5.5294e-02, 8.2267e-02, 2.1411e-01],
        [7.1093e-02, 8.7373e-02, 4.1576e-01, 2.2153e-01, 8.9381e-03, 1.1020e-02,
         3.6452e-02, 2.6599e-02, 1.2124e-01],
        [2.0167e-01, 1.2511e-01, 1.4100e-01, 4.7089e-02, 4.2112e-02, 1.3234e-01,
         9.5376e-02, 1.6339e-01, 5.1919e-02],
        [2.1234e-01, 1.6746e-02, 1.0519e-01, 1.7828e-02, 6.2450e-02, 2.4318e-01,
         5.8937e-02, 1.3822e-01, 1.4510e-01],
        [2.9630e-01, 3.4920e-02, 5.3029e-02, 1.3975e-01, 1.0845e-02, 1.8033e-02,
         9.1013e-02, 8.4441e-0

# Visualize model

In [21]:
!pip3 install bertviz

Collecting bertviz
[?25l  Downloading https://files.pythonhosted.org/packages/15/8b/f4226c75b35df80504ef41399fc1569b550332e3e4796618e5669c91af55/bertviz-1.0.0-py3-none-any.whl (162kB)
[K     |████████████████████████████████| 163kB 7.5MB/s 
[?25hCollecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 10.2MB/s 
Collecting boto3
[?25l  Downloading https://files.pythonhosted.org/packages/d2/8f/42959300c543b4d34bc9f9b54954471a33384c181084ed84f070763d7f37/boto3-1.17.62-py2.py3-none-any.whl (131kB)
[K     |████████████████████████████████| 133kB 27.3MB/s 
Collecting botocore<1.21.0,>=1.20.62
[?25l  Downloading https://files.pythonhosted.org/packages/bd/60/ba830f93176fdc23166043298173ee2aecd5cf150f1ede51d6506f021deb/botocore-1.20.62-py2.py3-none-any.whl (7.5MB)
[K     |██████████

In [22]:
from bertviz import model_view

In [23]:
def show_model_view(model, tokenizer, sentence_a, sentence_b=None, hide_delimiter_attn=False, display_mode="dark"):
    inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt', add_special_tokens=True)
    input_ids = inputs['input_ids']
    if sentence_b:
        token_type_ids = inputs['token_type_ids']
        attention = model(input_ids, token_type_ids=token_type_ids, output_attentions=True).attentions
        sentence_b_start = token_type_ids[0].tolist().index(1)
    else:
        attention = model(input_ids)[-1]
        sentence_b_start = None
    input_id_list = input_ids[0].tolist() # Batch index 0
    tokens = tokenizer.convert_ids_to_tokens(input_id_list)  
    if hide_delimiter_attn:
        for i, t in enumerate(tokens):
            if t in ("[SEP]", "[CLS]"):
                for layer_attn in attention:
                    layer_attn[0, :, i, :] = 0
                    layer_attn[0, :, :, i] = 0
    model_view(attention, tokens, sentence_b_start, display_mode=display_mode)

In [24]:
sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"
show_model_view(model, tokenizer, sentence_a, sentence_b, hide_delimiter_attn=False, display_mode="light")

<IPython.core.display.Javascript object>

In [25]:
tokenizer_zh = AutoTokenizer.from_pretrained('hfl/chinese-electra-180g-small-discriminator')
model_zh = AutoModel.from_pretrained('hfl/chinese-electra-180g-small-discriminator') 

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=627.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=109540.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=268961.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=19.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=49439914.0, style=ProgressStyle(descrip…




In [26]:
wps_ids = tokenizer_zh.encode("生活的真谛是爱")
wordpieces = tokenizer_zh.convert_ids_to_tokens(wps_ids)

In [27]:
wordpieces

['[CLS]', '生', '活', '的', '真', '谛', '是', '爱', '[SEP]']

In [28]:
import torch
wps_tensor = torch.tensor([wps_ids])

In [29]:
outputs = model_zh(wps_tensor, output_hidden_states=True, output_attentions=True)

In [30]:
last_hidden_state = outputs.last_hidden_state
#pooler_output = outputs.pooler_output
hidden_states = outputs.hidden_states
attentions = outputs.attentions

In [31]:
sentence_a_zh = "那只猫坐在席子上"
sentence_b_zh = "但有时候它喜欢躺在地毯上"
show_model_view(model_zh, tokenizer_zh, sentence_a_zh, sentence_b_zh, hide_delimiter_attn=True, display_mode="light")

<IPython.core.display.Javascript object>

# Explore similarity

Consider the pooler output for two sentences and compute their cosine distance:

In [32]:
def pooler_similarity(sentence_a, sentence_b):
    tokens_a = tokenizer.encode_plus(sentence_a, return_tensors='pt', add_special_tokens=True)
    tokens_b = tokenizer.encode_plus(sentence_b, return_tensors='pt', add_special_tokens=True)
    outputs_a = model(tokens_a['input_ids'])
    outputs_b = model(tokens_b['input_ids'])
    return torch.cosine_similarity(outputs_a.pooler_output[0], outputs_b.pooler_output[0], dim=0)

In [33]:
pooler_similarity(sentence_a, sentence_b)

tensor(0.9177, grad_fn=<DivBackward0>)

In [34]:
pooler_similarity(sentence_a, sentence_a)

tensor(1., grad_fn=<DivBackward0>)

In [35]:
pooler_similarity("Who is Boris Johnson?", "The British prime minister.")

tensor(0.9899, grad_fn=<DivBackward0>)

In [36]:
pooler_similarity("Who is Boris Johnson?", "I don't know")

tensor(0.5651, grad_fn=<DivBackward0>)

However adding a period:

In [37]:
pooler_similarity("Who is Boris Johnson?", "I don't know.")

tensor(0.9862, grad_fn=<DivBackward0>)