<a href="https://colab.research.google.com/github/muhajirakbarhsb/NLP_class_2023/blob/main/transformer_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m119.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.2 MB/s[0m eta [36m0:00:00[0m
Co

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [4]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [6]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [7]:

### 1. Tokenization
# Tokenizer documentation: https://huggingface.co/transformers/main_classes/tokenizer.html

text = 'I believe that the EU is trustworthy.'
print(f"Input text: '{text}'\n")

Input text: 'I believe that the EU is trustworthy.'



In [8]:

input_ids = tokenizer(text, truncation=True, return_tensors="pt")["input_ids"]
print(f"""The tokenizer splits the text string into separate tokens. A token is either an entire word,
or a 'sub-word unit' in case of rare words (or punctuation).
The word 'trustworthy', for example is split into two tokens: {tokenizer.tokenize("Trustworthy")}.
The main advantage of these sub-word units is that rare words cannot be out-of-vocabulary (an issue of other text-as-data approaches).
Transformer models typically have a vocabulary of around 30.000 - 250.000 tokens, learned from the training data.
Here is e.g. the vocabulary of DistilBERT: https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt\n""")

The tokenizer splits the text string into separate tokens. A token is either an entire word,
or a 'sub-word unit' in case of rare words (or punctuation).
The word 'trustworthy', for example is split into two tokens: ['trust', '##worthy'].
The main advantage of these sub-word units is that rare words cannot be out-of-vocabulary (an issue of other text-as-data approaches).
Transformer models typically have a vocabulary of around 30.000 - 250.000 tokens, learned from the training data.
Here is e.g. the vocabulary of DistilBERT: https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt



In [9]:
input_ids[0].tolist()

[101, 1045, 2903, 2008, 1996, 7327, 2003, 3404, 13966, 1012, 102]

In [10]:
tokenizer(text, truncation=True, return_tensors="pt")

{'input_ids': tensor([[  101,  1045,  2903,  2008,  1996,  7327,  2003,  3404, 13966,  1012,
           102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [11]:
tokenizer.tokenize(text)

['i', 'believe', 'that', 'the', 'eu', 'is', 'trust', '##worthy', '.']

In [12]:
print(f"The input text is split into the following tokens:\n{tokenizer.tokenize(text)}.")
print("The tokenizer then maps each token to the corresponding token ID in the model's vocabulary:")
print(input_ids[0].tolist()[1:-1])
print("Transformer models only understand these token IDs.\n")

The input text is split into the following tokens:
['i', 'believe', 'that', 'the', 'eu', 'is', 'trust', '##worthy', '.'].
The tokenizer then maps each token to the corresponding token ID in the model's vocabulary:
[1045, 2903, 2008, 1996, 7327, 2003, 3404, 13966, 1012]
Transformer models only understand these token IDs.



In [13]:
print("""In addition, the tokenizer adds two special tokens:
 First, the [CLS] (classification) token is always added at the beginning.
        While individual tokens represent individual (sub)words, the [CLS] token represents the entire text.
        The [CLS] token "is  used  as  the  aggregate sequence representation for classification tasks" (Devlin et al. 2019: 4). Details: https://arxiv.org/pdf/1810.04805.pdf
 Second, the [SEP] token separates two texts. It is useful for tasks which require two text inputs, for example Questions & Answer tasks.
        (It is not relevant in our case)
\n""")

In addition, the tokenizer adds two special tokens:
 First, the [CLS] (classification) token is always added at the beginning.
        While individual tokens represent individual (sub)words, the [CLS] token represents the entire text.
        The [CLS] token "is  used  as  the  aggregate sequence representation for classification tasks" (Devlin et al. 2019: 4). Details: https://arxiv.org/pdf/1810.04805.pdf
 Second, the [SEP] token separates two texts. It is useful for tasks which require two text inputs, for example Questions & Answer tasks.
        (It is not relevant in our case)




In [14]:
print("""The final input for a BERT transformer model therefore looks like this:""")
token_strings = tokenizer.convert_ids_to_tokens(ids=input_ids[0])
#token_strings = tokenizer.tokenize(text)
for token_id, token_string in zip(input_ids[0].tolist(), token_strings):
  print(token_id, " == ", token_string)


# entire vocabulary: tokenizer.pretrained_vocab_files_map["vocab_file"]["distilbert-base-uncased"]
# https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt

The final input for a BERT transformer model therefore looks like this:
101  ==  [CLS]
1045  ==  i
2903  ==  believe
2008  ==  that
1996  ==  the
7327  ==  eu
2003  ==  is
3404  ==  trust
13966  ==  ##worthy
1012  ==  .
102  ==  [SEP]


In [15]:
input_ids[0]

tensor([  101,  1045,  2903,  2008,  1996,  7327,  2003,  3404, 13966,  1012,
          102])

In [16]:
token_strings = tokenizer.convert_ids_to_tokens(ids=input_ids[0])
token_strings

['[CLS]',
 'i',
 'believe',
 'that',
 'the',
 'eu',
 'is',
 'trust',
 '##worthy',
 '.',
 '[SEP]']

In [17]:
for a,b in zip(input_ids[0].tolist(), token_strings):
    print(a,b)

101 [CLS]
1045 i
2903 believe
2008 that
1996 the
7327 eu
2003 is
3404 trust
13966 ##worthy
1012 .
102 [SEP]


In [18]:
text1 = "I learn python programming"
text2 = "the elephant is bitten by python snake"
#text1 = "I cook dinner everyday"
#text2 = "Mr cook went to sydney"
text3 = "I study python programming"
inp1 = tokenizer(text1, truncation=True, return_tensors="pt")["input_ids"]
inp2 = tokenizer(text2, truncation=True, return_tensors="pt")["input_ids"]
inp3 = tokenizer(text3, truncation=True, return_tensors="pt")["input_ids"]

In [None]:
tokenizer.tokenize(text1)

['i', 'learn', 'python', 'programming']

In [19]:
inp1

tensor([[  101,  1045,  4553, 18750,  4730,   102]])

In [20]:
inp2

tensor([[  101,  1996, 10777,  2003, 19026,  2011, 18750,  7488,   102]])

In [22]:
inp3

tensor([[  101,  1045,  2817, 18750,  4730,   102]])

In [23]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [24]:
model.distilbert.transformer.layer[0].attention

MultiHeadSelfAttention(
  (dropout): Dropout(p=0.1, inplace=False)
  (q_lin): Linear(in_features=768, out_features=768, bias=True)
  (k_lin): Linear(in_features=768, out_features=768, bias=True)
  (v_lin): Linear(in_features=768, out_features=768, bias=True)
  (out_lin): Linear(in_features=768, out_features=768, bias=True)
)

In [25]:
e1 = model.distilbert.embeddings(inp1)
e2 = model.distilbert.embeddings(inp2)
e3 = model.distilbert.embeddings(inp3)

In [26]:
e1[0,3][:20]

tensor([ 0.4765,  0.0791, -0.5994,  0.1990, -1.1323,  1.2059, -0.6152,  0.1050,
        -0.6633,  0.7164,  0.8947,  0.2140, -0.9706, -0.0527, -0.5499, -1.1334,
         0.3393,  0.1940, -0.4095, -0.6861], grad_fn=<SliceBackward0>)

In [27]:
e2[0,6][:20]

tensor([ 0.6138, -0.0411, -0.5975,  0.3634, -1.0803,  1.2622, -0.8302,  0.3590,
        -0.6277,  0.6380,  0.3281,  0.4025, -1.0925,  0.0264, -0.8139, -1.0278,
         0.8216,  0.2391, -0.3296, -0.6926], grad_fn=<SliceBackward0>)

In [28]:
out1 = model(inp1, output_hidden_states=True, output_attentions=False, return_dict=True)
out2 = model(inp2, output_hidden_states=True, output_attentions=False, return_dict=True)
out3 = model(inp3, output_hidden_states=True, output_attentions=False, return_dict=True)

In [29]:
layer = 6
cosi = torch.nn.CosineSimilarity(dim=0)
output = cosi(out1.hidden_states[layer][0][3], out2.hidden_states[layer][0][6])
output

tensor(0.4203, grad_fn=<SumBackward1>)

In [30]:
we1 = model.distilbert.embeddings.word_embeddings(inp1)
we2 = model.distilbert.embeddings.word_embeddings(inp2)
we3 = model.distilbert.embeddings.word_embeddings(inp3)

In [31]:
#embedding layer = 0
cosi = torch.nn.CosineSimilarity(dim=0)
output = cosi(e1[0][2], e2[0][2])
output

tensor(0.0313, grad_fn=<SumBackward1>)

In [32]:
t1 = "I like python programming"
tt1 = tokenizer(t1, return_tensors='pt')['input_ids']

In [33]:
tt1

tensor([[  101,  1045,  2066, 18750,  4730,   102]])

In [34]:
t2 = "python bites cats"
tt1 = tokenizer(t1, return_tensors='pt')['input_ids']

In [35]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [36]:
model.distilbert.embeddings

Embeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [37]:
text1 = "I cook dinner everyday"
text2 = "Mr cook went to sydney"
#text1 = "I learn python programming"
#text2 = "the elephant is bitten by python snake"

inp1 = tokenizer(text1, truncation=True, return_tensors="pt")["input_ids"]
inp2 = tokenizer(text2, truncation=True, return_tensors="pt")["input_ids"]


In [38]:
inp1

tensor([[  101,  1045,  5660,  4596, 10126,   102]])

In [39]:
inp2

tensor([[ 101, 2720, 5660, 2253, 2000, 3994,  102]])

In [40]:
m1 = model(inp1, output_hidden_states=True, return_dict=True)
m2 = model(inp2, output_hidden_states=True, return_dict=True)

In [41]:
m1.hidden_states[0][0][2][:20]

tensor([-7.9772e-04, -1.3976e-01, -1.4937e-01, -3.5096e-01,  7.5117e-01,
        -7.1433e-01,  8.2337e-01, -7.5590e-01,  3.0664e-01, -1.1051e+00,
         4.1999e-01, -2.4543e-01,  3.9497e-01, -7.2123e-01,  5.4743e-01,
        -4.5326e-01,  5.1976e-02,  4.8805e-01, -1.1667e+00, -1.4705e-01],
       grad_fn=<SliceBackward0>)

In [42]:
m2.hidden_states[0][0][2][:20]

tensor([-7.9772e-04, -1.3976e-01, -1.4937e-01, -3.5096e-01,  7.5117e-01,
        -7.1433e-01,  8.2337e-01, -7.5590e-01,  3.0664e-01, -1.1051e+00,
         4.1999e-01, -2.4543e-01,  3.9497e-01, -7.2123e-01,  5.4743e-01,
        -4.5326e-01,  5.1976e-02,  4.8805e-01, -1.1667e+00, -1.4705e-01],
       grad_fn=<SliceBackward0>)

In [54]:
m1.hidden_states[layer][0][2]

tensor([ 6.0160e-01, -6.1138e-02, -4.8193e-02, -2.9084e-01,  1.1344e+00,
        -7.3096e-01,  3.9375e-01, -8.7785e-01,  2.5876e-01, -8.6359e-01,
         1.7759e-01,  8.2663e-02,  5.7305e-01, -6.6176e-01,  6.1738e-01,
        -5.1738e-01,  7.8323e-02, -1.4626e-01, -1.5110e+00, -5.3880e-01,
         4.5177e-01, -1.8879e-01,  1.5207e-01,  3.7827e-02,  4.7544e-01,
         6.9694e-01,  9.7524e-02,  5.7166e-01,  2.3009e-01, -8.7522e-01,
        -8.8629e-01,  5.8747e-01, -9.4186e-01, -3.0890e-01, -8.1571e-02,
        -3.5211e-01, -6.9077e-01,  1.1259e+00,  3.5043e-01,  2.5840e-01,
         7.9755e-01, -8.1070e-01,  1.7915e+00,  6.4750e-03,  1.0273e+00,
         1.1934e+00,  8.3038e-02,  1.6825e+00, -6.4924e-02,  2.9854e-01,
        -1.3540e-01, -6.4556e-02,  3.4877e-01,  4.1157e-01, -4.6068e-01,
        -3.3634e-01, -7.9444e-01, -6.8038e-01, -1.4153e-01, -1.5497e+00,
        -1.9911e-01, -2.7106e-01,  5.6309e-01, -2.6540e-01,  3.0148e-02,
        -2.5668e-01, -1.1686e-01, -1.9066e-01, -1.9

In [43]:
layer=1
output = cosi(m1.hidden_states[layer][0][2], m2.hidden_states[layer][0][2])
output

tensor(0.8369, grad_fn=<SumBackward1>)

In [56]:
layer=6
output = cosi(m1.hidden_states[layer][0][2], m2.hidden_states[layer][0][2])
output

tensor(0.6863, grad_fn=<SumBackward1>)

In [57]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 