<a href="https://colab.research.google.com/github/poojithamoganti/BERT_Embedding/blob/main/BERT_word_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Converting sentences to tokens and tokens to token id's which are wraped in tensor objects to pass it into the Bert Model.


In [None]:
from transformers import BertModel, AutoTokenizer
from scipy.spatial.distance import cosine
import matplotlib.pyplot as plt


In [None]:
model_name = "bert-base-cased"

In [None]:
model = BertModel.from_pretrained(model_name)
print(model)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(tokenizer)

BertTokenizerFast(name_or_path='bert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


In [None]:
text = "God is great"

Converting sentence to tokens and token id's in tensor objects

In [None]:
input_id = tokenizer(text, return_tensors='pt')
print(input_id)

{'input_ids': tensor([[ 101, 1875, 1110, 1632,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}


Two parts of the output : Last hidden State and pooler output.


In [None]:
output = model(**input_id)
print(output)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.4866, -0.1328,  0.2512,  ...,  0.0591,  0.4124, -0.1841],
         [ 0.4726, -0.3246, -0.0368,  ...,  0.7243,  0.3974, -0.0467],
         [ 0.2791, -0.3287,  0.3107,  ...,  0.3419,  0.3191, -0.1028],
         [ 0.5828, -0.2020,  0.2336,  ..., -0.0205,  0.1188,  0.1484],
         [ 1.0275,  0.0753,  0.1596,  ...,  0.1815,  0.6824, -0.7039]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-6.8822e-01,  4.9751e-01,  9.9984e-01, -9.8988e-01,  9.3112e-01,
          9.1341e-01,  9.6476e-01, -9.9421e-01, -9.3240e-01, -2.5851e-01,
          9.3501e-01,  9.9816e-01, -9.9889e-01, -9.9960e-01,  8.0194e-01,
         -9.4679e-01,  9.7670e-01, -5.4133e-01, -9.9995e-01, -9.2792e-01,
         -5.6978e-01, -9.9982e-01,  2.7821e-01,  9.7599e-01,  9.5293e-01,
          5.2357e-02,  9.8457e-01,  9.9995e-01,  7.3948e-01, -2.9107e-02,
          2.6309e-01, -9.8758e-01,  8.9891e-01, -9.9844e-01,  2.0297e-01,
    

Last hidden state : Its the final output of the models main layers for each input token.i.e numerical represtentaion of the tokens.

In [None]:
last_hidden_state = output.last_hidden_state

Each token is converted to vectors each with the size of 768(width of the model)

In [None]:
last_hidden_state.shape

torch.Size([1, 5, 768])

Pooler_output: It is a tensor representing the final state of the entire sentence, but not the individual tokens.It captures important information about input sequence which is relevant to classification task.


In [None]:
pooler_output = output.pooler_output

hence the shape is 1 as we gave 1 sentence

In [None]:
pooler_output.shape

torch.Size([1, 768])

Lets take two different sentences with a same word and  different meaning to compare

In [None]:
def predict(text):
  input = tokenizer(text, return_tensors="pt")
  return model(**input)[0]

In [None]:
sent1 = "There was a fly on a coffee cup"
sent2 = "To become a pilot, he need to fly for 500 hours"

In [None]:
token1 = tokenizer.tokenize(sent1)
print(token1)


['There', 'was', 'a', 'fly', 'on', 'a', 'coffee', 'cup']


In [None]:
token2 = tokenizer.tokenize(sent2)
print(token2)

['To', 'become', 'a', 'pilot', ',', 'he', 'need', 'to', 'fly', 'for', '500', 'hours']


In [None]:
out1 = predict(sent1)
print(out1)

tensor([[[ 0.4762,  0.0516,  0.1692,  ..., -0.2786,  0.3217, -0.2870],
         [ 0.1351, -0.3264, -0.1947,  ...,  0.0986,  0.5370, -0.1072],
         [ 0.0652, -0.2587,  0.6802,  ...,  0.1937,  0.4434, -0.1388],
         ...,
         [ 0.7731, -0.1080, -0.0015,  ...,  0.0657, -0.4664,  0.3786],
         [ 0.2755,  0.0598, -0.1425,  ..., -0.5723, -0.1478, -0.0387],
         [ 1.3301,  0.3034,  0.3484,  ..., -0.6476,  0.6884, -0.2950]]],
       grad_fn=<NativeLayerNormBackward0>)


In [None]:
out2 = predict(sent2)
print(out2)

tensor([[[ 0.1934, -0.0088, -0.0990,  ..., -0.2259,  0.1217,  0.3573],
         [ 0.6226, -0.4217,  0.4586,  ...,  0.0511,  0.2413, -0.1141],
         [ 0.6629,  0.0174,  0.0491,  ...,  0.1873, -0.2785,  0.4566],
         ...,
         [ 0.4879,  1.0135,  0.2761,  ...,  0.0410, -0.3622,  0.6522],
         [-0.0250,  0.0829,  0.3396,  ..., -0.9074, -0.0831,  0.2675],
         [ 0.0339,  0.0400,  0.0243,  ...,  0.2462,  0.3739,  0.1165]]],
       grad_fn=<NativeLayerNormBackward0>)


Extracting embeddings of word fly from 2 sentences

In [None]:
emb1 = out1[0:, token1.index("fly"), :].detach()
print(emb1)

tensor([[ 1.0528e-01, -8.1940e-02,  6.2116e-02,  3.9492e-01,  4.7156e-01,
         -2.3815e-01,  3.0041e-01, -6.0466e-02, -3.2563e-01, -4.5995e-01,
         -3.1197e-01,  7.4247e-01, -4.7647e-01,  3.4849e-01, -2.3892e-01,
         -6.7528e-01, -2.1642e-02, -2.5098e-01, -9.0584e-02, -3.1111e-01,
         -2.1276e-02, -6.6329e-03, -5.4030e-01,  6.0799e-01,  3.1412e-01,
         -3.2511e-01,  3.5695e-02,  6.1401e-01,  3.2594e-02,  7.0858e-02,
         -5.7769e-01, -2.9438e-01,  3.3812e-01,  2.2628e-01,  8.4977e-02,
          2.4271e-01,  1.3990e-01,  2.8993e-01, -2.2161e-01, -1.8374e-01,
          1.9298e-01,  1.4201e-01, -2.9457e-01,  3.4267e-01,  2.0725e-01,
         -5.2862e-01,  4.0856e-01, -1.2432e-01, -3.5990e-01, -8.7775e-02,
         -1.2985e-01, -9.6557e-02, -3.8949e-01,  3.8721e-01, -9.5993e-02,
         -2.0809e-01,  3.2420e-01, -1.2602e-01,  3.1916e-01,  9.2371e-01,
         -3.0648e-02,  6.9947e-01,  3.1878e-01,  5.9327e-01,  3.7188e-01,
          2.8216e-02,  1.0667e-02, -3.

In [None]:
emb2 = out2[0:, token2.index("fly"), :].detach()
print(emb2)

tensor([[ 1.6473e-01, -1.3419e-01,  2.6311e-01, -3.9736e-01,  3.7144e-01,
         -4.9248e-03,  7.8637e-01,  6.6749e-01, -3.9301e-02,  3.5459e-01,
         -6.6649e-02,  5.7849e-01, -3.2773e-01,  1.9210e-01, -1.4740e-01,
         -4.6983e-01,  5.3565e-01,  3.2377e-01, -3.3581e-01, -1.8866e-01,
          3.6379e-01,  4.0006e-01, -3.2869e-01,  4.0224e-01, -3.3111e-01,
         -5.0390e-01, -4.3711e-01,  1.3815e-01, -3.1496e-01,  3.5972e-01,
          5.0461e-03,  6.5821e-02,  3.0410e-01,  1.8149e-02, -3.0836e-01,
          3.8311e-03, -8.8698e-02, -5.6506e-02,  1.8122e-01,  1.1235e-03,
          3.2058e-01,  2.3894e-02, -8.2447e-01,  5.1725e-01,  5.8181e-01,
          1.2271e-01, -5.1138e-01, -5.4091e-02, -1.4248e-01, -1.6675e-01,
         -1.2193e-01, -3.5262e-01, -1.2342e-01,  5.9026e-01,  2.8599e-01,
          5.6710e-02, -6.1247e-01, -9.3673e-01, -1.4431e-01,  2.3883e-01,
          5.0458e-02, -2.3872e-01,  4.3078e-02,  4.7828e-01,  4.7153e-01,
          2.6911e-01,  2.4046e-01,  2.

In [None]:
emb1.shape

torch.Size([1, 768])

In [None]:
emb2.shape

torch.Size([1, 768])

Compare the embeddings using cosine distance metrics

In [None]:
emb1 =emb1.reshape(-1)
emb2= emb2.reshape(-1)

In [None]:
similarity =cosine(emb1, emb2)
print(similarity)

0.41148829460144043
