# Contextual BERT embeddings

In this notebook we're gonna make contextual BERT embeddings for each word/token inside an input sentence.

An outline of this notebook is as follows:
* Setup
* Create contextual embeddings
* Compare results versus non-contextual embeddings
* Visualize results

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/gdrive


# 1. Setup

In [None]:
import pandas as pd
import numpy as np
import torch

Next, we  install the transformers package from Hugging Face which will give us a pytorch interface for working with BERT. We've selected the pytorch interface because it strikes a nice balance between the high-level APIs (which are easy to use but don't provide insight into how things work) and tensorflow code (which contains lots of details but often sidetracks us into lessons about tensorflow, when the purpose here is BERT).



In [None]:
!pip install transformers



Next load the pre-trained BERT model and tokenizer

In [None]:
from transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained('bert-base-uncased',
           output_hidden_states = True,)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



# 2. Create contextual embeddings

We have to put the input text into a specific format that BERT can read. Mainly we add the ```[CLS]``` to the beginning and ```[SEP]``` to the end of the input. Then we convert the tokenized BERT input to the tensor format.

In [None]:
def bert_text_preparation(text, tokenizer):
  """
  Preprocesses text input in a way that BERT can interpret.
  """
  marked_text = "[CLS] " + text + " [SEP]"
  tokenized_text = tokenizer.tokenize(marked_text)
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
  segments_ids = [1]*len(indexed_tokens)

  # convert inputs to tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensor = torch.tensor([segments_ids])

  return tokenized_text, tokens_tensor, segments_tensor

In order to obtain the actual BERT embeddings, we take preprocessed input text, which now is represented by tensors, put it into our pre-trained BERT model.

Which vector works best as a contextualized embedding? I would think it depends on the task. The original paper that proposed BERT examines six choices.

I go with one of these choice that worked well in their experiments, which is the sum of the last four layers of the model.

In [None]:
def get_bert_embeddings(tokens_tensor, segments_tensor, model):
    """
    Obtains BERT embeddings for tokens, in context of the given sentence.
    """
    # gradient calculation id disabled
    with torch.no_grad():
      # obtain hidden states
      outputs = model(tokens_tensor, segments_tensor)
      hidden_states = outputs[2]

    # concatenate the tensors for all layers
    # use "stack" to create new dimension in tensor
    token_embeddings = torch.stack(hidden_states, dim=0)

    # remove dimension 1, the "batches"
    token_embeddings = torch.squeeze(token_embeddings, dim=1)

    # swap dimensions 0 and 1 so we can loop over tokens
    token_embeddings = token_embeddings.permute(1,0,2)

    # intialized list to store embeddings
    token_vecs_sum = []

    # "token_embeddings" is a [Y x 12 x 768] tensor
    # where Y is the number of tokens in the sentence

    # loop over tokens in sentence
    for token in token_embeddings:

        # "token" is a [12 x 768] tensor

        # sum the vectors from the last four layers
        sum_vec = torch.sum(token[-4:], dim=0)
        token_vecs_sum.append(sum_vec)

    return token_vecs_sum

Now we can create contextual embeddings for a set of contexts.

In [None]:
sentences = ["bank",
         "he eventually sold the shares back to the bank at a premium.",
         "the bank strongly resisted cutting interest rates.",
         "the bank will supply and buy back foreign currency.",
         "the bank is pressing us for repayment of the loan.",
         "the bank left its lending rates unchanged.",
         "the river flowed over the bank.",
         "tall, luxuriant plants grew along the river bank.",
         "his soldiers were arrayed along the river bank.",
         "wild flowers adorned the river bank.",
         "two fox cubs romped playfully on the river bank.",
         "the jewels were kept in a bank vault.",
         "you can stow your jewellery away in the bank.",
         "most of the money was in storage in bank vaults.",
         "the diamonds are shut away in a bank vault somewhere.",
         "thieves broke into the bank vault.",
         "can I bank on your support?",
         "you can bank on him to hand you a reasonable bill for your services.",
         "don't bank on your friends to help you out of trouble.",
         "you can bank on me when you need money.",
         "i bank on your help."
         ]

In [None]:
# sentences = ["bank",
#          "he eventually sold the shares back to the bank at a premium.",
#          "the river flowed over the bank.",
#          "the jewels were kept in a bank vault.",
#          "can I bank on your support?",
#          ]

In [None]:
from collections import OrderedDict

context_embeddings = []
context_tokens = []

for sentence in sentences:
  tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(sentence, tokenizer)
  list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)

  # make ordered dictionary to keep track of the position of each word
  tokens = OrderedDict()

  # loop over tokens in sensitive sentence
  for token in tokenized_text[1:-1]:
    # keep track of position of word and whether it occurs multiple times
    if token in tokens:
      tokens[token] += 1
    else:
      tokens[token] = 1

    # compute the position of the current token
    token_indices = [i for i, t in enumerate(tokenized_text) if t == token]
    current_index = token_indices[tokens[token]-1]

    # get the corresponding embedding
    token_vec = list_token_embeddings[current_index]

    # save values
    context_tokens.append(token)
    context_embeddings.append(token_vec)

# 3. Compare Results

Now that we have the contextual embeddings for the word "record", we can calculate it's similarity with its polysemous siblings and the static embedding.

In [None]:
from scipy.spatial.distance import cosine

# embeddings for the word 'record'
token = 'bank'
indices = [i for i, t in enumerate(context_tokens) if t == token]
token_embeddings = [context_embeddings[i] for i in indices]

# compare 'record' with different contexts
list_of_distances = []
for sentence_1, embed1 in zip(sentences, token_embeddings):
    for sentence_2, embed2 in zip(sentences, token_embeddings):
        cos_dist = 1 - cosine(embed1, embed2)
        list_of_distances.append([sentence_1, sentence_2, cos_dist])

distances_df = pd.DataFrame(list_of_distances, columns=['sentence_1', 'sentence_2', 'distance'])

In [None]:
distances_df[distances_df.sentence_1 == "bank"]

Unnamed: 0,sentence_1,sentence_2,distance
0,bank,bank,1.0
1,bank,he eventually sold the shares back to the bank...,0.527946
2,bank,the bank strongly resisted cutting interest ra...,0.547514
3,bank,the bank will supply and buy back foreign curr...,0.542472
4,bank,the bank is pressing us for repayment of the l...,0.52758
5,bank,the bank left its lending rates unchanged.,0.552808
6,bank,the river flowed over the bank.,0.398155
7,bank,"tall, luxuriant plants grew along the river bank.",0.356118
8,bank,his soldiers were arrayed along the river bank.,0.359705
9,bank,wild flowers adorned the river bank.,0.401465


In [None]:
distances_df[distances_df.sentence_1 == "he eventually sold the shares back to the bank at a premium."]

Unnamed: 0,sentence_1,sentence_2,distance
21,he eventually sold the shares back to the bank...,bank,0.527946
22,he eventually sold the shares back to the bank...,he eventually sold the shares back to the bank...,1.0
23,he eventually sold the shares back to the bank...,the bank strongly resisted cutting interest ra...,0.853815
24,he eventually sold the shares back to the bank...,the bank will supply and buy back foreign curr...,0.821346
25,he eventually sold the shares back to the bank...,the bank is pressing us for repayment of the l...,0.858945
26,he eventually sold the shares back to the bank...,the bank left its lending rates unchanged.,0.854651
27,he eventually sold the shares back to the bank...,the river flowed over the bank.,0.532426
28,he eventually sold the shares back to the bank...,"tall, luxuriant plants grew along the river bank.",0.512192
29,he eventually sold the shares back to the bank...,his soldiers were arrayed along the river bank.,0.518633
30,he eventually sold the shares back to the bank...,wild flowers adorned the river bank.,0.525788


# 4. Visualize

In [None]:
import os

filepath = os.path.join('./') #'drive/My Drive/'

In [None]:
name = 'metadata.tsv'

with open(os.path.join(filepath, name), 'w+') as file_metadata:
  for i, token in enumerate(context_tokens):
    file_metadata.write(token + '\n')

In [None]:
import csv

name = 'embeddings.tsv'

with open(os.path.join(filepath, name), 'w+') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')
    for embedding in context_embeddings:
        writer.writerow(embedding.numpy())