<a href="https://colab.research.google.com/github/madhavjk/Deep-Learning/blob/master/BERT_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Encoding text with BERT

In this notebook we will:

1.   Feed text into BERT to get encoded representations
2.   Explore how the output word embeddings are context-dependent.



In [None]:
# We'll use Huggingface's Transformers package
!pip install transformers
import transformers
from transformers import AutoModel, AutoTokenizer
import torch




**Preparing the input**

In [None]:
# This is the name of a pre-trained BERT model available online and downloadable using the transformers package
model_name = "bert-base-uncased"

# Use AutoModel to figure out which Transformer to use based on the name of the pretrained model
# Note this downloading will be substantial if a bunch of people are doing it. Is there a way we can cache it in a shared Colab environment somehow?
model = AutoModel.from_pretrained(model_name)

# Get the tokenizer that will help in preprocessing
tokenizer = AutoTokenizer.from_pretrained(model_name)

The tokenizer breaks words down into smaller wordpieces. In the case of BERT, the wordpieces are prefixed with ## if they are not a whole word.

In [None]:
sentence = "Unaccountable words are handled with byte pair encoding."
tokenizer.tokenize(sentence)

['una',
 '##cco',
 '##unt',
 '##able',
 'words',
 'are',
 'handled',
 'with',
 'byte',
 'pair',
 'encoding',
 '.']

To fully prepare this sentence for input into the model:

*   Special [CLS] and [SEP] tokens must be added to the start of the end, 
*   The tokens must be converted to corresponding numerical IDs.



In [None]:
# Combining wordpieces with special tokens
pieces = tokenizer.tokenize(sentence)
pieces = [tokenizer.cls_token] + pieces + [tokenizer.sep_token]
print(pieces)

['[CLS]', 'una', '##cco', '##unt', '##able', 'words', 'are', 'handled', 'with', 'byte', 'pair', 'encoding', '.', '[SEP]']


In [None]:
# Converting all tokens to numerical IDs
tokenizer.convert_tokens_to_ids(pieces)

[101,
 14477,
 21408,
 16671,
 3085,
 2616,
 2024,
 8971,
 2007,
 24880,
 3940,
 17181,
 1012,
 102]

In most cases it's best to just roll these in to one step with tokenizer.encode().

In [None]:
tokenizer.encode(sentence)

[101,
 14477,
 21408,
 16671,
 3085,
 2616,
 2024,
 8971,
 2007,
 24880,
 3940,
 17181,
 1012,
 102]

In [None]:
" ".join(['[CLS]', 'boston', 'is', 'a', 'city', 'in', 'massachusetts', '.', 'it', 'is', 'not', 'only', 'the', 'largest', 'city', 'in', 'massachusetts', ',', 'but', 'the', 'capital', '.', '[SEP]'])


'[CLS] boston is a city in massachusetts . it is not only the largest city in massachusetts , but the capital . [SEP]'

In the case of encoding multiple sequences (say for natural language inference tasks), separate those sentences with special delimiters.

In [None]:
# Special tokens are included that delimit sequences
sents = ["Boston is a city in Massachusetts. It is not only the largest city in Massachusetts, but the capital.",
         "Where is Boston?"]
tokens = ([tokenizer.cls_token] + tokenizer.tokenize(sents[0]) + 
          [tokenizer.sep_token] + tokenizer.tokenize(sents[1]) + 
          [tokenizer.sep_token])
ids = tokenizer.convert_tokens_to_ids(tokens)
list(zip(ids, tokens))

[(101, '[CLS]'),
 (3731, 'boston'),
 (2003, 'is'),
 (1037, 'a'),
 (2103, 'city'),
 (1999, 'in'),
 (4404, 'massachusetts'),
 (1012, '.'),
 (2009, 'it'),
 (2003, 'is'),
 (2025, 'not'),
 (2069, 'only'),
 (1996, 'the'),
 (2922, 'largest'),
 (2103, 'city'),
 (1999, 'in'),
 (4404, 'massachusetts'),
 (1010, ','),
 (2021, 'but'),
 (1996, 'the'),
 (3007, 'capital'),
 (1012, '.'),
 (102, '[SEP]'),
 (2073, 'where'),
 (2003, 'is'),
 (3731, 'boston'),
 (1029, '?'),
 (102, '[SEP]')]

This use case can also be simplified by using `tokenizer.encode_plus`

In [None]:
tokenizer.encode_plus(*sents, return_token_type_ids=True)

{'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'input_ids': [101,
  3731,
  2003,
  1037,
  2103,
  1999,
  4404,
  1012,
  2009,
  2003,
  2025,
  2069,
  1996,
  2922,
  2103,
  1999,
  4404,
  1010,
  2021,
  1996,
  3007,
  1012,
  102,
  2073,
  2003,
  3731,
  1029,
  102],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  1,
  1]}

To feed the text to the model, put the IDs for all sentences of one batch in a tensor, then call the model on it:

In [None]:
# Put the token IDs in a tensor of batch size 1
tokens_tensor = torch.tensor([ids])
tokens_tensor

tensor([[ 101, 3731, 2003, 1037, 2103, 1999, 4404, 1012, 2009, 2003, 2025, 2069,
         1996, 2922, 2103, 1999, 4404, 1010, 2021, 1996, 3007, 1012,  102, 2073,
         2003, 3731, 1029,  102]])

In [None]:
# Feed the batch to the model and get our hidden representations
batch_hiddens, *_other = model(tokens_tensor)
hiddens = batch_hiddens[0]
hiddens

tensor([[-0.3978,  0.2074, -0.5156,  ..., -0.5493,  0.5735,  0.5814],
        [ 0.1739, -0.3144, -0.6636,  ..., -0.2763,  0.3885,  0.1835],
        [-0.8951, -0.5945, -0.0142,  ..., -0.4121,  0.7583,  0.5488],
        ...,
        [ 0.3861, -0.4660, -0.6078,  ..., -0.3762, -0.6090, -0.0122],
        [-0.5916, -1.0025, -1.0509,  ...,  0.3678, -0.0044,  0.1510],
        [-1.5023, -0.2680, -0.8075,  ...,  0.0321,  0.3901,  0.1777]],
       grad_fn=<SelectBackward>)

**Context dependence**

One of the key advantages of using transformer models like BERT is that the word embeddings they provide are context-dependent.

Let's get some of these word embeddings from different sentences, and see how the representation of the word "saw" changes depending on the context.

In [None]:
# Each sentence is paired with the index that points to the location of the word we're interested in
sequences = [("I saw the car", 1),
             ("I cut the log with a saw", 6),
             ("She saw the dog", 1),
             ("I drilled a hole with a drill", 6),
             ("Use a screwdriver to drive the screw", 2)]


In [None]:
def get_embedding(sentence, loc):
  seq_ids = tokenizer.encode(sentence)
  seq_ids_tensor = torch.tensor([seq_ids])
  hiddens, *_other = model(seq_ids_tensor)
  return hiddens[0][loc]

In [None]:
# Create a mapping from sentences to the word embeddings of the relevant words
embeddings = {}
for sentence, loc in sequences:
  seq_ids = tokenizer.encode(sentence)
  seq_ids_tensor = torch.tensor([seq_ids])
  hiddens, *_other = model(seq_ids_tensor)
  embeddings[sentence] = hiddens[0][loc]

Compare "saw" in "I saw the car" with words in other sentences. Sentences with the *seeing* sense have high similarity.




In [None]:
cos_sim = torch.nn.functional.cosine_similarity
ref, refloc = sequences[0]
print("\"{}\" in \"{}\" has a cosine similarity of...".format(tokenizer.tokenize(ref)[refloc], ref))
for sent, loc in sequences:
  print("{:3.2f} with \"{}\" in \"{}\"".format(cos_sim(get_embedding(ref, refloc), get_embedding(sent, loc), dim=0),
                                          tokenizer.tokenize(sent)[loc],
                                          sent))

"saw" in "I saw the car" has a cosine similarity of...
1.00 with "saw" in "I saw the car"
0.38 with "saw" in "I cut the log with a saw"
0.64 with "saw" in "She saw the dog"
0.39 with "drill" in "I drilled a hole with a drill"
0.25 with "screw" in "Use a screwdriver to drive the screw"


Compare "saw" in "I cut the log with the saw" with other "saw"s and tools. The word is closer to other tools than it is other sense of the word "saw".

In [None]:
cos_sim = torch.nn.functional.cosine_similarity
ref, refloc = sequences[1]
print("\"{}\" in \"{}\" has a cosine similarity of...".format(tokenizer.tokenize(ref)[refloc], ref))
for sent, loc in sequences:
  print("{:3.2f} with \"{}\" in \"{}\"".format(cos_sim(get_embedding(ref, refloc), get_embedding(sent, loc), dim=0),
                                          tokenizer.tokenize(sent)[loc],
                                          sent))

"saw" in "I cut the log with a saw" has a cosine similarity of...
0.38 with "saw" in "I saw the car"
1.00 with "saw" in "I cut the log with a saw"
0.36 with "saw" in "She saw the dog"
0.83 with "drill" in "I drilled a hole with a drill"
0.64 with "screw" in "Use a screwdriver to drive the screw"


Try experimenting with your own sentence and word choices!