# QUESTION-ANSWERING BERT

- The goal of this notebook is to explore and understand pre-trained qa models from the huggingface library. No fine-tuning is going to be done, we are going to use existing models from **<a href="https://huggingface.co/models">Hugging Face Models.</a>**

## 1. BERT Input format for QA

- To feed a QA task into BERT, we pack both the question and the reference text into the input.

- The two pieces of text are separated by the special [SEP] token.
- BERT also uses “Segment Embeddings” to differentiate the question from the reference text. These are simply two embeddings (for segments “A” and “B”) that BERT learned, and which it adds to the token embeddings before feeding them into the input layer.

<img src="img/bert_qa_input.png">

### 1.1 STAR & END TOKEN CLASSIFIER

- BERT needs to highlight a “span” of text containing the answer–this is represented as simply predicting which token marks the start of the answer, and which token marks the end

<img src="img/qa-bert-input-start.png"> 

- For every token in the text, we feed its final embedding into the start token classifier. The start token classifier only has a single set of weights (represented by the blue “start” rectangle in the above illustration) which it applies to every word.

- After taking the dot product between the output embeddings and the ‘start’ weights, we apply the softmax activation to produce a probability distribution over all of the words. Whichever word has the highest probability of being the start token is the one that we pick.

- We repeat this process for the end token–we have a separate weight vector this

<img src="img/qa-bert-input-end.png">

## 2. LOAD FINE-TUNED BERT MODEL

- We are going to use **<a href="https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad">bert-large-uncased-whole-word-masking-finetuned-squad</a>**

- BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way. It was trained with two objectives:
    1. **Masked language modeling (MLM)**: taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence
    2. **Next sentence prediction (NSP):** the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.
    
 | Layers | Hidden State | Attention Heads | Parameters | Memory |
| --- | --- | --- | --- | --- |
| 12 | 1024 | 16 | 336M | 1.34GB |

**Preprocessing**
The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are then of the form:

<i>[CLS] Sentence A [SEP] Sentence B [SEP]</i>

 - With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus and in the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two "sentences" has a combined length of less than 512 tokens.

- The details of the masking procedure for each sentence are the following:

1. 15% of the tokens are masked.
2. In 80% of the cases, the masked tokens are replaced by [MASK].
3. In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
4. In the 10% remaining cases, the masked tokens are left as is.
**Pretraining**
- The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, β1=0.9 and β2=0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.

In [1]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering

In [15]:
BASE_MODEL = 'bert-large-uncased-whole-word-masking-finetuned-squad'
#BASE_MODEL = 'bert-base-uncased'

In [16]:
tokenizer = BertTokenizer.from_pretrained(BASE_MODEL)

In [None]:
model = BertForQuestionAnswering.from_pretrained(BASE_MODEL)

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

In [5]:
model.num_parameters()

108893186

In [6]:
question = "How many parameters does BERT-large have?"
answer_text = "BERT-large is really big... it has 24-layers and an embedding size of 1,024, for a total of 340M parameters! Altogether it is 1.34GB, so expect it to take a couple minutes to download to your Colab instance."

In [7]:
encoded_dict = tokenizer(question, answer_text)

In [8]:
encoded_dict

{'input_ids': [101, 2129, 2116, 11709, 2515, 14324, 1011, 2312, 2031, 1029, 102, 14324, 1011, 2312, 2003, 2428, 2502, 1012, 1012, 1012, 2009, 2038, 2484, 1011, 9014, 1998, 2019, 7861, 8270, 4667, 2946, 1997, 1015, 1010, 6185, 2549, 1010, 2005, 1037, 2561, 1997, 16029, 2213, 11709, 999, 10462, 2009, 2003, 1015, 1012, 4090, 18259, 1010, 2061, 5987, 2009, 2000, 2202, 1037, 3232, 2781, 2000, 8816, 2000, 2115, 15270, 2497, 6013, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [9]:
input_ids = encoded_dict['input_ids']
token_type_ids = encoded_dict['token_type_ids']
attention_mask = encoded_dict['attention_mask']

In [10]:
##Prepare input for the model 
input_ids = tokenizer.encode(question, answer_text)
print(f"There are {len(input_ids)} input ids for the question + context")
#Let´s convert 
tokens = tokenizer.convert_ids_to_tokens(input_ids)

There are 70 input ids for the question + context


In [11]:
for input_id, token in zip(input_ids, tokens):
    print(f"{input_id} >> {token}")

101 >> [CLS]
2129 >> how
2116 >> many
11709 >> parameters
2515 >> does
14324 >> bert
1011 >> -
2312 >> large
2031 >> have
1029 >> ?
102 >> [SEP]
14324 >> bert
1011 >> -
2312 >> large
2003 >> is
2428 >> really
2502 >> big
1012 >> .
1012 >> .
1012 >> .
2009 >> it
2038 >> has
2484 >> 24
1011 >> -
9014 >> layers
1998 >> and
2019 >> an
7861 >> em
8270 >> ##bed
4667 >> ##ding
2946 >> size
1997 >> of
1015 >> 1
1010 >> ,
6185 >> 02
2549 >> ##4
1010 >> ,
2005 >> for
1037 >> a
2561 >> total
1997 >> of
16029 >> 340
2213 >> ##m
11709 >> parameters
999 >> !
10462 >> altogether
2009 >> it
2003 >> is
1015 >> 1
1012 >> .
4090 >> 34
18259 >> ##gb
1010 >> ,
2061 >> so
5987 >> expect
2009 >> it
2000 >> to
2202 >> take
1037 >> a
3232 >> couple
2781 >> minutes
2000 >> to
8816 >> download
2000 >> to
2115 >> your
15270 >> cola
2497 >> ##b
6013 >> instance
1012 >> .
102 >> [SEP]


- We’ve concatenated the question and answer_text together, but BERT still needs a way to distinguish them. BERT has two special “Segment” embeddings, one for segment “A” and one for segment “B”. Before the word embeddings go into the BERT layers, the segment A embedding needs to be added to the question tokens, and the segment B embedding needs to be added to each of the answer_text tokens.

- These additions are handled for us by the transformer library, and all we need to do is specify a ‘0’ or ‘1’ for each token.

- Note: In the transformers library, huggingface calls these token_type_ids

In [12]:
output = model(torch.LongTensor([input_ids]),
                torch.LongTensor([token_type_ids]))

In [13]:
start_logits = output.start_logits
end_logits = output.end_logits

In [14]:
print('Start logits Tensor')
print(start_logits)
print('\n')
print(f"Max value for start position {start_logits.max()}")
print(f"Max position value for start {start_logits.argmax()}")
print('\n')
print('End Logits Tensor')
print(end_logits)
print('\n')
print(f"Max value for start position {end_logits.max()}")
print(f"Max position value for start {end_logits.argmax()}")

Start logits Tensor
tensor([[ 0.3870,  0.3425,  0.2221, -0.1389, -0.0598,  0.0167,  0.2192, -0.0195,
          0.0358,  0.1572,  0.2594,  0.0026,  0.3877,  0.2388, -0.2559, -0.2483,
         -0.1877, -0.0095,  0.1381,  0.0192, -0.0996, -0.1723, -0.0579,  0.4340,
          0.2965,  0.0091, -0.0136, -0.0008,  0.2370,  0.1961, -0.3182, -0.0174,
         -0.0215, -0.2632,  0.3457, -0.4221,  0.2370,  0.0834,  0.0156, -0.4788,
         -0.0394, -0.0963,  0.3294, -0.0429,  0.0606, -0.0020, -0.1711, -0.3001,
         -0.4400, -0.3295, -0.2250, -0.2735,  0.2988, -0.1878, -0.0589, -0.0034,
         -0.0199, -0.3540, -0.1299, -0.4777, -0.1572, -0.1968, -0.1163,  0.2619,
         -0.1417, -0.2043, -0.0156, -0.2806,  0.3504,  0.2129]],
       grad_fn=<CopyBackwards>)


Max value for start position 0.43402576446533203
Max position value for start 23


End Logits Tensor
tensor([[ 0.2750,  0.3025,  0.1982,  0.2804,  0.0134,  0.1392, -0.3005,  0.1847,
          0.2117,  0.1993, -0.2266, -0.0302, -0.153