# COLX 561 Lab Assignment 3: Question-Answering with BERT
## Assignment Objectives

In this lab, you will implement and train a (distil)BERT model for Question and Answering on a subset of the [SQuAD v2.0](https://rajpurkar.github.io/SQuAD-explorer/) dataset. Lab objectives include:

1. Convert the data to tensors using the BERT tokenizer
2. Train a model for Question-Answering by tuning on top of a pre-trained BERT model 
3. Optimize the choice of start and end indicies

We use distBERT in this lab because it is significantly smaller and faster than BERT, but with very similar performance. Even though we are using distBERT, we will call it BERT throughout this lab

If you do not have access to a GPU locally, you'll likely want to run this on Google Colab with a GPU backend. 

In [28]:
# !python3 -m pip install pulp
# !python3 -m pip install transformers

## Getting Started

Run the code below to access relevant modules (you can add to this as needed).

In [139]:
#provided code
import numpy as np
import torch
import pulp
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score

In [140]:
print(torch.__version__)

1.13.1


For this lab, you'll be working with the SQuAD database. Download the SQuAD data from [google drive](https://drive.google.com/file/d/1jAXaGLyCllMoa6suFiZro4cuWf0Mnx9G/view?usp=sharing), unzip it into a directory outside of your lab repo and change the path below. Later you will probably want to put the data on Google drive and change this path so it points to your mounted data.

The question, context (also called the passage), and answer for a given set of QA training data are stored in separate files with corresponding line numbers. You should open up the data files to make sure you understand what they each represent.

In [141]:
#provided code
squad_path = '../Data/Lab3/'

```
==> train.question <==
what percentage of imperial 's staff was classified as world leading in 2008 ?
what paradox did sheptycki point out ?

==> train.answer <==
26 %
the harder policing agencies work to produce security , the greater are feelings of insecurity

==> train.span <==
6 7
166 180

==> train.context <==  **[[[ ANSWER ]]]** <--  I added
the 2008 research assessment exercise returned **[[[26 %]]]** of the 1225 staff submitted as being world-leading ( 4* ) and a further 47 % as being internationally excellent ( 3* ) . the 2008 research assessment exercise also showed five subjects – pure mathematics , epidemiology and public health , chemical engineering , civil engineering , and mechanical , aeronautical and manufacturing engineering – were assessed to be the best [ clarification needed ] in terms of the proportion of internationally recognised research quality .
studies of this kind outside of europe are even rarer , so it is difficult to make generalizations , but one small-scale study that compared transnational police information and intelligence sharing practices at specific cross-border locations in north america and europe confirmed that low visibility of police information and intelligence sharing was a common feature ( alain , 2001 ) . intelligence-led policing is now common practice in most advanced countries ( ratcliffe , 2007 ) and it is likely that police intelligence sharing and information exchange has a common morphology around the world ( ratcliffe , 2007 ) . james sheptycki has analyzed the effects of the new information technologies on the organization of policing-intelligence and suggests that a number of 'organizational pathologies ' have arisen that make the functioning of security-intelligence processes in transnational policing deeply problematic . he argues that transnational police information circuits help to " compose the panic scenes of the security-control society " . the paradoxical effect is that , **[[[the harder policing agencies work to produce security , the greater are feelings of insecurity]]]** .
```

## Tidy Submission
rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this Jupyter notebook with your answers embedded
- Be sure to follow the instructions

## Exercise 1: Initial Data Processing

### Exercise 1.1
rubric={accuracy:2}

Your first task is to write a function, `convert_to_BERT_tensors`, which uses the build-in BERT tokenizer to create tensors for input to the BERT model. You should call the tokenizer directly, look at the tokenizer [docs](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__). This function should involve only two lines of code, but you need to get the arguments right. Your tokenization process must
* return pytorch tensors corresponding to the input_ids and attention masks (which prevent BERT from attending to padding)
* combine questions and contexts into a single input with a separator character
* truncate when the question and context is too long to work with BERT (longer than 512) 
* add padding when the question and context is too short 

In [142]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def convert_to_BERT_tensors(questions, contexts):
    '''takes a parallel list of question strings and answer strings'''
    #your code here
    # use tokenize with `return_tensors='pt', truncation=True, padding=True` for `question` and `contexts`
    stuff = tokenizer(questions, contexts, return_tensors='pt', truncation=True, padding=True)
    return stuff['input_ids'], stuff['attention_mask']
    #your code here


In [143]:
test_questions = ["Why?", "How?"]
test_contexts = ["I think it is because we can bluminate", "It was done"" ".join(["very"]*1000) + " well"]

ids, mask = convert_to_BERT_tensors(test_questions,test_contexts)
assert ids.shape == (2,512) # 512 because that's the max allowed
assert ids[0][3] == 102 # fourth token is separator
assert list(ids[0][-100:]) == [0]*100 # first row is mostly padding
assert list(ids[1][-100:]) != [0]*100 # second row is not
assert list(mask[0][-100:]) == [0]*100 # first row padding is masked
assert list(mask[1][-100:]) != [0]*100 # second row is not padding, no mask
print("Success!")

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Success!


In [11]:
print(tokenizer.vocab['[CLS]'], tokenizer.vocab['why'], 
      tokenizer.vocab['?'], tokenizer.vocab['[SEP]'], tokenizer.vocab['i'], tokenizer.vocab['think'], "...")

101 2339 1029 102 1045 2228 ...


In [13]:
ids[0][:50]

tensor([  101,  2339,  1029,   102,  1045,  2228,  2009,  2003,  2138,  2057,
         2064, 14154, 19269,   102,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0])

### Exercise 1.2
rubric={accuracy:3, efficiency:1}

As our target for training, we want tensors of indicies which correspond to the beginning and end of the answer span. For example, given the question `Who wrote Hamlet ?`, the answer `W . Shakespeare` is given by the span `[5, 7]` in the context `Between 1599 and 1601 , W . Shakespeare wrote Hamlet`.

This gets a bit tricky because BERT will pack the question and context in the same vector and tokenize the input words:

```[CLS] Who wrote Ham ##let ? [SEP] Between 1599 and 16 ##01 , W . Shakes ##peare wrote Ham ##let [SEP]```

This means that the span will change from the original `[5,7]`, in the span file, to `[13,16]` in the BERT input.

You should implement the function `get_answer_span_tensor()` which takes strings corresponding to a question, context, and answer as input. It then identifies the correct span in the BERT input. To find the correct answer span, you can use the [tokenize()](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.tokenize) method of BERT's tokenizer which splits the input into sub-word units.

Start by forming the BERT input string (`[CLS] question [SEP] context [SEP]`). Remember to truncate the input to 512 tokens, which is the BERT maximum. We'll simply discard all tokens beyond that. Then, apply `tokenize()` both to the BERT input and the answer, and match the tokenized answer to the correct substring in the tokenized input. Return `torch.tensor([start,end])` for the identified span `[start,end]`.

**Note** because we're truncating, it may happen that the answer does not appear in the input (because it would have a start index >= 512). In that case, return the range `torch.tensor([0,0])`.  

In [144]:
def get_answer_span_tensor(question,context,answer):
    # your code here
    input_tokens = tokenizer.tokenize('[CLS] ' + question + ' [SEP] ' + context)
    answer_tokens = tokenizer.tokenize(answer)
    # print("input: ", input_tokens)
    # print("answer: ", answer_tokens)
    span_len = len(answer_tokens)
    # print("span length: ", span_len)
    for i in range(min(len(input_tokens) - span_len+1, 512 - span_len - 1)):
        if input_tokens[i:i+span_len] == answer_tokens:
            span = torch.tensor([i,i+span_len - 1])
            break
    else:
        span = torch.tensor([0,0])
        
    return span
    # your code here

In [24]:
test_question = "Why?"
test_context = "I think it is because we can bluminate"
test_answer = "because we can bluminate"
bad_answer  = "because we can fumiage"
span = get_answer_span_tensor(test_question,test_context,test_answer)
assert span.shape == (2,)
assert list(span) == [8,12]
span = get_answer_span_tensor(test_question,test_context,bad_answer)
assert list(span) == [0,0]
print('Success!')

input:  ['[CLS]', 'why', '?', '[SEP]', 'i', 'think', 'it', 'is', 'because', 'we', 'can', 'blu', '##minate']
answer:  ['because', 'we', 'can', 'blu', '##minate']
span length:  5
input:  ['[CLS]', 'why', '?', '[SEP]', 'i', 'think', 'it', 'is', 'because', 'we', 'can', 'blu', '##minate']
answer:  ['because', 'we', 'can', 'fu', '##mia', '##ge']
span length:  6
Success!


### Exercise 1.3
rubric={accuracy:2, quality:1}

Now write code that builds a `QAdataset` (defined below) and a corresponding dataloader for each of the train, dev, and test splits with the provided `batch_size`.

In [145]:
#provided code
batch_size = 16

class QAdataset(Dataset):
    '''A dataset for housing QA data, including input_data, output_data, and padding mask'''
    def __init__(self, input_data, output_data,mask):
        self.input_data = input_data
        self.output_data = output_data
        self.mask = mask
        
    def __len__(self):
        return len(self.input_data)
    
    def __getitem__(self, index):
        target = self.output_data[index]
        data_val = self.input_data[index]
        mask = self.mask[index]
        return data_val,target,mask 

We will now initialize data loaders for `QAdataset`. We start by writing a function `prepare_QA_dataset` which reads QA data from the directory `squad_path`.

The function takes a string `prefix` (either `train`, `dev` or `test`) as input, and reads the SQuAD files `prefix.question` and `prefix.context` in the directory `squad_path` into two lists: `questions` and `contexts` (these should simply be lists of strings corresponding to the lines in the original data files). 

The function then passes the `questions` and `contexts` lists to `convert_to_BERT_tensors` which returns a list of BERT input tensors `QA_input` and a list of masks `masks`. 

Next we will generate a list `spans` of answers spans. There are two cases.

1. If we're reading the test set (i.e. `prefix == "test"`), then the span for each example should be empty: `torch.tensor([0,0])`
1. If we're reading the training or development set (i.e. `prefix in "train dev".split()`), then start by reading the answers in the file `prefix.answer` into a list `answers`. Then call `get_answer_span_tensor` on with the arguments `questions[i]`, `contexts[i]` and `answers[i]` to get the correct span.

Finally, return `QAdataset(QA_input, spans, masks)`.

You should then generate `train_dataset`, `dev_dataset` and `test_dataset` by calling `prepare_QA_dataset`. Use the datasets to initialize [data loaders](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) `train_dataloader`, `dev_dataloader` and `test_dataloader`. When initializating the data loaders, set `batch_size=batch_size` and `shuffle=False`. 

**Note** It can take several minutes to read the datasets because they are large. It can be a good idea to first complete the entire lab with small subsets of these datasets, e.g. the first 5000 examples from each dataset. When everything is working, you can then switch to the full datasets.

In [164]:
def prepare_QA_dataset(split):
    
    # '''for split in "train", "dev", "test", perpares Pytorch dataset by reading the files and 
    # converting the data to tensors. For test, provides dummy answers'''    
    with open(squad_path + split + ".question", encoding="utf-8") as f:
        questions = f.readlines()
    with open(squad_path + split + ".context", encoding="utf-8") as f:
        contexts = f.readlines()    
    QA_input, masks = convert_to_BERT_tensors(questions, contexts)

    # only for train and dev; 
    if "train" == split or "dev" == split: 
        with open(squad_path + split + ".answer", encoding="utf-8") as f:
            answers = f.readlines()
            spans = []
            # based on the lenght of questions, `get_answer_span_tensor`
            for i in range(len(questions)):
                spans.append(get_answer_span_tensor(questions[i], contexts[i], answers[i]))
    else:
        spans = [torch.tensor([0,0])]*len(questions)
    return QAdataset(QA_input, spans, masks)


In [1]:
# DO NOT LOAD the train dataset before finishing Ex 3. 

# train_dataset = prepare_QA_dataset("train")
# train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
dev_dataset = prepare_QA_dataset('dev')
dev_dataloader = DataLoader(dev_dataset, batch_size=batch_size, shuffle=False)
test_dataset = prepare_QA_dataset('test')
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

## Exercise 2: BERT Training 


### Exercise 2.1
rubric={accuracy:2}

We will now train a BERT model for QA. The Huggingface library has a [BERT QA model](https://huggingface.co/transformers/model_doc/distilbert.html#distilbertforquestionanswering) that includes the main pre-trained BERT model as well as the QA heads. We will use this model (the preinitialized `model` object below). 

Given an input `batch` of dimension `batch_size x 512` (where 512 is the BERT maximum input size) and input `masks`, the model forward function will return an object `output` with two members:

1. `output.start_logits`, a log-distributions of shape `batch_size x 512` over the start position for the QA span. 
1. `output.end_logits`, a log-distribution of shape `batch_size x 512` over the end position of the QA span.

----------

Use should start by defining a loss function. Please use [`nn.CrossEntropyLoss`](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html). You should then define an [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer for your model parameters using learning rate 0.00003.

Then, iterate over the data using your `train_dataloader`:
```
for train_text_batch, train_span_batch, masks in train_dataloader:
   ... 
```

You should pass the inputs and masks to `model.forward`. This returns an object `output` which was explained above. Calculate the loss as a sum of the losses for `output.start_logits` and `output.end_logits` using `train_span_batch`. It is a `batch_size x 2` tensor giving the gold standard spans: `[start, end]` for each example in the training batch. 

Remember that the objective is to raise the probabilities `output.start_logits[i,start]` and `output.end_logits[i,end]` as high as possible for each example `i`.

Print out the loss regularly, if everything is correct you should see it drop rapidly. 

You only need to train your model for a single epoch here (you may want to increase this number later for the Kaggle competition).

**Note** If you're running on a GPU on Google Colab, you will need to copy your tensors and model over to the GPU (check practical work 4 for COLX 581 to see how this is done).

**Note** If this is too slow, run training only on the 10,000 first (or even 2000 first) batches. This will reduce your accuracy but it's more important to be able to run the training in the first place.

In [9]:
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')

# #your code here
# loss_function = nn.CrossEntropyLoss()
# optimizer = torch.optim.Adam(model.parameters(), lr=0.00003)

# device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# epochs = 1

# model.to(device)
# for epoch in range(epochs):
#     epoch_loss = 0
#     batch_counter = 0
#     for train_text_batch, train_span_batch, masks in train_dataloader:       
#         model.zero_grad()
#         train_text_batch, train_span_batch, masks = train_text_batch.to(device), train_span_batch.to(device), masks.to(device)
#         output = model(train_text_batch,attention_mask=masks)
#         loss = loss_function(output.start_logits, train_span_batch[:,0])
#         loss += loss_function(output.end_logits, train_span_batch[:,1])
#         loss.backward()
#         optimizer.step()
#         batch_counter += 1
#         if batch_counter % 10 == 0:
#             print("Processed ", batch_counter*batch_size, "QA pairs of ", len(train_dataset))
#             print("Last loss:", loss.item())
#         epoch_loss += loss.item()
#     print('After epoch:', epoch, 'Loss is:', epoch_loss)

#your code here

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

Processed  160 QA pairs of  77558
Last loss: 11.232280731201172
Processed  320 QA pairs of  77558
Last loss: 9.259109497070312
Processed  480 QA pairs of  77558
Last loss: 9.044548034667969
Processed  640 QA pairs of  77558
Last loss: 7.710631370544434
Processed  800 QA pairs of  77558
Last loss: 8.131200790405273
Processed  960 QA pairs of  77558
Last loss: 6.788171768188477
Processed  1120 QA pairs of  77558
Last loss: 7.019942283630371
Processed  1280 QA pairs of  77558
Last loss: 6.349470138549805
Processed  1440 QA pairs of  77558
Last loss: 6.955586910247803
Processed  1600 QA pairs of  77558
Last loss: 7.553208351135254
Processed  1760 QA pairs of  77558
Last loss: 6.197084426879883
Processed  1920 QA pairs of  77558
Last loss: 4.91841459274292
Processed  2080 QA pairs of  77558
Last loss: 4.939706802368164
Processed  2240 QA pairs of  77558
Last loss: 5.47674560546875
Processed  2400 QA pairs of  77558
Last loss: 5.493062973022461
Processed  2560 QA pairs of  77558
Last loss: 4

```
# After epoch: 0 Loss is: 44305.126106739044
# After epoch: 1 Loss is: 40180.46493291855
# After epoch: 2 Loss is: 43801.96783733368
# After epoch: 3 Loss is: 56590.43050909042
```

# Start here | Commencer ici:

In [150]:
MODEL_PATH = "COLX_563_adv-semantics_lab3.bin" # download from googledrife (see the github/jungyeul/labs site);
model.load_state_dict(torch.load(MODEL_PATH, map_location=torch.device('cpu'))) 

<All keys matched successfully>

### Exercise 2.2
rubric={accuracy:2}

Now run the trained classifier over the dev set and calculate the accuracy for each of the start and end predictions (independently). You can iterate over your `dev_dataloader`, pass `batch` and `masks` to the QA model and use argmax on `output.start_logits` and `output.end_logits`.

You should get close to 50% performance for both (if you trained on at least 10,000 examples). 

**Note** Don't forget to put the model in `eval` mode, with no gradients in order to disable dropout.

In [167]:
# takes 18mins... 

predicted_starts = []
gold_starts = []
predicted_ends = []
gold_ends = []
model.eval()
with torch.no_grad():
    for dev_text_batch, dev_span_batch, masks in dev_dataloader:
        dev_text_batch, masks = dev_text_batch.to(device), masks.to(device)
        output = model(dev_text_batch,attention_mask=masks)

        # `start_scores` and `end_scores` from `start_logitic` and `end_logtis` of `output`
        start_scores = 
        end_scores = 

        # `target``from `dev_span_batch`
        targets = 

        # `extend` for 
        # `predicted_starts` and `predicted_ends`; 
        # argmax of start_scores and end_scores, and list
        predicted_starts.extend(...)
        predicted_ends.extend(...)

        # and `gold_starts` and `gold_ends`
        # list of targets; 
        gold_starts.extend(...)
        gold_ends.extend(...)

print("Starts accuracy")
print(accuracy_score(gold_starts,predicted_starts))
print("Ends accuracy")
print(accuracy_score(gold_ends,predicted_ends))

Starts accuracy
0.6228220020498805
Ends accuracy
0.6592073795695251


## Exercise 3: Discrete optimization of answer spans
rubric={accuracy:3, efficiency:2, quality:1}

The model from exercise 2 independently predicts both start and end indicies for the answer span. However, this is a case where there is a dependency between predictions that needs to be considered. In particular, it doesn't make sense to have the end index appear before the start index, or too long after it. You want to pick the highest probability pair that satisfies those basic constraints.

In this exercise, we will enforce these constraints using discrete optimization. You will write a function `select_best_answer_span` which takes a two numpy matrices of equal size; each row of the first matrix corresponds to the log probabilities for start prediction, and each row of the second matrix is the corresponding log probabilties for the end prediction (i.e. these are the `start_logits` and `end_logits` from above, appropriately logsoftmaxed). The third argument to the function is an integer `distance` which indicates how soon after the start token the end token must appear (i.e. distance = 0 indicates that start and end must be the same token).

An example is provided in the form of `select_best_answer_span_slow`. It solves this problem but it does so in a slow, fairly brute force way. You need to implement a better version. You have two choices: 

1. For each row of start/end probabilities provided, set up a PuLP problem (see Lecture 7 of DSCI 512). Your variable dict for each problem should consist of only two rows, with as many columns as you have tokens in your input. The first row corresponds to the choice of the start index, the second corresponds to the choice of end index. You'll need to constrain there to be only one 1 per row (i.e. only one index can be assigned for each start/end). The other set of constraints are trickier: one way to formulate things is that, for each possible index, the (single) corresponding element of the start row minus all the elements of the end row which are ends for that start should be less than zero; this means you can't have an index at a particular location the start row without having an index at an allowed location in the end row (this is very similar to the "connectivity" constraint from the ILP problem in Lab 4 of DSCI 512, look back at that and adapt it if you're confused!). Make sure you do NOT setup constraints for every possible pair of start/end indicies, since that would be no better than the provided solution. Note that in this case this solution won't *actually* be faster than `select_best_answer_span_slow` because of the overhead in setting up each PuLP problem (PuLP doesn't support solving the same problem with different inputs, so you need to recreate the problem from scratch each time), but it is still more elegant, and good linear programming practice!

2. Solve it in regular Python, but using a smarter approach. Instead of searching all possible pairings in a brute force fashion, you should try searching in such a way that you're always checking spans (pairs of start/end indicies) which have high probability. There are number of ways to solve this, one recommended way would be to first use argsort (to get starts/ends which are individually high probability), and then you can apply an algorithm similar to Dijkstra's (DSCI 512 Lecture 8), i.e. keep a list of spans sorted by the sum of the probabilities of their end points, checking them to see if they satisfy your requirements. The tricky part is knowing when and how to add new spans to your sorted list: you should more add spans not only when your list is empty, but also when you can no longer be sure the highest probability span in your list is the highest probability span overall (excluding those you've already checked). But when is that? You obviously can't just add all spans, that would be an $O(n^2)$ solution no better than the provided. If you do this one correctly, it should be quite a bit faster than the provided slow solution. 

You can get one bonus point here if you code both kinds of solutions.

After you're done, use predictions from your model on the dev set to show that the `select_best_answer_span` you wrote always gives the same result as `select_best_answer_span_slow` (iterate over your batches and use an assert). You can use a max distance of 20 here, and in Exercise 4 (if you are participating in the kaggle competition, you may want to consider this as a hyperparameter to be optimized).

1. 

```
def select_best_answer_span(start_probs, end_probs, distance):
    ...
    if j <= k <= j + distance:
        ...
    return output_spans
```

```
start_probs = np.array([0.1, 0.5, 0.2, 0.1, 0.1]) 
end_probs   = np.array([0.4, 0.1, 0.3, 0.1, 0.1]) 
distance    = 2
```

```
j     k
---------
0     0     0.5     <- update best 
0     1     0.2
0     2     0.4
1     1     0.6     <- update best
1     2     0.8     <- update best
1     3     0.6
2     2     0.5
2     3     0.3
2     4     0.3
3     3     0.2
3     4     0.2
4     4     0.2
```
where *j* is `starts` and *k* is `ends`

2. you can apply an algorithm similar to Dijkstra's 

```
start_probs = np.array([0.1, 0.5, 0.2, 0.1, 0.1]) 
end_probs   = np.array([0.4, 0.1, 0.3, 0.1, 0.1]) 

best_starts = np.argsort(start_probs*-1) =>  [1 2 0 3 4]
best_ends  = np.argsort(end_probs*-1)    =>  [0 2 1 3 4]
```


```
----------
iter 0:
[(best_starts, best_ends)] = (start_probs, end_probs)
[(0, 0)] = (1,0)                  
starts = 1 & ends 0             => (1,0) 0.9  <-- NOT BEST because of starts > ends
----------
iter 1:
[(0, 1)] = (1,2)
[(1, 0)] = (2,0)                 
[(1, 1)] = (2,2)                 
starts = 1 & ends = 2           => (1,2) 0.8  <-- BEST  (we will stop HERE) 
starts = 2 & ends = 0           => (2,0) 0.6  <-- NOT BEST because of starts > ends
starts = 2 & ends = 2           => (2,2) 0.5  <-- not BEST
----------
iter 2:                         => not necessary because we "already" found BEST
[(0, 2), (1, 2)] = (1,1), (2,1)
[(2, 0), (2, 1)] = (0,0), (0,2)
[(2, 2)] = (0,1)
...
```


In [133]:
start_probs = test_start = np.array([0.1, 0.5, 0.2, 0.1, 0.1]) 
end_probs = test_ends  = np.array([0.4, 0.1, 0.3, 0.1, 0.1]) 

best_starts = np.argsort(start_probs*-1)            # axis = 1 if you have [[] [] ... []]  where len is # of batch;
best_ends = np.argsort(end_probs*-1)
output_spans = []
distance = 10

step = 0
found = False
sorted_spans = []
# bound = 0
while not found:
    print("iter ", step)
    for j in range(step + 1):
        print(j, step)
        print(step, j)
    sorted_spans.extend([(start_probs[best_starts[j]] + end_probs[best_ends[step]],         # start_probs[i, best_starts[i,j]] + end_probs[i,best_ends[i,step]]
                                                                                            # where i in `range(len(start_probs)` (iterate # of batch)
                                best_starts[j], best_ends[step]) for j in range(step + 1)])
    sorted_spans.extend([(start_probs[best_starts[step]] + end_probs[best_ends[j]], 
                                best_starts[step], best_ends[j]) for j in range(step + 1)])
    sorted_spans.sort()

    while not found:
        if len(sorted_spans) > 0:
            curr = sorted_spans.pop()
            print("curr", curr)
            if curr[1] <= curr[2] <= curr[1] + distance:
                found = (curr[1], curr[2])
        else:
            break
    step += 1

iter  0
0 0
0 0
curr (0.9, 1, 0)
curr (0.9, 1, 0)
iter  1
0 1
1 0
1 1
1 1
curr (0.8, 1, 2)


3. set up a PuLP problem (see Lecture 7 of DSCI 512)

For each row of start/end probabilities provided, set up a PuLP problem (See a `pulp` example at https://coin-or.github.io/pulp/CaseStudies/a_transportation_problem.html)

- Creates the `problem` variable to contain the problem data
```
problem = pulp.LpProblem("Beer Distribution Problem", pulp.LpMinimize)     
```
**[Note]** that we are maximizing `pulp.LpMaximize`;
```
problem = pulp.LpProblem("QASpanSelection", pulp.LpMaximize)
```

--

```
Warehouses = ["A", "B"]                         # warehouse has their beers
Bars = ["1", "2", "3", "4", "5"]                # bar requires beers
                                                # there is a cost w -> b 

vars = pulp.LpVariable.dicts("Route", (Warehouses, Bars), 0, None, pulp.LpInteger)

```
**[Note]** that we have `(locs, indicies)` where locs = warehouse & indicies = bar, and `0, 1` (instead of `0, None`)

- The objective function is added to 'problem' first

```
problem += (
    pulp.lpSum([vars[w][b] * costs[w][b] for (w, b) in Routes]),
    "Sum_of_Transporting_Costs",
)
```

**[Note]** that we have `[ start_probs[i], end_probs[i] ]` as costs, where $i$ is the current index;

- **constraint 1** each of start/end row has only one 1 (i.e. only one index can be assigned for each start/end): 
    - `problem += pulp.lpSum(x[loc][index] ... ) == 1` for `loc` (start/end)
    - 

- **constraint 2**  index of start row has 1, only span of index to `index + distance` can have 1 in end row:
    - `problem += pulp.lpSum(starts - ends...) <= 0` for`index1` in `indices`
    - `starts = x[0][index1]` and `x[1][index2]` where `index2` is `range(index1, min(index1 + distance+1, len(indicies)))`
 

Then, 
```
problem.solve()
for index in range(len(Bars)):      # indices
    if vars[0][index].value() == 1:                            
        start = index
    if vars[1][index].value() == 1:
        end = index
```

In [55]:
#provided code
def select_best_answer_span_slow(start_probs, end_probs, distance):
    ''' returns a list of spans corresponding to the highest probability QA solution which satisfy the restriction that the end index must
    be within distance after the start index'''
    output_spans = []
    for i in range(start_probs.shape[0]):
        best_indicies = None
        best_prob = -9999 # essentially zero probability in log space, could also use -np.inf
        for j in range(start_probs.shape[1]):
            for k in range(end_probs.shape[1]):
                if j <= k <= j + distance:
                    prob = start_probs[i,j] + end_probs[i,k]
                    if prob > best_prob:
                        best_prob = prob
                        best_indicies = (j,k)
        output_spans.append(best_indicies)
    return output_spans


In [182]:
predicted_starts = []
gold_starts = []
predicted_ends = []
gold_ends = []
model.eval()

with torch.no_grad():
    for dev_text_batch, dev_span_batch, masks in dev_dataloader:
        dev_text_batch, masks = dev_text_batch.to(device), masks.to(device)
        output = model(dev_text_batch,attention_mask=masks)

        # you can copy from Ex2.2; 
        # 

print("Starts accuracy")
print(accuracy_score(gold_starts,predicted_starts))
print("Ends accuracy")
print(accuracy_score(gold_ends,predicted_ends))

Starts accuracy
0.6253843525794328
Ends accuracy
0.6545951486163307


### your solutions:

In [208]:
def select_best_answer_span_v1(start_probs, end_probs, distance):
    '''given 2 matrices of probabilities associated with 
    indicies of a text being the start or end of an answer spans, respectively,
    solves the ILP with the objective function being the max probability, 
    under the restriction that the end index must be no more 
    than distance after the start. Returns a tuple (start index, end index)
    corresponding to the best solution'''
    output_spans = []
    locs = [0,1]                                    # warehous; [0,1] (start/end)
    indicies = list(range(start_probs.shape[1]))    # bars;     [len(start_probs[0])] (results of log_softmax)
    
    for i in range(start_probs.shape[0]):
        probs =                                     # list of [[start_probs] and [end_probs]]

        problem = pulp.LpProblem("QASpanSelection", pulp.LpMaximize)
        x = pulp.LpVariable.dicts(...)
        
        #objective function
        problem += pulp.lpSum(...)

        # constraint #1 each of start/end row has only one 1
        for loc in locs:
            problem += pulp.lpSum(...) == 1

        # constraint #2, if index of start row has 1, only span of index to index + 
        # distance can have 1 in end row
        for index1 in indicies:
            problem += pulp.lpSum(...) <= 0

        problem.solve()

        for index in indicies:
            if x[0][index].value() == 1:                            
                start = index
            if x[1][index].value() == 1:
                end = index
        output_spans.append((start,end))
    return output_spans
    

In [206]:
def select_best_answer_span_v2(start_probs, end_probs, distance):
    '''given 2 matrices of probabilities associated with 
    indicies of a text being the start or end of an answer spans, respectively,
    finds the highest probability spans under the restriction that the end index must be no more 
    than distance after the start. Returns a list (start index, end index) 2-plues
    corresponding to the best solution for each row of start/end_probs'''
    best_starts = np.argsort(start_probs*-1, axis=1)
    best_ends = np.argsort(end_probs*-1, axis=1)
    output_spans = []
    for i in range(len(start_probs)):
        step = 0
        found = False
        sorted_spans = []
        bound = 0
        while not found:
            #  ...

        output_spans.append(found)

    return output_spans
                    

In [3]:
test_starts = np.array([[0.1,0.5,0.2,0.1,0.1], [0.3,0.2,0.2,0.1,0.1]])
test_ends = np.array([[0.4,0.1,0.3,0.1,0.1], [0.1,0.1,0.1,0.1,0.6]])
# assert select_best_answer_span_v1(test_starts,test_ends,2) == [(1,2),(2,4)]
# select_best_answer_span_v1(test_starts,test_ends,2)
# print("Success!")

In [None]:
distance = 20
predicted_starts = []
gold_starts = []
predicted_ends = []
gold_ends = []
model.eval()
with torch.no_grad():
    for dev_text_batch, dev_span_batch, masks in dev_dataloader:
        dev_text_batch, masks = dev_text_batch.to(device), masks.to(device)
        output = model(dev_text_batch,attention_mask=masks)
        start_scores = output.start_logits.to('cpu').detach()
        end_scores = output.end_logits.to('cpu').detach()
        start_probs = F.log_softmax(start_scores,dim=1).numpy()
        end_probs = F.log_softmax(end_scores,dim=1).numpy()  
        spans = select_best_answer_span_slow(start_probs, end_probs, distance)
        # spans = select_best_answer_span_v1(start_probs, end_probs, distance)
        # spans = select_best_answer_span_v2(start_probs, end_probs, distance)
        assert spans == select_best_answer_span_slow(start_probs, end_probs, distance)
print("Success!")