# Machine Reading Comprehension (MRC)

In this notebook, it will:

    I. explain the MRC problem.
    II. Model
    III. Realization

## I. Presentation

### 1. Definition
 

MRC is class of problem where model provides answers based on the context given at the input. More precisely, we input a context text and a questions, and the model outputs a answer based on the inputs.

The answers can be:

 - filling blanks
 - multi-choices
 - generate answers
 - extraction of a passage

Example: question-answering

**context**: Mercury is the closest planet to the sun. As such, it circles the sun faster than all the other planets, which is why Romans named it after the swift-footed messenger god Mercury. Mercury was known since at least Sumerian times roughly 5,000 years ago, where it was often associated with Nabu, the god of writing. Mercury was also given separate names for its appearance as both a morning star and as an evening star. Greek astronomers knew, however, that the two names referred to the same body. Heraclitus believed that both Mercury and Venus orbited the Sun, not the Earth.

    **Question1**: Which is the cloest planet to the sun?
    **Answer1**: Mercury
    **Question2**: Who is Nabu?
    **Answer2**: the god of writing
    **Question3**: What Heraclitus believed?
    **Answer3**: both Mercury and Venus orbited the Sun, not the Earth

In this case, the labels are the starting and ending positions of the answers.

### 2. evaluation

The metrics:

* Exact Match (EM): fully matched
* f1

Example: 
* reference: Paris
* ground Truth: Paris of France

* EM = 0
* f1 = (2 * 1/1 * 1 / 3) / (1 / 1 + 1 / 3) = 0.5

### 3. data processing

The texts was concatenated as:

        ______________________________________________________________________________
        |CLS|       Question       |SEP|                    context              |SEP|
        ------------------------------------------------------------------------------

If the context is too long, we can either:

* truncate and ignore the text longer than a length - waste of data
* using a sliding window to get a section of the text - more complex to realize



## II. Model

The model used is AutoModelForQuestionAnswering, with the bert base model.

This model uses bert base model to encode the input text, and output the classes (num_labels) of each tokens according to the classes we defined. So the output dimension is [batch, seq_len, classes].

In the class's init function:

```python
    self.num_labels = config.num_labels
    self.bert = BertModel(config, add_pooling_layer=False)
    ...
    self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
```
Where num_labels in this context is 2. We don't have to redefine.

In the forward function:

```python
    outputs = self.bert(
            input_ids,
            ...
        )
    sequence_output = outputs[0]

    logits = self.qa_outputs(sequence_output)           # [batch, seq_len, 2]
    start_logits, end_logits = logits.split(1, dim=-1)  # [batch, seq_len, 1] for each
    start_logits = start_logits.squeeze(-1).contiguous()
    end_logits = end_logits.squeeze(-1).contiguous()
```

The input is encoded using bert model. The output of bert model is then put into a linear layer to project the hidden values to the positions.

## III. Realization

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first
# skip this if you don't need.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
## defin repos for data and model

# data

ckp_data = "cjlovering/natural-questions-short"

# model

ckp = "google-bert/bert-base-uncased"

### 1. import

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, DefaultDataCollator

2024-06-20 17:58:37.129243: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-20 17:58:37.129306: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-20 17:58:37.131566: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-20 17:58:37.142966: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. load dataset

In [4]:
data = load_dataset(ckp_data, cache_dir="../tmp/mrc")
data

DatasetDict({
    train: Dataset({
        features: ['has_correct_context', 'id', 'questions', 'name', 'contexts', 'answers'],
        num_rows: 13933
    })
    validation: Dataset({
        features: ['has_correct_context', 'id', 'questions', 'name', 'contexts', 'answers'],
        num_rows: 871
    })
})

In [6]:
data["train"][0]

{'has_correct_context': True,
 'id': '5495190773098085777',
 'questions': [{'input_text': "when does stephanie die in grey's anatomy"}],
 'name': "Stephanie Edwards (Grey's Anatomy)",
 'contexts': "Dr. Stephanie Edwards Grey 's Anatomy character The Season 12 Promotional Photo of Jerrika Hinton as Stephanie Edwards First appearance Going , Going , Gone ( 9.01 ) September 27 , 2012 ( as recurring cast ) `` Seal Our Fate '' ( 10.01 ) September 26 , 2013 ( as series regular ) Last appearance `` Ring of Fire '' ( 13.24 ) May 18 , 2017 Created by Shonda Rhimes Portrayed by Jerrika Hinton Information Full name Stephanie Edwards Nickname ( s ) Grumpy Steph Dr. Lavender Title M.D. Significant other ( s ) Jackson Avery Kyle Diaz ( deceased )",
 'answers': [{'candidate_id': 0,
   'input_text': 'short',
   'span_end': 324,
   'span_start': 296,
   'span_text': "`` Ring of Fire '' ( 13.24 )"}]}

### 3. Split Data

Since the downloaded dataset is already split, so we skip this step.

### 4. tokenization

In [7]:
# load tokenizer

tokenizer = AutoTokenizer.from_pretrained(ckp)
tokenizer

BertTokenizerFast(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [8]:
# truncation="only_second" means only cut the text_pair if length greater the max_length

# token_type_ids indicate if the token comes from the question or the context

ind = 0
sample = data["train"][ind]
tok = tokenizer(text=sample["questions"][0]['input_text'], 
                text_pair=sample["contexts"],
                max_length=128, truncation="only_second", padding="max_length")
tok

{'input_ids': [101, 2043, 2515, 11496, 3280, 1999, 4462, 1005, 1055, 13336, 102, 2852, 1012, 11496, 7380, 4462, 1005, 1055, 13336, 2839, 1996, 2161, 2260, 10319, 6302, 1997, 15333, 18752, 2912, 9374, 2239, 2004, 11496, 7380, 2034, 3311, 2183, 1010, 2183, 1010, 2908, 1006, 1023, 1012, 5890, 1007, 2244, 2676, 1010, 2262, 1006, 2004, 10694, 3459, 1007, 1036, 1036, 7744, 2256, 6580, 1005, 1005, 1006, 2184, 1012, 5890, 1007, 2244, 2656, 1010, 2286, 1006, 2004, 2186, 3180, 1007, 2197, 3311, 1036, 1036, 3614, 1997, 2543, 1005, 1005, 1006, 2410, 1012, 2484, 1007, 2089, 2324, 1010, 2418, 2580, 2011, 26822, 8943, 1054, 14341, 2229, 6791, 2011, 15333, 18752, 2912, 9374, 2239, 2592, 2440, 2171, 11496, 7380, 8367, 1006, 1055, 1007, 24665, 24237, 2100, 3357, 2232, 2852, 1012, 20920, 2516, 1049, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [8]:
# show some info

print(sample["questions"][0])
print(sample["contexts"])
print(tok["input_ids"])
print(len(tok["input_ids"]))

{'input_text': "when does stephanie die in grey's anatomy"}
Dr. Stephanie Edwards Grey 's Anatomy character The Season 12 Promotional Photo of Jerrika Hinton as Stephanie Edwards First appearance Going , Going , Gone ( 9.01 ) September 27 , 2012 ( as recurring cast ) `` Seal Our Fate '' ( 10.01 ) September 26 , 2013 ( as series regular ) Last appearance `` Ring of Fire '' ( 13.24 ) May 18 , 2017 Created by Shonda Rhimes Portrayed by Jerrika Hinton Information Full name Stephanie Edwards Nickname ( s ) Grumpy Steph Dr. Lavender Title M.D. Significant other ( s ) Jackson Avery Kyle Diaz ( deceased )
[101, 2043, 2515, 11496, 3280, 1999, 4462, 1005, 1055, 13336, 102, 2852, 1012, 11496, 7380, 4462, 1005, 1055, 13336, 2839, 1996, 2161, 2260, 10319, 6302, 1997, 15333, 18752, 2912, 9374, 2239, 2004, 11496, 7380, 2034, 3311, 2183, 1010, 2183, 1010, 2908, 1006, 1023, 1012, 5890, 1007, 2244, 2676, 1010, 2262, 1006, 2004, 10694, 3459, 1007, 1036, 1036, 7744, 2256, 6580, 1005, 1005, 1006, 2184, 101

In [9]:
# the token contains 3 components:
# - input_ids: contains the combined tokens of question and contexts
# - token_type_ids: to indicate it is question (0) or contexts (1)
# - attention_mask -

tok.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [10]:
# to show which part of the tokens belongs to question and which to context
# the second term in each tuple designates whether it is question or context

print([(t, l) for t, l in zip(tok["input_ids"], tok["token_type_ids"])])

[(101, 0), (2043, 0), (2515, 0), (11496, 0), (3280, 0), (1999, 0), (4462, 0), (1005, 0), (1055, 0), (13336, 0), (102, 0), (2852, 1), (1012, 1), (11496, 1), (7380, 1), (4462, 1), (1005, 1), (1055, 1), (13336, 1), (2839, 1), (1996, 1), (2161, 1), (2260, 1), (10319, 1), (6302, 1), (1997, 1), (15333, 1), (18752, 1), (2912, 1), (9374, 1), (2239, 1), (2004, 1), (11496, 1), (7380, 1), (2034, 1), (3311, 1), (2183, 1), (1010, 1), (2183, 1), (1010, 1), (2908, 1), (1006, 1), (1023, 1), (1012, 1), (5890, 1), (1007, 1), (2244, 1), (2676, 1), (1010, 1), (2262, 1), (1006, 1), (2004, 1), (10694, 1), (3459, 1), (1007, 1), (1036, 1), (1036, 1), (7744, 1), (2256, 1), (6580, 1), (1005, 1), (1005, 1), (1006, 1), (2184, 1), (1012, 1), (5890, 1), (1007, 1), (2244, 1), (2656, 1), (1010, 1), (2286, 1), (1006, 1), (2004, 1), (2186, 1), (3180, 1), (1007, 1), (2197, 1), (3311, 1), (1036, 1), (1036, 1), (3614, 1), (1997, 1), (2543, 1), (1005, 1), (1005, 1), (1006, 1), (2410, 1), (1012, 1), (2484, 1), (1007, 1), (2

In [11]:
## offset mapping

samples = data["train"].select(range(5))

question = []
for i in range(5):
    question.append(samples[i]["questions"][0]["input_text"])


toks = tokenizer(text=question, 
                 text_pair=samples["contexts"],
                 max_length=128, 
                 truncation="only_second", 
                 padding="max_length", 
                 return_offsets_mapping=True
                )

In [12]:
toks["offset_mapping"][1]

[(0, 0),
 (0, 4),
 (5, 8),
 (9, 12),
 (13, 16),
 (17, 21),
 (22, 24),
 (25, 28),
 (29, 35),
 (36, 44),
 (0, 0),
 (0, 3),
 (4, 10),
 (11, 19),
 (20, 23),
 (24, 26),
 (27, 35),
 (36, 43),
 (44, 50),
 (51, 56),
 (57, 63),
 (64, 71),
 (72, 75),
 (76, 78),
 (79, 86),
 (87, 93),
 (94, 97),
 (97, 100),
 (100, 106),
 (107, 116),
 (117, 123),
 (124, 127),
 (128, 132),
 (133, 136),
 (137, 138),
 (139, 141),
 (142, 145),
 (146, 151),
 (152, 161),
 (162, 164),
 (165, 173),
 (174, 176),
 (177, 186),
 (187, 192),
 (193, 194),
 (194, 195),
 (196, 202),
 (203, 205),
 (206, 211),
 (212, 214),
 (215, 216),
 (217, 221),
 (222, 223),
 (224, 227),
 (228, 235),
 (236, 245),
 (246, 248),
 (249, 253),
 (254, 256),
 (257, 258),
 (259, 263),
 (264, 265),
 (266, 270),
 (271, 273),
 (274, 281),
 (282, 284),
 (285, 292),
 (293, 300),
 (301, 303),
 (304, 310),
 (311, 314),
 (315, 321),
 (322, 323),
 (324, 330),
 (331, 339),
 (340, 348),
 (349, 354),
 (355, 358),
 (359, 366),
 (367, 370),
 (371, 379),
 (380, 381),
 

In [13]:
## sequence id

# Nones are the special tokens, and 0s are the question, and 1s are the context

toks.sequence_ids(1)

[None,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 None,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 None]

#### 4.1. Simple truncate

As mentioned in the presentation section, for long context, we have 2 strategies: truncate and using sliding window. We implement the truncate first in section 4.1, and using sliding window in the following section 4.2.

In [14]:
# To convert the answer position to token positions
# the tokens is the concatenation of question + context
# the outputs here is the starting and ending positions of the context and answer in the tokens

for ind, offset in enumerate(toks["offset_mapping"]):

    answer = samples[ind]["answers"][0]
    text = answer["span_text"]

    # the answer position in the context
    start = int(answer["span_start"])
    end = int(answer["span_end"])

    # find context pos in tokens
    context_start = toks.sequence_ids(ind).index(1) # get the starting index of context in the tokens
    context_end = toks.sequence_ids(ind).index(None, context_start) - 1 # get the ending index of context in the tokens

    # convert answer position to token pos
    if offset[context_start][1] > end or offset[context_end][0] < start:
        answer_start = 0
        answer_end = 0

    else:
        tok_id = context_start
        while tok_id <= context_end and offset[tok_id][0] < start:
            tok_id += 1
        answer_start = tok_id

        tok_id = context_end
        while tok_id >= context_start and offset[tok_id][1] > end:
            tok_id -= 1

        answer_end = tok_id
        
    print(text, start, end, context_start, context_end, answer_start, answer_end)

`` Ring of Fire '' ( 13.24 ) 296 324 11 126 78 89
to counter Soviet geopolitical expansion during the Cold War 76 136 11 126 23 33
Cara Black 37 47 19 82 25 26
vice president 530 544 12 126 106 107
Shemar Moore ( 1994 -- 2005 , 2014 ) Darius McCrary 98 149 12 126 28 42


In [15]:
## define the data process

# put all previous operations together and define a function

def process_truncate(samples):

    # redefine column of question
    # this is slightly different from previous tests since the map function will rearrange
    # the dataset according to their feature names, so we should put the feature names before the
    # indexing

    new_column = []
    
    for sample in samples["questions"]:

        new_column.append(sample[0]["input_text"])
    
    # create new column of question

    samples["question"] = new_column

    # tokenization
    toks = tokenizer(text=samples["question"], 
                     text_pair=samples["contexts"],
                     max_length=128, truncation="only_second", padding="max_length", return_offsets_mapping=True)
    

    # convert the answer position to token positions
    start_pos = []
    end_pos = []

    for ind, offset in enumerate(toks["offset_mapping"]):

        answer = samples["answers"][ind][0]
        text = answer["span_text"]

        # text pos for answer
        start = int(answer["span_start"])
        end = int(answer["span_end"])

        # find context pos in tokens
        context_start = toks.sequence_ids(ind).index(1)
        context_end = toks.sequence_ids(ind).index(None, context_start) - 1

        # convert pos to token pos
        if offset[context_start][1] > end or offset[context_end][0] < start:

            answer_start = 0
            answer_end = 0

        else:

            tok_id = context_start
            while tok_id <= context_end and offset[tok_id][0] < start:
                tok_id += 1
            answer_start = tok_id

            tok_id = context_end
            while tok_id >= context_start and offset[tok_id][1] > end:
                tok_id -= 1

            answer_end = tok_id

        start_pos.append(answer_start)
        end_pos.append(answer_end)

    # the trainer relies on the feature names of the datadict
    # in this context (mrc), the start and end positions of the answers should be named
    # - start_positions
    # - end_positions
    # Otherwise, we get error message when train: ValueError: The model did not return a loss from the inputs, 
    # only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask.**

    toks["start_positions"] = start_pos # be careful with feature names
    toks["end_positions"] = end_pos

    return toks  

In [16]:
# tokenization

tokenized_data_truncate = data.map(process_truncate, batched=True, remove_columns=data["train"].column_names)
tokenized_data_truncate

Map:   0%|          | 0/13933 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 13933
    })
    validation: Dataset({
        features: ['question', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 871
    })
})

#### 4.2. Sliding window

In previous sections, we truncate data within maximal length and discard the rest. This leads to the waste of data.
There is another way to treate data, which is to cut original sentences into smaller chunk with some overlapping to perserve integrity of information.

In [9]:
# add parameters allow overflowing

samples = data["train"].select(range(5))
new_column = []
for i in range(5):
    new_column.append(samples[i]["questions"][0]["input_text"])

samples = samples.add_column("question", new_column)
toks = tokenizer(text=samples["question"], 
                    text_pair=samples["contexts"],
                    return_overflowing_tokens=True,     # to activate sliding truncation
                    stride=64,                          # define the overlapping length, default value is 0 (no overlap)
                    max_length=128, 
                    truncation="only_second", 
                    padding="max_length", 
                    return_offsets_mapping=True)

In [18]:
# there is an extra key named "overflow_to_sample_mapping"

toks.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [19]:
# the extra field contains the indices of the original data
# We selected 5 data from the original dataset and after the overflowing process,
# it returns 11 data by cutting the contexts into chunks of size 128 and 
# stride of 64

toks["overflow_to_sample_mapping"]

[0, 0, 1, 1, 2, 3, 3, 3, 4, 4, 4]

In [20]:
# we can show the texts of the chunk to make sure that the cuts were correct

for text in tokenizer.batch_decode(toks["input_ids"][:3]):
    print(text)

[CLS] when does stephanie die in grey's anatomy [SEP] dr. stephanie edwards grey's anatomy character the season 12 promotional photo of jerrika hinton as stephanie edwards first appearance going, going, gone ( 9. 01 ) september 27, 2012 ( as recurring cast ) ` ` seal our fate'' ( 10. 01 ) september 26, 2013 ( as series regular ) last appearance ` ` ring of fire'' ( 13. 24 ) may 18, 2017 created by shonda rhimes portrayed by jerrika hinton information full name stephanie edwards nickname ( s ) grumpy steph dr. lavender title m [SEP]
[CLS] when does stephanie die in grey's anatomy [SEP] 10. 01 ) september 26, 2013 ( as series regular ) last appearance ` ` ring of fire'' ( 13. 24 ) may 18, 2017 created by shonda rhimes portrayed by jerrika hinton information full name stephanie edwards nickname ( s ) grumpy steph dr. lavender title m. d. significant other ( s ) jackson avery kyle diaz ( deceased ) [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [P

In [21]:
# We do the same process as before by replacing the offset_mapping by 
# overflow_to_sample_mapping

for ind, _ in enumerate(toks["overflow_to_sample_mapping"]):

    answer = samples[toks["overflow_to_sample_mapping"][ind]]["answers"][0]
    text = answer["span_text"]

    # text pos for answer
    start = int(answer["span_start"])
    end = int(answer["span_end"])

    # find context pos in tokens
    context_start = toks.sequence_ids(ind).index(1)
    context_end = toks.sequence_ids(ind).index(None, context_start) - 1

    offset = toks.get("offset_mapping")[ind]

    # convert pos to token pos
    if offset[context_start][1] > end or offset[context_end][0] < start:
        answer_start = 0
        answer_end = 0

    else:
        tok_id = context_start
        while tok_id <= context_end and offset[tok_id][0] < start:
            tok_id += 1
        answer_start = tok_id
        tok_id = context_end
        while tok_id >= context_start and offset[tok_id][1] > end:
            tok_id -= 1

        answer_end = tok_id
        
    print(text, start, end, context_start, context_end, answer_start, answer_end)

`` Ring of Fire '' ( 13.24 ) 296 324 11 126 78 89
`` Ring of Fire '' ( 13.24 ) 296 324 11 89 26 37
to counter Soviet geopolitical expansion during the Cold War 76 136 11 126 23 33
to counter Soviet geopolitical expansion during the Cold War 76 136 11 113 0 0
Cara Black 37 47 19 82 25 26
vice president 530 544 12 126 106 107
vice president 530 544 12 126 55 56
vice president 530 544 12 105 0 0
Shemar Moore ( 1994 -- 2005 , 2014 ) Darius McCrary 98 149 12 126 28 42
Shemar Moore ( 1994 -- 2005 , 2014 ) Darius McCrary 98 149 12 126 0 0
Shemar Moore ( 1994 -- 2005 , 2014 ) Darius McCrary 98 149 12 82 0 0


In [10]:
# process by taking consideration of the sliding truncation

def process_overflow(samples):

    # redefine column of question

    question = []
    
    for sample in samples["questions"]:

        question.append(sample[0]["input_text"])

    # tokenization
    toks = tokenizer(text=question, 
                    text_pair=samples["contexts"],
                    return_overflowing_tokens=True, # to activate sliding truncation
                    stride=64, # define the overlapping length, default value is 0 (no overlap)
                    max_length=128, 
                    truncation="only_second", 
                    padding="max_length", 
                    return_offsets_mapping=True)
    
    start_pos = []
    end_pos = []
    ids = []

    sample_map = toks.pop("overflow_to_sample_mapping")

    for ind, _ in enumerate(sample_map):

        answer = samples["answers"][sample_map[ind]][0]
        
        text = answer["span_text"]

        # text pos for answer
        start = int(answer["span_start"])
        end = int(answer["span_end"])

        # find context pos in tokens
        context_start = toks.sequence_ids(ind).index(1)
        context_end = toks.sequence_ids(ind).index(None, context_start) - 1

        offset = toks.get("offset_mapping")[ind]
        

        # convert pos to token pos
        if offset[context_start][1] > end or offset[context_end][0] < start:

            answer_start = 0
            answer_end = 0

        else:

            tok_id = context_start
            while tok_id <= context_end and offset[tok_id][0] < start:
                tok_id += 1
            answer_start = tok_id

            tok_id = context_end
            while tok_id >= context_start and offset[tok_id][1] > end:
                tok_id -= 1

            answer_end = tok_id

        ids.append(samples["id"][sample_map[ind]]) # we added info aout the id to identify the origin of the data
        start_pos.append(answer_start)
        end_pos.append(answer_end)

        toks["offset_mapping"][ind] = [
            v if toks.sequence_ids(ind)[k] == 1 else None
            for k, v in enumerate(toks["offset_mapping"][ind])
        ]

    toks["start_positions"] = start_pos # be careful with feature names
    toks["end_positions"] = end_pos
    toks["id"] = ids

    return toks  

In [11]:
# tokenization

tokenized_data_overflow = data.map(process_overflow, batched=True, remove_columns=data["train"].column_names)
tokenized_data_overflow

DatasetDict({
    train: Dataset({
        features: ['id', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 32010
    })
    validation: Dataset({
        features: ['id', 'input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'start_positions', 'end_positions'],
        num_rows: 2022
    })
})

### 5. download model

In [12]:
# we use the same model as in NER
model = AutoModelForQuestionAnswering.from_pretrained(ckp)

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
model

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, 

### 6. Define metrics

In [13]:
# extract predictions and the corresponding labels


import numpy as np
from collections import defaultdict

def result(start_logits, end_logits, samples, examples) :

    preds = {}
    refs = {}

    example_by_sample = defaultdict(list)

    for i, sample_id in enumerate(samples["id"]):

        example_by_sample[sample_id].append(i)


    n_best = 20
    max_answer_len = 30

    for sample in samples:

        sample_id = sample["id"]
        context = sample["contexts"]
        answers =[]

        for example_id in example_by_sample[sample_id]:

            start_logit = start_logits[example_id]
            end_logit = end_logits[example_id]

            offset = examples[example_id]["offset_mapping"]

            start_indices = np.argsort(start_logit)[::-1][:n_best].tolist()
            end_indices = np.argsort(end_logit)[::-1][:n_best].tolist()

            for s_ind in start_indices:
                for e_ind in end_indices:

                    if offset[s_ind] is None or offset[e_ind] is None:
                        continue
                    if e_ind < s_ind or e_ind - s_ind + 1 > max_answer_len:
                        continue

                    answers.append({
                        "text": context[offset[s_ind][0]: offset[e_ind][1]],
                        "score": start_logit[s_ind] + end_logit[e_ind]
                    })

        if len(answers) > 0:
            
            best_answer = max(answers, key=lambda x: x["score"])

            preds[sample_id] = best_answer["text"]
            
        else:
            preds[sample_id] = ""

        refs[sample_id] = sample["answers"][0]["span_text"]

    return preds, refs

In [14]:
from eval_cmrc import evaluate_cmrc

def metric(pred):

    s_logits, e_logits = pred[0]
    if s_logits.shape[0] == len(tokenized_data["validation"]):
        p, r = result(s_logits, e_logits, data["validation"], tokenized_data["validation"])
        
    return evaluate_cmrc(p, r)

### 7. train args

In [15]:
args = TrainingArguments(
        output_dir="../tmp/checkpoints",
        per_device_train_batch_size=64,
        per_device_eval_batch_size=128,
        eval_strategy="epoch",
        save_strategy="epoch",
        logging_steps=10,
)

### 8. trainer

In [16]:
tokenized_data = tokenized_data_overflow

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["validation"],
    data_collator=DefaultDataCollator(),
    compute_metrics=metric
)

### 9. train + eval

In [17]:
trainer.evaluate(eval_dataset=tokenized_data["validation"])

{'eval_loss': 4.901897430419922,
 'eval_avg': 0.2343407000742645,
 'eval_f1': 0.08980654366173216,
 'eval_em': 0.3788748564867968,
 'eval_total': 871,
 'eval_skip': 0,
 'eval_runtime': 24.4983,
 'eval_samples_per_second': 82.536,
 'eval_steps_per_second': 0.653}

In [18]:
trainer.train()

Epoch,Training Loss,Validation Loss,Avg,F1,Em,Total,Skip
1,0.9803,1.030865,0.154136,0.059134,0.249139,871,0
2,0.7023,1.029207,0.151903,0.062705,0.241102,871,0
3,0.3874,1.110168,0.152483,0.063864,0.241102,871,0


TrainOutput(global_step=1503, training_loss=0.8424471972865893, metrics={'train_runtime': 536.643, 'train_samples_per_second': 178.946, 'train_steps_per_second': 2.801, 'total_flos': 6273081887339520.0, 'train_loss': 0.8424471972865893, 'epoch': 3.0})

In [19]:
trainer.evaluate(eval_dataset=tokenized_data["validation"])

{'eval_loss': 1.1101675033569336,
 'eval_avg': 0.15248326279167984,
 'eval_f1': 0.06386434418267078,
 'eval_em': 0.24110218140068887,
 'eval_total': 871,
 'eval_skip': 0,
 'eval_runtime': 21.62,
 'eval_samples_per_second': 93.525,
 'eval_steps_per_second': 0.74,
 'epoch': 3.0}

### 10. Inference

In [20]:
from transformers import pipeline

In [21]:
pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
pipe

<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7fd400bfffa0>

In [22]:
pipe(question="who is john", context="John is a boxer who works in europ")

{'score': 0.2047954797744751,
 'start': 8,
 'end': 34,
 'answer': 'a boxer who works in europ'}

## 