In [None]:
# default_exp data.question_answering


In [None]:
# all_slow


In [None]:
#hide
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# data.question_answering

> This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for question/answering tasks.

In [None]:
# export
import ast
from functools import reduce

from fastcore.all import *
from fastai.data.block import DataBlock, CategoryBlock, ColReader, ColSplitter
from fastai.imports import *
from fastai.losses import CrossEntropyLossFlat
from fastai.torch_core import *
from fastai.torch_imports import *
from transformers import AutoModelForQuestionAnswering, logging, PretrainedConfig, PreTrainedTokenizerBase, PreTrainedModel

from blurr.utils import BLURR
from blurr.data.core import HF_BaseInput, HF_AfterBatchTransform, HF_BeforeBatchTransform, first_blurr_tfm

logging.set_verbosity_error()


In [None]:
# hide_input
import pdb

from fastai.data.core import DataLoader, DataLoaders, TfmdDL
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import *
from fastcore.test import *
from nbverbose.showdoc import show_doc

from blurr.utils import print_versions
from blurr.data.core import HF_TextBlock

os.environ["TOKENIZERS_PARALLELISM"] = "false"
print("What we're running with at the time this documentation was generated:")
print_versions("torch fastai transformers")


What we're running with at the time this documentation was generated:
torch: 1.7.1
fastai: 2.5.3
transformers: 4.13.0


In [None]:
# hide
# cuda
torch.cuda.set_device(1)
print(f"Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}")


Using GPU #1: GeForce GTX 1080 Ti


## Question/Answering tokenization, batch transform, and DataBlock methods

Question/Answering tasks are models that require two text inputs (a context that includes the answer and the question).  The objective is to predict the start/end tokens of the answer in the context)

In [None]:
path = Path("./")
squad_df = pd.read_csv(path / "squad_sample.csv")
print(len(squad_df))
squad_df.head(2)


1000


Unnamed: 0,id,title,context,question,answers,ds_type,answer_text,is_impossible
0,56be85543aeaaa14008c9063,Beyoncé,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G...",When did Beyonce start becoming popular?,"{'text': ['in the late 1990s'], 'answer_start': [269]}",train,in the late 1990s,False
1,56be85543aeaaa14008c9065,Beyoncé,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G...",What areas did Beyonce compete in when she was growing up?,"{'text': ['singing and dancing'], 'answer_start': [207]}",train,singing and dancing,False


We've provided a simple subset of a pre-processed SQUADv2 dataset below just for demonstration purposes. There is a lot that can be done to make this much better and more fully functional.  The idea here is just to show you how things can work for tasks beyond sequence classification. 

In [None]:
model_cls = AutoModelForQuestionAnswering

pretrained_model_name = "roberta-base"  #'xlm-mlm-ende-1024'
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)

max_seq_len = 128
vocab = dict(enumerate(range(max_seq_len)))


With version 2 of blurr, you now have the option of preprocessing your raw data as before using the `pre_process_squad` method (or somethign similar), as well as letting blurr handle everything including documents longer than the `max_len` or model will allow via splitting them into smaller chunks within those constraints.

In [None]:
# export
def pre_process_squad(
    # A row in your pd.DataFrame
    row,
    # The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
    hf_arch: str,
    # A Hugging Face tokenizer
    hf_tokenizer: PreTrainedTokenizerBase,
    # The attribute in your dataset that contains the context (where the answer is included) (default: 'context')
    ctx_attr: str = "context",
    # The attribute in your dataset that contains the question being asked (default: 'question')
    qst_attr: str = "question",
    # The attribute in your dataset that contains the actual answer (default: 'answer_text')
    ans_attr: str = "answer_text",
):
    context, qst, ans = row[ctx_attr], row[qst_attr], row[ans_attr]

    tok_kwargs = {}

    if hf_tokenizer.padding_side == "right":
        tok_input = hf_tokenizer.convert_ids_to_tokens(hf_tokenizer.encode(qst.lstrip(), context, **tok_kwargs))
    else:
        tok_input = hf_tokenizer.convert_ids_to_tokens(hf_tokenizer.encode(context, qst.lstrip(), **tok_kwargs))

    tok_ans = hf_tokenizer.tokenize(str(row[ans_attr]), **tok_kwargs)

    start_idx, end_idx = 0, 0
    for idx, tok in enumerate(tok_input):
        try:
            if tok == tok_ans[0] and tok_input[idx : idx + len(tok_ans)] == tok_ans:
                start_idx, end_idx = idx, idx + len(tok_ans)
                break
        except:
            pass

    row["tokenized_input"] = tok_input
    row["tokenized_input_len"] = len(tok_input)
    row["tok_answer_start"] = start_idx
    row["tok_answer_end"] = end_idx

    return row


In [None]:
show_doc(pre_process_squad)


<h4 id="pre_process_squad" class="doc_header"><code>pre_process_squad</code><a href="__main__.py#L2" class="source_link" style="float:right">[source]</a></h4>

> <code>pre_process_squad</code>(**`row`**, **`hf_arch`**:`str`, **`hf_tokenizer`**:`PreTrainedTokenizerBase`, **`ctx_attr`**:`str`=*`'context'`*, **`qst_attr`**:`str`=*`'question'`*, **`ans_attr`**:`str`=*`'answer_text'`*)



**Parameters:**


 - **`row`** : *`<class 'inspect._empty'>`*	<p>A row in your pd.DataFrame</p>


 - **`hf_arch`** : *`<class 'str'>`*	<p>The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)</p>


 - **`hf_tokenizer`** : *`<class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>`*	<p>A Hugging Face tokenizer</p>


 - **`ctx_attr`** : *`<class 'str'>`*, *optional*	<p>The attribute in your dataset that contains the context (where the answer is included) (default: 'context')</p>


 - **`qst_attr`** : *`<class 'str'>`*, *optional*	<p>The attribute in your dataset that contains the question being asked (default: 'question')</p>


 - **`ans_attr`** : *`<class 'str'>`*, *optional*	<p>The attribute in your dataset that contains the actual answer (default: 'answer_text')</p>



How to preprocess your data

In [None]:
proc_df = squad_df.apply(partial(pre_process_squad, hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), axis=1)


In [None]:
# if you want to remove texts longer than your model will hold (and include only answerable contexts)
proc_df = proc_df[(proc_df.tok_answer_end < max_seq_len) | (proc_df.is_impossible == False)]
proc_df.head(2)


Unnamed: 0,id,title,context,question,answers,ds_type,answer_text,is_impossible,tokenized_input,tokenized_input_len,tok_answer_start,tok_answer_end
0,56be85543aeaaa14008c9063,Beyoncé,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G...",When did Beyonce start becoming popular?,"{'text': ['in the late 1990s'], 'answer_start': [269]}",train,in the late 1990s,False,"[<s>, ĠWhen, Ġdid, ĠBeyon, ce, Ġstart, Ġbecoming, Ġpopular, ?, </s>, </s>, ĠBeyon, cÃ©, ĠG, is, elle, ĠKnow, les, -, Carter, Ġ(/, bi, Ë, Ĳ, ËĪ, j, É, Ĵ, n, se, É, ª, /, Ġbee, -, Y, ON, -, say, ), Ġ(, born, ĠSeptember, Ġ4, ,, Ġ1981, ), Ġis, Ġan, ĠAmerican, Ġsinger, ,, Ġsong, writer, ,, Ġrecord, Ġproducer, Ġand, Ġactress, ., ĠBorn, Ġand, Ġraised, Ġin, ĠHouston, ,, ĠTexas, ,, Ġshe, Ġperformed, Ġin, Ġvarious, Ġsinging, Ġand, Ġdancing, Ġcompetitions, Ġas, Ġa, Ġchild, ,, Ġand, Ġrose, Ġto, Ġfame, Ġin, Ġthe, Ġlate, Ġ1990, s, Ġas, Ġlead, Ġsinger, Ġof, ĠR, &, B, Ġgirl, -, group, ĠDestiny, ...]",185,84,89
1,56be85543aeaaa14008c9065,Beyoncé,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G...",What areas did Beyonce compete in when she was growing up?,"{'text': ['singing and dancing'], 'answer_start': [207]}",train,singing and dancing,False,"[<s>, ĠWhat, Ġareas, Ġdid, ĠBeyon, ce, Ġcompete, Ġin, Ġwhen, Ġshe, Ġwas, Ġgrowing, Ġup, ?, </s>, </s>, ĠBeyon, cÃ©, ĠG, is, elle, ĠKnow, les, -, Carter, Ġ(/, bi, Ë, Ĳ, ËĪ, j, É, Ĵ, n, se, É, ª, /, Ġbee, -, Y, ON, -, say, ), Ġ(, born, ĠSeptember, Ġ4, ,, Ġ1981, ), Ġis, Ġan, ĠAmerican, Ġsinger, ,, Ġsong, writer, ,, Ġrecord, Ġproducer, Ġand, Ġactress, ., ĠBorn, Ġand, Ġraised, Ġin, ĠHouston, ,, ĠTexas, ,, Ġshe, Ġperformed, Ġin, Ġvarious, Ġsinging, Ġand, Ġdancing, Ġcompetitions, Ġas, Ġa, Ġchild, ,, Ġand, Ġrose, Ġto, Ġfame, Ġin, Ġthe, Ġlate, Ġ1990, s, Ġas, Ġlead, Ġsinger, Ġof, ĠR, &, ...]",190,77,80


### Mid-level API

In [None]:
# export
class HF_QuestionAnswerInput(HF_BaseInput):
    pass


We'll return a `HF_QuestionAnswerInput` from our custom `HF_BeforeBatchTransform` so that we can customize the show_batch/results methods for this task.

In [None]:
# export
class HF_QABeforeBatchTransform(HF_BeforeBatchTransform):
   
    def encodes(self, samples):
        samples, batch_encoding = super().encodes(samples, return_batch_encoding=True)

        updated_samples = []
        for idx, s in enumerate(samples):
            # update the targets: is_found (s[1]), answer start token index (s[2]), and answer end token index (s[3])
            qst_mask = [i != 1 for i in batch_encoding.sequence_ids(idx)]
            start, end, has_ans = self.find_start_end(s[1], s[0]["input_ids"], s[0]["offset_mapping"], qst_mask)
            start_t, end_t, has_ans_t  = TensorCategory(start), TensorCategory(end), TensorCategory(has_ans)

            # cls_index: location of CLS token (used by xlnet and xlm); is a list.index(value) for pytorch tensor's
            s[0]["cls_index"] = (s[0]["input_ids"] == self.hf_tokenizer.cls_token_id).nonzero()[0]
            # p_mask: mask with 1 for token than cannot be in the answer, else 0 (used by xlnet and xlm)
            s[0]["p_mask"] = s[0]["special_tokens_mask"]

            updated_samples.append((s[0], has_ans_t, start_t, end_t))

        return updated_samples

    def find_start_end(self, ans_data, input_ids, offset_mapping, qst_mask):
        # mask the question tokens so they aren't included in the search
        masked_offset_mapping = offset_mapping.clone()
        masked_offset_mapping[qst_mask] = tensor([-100, -100])

        # based on the character start/end index, see if we can find the span of tokens in the `offset_mapping`
        start = torch.where((masked_offset_mapping[:, 0] == ans_data[1]) | (masked_offset_mapping[:, 1] == ans_data[1]))[0]
        end = torch.where((masked_offset_mapping[:, 0] <= ans_data[2]) & (masked_offset_mapping[:, 1] >= ans_data[2]))[0]

        if len(start) > 0 and len(end) > 0:
            start = start[-1]
            end = end[-1]

            if end < len(masked_offset_mapping):
                return (start, end, tensor(1))

        # if neither star or end is found, or the end token is part of this chunk, consider the answer not found
        return (tensor(0), tensor(0), tensor(0))


By overriding `HF_BeforeBatchTransform` we can add other inputs to each example for this particular task.

In [None]:
before_batch_tfm = HF_QABeforeBatchTransform(
    hf_arch,
    hf_config,
    hf_tokenizer,
    hf_model,
    max_length=max_seq_len,
    truncation="only_second",
    padding="max_length",
    tok_kwargs={"return_special_tokens_mask": True, "return_overflowing_tokens": True, "return_offsets_mapping": True, "stride": 2},
)

blocks = (
    HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_QuestionAnswerInput),
    None,
    CategoryBlock(vocab=vocab),
    CategoryBlock(vocab=vocab),
)


def get_ans(r):
    start = eval(r.answers)["answer_start"][0]
    end = start + len(r.answer_text) + 1
    return (r.answer_text, start, end)


def get_dummy_cat(r):
    return 0


dblock = DataBlock(
    blocks=blocks,
    get_x=lambda x: (x.question, x.context),
    get_y=[get_ans, get_dummy_cat, get_dummy_cat],
    splitter=RandomSplitter(),
    n_inp=1,
)


In [None]:
dls = dblock.dataloaders(squad_df, bs=4)
len(dls.train), len(dls.valid)

(200, 50)

In [None]:
b = dls.one_batch()
len(b), len(b[0]), len(b[1]), len(b[2])


(4, 7, 4, 4)

In [None]:
b[0]["input_ids"].shape, b[0]["attention_mask"].shape, b[1].shape, b[2].shape


(torch.Size([4, 128]), torch.Size([4, 128]), torch.Size([4]), torch.Size([4]))

In [None]:
# #hide
# for idx, b in enumerate(dls.valid):
#     for input_ids, start_idx, end_idx in zip(b[0]["input_ids"], b[2], b[3]):
#         print(hf_tokenizer.decode(input_ids))
#         if start_idx.item() != 0:
#             print(f"*** Answer: {hf_tokenizer.decode(input_ids[start_idx:end_idx])} ***\n")
#         else:
#             print("*** NO ANSWER ***")


In [None]:
# export
@typedispatch
def show_batch(
    # This typedispatched `show_batch` will be called for `HF_QuestionAnswerInput` typed inputs
    x: HF_QuestionAnswerInput,
    # Your targets
    y,
    # Your raw inputs/targets
    samples,
    # Your `DataLoaders`. This is required so as to get at the Hugging Face objects for
    # decoding them into something understandable
    dataloaders,
    # Your `show_batch` context
    ctxs=None,
    # The maximum number of items to show
    max_n=6,
    # Any truncation your want applied to your decoded inputs
    trunc_at=None,
    # Any other keyword arguments you want applied to `show_batch`
    **kwargs
):
    # grab our tokenizer
    tfm = first_blurr_tfm(dataloaders, HF_QABeforeBatchTransform)
    hf_tokenizer = tfm.hf_tokenizer

    res = L()
    for sample, input_ids, has_ans, start, end in zip(samples, x, *y):
        txt = hf_tokenizer.decode(sample[0], skip_special_tokens=True)[:trunc_at]
        found = has_ans.item() == 1
        ans_text = hf_tokenizer.decode(input_ids[start:end], skip_special_tokens=False)
        res.append((txt, found, (start.item(), end.item()), ans_text))

    display_df(pd.DataFrame(res, columns=["text", "found", "start/end", "answer"])[:max_n])
    return ctxs


The `show_batch` method above allows us to create a more interpretable view of our question/answer data.

In [None]:
dls.show_batch(dataloaders=dls, max_n=4, trunc_at=500)


Unnamed: 0,text,found,start/end,answer
0,"Which prominent star felt the 2009 Female Video of the Year award should have went to Beyoncé instead of Taylor Swift?, Beyoncé embarked on the I Am... World Tour, her second headlining worldwide concert tour, consisting of 108 shows, grossing $119.5 million.",False,"(0, 0)",
1,"Who beat out Beyonce for Best Female Video? On April 4, 2008, Beyoncé married Jay Z. She publicly revealed their marriage in a video montage at the listening party for her third studio album, I Am... Sasha Fierce, in Manhattan's Sony Club on October 22, 2008. I Am... Sasha Fierce was released on November 18, 2008 in the United States. The album formally introduces Beyoncé's alter ego Sasha Fierce, conceived during the making of her 2003 single ""Crazy in Love"", selling 482,000 copies in its firs",False,"(0, 0)",
2,"Who beat out Beyonce for Best Female Video?, and giving Beyoncé her third consecutive number-one album in the US. The album featured the number-one song ""Single Ladies (Put a Ring on It)"" and the top-five songs ""If I Were a Boy"" and ""Halo"". Achieving the accomplishment of becoming her longest-running Hot 100 single in her career, ""Halo""'s success in the US helped Beyoncé attain more top-ten singles on the list than any other woman during the 2000s. It also included the successful ""Sweet Dreams""",False,"(0, 0)",
3,"Who beat out Beyonce for Best Female Video?Diva"", ""Ego"", ""Broken-Hearted Girl"" and ""Video Phone"". The music video for ""Single Ladies"" has been parodied and imitated around the world, spawning the ""first major dance craze"" of the Internet age according to the Toronto Star. The video has won several awards, including Best Video at the 2009 MTV Europe Music Awards, the 2009 Scottish MOBO Awards, and the 2009 BET Awards. At the 2009 MTV Video Music Awards, the video was nominated for nine awards, u",False,"(0, 0)",


## Summary

This module includes all the low, mid, and high-level API bits for extractive Q&A tasks data preparation.

In [None]:
# hide
from nbdev.export import notebook2script

notebook2script()


Converted 00_utils.ipynb.
Converted 01_data-core.ipynb.
Converted 01_modeling-core.ipynb.
Converted 02_data-language-modeling.ipynb.
Converted 02_modeling-language-modeling.ipynb.
Converted 03_data-token-classification.ipynb.
Converted 03_modeling-token-classification.ipynb.
Converted 04_data-question-answering.ipynb.
Converted 04_modeling-question-answering.ipynb.
Converted 10_data-seq2seq-core.ipynb.
Converted 10_modeling-seq2seq-core.ipynb.
Converted 11_data-seq2seq-summarization.ipynb.
Converted 11_modeling-seq2seq-summarization.ipynb.
Converted 12_data-seq2seq-translation.ipynb.
Converted 12_modeling-seq2seq-translation.ipynb.
Converted 99a_examples-high-level-api.ipynb.
Converted 99b_examples-glue.ipynb.
Converted 99c_examples-glue-plain-pytorch.ipynb.
Converted 99d_examples-multilabel.ipynb.
Converted 99e_examples-causal-lm-gpt2.ipynb.
Converted index.ipynb.
