# 03 Question Answering and Leaning into the fastai Community

Through the power of [nbdev](https://github.com/fastai/nbdev) it is easy for folks to write extensions for fastai, adding useful functions and techniques while keeping the source library easy to maintain. We will be looking at one today


## Blurr

Blurr is a library made by a fastai fellow [wgpubs](https://forums.fast.ai/u/wgpubs/summary) which is designed to help integrate the HuggingFace library into fastai

## Installing and importing the libraries:

Let's install `blurr`, `fastai`, and `torch`:

In [None]:
!pip install light-the-torch >> /.tmp
!ltt install torch >> /.tmp
!pip install fastai nbdev ohmeow-blurr --upgrade >> /.tmp

And now let's import them:

In [4]:
import torch
from transformers import *
from fastai.text.all import *

from blurr.data.all import *
from blurr.modeling.all import *

## Question and Answer (Q/A)

We will be following along his QA tutorial [here](https://ohmeow.github.io/blurr/modeling-question-answering/)

Q/A is a technique where given an input we want our model to predict some logically sounding answer. 

Let's download the dataset:

In [None]:
!wget 'https://raw.githubusercontent.com/ohmeow/blurr/master/nbs/squad_sample.csv'

And load it into `Pandas`:

In [2]:
df = pd.read_csv('squad_sample.csv'); df.head(2)

Unnamed: 0,id,title,context,question,answers,ds_type,answer_text,is_impossible
0,56be85543aeaaa14008c9063,Beyoncé,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G...",When did Beyonce start becoming popular?,"{'text': ['in the late 1990s'], 'answer_start': [269]}",train,in the late 1990s,False
1,56be85543aeaaa14008c9065,Beyoncé,"Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G...",What areas did Beyonce compete in when she was growing up?,"{'text': ['singing and dancing'], 'answer_start': [207]}",train,singing and dancing,False


## Getting a Pretrained Model

We will use `bert` for our task. We need to define the model, the task, before using a helper function:

In [5]:
p_name = 'bert-large-uncased-whole-word-masking-finetuned-squad'
model_cls = BertForQuestionAnswering

Next we'll get the architecture, it's configuration, hugginface's tokenizer as well as the model:

In [7]:
arch, config, tokenizer, net = BLURR_MODEL_HELPER.get_hf_objects(p_name,
                                                                 model_cls=model_cls)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1340675298.0, style=ProgressStyle(descr…




## Working with the Data

Next we need to apply some preprocessing:

In [8]:
df = df.apply(partial(pre_process_squad, hf_arch=arch, hf_tokenizer=tokenizer),
              axis=1)

Before filtering our values by the sequence length:

In [9]:
seq_len=128

In [11]:
df = df[(df.tokenized_input_len < seq_len) & (df.is_impossible == False)]

We'll then want to grab our vocab:

In [13]:
vocab = list(range(seq_len))

Now let's integrate it into the `DataBlock` API. We'll add a batch transform specifically for HF:

In [16]:
trunc_strat = 'only_second' if (tokenizer.padding_side == 'right') else 'only_first'

In [17]:
hf_batch_tfm = HF_QABatchTransform(arch, tokenizer, 
                                   max_length=seq_len, 
                                   truncation=trunc_strat, 
                                   tok_kwargs={ 'return_special_tokens_mask': True })

Let's define our `blocks`:

In [18]:
blocks = (
    HF_TextBlock(hf_batch_tfm=hf_batch_tfm), 
    CategoryBlock(vocab=vocab),
    CategoryBlock(vocab=vocab)
)

How we want to get our data:

In [22]:
def get_x(x):
    return (x.question, x.context) if (tokenizer.padding_side == 'right') else (x.context, x.question)

Before finally the `DataBlock` and `DataLoaders`:

In [23]:
dblock = DataBlock(blocks=blocks, 
                   get_x=get_x,
                   get_y=[ColReader('tok_answer_start'), ColReader('tok_answer_end')],
                   splitter=RandomSplitter(),
                   n_inp=1)

In [24]:
dls = dblock.dataloaders(df, bs=4)

Let's look at a batch of data:

In [26]:
dls.show_batch(dataloaders=dls, max_n=2)

Unnamed: 0,text,start/end,answer
0,"what video game did beyonce back out of? the release of a video - game starpower : beyonce was cancelled after beyonce pulled out of a $ 100 million with gatefive who alleged the cancellation meant the sacking of 70 staff and millions of pounds lost in development. it was settled out of court by her lawyers in june 2013 who said that they had cancelled because gatefive had lost its financial backers. beyonce also has had deals with american express, nintendo ds and l'oreal since the age of 18.","(18, 22)",starpower : beyonce
1,"what french magazine did beyonce appear in wearing blackface and tribal makeup? in 2006, the animal rights organization people for the ethical treatment of animals ( peta ), criticized beyonce for wearing and using fur in her clothing line house of dereon. in 2011, she appeared on the cover of french fashion magazine l'officiel, in blackface and tribal makeup that drew criticism from the media. a statement released from a spokesperson for the magazine said that beyonce's look was "" far from the glamorous sasha fierce "" and that it was "" a return to her african roots "".","(63, 68)",l ' officiel


## Training

Next we'll make a `Learner` for our problem by first wrapping our `net` in a HF helper:

In [27]:
net = HF_BaseModelWrapper(net)

And then making the `Learner`:

In [28]:
learn = Learner(dls, net, cbs=[HF_QstAndAnsModelCallback],
                splitter=hf_splitter,
                loss_func=MultiTargetLoss())

We'll create our optimizer state and freeze the model for transfer learning:

In [30]:
learn.create_opt()
learn.freeze()

Before fine-tuning:

In [31]:
learn.fine_tune(3, 1e-3)

epoch,train_loss,valid_loss,time
0,5.291303,2.841815,00:04


epoch,train_loss,valid_loss,time
0,1.636171,1.43889,00:06
1,1.369125,1.069108,00:06
2,0.94343,1.027662,00:06


Let's look at a few results:

In [32]:
learn.show_results(learner=learn, skip_special_tokens=True, max_n=2)

Unnamed: 0,text,start/end,answer,pred start/end,pred answer
0,"what language did chopin's father teach? in october 1810, six months after fryderyk's birth, the family moved to warsaw, where his father acquired a post teaching french at the warsaw lyceum, then housed in the saxon palace. fryderyk lived with his family in the palace grounds. the father played the flute and violin ; the mother played the piano and gave lessons to boys in the boarding house that the chopins kept. chopin was of slight build, and even in early childhood was prone to illnesses.","(38, 39)",french,"(38, 39)",french
1,"how much did beyonce initially contribute to the foundation? after hurricane katrina in 2005, beyonce and rowland founded the survivor foundation to provide transitional housing for victims in the houston area, to which beyonce contributed an initial $ 250, 000. the foundation has since expanded to work with other charities in the city, and also provided relief following hurricane ike three years later.","(42, 46)","$ 250 , 000","(42, 46)","$ 250 , 000"


We did pretty good! Let's try asking it a question:

In [33]:
inf_df = pd.DataFrame.from_dict([
    {'question': 'Who created Star Wars?', 
     'context': 'George Lucas created Star Wars in 1977. He directed and produced it.'}],
    orient='columns')

In [35]:
learn.blurr_predict(inf_df.iloc[0])

(('7', '9'),
 tensor([7]),
 tensor([[1.9917e-07, 1.1072e-07, 3.2708e-08, 2.9171e-08, 1.7883e-08, 1.4171e-08,
          1.9917e-07, 9.9925e-01, 7.3408e-04, 5.3861e-08, 1.3217e-07, 4.8066e-08,
          1.7595e-07, 5.1273e-06, 8.5625e-08, 7.5926e-06, 2.5976e-07, 2.7561e-08,
          1.9069e-07, 7.7642e-08, 4.0385e-08, 1.9912e-07]]))

In [36]:
inp_ids = tokenizer.encode('Who created Star Wars?',
                              'George Lucas created Star Wars in 1977. He directed and produced it.')

tokenizer.convert_ids_to_tokens(inp_ids, skip_special_tokens=False)[7:9]

['george', 'lucas']

Pretty cool!