#Build a QA System using BERT and Hugging Face
Artificial-Intelligence
•
12/ 30/ 2021

A chatbot is an AI software that can simulate a conversation (or a chat) with a user through messaging applications, websites, mobile apps or through the telephone.
A chatbot is often described as one of the most advanced and promising expressions of interaction between humans and machines. However, from a technological point of view, a chatbot only represents the natural evolution of a Question Answering system leveraging Natural Language Processing (NLP). Formulating responses to questions in natural language is one of the most typical Examples of Natural Language Processing applied in various enterprises’ end-use applications.

source -
https://blog.jaysinha.me/build-your-first-qa-system-using-bert-and-hugging-face/

BERT (Bidirectional Encoder Representations from Transformers) has started a revolution in NLP with state of the art results in various tasks, including Question Answering, GLUE Benchmark, and others. People even referred to this as the ImageNet moment of NLP.

Question Answering systems are built on pairs of question and contexts.

#The Tutorial:
In this tutorial, we will use a pre-trained modified version of BERT from Hugging Face which was trained on Squad 2.0 dataset. We will provide the questions and for context, we will use the first match article from Wikipedia through wikipedia package in Python. Then we will tokenize the article using the AutoTokenizer model in order for the AutoModelForQuestionAnswering model to predict the sequence of words which will be our answer.

#A little background:
The model we are using was originally trained on masked datasets where the researchers masked key words in a huge corpus and the task for the model was to predict that word. The QA system uses a similar system for its set of tasks.

Now, let's get into the tutorial.

First we will create a class that will compile the model import and tokenizing of the question and matched wikipedia article.

In [2]:
#first lets install the libs we need 
!pip install torch  torchvision -f https://download.pytorch.org/whl/torch_stable.html
!pip install transformers
!pip install wikipedia

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 13.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 64.7 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 52.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 262 kB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 75.9 MB/s 
Installing collected packages:

In [3]:
#next lets begin our imports with a test import
try:
    from torch.utils.tensorboard import SummaryWriter
except ImportError:
    from tensorboardX import SummaryWriter

In [5]:
#if that passes move next to import the main targets
# to make the following output more readable I'll turn off the token sequence length warning
import logging
logging.getLogger("transformers.tokenization_utils").setLevel(logging.ERROR)
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import wikipedia as wiki
from collections import OrderedDict
import os

In [6]:
#make final import staements
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from collections import OrderedDict
import torch


In [7]:
class QASystemWithBERT:

    def __init__(self, pretrained_model_name_or_path='bert-large-uncased'):
        self.READER_PATH = pretrained_model_name_or_path
        self.tokenizer = AutoTokenizer.from_pretrained(self.READER_PATH
                                                      )
        self.model = AutoModelForQuestionAnswering.from_pretrained(self.READER_PATH
                                                                  )
        self.max_len = self.model.config.max_position_embeddings
        self.chunked = False

    def tokenize(self, question, text):
        self.inputs = self.tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt", return_token_type_ids=True)
        self.input_ids = self.inputs["input_ids"].tolist()[0]

        if len(self.input_ids) > self.max_len:
            self.inputs = self.chunkify()
            self.chunked = True

    def chunkify(self):
        """ 
        Break up a long article into chunks that fit within the max token
        requirement for that Transformer model. 
        """
        qmask = self.inputs['token_type_ids'].lt(1)
        
        qt = torch.masked_select(self.inputs['input_ids'], qmask)
        chunk_size = self.max_len - qt.size()[0] - 1 
        
        chunked_input = OrderedDict()
        for k,v in self.inputs.items():
            q = torch.masked_select(v, qmask)
            c = torch.masked_select(v, ~qmask)
            chunks = torch.split(c, chunk_size)
            
            for i, chunk in enumerate(chunks):
                if i not in chunked_input:
                    chunked_input[i] = {}

                thing = torch.cat((q, chunk))
                if i != len(chunks)-1:
                    if k == 'input_ids':
                        thing = torch.cat((thing, torch.tensor([102])))
                    else:
                        thing = torch.cat((thing, torch.tensor([1])))

                chunked_input[i][k] = torch.unsqueeze(thing, dim=0)
        return chunked_input

    def get_answer(self):
        if self.chunked:
            answer = ''
            for k, chunk in self.inputs.items():
                answer_start_scores, answer_end_scores = self.model(**chunk)[:2]

                answer_start = torch.argmax(answer_start_scores)
                answer_end = torch.argmax(answer_end_scores) + 1

                ans = self.convert_ids_to_string(chunk['input_ids'][0][answer_start:answer_end])
                if ans != '[CLS]':
                    answer += ans + " / "
            return answer
        else:
            answer_start_scores, answer_end_scores = self.model(**self.inputs)[:2]

            answer_start = torch.argmax(answer_start_scores)  
            answer_end = torch.argmax(answer_end_scores) + 1  
        
            return self.convert_ids_to_string(self.inputs['input_ids'][0][
                                              answer_start:answer_end])

    def convert_ids_to_string(self, input_ids):
        return self.tokenizer.convert_tokens_to_string(self.tokenizer.convert_ids_to_tokens(input_ids))


        

Now, we are going to pass a list of questions and get the answers to our questions:

In [8]:
import wikipedia as wiki
questions = [
    'Where is Microsoft Headquarters located?',
    'Who is the President of the United States of America?',
    'How many sides does a pentagon have?'
]


qas = QASystemWithBERT("deepset/bert-base-cased-squad2")


for question in questions:
    print(f"Question: {question}")
    results = wiki.search(question)

    page = wiki.page(results[0])
    print(f"Top wiki result: {page}")

    text = page.content

    qas.tokenize(question, text)
    print(f"Answer: {qas.get_answer()}")
    print()

Downloading:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

Question: Where is Microsoft Headquarters located?


Token indices sequence length is longer than the specified maximum sequence length for this model (1390 > 512). Running this sequence through the model will result in indexing errors


Top wiki result: <WikipediaPage 'Microsoft Redmond campus'>
Answer: Redmond, Washington / Redmond / 

Question: Who is the President of the United States of America?
Top wiki result: <WikipediaPage 'List of presidents of the United States'>
Answer: the head of state and head of government of the United States / 

Question: How many sides does a pentagon have?
Top wiki result: <WikipediaPage 'The Pentagon'>
Answer: five / 



take 10 retries on each question & changing wordings of the question numerous times and see how well the system can answer the qestion in every situtation 

Main Example notebook for refeence -
https://www.kaggle.com/razor08/qa-system-with-bert/notebook 