# Basics of Question-Answer system

## Preamble

In this article we'll learn two of the most basic methods to implement Question Answering system.

[Question Answering (QA) system](https://en.wikipedia.org/wiki/Question_answering) is an information retrieval system in which a direct answer is expected in response to a submitted query, rather than a set of references that may contain the answers. QAs aim at satisfying users who are looking to answer a specific question in natural language.
QA system works like search engine, but with different result representation: search engine returns list of links to answering resources, while QA systeem gives direct answer to question.

The information-retrieval process in QA system is broken down into 3 stages: question processing, ranking, and answer extraction. Question processing and ranking can be performed using algorithmic functions or with Machine learning.

QA system with approximate match function is simple as: 

![](Question-Answer-approximate.png)

More complex system uses NLP technique to understand natural language. [Natural Language Processing (NLP)](https://en.wikipedia.org/wiki/Natural_language_processing) is the ability of a computer program to understand human language as it is spoken. Pipeline of QA system with pre-trained NLP model includes 2 stages, preparation of data and proceessing:

![](Question-Answer-NLP.png)

## Prerequosites

To run theese examples, you need [Python 3](https://www.python.org/). Also install [Jupyter Lab](https://jupyter.org/install) and few Python modules:

```bash
pip install jupyterlab
pip install python-Levenshtein
pip install bert-serving-server bert-serving-client
```

## Data

For demo purposes, we use extremely small set of question-answer pairs in CSV file. To build high-quality QA system you should use a lot of question samples, specialized data storage or database for fast lookup, and, as you'll see at the end, its a good idea to master training of NLP models. In our examples we are going to use knowledge base as-is without any modification, but you are free to insert additional question samples to improve answering quality.
Let load our data:

In [1]:
import pandas as pd
data = pd.read_csv('qa.csv')

# this function is used to get printable results
def getResults(questions, fn):
    def getResult(q):
        answer, score, prediction = fn(q)
        return [q, prediction, answer, score]

    return pd.DataFrame(list(map(getResult, questions)), columns=["Q", "Prediction", "A", "Score"])

test_data = [
    "What is the population of Egypt?",
    "What is the poulation of egypt",
    "How long is a leopard's tail?",
    "Do you know the length of leopard's tail?",
    "When polar bears can be invisible?",
    "Can I see arctic animals?",
    "some city in Finland"
]

data

Unnamed: 0,Question,Answer
0,Who determined the dependence of the boiling o...,Anders Celsius
1,Are beetles insects?,Yes
2,Are Canada 's two official languages English a...,yes
3,What is the population of Egypt?,more than 78 million
4,What is the biggest city in Finland?,Greater Helsinki
5,What is the national currency of Liechtenstein?,Swiss franc
6,Can polar bears be seen under infrared photogr...,Polar bears are nearly invisible under infrare...
7,When did Tesla demonstrate wireless communicat...,1893
8,What are violins made of?,different types of wood
9,How long is a leopard's tail?,60 to 110cm


## The simplest QA system

This is very naive system, where user's query need to be equal or to be part of some question.

In [2]:
import re

def getNaiveAnswer(q):
    # regex helps to pass some punctuation signs
    row = data.loc[data['Question'].str.contains(re.sub(r"[^\w'\s)]+", "", q),case=False)]
    if len(row) > 0:
        return row["Answer"].values[0], 1, row["Question"].values[0]
    return "Sorry, I didn't get you.", 0, ""

getResults(test_data, getNaiveAnswer)

Unnamed: 0,Q,Prediction,A,Score
0,What is the population of Egypt?,What is the population of Egypt?,more than 78 million,1
1,What is the poulation of egypt,,"Sorry, I didn't get you.",0
2,How long is a leopard's tail?,How long is a leopard's tail?,60 to 110cm,1
3,Do you know the length of leopard's tail?,,"Sorry, I didn't get you.",0
4,When polar bears can be invisible?,,"Sorry, I didn't get you.",0
5,Can I see arctic animals?,,"Sorry, I didn't get you.",0
6,some city in Finland,,"Sorry, I didn't get you.",0


This system has notable drawback to not find match if there some grammar mistake. Even if we use some string pre-processing of source and query texts, like punctuation symbols removal, lowercasing etc., result has poor quality. Such way of question matching is very inefficient.
Let improve it to become error-prone with approximate string matching.

## Approximating QA system 

Let use [approximate string matching](https://en.wikipedia.org/wiki/Approximate_string_matching) to make our system admitting grammar mistakes and some text difference.
In computer science many methods to do approximate string matching exists. For our demo purposes we use one of implementations of fuzzy string searching, called [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

Let implement our system with [Levenshtein](https://github.com/ztane/python-Levenshtein) Python module. It contains a set of approximate string matching functions, you can play with any other one.

In [3]:
from Levenshtein import ratio

def getApproximateAnswer(q):
    max_score = 0
    answer = ""
    prediction = ""
    for idx, row in data.iterrows():
        score = ratio(row["Question"], q)
        if score >= 0.9: # I'm sure, stop here
            return row["Answer"], score, row["Question"]
        elif score > max_score: # I'm unsure, continue
            max_score = score
            answer = row["Answer"]
            prediction = row["Question"]

    if max_score > 0.8:
        return answer, max_score, prediction
    return "Sorry, I didn't get you.", max_score, prediction

getResults(test_data, getApproximateAnswer)

Unnamed: 0,Q,Prediction,A,Score
0,What is the population of Egypt?,What is the population of Egypt?,more than 78 million,1.0
1,What is the poulation of egypt,What is the population of Egypt?,more than 78 million,0.935484
2,How long is a leopard's tail?,How long is a leopard's tail?,60 to 110cm,1.0
3,Do you know the length of leopard's tail?,How long is a leopard's tail?,"Sorry, I didn't get you.",0.657143
4,When polar bears can be invisible?,Can polar bears be seen under infrared photogr...,"Sorry, I didn't get you.",0.517647
5,Can I see arctic animals?,What is the biggest city in Finland?,"Sorry, I didn't get you.",0.42623
6,some city in Finland,What is the biggest city in Finland?,"Sorry, I didn't get you.",0.642857


As you see, the second question with two grammar mistakes is answered for now, getting score below 1.0, but acceptably high. For now our system is better, it can do spell-checking, but there still trouble with questions written in native language.
Let try to adjust max_ratio koefficient of our function to be more tolerant.

In [4]:
from Levenshtein import ratio

def getApproximateAnswer2(q):
    max_score = 0
    answer = ""
    prediction = ""
    for idx, row in data.iterrows():
        score = ratio(row["Question"], q)
        if score >= 0.9: # I'm sure, stop here
            return row["Answer"], score, row["Question"]
        elif score > max_score: # I'm unsure, continue
            max_score = score
            answer = row["Answer"]
            prediction = row["Question"]

    if max_score > 0.3: # treshold is lowered
        return answer, max_score, prediction
    return "Sorry, I didn't get you.", max_score, prediction

getResults(test_data, getApproximateAnswer2)

Unnamed: 0,Q,Prediction,A,Score
0,What is the population of Egypt?,What is the population of Egypt?,more than 78 million,1.0
1,What is the poulation of egypt,What is the population of Egypt?,more than 78 million,0.935484
2,How long is a leopard's tail?,How long is a leopard's tail?,60 to 110cm,1.0
3,Do you know the length of leopard's tail?,How long is a leopard's tail?,60 to 110cm,0.657143
4,When polar bears can be invisible?,Can polar bears be seen under infrared photogr...,Polar bears are nearly invisible under infrare...,0.517647
5,Can I see arctic animals?,What is the biggest city in Finland?,Greater Helsinki,0.42623
6,some city in Finland,What is the biggest city in Finland?,Greater Helsinki,0.642857


Let examine results. For now our system has more answers. We've got responses for questions written with different words. But look on 5'th results. It looks like false positive, answers don't match our questions semantically. So, need some balanced ratio depending on data set, but we need to select betweeen language understanding and correctness. This code is very simple, but it is impractical on large volumes because of iteration over all dataset.

For now you have an idea how to improve answering quality, by tuning up koefficients, inserting more question samples, or maybe using a set of functions simultaneously, splitting the sentence to words and doing matching on word level too, or by avoiding iteration by indexing and pre-processing. But you shouldn't. There are exist advanced libraries, produced by internet giants like Google and Facebook, which are dealing with subject already, and they are doing it pretty well.
Let go to the next level with NLP models.

## NLP Question Answering system

We use [bert-as-service](https://github.com/hanxiao/bert-as-service) to implement our next QA function. [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)), or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations [developed by Google](https://github.com/google-research/bert).
bert-as-service uses BERT as a sentence encoder, allowing you to map sentences into fixed-length representations in just few lines of code.

### Installation

Install the server and client via pip (consult documentation for details):

```bash
pip install bert-serving-server bert-serving-client
```

Download a Pre-trained BERT Model. We use [BERT-Base, Cased](https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip), but you can try other model which fits better for you. Download and unpack archive.

Start service, pointing model_dir to the folder with your downloaded model. Also you need to set maximum question sentence length, if default value of 25 doesn't fit to your texts: 

```bash
bert-serving-start -model_dir /tmp/cased_L-12_H-768_A-12/ -num_worker=4 -max_seq_len=64
```

### Ranking (or pre-processing)

Before we use service, we need to encode our knowledgebase to BERT format.

In [5]:
from bert_serving.client import BertClient
import numpy as np

def encode_questions():
    bc = BertClient()
    questions = data["Question"].values.tolist()
    print("Questions count", len(questions))
    print("Start to calculate encoder....")
    questions_encoder = bc.encode(questions)
    np.save("questions", questions_encoder)
    questions_encoder_len = np.sqrt(
        np.sum(questions_encoder * questions_encoder, axis=1)
    )
    np.save("questions_len", questions_encoder_len)
    print("Encoder ready")

encode_questions()

Questions count 10
Start to calculate encoder....
Encoder ready


### Run

In [6]:
from bert_serving.client import BertClient
import numpy as np

class BertAnswer():
    def __init__(self):
        self.bc = BertClient()
        self.q_data = data["Question"].values.tolist()
        self.a_data = data["Answer"].values.tolist()
        self.questions_encoder = np.load("questions.npy")
        self.questions_encoder_len = np.load("questions_len.npy")

    def get(self, q):
        query_vector = self.bc.encode([q])[0]
        score = np.sum((query_vector * self.questions_encoder), axis=1) / (
            self.questions_encoder_len * (np.sum(query_vector * query_vector) ** 0.5)
        )
        top_id = np.argsort(score)[::-1][0]
        if float(score[top_id]) > 0.94:
            return self.a_data[top_id], score[top_id], self.q_data[top_id]
        return "Sorry, I didn't get you.", score[top_id], self.q_data[top_id]

bm = BertAnswer()

def getBertAnswer(q):
    return bm.get(q)

getResults(test_data, getBertAnswer)

Unnamed: 0,Q,Prediction,A,Score
0,What is the population of Egypt?,What is the population of Egypt?,more than 78 million,1.0
1,What is the poulation of egypt,What is the population of Egypt?,more than 78 million,0.967848
2,How long is a leopard's tail?,How long is a leopard's tail?,60 to 110cm,1.0
3,Do you know the length of leopard's tail?,How long is a leopard's tail?,60 to 110cm,0.970769
4,When polar bears can be invisible?,Can polar bears be seen under infrared photogr...,Polar bears are nearly invisible under infrare...,0.975287
5,Can I see arctic animals?,Can polar bears be seen under infrared photogr...,Polar bears are nearly invisible under infrare...,0.964607
6,some city in Finland,What is the biggest city in Finland?,"Sorry, I didn't get you.",0.932894


Our function correctly answered most questions. But we have unanswered 6'th question. We can play with score treshold and add additional question samples to improve understanding in our case, but in general we need the better way - to perform fine tuning of model.

Let try fine-tuned BERT model in next step.

## Fine-tuned NLP Question Answering system

Pretrained BERT models often show quite good results on many tasks. However, to release the true power of BERT a fine-tuning on on domain-specific data is necessary.

We follow the instruction in ["Sentence (and sentence-pair) classification tasks"](https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks). Clone the repository:

```bash
git clone https://github.com/google-research/bert.git
```

Download [this script](https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) and run it to download "GLUE data":

```bash
python download_glue_data.py
```

Then run fine-tuning process:

```bash
export BERT_BASE_DIR=/tmp/cased_L-12_H-768_A-12
export GLUE_DIR=/tmp/glue_data

python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=/tmp/mrpc_output/ \
  --do_lower_case=False
```

The fine-tuned model is stored at `/tmp/mrpc_output/`. Look inside it and find our fine-tuned model checkpoint, which is named like `model.ckpt-343`. Remember it to use as parameter to bert-server.

Now start a BertServer by putting three pieces together:

```bash
bert-serving-start -model_dir /tmp/cased_L-12_H-768_A-12/ -num_worker=4 -max_seq_len=64 \
  -tuned_model_dir=/tmp/mrpc_output/ -ckpt_name=model.ckpt-343
```

After the server started, you should find this line in the log:

```bash
I:GRAPHOPT:checkpoint (override by the fine-tuned model): /tmp/mrpc_output/model.ckpt-343
...
I:VENTILATOR:all set, ready to serve request!
```

Now let repeat pre-processing and run steps: 

In [7]:
from bert_serving.client import BertClient
import numpy as np

def encode_questions2():
    bc = BertClient()
    questions = data["Question"].values.tolist()
    print("Questions count", len(questions))
    print("Start to calculate encoder....")
    questions_encoder = bc.encode(questions)
    np.save("questions2", questions_encoder)
    questions_encoder_len = np.sqrt(
        np.sum(questions_encoder * questions_encoder, axis=1)
    )
    np.save("questions_len2", questions_encoder_len)
    print("Encoder ready")

encode_questions2()

Questions count 10
Start to calculate encoder....
Encoder ready


In [8]:
from bert_serving.client import BertClient
import numpy as np

class TunedBertAnswer():
    def __init__(self):
        self.bc = BertClient()
        self.q_data = data["Question"].values.tolist()
        self.a_data = data["Answer"].values.tolist()
        self.questions_encoder = np.load("questions2.npy")
        self.questions_encoder_len = np.load("questions_len2.npy")

    def get(self, q):
        query_vector = self.bc.encode([q])[0]
        score = np.sum((query_vector * self.questions_encoder), axis=1) / (
            self.questions_encoder_len * (np.sum(query_vector * query_vector) ** 0.5)
        )
        top_id = np.argsort(score)[::-1][0]
        if float(score[top_id]) > 0.94:
            return self.a_data[top_id], score[top_id], self.q_data[top_id]
        return "Sorry, I didn't get you.", score[top_id], self.q_data[top_id]

bm2 = TunedBertAnswer()

def getTunedBertAnswer(q):
    return bm2.get(q)

getResults(test_data, getTunedBertAnswer)

Unnamed: 0,Q,Prediction,A,Score
0,What is the population of Egypt?,What is the population of Egypt?,more than 78 million,1.0
1,What is the poulation of egypt,What is the population of Egypt?,more than 78 million,0.968978
2,How long is a leopard's tail?,How long is a leopard's tail?,60 to 110cm,1.0
3,Do you know the length of leopard's tail?,How long is a leopard's tail?,60 to 110cm,0.967964
4,When polar bears can be invisible?,Can polar bears be seen under infrared photogr...,Polar bears are nearly invisible under infrare...,0.960934
5,Can I see arctic animals?,Can polar bears be seen under infrared photogr...,Polar bears are nearly invisible under infrare...,0.964808
6,some city in Finland,What is the biggest city in Finland?,Greater Helsinki,0.946184


We've got all questions answered for now. Please note that this is just example. On other question base
it is possible to get better or worse results, so need to examine accessible technologies in each case.

## Conclusion

Quantity and content of examples has big influence to pre-trained model, when the score is near treshold. Try to change some test question and
you'll get different result, even some right answers may gone. So pre-trained model can handle a lot of input variants, but it
doesn't solve all possible cases. To make a good QA system, we need a lot of question examples, trying to rise prediction score up to 1 on most
possible input questions. Training on own domain-specific data instead of general may give a better prediction result too.

In examples above, we demonstrated how the QA system quality influenced by technology,
from single function to pre-trained NLP model. We built the basic Question Answering system
with natural language understanding literaly in few lines of code. Such QA systems
can be used standalone to serve Frequently Asked Questions search, documentation search, etc.
QA system can be used to greatly improve quality of chatbots. Large amount
of questions and answers can introduce some difficulty in training, but QA system
can serve this task quite well. 

## Links

- https://en.wikipedia.org/wiki/Question_answering - Question answering article on Wikipedia
- https://github.com/ztane/python-Levenshtein - The Levenshtein Python C extension module
- https://github.com/hanxiao/bert-as-service - Sentence Encoding service based on BERT
- https://github.com/google-research/bert - TensorFlow pre-trained models for BERT developed by Google
- https://github.com/nghuyong/rasa-faq-bot - example integration of QA with chatbot
- https://cdqa-suite.github.io/cdQA-website/ - An End-To-End Closed Domain Question Answering System