In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0" 

## Building an End-to-End Question-Answering System With BERT

In this notebook, we build a practical, end-to-end Question-Answering (QA) system with BERT in rougly 3 lines of code.  We will treat a corpus of text documents as a knowledge base to which we can ask questions and retrieve exact answers using [BERT](https://arxiv.org/abs/1810.04805). This goes beyond simplistic keyword searches.

For this example, we will use the [20 Newsgroup dataset](http://qwone.com/~jason/20Newsgroups/) as the text corpus.  As a collection of newsgroup postings which contains an abundance of opinions and debates, the corpus is not ideal as a knowledgebase.  It is better to use fact-based documents such as Wikipedia articles or even news articles.  However, this dataset will suffice for this example.

Let us begin by loading the dataset into an array using **scikit-learn** and importing *ktrain* modules.

In [2]:
# load 20newsgroups datset into an array
from sklearn.datasets import fetch_20newsgroups
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)
docs = newsgroups_train.data +  newsgroups_test.data

In [3]:
import ktrain
from ktrain import text

### STEP 1:  Index the Documents

We will first index the documents into a search engine that will be used to quickly retrieve documents that are likely to contain answers to a question. To do so, we must choose an index location, which must be a folder that does not already exist. 

Since the newsgroup postings are small and fit in memory, we wil set `commit_every` to a large value to speed up the indexing process. This means results will not be written until the end.  If you experience issues, you can lower this value.

In [4]:
INDEXDIR = '/tmp/myindex'

In [5]:
text.SimpleQA.initialize_index(INDEXDIR)
text.SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs))

For documents sets that are too large to be loaded into a Python list, you can use `SimpleQA.index_from_folder`, which will crawl a folder and index all plain text documents (e.g.,, `.txt` files).

By default, `index_from_list` and `index_from_folder` use a single processor (`procs=1`) with each processor using a maximum of 256MB of memory (`limitmb=256`) and merging results into a single segment (`multisegment=False`).  These values can be changed to speedup indexing as arguments to `index_from_list` or `index_from_folder`.  See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/batch.html) for more information on these parameters and how to use them to speedup indexing.

Note that a small number of large documents will cause inferences in STEP 3 to be very slow.  If your dataset consists of large documents (e.g., books or long papers), we recommend breaking them up into pages (e.g., splitting the original PDF using something like `pdfseparate`) or splitting them into paragraphs.  The latter can be done with *ktrain* using:
```python
ktrain.text.textutils.paragraph_tokenize(document, join_sentences=True)
```

The above steps need to only be performed once. Once an index is already created, you can skip this step and proceed directly to **STEP 2** to begin using your system.

### STEP 2: Create a QA instance

Next, we create a QA instance.  This step will automatically download the BERT SQUAD model if it does not already exist on your system.

In [6]:
qa = text.SimpleQA(INDEXDIR)

That's it!  In roughly **3 lines of code**, we have built an end-to-end QA system that can now be used to generate answers to questions.  Let's ask our system some questions.

### STEP 3:  Ask Questions

We will invoke the `ask` method to issue questions to the text corpus we indexed and retrieve answers.  We will also use the `qa.display` method to nicely display the top 5 results in this Jupyter notebook. The answers are inferred using a BERT model trained on the SQUAD dataset.  Since the model is combing through paragraphs and sentences to find an answer, it may take a minute or two to return results.

Note also that the 20 Newsgroup Dataset covers events in the early to mid 1990s, so references to recent events will not exist.

#### Space Question

In [7]:
answers = qa.ask('When did the Cassini probe launch?')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,in october of 1997,cassini is scheduled for launch aboard a titan iv / centaur in october of 1997 .,0.348675,59
1,"on january 26,1962","ranger 3, launched on january 26,1962 , was intended to land an instrument capsule on the surface of the moon, but problems during the launch caused the probe to miss the moon and head into solar orbit.",0.195161,8525
2,"on november 5,1964","mariner 3, launched on november 5,1964 , was lost when its protective shroud failed to eject as the craft was placed into interplanetary space.",0.162835,8525
3,"launched october 18,1962","ranger 5, launched october 18,1962 and similar to ranger 3 and 4, lost all solar panel and battery power enroute and eventually missed the moon and drifted off into solar orbit.",0.07781,8525
4,2001,"possible launch dates : 1996 for imaging orbiter, 2001 for rover.",0.06974,59


As you can see, the top candidate answer indicates that the Cassini space probe was launched in October of 1997, which appears to be correct.  The correct answer will not always be the top answer, but it is in this case.  

Note that, since we used `index_from_list` to index documents, the last column shows the list index associated with the newsgroup posting containing the answer, which can be used to peruse the entire document containing the answer.  If using `index_from_folder` to index documents, the last column will show the relative path and filename of the document.

In [8]:
print(docs[59])

Archive-name: space/new_probes
Last-modified: $Date: 93/04/01 14:39:17 $

UPCOMING PLANETARY PROBES - MISSIONS AND SCHEDULES

    Information on upcoming or currently active missions not mentioned below
    would be welcome. Sources: NASA fact sheets, Cassini Mission Design
    team, ISAS/NASDA launch schedules, press kits.


    ASUKA (ASTRO-D) - ISAS (Japan) X-ray astronomy satellite, launched into
    Earth orbit on 2/20/93. Equipped with large-area wide-wavelength (1-20
    Angstrom) X-ray telescope, X-ray CCD cameras, and imaging gas
    scintillation proportional counters.


    CASSINI - Saturn orbiter and Titan atmosphere probe. Cassini is a joint
    NASA/ESA project designed to accomplish an exploration of the Saturnian
    system with its Cassini Saturn Orbiter and Huygens Titan Probe. Cassini
    is scheduled for launch aboard a Titan IV/Centaur in October of 1997.
    After gravity assists of Venus, Earth and Jupiter in a VVEJGA
    trajectory, the spacecraft will arrive a

The 20 Newsgroup dataset contains lots of posts discussing and debating Christianity, as well.  Let's ask a question on this subject.

#### Religious Question

In [9]:
answers = qa.ask('Who was Jesus?')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,is god incarnate,jesus isn ' t god ? when jesus returns some people may miss him ? what version of the bible do you read mike ? jesus is god incarnate (in flesh).,0.482224,6356
1,jesus god only of the jews,"which is more important : 1) the recorded word of jesus or 2) indications that you can deduce from the bible ? was jesus god only of the jews , or god of all humankind of all race and sex ?",0.164358,7842
2,was god in human form,"first question is, if jesus was god in human form , how could he really be god ' s son ? if the holy ghost "" planted the seed "" in mary, so to speak, then it seems that jesus ' relationship to god would be the equivalent to the human father / son relationship.",0.109961,11661
3,was magus from the east,"who acknowledged this fact ? on what basis ? are we extra biblical at this point ? why not also acknowledge that the bhagavad gita is the only relevant text for gentiles, after all we see in the bible that it was magus from the east who observed the star signs of jesus ? why bother with any texts at all ? why not just follow whatever the church has to say ?",0.082453,7842
4,the incarnation of the son,jesus is the incarnation of the son .,0.065281,11661


Here, we see different views on who Jesus was as debated and discussed in this document set.

Finally, the 20 Newsgroup dataset also contains many groups about computing hardware and software.  Let's ask a technical support question.

#### Technical Question

In [10]:
answers = qa.ask('What causes computer images to be too dark?')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,that not all display programs do gamma correction,the problem is that not all display programs do gamma correction .,0.848914,13873
1,if your viewer does not do gamma correction,"if your viewer does not do gamma correction , then linear images will look too dark, and gamma corrected images will ok.",0.042701,13873
2,altering the intensity in the hsv controls,"altering the intensity in the hsv controls does not do the right thing, as it fails to take account of the effect gamma has on h and s.",0.040876,13873
3,is gamma correction,"this, is gamma correction (or the lack of it).",0.019417,13873
4,if your viewer does not do gamma correction,"if your viewer does not do gamma correction , then left hand ramp will have a long dark part and a short white part, and the point of equal brightness will be above the center.",0.013624,13873


### Using `SimpleQA` as a Simple Search Engine
Once an index is created, `SimpleQA` can also be used as a conventional search engine to perform keyword searches using the `search` method:

```python
qa.search(' "solar orbit" AND "battery power" ') # find documents that contain both these phrases
```
See the [whoosh documentation](https://whoosh.readthedocs.io/en/latest/querylang.html) for more information on query syntax.


### Deploying the QA System

To deploy this system, the only state that needs to be persisted is the search index we initialized and populated in **STEP 1**.  Once a search index is initialized and populated, one can simply re-run from **STEP 2**.

