In [1]:
!git clone https://github.com/nageshsinghc4/deepwrap.git
!cd deepwrap/
import os
os.chdir('/content/deepwrap/')
os.getcwd()
!pip install .

Cloning into 'deepwrap'...
remote: Enumerating objects: 279, done.[K
remote: Counting objects: 100% (279/279), done.[K
remote: Compressing objects: 100% (239/239), done.[K
remote: Total 279 (delta 40), reused 256 (delta 27), pack-reused 0[K
Receiving objects: 100% (279/279), 25.31 MiB | 27.51 MiB/s, done.
Resolving deltas: 100% (40/40), done.
Processing /content/deepwrap
Collecting scipy==1.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/2b/a8/f4c66eb529bb252d50e83dbf2909c6502e2f857550f22571ed8556f62d95/scipy-1.5.2-cp36-cp36m-manylinux1_x86_64.whl (25.9MB)
[K     |████████████████████████████████| 25.9MB 142kB/s 
Collecting keras_bert>=0.81.0
  Downloading https://files.pythonhosted.org/packages/e2/7f/95fabd29f4502924fa3f09ff6538c5a7d290dfef2c2fe076d3d1a16e08f0/keras-bert-0.86.0.tar.gz
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/56/a3/8407c1e62d5980188b4acc45ef3d94b933d14a2ebc9ef3505f22cf772570/langdetect-1.0.8.tar.gz (981kB)


#Building an End-to-End Question-Answering System With BERT

In this notebook, we are going to build a practical, end-to-end Question-Answering (QA) system with [BERT](https://arxiv.org/abs/1810.04805) in rougly 3 lines of code. We will treat a corpus of text documents as a knowledge base to which we can ask questions and retrieve exact answers using BERT. This goes beyond simplistic keyword searches.

For this example, we will use the [20 Newsgroup dataset](http://qwone.com/~jason/20Newsgroups/) as the text corpus. As a collection of newsgroup postings which contains an abundance of opinions and debates, the corpus is not ideal as a knowledgebase. It is better to use fact-based documents such as Wikipedia articles or even news articles. However, this dataset will suffice for this example.

Let us begin by loading the dataset into an array using scikit-learn and importing deepwrap modules.

In [2]:
# load 20newsgroups datset into an array
from sklearn.datasets import fetch_20newsgroups
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', remove=remove)
docs = newsgroups_train.data +  newsgroups_test.data

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [3]:
#Import libraries
import deepwrap
from deepwrap import text

####STEP 1: Index the Documents
We will first index the documents into a search engine that will be used to quickly retrieve documents that are likely to contain answers to a question. To do so, we must choose an index location, which must be a folder that does not already exist.

Since the newsgroup postings are small and fit in memory, we wil set commit_every to a large value to speed up the indexing process. This means results will not be written until the end. If you experience issues, you can lower this value.

In [4]:
INDEXDIR = '/tmp/myindex'

In [5]:
text.SimpleQA.initialize_index(INDEXDIR)
text.SimpleQA.index_from_list(docs, INDEXDIR, commit_every=len(docs))

For documents sets that are too large to be loaded into a Python list, you can use SimpleQA.index_from_folder, which will crawl a folder and index all plain text documents (e.g. .txt files).

By default, index_from_list and index_from_folder use a single processor (procs=1) with each processor using a maximum of 256MB of memory (limitmb=256) and merging results into a single segment (multisegment=False). These values can be changed to speedup indexing as arguments to index_from_list or index_from_folder. See the whoosh documentation for more information on these parameters and how to use them to speedup indexing.

Note that a small number of large documents will cause inferences in STEP 3 to be very slow. If your dataset consists of large documents (e.g., books or long papers), we recommend breaking them up into pages (e.g., splitting the original PDF using something like pdfseparate) or splitting them into paragraphs. The latter can be done with deepwrap using:



```
# deepwrap.text.textutils.paragraph_tokenize(document, join_sentences=True)
```
The above steps need to only be performed once. Once an index is already created, you can skip this step and proceed directly to **STEP 2** to begin using your system.






####STEP 2: Create a QA instance


Next, we create a QA instance. This step will automatically download the BERT SQUAD model if it does not already exist on your system.



In [6]:
qa = text.SimpleQA(INDEXDIR)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=443.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1341090760.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




That's it! In roughly 3 lines of code, we have built an end-to-end QA system that can now be used to generate answers to questions. 

Let us ask our system some questions.

####STEP 3: Ask Questions


We will invoke the ask method to issue questions to the text corpus we indexed and retrieve answers. We will also use the qa.display method to nicely display the top 5 results in this Jupyter notebook. The answers are inferred using a BERT model trained on the SQUAD dataset. Since the model is combing through paragraphs and sentences to find an answer, it may take a minute or two to return results.

Note also that the 20 Newsgroup Dataset covers events in the early to mid 1990s, so references to recent events will not exist.



**Cryptography related Question**

In [7]:
answers = qa.ask('What is RSA?')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,"is a crypto system which is asymmetric, or public key","rsa is a crypto system which is asymmetric, or public key .",0.736525,10861
1,"is a crypto system which is asymmetric, or public key","rsa is a crypto system which is asymmetric, or public key .",0.212858,10418
2,is a public key cryptosystem,"rsa is a public key cryptosystem defined by rivest, shamir, and adleman.",0.01557,2104
3,cryptographic communications system and method,"cryptographic communications system and method ("" rsa "")...................................",0.01502,10418
4,are a library called rsaref,"most of the code is in the public domain, except for the rsa routines, which are a library called rsaref licensed from rsa data security inc.",0.003415,10861


Note that, since we used **index_from_list** to index documents, the last column shows the list index associated with the newsgroup posting containing the answer, which can be used to peruse the entire document containing the answer. If using **index_from_folder** to index documents, the last column will show the relative path and filename of the document.

In [8]:
#Read the document reference
print(docs[10861])

Archive-name: ripem/faq
Last-update: Sun, 7 Mar 93 21:00:00 -0500

ABOUT THIS POSTING
------------------
This is a (still rather rough) listing of likely questions and
information about RIPEM, a program for public key mail encryption.  It
(this FAQ, not RIPEM) was written and will be maintained by Marc
VanHeyningen, <mvanheyn@whale.cs.indiana.edu>.  It will be posted to a
variety of newsgroups on a monthly basis; follow-up discussion specific
to RIPEM is redirected to the group alt.security.ripem.

This month, I have reformatted this posting in an attempt to comply
with the standards for HyperText FAQ formatting to allow easy
manipulation of this document over the World Wide Web.  Let me know
what you think.

DISCLAIMER
----------
Nothing in this FAQ should be considered legal advice, or anything
other than one person's opinion.  If you want real legal advice, talk
to a real lawyer.

QUESTIONS AND ANSWERS
---------------------

1)  What is RIPEM?

 RIPEM is a program which performs Pri

**Automobile related question**

In [9]:
answers = qa.ask('What is the most sold motorcycle brand in the world?')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,"about dodge shadow deleted ] what do you mean by "" all models "", all models of cars, all chrysler","[ stuff about dodge shadow deleted ] what do you mean by "" all models "", all models of cars, all chrysler models, all models that the fleet manager had bought ? because there is no way in hell that the shadow is the most reliable car of all models sold, not even chrysler ' s dept.",0.493545,4178
1,harleys,"big fat hairy deal ! based on what i know, harleys tend to depreciate your monies far more than the initial depreciation of the bike itself when it comes to parts and service.",0.401039,102
2,than harleys,"yeah, they depreciate faster than harleys for the first couple of years then they bottom out.",0.049647,102
3,/ stafford,motorcycles / stafford @ vax2.,0.036538,102
4,that msw3. 1,none of this changes the fact that msw3. 1 is objectively inferior to its competition.,0.019231,18253


**Religion related question**

In [None]:
answers = qa.ask('Who was Jesus?')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,is god incarnate,jesus isn ' t god ? when jesus returns some people may miss him ? what version of the bible do you read mike ? jesus is god incarnate (in flesh).,0.482224,6356
1,jesus god only of the jews,"which is more important : 1) the recorded word of jesus or 2) indications that you can deduce from the bible ? was jesus god only of the jews , or god of all humankind of all race and sex ?",0.164357,7842
2,was god in human form,"first question is, if jesus was god in human form , how could he really be god ' s son ? if the holy ghost "" planted the seed "" in mary, so to speak, then it seems that jesus ' relationship to god would be the equivalent to the human father / son relationship.",0.109961,11661
3,was magus from the east,"who acknowledged this fact ? on what basis ? are we extra biblical at this point ? why not also acknowledge that the bhagavad gita is the only relevant text for gentiles, after all we see in the bible that it was magus from the east who observed the star signs of jesus ? why bother with any texts at all ? why not just follow whatever the church has to say ?",0.082453,7842
4,the incarnation of the son,jesus is the incarnation of the son .,0.065282,11661


Here, we see different views on who Jesus was as debated and discussed in this document set.

**Athesim related question**

In [None]:
answers = qa.ask('What about prayer in schools? If there is no God, why do you care if people pray?')
qa.display_answers(answers[:5])

Unnamed: 0,Candidate Answer,Context,Confidence,Document Reference
0,that they want the public schools to teach what they cannot manage to teach,"for what it ' s worth, i suspect that the coercion is not really targeted at the non christians-- it is yet another case of failure amongst christian parents in "" making "" their children prayerful, so that they want the public schools to teach what they cannot manage to teach , despite having all the opportunity in the world to do so.",0.631494,11778
1,""" moment of silence","the problem with a "" moment of silence "" is that it is not an even handed way of "" allowing "" for religion amongst students in the public schools.",0.136796,11778
2,they do not need a moment of silence,"if you have taught your children to pray, they do not need a moment of silence in school.",0.062049,11778
3,"they want public prayers, the better to manipulate children","they want public prayers, the better to manipulate children .",0.057068,17645
4,"a christian student may (and probably does) pray at innumerable times during the day, without anyone else knowing it","a christian student may (and probably does) pray at innumerable times during the day, without anyone else knowing it .",0.052348,11778


####Using SimpleQA as a Simple Search Engine
Once an index is created, SimpleQA can also be used as a conventional search engine to perform keyword searches using the search method:


```
# qa.search(' "solar orbit" AND "battery power" ') # find 
documents that contain both these phrases
```


See the whoosh documentation for more information on query syntax.

