<a href="https://colab.research.google.com/github/pradeepram80/QA/blob/main/Basic_QA_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# QA System

QA is a fun AI problem. Lets try it out on a few public documents using Haystack APIs


### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

In [2]:
# Make sure you have a GPU running
!nvidia-smi

Sun Sep 12 03:51:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   74C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git


Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-w09s1cr3
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-w09s1cr3


In [2]:
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers



## Document Store

Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `FAISSDocumentStore`,  `SQLDocumentStore`, and `InMemoryDocumentStore`.

**Here:** We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

**Alternatives:** If you are unable to setup an Elasticsearch instance, then follow the [Tutorial 3](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb) for using SQL/InMemory document stores.

### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [3]:
# Recommended: Start Elasticsearch using Docker via the Haystack utility function
from haystack.utils import launch_es

launch_es()

09/12/2021 03:57:03 - INFO - haystack.utils -   Starting Elasticsearch ...


In [4]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [5]:
# Connect to Elasticsearch

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

09/12/2021 03:58:40 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.107s]
09/12/2021 03:58:40 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.454s]
09/12/2021 03:58:40 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.230s]


## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store

In this tutorial, we download some public mu documents, apply a basic cleaning function, and index them in Elasticsearch.

In [6]:
#need pdftotext installed
!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz && tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin

# Let's first fetch some documents that we want to query
# Here: 27 pdfs from mu website searched using google
doc_dir = "mu_docs"

# Convert files to dicts
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is:
# {
#    'text': "<DOCUMENT_TEXT_HERE>",
#    'meta': {'name': "<DOCUMENT_NAME_HERE>", ...}
#}
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Pipeline)

# Let's have a look at the first 3 entries:
print(dicts[:3])

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

--2021-09-12 03:58:46--  https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz
Resolving dl.xpdfreader.com (dl.xpdfreader.com)... 45.79.72.155
Connecting to dl.xpdfreader.com (dl.xpdfreader.com)|45.79.72.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23024720 (22M) [application/x-gzip]
Saving to: ‘xpdf-tools-linux-4.03.tar.gz’


2021-09-12 03:58:49 (10.7 MB/s) - ‘xpdf-tools-linux-4.03.tar.gz’ saved [23024720/23024720]

xpdf-tools-linux-4.03/
xpdf-tools-linux-4.03/ANNOUNCE
xpdf-tools-linux-4.03/bin32/
xpdf-tools-linux-4.03/bin32/pdftotext
xpdf-tools-linux-4.03/bin32/pdfinfo
xpdf-tools-linux-4.03/bin32/pdftopng
xpdf-tools-linux-4.03/bin32/pdfimages
xpdf-tools-linux-4.03/bin32/pdftoppm
xpdf-tools-linux-4.03/bin32/pdftops
xpdf-tools-linux-4.03/bin32/pdfdetach
xpdf-tools-linux-4.03/bin32/pdffonts
xpdf-tools-linux-4.03/bin32/pdftohtml
xpdf-tools-linux-4.03/CHANGES
xpdf-tools-linux-4.03/bin64/
xpdf-tools-linux-4.03/bin64/pdftotext
xpdf-tools-linux-4.03/bin64/pd

09/12/2021 03:58:50 - INFO - haystack.preprocessor.utils -   Converting mu_docs/an2308_m29w320_differences_between_-m29w320eb.pdf
09/12/2021 03:58:50 - INFO - haystack.preprocessor.utils -   Converting mu_docs/2020 10-K As Filed with Certs.pdf
09/12/2021 03:58:51 - INFO - haystack.preprocessor.utils -   Converting mu_docs/tn1215_comparing_n25q_and_spn_s25fl_ja.pdf
09/12/2021 03:58:51 - INFO - haystack.preprocessor.utils -   Converting mu_docs/309014_s29gl_to_m29ew-sbc_an.pdf
09/12/2021 03:58:51 - INFO - haystack.preprocessor.utils -   Converting mu_docs/32gb_ddr4_x4x8_2cs_twindie.pdf
09/12/2021 03:58:51 - INFO - haystack.preprocessor.utils -   Converting mu_docs/tn2603_small-page_nand.pdf
09/12/2021 03:58:51 - INFO - haystack.preprocessor.utils -   Converting mu_docs/ddr3l-rs_8gb_x16_2cs_twindie_v80a.pdf
09/12/2021 03:58:51 - INFO - haystack.preprocessor.utils -   Converting mu_docs/tn1342_mg_s29gl_np_to_mt28ew_automotive.pdf
09/12/2021 03:58:51 - INFO - haystack.preprocessor.utils -  

[{'text': 'Differences between the M29W320DB/T and the M29W320EB/T Flash Memories\nThe purpose of this Application note is to highlight the differences between from the M29W320D and the M29W320E Flash memories. The M29W320D and M29W320E Flash memories are members of the family of industry standard Flash memories from Numonyx, and are suited for use in most applications. The M29W320E is a recent addition to the family and presents some new features compared to the M29W320D.\nMain Features of the M29W320D and M29W320E . . . . . . . . . . . . . . . . . 3\n1.1 Additional features of the M29W320E . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3\nComparing the M29W320D and the M29320E . . . . . . . . . . . . . . . . . . . . . 4\n2.1 Packages, pinout, and ballout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4\n2.2 Block organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6\n2.2.1 M29W320D . . . . . . . . . . . . . 

09/12/2021 03:58:54 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.689s]


## Initalize Retriever, Reader,  & Pipeline

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered.
They use some simple but fast algorithm.

**Here:** We use Elasticsearch's default BM25 algorithm

**Alternatives:**

- Customize the `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `DensePassageRetriever` to use different embedding models for passage and query (see Tutorial 6)

In [7]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [None]:
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

# from haystack.retriever.sparse import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)

### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

#### FARMReader

In [8]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

09/12/2021 03:59:24 - INFO - farm.utils -   Using device: CUDA 
09/12/2021 03:59:24 - INFO - farm.utils -   Number of GPUs: 1
09/12/2021 03:59:24 - INFO - farm.utils -   Distributed Training: False
09/12/2021 03:59:24 - INFO - farm.utils -   Automatic Mixed Precision: None
09/12/2021 03:59:25 - INFO - filelock -   Lock 140266050581072 acquired on /root/.cache/huggingface/transformers/c40d0abb589629c48763f271020d0b1f602f5208c432c0874d420491ed37e28b.122ed338b3591c07dba452777c59ff52330edb340d3d56d67aa9117ad9905673.lock


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

09/12/2021 03:59:26 - INFO - filelock -   Lock 140266050581072 released on /root/.cache/huggingface/transformers/c40d0abb589629c48763f271020d0b1f602f5208c432c0874d420491ed37e28b.122ed338b3591c07dba452777c59ff52330edb340d3d56d67aa9117ad9905673.lock
09/12/2021 03:59:28 - INFO - filelock -   Lock 140269188139152 acquired on /root/.cache/huggingface/transformers/eac3273a8097dda671e3bea1db32c616e74f36a306c65b4858171c98d6db83e9.084aa7284f3a51fa1c8f0641aa04c47d366fbd18711f29d0a995693cfdbc9c9e.lock


Downloading:   0%|          | 0.00/496M [00:00<?, ?B/s]

09/12/2021 03:59:46 - INFO - filelock -   Lock 140269188139152 released on /root/.cache/huggingface/transformers/eac3273a8097dda671e3bea1db32c616e74f36a306c65b4858171c98d6db83e9.084aa7284f3a51fa1c8f0641aa04c47d366fbd18711f29d0a995693cfdbc9c9e.lock
Some weights of the model checkpoint at deepset/roberta-base-squad2 were not used when initializing RobertaModel: ['qa_outputs.weight', 'qa_outputs.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['ro

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

09/12/2021 03:59:58 - INFO - filelock -   Lock 140265986296976 released on /root/.cache/huggingface/transformers/81c80edb4c6cefa5cae64ccfdb34b3b309ecaf60da99da7cd1c17e24a5d36eb5.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock
09/12/2021 03:59:59 - INFO - filelock -   Lock 140265977777296 acquired on /root/.cache/huggingface/transformers/b87d46371731376b11768b7839b1a5938a4f77d6bd2d9b683f167df0026af432.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock


Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

09/12/2021 04:00:00 - INFO - filelock -   Lock 140265977777296 released on /root/.cache/huggingface/transformers/b87d46371731376b11768b7839b1a5938a4f77d6bd2d9b683f167df0026af432.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
09/12/2021 04:00:03 - INFO - filelock -   Lock 140265977145168 acquired on /root/.cache/huggingface/transformers/c9d2c178fac8d40234baa1833a3b1903d393729bf93ea34da247c07db24900d0.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0.lock


Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

09/12/2021 04:00:04 - INFO - filelock -   Lock 140265977145168 released on /root/.cache/huggingface/transformers/c9d2c178fac8d40234baa1833a3b1903d393729bf93ea34da247c07db24900d0.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0.lock
09/12/2021 04:00:04 - INFO - filelock -   Lock 140265976904144 acquired on /root/.cache/huggingface/transformers/e8a600814b69e3ee74bb4a7398cc6fef9812475010f16a6c9f151b2c2772b089.451739a2f3b82c3375da0dfc6af295bedc4567373b171f514dd09a4cc4b31513.lock


Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

09/12/2021 04:00:05 - INFO - filelock -   Lock 140265976904144 released on /root/.cache/huggingface/transformers/e8a600814b69e3ee74bb4a7398cc6fef9812475010f16a6c9f151b2c2772b089.451739a2f3b82c3375da0dfc6af295bedc4567373b171f514dd09a4cc4b31513.lock
09/12/2021 04:00:05 - INFO - farm.utils -   Using device: CUDA 
09/12/2021 04:00:05 - INFO - farm.utils -   Number of GPUs: 1
09/12/2021 04:00:05 - INFO - farm.utils -   Distributed Training: False
09/12/2021 04:00:05 - INFO - farm.utils -   Automatic Mixed Precision: None
09/12/2021 04:00:05 - INFO - farm.infer -   Got ya 2 parallel workers to do inference ...
09/12/2021 04:00:05 - INFO - farm.infer -    0    0 
09/12/2021 04:00:05 - INFO - farm.infer -   /w\  /w\
09/12/2021 04:00:05 - INFO - farm.infer -   /'\  / \
09/12/2021 04:00:05 - INFO - farm.infer -     
09/12/2021 04:00:05 - INFO - farm.utils -   Using device: CUDA 
09/12/2021 04:00:05 - INFO - farm.utils -   Number of GPUs: 1
09/12/2021 04:00:05 - INFO - farm.utils -   Distributed 

#### TransformersReader

In [9]:
# Alternative:
# reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [10]:
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Voilà! Ask a question!

In [49]:
# You can configure how many candidates the reader and retriever shall return
# The higher the top_k, the better (but also the slower) your answers.
prediction = pipe.run(
    query="who is micron?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

09/12/2021 04:40:13 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.018s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.53 Batches/s]
Inferencing Samples: 100%|██████████| 10/10 [00:15<00:00,  1.53s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.50s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.24s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.20s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.34s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.23s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.38s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.49s/ Batches]


In [29]:
# prediction = pipe.run(query="Who created the Dothraki vocabulary?", params={"Reader": {"top_k": 5}})
# prediction = pipe.run(query="Who is the sister of Sansa?", params={"Reader": {"top_k": 5}})

In [50]:
print_answers(prediction, details="minimal")

[   {   'answer': 'Micron Technology, Inc',
        'context': 'invested. The performance was plotted using the following '
                   'data:\n'
                   'Micron Technology, Inc. S&P 500 Composite Index '
                   'Philadelphia Semiconductor Index (SOX)'},
    {   'answer': 'Micron Technology, Inc',
        'context': '5aef84ccb511 DDR3L-RS_8Gb_x16_2CS_TwinDie.pdf - Rev. D '
                   '05/13 EN\n'
                   'Micron Technology, Inc. reserves the right to change '
                   'products or specifications withou'},
    {   'answer': 'Micron Technology, Inc',
        'context': 'ight to change products or specifications without notice.  '
                   '2014 Micron Technology, Inc. All rights reserved.\n'
                   '\x0c'
                   'TN-1335: Migrating S29GL-N/P to MT28EW N'},
    {   'answer': 'Micron Technology, Inc',
        'context': ' AcceleratedTM, and other Micron trademarks are the '
                   'prop

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.  
  
Some of our other work: 
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)
