<a href="https://colab.research.google.com/github/plaban1981/haystack/blob/master/Build_a_QA_System_Without_Elasticsearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a QA System Without Elasticsearch

Haystack provides alternatives to Elasticsearch for developing quick prototypes.

You can use an `InMemoryDocumentStore` or a `SQLDocumentStore`(with SQLite) as the document store.

If you are interested in more feature-rich Elasticsearch, then please refer to the Tutorial 1.

 Check whether the GPU runtime is enabled with the following command:

In [1]:
%%bash

nvidia-smi

Wed Nov 16 10:03:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   59C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### install the latest release of Haystack with pip

In [2]:
%%bash

pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-22.3.1-py3-none-any.whl (2.1 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.3.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-dueloh6t/farm-haystack_064ed8a24faa4fb9bd80c03ef483e79d
  Resolved https://github.com/deepset-ai/haystack.git to commit af78f8b431af06d7078bbc3231c3a6fba875a916
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started

  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-dueloh6t/farm-haystack_064ed8a24faa4fb9bd80c03ef483e79d
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
grpcio-status 1.48.2 requires grpcio>=1.48.2, but you have grpcio 1.47.0 which is incompatible.


## Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.
Example log message:
INFO - haystack.utils.preprocessing -  Converting data/tutorial1/218_Olenna_Tyrell.txt
Default log level in basicConfig is WARNING so the explicit parameter is not necessary but can be changed easily:

In [3]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Document Store

In [4]:
# In-Memory Document Store
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


In [5]:
# Alternatively, uncomment the following to use the SQLite Document Store:

# from haystack.document_stores import SQLDocumentStore
# document_store = SQLDocumentStore(url="sqlite:///qa.db")


## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store

In this tutorial, we download Wikipedia articles on Game of Thrones, apply a basic cleaning function, and index them in Elasticsearch.

In [6]:
from haystack.utils import clean_wiki_text, convert_files_to_docs, fetch_archive_from_http


# Let's first get some documents that we want to query
# Here: 517 Wikipedia articles for Game of Thrones
doc_dir = "data/tutorial3"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt3.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# convert files to dicts containing documents that can be indexed to our datastore
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is: {"name": "<some-document-name>", "content": "<the-actual-text>"}

# Let's have a look at the first 3 entries:
print(docs[:3])

# Now, let's write the docs to our DB.
document_store.write_documents(docs)

INFO:haystack.utils.import_utils:Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt3.zip to 'data/tutorial3'
INFO:haystack.utils.preprocessing:Converting data/tutorial3/420_Blood_of_My_Blood.txt
INFO:haystack.utils.preprocessing:Converting data/tutorial3/460_Battle_of_the_Bastards.txt
INFO:haystack.utils.preprocessing:Converting data/tutorial3/311_Game_of_Thrones__season_7_.txt
INFO:haystack.utils.preprocessing:Converting data/tutorial3/74_The_Prince_of_Winterfell.txt
INFO:haystack.utils.preprocessing:Converting data/tutorial3/9_Game_of_Thrones_Tapestry.txt
INFO:haystack.utils.preprocessing:Converting data/tutorial3/12_Fire.txt
INFO:haystack.utils.preprocessing:Converting data/tutorial3/452_Fire_and_Blood__Game_of_Thrones_.txt
INFO:haystack.utils.preprocessing:Converting data/tutorial3/119_Walk_of_Punishment.txt
INFO:haystack.utils.preprocessing:Converting data/tutorial3/513_Oathbreaker__Game_of_Thrones_.txt
INFO:haystack.ut

[<Document: {'content': '"\'\'\'Blood of My Blood\'\'\'" is the sixth episode of the sixth season of HBO\'s fantasy television series \'\'Game of Thrones\'\', and the 56th overall. The episode was written by Bryan Cogman, and directed by Jack Bender.\nBran Stark and Meera Reed are rescued from the White Walkers by Benjen Stark. Samwell Tarly returns to his family\'s home in Horn Hill, accompanied by Gilly and little Sam; Jaime Lannister attempts to rescue the Queen, Margaery Tyrell; Arya Stark defies the Faceless Men; and Daenerys Targaryen rides on Drogon and emboldens her newly acquired khalasar.\n"Blood of My Blood" was positively received by critics who praised the return of several notable characters, including Benjen Stark, Walder Frey and Edmure Tully. Further praise was given to other plot points, such as Samwell\'s return to Horn Hill, and Arya\'s decision to return to being a Stark rather than a disciple of the Many-Faced God. The episode title is a reference to a famous Doth

INFO:haystack.document_stores.base:Duplicate Documents: Document with id '1906b2acc6c764b69e619e5eb2fa646f' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id 'c8b51f62e0fccac8361c4464cc2c8f70' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id '994b9df41876668d9a5cae9510915a24' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id '1906b2acc6c764b69e619e5eb2fa646f' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id '9e9a3181b6bc168b4a25429b641e8c86' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id '9e9a3181b6bc168b4a25429b641e8c86' already exists in index 'document'
INFO:haystack.document_stores.base:Duplicate Documents: Document with id '9e9a3181b6bc168b4a25429b641e8c86' already exists in index 'document'

## Initialize Retriever, Reader & Pipeline

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. 

With InMemoryDocumentStore or SQLDocumentStore, you can use the TfidfRetriever. For more retrievers, please refer to the tutorial-1.

In [9]:
# An in-memory TfidfRetriever based on Pandas dataframes
from haystack.nodes import TfidfRetriever

retriever = TfidfRetriever(document_store=document_store)

INFO:haystack.nodes.retriever.sparse:Found 2357 candidate paragraphs from 2357 docs in DB


## FARMreader

In [11]:
from haystack.nodes import FARMReader

# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)


INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading pytorch_model.bin:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


## TransformersReader
Alternatively, we can use a Transformers reader:

In [8]:
# from haystack.nodes import FARMReader, TransformersReader
# reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)


### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelines).

In [12]:
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)


## Ask a question

In [13]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k for retriever, the better (but also the slower) your answers.
prediction = pipe.run(
    query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

Inferencing Samples: 100%|██████████| 1/1 [00:04<00:00,  4.18s/ Batches]


In [16]:
from pprint import pprint
pprint(prediction)

{'answers': [<Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9919580221176147, 'context': "s Nymeria after a legendary warrior queen. She travels with her father, Eddard, to King's Landing when he is made Hand of the King. Before she leaves,", 'offsets_in_document': [{'start': 147, 'end': 153}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f', 'meta': {'name': '43_Arya_Stark.txt'}}>,
             <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.9767245054244995, 'context': "\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure, Arya's half-brother Jon Snow gifts A", 'offsets_in_document': [{'start': 46, 'end': 49}], 'offsets_in_context': [{'start': 46, 'end': 49}], 'document_id': '180c2a6b36369712b361a80842e79356', 'meta': {'name': '43_Arya_Stark.txt'}}>,
             <Answer {'answer': 'Robert Baratheon', 'type': 'extractive', 'score': 0.940885305404

In [17]:
prediction = pipe.run(query="Who created the Dothraki vocabulary?", params={"Reader": {"top_k": 5}})
pprint(prediction)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  3.11 Batches/s]

{'answers': [<Answer {'answer': 'David J. Peterson', 'type': 'extractive', 'score': 0.9532105922698975, 'context': "orld. The language was developed for the TV series by the linguist David J. Peterson, working off the Dothraki words and phrases in Martin's novels.\n,", 'offsets_in_document': [{'start': 329, 'end': 346}], 'offsets_in_context': [{'start': 67, 'end': 84}], 'document_id': '308dca876f94e5e839187f1463693015', 'meta': {'name': '214_Dothraki_language.txt'}}>,
             <Answer {'answer': 'David J. Peterson', 'type': 'extractive', 'score': 0.8687498569488525, 'context': "age for ''Game of Thrones''\nThe Dothraki vocabulary was created by David J. Peterson well in advance of the adaptation. HBO hired the Language Creatio", 'offsets_in_document': [{'start': 139, 'end': 156}], 'offsets_in_context': [{'start': 67, 'end': 84}], 'document_id': '27baa56e5aab6b04d38f19e97e078bc6', 'meta': {'name': '214_Dothraki_language.txt'}}>,
             <Answer {'answer': 'Daenerys', 'type': 'e




In [18]:
prediction = pipe.run(query="Who is the sister of Sansa?", params={"Reader": {"top_k": 5}})
pprint(prediction)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.44 Batches/s]

{'answers': [<Answer {'answer': 'Arya', 'type': 'extractive', 'score': 0.9512491226196289, 'context': 'n contrast to "her universally (and rightly) adored tomboy little sister Arya", stating that Sansa "arguably gets a disproportionate amount of fan hat', 'offsets_in_document': [{'start': 1402, 'end': 1406}], 'offsets_in_context': [{'start': 73, 'end': 77}], 'document_id': '2ba1320d39d9464f3a92d583106aaae1', 'meta': {'name': '332_Sansa_Stark.txt'}}>,
             <Answer {'answer': 'Arya', 'type': 'extractive', 'score': 0.9470417499542236, 'context': "denotes weakness. She doesn't have cool swordplay skills like her sister Arya; she isn't a smart seductress like Margaery Tyrell or a fierce queen lik", 'offsets_in_document': [{'start': 2601, 'end': 2605}], 'offsets_in_context': [{'start': 73, 'end': 77}], 'document_id': '2ba1320d39d9464f3a92d583106aaae1', 'meta': {'name': '332_Sansa_Stark.txt'}}>,
             <Answer {'answer': 'Brienne', 'type': 'extractive', 'score': 0.82025712728500




In [19]:
from haystack.utils import print_answers
# Change `minimum` to `medium` or `all` to control the level of detail
print_answers(prediction, details="minimum")


Query: Who is the sister of Sansa?
Answers:
[   {   'answer': 'Arya',
        'context': 'n contrast to "her universally (and rightly) adored tomboy '
                   'little sister Arya", stating that Sansa "arguably gets a '
                   'disproportionate amount of fan hat'},
    {   'answer': 'Arya',
        'context': "denotes weakness. She doesn't have cool swordplay skills "
                   "like her sister Arya; she isn't a smart seductress like "
                   'Margaery Tyrell or a fierce queen lik'},
    {   'answer': 'Brienne',
        'context': 'etter is made public. Sansa confides in Littlefinger, who '
                   'suggests that Brienne, sworn to serve both sisters, would '
                   'intervene if Arya acted against Sa'},
    {   'answer': 'Myrcella',
        'context': 'sa is present when the royal family bids farewell to '
                   "Joffrey's sister, Myrcella, on her departure to Dorne to "
                   'form an alliance

In [20]:
print_answers(prediction, details="all")


Query: Who is the sister of Sansa?
Answers:
[   <Answer {'answer': 'Arya', 'type': 'extractive', 'score': 0.9512491226196289, 'context': 'n contrast to "her universally (and rightly) adored tomboy little sister Arya", stating that Sansa "arguably gets a disproportionate amount of fan hat', 'offsets_in_document': [{'start': 1402, 'end': 1406}], 'offsets_in_context': [{'start': 73, 'end': 77}], 'document_id': '2ba1320d39d9464f3a92d583106aaae1', 'meta': {'name': '332_Sansa_Stark.txt'}}>,
    <Answer {'answer': 'Arya', 'type': 'extractive', 'score': 0.9470417499542236, 'context': "denotes weakness. She doesn't have cool swordplay skills like her sister Arya; she isn't a smart seductress like Margaery Tyrell or a fierce queen lik", 'offsets_in_document': [{'start': 2601, 'end': 2605}], 'offsets_in_context': [{'start': 73, 'end': 77}], 'document_id': '2ba1320d39d9464f3a92d583106aaae1', 'meta': {'name': '332_Sansa_Stark.txt'}}>,
    <Answer {'answer': 'Brienne', 'type': 'extractive', 'score'

In [21]:
print_answers(prediction, details="medium")


Query: Who is the sister of Sansa?
Answers:
[   {   'answer': 'Arya',
        'context': 'n contrast to "her universally (and rightly) adored tomboy '
                   'little sister Arya", stating that Sansa "arguably gets a '
                   'disproportionate amount of fan hat',
        'score': 0.9512491226196289},
    {   'answer': 'Arya',
        'context': "denotes weakness. She doesn't have cool swordplay skills "
                   "like her sister Arya; she isn't a smart seductress like "
                   'Margaery Tyrell or a fierce queen lik',
        'score': 0.9470417499542236},
    {   'answer': 'Brienne',
        'context': 'etter is made public. Sansa confides in Littlefinger, who '
                   'suggests that Brienne, sworn to serve both sisters, would '
                   'intervene if Arya acted against Sa',
        'score': 0.8202571272850037},
    {   'answer': 'Myrcella',
        'context': 'sa is present when the royal family bids farewell to '
    