<a href="https://colab.research.google.com/github/ontologist/viba-project/blob/main/ScalableQASystemSample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Scalable QA System Implementation

# Install Haystack

To start, let's install the latest release of Haystack with `pip`:

In [2]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
     ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 2.1/2.1 MB 55.5 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab]
  Downloading farm_haystack-1.13.2-py3-none-any.whl (620 kB)
     ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ 620.6/620.6 kB 24.1 MB/s eta 0:00:00
Collecting mlflow
  Downloading mlflow-2.1.1-py3-none-any.whl (16.7 MB)
     ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î



# Set the logging info level for INFO:

In [1]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the ElasticsearchDocumentStore

A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. Here, we're using the [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) or [`FAISSDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-faiss).  For testing purposes. [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) is a fast and scalable text-focused storage option. [`FAISSDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-faiss) is a DocumentStore for very large-scale, embedding-based dense Retrievers, like the DPR.  Both services runs independently from Haystack and persists even after the Haystack program has finished running.  The DocumentStore support different types of external databases. See this for more information: [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

1.   Download, extract, and set the permissions for the Elasticsearch installation 

In [3]:
%%bash


wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

# wget -c https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh
# chmod +x Anaconda3-2019.03-Linux-x86_64.sh

2.   Start the elasticsearch server

In [4]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch


3.    Wait 30 seconds for the server to fully start

In [5]:
import time
time.sleep(30)

4.    Initialize the ElasticsearchDocumentStore:

In [6]:
from haystack.utils import launch_es
launch_es()



In [7]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

document_store = ElasticsearchDocumentStore(
    host=host,
    username="",
    password="",
    index="document"
)

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable  HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://docs.haystack.deepset.ai/docs/telemetry


## Indexing Documents with a Pipeline

The next step is adding the files to the DocumentStore. The indexing pipeline turns your files into Document objects and writes them to the DocumentStore. Our indexing pipeline will have two nodes: TextConverter, which turns .txt files into Haystack Document objects, and PreProcessor, which cleans and splits the text within a Document.

Once we combine these nodes into a pipeline, the pipeline will ingest .txt file paths, preprocess them, and write them into the DocumentStore.



1. Download several curated laptop documents from [`Sam's Club Website`](https://www.samsclub.com/), [`PC Magazine Review`](https://www.pcmag.com/reviews), and the [Intel Specs site](https://ark.intel.com/), among others. You can find them in https://github.com/ontologist/viba-project/blob/main/samsclublaptopswithreviews.zip.

In [8]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/build_a_scalable_question_answering_system"

fetch_archive_from_http(
    url="https://github.com/ontologist/viba-project/raw/main/samsclublaptopswithreviews.zip", 
    output_dir=doc_dir
)

INFO:haystack.utils.import_utils:Fetching from https://github.com/ontologist/viba-project/raw/main/samsclublaptopswithreviews.zip to 'data/build_a_scalable_question_answering_system'


True

2.   Initialize the pipeline, TextConverter, and PreProcessor:

In [9]:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline()
text_converter = TextConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=200,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


To learn more about the parameters of the PreProcessor, see Usage. To understand why document splitting is important for your question answering system's performance, see Document Length.

3.  Add the nodes into an indexing pipeline. You should provide the name or names of preceding nodes as the input argument. Note that in an indexing pipeline, the input to the first node is File.

In [10]:
import os

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

In [12]:
!rm -rf data/build_a_scalable_question_answering_system/__MACOSX/

4.   Run the indexing pipeline to write the text data into the DocumentStore:

In [13]:
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)

INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.


Converting files:   0%|          | 0/25 [00:00<?, ?it/s]

Preprocessing:   0%|          | 0/25 [00:00<?, ?docs/s]



{'documents': [<Document: {'content': "\ufeffARK | Compare Intel¬Æ Products\n02/20/2023 07:04:55 PM\n,Intel¬Æ Pentium¬Æ Silver N5000 Processor  (4M Cache- up to 2.70 GHz)\n\nEssentials\nProduct Collection,Intel¬Æ Pentium¬Æ Silver Processor Series\nCode Name,Products formerly Gemini Lake\nVertical Segment,Mobile\nProcessor Number,N5000\nLithography,14 nm\n\nCPU Specifications\nTotal Cores,4\nTotal Threads,4\nBurst Frequency,2.70 GHz\nProcessor Base Frequency,1.10 GHz\nCache,4 MB\nScenario Design Power (SDP),4.8 W\nTDP,6 W\n\nSupplemental Information\nMarketing Status,Discontinued\nLaunch Date,Q4'17\nEmbedded Options Available,No\nProduct Brief,View now\n\nMemory Specifications\nMax Memory Size (dependent on memory type),8 GB\nMemory Types,DDR4/LPDDR4 upto 2400 MT/s\nMax # of Memory Channels,2\nECC Memory Supported   ‚Ä°,No\n\nProcessor Graphics\nProcessor Graphics ‚Ä°,Intel¬Æ UHD Graphics 605\nGraphics Base Frequency,200 MHz\nGraphics Burst Frequency,750 MHz\nGraphics Video Max Memory,8

Now that the Documents are in the DocumentStore, let's initialize the nodes we want to use in our query pipeline.

*italicized text*## Initializing the Retriever

Our query pipeline is going to use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only those that are relevant to the question. This tutorial uses the BM25Retriever. This is the recommended Retriever for a question answering system like the one we're creating. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

In [14]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

## Initializing the Reader

Our query pipeline also needs a Reader, so we'll initialize it next. A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. This tutorials uses a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a good all-round model to start with. To find a model that's best for your use case, see [Models](https://docs.haystack.deepset.ai/docs/reader#models).

In [15]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading (‚Ä¶)"pytorch_model.bin";:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (‚Ä¶)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (‚Ä¶)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (‚Ä¶)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


## Creating the Retriever-Reader Pipeline

You can combine the Reader and Retriever in a querying pipeline using the `Pipeline` class. The combination of the two speeds up processing because the Reader only processes the Documents that it received from the Retriever. 

Initialize the `Pipeline` object and add the Retriever and Reader as nodes. You should provide the `name` or `name`s of preceding nodes as the input argument. Note that in a querying pipeline, the input to the first node is `Query`.

In [16]:
from haystack import Pipeline

querying_pipeline = Pipeline()
querying_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
querying_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

## Asking a Question

1. Use the pipeline's `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).


In [23]:
prediction = querying_pipeline.run(
   # query="What is a good game computer?",
   # query="What is the most cost-effective laptop?",
   # query="What is a good student laptop?",
   # query="What laptop has the best CPU?",
   query="What laptop has the longest battery life?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)

INFO:haystack.schema:Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO:haystack.schema:Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO:haystack.schema:Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO:haystack.schema:Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO:haystack.schema:Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO:haystack.schema:Setting the ID manually. This might cause a mismatch with the ID that would be generated from the document content and id_hash_keys value.
INFO:haystack.schema:Setting the ID manu

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Here are some questions you could try out:
- What is a good student laptop?
- What is the most cost-effective laptop?
- What laptop has the best CPU?

2. Print out the answers the pipeline returns:

In [24]:
from pprint import pprint

pprint(prediction)

{'answers': [<Answer {'answer': 'Acer Aspire 5', 'type': 'extractive', 'score': 0.7463506460189819, 'context': 'n can stay on\n\nfor an impressive 11 hours, scoring second behind the Acer Aspire 5 in battery life, which has a\n\nsmaller 15-inch display.\n\nVerdict: A ', 'offsets_in_document': [{'start': 717, 'end': 730}], 'offsets_in_context': [{'start': 69, 'end': 82}], 'document_id': '4b93e32d4872ba3f21d88398f0dccb91', 'meta': {'_split_id': 12}}>,
             <Answer {'answer': 'Gaming F17', 'type': 'extractive', 'score': 0.6970942616462708, 'context': " (candelas per\n\nsquare meter).\n\nI'm not exactly disappointed with the Gaming F17's battery life, because expectations aren't very\n\nhigh for big-screen", 'offsets_in_document': [{'start': 1058, 'end': 1068}], 'offsets_in_context': [{'start': 70, 'end': 80}], 'document_id': '8349cca596eb21ab19919792f6b8fae1', 'meta': {'_split_id': 12}}>,
             <Answer {'answer': 'HP', 'type': 'extractive', 'score': 0.5976345539093018, 'con

3. Simplify the printed answers:

In [33]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium` and `all`
)


Query: What laptop has the longest battery life?
Answers:
[   {   'answer': 'Acer Aspire 5',
        'context': 'n can stay on\n'
                   '\n'
                   'for an impressive 11 hours, scoring second behind the Acer '
                   'Aspire 5 in battery life, which has a\n'
                   '\n'
                   'smaller 15-inch display.\n'
                   '\n'
                   'Verdict: A '},
    {   'answer': 'Gaming F17',
        'context': ' (candelas per\n'
                   '\n'
                   'square meter).\n'
                   '\n'
                   "I'm not exactly disappointed with the Gaming F17's battery "
                   "life, because expectations aren't very\n"
                   '\n'
                   'high for big-screen'},
    {   'answer': 'HP',
        'context': "On the other hand, the HP's keyboard feels much better to "
                   'type on, its speakers are better, and it has a better '
                   'select