<a href="https://colab.research.google.com/github/ontologist/viba-project/blob/main/HaystackWithPDFFiles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Scalable QA System Implementation

# Install Haystack

To start, let's install the latest release of Haystack with `pip`:

In [None]:
%%bash

pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr]

# For Colab/linux based machines:
wget https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
tar -xvf xpdf-tools-linux-4.04.tar.gz && sudo cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 24.4 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab,ocr]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-ffp78bi0/farm-haystack_ad5bcc86253341d18ab411d498d01047
  Resolved https://github.com/deepset-ai/haystack.git to commit c4b98fcccc61023e43fe8dec3fd35bb377f845e6
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to 

DEPRECATION: git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-ffp78bi0/farm-haystack_ad5bcc86253341d18ab411d498d01047
--2023-02-21 12:23:29--  https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
Resolving dl.xpdfreader.com (dl.xpdfreader.com)... 45.79.72.155
Connecting to dl.xpdfreader.com (dl.xpdfreader.com)|45.79.72.155|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23687259 (23M) [application/x-gzip]
Saving to: ‘xpdf-tools-linux-4.04.tar.gz’

     0K .......... .......... .......... .......... ..........  0%  475K 49s
    50K .......... .......... .......... .......... ...

# Set the logging info level for INFO:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the ElasticsearchDocumentStore

A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. Here, we're using the [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) or [`FAISSDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-faiss).  For testing purposes. [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) is a fast and scalable text-focused storage option. [`FAISSDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-faiss) is a DocumentStore for very large-scale, embedding-based dense Retrievers, like the DPR.  Both services runs independently from Haystack and persists even after the Haystack program has finished running.  The DocumentStore support different types of external databases. See this for more information: [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

1.   Download, extract, and set the permissions for the Elasticsearch installation 

In [None]:
%%bash


wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

# wget -c https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-x86_64.sh
# chmod +x Anaconda3-2019.03-Linux-x86_64.sh

2.   Start the elasticsearch server

In [None]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch


3.    Wait 30 seconds for the server to fully start

In [None]:
import time
time.sleep(30)

4.    Initialize the ElasticsearchDocumentStore:

In [None]:
from haystack.utils import launch_es
launch_es()



In [None]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

document_store = ElasticsearchDocumentStore(
    host=host,
    username="",
    password="",
    index="document"
)

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable  HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://docs.haystack.deepset.ai/docs/telemetry


## Indexing Documents with a Pipeline

The next step is adding the files to the DocumentStore. The indexing pipeline turns your files into Document objects and writes them to the DocumentStore. Our indexing pipeline will have two nodes: TextConverter, which turns .txt files into Haystack Document objects, and PreProcessor, which cleans and splits the text within a Document.

Once we combine these nodes into a pipeline, the pipeline will ingest .txt file paths, preprocess them, and write them into the DocumentStore.



1. Download several curated laptop documents from [`Sam's Club Website`](https://www.samsclub.com/), [`PC Magazine Review`](https://www.pcmag.com/reviews), and the [Intel Specs site](https://ark.intel.com/), among others. You can find them in https://github.com/ontologist/viba-project/blob/main/samsclublaptopswithreviews.zip.

In [None]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/scalable_question_answering_system_from_PDFs"

fetch_archive_from_http(
    url="https://github.com/ontologist/viba-project/raw/main/Laptop-dataset.zip", ## Using the MobileInsighte repo gives an error
    output_dir=doc_dir
)

INFO:haystack.utils.import_utils:Fetching from https://github.com/ontologist/viba-project/raw/main/Laptop-dataset.zip to 'data/scalable_question_answering_system_from_PDFs'


True

2.   Initialize the pipeline, PDFToTextConverter, and PreProcessor:

In [None]:
%%bash

pip install --upgrade pip
#pip install farm-haystack[OCR]
pip install pdf2image
pip install pymilvus==1.1.2

# For Colab/linux based machines:
wget https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
tar -xvf xpdf-tools-linux-4.04.tar.gz && sudo cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin

In [None]:
!apt-get install poppler-utils

In [None]:
!pip install pdf2image

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [None]:
!pip install pdf2text

In [None]:
!apt-get install tesseract-ocr libtesseract-dev poppler-utils

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-510
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libarchive-dev libleptonica-dev tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  libarchive-dev libleptonica-dev libtesseract-dev poppler-utils tesseract-ocr
  tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 7 newly installed, 0 to remove and 21 not upgraded.
Need to get 8,367 kB of archives.
After this operation, 32.7 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libarchive-dev amd64 3.4.0-2ubuntu1.2 [491 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/universe amd64 libleptonica-dev amd64 1.79.0-1 [1,389 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal/universe amd64 libtesseract-dev amd64 4.1.1-2build2 [1,46

In [None]:
#from haystack.nodes.file_converter import PDFToTextOCRConverter
from haystack import Pipeline
from haystack.nodes import TextConverter, PDFToTextOCRConverter, DocxToTextConverter, PreProcessor

from haystack.utils import convert_files_to_docs
#from pdf2image import convert_from_path

indexing_pipeline = Pipeline()
#text_converter = TextConverter()
#pdf_converter = convert_files_to_docs(dir_path=doc_dir)
pdf_converter = PDFToTextOCRConverter()
preprocessor = PreProcessor(
    clean_whitespace=True,
    clean_header_footer=True,
    clean_empty_lines=True,
    split_by="word",
    split_length=200,
    split_overlap=20,
    split_respect_sentence_boundary=True,
)

To learn more about the parameters of the PreProcessor, see Usage. To understand why document splitting is important for your question answering system's performance, see Document Length.

3.  Add the nodes into an indexing pipeline. You should provide the name or names of preceding nodes as the input argument. Note that in an indexing pipeline, the input to the first node is File.

In [None]:
import os

indexing_pipeline.add_node(component=pdf_converter, name="PDFToTextOCRConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["PDFToTextOCRConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

In [None]:
!rm -rf data/scalable_question_answering_system_from_PDFs/__MACOSX/

4.   Run the indexing pipeline to write the text data into the DocumentStore:

In [None]:
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline.run_batch(file_paths=files_to_index)

INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.


Converting files:   0%|          | 0/1 [00:00<?, ?it/s]

ERROR:haystack.nodes.file_converter.pdf:File data/scalable_question_answering_system_from_PDFs/Laptop-dataset has an error:
Unable to get page count.
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Couldn't read xref table



Preprocessing:   0%|          | 0/1 [00:00<?, ?docs/s]

{'documents': [],
 'root_node': 'File',
 'params': {},
 'file_paths': ['data/scalable_question_answering_system_from_PDFs/Laptop-dataset'],
 'node_id': 'DocumentStore'}

Now that the Documents are in the DocumentStore, let's initialize the nodes we want to use in our query pipeline.

*italicized text*## Initializing the Retriever

Our query pipeline is going to use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only those that are relevant to the question. This tutorial uses the BM25Retriever. This is the recommended Retriever for a question answering system like the one we're creating. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

In [None]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

## Initializing the Reader

Our query pipeline also needs a Reader, so we'll initialize it next. A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. This tutorials uses a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a good all-round model to start with. To find a model that's best for your use case, see [Models](https://docs.haystack.deepset.ai/docs/reader#models).

In [None]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


## Creating the Retriever-Reader Pipeline

You can combine the Reader and Retriever in a querying pipeline using the `Pipeline` class. The combination of the two speeds up processing because the Reader only processes the Documents that it received from the Retriever. 

Initialize the `Pipeline` object and add the Retriever and Reader as nodes. You should provide the `name` or `name`s of preceding nodes as the input argument. Note that in a querying pipeline, the input to the first node is `Query`.

In [None]:
from haystack import Pipeline

querying_pipeline = Pipeline()
querying_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
querying_pipeline.add_node(component=reader, name="Reader", inputs=["Retriever"])

## Asking a Question

1. Use the pipeline's `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).


In [None]:
prediction = querying_pipeline.run(
    query="What is a good game computer?",
   # query="What is the most cost-effective laptop?",
   # query="What is a good student laptop?",
   # query="What laptop has the best CPU?",
   # query="What laptop has the longest battery life?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)

Here are some questions you could try out:
- What is a good student laptop?
- What is the most cost-effective laptop?
- What laptop has the best CPU?

2. Print out the answers the pipeline returns:

In [None]:
from pprint import pprint

pprint(prediction)

{'answers': [],
 'documents': [],
 'node_id': 'Reader',
 'params': {'Reader': {'top_k': 5}, 'Retriever': {'top_k': 10}},
 'query': 'What is a good game computer?',
 'root_node': 'Query'}


3. Simplify the printed answers:

In [None]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium` and `all`
)


Query: What is a good game computer?
Answers:
[]
