In [1]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git


Collecting grpcio-tools==1.34.1
  Downloading grpcio_tools-1.34.1-cp37-cp37m-manylinux2014_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 5.4 MB/s 
Installing collected packages: grpcio-tools
Successfully installed grpcio-tools-1.34.1
Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-xakbjjvn
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-xakbjjvn
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 70 kB/s 
[?25hCollecting mlflow<=1.13.1
  Downloading mlflow-1.13.1-py3-none-any.whl (14.1 MB)
[K     |████████████████████████████████| 14.1 MB 6.0 MB/s 
[?25hCollecting transformers==4.13.0
  Downloading transformers-4.13.0-py3-none-any.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 37.4 MB/s 
Collecting fastapi
  Downloading fastapi-0.70.1-py3-none-any.whl (51 kB

In [1]:
from haystack.utils import clean_wiki_text, convert_files_to_dicts, fetch_archive_from_http, print_answers
from haystack.nodes import FARMReader, TransformersReader

## Document Store

Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `FAISSDocumentStore`,  `SQLDocumentStore`, and `InMemoryDocumentStore`.

**Here:** We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).


### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [24]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [25]:
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

# ***ToPDF***

In [4]:
# Here are the imports we need
from haystack.nodes import TextConverter, PDFToTextConverter, DocxToTextConverter, PreProcessor
from haystack.utils import convert_files_to_dicts, fetch_archive_from_http

The training dataset used for developing this question-and-answer model is the programming open source [O'Reilly books, Programming the Be Operating System](https://www.oreilly.com/openbook/beosprog/book/) and [Asterisk: The Future of Telephony](http://cdn.oreillystatic.com/books/9780596510480.pdf). The individual chapters from the books were downloaded as pdf files from the O'Reilly website. These books are made available under the creative commons license. We have created a combined PDF which is available on google drive [here](https://drive.google.com/drive/u/2/folders/1-IAona91wwNKA0_Wm0ux5yCtH15m8Yv2). If using Colab upload this PDF to your environment.

# **Converters**

Haystack's converter classes are designed to help you turn files on your computer into the documents that can be processed by the Haystack pipeline. There are file converters for txt, pdf, docx files as well as a converter that is powered by Apache Tika. The parameter valid_langugages does not convert files to the target language, but checks if the conversion worked as expected. For converting PDFs, try changing the encoding to UTF-8 if the conversion isn't great.

In [5]:
!apt-get install poppler-utils 

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 37 not upgraded.
Need to get 154 kB of archives.
After this operation, 613 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 poppler-utils amd64 0.62.0-2ubuntu2.12 [154 kB]
Fetched 154 kB in 1s (230 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 155225 files and directories currently installed.)
Preparing to unpack .../poppler-utils_0.62.0-2ubuntu2.12_amd64.deb ...
Unpacking poppler-utils (0.62.0-2ubuntu2.12) ...
Setting up poppler-utils (0.62.0-2ubuntu2.12) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


In [6]:
# Use the PDF converter to read PDF 
converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_pdf = converter.convert(file_path="/content/combined_books.pdf", meta=None)[0]

In [7]:
doc_pdf

 'content_type': 'text',
 'meta': None}

In [8]:
content_in_doc_pdf = (doc_pdf['content'])
content_in_doc_pdf 



In [9]:
print("Num words: ", len(content_in_doc_pdf.split(" ")))

Num words:  263456


## Preprocessing of documents
Preprocessing splits the document into smaller documents. These mini documents are then uploaded to ElasticSearch database

In [10]:
# This is a default usage of the PreProcessor.
# Here, it performs cleaning of consecutive whitespaces
# and splits a single large document into smaller documents.
# Each document is up to 1000 words long and document breaks cannot fall in the middle of sentences
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [11]:
dict1 = preprocessor.process([doc_pdf])
print(f"n_docs_input: 1\nn_docs_output: {len(dict1)}")
dict1 [:2]

100%|██████████| 1/1 [00:00<00:00,  1.48docs/s]

n_docs_input: 1
n_docs_output: 3043





[{'content': "\n\nAsterisk : The Future of Telephony\nTM\n\nOther resources from O'Reilly\nRelated titles\n\noreilly.com\n\nEthernet: The Definitive\nGuide\nSwitching to VoIP\nT1: A Survival Guide\n\nTCP/IP Network\nAdministration\nVoIP HacksTM\n\noreilly.com is more than a complete catalog of O'Reilly books. You'll also find links to news, events, articles, weblogs, sample\nchapters, and code examples. oreillynet.com is the essential portal for developers interested in\nopen and emerging technologies, including new platforms, programming languages, and operating systems. Conferences\n\nO'Reilly brings diverse innovators together to nurture the ideas\nthat spark revolutionary industries. We specialize in documenting the latest tools and systems, translating the innovator's\nknowledge into useful skills for those in the trenches. Please\nvisit conferences.oreilly.com for our upcoming events.",
  'content_type': 'text',
  'meta': {'_split_id': 0}},
 {'content': 'Safari Bookshelf (safari.

In [12]:
print("Num Words in first content: ", len(dict1[0]['content'].split(" ")))

Num Words in first content:  99


## Initalize Retriever, Reader,  & Pipeline

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered.
They use some simple but fast algorithm.

**Here:** We use Elasticsearch's default BM25 algorithm

**Alternatives:**

- Customize the `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `DensePassageRetriever` to use different embedding models for passage and query (see Tutorial 6)

In [26]:
from haystack.nodes import ElasticsearchRetriever, TfidfRetriever
retriever = TfidfRetriever(document_store=document_store)

INFO - haystack.nodes.retriever.sparse -  Found 8978 candidate paragraphs from 2906 docs in DB


In [27]:
# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dict1)

In [28]:
retriever

<haystack.nodes.retriever.sparse.TfidfRetriever at 0x7f80d5723890>

### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"


### Question Answering using Roberta-Base-SQuAD 2

In [29]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...
INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2
INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


### Question Answering using BERT-Base-SQuAD 2

In [18]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)
reader = TransformersReader(model_name_or_path="deepset/bert-large-uncased-whole-word-masking-squad2", tokenizer="bert-base-uncased", use_gpu=-1)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1


Downloading:   0%|          | 0.00/540 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

KeyboardInterrupt: ignored

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [30]:
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Ask a question!

#### Question 1

In [31]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = pipe.run(query="What is Virtual Memory?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 3}}
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.14 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.21 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.82 Batches/s]


In [32]:
# Now you can either print the object directly...
from pprint import pprint
pprint(prediction)

{'answers': [<Answer {'answer': 'To accommodate the simultaneous running of several applications', 'type': 'extractive', 'score': 0.7343046367168427, 'context': 's\nrunning BeOS rarely crash. Virtual Memory\nTo accommodate the simultaneous running of several applications, some operating systems use a memory schem', 'offsets_in_document': [{'start': 456, 'end': 519}], 'offsets_in_context': [{'start': 44, 'end': 107}], 'document_id': '49e15c4a0138f2e7d9e707203b5541eb', 'meta': {'_split_id': 1755}}>,
             <Answer {'answer': 'memory other than RAM that is devoted to holding application code and data', 'type': 'extractive', 'score': 0.5000361800193787, 'context': 'Virtual memory is\nmemory other than RAM that is devoted to holding application code and data. Typically, a system reserves hard drive space and uses t', 'offsets_in_document': [{'start': 18, 'end': 92}], 'offsets_in_context': [{'start': 18, 'end': 92}], 'document_id': 'e90c871864ed4a202d2874aa5a858de', 'meta': {'_split_id

In [33]:
# ...or use a util to simplify the output
# Change `minimum` to `medium` or `all` to raise the level of detail
print_answers(prediction, details="minimum")


Query: What is Virtual Memory?
Answers:
[   {   'answer': 'To accommodate the simultaneous running of several '
                  'applications',
        'context': 's\n'
                   'running BeOS rarely crash. Virtual Memory\n'
                   'To accommodate the simultaneous running of several '
                   'applications, some operating systems use a memory schem'},
    {   'answer': 'memory other than RAM that is devoted to holding '
                  'application code and data',
        'context': 'Virtual memory is\n'
                   'memory other than RAM that is devoted to holding '
                   'application code and data. Typically, a system reserves '
                   'hard drive space and uses t'},
    {   'answer': 'height',
        'context': 'virtual\n'
                   'virtual\n'
                   'virtual\n'
                   '...\n'
                   'virtual\n'
                   'virtual\n'
                   'height);\n'
           

#### Question 2

In [34]:
prediction = pipe.run(query="What is telephony adaptor?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 3}})

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.73 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.31 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.92 Batches/s]


In [35]:
# Now you can either print the object directly...
from pprint import pprint
pprint(prediction)

{'answers': [<Answer {'answer': 'an end-user device that converts communications circuits from\none protocol to another', 'type': 'extractive', 'score': 0.5864536762237549, 'context': 'tor) can\nloosely be described as an end-user device that converts communications circuits from\none protocol to another. Most commonly, these devices a', 'offsets_in_document': [{'start': 127, 'end': 212}], 'offsets_in_context': [{'start': 33, 'end': 118}], 'document_id': 'e7da4ecbd263dd6e2e4e4c67cb76d67c', 'meta': {'_split_id': 214}}>,
             <Answer {'answer': 'Digital Circuit-Switched Telephone Network', 'type': 'extractive', 'score': 0.11177185922861099, 'context': '167\nAnalog Telephony\nDigital Telephony\nThe Digital Circuit-Switched Telephone Network', 'offsets_in_document': [{'start': 43, 'end': 85}], 'offsets_in_context': [{'start': 43, 'end': 85}], 'document_id': 'c2244ffe9fc6dd5fe5ba0454ced089ff', 'meta': {'_split_id': 10}}>,
             <Answer {'answer': 'Asterisk', 'type': 'extracti

In [36]:
print_answers(prediction, details="minimum")


Query: What is telephony adaptor?
Answers:
[   {   'answer': 'an end-user device that converts communications circuits '
                  'from\n'
                  'one protocol to another',
        'context': 'tor) can\n'
                   'loosely be described as an end-user device that converts '
                   'communications circuits from\n'
                   'one protocol to another. Most commonly, these devices a'},
    {   'answer': 'Digital Circuit-Switched Telephone Network',
        'context': '167\n'
                   'Analog Telephony\n'
                   'Digital Telephony\n'
                   'The Digital Circuit-Switched Telephone Network'},
    {'answer': 'Asterisk', 'context': 'Asterisk: The Future of Telephony'}]


#### Question 3

In [43]:
predictin = pipe.run(query="Tell me more about message handling", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 3}})

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.86 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.29 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.84 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 17.59 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.85 Batches/s]


In [44]:
# Now you can either print the object directly...
from pprint import pprint
pprint(predictin)

{'answers': [<Answer {'answer': 'message creation', 'type': 'extractive', 'score': 0.12980275601148605, 'context': 'f menu item handling to\ncontrol handling will serve well to cement in your mind the practice used in each\ncase: message creation and message handling.', 'offsets_in_document': [{'start': 132, 'end': 148}], 'offsets_in_context': [{'start': 112, 'end': 128}], 'document_id': '568658458d7252c03f937aabac9308d8', 'meta': {'_split_id': 2603}}>,
             <Answer {'answer': 'message to the affected BHandler object', 'type': 'extractive', 'score': 0.09160163253545761, 'context': 'ceives a system message, it is dispatched by sending the\nmessage to the affected BHandler object. That object then invokes a hook function--a function', 'offsets_in_document': [{'start': 102, 'end': 141}], 'offsets_in_context': [{'start': 56, 'end': 95}], 'document_id': 'b43e10dd82ca5f2b17488a107bf8e990', 'meta': {'_split_id': 2910}}>,
             <Answer {'answer': '#define', 'type': 'extractive',

In [45]:
print_answers(predictin, details="minimum")


Query: Tell me more about message handling
Answers:
[   {   'answer': 'message creation',
        'context': 'f menu item handling to\n'
                   'control handling will serve well to cement in your mind '
                   'the practice used in each\n'
                   'case: message creation and message handling.'},
    {   'answer': 'message to the affected BHandler object',
        'context': 'ceives a system message, it is dispatched by sending the\n'
                   'message to the affected BHandler object. That object then '
                   'invokes a hook function--a function'},
    {   'answer': '#define',
        'context': 'For instance, a program that implements message handling '
                   'through a menu item first defines a message constant:\n'
                   '#define'}]
