<a href="https://colab.research.google.com/github/patzacher/extractive_qa/blob/main/extractive_qa_haystack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extractive QA System Using Python and Haystack




##Overview

Haystack is an end-to-end open-source framework for creating Question-Answering models. Haystack has three primary components: the DocumentStore, Retriever, and Reader.

1. DocumentStore: This is exactly what it sounds like. The DocumentStore stores text documents and their meta data. Documents are typically split into smaller units (e.g., paragraphs) before indexing to enable higher accuracy and granularity to answers.

2. Retriever: These are fast and simple algorithms to identify candidate passages from a large collection of documents. It allows a set of k-candidate documents to be sent to the Reader. In general, the Retriever helps narrow the scope for the Reader, which will then perform a thorough search of the top-k documents for the best answer.

3. Reader: Takes passages of text as input and returns top-k answers with their corresponding confidence scores (range 0-1). Readers are powerful models that are able to make a full search in the selected documents with the aim of finding the right answer.

The DocumentStore, Retriever, and Reader are connected using a querying pipeline. Querying pipelines are used to receive a query from the user and produce a result.


## Preparing the Colab Environment

- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)


## Install Packages

Install Haystack and other required packages:

In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,preprocessing,elasticsearch,inference]

Configure Haystack's logging level:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.DEBUG)
logging.getLogger("haystack").setLevel(logging.DEBUG)

## Import Packages

In [1]:
from google.colab import files

import requests
import re
import os

## Initialize the ElasticsearchDocumentStore

A DocumentStore stores the documents that the question-answering system uses to find answers to questions. Here, we're using the [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) which connects to a running Elasticsearch service. It's a fast and scalable text-focused storage option. This service runs independently from Haystack and persists even after the Haystack program has finished running. To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

As an aside, Elasticsearch is an open-source, distributed search and analytics engine designed for scalability, real-time searching, and data analysis. Among other things, Elasticsearch can index and search large volumes of text data quickly and efficiently.

1. Download, extract, and set the permissions for the Elasticsearch installation image:

In [None]:
%%bash

# Use `wget` utility to quietly (-q) download the Elasticsearch archive from
# this URL
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q

# After downloading, use `tar` to extract contents (-x extract, -z archive is
# compressed with gzip, -f specifies file name)
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz

# Change ownership and group ownership (`chown`) of extracted files and
# directories to `daemon`, a non-priveleged user and group name for running
# Elastic search in a secure manner.
chown -R daemon:daemon elasticsearch-7.9.2

2. Start the server:

In [None]:
%%bash --bg

# Start Elasticsearch server as the `daemon` user.
sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

If Docker is available in your environment (Colab notebooks do not support Docker), you can also start Elasticsearch using Docker. You can do this manually, or using our [`launch_es()`](https://docs.haystack.deepset.ai/reference/utils-api#module-doc_store) utility function.

In [None]:
# from haystack.utils import launch_es

# launch_es()

3. Wait 30 seconds for the server to fully start up:

In [None]:
import time

time.sleep(30)

4. Initialize the ElasticsearchDocumentStore:


In [None]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

# Configure ElasticsearchDocumentStore for accessing and storing documents.
document_store = ElasticsearchDocumentStore(host=host, username="", password="", index="document")

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).


ElasticsearchDocumentStore is up and running and ready to store the Documents.

## Upload Files


####**Option 1:** Upload Previously Scraped Data

Step 1.

If you have text data that you would like to use, upload the files here.

In [None]:
from google.colab import files

# Create new directory for scraped content.
os.mkdir("/content/Data")

# Switch to new directory.
os.chdir("/content/Data")

files.upload()

Step 2.

If you have a meta data file, you can set `meta_file` from meta_file.txt for use in the indexing pipeline. The meta data file must be a list of dictionaries with entries in the same order as the documents contained in your `Data` directory. If this doesn't apply, skip this step, but remember to switch back to the parent directory ("/content/").

In [None]:
meta_file_path = "/content/Data/meta_file.txt"

# Read meta_file.txt
with open(meta_file_path, "r") as file:
  meta_data = file.read()

# Split the input string into items based on the newline characters
meta_items = meta_data.strip().split('\n\n')

# Initialize a list to store dictionaries
meta_file = []

# Iterate through the items and create the list of dictionaries contained in
# `meta_file`.
for item in meta_items:
    item_lines = item.split('\n')
    item_dict = {}

    for line in item_lines:
        key, value = line.split(': ', 1)
        item_dict[key] = value

    meta_file.append(item_dict)


# Set directory where documents are located.
doc_dir = "/content/Data"



In [None]:
# Switch back to parent directory.
os.chdir("/content/")

####**Option 2: Example Dataset from Haystack Tutorial**

Download 517 articles from the Game of Thrones Wikipedia. You can find them in *data/build_a_scalable_question_answering_system* as a set of *.txt* files.

In [None]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/build_a_scalable_question_answering_system"

fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt3.zip",
    output_dir=doc_dir,
)

## Index Documents with a Pipeline

Indexing pipelines prepare the files for search. The main objective here is to convert files (.txt, in our case) into Haystack Documents, so they can be saved in a DocumentStore. Our indexing pipeline will have three nodes:

1. `TextConverter`, which turns `.txt` files into Haystack `Document` objects and sends to the `PreProcessor`.
2. `PreProcessor`, which cleans and splits the text within a `Document` and sends to the `DocumentStore`.
3. `DocumentStore` is the database that stores text and meta data and provides them to the Retriever at query time. Our `ElasticsearchDocumentStore` has already been initialized.

Once we combine these nodes into a pipeline, the pipeline will ingest `.txt` file paths, preprocess them, and write them into the DocumentStore.

Note: More nodes are available for our indexing pipeline as needed. For example, a `FileClassifier` can be added as the first node to classify files into text, PDF, Markdown, docx, and HTML files and route them to the appropriate `FileConverter`. Also, a `DocumentClassifier` could be used to attach a classification label each Document's meta data (e.g., sentiment labels like "positive", "negative").


####**Option 1:**

Step 1.

Manually apply the converter(s) to each file. If we are only using one type of file (in this case, .txt), then we can specify the type of file converter we want to use.

Next, intialize the preprocessor with recommended values.

In [None]:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline() # Initialize the indexing pipeline
text_converter = TextConverter() # Reads text from .txt file
                                 # Sends to preprocessor
preprocessor = PreProcessor(
    clean_whitespace=True, # Remove whitespace at start/end of each line in text
    clean_header_footer=True, # Remove repeated header/footer
    clean_empty_lines=True, # Normalize 3+ empty lines to 2 empty lines
    split_by="word", # Unit to split document by
    split_length=100, # Max number of units per document (Recommended value)
    split_overlap=10, # Overlap between adjacent documents
    split_respect_sentence_boundary=True, # Doc boundaries preserve sentences
    max_chars_check = 1000 # Some docs have very long passages so we need to split them
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


To learn more about the parameters of the `PreProcessor`, see [Usage](https://docs.haystack.deepset.ai/docs/preprocessor#usage).

[Document splitting](https://docs.haystack.deepset.ai/docs/optimization#document-length) is important for your question answering system's performance. If you halve the length of your documents, you will halve the workload placed on your Retriever. Depending on the type of Retriever used, the maximum number of words will vary (between 100 - 500 words).

Our current pipeline uses a dense retriever, which have more restrictive guidelines for sentence length. We have to ensure that documents are not longer than the retriever's maximium input length (256 tokens). As such, decent performance has been found with documents around 100 words long (see [Optimization - Document Length](https://docs.haystack.deepset.ai/docs/optimization#document-length) for more details).

Step 2.

Add the nodes into an indexing pipeline. You should provide the `name` or `name`s of preceding nodes as the `input` argument. Note that in an indexing pipeline, the input to the first node is `File`.

In [None]:
import os

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

Step 3.

Run the indexing pipeline to write the text data into the DocumentStore. We can add metadata to our files using the `meta` argument in the `indexing_pipeline.run_batch` command. For example, we can include the title and URL of a document and return them for additional context when the user asks a question.

Note that `meta_file` has to be a list of dictionaries, the same length as `files_to_index`. Also, we are alphabetically sorting these arguments because the entries in `meta` must be in the same order as `file_paths`. If you are not using a `meta_file` remove that argument from the `run_batch` command.

In [None]:
# Specify the files we want to send to the DocumentStore.
files_to_index = [
    os.path.join(doc_dir, f)
    for f in os.listdir(doc_dir)
    if f != "meta_file.txt" # Don't include metadata file because we read that
]                           # separately.

# Run our indexing pipeline to convert files, preprocess, and store them.
indexing_pipeline.run_batch(file_paths=sorted(files_to_index),
                            meta=sorted(meta_file, key=lambda x: x['Title']))


####**Option 2:**
Haystack has a convenience function that will automatically apply the right converter to each file in a directory instead of having to specify a converter (i.e., for pdf, docx, txt). See [Better Retrieval via Embedding Retrieval](https://haystack.deepset.ai/tutorials/06_better_retrieval_via_embedding_retrieval) tutorial for an example.

In [None]:
from haystack.utils import convert_files_to_docs

all_docs = convert_files_to_docs(dir_path=doc_dir)

## Initializing the Retriever

Now that the Documents are in the DocumentStore, let's initialize the nodes we want to use in our query pipeline. First, a Retreiver.

In a query pipeline, the Retriever takes a query as input and checks it against the documents contained in the DocumentStore. It scores each document for its relevance to the query and returns the top candidates (top-k documents) to the Reader. The Reader will then perform the more complex task of question-answering using transformer-based language models (if using a dense retriever).

Two (out of many) Retriever options are the **BM25Retriever** (no GPU needed) and an **EmbeddingRetriever** with Sentence Transformers models (recommended if we have a GPU available). The BM25Retriever is a *sparse* retriever while the EmbeddingRetriever is *dense*.

Sparse methods operate by looking for shared keywords between the document and the query. Dense approaches perform better than sparse counterparts, but are computationally more expensive. The models used by the EmbeddingRetriever are trained to embed similar sentences close to each other in a shared embedding space.

A starting model for a dense Retriever is the `multi-qa-mpnet-base-dot-v1` as it was tuned for semantic search (i.e., given a query, it can find relevant passages). It was trained on a large and diverse set of question/answer pairs. Downside is that it is one of the larger models (420 MB), while a smaller, similar option might be the `multi-qa-MiniLM-L6-cos-V1` (80 MB). Here we have to consider model size with performance, as the smaller model generally has poorer performance.

For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

For model info, see [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html#).


####**Option 1: EmbeddingRetriever**

Let's use the `multi-qa-mpnet-base-dot-v1` model.

In [None]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1"
)
# Important:
# Now that we initialized the Retriever, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation.
# While this can be a time consuming operation (depending on the corpus size), it only needs to be done once.
# At query time, we only need to embed the query and compare it to the existing document embeddings, which is very fast.
document_store.update_embeddings(retriever)


####**Option 2: BM25 Retriever**

In [None]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

##Route Documents##

Now that the Retriever has been initialized, we can move on specifying our approach to routing documents. We can use the EmbeddingRetriever to retrieve both texts and tables. To do question-answering on these documents, we need to route the "text" documents to a FARMReader and "table" documents to a TableReader. Then we need to join the answers coming from the two Readers to a single list of answers.

To read more about this process, see [Pipeline for QA on Combination of Text and Tables](https://haystack.deepset.ai/tutorials/15_tableqa) including how to evaluate the pipeline and how to add tables from PDFs.

## Initializing the Reader

Our query pipeline also needs a Reader, so we'll initialize it next. A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. This is due to the model complexity (e.g., number of parameters), but also the difficulty of the task. Readers must process the text within the selected documents to extract the answer to a question, which involves fine-grained language understanding and reasoning.

We'll use a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a good all-round model to start with and has been trained on QA pairs, including unanswerable questions, for the task of question-answering.

See [Models](https://docs.haystack.deepset.ai/docs/reader#models) for more options.

In [None]:
from haystack.nodes import FARMReader, TableReader, RouteDocuments, JoinAnswers

text_reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", context_window_size=300, use_gpu=True)
table_reader = TableReader(model_name_or_path="deepset/tapas-large-nq-hn-reader")
route_documents = RouteDocuments()
join_answers = JoinAnswers()

## Creating the Retriever-Reader Pipeline

You can combine the Reader and Retriever in a querying pipeline using the `Pipeline` class. The combination of the two speeds up processing because the Reader only processes the Documents that it received from the Retriever.

To speed things up, Haystack comes with a few predefined pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer questions.

**Option 1: Manually Define a Pipeline**

*We'll use this option if we expect some answers to be contained within tables.

Initialize the `Pipeline` object and add the Retriever and Reader as nodes. You should provide the `name` or `name`s of preceding nodes as the input argument. Note that in a querying pipeline, the input to the first node is `Query`.

In [None]:
from haystack import Pipeline

text_table_qa_pipeline = Pipeline()
text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])

**Option 2: Predefined Pipeline**

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(text_reader, retriever)

That's it! The pipeline is ready to answer questions!

## Ask your Question

1. Use the pipeline's `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. The `top-k` parameter in both the Retriever and Reader determine how many results they return and is a trade-off between speed and accuracy. Specifically, Retriever top-k dictates how many retrieved documents are passed on to the Reader, while Reader top-k determines how many answer candidates to show. Haystack recommends using a Retriever top-k = 10 for decent overall performance.

To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments).

To read more about the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).


In [None]:
question = "what is haystack?"

# Wrap prediction pipeline in a try/except statement to prevent errors from
# impeding operation.
try:
  prediction = text_table_qa_pipeline.run(
          query = question,
          params = {"EmbeddingRetriever": {"top_k" : 10},
                    "TableReader": {"top_k" : 2},
                    "TextReader": {"top_k" : 2}}
          )
except:
  prediction = [] # If we run into an error, return an empty list.

##Filter by Score Threshold
First, check if the prediction pipeline returned an answer (of type 'dict'). If it did, use a score threshold to filter documents so that the only answers returned are greater than the threshold. Also include a condition to return a default answer if no answers are returned that meet our threshold.

If the prediction pipeline ran into an error and returned an empty list, use the default answer.

In [None]:
if isinstance(prediction, dict):
  score_threshold = 0.1
  filtered_documents = [doc for doc in prediction['answers'] if doc.score > score_threshold]

  if not filtered_documents:
        default_answer = {"answer": "Sorry, I don't have an answer for that. Try asking your question in a different way", "score": 0.0}
        filtered_documents = [default_answer]

else:
  default_answer = {"answer": "Sorry, I don't have an answer for that. Try asking your question in a different way.", "score": 0.0}
  filtered_documents = [default_answer]

2. Print out the answers the pipeline returns:

In [None]:
from haystack.schema import Answer

# Set a hyperlink format for answers.
hyperlink_format = '<a href="{link}">{text}</a>'

# Check if filtered_documents is a Haystack Answer object.
# If so, print the answers. If not, print the default
# answer.
for answer in filtered_documents:
    if isinstance(answer, Answer):
        print('The suggested answer is:',
              '"',
              answer.answer,
              '"',
              'with {} percent probability.'.format(round((answer.score)*100)),
              '\n\n',
              'See here for more information related to this answer: ',
              hyperlink_format.format(link = answer.meta['Link'], text = answer.meta['Title']),
              '\n\n',
              'Context for this answer: ',
              answer.context,
              '\n\n',
              'Document ID: ',
              answer.document_ids,
              '\n\n')

    else:
        print(answer['answer'])

#print(filtered_documents) # Print all filtered answers and related info

## Improvements/Extras

Improve the performance of the Reader, by [fine-tuning](https://haystack.deepset.ai/tutorials/02_finetune_a_model_on_your_data).