This is a repository for an Python code used to build an extractive Question-Answering model using Haystack.
Haystack is an end-to-end open-source framework for creating Question-Answering models. Haystack has three primary components: the DocumentStore, Retriever, and Reader.
-
DocumentStore: This is exactly what it sounds like. The DocumentStore stores text documents and their meta data. Documents are typically split into smaller units (e.g., paragraphs) before indexing to enable higher accuracy and granularity to answers. In this example, documents consist of webscraped text from a list of URLs pointing either to an article or a YouTube video. Rudimentary pre-processing of text data is completed prior to using Haystack's preprocessor.
-
Retriever: These are fast and simple algorithms to identify candidate passages from a large collection of documents. It allows a set of k-candidate documents to be sent to the Reader. In general, the Retriever helps narrow the scope for the Reader, which will then perform a thorough search of the top-k documents for the best answer.
-
Reader: Takes passages of text as input and returns top-k answers with their corresponding confidence scores (range 0-1). Readers are powerful models that are able to make a full search in the selected documents with the aim of finding the right answer.
The DocumentStore, Retriever, and Reader are connected using a querying pipeline. Querying pipelines are used to receive a query from the user and produce a result.
- Install Packages: Update and install required packages.
- Download, extract, and set permissions for the Elasticsearch installation image.
- Start Elasticsearch server.
- Upload your text data files and, if available, a metadata file.
- Download a dataset from the Game of Thrones Wikipedia.
- Index documents by converting them into Haystack Documents using an indexing pipeline.
- Initialize the Retriever to score and retrieve relevant documents.
- Use a model for semantic search.
- Use BM25Retriever for sparse retrieval.
- Route documents for reading text and tables using different Readers.
- Initialize a Reader that extracts the top answer candidates.
- Combine the Reader and Retriever in a querying pipeline.
- Query the pipeline to ask a question and set top-k parameters for the Retriever and Reader.
- Filter documents based on a score threshold and set a default answer.
- Print the filtered answers returned by the pipeline.