<a href="https://colab.research.google.com/github/huckles-learning-lab/haystack/blob/main/haystack_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installing Haystack

To start, let’s install the latest release of Haystack with pip:

In [1]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 22.6 MB/s eta 0:00:00
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack[colab]
  Downloading farm_haystack-1.13.2-py3-none-any.whl (620 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 620.6/620.6 kB 10.5 MB/s eta 0:00:00
Collecting posthog
  Downloading posthog-2.3.1-py2.py3-none-any.whl (34 kB)
Collecting elasticsearch<8,>=7.7
  Downloading elasticsearch-7.17.9-py2.py3-none-any.whl (385 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 386.0/386.0 kB 28.9 MB/s eta 0:00:00
Collecting rank-



Set the logging level to INFO:

In [2]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)


Initializing the DocumentStore

We’ll start creating our question answering system by initializing a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, we’re using the InMemoryDocumentStore, which is the simplest DocumentStore to get started with. It requires no external dependencies and it’s a good option for smaller projects and debugging. But it doesn’t scale up so well to larger Document collections, so it’s not a good choice for production systems. To learn more about the DocumentStore and the different types of external databases that we support, see DocumentStore.

Let’s initialize the the DocumentStore:

In [3]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)


INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable  HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://docs.haystack.deepset.ai/docs/telemetry
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


The DocumentStore is now ready. Now it’s time to fill it with some Documents.

Preparing Documents

    Download 517 articles from the Game of Thrones Wikipedia. You can find them in data/build_your_first_question_answering_system as a set of .txt files.


In [4]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/build_your_first_question_answering_system"

fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",
    output_dir=doc_dir
)


INFO:haystack.utils.import_utils:Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip to 'data/build_your_first_question_answering_system'


True

Use TextIndexingPipeline to convert the files you just downloaded into Haystack Document objects and write them into the DocumentStore:

In [5]:
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.


Converting files:   0%|          | 0/183 [00:00<?, ?it/s]

Preprocessing:   0%|          | 0/183 [00:00<?, ?docs/s]



Updating BM25 representation...:   0%|          | 0/2356 [00:00<?, ? docs/s]

{'documents': [<Document: {'content': '\n\n"\'\'\'Mhysa\'\'\'" is the third season finale of the American medieval epic fantasy television series \'\'Game of Thrones\'\', and its 30th episode overall. Written by executive producers David Benioff and D. B. Weiss, and directed by David Nutter, it originally aired on  on HBO in the United States.\n\nThe episode revolves on the aftermath of the events instigated by "The Red Wedding", in which Tywin Lannister is revealed to be the mastermind behind the massacre — with Walder Frey and Roose Bolton having conspired with the Lannisters against the Starks. As a result, House Frey receives the Seat of Riverrun and Roose Bolton is appointed the new "Warden of the North". Elsewhere, House Greyjoy begins a new military campaign. In the North, Maester Aemon sends out ravens to alert the whole of Westeros about the arrival of the White Walkers. And across the narrow sea, the freed slaves of Yunkai hail Daenerys as their "mhysa", the Ghiscari language

The code in this tutorial uses the Game of Thrones data, but you can also supply your own .txt files and index them in the same way.

As an alternative, you can cast you text data into Document objects and write them into the DocumentStore using DocumentStore.write_documents().

Initializing the Retriever

Our search system will use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only the ones relevant to the question. This tutorial uses the BM25 algorithm. For more Retriever options, see Retriever.

Let’s initialize a BM25Retriever and make it use the InMemoryDocumentStore we initialized earlier in this tutorial:

In [6]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)


The Retriever is ready but we still need to initialize the Reader.

Initializing the Reader

A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. In this tutorial, we’re using a FARMReader with a base-sized RoBERTa question answering model called deepset/roberta-base-squad2. It’s a strong all-round model that’s good as a starting point. To find the best model for your use case, see Models.

Let’s initialize the Reader:

In [7]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)


INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0
INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CPU - Number of GPUs: 0


We’ve initalized all the components for our pipeline. We’re now ready to create the pipeline.

Creating the Retriever-Reader Pipeline

In this tutorial, we’re using a ready-made pipeline called ExtractiveQAPipeline. It connects the Reader and the Retriever. The combination of the two speeds up processing because the Reader only processes the Documents that the Retriever has passed on. To learn more about pipelines, see Pipelines.

To create the pipeline, run:

In [8]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)


The pipeline’s ready, you can now go ahead and ask a question!

Asking a Question

    Use the pipeline run() method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the top-k parameter. To learn more about setting arguments, see Arguments. To understand the importance of the top-k parameter, see Choosing the Right top-k Values.


In [9]:
prediction = pipe.run(
    query="Who is the father of Arya Stark?",
    params={
        "Retriever": {"top_k": 10},
        "Reader": {"top_k": 5}
    }
)




Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s]

Here are some questions you could try out:

    Who is the father of Arya Stark?
    Who created the Dothraki vocabulary?
    Who is the sister of Sansa?

    Print out the answers the pipeline returned:


In [10]:
from pprint import pprint

pprint(prediction)


{'answers': [<Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9933727979660034, 'context': "s Nymeria after a legendary warrior queen. She travels with her father, Eddard, to King's Landing when he is made Hand of the King. Before she leaves,", 'offsets_in_document': [{'start': 207, 'end': 213}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': '9e3c863097d66aeed9992e0b6bf1f2f4', 'meta': {'_split_id': 3}}>,
             <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.975361168384552, 'context': "k in the television series.\n\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure, Arya's h", 'offsets_in_document': [{'start': 630, 'end': 633}], 'offsets_in_context': [{'start': 74, 'end': 77}], 'document_id': '7d3360fa29130e69ea6b2ba5c5a8f9c8', 'meta': {'_split_id': 10}}>,
             <Answer {'answer': 'Lord Eddard Stark', 'type': 'extractive', 'score': 0.9177318811416626, 'context': 'rk d

Simplify the printed answers:

In [11]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium`, and `all`
)



Query: Who is the father of Arya Stark?
Answers:
[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': 'k in the television series.\n'
                   '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's h"},
    {   'answer': 'Lord Eddard Stark',
        'context': 'rk daughters.\n'
                   '\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Ned',
        'context': ' girl disguised as a boy all along and is surprised to '
         

And there you have it! Congratulations on building your first machine learning based question answering system!