<a href="https://colab.research.google.com/github/navneetkrc/Deep-Learning-Experiments-implemented-using-Google-Colab/blob/master/HayStackTutorial11_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pipelines Tutorial

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial11_Pipelines.ipynb)

In this tutorial, you will learn how the `Pipeline` class acts as a connector between all the different
building blocks that are found in FARM. Whether you are using a Reader, Generator, Summarizer
or Retriever (or 2), the `Pipeline` class will help you build a Directed Acyclic Graph (DAG) that
determines how to route the output of one component into the input of another.


## Setting Up the Environment

Let's start by ensuring we have a GPU running to ensure decent speed in this tutorial.
In Google colab, you can change to a GPU runtime in the menu:
- **Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

In [1]:
# Make sure you have a GPU running
!nvidia-smi

Sun Apr 17 14:56:17 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

These lines are to install Haystack through pip

In [2]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

# Install pygraphviz
!apt install libgraphviz-dev
!pip install pygraphviz

Collecting pip
  Downloading pip-22.0.4-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 30.1 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.0.4
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-_w_hk3bo/farm-haystack_0a98b6a552054a62a59068ee5bab7db0
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-_w_hk3bo/farm-haystack_0a98b6a552054a62a59068ee5bab7db0
  Resolved https://github.com/deepset-ai/haystack.git to commit 929c685cdad93a7315983f7f01d77e57a4235741
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting elasticsearch<=7.10,>=7.7
  Download

If running from Colab or a no Docker environment, you will want to start Elasticsearch from source

In [3]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

## Initialization

Then let's fetch some data (in this case, pages from the Game of Thrones wiki) and prepare it so that it can
be used indexed into our `DocumentStore`

In [4]:
from haystack.utils import (
    print_answers,
    print_documents,
    fetch_archive_from_http,
    convert_files_to_docs,
    clean_wiki_text,
)

# Download and prepare data - 517 Wikipedia articles for Game of Thrones
doc_dir = "data/tutorial11"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt11.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# convert files to dicts containing documents that can be indexed to our datastore
got_docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/
INFO - haystack.telemetry -  Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry
INFO - haystack.utils.import_utils -  Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt11.zip to `data/tutorial11`
INFO - haystack.utils.preprocessing -  Converting data/tutorial11/121_The_Bear_and_the_Maiden_Fair.txt
INFO - haystack.utils.preprocessing -  Converting data/tutorial11/57_The_Laws_of_Gods_and_Men.txt
INFO - haystack.utils.preprocessing -  Converting data/tutorial11/201

Here we initialize the core components that we will be gluing together using the `Pipeline` class.
We have a `DocumentStore`, an `ElasticsearchRetriever` and a `FARMReader`.
These can be combined to create a classic Retriever-Reader pipeline that is designed
to perform Open Domain Question Answering.

In [5]:
from haystack import Pipeline
from haystack.utils import launch_es
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import ElasticsearchRetriever, EmbeddingRetriever, FARMReader


# Initialize DocumentStore and index documents
launch_es()
document_store = ElasticsearchDocumentStore()
document_store.delete_documents()
document_store.write_documents(got_docs)

# Initialize Sparse retriever
es_retriever = ElasticsearchRetriever(document_store=document_store)

# Initialize dense retriever
embedding_retriever = EmbeddingRetriever(
    document_store,
    model_format="sentence_transformers",
    embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
)
document_store.update_embeddings(embedding_retriever, update_existing_embeddings=False)

# Initialize reader
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model sentence-transformers/multi-qa-mpnet-base-dot-v1


Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

INFO - haystack.document_stores.elasticsearch -  Updating embeddings for 2357 docs without embeddings ...


Updating embeddings:   0%|          | 0/2357 [00:00<?, ? Docs/s]

Batches:   0%|          | 0/74 [00:00<?, ?it/s]

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


## Prebuilt Pipelines

Haystack features many prebuilt pipelines that cover common tasks.
Here we have an `ExtractiveQAPipeline` (the successor to the now deprecated `Finder` class).

In [6]:
from haystack.pipelines import ExtractiveQAPipeline

# Prebuilt pipeline
p_extractive_premade = ExtractiveQAPipeline(reader=reader, retriever=es_retriever)
res = p_extractive_premade.run(
    query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)
print_answers(res, details="minimum")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  4.62 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.53 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 46.64 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 43.38 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.03 Batches/s]


Query: Who is the father of Arya Stark?
Answers:
[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's "
                   'half-brother Jon Snow gifts A'},
    {   'answer': 'Lord Eddard Stark',
        'context': 'ark daughters.\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Joffrey',
        'context': 'laying with one of his wooden toys.\n'
                   "After Eddard discovers the 




If you want to just do the retrieval step, you can use a `DocumentSearchPipeline`

In [7]:
from haystack.pipelines import DocumentSearchPipeline

p_retrieval = DocumentSearchPipeline(es_retriever)
res = p_retrieval.run(query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}})
print_documents(res, max_text_len=200)


Query: Who is the father of Arya Stark?

{   'content': '\n'
               '===In the Riverlands===\n'
               'The Stark army reaches the Twins, a bridge stronghold '
               'controlled by Walder Frey, who agrees to allow the army to '
               'cross the river and to commit his troops in return for Robb '
               'an...',
    'name': '450_Baelor.txt'}

{   'content': '\n'
               '===On the Kingsroad===\n'
               'City Watchmen search the caravan for Gendry but are turned '
               'away by Yoren. Gendry tells Arya Stark that he knows she is a '
               'girl, and she reveals she is actually Arya Stark after ...',
    'name': '224_The_Night_Lands.txt'}

{   'content': '\n'
               "===''A Game of Thrones''===\n"
               'Sansa Stark begins the novel by being betrothed to Crown '
               'Prince Joffrey Baratheon, believing Joffrey to be a gallant '
               'prince. While Joffrey and Sansa are walki

Or if you want to use a `Generator` instead of a `Reader`,
you can initialize a `GenerativeQAPipeline` like this:

In [8]:
from haystack.pipelines import GenerativeQAPipeline, FAQPipeline
from haystack.nodes import RAGenerator

# We set this to True so that the document store returns document embeddings with each document
# This is needed by the Generator
document_store.return_embedding = True

# Initialize generator
rag_generator = RAGenerator()

# Generative QA
p_generator = GenerativeQAPipeline(generator=rag_generator, retriever=embedding_retriever)
res = p_generator.run(query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}})
print_answers(res, details="minimum")

# We are setting this to False so that in later pipelines,
# we get a cleaner printout
document_store.return_embedding = False

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1


Downloading:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

  f"Please make sure the config includes `forced_bos_token_id={self.bos_token_id}` in future versions. "


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizerFast'.


Downloading:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

Some weights of RagTokenForGeneration were not initialized from the model checkpoint at facebook/rag-token-nq and are newly initialized: ['rag.generator.lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Batches:   0%|          | 0/1 [00:00<?, ?it/s]




Query: Who is the father of Arya Stark?
Answers:
[{'answer': ' arya stark'}, {'answer': ' eddard and catelyn stark'}]


Haystack features prebuilt pipelines to do:
- just document search (DocumentSearchPipeline),
- document search with summarization (SearchSummarizationPipeline)
- generative QA (GenerativeQAPipeline)
- FAQ style QA (FAQPipeline)
- translated search (TranslationWrapperPipeline)
To find out more about these pipelines, have a look at our [documentation](https://haystack.deepset.ai/docs/latest/pipelinesmd)


With any Pipeline, whether prebuilt or custom constructed,
you can save a diagram showing how all the components are connected.

![image](https://github.com/deepset-ai/haystack/blob/master/docs/img/retriever-reader-pipeline.png)

In [9]:
p_extractive_premade.draw("pipeline_extractive_premade.png")
p_retrieval.draw("pipeline_retrieval.png")
p_generator.draw("pipeline_generator.png")

## Custom Pipelines

Now we are going to rebuild the `ExtractiveQAPipelines` using the generic Pipeline class.
We do this by adding the building blocks that we initialized as nodes in the graph.

In [10]:
# Custom built extractive QA pipeline
p_extractive = Pipeline()
p_extractive.add_node(component=es_retriever, name="Retriever", inputs=["Query"])
p_extractive.add_node(component=reader, name="Reader", inputs=["Retriever"])

# Now we can run it
res = p_extractive.run(
    query="Who is the father of Arya Stark?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)
print_answers(res, details="minimum")
p_extractive.draw("pipeline_extractive.png")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.88 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.11 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.22 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 36.69 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 30.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.87 Batches/s]



Query: Who is the father of Arya Stark?
Answers:
[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's "
                   'half-brother Jon Snow gifts A'},
    {   'answer': 'Lord Eddard Stark',
        'context': 'ark daughters.\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Joffrey',
        'context': 'laying with one of his wooden toys.\n'
                   "After Eddard discovers the 

Pipelines offer a very simple way to ensemble together different components.
In this example, we are going to combine the power of an `EmbeddingRetriever`
with the keyword based `ElasticsearchRetriever`.
See our [documentation](https://haystack.deepset.ai/docs/latest/retrievermd) to understand why
we might want to combine a dense and sparse retriever.

![image](https://github.com/deepset-ai/haystack/blob/master/docs/img/tutorial11_custompipelines_pipeline_ensemble.png?raw=true)

Here we use a `JoinDocuments` node so that the predictions from each retriever can be merged together.

In [11]:
from haystack.nodes import JoinDocuments

# Create ensembled pipeline
p_ensemble = Pipeline()
p_ensemble.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
p_ensemble.add_node(component=embedding_retriever, name="EmbeddingRetriever", inputs=["Query"])
p_ensemble.add_node(
    component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "EmbeddingRetriever"]
)
p_ensemble.add_node(component=reader, name="Reader", inputs=["JoinResults"])
p_ensemble.draw("pipeline_ensemble.png")

# Run pipeline
res = p_ensemble.run(
    query="Who is the father of Arya Stark?", params={"EmbeddingRetriever": {"top_k": 5}, "ESRetriever": {"top_k": 5}}
)
print_answers(res, details="minimum")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.58 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.37 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.10 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.63 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.81 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 32.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.49 Batches/s]


Query: Who is the father of Arya Stark?
Answers:
[   {   'answer': 'Ned',
        'context': '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's "
                   'half-brother Jon Snow gifts A'},
    {   'answer': 'Lord Eddard Stark',
        'context': 'ark daughters.\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Lord Eddard Stark',
        'context': "Game of Thrones'', Arya is the third child and younger "
                   'daughter of Lord Eddard Stark and his wife Lady Catelyn '
                   'Stark.  She is tomboyish, headstrong, f'},
    {   'answer': 'Eddard and Catelyn Stark',
        'context': 'Background ===\n'
                   'Arya is the third ch




In [15]:
# Run pipeline
res = p_ensemble.run(
    query="Who is the father of Arya Stark?", params={"EmbeddingRetriever": {"top_k": 5}}
)
print_answers(res, details="minimum")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.13 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  7.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.83 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 46.16 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.57 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 30.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 39.87 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.89 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 32.09 Batches/s


Query: Who is the father of Arya Stark?
Answers:
[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's "
                   'half-brother Jon Snow gifts A'},
    {   'answer': 'Lord Eddard Stark',
        'context': 'ark daughters.\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Lord Eddard Stark',
        'context': "Game of Thrones'', Arya is the third child and younger "
                   




In [16]:
# Run pipeline
res = p_ensemble.run(
    query="Who is the father of Arya Stark?", params={"ESRetriever": {"top_k": 5}}
)
print_answers(res, details="minimum")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  5.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.70 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 18.80 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.06 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 48.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 21.75 Batches/s


Query: Who is the father of Arya Stark?
Answers:
[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Ned',
        'context': '\n'
                   '====Season 1====\n'
                   'Arya accompanies her father Ned and her sister Sansa to '
                   "King's Landing. Before their departure, Arya's "
                   'half-brother Jon Snow gifts A'},
    {   'answer': 'Lord Eddard Stark',
        'context': 'ark daughters.\n'
                   'During the Tourney of the Hand to honour her father Lord '
                   'Eddard Stark, Sansa Stark is enchanted by the knights '
                   'performing in the event.'},
    {   'answer': 'Lord Eddard Stark',
        'context': "Game of Thrones'', Arya is the third child and younger "
                   




## Custom Nodes

Nodes are relatively simple objects
and we encourage our users to design their own if they don't see on that fits their use case

The only requirements are:
- Create a class that inherits `BaseComponent`.
- Add a method run() to your class. Add the mandatory and optional arguments it needs to process. These arguments must be passed as input to the pipeline, inside `params`, or output by preceding nodes.
- Add processing logic inside the run() (e.g. reformatting the query).
- Return a tuple that contains your output data (for the next node)
and the name of the outgoing edge (by default "output_1" for nodes that have one output)
- Add a class attribute outgoing_edges = 1 that defines the number of output options from your node. You only need a higher number here if you have a decision node (see below).

Here we have a template for a Node:

In [12]:
from haystack import BaseComponent
from typing import Optional


class CustomNode(BaseComponent):
    outgoing_edges = 1

    def run(self, query: str, my_optional_param: Optional[int]):
        # process the inputs
        output = {"my_output": ...}
        return output, "output_1"

## Decision Nodes

Decision Nodes help you route your data so that only certain branches of your `Pipeline` are run.
One popular use case for such query classifiers is routing keyword queries to Elasticsearch and questions to EmbeddingRetriever + Reader.
With this approach you keep optimal speed and simplicity for keywords while going deep with transformers when it's most helpful.

![image](https://github.com/deepset-ai/haystack/blob/master/docs/img/tutorial11_decision_nodes_pipeline_classifier.png?raw=true)

Though this looks very similar to the ensembled pipeline shown above,
the key difference is that only one of the retrievers is run for each request.
By contrast both retrievers are always run in the ensembled approach.

Below, we define a very naive `QueryClassifier` and show how to use it:

In [13]:
class CustomQueryClassifier(BaseComponent):
    outgoing_edges = 2

    def run(self, query: str):
        if "?" in query:
            return {}, "output_2"
        else:
            return {}, "output_1"


# Here we build the pipeline
p_classifier = Pipeline()
p_classifier.add_node(component=CustomQueryClassifier(), name="QueryClassifier", inputs=["Query"])
p_classifier.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_1"])
p_classifier.add_node(component=embedding_retriever, name="EmbeddingRetriever", inputs=["QueryClassifier.output_2"])
p_classifier.add_node(component=reader, name="QAReader", inputs=["ESRetriever", "EmbeddingRetriever"])
p_classifier.draw("pipeline_classifier.png")

# Run only the dense retriever on the full sentence query
res_1 = p_classifier.run(query="Who is the father of Arya Stark?")
print("Embedding Retriever Results" + "\n" + "=" * 15)
print_answers(res_1)

# Run only the sparse retriever on a keyword based query
res_2 = p_classifier.run(query="Arya Stark father")
print("ES Results" + "\n" + "=" * 15)
print_answers(res_2)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.65 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.50 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.46 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 22.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 47.07 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.61 Batches/s]


Embedding Retriever Results

Query: Who is the father of Arya Stark?
Answers:
[   <Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9919579923152924, 'context': "s Nymeria after a legendary warrior queen. She travels with her father, Eddard, to King's Landing when he is made Hand of the King. Before she leaves,", 'offsets_in_document': [{'start': 147, 'end': 153}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f', 'meta': {'name': '43_Arya_Stark.txt'}}>,
    <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.9767242670059204, 'context': "\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure, Arya's half-brother Jon Snow gifts A", 'offsets_in_document': [{'start': 46, 'end': 49}], 'offsets_in_context': [{'start': 46, 'end': 49}], 'document_id': '180c2a6b36369712b361a80842e79356', 'meta': {'name': '43_Arya_Stark.txt'}}>,
    <Answer {'answer': 'Lord Eddard S

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.94 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 50.91 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 30.92 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.30 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.79 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.36 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.22 Batches/s]


ES Results

Query: Arya Stark father
Answers:
[   <Answer {'answer': 'Eddard', 'type': 'extractive', 'score': 0.9085888862609863, 'context': "s Nymeria after a legendary warrior queen. She travels with her father, Eddard, to King's Landing when he is made Hand of the King. Before she leaves,", 'offsets_in_document': [{'start': 147, 'end': 153}], 'offsets_in_context': [{'start': 72, 'end': 78}], 'document_id': 'ba2a8e87ddd95e380bec55983ee7d55f', 'meta': {'name': '43_Arya_Stark.txt'}}>,
    <Answer {'answer': 'Ned', 'type': 'extractive', 'score': 0.7877896726131439, 'context': "\n====Season 1====\nArya accompanies her father Ned and her sister Sansa to King's Landing. Before their departure, Arya's half-brother Jon Snow gifts A", 'offsets_in_document': [{'start': 46, 'end': 49}], 'offsets_in_context': [{'start': 46, 'end': 49}], 'document_id': '180c2a6b36369712b361a80842e79356', 'meta': {'name': '43_Arya_Stark.txt'}}>,
    <Answer {'answer': 'Lord Eddard Stark', 'type': 'extractive', 'sc

## Evaluation Nodes

We have also designed a set of nodes that can be used to evaluate the performance of a system.
Have a look at our [tutorial](https://haystack.deepset.ai/docs/latest/tutorial5md) to get hands on with the code and learn more about Evaluation Nodes!

## Debugging Pipelines

You can print out debug information from nodes in your pipelines in a few different ways.

In [14]:
# 1) You can set the `debug` attribute of a given node.
es_retriever.debug = True

# 2) You can provide `debug` as a parameter when running your pipeline
result = p_classifier.run(query="Who is the father of Arya Stark?", params={"ESRetriever": {"debug": True}})

# 3) You can provide the `debug` paramter to all nodes in your pipeline
result = p_classifier.run(query="Who is the father of Arya Stark?", params={"debug": True})

result["_debug"]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.64 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  8.54 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 13.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.47 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 26.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.84 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.95 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.55 Batches/s]


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.59 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.67 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 37.48 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 58.42 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.04 Batches/s]


KeyError: ignored

## YAML Configs

A full `Pipeline` can be defined in a YAML file and simply loaded.
Having your pipeline available in a YAML is particularly useful
when you move between experimentation and production environments.
Just export the YAML from your notebook / IDE and import it into your production environment.
It also helps with version control of pipelines,
allows you to share your pipeline easily with colleagues,
and simplifies the configuration of pipeline parameters in production.

It consists of two main sections: you define all objects (e.g. a reader) in components
and then stick them together to a pipeline in pipelines.
You can also set one component to be multiple nodes of a pipeline or to be a node across multiple pipelines.
It will be loaded just once in memory and therefore doesn't hurt your resources more than actually needed.

The contents of a YAML file should look something like this:

```yaml
version: '0.7'
components:    # define all the building-blocks for Pipeline
- name: MyReader       # custom-name for the component; helpful for visualization & debugging
  type: FARMReader    # Haystack Class name for the component
  params:
    no_ans_boost: -10
    model_name_or_path: deepset/roberta-base-squad2
- name: MyESRetriever
  type: ElasticsearchRetriever
  params:
    document_store: MyDocumentStore    # params can reference other components defined in the YAML
    custom_query: null
- name: MyDocumentStore
  type: ElasticsearchDocumentStore
  params:
    index: haystack_test
pipelines:    # multiple Pipelines can be defined using the components from above
- name: my_query_pipeline    # a simple extractive-qa Pipeline
  nodes:
  - name: MyESRetriever
    inputs: [Query]
  - name: MyReader
    inputs: [MyESRetriever]
```

To load, simply call:
``` python
pipeline.load_from_yaml(Path("sample.yaml"))
```

## Conclusion

The possibilities are endless with the `Pipeline` class and we hope that this tutorial will inspire you
to build custom pipeplines that really work for your use case!

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.  
  
Some of our other work: 
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)