# Build LLM applications with **Haystack**

Haystack Concepts we will cover:

- Nodes
- Pipelines
- (Agents)
- (Document-Store)

In [1]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

## Creating our Knowledge-Base : Indexing Pipeline

In [2]:
!haystack --version

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
haystack, version 1.24.1


In [3]:
from haystack.pipelines import Pipeline
from haystack.nodes import Crawler, EmbeddingRetriever
from haystack.document_stores import ElasticsearchDocumentStore
from helper_functions.preprocessor import CustomPreProcessor

# Init documentstore with custom mapping
mapping = {
    "mappings": {
        "properties": {
            "embedding": {"type": "dense_vector", "dims": 384},
            "authors": {"type": "keyword"},
            "title": {"type": "keyword"},
            "date": {
                "type":   "date",
                "format": "dd.MM.yyyy"
            }
        }
    }
}

document_store = ElasticsearchDocumentStore(index="blogs_clean1", custom_mapping=mapping)

# Define nodes
crawler = Crawler(
    urls=["https://www.inovex.de/de/blog/perspective-dialogue-summarization-with-neural-networks/"],   # Websites to crawl
    filter_urls=["https://www.inovex.de/de/blog/"],
    crawler_depth=1,    # How many links to follow
    output_dir="data/blogs_clean1",  # The directory to store the crawled files, not very important, we don't use the files in this example
)

preprocessor = CustomPreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=200,
    split_respect_sentence_boundary=True,
    split_overlap=50,
)

retriever = EmbeddingRetriever(
        embedding_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        document_store=document_store,
)

# Define pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=crawler, name="crawler", inputs=['File'])
indexing_pipeline.add_node(component=preprocessor, name="preprocessor", inputs=['crawler'])
indexing_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["preprocessor"])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=['EmbeddingRetriever'])

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Cannot validate index for custom mappings. Skipping index validation.
Converting files: 100%|██████████| 183/183 [00:01<00:00, 166.41it/s]
Preprocessing:   0%|          | 0/183 [00:00<?, ?docs/s]We found one or more sentences whose split count is higher than the split length.
Preprocessing: 100%|██████████| 183/183 [00:01<00:00, 154.72docs/s]
Batches: 100%|██████████| 50/50 [00:35<00:00,  1.39it/s]


Filled documentstore


In [4]:
docs = indexing_pipeline.run()

Clean https://www.inovex.de/de/blog/perspective-dialogue-summarization-with-neural-networks/
Ignore https://www.inovex.de/de/blog/author/tnguyen/
Clean https://www.inovex.de/de/blog/ki-optimierung-in-der-industrie-intelligentes-service-ticket-system-fuer-die-wartung-teil-2/
Clean https://www.inovex.de/de/blog/explainable-ai-as-a-user-centered-design-approach/
Clean https://www.inovex.de/de/blog/verantwortung-in-ki-gemischten-teams-was-passiert-wenn-die-ki-mitarbeitet/
Ignore https://www.inovex.de/de/blog/


Preprocessing: 100%|██████████| 4/4 [00:00<00:00, 63.92docs/s]
Batches: 100%|██████████| 4/4 [00:03<00:00,  1.29it/s]


In [5]:
# Print all documents that have been created
pp.pprint(docs)

{   'documents': [   <Document: {'content': 'This article will elaborate a method for generating abstractive perspective dialogue summarization. Unlike regular dialogue summarization, perspective summarizations aim to outline the point of view of each participant within a dialogue. This work provides an approach to fit datasets intended for regular dialogue summarization to the task of perspective summarizations. It furthermore presents an architecture that can be a solid foundation for this task.\n\nIntroductionDefining summarizationMonologue summarizationDialogue summarizationPerspective dialogue summarizationEstablished dialogue summarization methodsData pre-processingDialogSum datasetAcquiring perspective summary annotationsCleaning and correcting the labelsAssigning the labels to the corresponding speakerArchitectureMulti-head encoderTrainingLoss functionSetupResultsDiscussion and future workChallengesFuture workConclusion\nIntroduction\nFor centuries humans have been living in a 

In [6]:
print("Document (Snippet) Count:", len(document_store.get_all_documents()))

Document (Snippet) Count: 1705


## Question Answering : Query PIpeline

In [7]:
from haystack.nodes import EmbeddingRetriever
from haystack import Pipeline
from haystack.document_stores import ElasticsearchDocumentStore

mapping = {
    "mappings": {
        "properties": {
            "embedding": {"type": "dense_vector", "dims": 384},
            "authors": {"type": "keyword"},
            "title": {"type": "keyword"},
            "date": {
                "type":   "date",
                "format": "dd.MM.yyyy"
            }
        }
    }
}

# Connect documentstore
document_store = ElasticsearchDocumentStore(index="blogs_clean1", custom_mapping=mapping)

# Define nodes
retriever = EmbeddingRetriever(
        embedding_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        document_store=document_store,
)

# Define pipeline
query_pipe = Pipeline()
query_pipe.add_node(component=retriever, name="Retriever", inputs=["Query"])    # Searches for relevant `documents`

Cannot validate index for custom mappings. Skipping index validation.


In [8]:
retrieved_docs = query_pipe.run(
    query="Tell me about Loss functions in Machine Learning?", params={"Retriever": {"top_k": 3}}
)

for idx, doc in enumerate(retrieved_docs["documents"]):
    print(f"{idx}. " + doc.content + "\n")

Batches: 100%|██████████| 1/1 [00:00<00:00,  2.86it/s]

0. The significant difference is that we receive two results from the neural network and have two labels for computing the loss. It is important to incorporate both outputs in the training process and to ensure that the outputs and the weights of both new encoder heads, given that the weight initialization is equal for all components, also differ from each other.
During training, we calculated the loss for each single encoder output and then used the maximum of both to punish the model for the worse performing head and thus making the learning process more challenging. We chose the Cross-Entropy loss function and obtained the following equations for the loss:
CE(Y, Ȳ) = –∑yi ·log ȳi
Lk = CE(Yk, Ȳk}),     k ∈ (1, 2)
LE = max(L1, L2)
where CE(Y, Ȳ) is the Cross-Entropy loss function, Lk denotes the loss that the k-th encoder head causes. Note that Lk includes the output from the decoder instead of only the encoder head. 

1. We chose the Cross-Entropy loss function and obtained the follo




### Using a Reader model

In [9]:
from haystack.nodes import EmbeddingRetriever, FARMReader
from haystack import Pipeline
from haystack.utils import print_answers
from haystack.document_stores import ElasticsearchDocumentStore

mapping = {
    "mappings": {
        "properties": {
            "embedding": {"type": "dense_vector", "dims": 384},
            "authors": {"type": "keyword"},
            "title": {"type": "keyword"},
            "date": {
                "type":   "date",
                "format": "dd.MM.yyyy"
            }
        }
    }
}

# Connect documentstore
document_store = ElasticsearchDocumentStore(index="blogs_clean1", custom_mapping=mapping)

# Define nodes
retriever = EmbeddingRetriever(
        embedding_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        document_store=document_store,
)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=False)

# Define pipeline
query_pipe = Pipeline()
query_pipe.add_node(component=retriever, name="Retriever", inputs=["Query"])    # Searches for relevant `documents`
query_pipe.add_node(component=reader, name="Reader", inputs=["Retriever"])      # Extract top answers from retrieved documents


Cannot validate index for custom mappings. Skipping index validation.


In [10]:
prediction = query_pipe.run(
    query="Tell me about Loss functions in nerual networks?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

print_answers(prediction, details="all")

Batches: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]
Inferencing Samples: 100%|██████████| 1/1 [00:03<00:00,  3.74s/ Batches]

'Query: Tell me about Loss functions in nerual networks?'
'Answers:'
[   <Answer {'answer': 'two labels', 'type': 'extractive', 'score': 0.3192610740661621, 'context': 'rence is that we receive two results from the neural network and have two labels for computing the loss. It is important to incorporate both outputs i', 'offsets_in_document': [{'start': 91, 'end': 101}], 'offsets_in_context': [{'start': 70, 'end': 80}], 'document_ids': ['433d4507782aed5e9e4a49c16e07df4c'], 'meta': {'url': 'https://www.inovex.de/de/blog/perspective-dialogue-summarization-with-neural-networks/', 'authors': ['Thien Quang Nguyen'], 'date': '14.09.2022', 'title': 'Perspective Dialogue Summarization with Neural Networks', '_split_id': 40, '_split_overlap': [{'doc_id': 'b705f8e514a3f127328414daea753d44', 'range': [0, 362]}, {'doc_id': '193b00b5b855711ca96fde93155c27b1', 'range': [573, 928]}]}}>,
    <Answer {'answer': 'Cross-Entropy loss function and obtained the following equations', 'type': 'extractive', 's




In [11]:
print_answers(prediction, details="minimum")

'Query: Tell me about Loss functions in nerual networks?'
'Answers:'
[   {   'answer': 'two labels',
        'context': 'rence is that we receive two results from the neural '
                   'network and have two labels for computing the loss. It is '
                   'important to incorporate both outputs i'},
    {   'answer': 'Cross-Entropy loss function and obtained the following '
                  'equations',
        'context': 'We chose the Cross-Entropy loss function and obtained the '
                   'following equations for the loss:\n'
                   'CE(Y, Ȳ) = –∑yi ·log ȳi\n'
                   'Lk = CE(Yk, Ȳk}),\xa0\xa0\xa0\xa0 k ∈ (1, 2)\n'
                   'L'},
    {   'answer': 'converging loss curves',
        'context': 'nimized loss on both training and validation sets, '
                   'resulting in converging loss curves that were '
                   'approximating 0. However, the validation and similari'},
    {   'answer': 'embeddings that 

### We can do better - Integrating GPT (Replacing the Reader model by gpt)

In [12]:
import os
from dotenv import load_dotenv
from haystack.nodes import PromptModel, PromptNode

load_dotenv("./.env")

api_key = os.environ.get("AZURE_API_KEY")
deployment_name = os.environ.get("AZURE_DEPLOYMENT_NAME")
base_url = os.environ.get("AZURE_BASE_URL")

# Init Model - Connects to Azure
azure_model = PromptModel(
    model_name_or_path="gpt-35-turbo",
    api_key=api_key,
    model_kwargs={
        "azure_deployment_name": deployment_name,
        "azure_base_url": base_url,
    },
)

# Init PromptNode
prompt_node = PromptNode(model_name_or_path=azure_model)

In [13]:
# Example: Test PromptNode

# Construct Message
messages = [{"role": "system", "content": "You are a helpful assistant"}]
messages.append({"role": "user", "content": "Tell me 1 sentence about haystack by deepset?"})

# Call PromptNode -> Calls OpenAI/Azure API
result = prompt_node(messages)
result[0]

'Haystack by deepset is an open-source framework for building search algorithms, with a focus on natural language processing and deep learning.'

In [14]:
from haystack.pipelines import Pipeline
from haystack.nodes import PromptNode, PromptTemplate, AnswerParser
from haystack.nodes import BM25Retriever

document_store = ElasticsearchDocumentStore(index="blogs_clean1", custom_mapping=mapping)

# Create nodes
retriever = BM25Retriever(document_store=document_store)


# PromptTemplate adds additional context to PromptNode
qa_prompt = PromptTemplate(
    prompt="""Given the context, provide a short consise answer to the question.
                Context: {join(documents)};
                Question: {query};
                Answer:""",
    output_parser=AnswerParser(),
)

# Combine Azure Model with Prompt
prompt_node = PromptNode(
    model_name_or_path=azure_model,
    default_prompt_template=qa_prompt
)


# Create Pipeline
inovex_query_pipe = Pipeline()
inovex_query_pipe.add_node(component=retriever, name="Retriever", inputs=["Query"])
inovex_query_pipe.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

Cannot validate index for custom mappings. Skipping index validation.


In [15]:
# Execute pipeline

output = inovex_query_pipe.run(query="Tell me about Loss functions in nerual networks?", params={"Retriever": {"top_k": 3}})
# print(output["answers"][0].answer)
print_answers(output, details="minimum")

'Query: Tell me about Loss functions in nerual networks?'
'Answers:'
[   {   'answer': 'The Cross-Entropy loss function is commonly used in neural '
                  'networks, and it was also used in the training process '
                  'described in the given context. The loss function is used '
                  'to calculate the error between predicted and actual '
                  'outputs, and it is important to incorporate both outputs in '
                  'the training process. The maximum of both outputs is used '
                  'to punish the model for the worse performing head and make '
                  'the learning process more challenging.'}]


## Multi-turn Conversations : Introducing Agents

### Setup Agent's Tools

- inovex_query_pipeline (from before)
- download Game of Thrones dataset
- game_query_pipeline (Pipeline accesses 'Game of Thrones' database)

In [16]:
from haystack.pipelines import Pipeline
import os
from haystack.nodes import Crawler, EmbeddingRetriever, TextConverter, PreProcessor
from haystack.document_stores import ElasticsearchDocumentStore
from helper_functions.preprocessor import CustomPreProcessor
from haystack.utils import fetch_archive_from_http
# Define nodes



# Init documentstore with custom
mapping = {
    "mappings": {
        "properties": {
            "embedding": {"type": "dense_vector", "dims": 384},
            "authors": {"type": "keyword"},
            "title": {"type": "keyword"},
            "date": {
                "type":   "date",
                "format": "dd.MM.yyyy"
            }
        }
    }
}

document_store = ElasticsearchDocumentStore(index="blogs_clean1", custom_mapping=mapping)

doc_dir = "data/build_your_first_question_answering_system"
fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip",
    output_dir=doc_dir,
)

textConverter = TextConverter()
preProcessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
  	remove_substrings=None,
    split_by="word",
    split_length=300,                     # Split length abhaengig von Modell
    split_respect_sentence_boundary=True, # Dont cut in the middle of sentence
    split_overlap=0,                      # Overlap between Document splits (number corresponds to ~words?)
  	max_chars_check = 15000
)

retriever = EmbeddingRetriever(
        embedding_model="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        document_store=document_store,
)


# Define pipeline
got_indexing_pipeline = Pipeline()
got_indexing_pipeline.add_node(component=textConverter, name="TextConverter", inputs=["File"])            # .txt-File -> Document class
got_indexing_pipeline.add_node(component=preProcessor, name="PreProcessor", inputs=["TextConverter"])     # Cleans & Splits documents
got_indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["PreProcessor"])            # Creates Embeddings
got_indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Retriever"])      # Stores documents


# Execute pipeline
files_to_index = [doc_dir + "/" + f for f in os.listdir(doc_dir)]
got_indexing_pipeline.run(file_paths=files_to_index)

print("Filled documentstore")

Cannot validate index for custom mappings. Skipping index validation.
Converting files: 100%|██████████| 183/183 [00:01<00:00, 164.96it/s]
Preprocessing:   0%|          | 0/183 [00:00<?, ?docs/s]We found one or more sentences whose split count is higher than the split length.
Preprocessing: 100%|██████████| 183/183 [00:01<00:00, 156.24docs/s]
Batches: 100%|██████████| 50/50 [00:34<00:00,  1.43it/s]


Filled documentstore


In [17]:
# Add PromptTemplate & Define Pipeline
from haystack.pipelines import Pipeline
from haystack.nodes import PromptNode, PromptTemplate, AnswerParser, BM25Retriever

# Init database
document_store = ElasticsearchDocumentStore(index="blogs_clean1", custom_mapping=mapping)

# Create nodes
# Create PromptTemplate with additinal context send PromptModel
qa_prompt = PromptTemplate(
    prompt="""Given the context, answer the question in 1 or 2 sentences.
                Context: {join(documents)};
                Question: {query};
                Answer:""",
    output_parser=AnswerParser(),
)

# Combine PromptModel & PromptTemplate
prompt_node = PromptNode(
    model_name_or_path=azure_model,
    default_prompt_template=qa_prompt
)

retriever = BM25Retriever(document_store=document_store)


# Create Pipeline
game_prompt_pipeline = Pipeline()
game_prompt_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
game_prompt_pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

Cannot validate index for custom mappings. Skipping index validation.


### Initialize Agent

In [18]:
from haystack.agents.base import Tool
from haystack.agents.conversational import ConversationalAgent
from haystack.agents.memory import ConversationSummaryMemory

inovex_blog_crawler_tool = Tool(
    name="inovex_blog_crawler",
    pipeline_or_node=inovex_query_pipe,
    description="useful for when you need to find content from the inovex blog", # agent uses this for its decision!
    output_variable="answers",
)

got_qa_tool = Tool(
    name="games_of_thrones_QA",
    pipeline_or_node=game_prompt_pipeline,
    description="useful for when you need to answer questions about games of thrones",
    output_variable="answers",
)

tools = [inovex_blog_crawler_tool, got_qa_tool]

In [19]:

conversational_agent_prompt_node = PromptNode(
    model_name_or_path=azure_model,
    max_length=256,
    top_k=2,
    stop_words=["Observation:"], # react framework
    model_kwargs={"temperature": 0.5, "top_p": 0.9}
)

memory = ConversationSummaryMemory(conversational_agent_prompt_node, summary_frequency=2)

zero_shot_agent_template = PromptTemplate("deepset/zero-shot-react")

agent = ConversationalAgent(
    prompt_node=conversational_agent_prompt_node, prompt_template=zero_shot_agent_template, tools=tools, memory=memory
)




In [20]:
res_crawl = agent.run("What can you tell me about loss functions?")


Agent deepset/zero-shot-react started with {'query': 'What can you tell me about loss functions?', 'params': None}


The 'transcript' parameter is missing from the Agent's prompt template. All ReAct agents that go through multiple steps to reach a goal require this parameter. Please append {transcript} to the end of the Agent's prompt template to ensure its proper functioning. A temporary prompt template with {transcript} appended will be used for this run.


[32mTool[0m[32m:[0m[32m in[0m[32mov[0m[32mex[0m[32m_blog[0m[32m_c[0m[32mrawler[0m[32m
[0m[32mTool[0m[32m Input[0m[32m:[0m[32m "[0m[32mloss[0m[32m functions[0m[32m"[0m[32m 
[0m[32mTool[0m[32m:[0m[32m in[0m[32mov[0m[32mex[0m[32m_blog[0m[32m_c[0m[32mrawler[0m[32m
[0m[32mTool[0m[32m Input[0m[32m:[0m[32m "[0m[32mloss[0m[32m functions[0m[32m"[0m[32m 
[0mObservation: [33mThe chosen loss function is the Cross-Entropy loss function, and the loss is calculated separately for each encoder head. The maximum of the two losses is then used to compute the final loss. Additionally, a similarity score is calculated using cosine similarity and multiplied with a penalizing parameter to penalize the learning process if the outputs are too similar.[0m
Thought: [32mFinal[0m[32m Answer[0m[32m:[0m[32m The[0m[32m Cross[0m[32m-[0m[32mEntropy[0m[32m loss[0m[32m function[0m[32m is[0m[32m used[0m[32m and[0m[32m the[0m[

In [21]:
pp.pprint(res_crawl)

{   'answers': [   <Answer {'answer': 'The Cross-Entropy loss function is used and the similarity score is penalized if the outputs are too similar.Final Answer: The Cross-Entropy loss function is used, and a similarity score is calculated using cosine similarity.', 'type': 'generative', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, 'document_ids': None, 'meta': {}}>],
    'query': 'What can you tell me about loss functions?',
    'transcript': 'Tool: inovex_blog_crawler\n'
                  'Tool Input: "loss functions" \n'
                  'Tool: inovex_blog_crawler\n'
                  'Tool Input: "loss functions"\n'
                  'Observation: The chosen loss function is the Cross-Entropy '
                  'loss function, and the loss is calculated separately for '
                  'each encoder head. The maximum of the two losses is then '
                  'used to compute the final loss. Additionally, a similarity '
           

In [22]:
res_crawl['answers'][0].answer

'The Cross-Entropy loss function is used and the similarity score is penalized if the outputs are too similar.Final Answer: The Cross-Entropy loss function is used, and a similarity score is calculated using cosine similarity.'

In [23]:
res_got = agent.run("Who is the Son of Eddard?")


Agent custom-at-query-time started with {'query': 'Who is the Son of Eddard?', 'params': None}
[32mknow[0m[32m which[0m[32m E[0m[32mdd[0m[32mard[0m[32m we[0m[32m are[0m[32m talking[0m[32m about[0m[32m.
[0m[32mTool[0m[32m:[0m[32m games[0m[32m_of[0m[32m_th[0m[32mrones[0m[32m_Q[0m[32mA[0m[32m
[0m[32mTool[0m[32m Input[0m[32m:[0m[32m "[0m[32mWho[0m[32m is[0m[32m the[0m[32m father[0m[32m of[0m[32m Jon[0m[32m Snow[0m[32m?"
[0m[32mknow[0m[32m whether[0m[32m this[0m[32m is[0m[32m a[0m[32m Game[0m[32m of[0m[32m Thrones[0m[32m question[0m[32m or[0m[32m not[0m[32m.
[0m[32mTool[0m[32m:[0m[32m games[0m[32m_of[0m[32m_th[0m[32mrones[0m[32m_Q[0m[32mA[0m[32m
[0m[32mTool[0m[32m Input[0m[32m:[0m[32m "[0m[32mWho[0m[32m is[0m[32m the[0m[32m Son[0m[32m of[0m[32m E[0m[32mdd[0m[32mard[0m[32m?"
[0mObservation: [33mEddard "Ned" Stark is the father of Jon Snow.[0m
Thought: [32mNow

In [24]:
res_got['answers'][0].answer

'Jon Snow.So the answer is Jon Snow.'