# Setup

In [4]:
import os
import json

from dotenv import load_dotenv

from IPython.display import Markdown as md

from haystack.nodes.prompt import PromptNode, PromptTemplate

In [5]:
load_dotenv()

True

In [6]:
openai_key = os.environ.get("OPENAI_KEY")

# What we'll build

We will build our personal assistant using OpenAI's `text-davinci-003` model.

We proceed step by step, only giving some code as a starter and then letting the model build up the whole system on its own.

In the end, we'll walk away with a retrieval augmented text and code generation pipeline that can easily be customized to your own data.

Let's start!

# PromptNode

At first, we'll initialize a `PromptNode` using the `text-davinci-003` model.

The PromptNode is Haystack's abstraction around large language models (LLM).

It provides a unified interface to many open source and commercial LLMs.

Find the full `PromptNode` documentation here: [https://docs.haystack.deepset.ai/docs/prompt_node](https://docs.haystack.deepset.ai/docs/prompt_node)

In [100]:
model_name = "text-davinci-003"

# The prompt template manages how instructions (aka prompts) will be passed to the PromptNode
# Via a placeholder syntax that is similar to python f-strings, it allows us to enrich the user's instructions with outputs
# of other nodes that might be used with the PromptNode in a pipeline.
# Here, we directly pass the user instruction (query) to the PromptNode without adding any other extra data.
# See https://docs.haystack.deepset.ai/docs/prompt_node#prompttemplates for detailed documentation on prompt templates
prompt_template = PromptTemplate("direct", "{query}")


prompt_node = PromptNode(model_name_or_path=model_name, api_key=openai_key, max_length=700, default_prompt_template=prompt_template)

# Getting some initial context

As a first step, I've copied the documentation for our `Crawler` component.

Given a list of URLs, the `Crawler` goes and scrapes text from them.

We can then transform the text into `Document`s and use a `Pipeline` to store these documents in a `DocumentStore`.

A `DocumentStore` makes these documents accessible for later use. You can also embed the documents as vectors before writing them to the `DocumentStore`.

We have a range of `DocumentStore`s to choose from, some of them are:

- `FAISSDocumentStore`
- `WeaviateDocumentStore`
- `QdrantDocumentStore`
- `ElasticsearchDocumentStore`

We will use a simple `InMemoryDocumentStore` for this demo.

In [101]:
crawler_help = """
Crawler
The Crawler scrapes the text from a website and creates a Document object out of it. For example, you can use the Crawler to turn the contents of a website into Documents to use for search.

Suggest Edits
Position in a Pipeline	At the very beginning of an indexing Pipeline
Input	Files
Output	Documents
Classes	Crawler
Usage
To use a Crawler on its own, run:

Python

from haystack.nodes import Crawler

crawler = Crawler(output_dir="crawled_files") # This tells the Crawler where to store the crawled files
docs = crawler.crawl(
    urls=["https://haystack.deepset.ai/docs/get-started"], # This tells the Crawler which URLs to crawl
    filter_urls=["haystack"], # Here, you can pass regular expressions that the crawled URLs must comply with
    crawler_depth=1 # This tells the Crawler to follow only the links that it finds on the initial URLs
)
Example Script
This script shows you how to use a Crawler in a pipeline.

Python

################################################################################
#                                                                              #
#             An Example of a Pipeline Using Crawler                           #
#                                                                              #
#  NOTE: You need a running Elasticsearch container for this to work.          #
#  If you don't have one, exchange ElasticsearchDocumentStore for another      #
#  document store, like SQLDocumentStore or InMemoryDocumentStore. Bear in     #
#  mind though that the code wasn't tested on them and you might encounter      #
#  errors.                                                                  #
#                                                                              #
################################################################################

from haystack.pipelines import Pipeline
from haystack.nodes import Crawler, PreProcessor, BM25Retriever, FARMReader
from haystack.document_stores import InMemoryDocumentStore


# Create the document store. You need it to:
#  1. Store the documents you crawled and preprocessed (with an indexing pipeline).
#  2. Extract the documents that contain the answer to your question (with a query pipeline).
#     document_store = InMemoryDocumentStore(use_bm25=True)


#
# Step 1: Get the data, clean it, and store it.
#

# NOTE: Run this code just once, every time you create a new Elasticsearch container. Comment it out afterwards.

# Let's create the indexing pipeline. It will contain:
#  1. A Crawler node that fetches text from a website.
#  2. A PreProcessor that makes the documents friendly to the Retriever
#  3. The DocumentStore that receives the documents and stores them.

crawler = Crawler(
    urls=["https://haystack.deepset.ai"],   # Websites to crawl
    crawler_depth=1,    # How many links to follow
    output_dir="crawled_files",  # The directory to store the crawled files, not very important, we don't use the files in this example
)
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=500,
    split_respect_sentence_boundary=True,
)
indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=crawler, name="crawler", inputs=['File'])
indexing_pipeline.add_node(component=preprocessor, name="preprocessor", inputs=['crawler'])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=['preprocessor'])

indexing_pipeline.run()


#
# Step 2: Use the data to answer questions.
#

# NOTE: You can run this code as many times as you like.

# Let's create a query pipeline. It will contain:
#  1. A Retriever that gets the relevant documents from the DocumentStore.
#  2. A Reader that locates the answers inside the documents.
retriever = BM25Retriever(document_store=document_store)
reader =  FARMReader(model_name_or_path="deepset/roberta-base-squad2-distilled")

query_pipeline = Pipeline()
query_pipeline.add_node(component=retriever, name="retriever", inputs=["Query"])
query_pipeline.add_node(component=reader, name="reader", inputs=["retriever"])

results = query_pipeline.run(query="What can I use Haystack for?")

print("\nQuestion: ", results["query"])
print("\nAnswers:")
for answer in results["answers"]:
    print("- ", answer.answer)
print("\n\n")
"""

In [102]:
urls = [
    'https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode',
    'https://docs.haystack.deepset.ai/docs/pipelines',
    'https://docs.haystack.deepset.ai/docs/ready_made_pipelines',
    'https://docs.haystack.deepset.ai/docs/retriever',
    'https://docs.haystack.deepset.ai/docs/agent',
    'https://docs.haystack.deepset.ai/docs/intro',
    'https://docs.haystack.deepset.ai/docs/prompt-engineering-guidelines',
    'https://docs.haystack.deepset.ai/docs/prompt_node'
]

# Our first instruction

We will now prompt our model for the first time. We pass the crawler documentation as context and tell it to crawl additional Haystack documentation pages and to index them into a document store.

In [103]:
query = f"Given this documentation: {crawler_help} ### Write a crawler node that crawls the following list of urls: {json.dumps(urls)}. Links should not be followed (depth 0). It should index these documents in an indexing pipeline. In the preprocessor, split it into 600 word chunks. The document store should use bm25. Do not use comments. Wrap the code in a code block like this: ```python\n\n"

In [104]:
out, _ = prompt_node.run(query=query)
generated_code = out["results"][0]

Let's inspect that generated code before we run it!

In [106]:
md(generated_code)

```python
from haystack.pipelines import Pipeline
from haystack.nodes import Crawler, PreProcessor, BM25Retriever, FARMReader
from haystack.document_stores import InMemoryDocumentStore


# Create the document store. You need it to:
#  1. Store the documents you crawled and preprocessed (with an indexing pipeline).
#  2. Extract the documents that contain the answer to your question (with a query pipeline).
document_store = InMemoryDocumentStore(use_bm25=True)


# Let's create the indexing pipeline. It will contain:
#  1. A Crawler node that fetches text from a website.
#  2. A PreProcessor that makes the documents friendly to the Retriever
#  3. The DocumentStore that receives the documents and stores them.

crawler = Crawler(
    urls=["https://haystack.deepset.ai/tutorials/22_pipeline_with_promptnode", 
          "https://docs.haystack.deepset.ai/docs/pipelines", 
          "https://docs.haystack.deepset.ai/docs/ready_made_pipelines", 
          "https://docs.haystack.deepset.ai/docs/retriever", 
          "https://docs.haystack.deepset.ai/docs/agent",
          "https://docs.haystack.deepset.ai/docs/intro",
          "https://docs.haystack.deepset.ai/docs/prompt-engineering-guidelines",
          "https://docs.haystack.deepset.ai/docs/prompt_node"], 
    crawler_depth=0,
    output_dir="crawled_files",
)
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=600,
    split_respect_sentence_boundary=True,
)
indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=crawler, name="crawler", inputs=['File'])
indexing_pipeline.add_node(component=preprocessor, name="preprocessor", inputs=['crawler'])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=['preprocessor'])

indexing_pipeline.run()
```

# Code looks great!

`text-davinci-003` executed perfectly on our instructions, we can run the code to crawls these additional pages from the documentation page and index all of them.

In [107]:
exec(generated_code[9:-3])



Current google-chrome version is 112.0.5615
Get LATEST chromedriver version for 112.0.5615 google-chrome
Driver [/Users/mathislucka/.wdm/drivers/chromedriver/mac64/112.0.5615.49/chromedriver] found in cache


Preprocessing:   0%|          | 0/8 [00:00<?, ?docs/s]

Document f7e9043805e9db17fda21984ff562d8a is 11133 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time.
Document d6452a2a65003120e9fc6a157969b788 is 10870 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time.
Document f32960edaf96d027fc02be5ca7a9bb61 is 10281 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time.
Document 355e687bb55d3a8b5e316dc71e2ae5 is 10050 characters long after preprocessing, where the maximum length should be 10000. Something might be wrong with the splitting, check the document affected to prevent issues at query time.


Updating BM25 representation...:   0%|          | 0/32 [00:00<?, ? docs/s]

In [108]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store, top_k=2)
results = retriever.retrieve(query="How to use the PromptNode in a Pipeline?")
helper = results[0].content

In [133]:
query = f"Using this snippet from the documentation: {helper} ### Initialize a pipeline 'pipe_v2' with your own PromptTemplate 'template' and PromptNode 'pn'. The template will answer questions about technical documentation using the provided documents as context. Include the following in the template:\n- questions should be answered in the style of technical documentation\n- example code snippets should be provided if needed\n- code should be formatted as ```python <code> ```\n\n. Make sure none of the instructions are missing.\n\nUse the existing retriever named `retriever` in the pipeline. Initialize the PromptNode with the 'gpt-3.5-turbo' model. The `api_key` for the PromptNode is stored in `openai_key`. PromptNode should be initialized with a max_length of 700. ```python\n\n"

In [134]:
out, _ = prompt_node.run(query=query)
generated_code = out['results'][0]

In [135]:
md("```python\n\n" + generated_code)

```python

from haystack.nodes import PromptNode, PromptTemplate

# Define the template
template = PromptTemplate(
    name="technical_doc_prompt",
    prompt_text="""
Answer the given question using the provided context. Your answer should be in the style of technical documentation and should provide example code snippets if necessary. Format code as ```python <code> ```

Context: {join(documents)}\n\nQuestion: {query}\n\nAnswer:""",
)

# Initialize the PromptNode 
pn = PromptNode(model_name_or_path="gpt-3.5-turbo", default_prompt_template=template, api_key=openai_key, max_length=700)

# Initialize the pipeline
pipe_v2 = Pipeline()

# Add the retriever
pipe_v2.add_node(component=retriever, name="retriever", inputs=["Query"])

# Add the PromptNode
pipe_v2.add_node(component=pn, name="prompt_node", inputs=["retriever"])
```

In [136]:
exec(generated_code[:-3])

In [140]:
res = pipe_v2.run(query="What is haystack?")

generated_docs = res["results"][0]

md(generated_docs)

Haystack is a Python library that provides a modular and extensible interface for building end-to-end search pipelines. It includes components for document retrieval, question answering, summarization, and more. The library also provides a PromptNode class, which allows you to easily integrate natural language prompts into your pipeline and use various language models for generating answers. The library is designed to be flexible and customizable, so you can easily configure it to suit your specific use case. 

Example code snippets for using PromptNode and creating custom prompts are provided in the context section above.

In [147]:
res = pipe_v2.run(query="How do I use the prompt node with open source models?")

generated_docs = res["results"][0]

In [148]:
md(generated_docs)

To use the PromptNode with open source models, you can specify the model's name when initializing the PromptModel or PromptNode. For example, to use the flan t5 base model, you can initialize PromptNode like this:

```python
from haystack.nodes import PromptModel, PromptNode

prompt_model = PromptModel(model_name_or_path="google/flan-t5-base")
prompt_node = PromptNode(prompt_model)
```

You can replace "google/flan-t5-base" with the name of any open source model that you want to use. You can also specify an API key if necessary. For example, if you are using Hugging Face's API for models, you can initialize the model like this:

```python
prompt_model = PromptModel(model_name_or_path="model_name", api_key="your_api_key")
```