<a href="https://colab.research.google.com/github/nickprock/appunti_data_science/blob/master/semantic-search/advent-of-haystack/Advent_of_Haystack_Prompt_Engineering_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advent of Haystack - Day 3
_Make a copy of this Colab to start!_

Here, you'll be provided a nearly complete RAG pipeline that is supposed to do QA on a number of URLs. Our aim is to create a [`PromptBuilder`](https://docs.haystack.deepset.ai/v2.0/docs/promptbuilder) that uses a template which can produce answers with references as to where the answer is coming from.

1. **Run the indexing pipeline:** This is already complete. Here, we are writing the contents of various haystack documentation pages into an `InMemoryDocumentStore`. We are also creating embeddings for our documents with a `SentenceTransformersDocumentEmbedder`
2. **Your task is to complete step 2 👇**

#Installation
**Note:** There is a known issue with colab due to a version conflict error related to `llmx` which comes with Colab. You might get an `llmx` error. You can safely ignore this, or run `pip uninstall -y llmx`

In [None]:
!pip install haystack-ai
!pip install boilerpy3
!pip install transformers accelerate bitsandbytes sentence_transformers



## 1) Write Documents to InMemoryDocumentStore

Here, we are writing the contents of a few URLs into an `InMemoryDocumentStore`

In [None]:
from haystack import Pipeline
from haystack.document_stores import InMemoryDocumentStore
from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter


document_store = InMemoryDocumentStore()

link_fetcher = LinkContentFetcher()
converter = HTMLToDocument()
splitter = DocumentSplitter(split_length=100, split_overlap=5)
embedder = SentenceTransformersDocumentEmbedder()
writer = DocumentWriter(document_store=document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("link_fetcher", link_fetcher)
indexing_pipeline.add_component("converter", converter)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("embedder", embedder)
indexing_pipeline.add_component("writer", writer)

indexing_pipeline.connect("link_fetcher", "converter")
indexing_pipeline.connect("converter", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

In [None]:
indexing_pipeline.run(data={"link_fetcher":{"urls": ["https://docs.haystack.deepset.ai/v2.0/docs/sentencetransformerstextembedder", "https://docs.haystack.deepset.ai/v2.0/docs/openaidocumentembedder"]}})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'writer': {'documents_written': 7}}

## 2) Build a RAG Pipeline
Here, we have provided a nearly complete RAG pipeline, but the `PromptBuilder` is mising. Create one and add it to the pipeline. Make sure your `PromptBuilder` is able to use the `url` from the documents metadata. That way, you can ask for a response that includes references!


In [None]:
from getpass import getpass

api_key = getpass("Enter OpenAI Api key: ")

Enter OpenAI Api key: ··········


In [None]:
import torch

from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.generators import GPTGenerator

######## Complete this section #############
prompt_template = """
    Given these documents, answer the question. Cite the documents using Document[url] notation. If multiple documents contain the answer, cite those documents like ‘as stated in Document[url], Document[url], etc.’. \nDocuments:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}

    \nQuestion: {{query}}
    \nAnswer:
    """
prompt_builder = PromptBuilder(prompt_template)
############################################
query_embedder = SentenceTransformersTextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=2)
llm = GPTGenerator(api_key=api_key)

Haystack is model-agnostic, which also means you can easily switch between different model providers. For example, instead of using an OpenAI model via an API, you can also try using an open source model running in this colab notebook. You can replace the `llm` with the one below. This might take up more resources in Colab. You might notice that models don't perform the same way, which can mean you need to change your prompt. It's ok to change the task from doing referenced QA to someting else. For example, we're also happy with a poem about the Haystack docs 🤗
```python
from haystack.components.generators import HuggingFaceLocalGenerator
llm = HuggingFaceLocalGenerator("HuggingFaceH4/zephyr-7b-beta",
                                 huggingface_pipeline_kwargs={"device_map":"auto",
                                               "model_kwargs":{"load_in_4bit":True,
                                                "bnb_4bit_use_double_quant":True,
                                                "bnb_4bit_quant_type":"nf4",
                                                "bnb_4bit_compute_dtype":torch.bfloat16}},
                                 generation_kwargs={"max_new_tokens": 350})
llm.warm_up()
```

In [None]:
pipeline = Pipeline()
pipeline.add_component(instance=query_embedder, name="query_embedder")
pipeline.add_component(instance=retriever, name="retriever")
pipeline.add_component(instance=prompt_builder, name="prompt_builder")
pipeline.add_component(instance=llm, name="llm")

pipeline.connect("query_embedder.embedding", "retriever.query_embedding")
pipeline.connect("retriever.documents", "prompt_builder.documents")
pipeline.connect("prompt_builder", "llm")


In [None]:
query = "How do I use the openai embedder?"
result = pipeline.run(data={"query_embedder": {"text": query}, "prompt_builder": {"query": query}})
print(result['llm']['replies'][0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

To use the OpenAI Embedder, you can follow these steps:

1. Import the necessary modules and classes in Python:

from haystack import Document
from haystack.components.embedders import OpenAIDocumentEmbedder

2. Create a Document object with the text you want to embed:

doc = Document(text="some text",
               metadata={"title": "relevant title",
                         "page number": 18})

3. Initialize the OpenAIDocumentEmbedder object and specify the metadata fields you want to embed (if any):

embedder = OpenAIDocumentEmbedder(metadata_fields_to_embed=["title"])

4. Use the run() method of the embedder to embed the document:

docs_w_embeddings = embedder.run(documents=[doc])["documents"]

You can find more information in Document[https://huggingface.co/prajjwal1/bert-tiny-random].
