<a href="https://colab.research.google.com/github/muffafa/advent-of-haystack-2024-2025-solutions/blob/main/SOLUTION_Advent_of_Haystack_Multi_Query_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advent of Haystack: Day 3

In this challenge, you must help Elf David build a system to answer questions over BBC news dataset. However, to increase recall, your system should be able to generate questions similar to the ones asked.

For instance, if Santa asks, `"How are cybersecurity threats evolving with new technologies?"` the system should be able to generate similar questions like:

- `"What impact do emerging technologies like AI and IoT have on the landscape of cybersecurity threats?"`
- `"How are organizations adapting their cybersecurity strategies in response to the evolution of threats driven by technological advancements?"`
- `"In what ways are cybercriminals leveraging new technologies to enhance their attack methods and tactics?"`

All these questions are similar to the original question, but they are not the same. The idea is that by generating similar questions, you can increase the system's recall, as the system will be able to retrieve more documents that could contain the answer to the original question.
For that, you will use a large language model (LLM) to generate alternative similar questions based on the original question.
Each of these similar questions will query a document store with news articles; all the documents retrieved by each similar question will be used to compose an answer to the original question.


### Components to use:

- [`InMemoryDocumentStore`](https://docs.haystack.deepset.ai/docs/inmemorydocumentstore): to store the news articles.
- [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever): to retrieve the documents from the document store.
- [`OpenAIGenerator`](https://docs.haystack.deepset.ai/docs/openaigenerator): to instantiate the LLM to generate similar questions and compose an answer to the original question.
- [`PromptBuilder`](https://docs.haystack.deepset.ai/docs/promptbuilder): to build the prompts to query the LLM
- [`AnswerBuilder`](https://docs.haystack.deepset.ai/docs/answerbuilder): (optional) to build the answers to the original question.
- [`SentenceTransformersTextEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformerstextembedder): to embed the questions
- [`SentenceTransformersDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/sentencetransformersdocumentembedder): to embed the news articles
- [`DocumentJoiner`](https://docs.haystack.deepset.ai/docs/documentjoiner): to join the documents retrieved by similar query questions

### Your task is to build two custom components:

- `MultiQueryGenerator`: a custom component that uses an LLM to generate similar questions based on the original question.
- `MultiQueryHandler`: a custom component that queries the document store with a set of query questions and collects all the documents

**Note:** Feel free to change the models in this challenge and use different model providers.

### 1) Installation

In [None]:
!pip install haystack-ai
!pip install "sentence-transformers>=3.0.0"
!pip install lazy_imports
!wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv

Collecting haystack-ai
  Downloading haystack_ai-2.8.0-py3-none-any.whl.metadata (13 kB)
Collecting haystack-experimental (from haystack-ai)
  Downloading haystack_experimental-0.4.0-py3-none-any.whl.metadata (16 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting posthog (from haystack-ai)
  Downloading posthog-3.7.4-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting monotonic>=1.5 (from posthog->haystack-ai)
  Downloading monotonic-1.6-py2.py3-none-any.whl.metadata (1.5 kB)
Collecting backoff>=1.10.0 (from posthog->haystack-ai)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Downloading haystack_ai-2.8.0-py3-none-any.whl (391 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m391.4/391.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading haystack_experimental-0.4.0-py3-none-any.whl (109 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.8/109.8 kB[0m 

### 2) Enter API keys for LLM
Enter your OpenAI API Key. If you don't have a key, [follow these instructions](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key).

In [None]:
from getpass import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Enter OpenAI API key:··········


### 3) Parse the news dataset and index it

This step might take some time if you haven't enabled GPU

In [None]:
import csv
from typing import List

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.writers import DocumentWriter
from haystack import Pipeline

def read_documents(file: str) -> List[Document]:
    with open(file, "r") as file:
        reader = csv.reader(file, delimiter="\t")
        next(reader, None)  # skip the headers
        docs = []
        for row in reader:
            category = row[0].strip()
            title = row[2].strip()
            text = row[3].strip()
            docs.append(Document(content=text, meta={"category": category, "title": title}))

    return docs


embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
doc_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(model=embedding_model))
indexing_pipeline.add_component("writer", DocumentWriter(doc_store, policy=DuplicatePolicy.OVERWRITE))
indexing_pipeline.connect("embedder", "writer")

documents = read_documents("bbc-news-data.csv")
indexing_pipeline.run({"embedder":{"documents": documents}})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/70 [00:00<?, ?it/s]

{'writer': {'documents_written': 2225}}

### 4) Define the custom to generate similar alternative questions

In [None]:
from haystack import component
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders import PromptBuilder

@component
class MultiQueryGenerator:
    def __init__(self):
        self.generator = OpenAIGenerator(generation_kwargs={"temperature": 0.75, "max_tokens": 400})
        self.prompt_builder = PromptBuilder(
            template="""
            You are an AI language model assistant. Your task is to generate {{n_variations}} different versions of the
            given user question to retrieve relevant documents from a vector database.
            By generating multiple perspectives on the user question, your goal is to help the user overcome some of
            the limitations of distance-based similarity search. Provide these alternative questions separated by
            newlines.
            Original question: {{question}}
            """
        )

    @component.output_types(queries=List[str])
    def run(self, query: str, n_variations: int = 3):
        prompt = self.prompt_builder.run(question=query, n_variations=n_variations)
        result = self.generator.run(prompt=prompt['prompt'])
        queries = [query] + [q.strip() for q in result['replies'][0].split("\n") if q.strip()]
        return {"queries": queries}

### 5) Define the custom to query the document store with multiple question queries and collect all the retrieved documents

In [None]:
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryEmbeddingRetriever

@component
class MultiQueryHandler:
    def __init__(self, document_store, embedding_model: str):
        self.embedder = SentenceTransformersTextEmbedder(model=embedding_model, progress_bar=False)
        self.embedding_retriever = InMemoryEmbeddingRetriever(document_store)

    @component.output_types(answers=List[Document])
    def run(self, queries: List[str], top_k: int = 3):
        self.embedder.warm_up()
        documents = []
        for idx, query in enumerate(queries):
            embedding = self.embedder.run(query)
            retrieved_docs = self.embedding_retriever.run(query_embedding=embedding['embedding'], top_k=top_k)
            documents.extend(retrieved_docs['documents'])
        return {"answers": documents}

### 6) Define the Pipeline that given a question, generates multiple similar questions, querying the document store and collecting all the retrieved documents

In [None]:
from haystack import component, Pipeline, Document
from haystack import component, Pipeline, Document
from haystack.components.builders import PromptBuilder, AnswerBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.joiners import DocumentJoiner
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

template = """
You have to answer the following question based on the given context information only.
If the context is empty or just a '\\n' answer with None, example: "None".

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

pipeline = Pipeline()

# add components
pipeline.add_component("multi_query_generator", MultiQueryGenerator())
pipeline.add_component("multi_query_handler", MultiQueryHandler(document_store=doc_store,embedding_model=embedding_model))
pipeline.add_component("reranker", DocumentJoiner(join_mode="reciprocal_rank_fusion"))
pipeline.add_component("prompt_builder", PromptBuilder(template=template))
pipeline.add_component("llm", OpenAIGenerator())
pipeline.add_component("answer_builder", AnswerBuilder())

# connect components
pipeline.connect("multi_query_generator.queries", "multi_query_handler.queries")
pipeline.connect("multi_query_handler.answers", "reranker.documents")
pipeline.connect("reranker", "prompt_builder.documents")
pipeline.connect("prompt_builder", "llm")
pipeline.connect("llm.replies", "answer_builder.replies")
pipeline.connect("llm.meta", "answer_builder.meta")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7d9ce590fb20>
🚅 Components
  - multi_query_generator: MultiQueryGenerator
  - multi_query_handler: MultiQueryHandler
  - reranker: DocumentJoiner
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - multi_query_generator.queries -> multi_query_handler.queries (List[str])
  - multi_query_handler.answers -> reranker.documents (List[Document])
  - reranker.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)
  - llm.replies -> answer_builder.replies (List[str])
  - llm.meta -> answer_builder.meta (List[Dict[str, Any]])

In [None]:
question = "Can you give me some suggestions do you have for Christmas presents? Please provide a variety of options."
question = "Ho ho ho! What steps does the UK take to keep those naughty pirates away from the music industry’s jingling tunes?"
question = "How are cybersecurity threats evolving with new technologies?"
n_variations = 3
top_k = 3

result = pipeline.run(
    {'multi_query_generator':{'query':question, 'n_variations':n_variations},
     'multi_query_handler':{'top_k':top_k},
     'prompt_builder': {'template_variables': {'question':question}},
     'answer_builder':{'query':question}
     }, include_outputs_from={"multi_query_generator"}
)

In [None]:
print("\n\nQuestions:\n")
for q in result['multi_query_generator']['queries']:
    print(q)
print("\n\nAnswer:\n")
print(result['answer_builder']['answers'][0].data)



Questions:

How are cybersecurity threats evolving with new technologies?
How are emerging technologies influencing the evolution of cybersecurity threats?
In what ways do advancements in technology contribute to the changing landscape of cybersecurity risks?
What trends are we seeing in cybersecurity threats as new technologies continue to develop?


Answer:

Cybersecurity threats are evolving as criminals increasingly leverage technology to perpetrate crimes for financial gain, leading to a shift from traditional viruses to more sophisticated and targeted forms of malware. The rise of spyware, phishing attacks, and bot nets reflects a trend towards leveraging existing vulnerabilities in systems and exploiting user behavior rather than creating flashy, mass-mailing viruses intended for notoriety. Criminals are using tried-and-tested techniques to infect machines, often concealing their activities to maximize profit while minimizing risk. Additionally, the categorization of threats h