<a href="https://colab.research.google.com/github/muffafa/advent-of-haystack-2024-2025-solutions/blob/main/Solution_Santa_Haystack_self_reflecting_Gift_agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advent of Haystack: Day 9
In this challenge, you'll help Santa build a self-reflecting gift selection agent using Haystack and MongoDB Atlas! 🎅

The agent will help optimize gift selections based on children's wishlists and budget constraints, using MongoDB Atlas vector search for semantic matching and implementing self-reflection to ensure the best possible gift combinations.

**Components to use in this challenge:**
- [`OpenAITextEmbedder`](https://docs.haystack.deepset.ai/docs/openaitextembedder) for  query embedding
- [`MongoDBAtlasEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/) for finding relevant gifts
- [`PromptBuilder`](https://docs.haystack.deepset.ai/docs/promptbuilder) for creating the prompt
- [`OpenAIGenerator`](https://docs.haystack.deepset.ai/docs/openaigenerator) for  generating responses
- Custom `GiftChecker` component for self-reflection

In [None]:
# Install required packages
!pip install haystack-ai mongodb-atlas-haystack tiktoken datasets colorama

Collecting haystack-ai
  Downloading haystack_ai-2.8.0-py3-none-any.whl.metadata (13 kB)
Collecting mongodb-atlas-haystack
  Downloading mongodb_atlas_haystack-1.0.0-py3-none-any.whl.metadata (2.3 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting haystack-experimental (from haystack-ai)
  Downloading haystack_experimental-0.4.0-py3-none-any.whl.metadata (16 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl.metadata (10 kB)
Collecting posthog (from haystack-ai)
  Downloading posthog-3.7.4-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting pymongo[srv] (from mongodb-atlas-haystack)
  Downloading pymongo-4.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
C

## Configure Environment

- [OpenAI API Key](https://platform.openai.com/api-keys) if you'd like to use OpenAI embedding and text generation models
- [MongoDB Atlas project](https://www.mongodb.com/docs/atlas/getting-started/) with an Atlas cluster (free tier works). [Detailed Tutorial](https://www.mongodb.com/docs/guides/atlas/cluster/#create-a-cluster)
- Get your [connection string](https://www.mongodb.com/docs/atlas/tutorial/connect-to-your-cluster/#connect-to-your-atlas-cluster) and have `0.0.0.0/0` address in your network access list.
- Connection string looks like this `mongodb+srv://<db_username>:<db_password>@<clustername>.xxxxx.mongodb.net/?retryWrites=true...`

Set up your MongoDB Atlas and OpenAI credentials:

In [None]:
import os
import getpass
import re

conn_str = getpass.getpass("Enter your MongoDB connection string:")
conn_str = (re.sub(r'appName=[^\s]*', 'appName=devrel.ai.haystack_partner', conn_str)
            if 'appName=' in conn_str
            else conn_str + ('&' if '?' in conn_str else '?') + 'appName=devrel.ai.haystack_partner')
os.environ['MONGO_CONNECTION_STRING']=conn_str
os.environ['OPENAI_API_KEY'] = getpass.getpass("Enter your OpenAI API Key:")

## Create Sample Gift Dataset

Let's create a dataset of gifts with prices and categories:

In [None]:
dataset = {
    "train": [
        {
            "title": "LEGO Star Wars Set",
            "price": "$49.99",
            "description": "Build your own galaxy with this exciting LEGO Star Wars set",
            "category": "Toys",
            "age_range": "7-12"
        },
        {
            "title": "Remote Control Car",
            "price": "$29.99",
            "description": "Fast and fun RC car with full directional control",
            "category": "Toys",
            "age_range": "6-10"
        },
        {
            "title": "Art Set",
            "price": "$24.99",
            "description": "Complete art set with paints, brushes, and canvas",
            "category": "Arts & Crafts",
            "age_range": "5-15"
        },
        {
            "title": "Science Kit",
            "price": "$34.99",
            "description": "Educational science experiments kit",
            "category": "Educational",
            "age_range": "8-14"
        },
        {
            "title": "Dollhouse",
            "price": "$89.99",
            "description": "Beautiful wooden dollhouse with furniture",
            "category": "Toys",
            "age_range": "4-10"
        }
    ]
}

## Initialize MongoDB Atlas

First, we need to set up our MongoDB Atlas collection and create a vector search index. This step is crucial for enabling semantic search capabilities:

In [None]:
# Create collection gifts and add the vector index

from pymongo import MongoClient
from bson import json_util
from pymongo.operations import SearchIndexModel
import json
import time

client = MongoClient(os.environ['MONGO_CONNECTION_STRING'])
db = client['santa_workshop']
collection = db['gifts']

db.create_collection("gifts")


## create index
search_index_model = SearchIndexModel(
  definition={
    "fields": [
      {
        "type": "vector",
        "numDimensions": 1536,
        "path": "embedding",
        "similarity": "cosine"
      },
    ]
  },
  name="vector_index",
  type="vectorSearch",
)
result = collection.create_search_index(model=search_index_model)
print("New search index named " + result + " is building.")
# Wait for initial sync to complete
print("Polling to check if the index is ready. This may take up to a minute.")
predicate=None
if predicate is None:
  predicate = lambda index: index.get("queryable") is True
while True:
  indices = list(collection.list_search_indexes(result))
  if len(indices) and predicate(indices[0]):
    break
  time.sleep(5)
print(result + " is ready for querying.")
client.close()

New search index named vector_index is building.
Polling to check if the index is ready. This may take up to a minute.
vector_index is ready for querying.


## Initialize Document Store and Index Documents

Now let's set up the [MongoDBAtlasDocumentStore](https://docs.haystack.deepset.ai/docs/mongodbatlasdocumentstore) and index our gift data:

In [None]:
from haystack import Pipeline, Document
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.writers import DocumentWriter
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from bson import json_util

# Initialize document store
document_store = MongoDBAtlasDocumentStore(
    database_name="santa_workshop",
    collection_name="gifts",
    vector_search_index="vector_index",
)

# Convert dataset to documents
insert_data = []
for gift in dataset['train']:
    doc_gift = json_util.loads(json_util.dumps(gift))
    haystack_doc = Document(content=doc_gift['title'], meta=doc_gift)
    insert_data.append(haystack_doc)

# Create indexing pipeline
doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
doc_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small", meta_fields_to_embed=["description"])

indexing_pipe = Pipeline()
indexing_pipe.add_component(instance=doc_embedder, name="doc_embedder")
indexing_pipe.add_component(instance=doc_writer, name="doc_writer")
indexing_pipe.connect("doc_embedder.documents", "doc_writer.documents")

# Index the documents
indexing_pipe.run({"doc_embedder": {"documents": insert_data}})

Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00,  1.25it/s]


{'doc_embedder': {'meta': {'model': 'text-embedding-3-small',
   'usage': {'prompt_tokens': 54, 'total_tokens': 54}}},
 'doc_writer': {'documents_written': 5}}

## TODO: Create Self-Reflecting Gift Selection Pipeline

Now comes the fun part! Create a pipeline that can:
1. Take a gift request query
2. Find relevant gifts using vector search
3. Self-reflect on selections to optimize for budget and preferences

**HINT:** Learn how to write your component in [Docs: Creating Custom Components](https://docs.haystack.deepset.ai/docs/custom-components)

Here's the basic structure to get you started:

In [None]:
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
from haystack.components.embedders import OpenAITextEmbedder
from colorama import Fore
from typing import List
from haystack import component

@component
class GiftChecker:
    @component.output_types(gifts_to_check=str, gifts=str)
    def run(self, replies: List[str]):
        if 'DONE' in replies[0]:
            return {"gifts": replies[0].replace('DONE', '')}
        else:
            print(Fore.RED + "Not optimized yet, could find better gift combinations")
            return {"gifts_to_check": replies[0]}

# Create prompt template
prompt_template = """
    You are Santa's gift selection assistant . Below you have a list of available gifts with their prices.
    Based on the child's wishlist and budget, suggest appropriate gifts that maximize joy while staying within budget.

    Available Gifts:
    {% for doc in documents %}
        Gift: {{ doc.content }}
        Price: {{ doc.meta['price']}}
        Age Range: {{ doc.meta['age_range']}}
    {% endfor %}

    Query: {{query}}
    {% if gifts_to_check %}
        Previous gift selection: {{gifts_to_check[0]}}
        Can we optimize this selection for better value within budget?
        If optimal, say 'DONE' and return the selection
        If not, suggest a better combination
    {% endif %}

    Gift Selection:
"""

# Create the pipeline
gift_pipeline = Pipeline(max_runs_per_component=5)
gift_pipeline.add_component("text_embedder", OpenAITextEmbedder(model="text-embedding-3-small"))
gift_pipeline.add_component(
    instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store, top_k=5),
    name="retriever"
)
gift_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
gift_pipeline.add_component(instance=GiftChecker(), name="checker")
gift_pipeline.add_component(instance=OpenAIGenerator(model="gpt-4"), name="llm")

# Connect components
gift_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
gift_pipeline.connect("retriever.documents", "prompt_builder.documents")
gift_pipeline.connect("checker.gifts_to_check", "prompt_builder.gifts_to_check")
gift_pipeline.connect("prompt_builder", "llm")
gift_pipeline.connect("llm", "checker")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7d5853ba7160>
🚅 Components
  - text_embedder: OpenAITextEmbedder
  - retriever: MongoDBAtlasEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - checker: GiftChecker
  - llm: OpenAIGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)
  - checker.gifts_to_check -> prompt_builder.gifts_to_check (str)
  - llm.replies -> checker.replies (List[str])

## Test the Gift Selection Pipeline

Let's test our pipeline with a sample query:

In [None]:
query = "Find gifts for a 9-year-old who loves science and building things. Budget: $100"

result = gift_pipeline.run(
    {
        "text_embedder": {"text": query},
        "prompt_builder": {"query": query}
    }
)

print(Fore.GREEN + result["checker"]["gifts"])

[31mNot optimized yet, could find better gift combinations
[32mScience Kit, LEGO Star Wars Set
    Total cost: $84.98
    This selection is under budget and suits the child's interest in science and building things.
    So, Santa says, ""!
