<a href="https://colab.research.google.com/github/nickprock/appunti_data_science/blob/master/semantic-search/advent-of-haystack/Advent_of_Haystack_Metadata_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advent of Haystack - Day 5
_Make a copy of this Colab to start!_

Here, you'll be provided with a document store containing some documents and their metadata. Our aim is to create a querying pipeline that uses these metadata to filter documents when we run the pipeline.

1. **Write documents to a document store:** This is already complete. Here, we are writing documents and their metadata into an `InMemoryDocumentStore`.
2. **Your task is to complete step 2 👇**

**Useful documentation:**

-  [Metadata Filtering](https://docs.haystack.deepset.ai/v2.0/docs/metadata-filtering)
- [`InMemoryBM25Retriever`](https://docs.haystack.deepset.ai/v2.0/docs/inmemorybm25retriever)
- [`Pipelines`](https://docs.haystack.deepset.ai/v2.0/docs/creating-pipelines)

#Installation
**Note:** There is a known issue with colab due to a version conflict error related to `llmx` which comes with Colab. You might get an `llmx` error. You can safely ignore this, or run `pip uninstall -y llmx`

In [None]:
!pip install haystack-ai

Collecting haystack-ai
  Downloading haystack_ai-2.0.0b2-py3-none-any.whl (185 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m185.7/185.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting openai<1.0.0 (from haystack-ai)
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting posthog (from haystack-ai)
  Downloading posthog-3.1.0-py2.py3-none-any.whl (37 kB)
Collecting rank-bm25 (from haystack-ai)
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting monotonic>=1.5 (from posthog->haystack-ai)
  Downloading monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting backoff>=1.10.0 (from posthog->haystack-ai)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Installing collected packages: monotonic, rank-bm25, lazy-imports, 

### Enabling Telemetry

Knowing you’re running this challenge helps us know whether Advent of Haystack is helping people learn about Haystack 2.0-Beta. But you can always opt out by commenting the following line.

In [None]:
from haystack.telemetry import tutorial_running

tutorial_running("challenge_5")

## 1) Write Documents to InMemoryDocumentStore

Here, we are writing the contents of a few URLs into an `InMemoryDocumentStore`

In [None]:
from haystack import Pipeline, Document
from haystack.document_stores import InMemoryDocumentStore
from haystack.components.retrievers import InMemoryBM25Retriever

documents = [Document(content="Use pip to install a basic version of Haystack's latest release: pip install farm-haystack. All the core Haystack components live in the haystack repo. But there's also the haystack-extras repo which contains components that are not as widely used, and you need to install them separately.",
                      meta={"version": "1.15"}),
             Document(content="Use pip to install a basic version of Haystack's latest release: pip install farm-haystack[inference]. All the core Haystack components live in the haystack repo. But there's also the haystack-extras repo which contains components that are not as widely used, and you need to install them separately.",
                      meta={"version": "1.22"}),
             Document(content="Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is built on the main branch which is an unstable beta version, but it's useful if you want to try the new features as soon as they are merged.",
                      meta={"version": "2.0"}),
]
document_store = InMemoryDocumentStore()
document_store.write_documents(documents=documents)


3

## 2) Build a Querying Pipeline
Here, we have provided a nearly complete querying pipeline, but it doesn't use metadata for filtering yet. Make sure your `InMemoryBM25Retriever` filters documents based on their metadata when you send a query. That way, you can limit the list of retrieved documents to the ones that fulfill the filtering condition! In this example, let's retrieve documents that have a version of 2.0.

In [None]:
pipeline = Pipeline()
pipeline.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retriever")

In [None]:
query = "Haystack installation"
result = pipeline.run(
    data={"retriever": {
    		"query": query,
      "filters":{"operator": "AND",
          "conditions":[{"field": "meta.version", "operator": "==", "value": "2.0"}]
      }
  		      }
       }
)

Ranking by BM25...:   0%|          | 0/1 [00:00<?, ? docs/s]

In [None]:
result

{'retriever': {'documents': [Document(id=03311fb024425a57af746d1e75273a4376444e8c72e2a553d39a875714238dad, content: 'Use pip to install only the Haystack 2.0 code: pip install haystack-ai. The haystack-ai package is b...', meta: {'version': '2.0'}, score: -0.457755120278379)]}}