## Documentation

To read more about how to apply pre-filtering with KNN search, visit the [docs](https://www.elastic.co/docs/reference/query-languages/query-dsl/query-dsl-knn-query#knn-query-filtering).



## Connect to ElasticSearch

In [None]:
from pprint import pprint
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")
client_info = es.info()
print("Connected to Elasticsearch!")
pprint(client_info.body)

## Preparing the index

We are adding a new field with type `dense_vector` to store the embeddings.

In [None]:
es.indices.delete(index="apod", ignore_unavailable=True)
es.indices.create(
    index="apod",
    mappings={
        "properties": {
            "embedding": {
                "type": "dense_vector",
            }
        }
    },
)

## Embedding model

I chose the `all-MiniLM-L6-v2` model for its speed, compact size, and versatility as a general-purpose model. It features an embedding dimension of `384` and truncates text that exceeds `256` words. This model is very popular in the community with almost `50M` downloads in one month.

To download and utilize this model, Hugging Face offers a Python package called `sentence-transformers`. This framework simplifies the process of computing dense vector representations.

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
model

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
model = model.to(device)
model

## Index documents

Let's use the `APOD` dataset in this notebook.

In [None]:
import json

with open("../data/apod.json") as f:
    documents = json.load(f)

Let's use the embedding model to embed the `explanation` field of the `APOD` dataset.

Use the `bulk` API to index the documents in the `apod` index.

In [None]:
from tqdm import tqdm


def get_embedding(text):
    return model.encode(text)


operations = []
for document in tqdm(documents, total=len(documents), desc="Indexing documents"):
    year = document["date"].split("-")[0]
    document["year"] = int(year)

    operations.append({"index": {"_index": "apod"}})
    operations.append(
        {
            **document,
            "embedding": get_embedding(document["explanation"]),
        }
    )

response = es.bulk(operations=operations)

If the indexing is successful, you should see `response["errors"]` as `False`.

In [None]:
response["errors"]

## Pre-filtering with kNN Search

### Regular kNN search

Regular kNN search means that we take the query, embed it, compute the similarity score between the query and every document in the index, and return the top k most similar documents.

In [None]:
query = "What is a black hole?"
embedded_query = get_embedding(query)

result = es.search(
    index="apod",
    knn={
        "field": "embedding",
        "query_vector": embedded_query,
        "num_candidates": 20,
        "k": 10,
    },
)

number_of_documents = result.body["hits"]["total"]["value"]
print(f"Found {number_of_documents} documents")

Here we got 10 documents that are most similar to the query "What is a black hole?". Let's print the first 3 documents.

In [None]:
for hit in result.body["hits"]["hits"][:3]:
    print(f"Score: {hit['_score']}")
    print(f"Title: {hit['_source']['title']}")
    print(f"Explanation: {hit['_source']['explanation']}")
    print("-" * 80)

In [None]:
for hit in result.body["hits"]["hits"]:
    print(f"Year: {hit['_source']['year']}")

Let's look at the years of the documents returned by the regular kNN search. We can see that the years are different, let's see how we can use pre-filtering to filter the documents based on the year.

### 2. Pre-filtering

Let's run the same query but this time we will use pre-filtering to filter the documents based on the year. Let's say we want to filter the documents to only include those from the year 2024.

We do this by adding a `filter` clause to the kNN query. The `filter` clause is a regular query that filters the documents before the kNN search is performed.

In [None]:
query = "What is a black hole?"
embedded_query = get_embedding(query)

result = es.search(
    index="apod",
    knn={
        "field": "embedding",
        "query_vector": embedded_query,
        "num_candidates": 20,
        "k": 10,
        "filter": {"term": {"year": 2024}},
    },
)

number_of_documents = result.body["hits"]["total"]["value"]
print(f"Found {number_of_documents} documents")

As you can see, the documents returned are only from the year 2024.

In [None]:
for hit in result.body["hits"]["hits"]:
    print(f"Year: {hit['_source']['year']}")

Let's look at the first 3 documents returned by the kNN search to confirm that they are similar to the query.

In [None]:
for hit in result.body["hits"]["hits"][:3]:
    print(f"Score: {hit['_score']}")
    print(f"Title: {hit['_source']['title']}")
    print(f"Explanation: {hit['_source']['explanation']}")
    print("-" * 80)