## Documentation

To read more about kNN search, checkout the docs [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html).



## Connect to ElasticSearch

In [None]:
from pprint import pprint
from elasticsearch import Elasticsearch

es = Elasticsearch('http://localhost:9200')
client_info = es.info()
print('Connected to Elasticsearch!')
pprint(client_info.body)

## Preparing the index

We are adding a new field with type `dense_vector` to store the embeddings.

In [None]:
es.indices.delete(index='my_index', ignore_unavailable=True)
es.indices.create(
    index="my_index",
    mappings={
        "properties": {
            "embedding": {
                "type": "dense_vector",
            }
        }
    },
)

## Embedding model

I chose the `all-MiniLM-L6-v2` model for its speed, compact size, and versatility as a general-purpose model. It features an embedding dimension of `384` and truncates text that exceeds `256` words. This model is very popular in the community with almost `50M` downloads in one month.

To download and utilize this model, Hugging Face offers a Python package called `sentence-transformers`. This framework simplifies the process of computing dense vector representations.

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
model

In [None]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

In [None]:
model = model.to(device)
model

## Load documents

In [None]:
import json


documents = json.load(open("../data/astronomy.json"))
documents

## Embed documents

In [None]:
from tqdm import tqdm
from pprint import pprint


def get_embedding(text):
    return model.encode(text)


operations = []
for document in tqdm(documents, total=len(documents)):
    operations.append({'index': {'_index': 'my_index'}})
    operations.append({
        **document,
        'embedding': get_embedding(document['content']),
    })

response = es.bulk(operations=operations)
pprint(response.body)

We indexed all documents with an additional field `embedding`. Let's retrieve the documents to verify that the text was converted to a dense vector.

In [None]:
response = es.search(
    index='my_index',
    body={
        'query':
            {
                'match_all': {}
            }
    }
)

pprint(response["hits"]["hits"])

Awesome! We successfully inserted the documents with the additional `embedding` field. Now, let’s check the mapping to confirm that the dimension of the dense vector is 384.

In [None]:
response = es.indices.get_mapping(index='my_index')
pprint(response.body)

## kNN search

### 1. Query N°1

In [None]:
from pprint import pprint

query = "What is a black hole?"
embedded_query = get_embedding(query)

result = es.search(
    index='my_index',
    knn={
        "field": "embedding",
        "query_vector": embedded_query,
        "num_candidates": 5,
        "k": 3,
    }
)

n_documents = result.body["hits"]["total"]["value"]
print(f"Found {n_documents} documents")

In [None]:
hits = result.body["hits"]["hits"]
for hit in hits:
    print(f"Title  : {hit['_source']['title']}")
    print(f"Content: {hit['_source']['content']}")
    print(f"Score  : {hit['_score']}")
    print("*"*100)

### 2. Query N°2

In [None]:
query = "How do we find exoplanets?"
embedded_query = get_embedding(query)

result = es.search(
    index='my_index',
    knn={
        "field": "embedding",
        "query_vector": embedded_query,
        "num_candidates": 5,
        "k": 1,
    }
)

n_documents = result.body["hits"]["total"]["value"]
print(f"Found {n_documents} documents")

In [None]:
hits = result.body["hits"]["hits"]
for hit in hits:
    print(f"Title  : {hit['_source']['title']}")
    print(f"Content: {hit['_source']['content']}")
    print(f"Score  : {hit['_score']}")
    print("*"*100)

We observe that the document with the highest score consistently corresponds to the query. Additionally, the other results returned by the k-nearest neighbors (k-NN) search are also relevant. To further refine the results, you can set a threshold to return only documents that meet a specified score, allowing you to exclude unrelated documents.