## Documentation

To read more about kNN search, checkout the docs [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html).

![knn_search_docs](../images/knn_search_docs.png)

## Connect to ElasticSearch

In [None]:
from pprint import pprint
from elasticsearch import Elasticsearch

HOST = "http://localhost:9200"

es = Elasticsearch(hosts=HOST)
client_info = es.info()
print("Connected tp Elasticsearch!")
pprint(client_info.body)

Connected tp Elasticsearch!
{'cluster_name': 'docker-cluster',
 'cluster_uuid': '9W_0LZIIR4aoxR46wiPMVA',
 'name': '07d95b8d2eee',
 'tagline': 'You Know, for Search',
 'version': {'build_date': '2025-02-05T22:10:57.067596412Z',
             'build_flavor': 'default',
             'build_hash': '747663ddda3421467150de0e4301e8d4bc636b0c',
             'build_snapshot': False,
             'build_type': 'docker',
             'lucene_version': '9.12.0',
             'minimum_index_compatibility_version': '7.0.0',
             'minimum_wire_compatibility_version': '7.17.0',
             'number': '8.17.2'}}


## Preparing the index

We are adding a new field with type `dense_vector` to store the embeddings.

In [35]:
INDEX = "my_index"

settings = {
    "index": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    }
}

mappings = {
    "properties": {
        "embedding": {
            "type": "dense_vector",
            "dims": 384
        }
    }
}

es.indices.delete(index=INDEX, ignore_unavailable=True)
es.indices.create(index=INDEX, mappings=mappings, settings=settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'my_index'})

## Embedding model

![all-MiniLM-L6-v2_model](../images/all-MiniLM-L6-v2_model.png)

I chose the `all-MiniLM-L6-v2` model for its speed, compact size, and versatility as a general-purpose model. It features an embedding dimension of `384` and truncates text that exceeds `256` words. This model is very popular in the community with almost `50M` downloads in one month.

To download and utilize this model, Hugging Face offers a Python package called `sentence-transformers`. This framework simplifies the process of computing dense vector representations.

In [36]:
import torch
from sentence_transformers import SentenceTransformer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = SentenceTransformer("all-MiniLM-L6-v2").to(device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

## Load documents

In [37]:
import json

documents = json.load(open("../data/astronomy.json"))
documents

[{'id': 1,
  'title': 'The Solar System',
  'content': 'The Solar System consists of the Sun and the objects that orbit it, including eight planets, their moons, dwarf planets, and countless small bodies like asteroids and comets.'},
 {'id': 2,
  'title': 'Black Holes',
  'content': 'A black hole is a region of space where the gravitational pull is so strong that nothing, not even light, can escape from it. They are formed when massive stars collapse under their own gravity.'},
 {'id': 3,
  'title': 'Galaxies',
  'content': 'Galaxies are vast systems that consist of stars, stellar remnants, interstellar gas, dust, and dark matter. The Milky Way is the galaxy that contains our Solar System.'},
 {'id': 4,
  'title': 'The Big Bang Theory',
  'content': 'The Big Bang Theory is the leading explanation about how the universe began. It suggests that the universe was once in an extremely hot and dense state and has been expanding ever since.'},
 {'id': 5,
  'title': 'Exoplanets',
  'content': 

## Embed documents

In [38]:
from tqdm import tqdm
from pprint import pprint

def get_embedding(text):
    return model.encode(text)


operations = []
for document in tqdm(documents, total=len(documents)):
    operations.append({"index": {"_index": INDEX}})
    operations.append({
        **document,
        "embedding": get_embedding(document["content"])
    })

response = es.bulk(operations=operations)
pprint(response.body)

100%|██████████| 10/10 [00:00<00:00, 54.24it/s]


{'errors': False,
 'items': [{'index': {'_id': 'FfRJKZUB1cKk6Stc5qAE',
                      '_index': 'my_index',
                      '_primary_term': 1,
                      '_seq_no': 0,
                      '_shards': {'failed': 0, 'successful': 1, 'total': 1},
                      '_version': 1,
                      'result': 'created',
                      'status': 201}},
           {'index': {'_id': 'FvRJKZUB1cKk6Stc5qAG',
                      '_index': 'my_index',
                      '_primary_term': 1,
                      '_seq_no': 1,
                      '_shards': {'failed': 0, 'successful': 1, 'total': 1},
                      '_version': 1,
                      'result': 'created',
                      'status': 201}},
           {'index': {'_id': 'F_RJKZUB1cKk6Stc5qAG',
                      '_index': 'my_index',
                      '_primary_term': 1,
                      '_seq_no': 2,
                      '_shards': {'failed': 0, 'successful': 1, '

We indexed all documents with an additional field `embedding`. Let's retrieve the documents to verify that the text was converted to a dense vector.

In [39]:
response = es.search(
    index=INDEX,
    body={
        "query": {
            "match_all": {}
        }
    }
)

pprint(response.body)

{'_shards': {'failed': 0, 'skipped': 0, 'successful': 1, 'total': 1},
 'hits': {'hits': [{'_id': 'FfRJKZUB1cKk6Stc5qAE',
                    '_index': 'my_index',
                    '_score': 1.0,
                    '_source': {'content': 'The Solar System consists of the '
                                           'Sun and the objects that orbit it, '
                                           'including eight planets, their '
                                           'moons, dwarf planets, and '
                                           'countless small bodies like '
                                           'asteroids and comets.',
                                'embedding': [0.04063341021537781,
                                              -0.0025617973878979683,
                                              0.05483473464846611,
                                              0.009171051904559135,
                                              0.031219912692904472,
           

Awesome! We successfully inserted the documents with the additional `embedding` field. Now, let’s check the mapping to confirm that the dimension of the dense vector is 384.

## kNN search

### 1. Query N°1

In [40]:
query = "What is a black hole?"
embedded_query = get_embedding(query)

result = es.search(
    index=INDEX,
    knn={
        "field": "embedding",
        "query_vector": embedded_query,
        "num_candidates": 5,
        "k": 3,
    }
)

n_documents = result["hits"]["total"]["value"]
print(f"Found {n_documents} documents")

Found 3 documents


In [41]:
hits = result.body["hits"]["hits"]
for hit in hits:
    print(f"Title  : {hit['_source']['title']}")
    print(f"Content: {hit['_source']['content']}")
    print(f"Score  : {hit['_score']}")
    print("*"*100)

Title  : Black Holes
Content: A black hole is a region of space where the gravitational pull is so strong that nothing, not even light, can escape from it. They are formed when massive stars collapse under their own gravity.
Score  : 0.88633347
****************************************************************************************************
Title  : Dark Matter
Content: Dark matter is a type of matter that does not emit light or energy. It cannot be observed directly but is believed to make up about 27% of the universe's total mass and energy.
Score  : 0.6618402
****************************************************************************************************
Title  : Galaxies
Content: Galaxies are vast systems that consist of stars, stellar remnants, interstellar gas, dust, and dark matter. The Milky Way is the galaxy that contains our Solar System.
Score  : 0.6431341
****************************************************************************************************


### 2. Query N°2

In [42]:
query = "How do we find exoplanets?"
embedded_query = get_embedding(query)

result = es.search(
    index=INDEX,
    knn={
        "field": "embedding",
        "query_vector": embedded_query,
        "num_candidates": 5,
        "k": 1
    }
)

n_docs = result["hits"]["total"]["value"]
print(f"Found {n_docs} documents")

Found 1 documents


In [43]:
hits = result.body["hits"]["hits"]
for hit in hits:
    print(f"Title  : {hit['_source']['title']}")
    print(f"Content: {hit['_source']['content']}")
    print(f"Score  : {hit['_score']}")
    print("*"*100)

Title  : Exoplanets
Content: Exoplanets, or extrasolar planets, are planets that exist outside our solar system. They vary greatly in size and composition and are often found using methods like the transit method and radial velocity.
Score  : 0.85656023
****************************************************************************************************


We observe that the document with the highest score consistently corresponds to the query. Additionally, the other results returned by the k-nearest neighbors (k-NN) search are also relevant. To further refine the results, you can set a threshold to return only documents that meet a specified score, allowing you to exclude unrelated documents.