# Tutorial : ElasticSearch & VectorDB

# Elasticsearch and VectorDB tutorial

This notebook aims to introduce the basic concepts of **Elasticsearch** as well as direct applications using **VectorDB**.

Elasticsearch is an open-source project that you can install locally on your own machines. This has the advantage of not requiring an internet connection and keeping your data in an environment you control.  
However, it is often more cost-effective to use a service provided by Elasticsearch to host such a database.

For this tutorial, two options are available to run Elasticsearch and a data-visualization dashboard (Kibana):

 - **Option 1 (recommended)**: launch an Elasticsearch cluster with Kibana locally using the `.env` and `docker-compose.yml` files:

```shell
cd [this folder]
docker compose up
```
(see if needed : https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html)

- **Option 2**: use the Elasticsearch serverless trial version, which allows you to quickly get an Elasticsearch database connected to Kibana.
To do this, simply go to https://cloud.elastic.co/registration
 and create an account. Once your account is created, you can create a deployment, which will give you an endpoint (e.g. https://10468406azdad...:443) as well as an API key.


Recommended setup:
* This notebook uses some common libraries (`numpy`, `sklearn`, `plotly`, etc.) as well as less common ones (`elasticsearch`). It is recommended to use a virtual environment to run it via:

```bash
python3 -m venv venv
source venv/bin/activate
```

In [None]:
import urllib3
# to disable certificate warnings
urllib3.disable_warnings()

from elasticsearch import Elasticsearch
from openTSNE import TSNE

Whatever option you choose, the API key and the endpoint must be set in the `API_KEY` and `ENDPOINT` variables, respectively, to connect to your remote database:

In [None]:
ENDPOINT = "https://localhost:9200"
API_KEY = "TO_COMPLETE" # http://localhost:5601/app/management/security/api_keys/

In [None]:
# connection test
es = Elasticsearch(ENDPOINT, api_key=API_KEY, verify_certs=False)

# API key should have cluster monitor rights
es.info()

## 1. What do we want to store in our database?

VectorDBs allow storing vectors. These vectors are most often the results of a document *embedding* step performed before indexing in Elasticsearch.

This *embedding* step is carried out by **dedicated models** (usually by extracting a layer from a deep neural network) and can be computationally expensive.

Here, we will first define some functions that will allow us to work with vectors using the standard libraries dedicated to vector manipulation.

As we saw earlier, certain algorithms are regularly used for indexing and similarity calculations of our documents.

The framework we are using is as follows:
* we want to index movies in order to **recommend** films that users might like
* we have a list of movies associated with a **pre-calculated 3-dimensional embedding** produced by a model

Note: in reality, embedding spaces are larger (e.g., 512 dimensions) to capture more information. In this lab, we will stick to 3 dimensions to make it easier to visualize the results.


### I.1 Dataset and visualization

In [None]:
import numpy as np

# our documents are stored in a dict
movies = {
    "Inception": np.array([0.12, 0.85, 0.34]),
    "Interstellar": np.array([0.75, 4.0, 2.0]),
    "Black sheep": np.array([-0.5, 0.5, -0.9]),
    "The Dark Knight": np.array([0.90, 0.70, 0.10]),
    "La grande vadrouille": np.array([-0.4, -0.87, 0.52]),
    "Sharknado 6": np.array([-0.6, 0.3, -0.7]),
}

To understand the data we are working with, let's define a function to visualize our data in 3D.


In [None]:
import numpy as np
import plotly.graph_objects as go

def plot_3d_vectors(vectors, labels=None, title="3D Vector Visualization"):
    """
        Visualizes 3D vectors with Plotly.

        :param vectors: List or NumPy array of 3D vectors (shape Nx3).  
        :param labels: List of labels for the vectors (optional).  
        :param title: Title of the plot.
    """
    if not isinstance(vectors, np.ndarray):
        vectors = np.array(vectors)

    if vectors.shape[1] != 3:
        raise ValueError("Chaque vecteur doit avoir 3 dimensions.")

    # origin for each vector
    origins = np.zeros_like(vectors)

    fig = go.Figure()

    for i, vector in enumerate(vectors):
        fig.add_trace(go.Scatter3d(
            x=[0,vector[0]], y=[0, vector[1]], z=[0, vector[2]],
        ))

        # Add labels if requested
        if labels:
            fig.add_trace(go.Scatter3d(
                x=[vector[0]], y=[vector[1]], z=[vector[2]],
                mode='text',
                text=[labels[i]],
                textposition='top center',
                textfont=dict(color='red', size=12)
            ))

    # Axis conf
    max_val = np.max(np.abs(vectors)) * 1.2
    fig.update_layout(
        scene=dict(
            xaxis=dict(range=[-max_val, max_val], title="X"),
            yaxis=dict(range=[-max_val, max_val], title="Y"),
            zaxis=dict(range=[-max_val, max_val], title="Z"),
        ),
        title=title,
        margin=dict(l=0, r=0, b=0, t=50),
    )

    # plot
    fig.show()

movie_embeddings = np.array(list(movies.values()))
movie_names = list(movies.keys())
plot_3d_vectors(movie_embeddings, labels=movie_names, title="My top films")

### I.2 Definition of the most common Algorithms

In this section, we define the most common algorithms and calculation methods used to estimate the similarity between two vectors:
* cosine similarity
* L2 norm
* k-NN
* hashing


The cosine similarity is defined as $$\cos(\theta_{\textbf{v,w}}) = \frac{\textbf{v} . \textbf{w}}{\Vert \textbf{v}\Vert_{2}\Vert \textbf{w}\Vert_{2}}$$

In [None]:
from numpy.linalg import norm

def cosine_similarity(vector_1, vector_2):
    return np.dot(vector_1, vector_2) / (norm(vector_1) * norm(vector_2))

# computation of the cosine similarity with a vector we query
query_vector = movies["Inception"]
# query_vector = np.array([0.47, -0.53, 0.95])
for movie_name, movie_embedding in zip(movie_names, movie_embeddings):
    cosine_results = cosine_similarity(query_vector, movie_embedding)
    print(f"Cosine Similarity with {movie_name}:", cosine_results)

*(Optional): How can we vectorize the calculation of cosine similarity?*

#### k-NN: k-Nearest Neighbours

The goal is to retrieve the k nearest neighbors using the L2 norm:

In [None]:
def knn(query_embedding, embeddings, k=2, labels=None):
    """
    `query_embedding` is the vector being queried  
    Args:  
        query_embedding (np.ndarray): vector to query  
        embeddings (np.ndarray): vectors already in the database (shape (n,3))  
        k (int, optional): number of neighbors. Defaults to 2.  
    
    Returns:  
        List: k nearest neighbors sorted by distance
    
    """
    if labels is None:
        labels = [""] * embeddings.shape[0]

    distances = []
    for embedding, title in zip(embeddings, labels):
        dist = np.linalg.norm(embedding - query_embedding)  # euclidean distance
        distances.append((title, dist))
    # sort by increasing distance and select the k nearest neighbours
    return sorted(distances, key=lambda x: x[1])[:k]

# find the 2 closest films
nearest_neighbors = knn(query_vector, movie_embeddings, k=2, labels=movie_names)
print("k-NN Results:", nearest_neighbors)

#### Locality Sensitive Hashing

When working with embeddings in high (or very high) dimensions, it can be useful to **represent them in a smaller space**.

The idea of LSH is to **assign similar hashes with high probability** to nearby embeddings.

To do this, we generate $k$ random hyperplanes and, for each vector in our database, we check on which side of each hyperplane it lies (positive or negative). This defines a hash composed of 1s and 0s.

Concretely, a hyperplane is defined by a normal vector $\textbf{n}$:  
$$H = \{\textbf{x} \in \mathbb{R}^d, \textbf{x} \cdot \textbf{n} = 0\}$$

Thus, for the $k$-th hyperplane defined by the normal vector $\textbf{n}_k$, if $\textbf{x} \cdot \textbf{n}_k \geq 0$, we assign the value 1 to the $k$-th coordinate of the hash vector, and 0 otherwise.


In [None]:
random_hyperplanes_vectors = np.random.rand(3, 8) * 2 - 1
random_vectors = np.random.rand(10,3)

def get_embedding(vector, hyperplanes_vectors):
    dot_product = vector@hyperplanes_vectors
    return (dot_product >= 0).astype(int)

for vector, movie_name in zip(movie_embeddings, movie_names):
    print(movie_name, get_embedding(vector, random_hyperplanes_vectors))

In [None]:
import numpy as np
from collections import defaultdict

class LocalitySensitiveHashing:
    def __init__(self, num_hashes=10, dimensions=100):
        """
        Initializes the LSH algorithm.

        :param num_hashes: Number of hash functions to use.
        :param dimensions: Dimensions of the input vectors.
        """

        self.num_hashes = num_hashes
        self.dimensions = dimensions
        # generates random hyperplanes for the hash functions
        self.hash_planes = (np.random.rand(num_hashes, dimensions) * 2) - 1
        self.hash_tables = defaultdict(list)

    def hash_vector(self, vector):
        """
        Hashes a given vector according to the generated hyperplanes.

        :param vector: Input vector (numpy array).
        :return: A tuple representing the binary signature of the vector.
        """

        return tuple(
            (np.dot(vector, plane) >= 0).astype(int) for plane in self.hash_planes
        )

    def add_vector(self, vector, label):
        """
        Adds a vector to the hash tables.

        :param vector: Vector to index.
        :param label: Identifier or label associated with the vector.
        """

        hash_key = self.hash_vector(vector)
        self.hash_tables[hash_key].append(label)

    def query(self, vector):
        """
        Searches for approximate neighbors for a given vector.

        :param vector: Vector to search for.
        :return: List of labels associated with potential neighbors.
        """

        hash_key = self.hash_vector(vector)
        return self.hash_tables[hash_key]


# Creates a LSH instance
lsh = LocalitySensitiveHashing(num_hashes=5, dimensions=3)

# add vectors with labels
vectors = {
    "A": np.array([1, 2, 3]),
    "B": np.array([4, 5, 6]),
    "C": np.array([-1, -2, -1]),
    "D": np.array([5, 5, 5]),
}

for label, vector in vectors.items():
    lsh.add_vector(vector, label)

# look for approximate nearest neighbours
query_vector = np.array([1, 1, 1])
neighbors = lsh.query(query_vector)
print(f"Neighbors of {query_vector}: {neighbors}")

## II. Indexing in Elasticsearch

After reviewing the classic algorithms used for similarity calculation and document indexing, we can now see how to use them in practice with Elasticsearch.

First, we need to create an index that will store our documents. This index will contain the movie titles as well as the associated embeddings.

### II.1 Index with euclidean distance

In [None]:
# creation of an index
index_name = "movies"
es.options(ignore_status=400).indices.create(
    index=index_name,
    body={
        "mappings": {
            "properties": {
                "title": {"type": "text"},
                "vector": {
                    "type": "dense_vector",
                    "dims": 3,
                    "index": True,
                    "similarity": "l2_norm",
                },  # index is set to True to allow the use of knn search queries
            }
        }
    }
)

for title, vector in movies.items():
    es.index(index=index_name, document={"title": title, "vector": vector.tolist()})

You can verify that the index has been successfully created by going to the Kibana interface, navigating to `Stack Management` at the bottom of the left sidebar, and then to `Data => Index Management`.

Our index should be visible and contain our documents corresponding to the movies.

**Objective:** Given a vector corresponding to a movie that a user liked, how can we find, via an Elasticsearch query, the closest movie in terms of Euclidean distance in the embedding space?


In [None]:
# query definition
query = {
    "knn": {
        "field": "vector",
        "query_vector": query_vector.tolist(),
        "k": 3,
        "num_candidates": 3,
    },
    "_source": ["title"],
}

response = es.search(index=index_name, body=query)
print("Elasticsearch k-NN Results:", response["hits"]["hits"])

In [None]:
def display_nearest_neighbors(results):
    print("\nüîç **Closest neighbours** üîç\n")
    for i, result in enumerate(results, start=1):
        title = result['_source'].get('title', 'Titre inconnu')
        score = result.get('_score', 0)
        index = result.get('_index', 'Index inconnu')
        doc_id = result.get('_id', 'ID inconnu')

        print(f"üé¨ **Neighbour {i}**")
        print(f"   üìÅ Index : {index}")
        print(f"   üìú ID Document : {doc_id}")
        print(f"   üìå Title : {title}")
        print(f"   ‚≠ê Similarity score : {score:.6f}\n")

display_nearest_neighbors(response["hits"]["hits"])

### II.2 Index with Cosine Similarity

In the previous section, we did not specify that we wanted to use Euclidean distance to determine similarity between vectors; this is the default behavior (see the documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html#dense-vector-similarity).

Now, we want to retrieve the previous results using, for example, cosine similarity.

In [None]:
# creation of an index
index_name = "movies_cosine"
es.options(ignore_status=400).indices.create(
    index=index_name,
    body={
        "mappings": {
            "properties": {
                "title": {"type": "text"},
                "vector": {
                    "type": "dense_vector",
                    "dims": 3,
                    "index": True,
                    "similarity": "cosine",
                },  # index is set to True to allow the use of knn search queries
            }
        }
    }
)

for title, vector in movies.items():
    es.index(index=index_name, document={"title": title, "vector": vector.tolist()})

In [None]:
# NN using cosine similarity
query = {
    "knn": {
        "field": "vector",
        "query_vector": query_vector.tolist(),
        "k": 3,
        "num_candidates": 3,
    },
    "_source": ["title"],
}

response = es.search(index=index_name, body=query)
display_nearest_neighbors(response["hits"]["hits"])

In [None]:
plot_3d_vectors(np.concatenate([movie_embeddings, query_vector[None, :]]), labels=movie_names + ["query_vector"])

The results are not the same! Depending on the similarity calculation method, the returned documents differ, so it is important to know the embedding model used initially to create our database.

Here, the embeddings of the movies `Inception` and `Interstellar` are very close in terms of cosine similarity but far apart in Euclidean distance. It may be appropriate to complement our embedding step with a normalization step of the embeddings before insertion.

The similarity measures used by Elasticsearch : https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html#dense-vector-similarity

## III. And in reality...

We will now use a real embedding model (here, all-MiniLM-L6-v2) to convert sentences into 384-dimensional vectors, which we will then insert into Elasticsearch.  
To do this, you will need to create an account on Hugging Face (https://huggingface.co/) and then generate a token (https://huggingface.co/) which you will place in the "HF_TOKEN" field.


In [None]:
MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
HF_TOKEN = "TO_COMPLETE" # https://huggingface.co/settings/tokens

import requests

api_url = f"https://router.huggingface.co/hf-inference/models/{MODEL_ID}/pipeline/feature-extraction"
headers = {"Authorization": f"Bearer {HF_TOKEN}"}

def query(texts):
    response = requests.post(api_url, headers=headers, json={"inputs": texts, "options":{"wait_for_model":True}})

    return response.json()

texts = ["How do I get a replacement Medicare card?",
         "What is the monthly premium for Medicare Part B?",
         "How do I terminate my Medicare Part B (medical insurance)?",
         "How do I sign up for Medicare?",
         "Can I sign up for Medicare Part B if I am working and have health insurance through an employer?",
         "How do I sign up for Medicare Part B if I already have Part A?",
         "What are Medicare late enrollment penalties?",
         "What is Medicare and who can get it?",
         "How can I get help with my Medicare Part A and Part B premiums?",
         "What are the different parts of Medicare?",
         "Will my Medicare premiums be higher because of my higher income?",
         "What is TRICARE ?",
         "Should I sign up for Medicare Part B if I have Veterans' Benefits?"]

output = query(texts)

nb_dim = len(output[0])

We have obtained the vectors for the sentences using the embedding model. We can now insert them into Elasticsearch in a new index called "sentences".

In [None]:
# creation of an index
sentences = {sentence: vectors for sentence, vectors in zip(texts, output)}

index_name = "sentences"
es.options(ignore_status=400).indices.create(
    index=index_name,
    body={
        "mappings": {
            "properties": {
                "title": {"type": "text"},
                "vector": {
                    "type": "dense_vector",
                    "dims": nb_dim,
                    "index": True,
                    "similarity": "cosine",
                },  # index is set to True to allow the use of knn search queries
            }
        }
    }
)

# Indexation des documents
for title, vector in sentences.items():
    es.index(index=index_name, document={"title": title, "vector": vector})

We can now write a "query" message, which, once converted into a vector, allows us to find the sentences that are semantically closest.

In [None]:
# Recherche par similarit√© vectorielle
request = ["I am a veteran, how can I proceed to signup?"]
embedded_request = query(request)[0]

es_query = {
    "knn": {
        "field": "vector",
        "query_vector": embedded_request,
        "k": 3,
        "num_candidates": 3,
    },
    "_source": ["title"],
}

response = es.search(index=index_name, body=es_query)
display_nearest_neighbors(response["hits"]["hits"])

### 3D Visualization

It is possible to visualize high-dimensional vectors (here, 384 dimensions) in a way that humans can understand, provided they are reduced to 2 or 3 dimensions using a dimensionality reduction algorithm.

The T-SNE algorithm is a non-linear dimensionality reduction method that allows reducing from 384 to 3 dimensions.


In [None]:
tsne = TSNE(n_components=3)
sentence_emb_3d = tsne.fit(np.concatenate([np.array(output), np.array([embedded_request])]))
sentence_emb_3d

In [None]:
plot_3d_vectors(sentence_emb_3d, labels=texts + [f"REQUEST: {request[0]}"])

IV - Additional exercises :

1) Design your own neural network to embed MNIST images in a smaller space.
- What could be a good basic architecture for this?
- Use pytorch to implement it.
- Test your implementation by fiddling around with your data

2) To illustrate the reason graphs can be a nice tool to find nearest neighbours, use a Delaunay triangulation to find nearest neighbours in 2D space.
- Design an experience with 100 000 random 2D vectors
- Discover what a Delaunay triangulation is (understanding how it is related to a Vorono√Ø diagram may help you)
- Compute a Delaunay triangulation on your random vectors
- Generate a new 2D point which we will use as a query
- Use either a spatial index or a walking algorithm to find the triangle containing this point
- Find the nearest neighbour in the triangle
- You can implement a BFS algorithm to request the k nearest neighbours

Of course, here it is exact but it should give you an overview of how graph-based ANNs work in practice.
