## **Vector Search with Elastic Search:**

In [1]:
# Install necessary libraries
!pip install elasticsearch sentence-transformers scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


### **0. Setup Elasticsearch Index**
Before we dive into using Elasticsearch for vector search, we first need to set up our index with the provided CSV file (monjoor_videos_data.csv). This step ensures that our data is correctly structured and available in Elasticsearch.

### **1. Setup Elasticsearch Connection**
We will establish a connection to our Elasticsearch instance, similar to what we did in the autocomplete example.

In [1]:
from elasticsearch import Elasticsearch
import os

# Connection parameters
ELASTICSEARCH_URL = "https://my-deployment-a0fcce.es.europe-west9.gcp.elastic-cloud.com"
ELASTICSEARCH_USER = "elastic"
ELASTICSEARCH_PASSWORD = "QBUkoRZOaEq54URItRvc3Jsn"
ELASTICSEARCH_INDEX_NAME = "monjoor-videos-index"

# Initialize Elasticsearch client
client = Elasticsearch(
    ELASTICSEARCH_URL,
    basic_auth=(ELASTICSEARCH_USER, ELASTICSEARCH_PASSWORD),
    verify_certs=False
)

# Check connection
if not client.ping():
    print("Failed to connect to Elasticsearch.")
else:
    print("Successfully connected to Elasticsearch.")

  _transport = transport_class(


Successfully connected to Elasticsearch.




### **2. Understanding Vector Search**
Instead of searching by keywords, we will use text embeddings to find similar content. 
For this, we will use the `sentence-transformers` library to generate sentence embeddings.

In [3]:
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize

# Initialize the SentenceTransformer model
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')

def generate_query_vector(query):
    """Generate a normalized vector embedding for the query"""
    embedding = model.encode(query)
    normalized_vector = normalize(embedding.reshape(1, -1))[0]
    return normalized_vector.tolist()

# Example query
query_text = "Comment apprendre le français rapidement ?"
query_vector = generate_query_vector(query_text)

print("Vector for query generated.")

  from .autonotebook import tqdm as notebook_tqdm


Vector for query generated.


In [4]:
print("Query vector:", query_vector)

Query vector: [0.03366442397236824, 0.032618049532175064, -0.02173670381307602, -0.07629579305648804, -0.09351697564125061, -0.0447269007563591, -0.015092339366674423, -0.03182116523385048, 0.01970180682837963, 0.01358596421778202, -0.01480043213814497, 0.01662377081811428, 0.06201957166194916, -0.029805339872837067, -0.015435141511261463, -0.10301117599010468, 0.021160203963518143, 0.048832327127456665, -0.034069400280714035, -0.0762123316526413, -0.007653051987290382, 0.037223126739263535, 0.02218840830028057, 0.08640406280755997, 0.005123378708958626, 0.008747594431042671, 0.07989867031574249, 0.039716266095638275, 0.04616536572575569, -0.07540848106145859, 0.023965992033481598, 0.031206103041768074, -0.020098643377423286, -0.04776463657617569, -0.07599502056837082, 0.016761409118771553, -0.0005820008809678257, -0.09277081489562988, 0.049123410135507584, -0.029800990596413612, -0.023506173864006996, -0.0018425690941512585, -0.03609904646873474, 0.014629388228058815, 0.07286980003118

### 2.1 configuring the vectors form text to dense vectors on elastic search
To this point, the vectors we imported were automatically converted to text. As the processing is automatically done within elastic search, we need to fetch every entry, and convert it to a dense vector.

In [5]:
# convert the field transcript_embedding to a list of floats (currently a string)
def convert_embedding_to_list(embedding_str):
    # Remove the brackets and split by comma
    embedding_list = embedding_str.strip('[]').split(',')
    # Convert each element to float and return as a list
    return [float(x) for x in embedding_list]

# Function to fetch and process data from Elasticsearch
def fetch_and_process_data(index_name):
    # Fetch data from Elasticsearch
    response = client.search(
        index=index_name,
        body={
            "query": {
                "match_all": {}
            },
            "_source": ["transcript_embedding"]
        },
        size=10000  # Adjust the size as needed
    )

    # Extract the transcript_embedding field from the response
    embeddings = []
    for hit in response['hits']['hits']:
        if 'transcript_embedding' in hit['_source']:
            embedding_str = hit['_source']['transcript_embedding']
            embedding_list = convert_embedding_to_list(embedding_str)
            embeddings.append(embedding_list)

    return embeddings

# Fetch and process data
embeddings = fetch_and_process_data(ELASTICSEARCH_INDEX_NAME)


# Function to update the document in Elasticsearch
def update_document(index_name, doc_id, normalized_embedding):
    # Update the document in Elasticsearch
    client.update(
        index=index_name,
        id=doc_id,
        body={
            "doc": {
                "dense_vector_embedding": normalized_embedding
            }
        }
    )
# Update the documents in Elasticsearch
for i, hit in enumerate(client.search(
        index=ELASTICSEARCH_INDEX_NAME,
        body={
            "query": {
                "match_all": {}
            },
            "_source": ["_id"]
        },
        size=10000  # Adjust the size as needed
    )['hits']['hits']):
    doc_id = hit['_id']
    embedding = embeddings[i]
    update_document(ELASTICSEARCH_INDEX_NAME, doc_id, embedding)


  response = client.search(
  for i, hit in enumerate(client.search(


### **3. Performing a kNN Search**
We will now send a request to Elasticsearch to find similar items using the vector representation.

You can also try a query in a different language (e.g. in english)

In [9]:
# Function to perform vector search
def vector_search(query):
    query_vector = generate_query_vector(query)

    # Search request body
    body = {
        "_source": ["title", "video_id"],
        "query": {
            "knn": {
                "field": "dense_vector_embedding",  # Ensure this matches the correct field in your index mapping
                "query_vector": query_vector,
                "k": 20,  
                "num_candidates": 50  
            }
        },
        "size": 40
    }

    # Execute search
    response = client.search(index=ELASTICSEARCH_INDEX_NAME, body=body)

    # Process results
    search_results = {}
    for hit in response["hits"]["hits"]:
        source = hit.get("_source", {})
        score = hit.get("_score", 0)
        video_id = source.get("video_id")

        # Deduplicate by video_id, keeping highest-score entry
        if video_id not in search_results or search_results[video_id]["similarity_score"] < score:
            search_results[video_id] = {
                "title": source.get("title"),
                "video_id": video_id,
                "similarity_score": score
            }

    return list(search_results.values())

# Example usage
query_example = "Les bases de la grammaire française"
results = vector_search(query_example)

# Display first 5 results
import pandas as pd
df_results = pd.DataFrame(results)
display(df_results.head())



Unnamed: 0,title,video_id,similarity_score
0,Règle 1 - Le secret pour améliorer ton franc...,G90N8wK0sS0,0.84091
1,Le système éducatif français 👩‍🎓👨‍🎓,9LWAjeG_dKs,0.836739
2,Tu as un Niveau B2 en Français si tu connais c...,IqZGFKC-eZ8,0.818137
3,Les Français décrivent leur week-end | Françai...,unmu4yKfBg0,0.811468
4,La routine parfaite pour apprendre le français,2mi3NM4YX10,0.810684


Let's try another query by yourself :

In [7]:
query_example = "Le trail en haute-montagne"
results = vector_search(query_example)
display(pd.DataFrame(results).head())



Unnamed: 0,title,video_id,similarity_score
0,Courir l’UTMB en 20h | Mathieu Blanchard (carr...,lE9AhAaMje4,0.766115
1,4 MANIÈRES DE FAIRE UN CAMPEMENT AVEC 1 BÂCHE !,QkTXVTqbDXo,0.742637
2,🇨🇭 La Suisse : un pays riche en paysages et cu...,sOSboG1eKyM,0.740606
3,Double Ascension Record de l’Everest | Kilian ...,N8-KKScr940,0.732122
4,MATÉRIEL DE RANDONNÉE DECATHLON PAS CHER POUR ...,dgrasUlIM8k,0.718271


You can also try a query in a different language (e.g. in english)

In [8]:
query_example = "Hiking mount everest"
results = vector_search(query_example)
display(pd.DataFrame(results).head())



Unnamed: 0,title,video_id,similarity_score
0,Double Ascension Record de l’Everest | Kilian ...,N8-KKScr940,0.808203
1,Le Mont Blanc - une montagne pour débutants ?,a52TtUhudoM,0.7791
2,4 MANIÈRES DE FAIRE UN CAMPEMENT AVEC 1 BÂCHE !,QkTXVTqbDXo,0.698431
3,🇯🇲 L'INCROYABLE HISTOIRE D'USAIN BOLT,bTa5P1AIbFo,0.688747
4,"La Suisse, pays de randonneurs",k14yhfDqn0o,0.687119


### **4. Summary**
In this section, we learned:
- How to generate embeddings from text using `sentence-transformers`.
- How to perform a kNN vector search in Elasticsearch.
- How to interpret and rank search results.

This method is powerful for **semantic search**, allowing us to retrieve relevant content even if the query does not contain exact words from the documents.