# Using Elasticsearch 8 as a Vector Store with dense_vector

### This is from the course "Elasticsearch 8 and the Elastic Stack - Hands On by Frank Kane and Sundog Education. Available at https://sundog-education.com/get-es8

We'll start by installing and importing what we need. Be sure the version of the Elasticsearch client matches the server.

In [1]:
!pip install openai elasticsearch



Since we disabled security in Elasticsearch to make life easier in this course, we just have to initialize the client with an http link to the localhost port we have it running on. In a production setting, you would use https, include your SSL fingerprint and authentication credentials as parameters, and point to a remote service where you have an Elasticsearch cluster running.

Also, we have defined an OPENAI_API_KEY environment variable on our system, so we don't have to pass that in explicitly to the OpenAI constructor below.

In [2]:
from elasticsearch import Elasticsearch
from openai import OpenAI
openai_client = OpenAI()
es_client = Elasticsearch(
    "http://localhost:9200"
)


Let's double check our version of the ES client. No promises this code works with anything other than what you see here, or with a server version that is any different.

In [3]:
!pip show elasticsearch

Name: elasticsearch
Version: 8.13.0
Summary: Python client for Elasticsearch
Home-page: https://github.com/elastic/elasticsearch-py
Author: Elastic Client Library Maintainers
Author-email: client-libs@elastic.co
License: Apache-2.0
Location: H:\anaconda3-fresh\Lib\site-packages
Requires: elastic-transport
Required-by: 


Let's define the name of the index we want to use in one place so we can't mistype it, and delete it in case it already exists from an earlier run.

In [4]:
index_name = 'movies-vectordb'  # Replace with your index name

# Delete the index
try:
    response = es_client.indices.delete(index=index_name)
    print(f"Index '{index_name}' deleted successfully.")
except Exception as e:
    print(f"Error deleting index '{index_name}': {e}")

Index 'movies-vectordb' deleted successfully.


As an example, we will set up an index for movies that we'll populate from the MovieLens dataset we use throughout our course.

Our mapping defines title, genre, and release year fields... and more interestingly, a "title_embedding" field of type "dense_vector." Elasticsearch will not compute embedding vectors for you; we'll use OpenAI for that. But we must take care to ensure the "dims" of our dense vector field matches the dimensions that will be generated by the embedding model we intend to use (1536, in this case.)

In [5]:
mapping = {
    "properties": {
        "title_embedding": {
            "type": "dense_vector",
            "dims": 1536
        },
        "title": {
            "type": "text"
        },
        "genre": {
            "type": "keyword"
        },
        "release_year": {
            "type": "integer"
        }
    }
}
es_client.indices.create(index=index_name, mappings=mapping)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'movies-vectordb'})

This snippet of code loads up all 9,743 movies in the MovieLens ml-latest-small dataset, extracts the move title, genres, and year, and populates an array called "movies" with the extracted information.

If you don't already have it, you can download the ml-latest-small dataset from https://grouplens.org/datasets/movielens/latest/

This isn't a Python coding course, so I'm not going to go into how this works too much. In short we use pandas to load up the movies.csv file, then iterate through every row of the resulting dataframe. The movie titles in the dataset are all of the form Movie Name (Year), so we have to extract the year from the title prior to storing it.

In [6]:
import pandas as pd

# Read the movies.csv file
movies_df = pd.read_csv('ml-latest-small/movies.csv')

# Extracting title, genres and release year
def extract_release_year(title):
    if '(' in title and title[-1] == ')':
        year = title.strip()[-5:-1]
        if year.isdigit():
            return int(year)
    return None

# Creating the list of dictionaries
movies = []
for index, row in movies_df.iterrows():
    title = row['title']
    genre = row['genres']
    release_year = extract_release_year(title)
    if release_year:
        title = title[:-7].strip()  # Remove the year part from title
        movie_dict = {
            "title": title,
            "genre": genre,
            "release_year": release_year
        }
        movies.append(movie_dict)

# Print the first few entries to verify
print(movies[:5])


[{'title': 'Toy Story', 'genre': 'Adventure|Animation|Children|Comedy|Fantasy', 'release_year': 1995}, {'title': 'Jumanji', 'genre': 'Adventure|Children|Fantasy', 'release_year': 1995}, {'title': 'Grumpier Old Men', 'genre': 'Comedy|Romance', 'release_year': 1995}, {'title': 'Waiting to Exhale', 'genre': 'Comedy|Drama|Romance', 'release_year': 1995}, {'title': 'Father of the Bride Part II', 'genre': 'Comedy', 'release_year': 1995}]


We will use the extracted titles for generating embedding vectors using OpenAI. Embeddings are basically positions in a high-dimensional space that encode the semantic meaning of a set of words. Titles that are close to each other in this space should have similar semantic meaning. That's why vector databases are used for "semantic search," commonly in the context of retreival-augmented generation (RAG) for extending large language models.

Now here is the heart of everything. This is a little more complicated than we might like, because the OpenAI API only permits generating embeddings in batches of 2048 or fewer items. So, we step through the "movies" array 2048 titles at a time. For each batch of 2048, we extract the titles and call the OpenAI embeddings API to generate embedding vectors for all of them at once, using an embedding model that produces the same number of dimensions as we set up in our dense_vector field. Once we have the embeddings for that batch, we then store them in our Elasticsearch index, also in a batch manner... the 'helpers' module makes that easy using helpers.bulk. We first zip together the movies and their embeddings so we can store them together as we build up a batch for Elasticsearch to index.

The code to do this one line at a time would be a LOT simpler, but it would also take a LOT more time. This should finish in less than a minute.

This does cost money from OpenAI, but it should just be pennies, at least as of the pricing at this writing.

In [7]:
from elasticsearch import Elasticsearch, helpers
import time

def get_embeddings(batch):
    """
    Function to get embeddings for a batch of titles from the OpenAI API.
    """

    # Extract the titles only
    titles = [movie["title"] for movie in batch]
    
    response = openai_client.embeddings.create(
        input=titles,
        model="text-embedding-3-small"  # Update with the appropriate model if necessary
    )
    return response.data

def index_to_elasticsearch(movies_batch, embeddings):
    """
    Function to index embeddings into Elasticsearch.
    """
    actions = []
    for movie, embedding_data in zip(movies_batch, embeddings):
        movie['title_embedding'] = embedding_data.embedding
        action = {
            "_index": index_name,
            "_source": movie
        }
        actions.append(action)
    
    helpers.bulk(es_client, actions)

# Processing in batches
batch_size = 2048
for i in range(0, len(movies), batch_size):
    batch = movies[i:i+batch_size]
    embeddings = get_embeddings(batch)
    index_to_elasticsearch(batch, embeddings)
    time.sleep(1)  # To avoid hitting rate limits

print("Finished indexing all movie titles.")


Finished indexing all movie titles.


So we have a vector store! All of our movies are indexed, together with a large vector representing the position of their titles in a semantic vector space. It's worth noting that the storage cost of those vectors exceeds the actual data we care about, by a lot. Nobody said AI was efficient.

Let's test it out! To do a vector (semantic) search, first we need to convert our search term to an embedding vector as well, using the same model used to create vectors in our index. So, let's convert "Star Wars" to an embedding vector, which we will use for the actual search.

In [8]:
vector_value = openai_client.embeddings.create(
        input=["Star Wars"], model='text-embedding-3-small'
    ).data[0].embedding

Our "search" is now a matter of finding the K items closest to our search vector in our vector space. Unlike traditional search, this is a "K-Nearest Neighbors" problem from the world of machine learning.

Our query string for vector search must include not only the field we are searching and the embedding vector we are searching for, but also "K" (the number of results we want closest to our search term in semantic space) and "num_candidates" (the number of approximate nearest neighbors we find before narrowing it down to K; this is faster than doing a brute-force search of every vector in the index to find the closest ones.)

Note the syntax of the search call; we pass in our query string as a "knn" parameter, so Elasticsearch will know to treat this as a vector search.

Let's kick it off, and print out the raw results:

In [9]:
query_string = {
    "field": "title_embedding",
    "query_vector": vector_value,
    "k": 5,
    "num_candidates": 100
}

results = es_client.search(index=index_name, knn=query_string)
print(results)


{'took': 104, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 5, 'relation': 'eq'}, 'max_score': 0.8514196, 'hits': [{'_index': 'movies-vectordb', '_id': 'LJVDjY8B4AUWxdud_e0L', '_score': 0.8514196, '_source': {'title': 'Star Wars: The Clone Wars', 'genre': 'Action|Adventure|Animation|Sci-Fi', 'release_year': 2008, 'title_embedding': [-0.02022513374686241, 0.07458090782165527, -0.054367389529943466, -0.007922768592834473, -0.05780600756406784, 0.03661666810512543, 0.011012881062924862, 0.01220942847430706, 0.02544114924967289, 0.0451899878680706, 0.0036157798022031784, -0.015717752277851105, -0.045584965497255325, -0.023954179137945175, 0.008259660564363003, -0.011367199011147022, -0.029530320316553116, 0.006255734711885452, -0.024047113955020905, 0.05636550486087799, 0.026486676186323166, 0.013173636049032211, 0.02781100943684578, 0.028368623927235603, -0.05427445098757744, 0.018215399235486984, -0.013173636049032211

So you can see all the associated data for each movie, along with their embedding vectors... which is a ton of information you don't really need. Let's just print out the movie titles of our requested 5 nearest hits:

In [10]:
# Print just the title field from the results
for result in results['hits']['hits']:
    title = result['_source']['title']
    print(title)

Star Wars: The Clone Wars
Star Wars: Episode V - The Empire Strikes Back
Star Wars: Episode VI - Return of the Jedi
Star Trek
Star Wars: Episode IV - A New Hope


So that is using vector search with Elasticsearch. Vector stores are very popular in the world of generative AI as they use complex models to encode the semantic meaning of phrases. But if you're thinking that's overkill for this specific problem... well, you're right. If simpler, more efficient methods of search meet your needs, you should opt for them instead.