# Aerospike

[Aerospike Vector Search](https://aerospike.com/docs/vector) (AVS) is an
extension to the Aerospike Database that enables searches across very large
datasets stored in Aerospike. This new service lives outside of Aerospike and
builds an index to perform those searches.

This notebook showcases the functionality of the [LangChain Aerospike VectorStore
integration](https://github.com/aerospike/langchain-aerospike).

## Install AVS

Before using this notebook, we need to have a running AVS instance. Use one of
the [available installation methods](https://aerospike.com/docs/vector/install). 

When finished, store your AVS instance's IP address and port to use later
in this demo:

In [11]:
AVS_HOST = "<avs_ip>"
AVS_PORT = 5000

## Install Dependencies 
The `sentence-transformers` dependency is large. This step could take several minutes to complete.

In [5]:
!pip install --upgrade --quiet aerospike-vector-search==4.2.0 langchain-aerospike langchain-community sentence-transformers langchain


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Download Quotes Dataset

We will download a dataset of approximately 100,000 quotes and use a subset of those quotes for semantic search.

In [6]:
!wget https://github.com/aerospike/aerospike-vector-search-examples/raw/7dfab0fccca0852a511c6803aba46578729694b5/quote-semantic-search/container-volumes/quote-search/data/quotes.csv.tgz

--2025-05-07 21:06:30--  https://github.com/aerospike/aerospike-vector-search-examples/raw/7dfab0fccca0852a511c6803aba46578729694b5/quote-semantic-search/container-volumes/quote-search/data/quotes.csv.tgz
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/aerospike/aerospike-vector/raw/7dfab0fccca0852a511c6803aba46578729694b5/quote-semantic-search/container-volumes/quote-search/data/quotes.csv.tgz [following]
--2025-05-07 21:06:30--  https://github.com/aerospike/aerospike-vector/raw/7dfab0fccca0852a511c6803aba46578729694b5/quote-semantic-search/container-volumes/quote-search/data/quotes.csv.tgz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/aerospike/aerospike-vector/7dfab0fccca0852a511c6803aba46578729694b5/quote-semantic-search/contain

## Load the Quotes Into Documents

We will load our quotes dataset using the `CSVLoader` document loader. In this case, `lazy_load` returns an iterator to ingest our quotes more efficiently. In this example, we only load 5,000 quotes.

In [7]:
import itertools
import os
import tarfile

from langchain_community.document_loaders.csv_loader import CSVLoader

filename = "./quotes.csv"

if not os.path.exists(filename) and os.path.exists(filename + ".tgz"):
    # Untar the file
    with tarfile.open(filename + ".tgz", "r:gz") as tar:
        tar.extractall(path=os.path.dirname(filename))

NUM_QUOTES = 5000
documents = CSVLoader(filename, metadata_columns=["author", "category"]).lazy_load()
documents = list(
    itertools.islice(documents, NUM_QUOTES)
)  # Allows us to slice an iterator

In [8]:
print(documents[0])

page_content='quote: I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best.' metadata={'source': './quotes.csv', 'row': 0, 'author': 'Marilyn Monroe', 'category': 'attributed-no-source, best, life, love, mistakes, out-of-control, truth, worst'}


## Create your Embedder

In this step, we use HuggingFaceEmbeddings and the "all-MiniLM-L6-v2" sentence transformer model to embed our documents so we can perform a vector search.

In [9]:
from aerospike_vector_search.types import VectorDistanceMetric
from langchain_community.embeddings import HuggingFaceEmbeddings

MODEL_DIM = 384
MODEL_DISTANCE_CALC = VectorDistanceMetric.COSINE
embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


## Create an Aerospike Index and Embed Documents

Before we add documents, we need to create an index in the Aerospike Database. In the example below, we use some convenience code that checks to see if the expected index already exists.

In [12]:
from aerospike_vector_search import Client, HostPort
from aerospike_vector_search.types import VectorDistanceMetric
from langchain_aerospike.vectorstores import Aerospike

# Here we are using the AVS host and port you configured earlier
seed = HostPort(host=AVS_HOST, port=AVS_PORT)

# The namespace of where to place our vectors. This should match the vector configured in your docstore.conf file.
NAMESPACE = "test"

# The name of our new index.
INDEX_NAME = "quote-miniLM-L6-v2"

# AVS needs to know which metadata key contains our vector when creating the index and inserting documents.
VECTOR_KEY = "vector"

client = Client(seeds=seed)
index_exists = False

# Check if the index already exists. If not, create it
for index in client.index_list():
    if index["id"]["namespace"] == NAMESPACE and index["id"]["name"] == INDEX_NAME:
        index_exists = True
        print(f"{INDEX_NAME} already exists. Skipping creation")
        break

if not index_exists:
    print(f"{INDEX_NAME} does not exist. Creating index")
    client.index_create(
        namespace=NAMESPACE,
        name=INDEX_NAME,
        vector_field=VECTOR_KEY,
        vector_distance_metric=MODEL_DISTANCE_CALC,
        dimensions=MODEL_DIM,
        index_labels={
            "model": "miniLM-L6-v2",
            "date": "05/04/2024",
            "dim": str(MODEL_DIM),
            "distance": "cosine",
        },
    )

docstore = Aerospike.from_documents(
    documents,
    embedder,
    client=client,
    namespace=NAMESPACE,
    vector_key=VECTOR_KEY,
    index_name=INDEX_NAME,
    distance_strategy=MODEL_DISTANCE_CALC,
)

quote-miniLM-L6-v2 does not exist. Creating index


## Search the Documents
Now that we have embedded our vectors, we can use vector search on our quotes.

In [13]:
query = "A quote about the beauty of the cosmos"
docs = docstore.similarity_search(
    query, k=5, index_name=INDEX_NAME, metadata_keys=["_id", "author"]
)


def print_documents(docs):
    for i, doc in enumerate(docs):
        print("~~~~ Document", i, "~~~~")
        print("auto-generated id:", doc.metadata["_id"])
        print("author: ", doc.metadata["author"])
        print(doc.page_content)
        print("~~~~~~~~~~~~~~~~~~~~\n")


print_documents(docs)

~~~~ Document 0 ~~~~
auto-generated id: 4984b472-8a32-4552-b3eb-f03b31b68031
author:  Carl Sagan, Cosmos
quote: The Cosmos is all that is or was or ever will be. Our feeblest contemplations of the Cosmos stir us -- there is a tingling in the spine, a catch in the voice, a faint sensation, as if a distant memory, of falling from a height. We know we are approaching the greatest of mysteries.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 1 ~~~~
auto-generated id: 486c8d87-8dd7-450d-9008-d7549e680ffb
author:  Renee Ahdieh, The Rose & the Dagger
quote: From the stars, to the stars.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 2 ~~~~
auto-generated id: 4b43b309-ce51-498c-b225-5254383b5b4a
author:  Elizabeth Gilbert
quote: The love that moves the sun and the other stars.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 3 ~~~~
auto-generated id: af784a10-f498-4570-bf81-2ffdca35440e
author:  Dante Alighieri, Paradiso
quote: Love, that moves the sun and the other stars
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 4 ~~~~
auto-generated

## Embedding Additional Quotes as Text

We can use `add_texts` to add additional quotes.

In [14]:
docstore = Aerospike(
    client,
    embedder,
    NAMESPACE,
    index_name=INDEX_NAME,
    vector_key=VECTOR_KEY,
    distance_strategy=MODEL_DISTANCE_CALC,
)

ids = docstore.add_texts(
    [
        "quote: Rebellions are built on hope.",
        "quote: Logic is the beginning of wisdom, not the end.",
        "quote: If wishes were fishes, we’d all cast nets.",
    ],
    metadatas=[
        {"author": "Jyn Erso, Rogue One"},
        {"author": "Spock, Star Trek"},
        {"author": "Frank Herbert, Dune"},
    ],
)

print("New IDs")
print(ids)

New IDs
['adf8064e-9c0e-46e2-b193-169c36432f4c', 'cf65b5ed-a0f4-491a-86ad-dcacc23c2815', '2ef52efd-d9b7-4077-bc14-defdf0b7dd2f']


## Search Documents Using Max Marginal Relevance Search

We can use max marginal relevance search to find vectors that are similar to our query but dissimilar to each other. In this example, we create a retriever object using `as_retriever`, but this could be done just as easily by calling `docstore.max_marginal_relevance_search` directly. The `lambda_mult` search argument determines the diversity of our query response. 0 corresponds to maximum diversity and 1 to minimum diversity.

In [15]:
query = "A quote about our favorite four-legged pets"
retriever = docstore.as_retriever(
    search_type="mmr", search_kwargs={"fetch_k": 20, "lambda_mult": 0.7}
)
matched_docs = retriever.invoke(query)

print_documents(matched_docs)

~~~~ Document 0 ~~~~
auto-generated id: 91e77b39-a528-40c6-a58a-486ae85f991a
author:  John Grogan, Marley and Me: Life and Love With the World's Worst Dog
quote: Such short little lives our pets have to spend with us, and they spend most of it waiting for us to come home each day. It is amazing how much love and laughter they bring into our lives and even how much closer we become with each other because of them.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 1 ~~~~
auto-generated id: c585b4ec-92b5-4579-948c-0529373abc2a
author:  John Grogan, Marley and Me: Life and Love With the World's Worst Dog
quote: Dogs are great. Bad dogs, if you can really call them that, are perhaps the greatest of them all.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 2 ~~~~
auto-generated id: 5768b31c-fac4-4af7-84b4-fb11bbfcb590
author:  Colleen Houck, Tiger's Curse
quote: He then put both hands on the door on either side of my head and leaned in close, pinning me against it. I trembled like a downy rabbit caught in the clutche

## Search Documents with a Relevance Threshold

Another useful feature is a similarity search with a relevance threshold. Generally, we only want results that are most similar to our query but also within some range of proximity. A relevance of 1 is most similar and a relevance of 0 is most dissimilar.

In [16]:
query = "A quote about stormy weather"
retriever = docstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={
        "score_threshold": 0.4
    },  # A greater value returns items with more relevance
)
matched_docs = retriever.invoke(query)

print_documents(matched_docs)

~~~~ Document 0 ~~~~
auto-generated id: 6d9e67a6-0427-41e6-9e24-050518120d74
author:  Roy T. Bennett, The Light in the Heart
quote: Never lose hope. Storms make people stronger and never last forever.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 1 ~~~~
auto-generated id: 7d426e59-7935-4bcf-a676-cbe8dd4860e7
author:  Roy T. Bennett, The Light in the Heart
quote: Difficulties and adversities viciously force all their might on us and cause us to fall apart, but they are necessary elements of individual growth and reveal our true potential. We have got to endure and overcome them, and move forward. Never lose hope. Storms make people stronger and never last forever.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 2 ~~~~
auto-generated id: 6ec05e48-d162-440d-8819-001d2f3712f9
author:  Vincent van Gogh, The Letters of Vincent van Gogh
quote: There is peace even in the storm
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 3 ~~~~
auto-generated id: d3c3de59-4da4-4ae6-8f6d-83ed905dd320
author:  Edwin Morgan, A Book of Lives
qu

## Clean up

We need to make sure we close our client to release resources and clean up threads.

In [17]:
client.close()

## Ready. Set. Search!

Now that you are up to speed with Aerospike Vector Search's LangChain integration, you have the power of the Aerospike Database and the LangChain ecosystem at your finger tips. Happy building!