# Aerospike

[Aerospike Vector Search](https://aerospike.com/docs/vector)(AVS) is a vector database built on
top of the performant and robust Aerospike database.

This notebook showcases the functionality of the langchain Aerospike VectorStore
integration. Before we get started we need to make sure we have a running AVS instance. Use one of the [available
installation methods](https://aerospike.com/docs/vector/install).

## Temporarily Install Langchain Community [DELETE THIS BEFORE PR]

In [68]:
#!git clone https://github.com/aerospike/langchain.git
# !git checkout VEC-131-add-aerospike
%cd langchain/libs/langchain
!pip install --quiet -e .
%cd ../../..
%cd langchain/libs/community
!pip install --quiet -e .
%cd ../../..

/home/jovyan/langchain/libs/langchain


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/home/jovyan
/home/jovyan/langchain/libs/community


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


/home/jovyan


## Install Dependencies 

In [None]:
#!pip install --upgrade --quiet aerospike-vector-search langchain==0.1.16 sentence-transformers
!pip install --no-cache-dir --quiet aerospike-vector-search sentence-transformers

## Download Quotes Dataset

We will download a dataset of ~500k quotes to embed and insert into our vector store for later use.

In [None]:
!wget https://archive.org/download/quotes_20230625/quotes.csv


## Load the Quotes Into Documents
We will load out dataset of quotes using the DocumentLoader `CSVLoader`. In this case `lazy_load` returns an iterator to more efficiently ingest our quotes. In this example we are only going to load 5000 quotes rather than all 500k.

In [65]:
from langchain_community.document_loaders.csv_loader import CSVLoader
import itertools

NUM_QUOTES = 5000
documents = CSVLoader('./quotes.csv', metadata_columns=["author", "category"]).lazy_load()
documents = list(itertools.islice(documents, NUM_QUOTES)) # Allows us to slice an iterator

In [66]:
print(documents[0])

page_content="quote: I'm selfish, impatient and a little insecure. I make mistakes, I am out of control and at times hard to handle. But if you can't handle me at my worst, then you sure as hell don't deserve me at my best." metadata={'source': './quotes.csv', 'row': 0, 'author': 'Marilyn Monroe', 'category': 'attributed-no-source, best, life, love, mistakes, out-of-control, truth, worst'}


## Create your Embedder
Here we are using HuggingFaceEmbeddings and a sentence transformer model "all-MiniLM-L6-v2" to embed our documents so that we can perform a vector search.

In [4]:
from langchain_community.embeddings import HuggingFaceEmbeddings
# from langchain_community.vectorstores.utils import DistanceStrategy
# distance_strategy = DistanceStrategy
MODEL_DIM = 384
embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")



## Create an Aerospike Index and Embed Documents

Before we add documents, we need to create an index. In this example, we have some convenience code that checks to see if the expected index is already created. Make sure to replace <proximus-ip> with the IP to your AVS instance.

In [5]:
from langchain_community.vectorstores import Aerospike
from aerospike_vector_search import AdminClient, Client, HostPort

# Replace <proximus-ip> with the IP of your proximus instance you setup earlier
seed = HostPort(host="proximus", port=5002) 

# The namespace of where to place our vectors. This should match the vector configured in your aerospike.conf file.
NAMESPACE = "test"

# The name of our new index.
INDEX_NAME = "quote-miniLM-L6-v2"

# AVS needs to know which metadata key contains our vector when creating the index and inserting documents.
VECTOR_KEY = "vector" 

client = Client(
    seeds=seed
)
admin_client = AdminClient(
    seeds=seed,
)
index_exists = False

# Check if the index already exists. If not, create it
for index in admin_client.index_list():
    print(index)
    if (
        index["id"]["namespace"] == NAMESPACE
        and index["id"]["name"] == INDEX_NAME
    ):
        index_exists = True
        break

if not index_exists:
    admin_client.index_create(
        namespace=NAMESPACE,
        name=INDEX_NAME,
        vector_field=VECTOR_KEY,
        dimensions=MODEL_DIM,
        index_meta_data={
            "model": "miniLM-L6-v2",
            "date": "05/04/2024",
            "dim": str(MODEL_DIM),
            "distance": "cosine",
        }
    )

aerospike = Aerospike.from_documents(
    documents,
    embedder,
    client,
    NAMESPACE,
    vector_key=VECTOR_KEY,
    index_name=INDEX_NAME,
)

## Search the Documents
Now that we've inserted our vectors we can now use vector search on our quotes.

In [49]:
query = "A quote about the beauty of the cosmos"
docs = aerospike.similarity_search(query, k=5, index_name=INDEX_NAME, metadata_keys=["_id", "author"])

def print_documents(docs):
    for i, doc in enumerate(docs):
        print("~~~~ Document", i, "~~~~")
        print("auto-generated id:", doc.metadata["_id"])
        print("author: ",doc.metadata["author"])
        print(doc.page_content)
        print("~~~~~~~~~~~~~~~~~~~~\n")
    
print_documents(docs)

~~~~ Document 0 ~~~~
auto-generated id: 335e1605-871c-4b81-97b1-6b163f75ad23
author:  Renee Ahdieh, The Rose & the Dagger
quote: From the stars, to the stars.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 1 ~~~~
auto-generated id: 0a3dc619-0f31-49cf-b65a-18587c157ccc
author:  Elizabeth Gilbert
quote: The love that moves the sun and the other stars.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 2 ~~~~
auto-generated id: c8ed848e-cb68-4aa2-bae6-7855419b67c8
author:  Dante Alighieri, Paradiso
quote: Love, that moves the sun and the other stars
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 3 ~~~~
auto-generated id: 972b226f-16c2-45fc-aae9-40805ec60fe1
author:  Thich Nhat Hanh, Teachings on Love
quote: Through my love for you, I want to express my love for the whole cosmos, the whole of humanity, and all beings. By living with you, I want to learn to love everyone and all species. If I succeed in loving you, I will be able to love everyone and all species on Earth... This is the real message of love.
~~~~~~~~~~~~~~~~~~~

## Embedding Additional Quotes as Text
We can also add additional quotes by using `add_texts`.

In [19]:
aerospike = Aerospike(
    client,
    embedder,
    NAMESPACE,
    index_name=INDEX_NAME,
    vector_key=VECTOR_KEY,
)

ids = aerospike.add_texts(
    [
        "quote: Rebellions are built on hope.", 
        "quote: Logic is the beginning of wisdom, not the end.",
        "quote: If wishes were fishes, we’d all cast nets."
    ],
    metadatas=[
        {"author": "Jyn Erso, Rogue One"}, 
        {"author": "Spock, Star Trek"},
        {"author": "Frank Herbert, Dune"},
    ],
)

print("New IDs")
print(ids)

New IDs
['d38989e1-a533-4d7e-b0e2-ca1ccd2ab588', '8cfc06ff-5df0-4794-8195-2abd4eb3664f', '6f510616-5598-4c40-b5eb-1ad575714c67']


## Search Documents Using Max Marginal Relevance Search

We can utalize max marginal releavance search to search for vectors that are similar to our query but at the same time are dissimilar to eachother. In this example we create a retirever object using `as_retriever` but this could be done just as easily by calling `aerospike.max_marginal_relevance_search` directly.

In [42]:
query = "A quote about our favorite four-legged pets"
retriever = aerospike.as_retriever(search_type="mmr", search_kwargs={"fetch_k": 20, "lambda_mult":0.7})
matched_docs = retriever.invoke(query)

print_documents(matched_docs)

~~~~ Document 0 ~~~~
auto-generated id: 4d764aae-9134-4cf6-b8f3-cc0a378f6a4e
score: 0.781421422958374
author:  John Grogan, Marley and Me: Life and Love With the World's Worst Dog
quote: Such short little lives our pets have to spend with us, and they spend most of it waiting for us to come home each day. It is amazing how much love and laughter they bring into our lives and even how much closer we become with each other because of them.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 1 ~~~~
auto-generated id: 28622daf-c70c-4591-990c-bc041c9c6b52
score: 0.781421422958374
author:  Colleen Houck, Tiger's Curse
quote: He then put both hands on the door on either side of my head and leaned in close, pinning me against it. I trembled like a downy rabbit caught in the clutches of a wolf. The wolf came closer. He bent his head and began nuzzling my cheek. The problem was…I wanted the wolf to devour me.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 2 ~~~~
auto-generated id: 3437bd07-d000-4649-b794-761494baea96
scor

## Search Documents with a Relevance Threshold

Another useful feature is a similarity search with a relevance threshold. Generally, we only want results that are most similar to our query but also within some range of proximity. A relevance of 1 is most similar and a relevance of is most dissimilar.

In [64]:
query = "A quote about stormy weather"
retriever = aerospike.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={'score_threshold': 0.4} # A greater value returns items with more relevance
)
matched_docs = retriever.invoke(query)

print_documents(matched_docs)

~~~~ Document 0 ~~~~
auto-generated id: 95c0538f-83bd-4ff4-baf4-c9497802344c
author:  William Shakespeare, The Complete Sonnets and Poems
quote: Love comforteth like sunshine after rain.
~~~~~~~~~~~~~~~~~~~~

~~~~ Document 1 ~~~~
auto-generated id: 2e052de1-fde6-4bda-ac27-0d85ebae59f8
author:  Christina Rossetti, Goblin Market and Other Poems
quote: For there is no friend like a sisterIn calm or stormy weather; To cheer one on the tedious way, To fetch one if one goes astray,To lift one if one totters down, To strengthen whilst one stands
~~~~~~~~~~~~~~~~~~~~

