# Visual Data Management System (VDMS)

>[VDMS](https://github.com/IntelLabs/vdms) is a storage solution for efficient access of big-”visual”-data that aims to achieve cloud scale by searching for relevant visual data via visual metadata stored as a graph and enabling machine friendly enhancements to visual data for faster access. VDMS is licensed under MIT.

It supports:
- approximate nearest neighbor search
- Euclidean similarity and cosine similarity
- Hybrid search combining vector and metadata searches

This notebook shows how to use VDMS as a vector store (`VDMSVectorSearch`).


Install VDMS client with:

```sh
pip install vdms
```


In [1]:
%pip install vdms sentence-transformers

Note: you may need to restart the kernel to use updated packages.


## Basic Example (using the Docker Container)

In this basic example, we take the most recent State of the Union Address, split it into chunks, embed it using an open-source embedding model, load it into VDMS, and then query it.

You can run the VDMS Server in a Docker container separately, create a Client to connect to it, and then pass that to LangChain. 

VDMS has the ability to handle multiple `Collections` of documents, but the LangChain interface expects one, so we need to specify the collection (DescriptorSet) name. The default collection (DescriptorSet) name used by LangChain is "langchain".

### Load Document and Obtain Embedding Function

In [2]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.vdms import VDMSVectorSearch
import time

# load the document and split it into chunks
raw_documents = TextLoader("../../modules/state_of_the_union.txt").load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(raw_documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") #384 dimensions
emb_dim = 384
collection_name="my_collection"

# Connect to client
db_host = "localhost"
db_port = 55555

### Start VDMS Server
Here we start the VDMS server with default port 55555.

In [3]:
!docker run ${DOCKER_PROXY_RUN_ARGS} --rm -d -p 55555:55555 --name vdms_vs_nb intellabs/vdms:latest
time.sleep(5)


e880381753311fb933a5fe4ecc3680c89c4d79eda51138eb16d303e4aa79a889


In [4]:
db = VDMSVectorSearch.from_documents(
    docs,
    embedding_function=embedding_function,
    collection_name=collection_name,
    embedding_dimension=emb_dim,
    connection_args={"host": db_host, "port": db_port},
)

# Query it
query = "What did the president say about Ketanji Brown Jackson"
constraints = {"set": ["==", collection_name]}
docs = db.similarity_search(query, filter=None)  # constraints)

# Print Results
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


## Update and Delete

While building toward a real application, you want to go beyond adding data, and also update and delete data. 

VDMS allows users to provide `ids` to simplify the bookkeeping here. `ids` can be the name of the file, or a combined has like `filename_paragraphNumber`, etc.

VDMS supports deletion of collections/descriptors but does not support updating this type of data.  To update a collection, we must find it, delete it, then add it again with new values.

Here is a basic example showing how to do various operations:

In [5]:

# create simple ids
ids = [str(i) for i in range(1, len(docs) + 1)]

# add data
db.from_documents(
    docs,
    embedding_function=embedding_function,
    embedding_dimension=emb_dim,
    ids=ids,
    collection_name=collection_name,
    connection_args={"host": db_host, "port": db_port},
)
docs = db.similarity_search(query)
print(docs[0].metadata)

# update the metadata for a document
docs[0].metadata = {
    "source": "../../../modules/state_of_the_union.txt",
    "new_value": "hello world",
}
db.update_document(collection_name, ids[0], docs[0])
print(db.get(collection_name, constraints={"id": ["==", ids[0]]}))

# delete the last document
print("\nCount before deletion: ", db.count(collection_name))
id_to_remove = ids[-1]
db.delete_collection(collection_name, ids=[id_to_remove])
print(f"Count after removing id {id_to_remove}: ", db.count(collection_name))


{'source': '../../../modules/state_of_the_union.txt'}
([{'FindDescriptor': {'entities': [{'id': '1', 'new_value': 'hello world', 'source': '../../../modules/state_of_the_union.txt', 'text': 'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'}], 'returned':

***

## Other Information

### Similarity search by vector

In [6]:
# Similarity search by vector
embedding_vector = embedding_function.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)

# Print Results
docs[0].page_content

'Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'

### Similarity search with score

The returned distance score is cosine distance. Therefore, a lower score is better.

In [7]:
# Similarity ssearch with score
docs_and_scores = db.similarity_search_with_score(query)
docs_and_scores[0]

(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'new_value': 'hello world', 'source': '../../../modules/state_of_the_union.txt'}),
 1.1972049474716187)

### Retriever options

This section goes over different options for how to use VDMS as a retriever.

#### MMR

In addition to using similarity search in the retriever object, you can also use `mmr`.

In [8]:
# Maximal Marginal Relevance Search (MMR)
retriever = db.as_retriever(search_type="mmr")
retriever.get_relevant_documents(query)[0]

Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../modules/state_of_the_union.txt'})

In [9]:

# Use max_marginal_relevance_search directly
db.max_marginal_relevance_search(query, k=2, fetch_k=10)


[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../modules/state_of_the_union.txt'}),
 Document(page_content='As Ohio Senator Sherrod Brown says, “It’s time to bury the label “Rust Belt.” \n\nIt’s time

### Filtering on metadata

It can be helpful to narrow down the collection before working with it.

For example, collections can be filtered on metadata using the get method.

In [10]:
# filter collection for id
response, response_array = db.get(collection_name,
    constraints={
        "id": ["==", "2"]
    }
)
db._client.print_last_response()

[
    {
        "FindDescriptor": {
            "entities": [
                {
                    "id": "2",
                    "new_value": "Missing property",
                    "source": "../../../modules/state_of_the_union.txt",
                    "text": "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia\u2019s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world w

In [11]:
!docker kill vdms_vs_nb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


vdms_vs_nb
