# Visual Data Management System (VDMS)

>[VDMS](https://github.com/IntelLabs/vdms) is a storage solution for efficient access of big-”visual”-data that aims to achieve cloud scale by searching for relevant visual data via visual metadata stored as a graph and enabling machine friendly enhancements to visual data for faster access. VDMS is licensed under MIT.

It supports:
- K nearest neighbor search
- Euclidean distance (L2) and inner product (IP)
- Vector and metadata searches

See the [installation instructions](https://github.com/IntelLabs/vdms/blob/master/INSTALL.md) and [docker image](https://hub.docker.com/r/intellabs/vdms).

This notebook shows how to use VDMS as a vector store (`VDMSVectorSearch`) using the docker image.


Install Python packages for VDMS client and Sentence Transformers:

In [1]:
# Pip install necessary package
%pip install --upgrade --quiet pip
%pip install --upgrade --quiet vdms
%pip install --upgrade --quiet sentence-transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Basic Example (using the Docker Container)

In this basic example, we take the most recent State of the Union Address, split it into chunks, embed it using an open-source embedding model, load it into VDMS, and then query it.

You can run the VDMS Server in a Docker container separately to use with LangChain. 

VDMS has the ability to handle multiple `Collections` of documents, but the LangChain interface expects one, so we need to specify the collection (DescriptorSet) name. The default collection (DescriptorSet) name used by LangChain is "langchain".


### Load Document and Obtain Embedding Function

In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.vdms import VDMSVectorSearch
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
import time

# load the document and split it into chunks
document_path = "../../modules/state_of_the_union.txt"
raw_documents = TextLoader(document_path).load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(raw_documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") #384 dimensions
emb_dim = 384
collection_name="my_collection"

# Configurations
db_host = "localhost"
db_port = 55555
connection_args={"host": db_host, "port": db_port}
collection_name = "my_collection"
distance_strategy = "L2"
engine = "FaissFlat"

### Start VDMS Server
Here we start the VDMS server with default port 55555.

In [3]:
!docker run ${DOCKER_PROXY_RUN_ARGS} --rm -d -p 55555:55555 --name vdms_vs_nb intellabs/vdms:latest
time.sleep(5)


d1ee8f09a6eb915f1982bda2b8c35f37a7dc8ccfe402563b3d7a557b7518761c


## Similarity Search using Faiss Flat and Euclidean Distance (Default)

Here we add the documents to VDMS and query it.

In [4]:
db = VDMSVectorSearch.from_documents(
    docs,
    collection_name=collection_name,
    embedding_function=embedding_function,
    engine=engine,
    distance_strategy=distance_strategy,
    connection_args=connection_args,
)

# Query
k = 3
query = "What did the president say about Ketanji Brown Jackson"
docs_with_score = db.similarity_search_with_score(query, k, filter=None)
print("-" * 50)
for doc, score in docs_with_score:
    print("Score: ", score)
    print(f"\nContent: \n{doc.page_content}")
    print("-" * 50)

--------------------------------------------------
Score:  1.1972054243087769

Content: 
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------
Score:  1.6471163034439087

Content: 
Madam Speaker, Madam Vice Presiden

## Update and Delete

While building toward a real application, you want to go beyond adding data, and also update and delete data. 

VDMS allows users to provide `ids` to simplify the bookkeeping here. `ids` can be the name of the file, or a combined has like `filename_paragraphNumber`, etc.

VDMS supports deletion of collections/descriptors but does not support updating this type of data.  To update a collection, we must find it, delete it, then add it again with new values.

Here is a basic example showing how to do various operations:

In [5]:
# create simple ids
ids = [str(i) for i in range(1, len(docs) + 1)]

# add data
db.from_documents(
    docs,
    embedding_function=embedding_function,
    ids=ids,
    collection_name=collection_name,
    engine=engine,
    distance_strategy=distance_strategy,
    connection_args=connection_args,
)
docs = db.similarity_search(query)
print(f"original metadata: {docs[0].metadata}")

original metadata: {'source': '../../modules/state_of_the_union.txt'}


Now update the metadata for the first document and delete the last document

In [6]:
# update the metadata for a document
docs[0].metadata = {
    "source": document_path,
    "new_value": "hello world",
}
print(f"new metadata: {docs[0].metadata}\n")

# Update document in VDMS
db.update_document(collection_name, ids[0], docs[0])
response, response_array = db.get(collection_name, constraints={"id": ["==", ids[0]]})
print(f"Returned entry:")
for key, value in response[0]['FindDescriptor']['entities'][0].items():
    if value != 'Missing property':
        print(f"{key}: {value}")

# delete the last document
print("\nCount before deletion: ", db.count(collection_name))
id_to_remove = ids[-1]
db.delete_collection(collection_name, ids=[id_to_remove])
print(f"Count after removing id {id_to_remove}: ", db.count(collection_name))


new metadata: {'source': '../../modules/state_of_the_union.txt', 'new_value': 'hello world'}

Returned entry:
content: Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
id: 1
new_value: hello world
source: ../../modules/state_of_the_union.txt

Count before delet

***

## Other Information

### Similarity search by vector

In [7]:
# Similarity search by vector
embedding_vector = embedding_function.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)

# Print Results
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


### Similarity search with score

The returned distance score is cosine distance. Therefore, a lower score is better.

In [8]:
# Similarity ssearch with score
docs_and_scores = db.similarity_search_with_score(query)
print("-" * 50)
for doc, score in docs_with_score:
    print("Score: ", score)
    print(f"\nContent: \n{doc.page_content}")
    print("-" * 50)

--------------------------------------------------
Score:  1.1972054243087769

Content: 
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------
Score:  1.6471163034439087

Content: 
Madam Speaker, Madam Vice Presiden

### Filtering on metadata

It can be helpful to narrow down the collection before working with it.

For example, collections can be filtered on metadata using the get method.

In [9]:
# filter collection for id
response, response_array = db.get(collection_name,
                                  limit=1,
                                  constraints={"id": ["==", "2"]}
)

print(f"Returned entry:")
for key, value in response[0]['FindDescriptor']['entities'][0].items():
    if value != 'Missing property':
        print(f"{key}: {value}")

Returned entry:
content: Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. 

In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. 

Let each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. 

Please rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. 

Throughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos.   

They keep moving.   

And the costs and the threats to America and the world keep rising.   

That’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. 

The United States is a member along with 29 other nations. 

It matters. American diplomacy matters. Ameri

### Retriever options

This section goes over different options for how to use VDMS as a retriever.


#### Simiarity Search

Here we use similarity search in the retriever object.


In [10]:
retriever = db.as_retriever()
relevant_docs = retriever.get_relevant_documents(query)[0]

print(f"Content: \n{relevant_docs.page_content}")
print(f"\nMetadata: \n{relevant_docs.metadata}")

Content: 
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Metadata: 
{'new_value': 'hello world', 'source': '../../modules/state_of_the_union.txt'}


#### MMR

In addition to using similarity search in the retriever object, you can also use `mmr`.

In [11]:
# Maximal Marginal Relevance Search (MMR)
retriever = db.as_retriever(search_type="mmr")
relevant_docs = retriever.get_relevant_documents(query)[0]

print(f"Content: \n{relevant_docs.page_content}")
print(f"\nMetadata: \n{relevant_docs.metadata}")

Content: 
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Metadata: 
{'source': '../../modules/state_of_the_union.txt'}


We can also use MMR directly.

In [12]:

# Use max_marginal_relevance_search directly
mmr_resp = db.max_marginal_relevance_search_with_score(query, k=2, fetch_k=10)
print("-" * 50)
for doc, score in mmr_resp:
    print("Score: ", score)
    print(f"\nContent: \n{doc.page_content}")
    print(f"\nMetadata: \n{doc.metadata}")
    print("-" * 50)


--------------------------------------------------
Score:  1.1972054243087769

Content: 
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

Metadata: 
{'source': '../../modules/state_of_the_union.txt'}
--------------------------------------------------
Score:  1

## Question Answering with Sources

This section goes over how to do question-answering with sources over an Index. It does this by using the `RetrievalQAWithSourcesChain`, which does the lookup of the documents from an Index. 

In [13]:
# from langchain.chains import RetrievalQAWithSourcesChain
# from langchain_openai import ChatOpenAI

# retriever = db.as_retriever()
# chain = RetrievalQAWithSourcesChain.from_chain_type(
#     ChatOpenAI(temperature=0), chain_type="stuff", retriever=retriever
# )
# chain(
#     {"question": "What did the president say about Justice Breyer"},
#     return_only_outputs=True,
# )

### Stop VDMS Server

In [14]:
!docker kill vdms_vs_nb

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


vdms_vs_nb
