# Chroma

This notebook shows how to use functionality related to the Chroma vector database.

In [1]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [2]:
from langchain.document_loaders import TextLoader
import os

# Load the text file and split it into chunks
# Set the file path and encoding for the text file
loader = TextLoader(file_path='../../../state_of_the_union.txt', encoding="utf-8")

# Load the text file into a list of documents
documents = loader.load()

# Create a CharacterTextSplitter to split the text into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Split the documents into chunks
docs = text_splitter.split_documents(documents)

# Check that the OPENAI_API_KEY environment variable is set
if os.environ.get('OPENAI_API_KEY') is None:
    # Raise an error if the variable is not set
    raise ValueError("Please set OPENAI_API_KEY environment variable")

# Create an instance of OpenAIEmbeddings
embeddings = OpenAIEmbeddings()


In [3]:
# Create a Chroma object from the docs and embeddings
db = Chroma.from_documents(docs, embeddings)

# Set the search query
query = "What did the president say about Ketanji Brown Jackson"
# Search for documents that are similar to the query
docs = db.similarity_search(query)

Using embedded DuckDB without persistence: data will be transient


In [4]:
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


## Similarity search with score

In [5]:
# Similarity search with score returns the score along with the document
output = db.similarity_search_with_score(query)

In [6]:
output[0]

(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}),
 0.3949804902076721)

## Persistence

The below steps cover how to persist a ChromaDB instance

### Initialize Persistent ChromaDB
Create embeddings for each chunk and insert into the Chroma vector database. The persist_directory argument tells ChromaDB where to store the database when it's persisted.



In [7]:
import os
import tempfile

# Set up a persist_directory to store the embeddings on disk in a temporary directory
persist_directory = os.path.join(tempfile.gettempdir(), "db")
if not os.path.exists(persist_directory):
    os.makedirs(persist_directory)

# Print the directory where embeddings will be stored
print("Persist directory: ", persist_directory)

# Create an instance of OpenAIEmbeddings and use it to create a Chroma vector database
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=docs, embedding=embedding, persist_directory=persist_directory)

Using embedded DuckDB with persistence: data will be stored in: C:\Users\ADMINI~1\AppData\Local\Temp\2\db


Persist directory:  C:\Users\ADMINI~1\AppData\Local\Temp\2\db


### Persist the Database
We should call persist() to ensure the embeddings are written to disk.

In [8]:
# Save the database to disk
vectordb.persist()

# Delete the database object to ensure it is closed
del vectordb

### Load the Database from disk, and create the chain
Be sure to pass the same persist_directory and embedding_function as you did when you instantiated the database. Initialize the chain we will use for question answering.

In [9]:
# Load the persisted database from disk
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

# Find similar documents to the query, with similarity scores
output = db.similarity_search_with_score(query)

# Print the top result
print(output[0])

Using embedded DuckDB with persistence: data will be stored in: C:\Users\ADMINI~1\AppData\Local\Temp\2\db


(Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'}), 0.3951101303100586)


## Retriever options

This section goes over different options for how to use Chroma as a retriever.

### MMR

In addition to using similarity search in the retriever object, you can also use `mmr`.

In [10]:
retriever = db.as_retriever(search_type="mmr")

In [11]:
retriever.get_relevant_documents(query)[0]

Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': '../../../state_of_the_union.txt'})