## <b><font color='darkblue'>Meet ChromaDB for LLM Applications</font><b/>
<b><font size='3ptx'>[ChromaDB](https://docs.trychroma.com/) is an open-source vector database designed specifically for LLM applications.</font></b>

<b>ChromaDB offers you both a user-friendly API and impressive performance, making it a great choice for many embedding applications</b>. To get started, activate your virtual environment and run the following command:
```shell
(venv) $ python -m pip install chromadb
```

<br/>

If you have any issues installing ChromaDB, take a look at [the troubleshooting guide](https://docs.trychroma.com/troubleshooting#build-error-when-running-pip-install-chromadb) for help.

### <b><font color='darkgreen'>Store documents</font></b>
<font size='3ptx'><b>Because you have a grasp on vectors and embeddings, and you understand the motivation behind vector databases, the best way to get started is with an example</b></font>.  

<b>For this example, you’ll store ten documents to search over.</b> To illustrate the power of embeddings and semantic search, each document covers a different topic, and you’ll see how well ChromaDB associates your queries with similar documents.

In [12]:
# from .autonotebook import tqdm as notebook_tqdm
import chromadb
from chromadb.utils import embedding_functions

CHROMA_DATA_PATH = "chroma_data/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "demo_docs"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

In [8]:
collections = client.list_collections()

In [9]:
collections

[Collection(name=demo_docs)]

Next, you instantiate your embedding function and the ChromaDB collection to store your documents in:

In [14]:
has_registered = False
if not collections:
    embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
        model_name=EMBED_MODEL
    )

    collection = client.create_collection(
        name=COLLECTION_NAME,
        embedding_function=embedding_func,
        metadata={"hnsw:space": "cosine"},
    )
else:
    has_registered = True
    collection = collections[0]

print(f'has_registered is {has_registered}')

has_registered is True


You specify an embedding function from the [**SentenceTransformers**](https://sbert.net/) library. ChromaDB will use this to embed all your documents and queries. In this example, you’ll continue using the "`all-MiniLM-L6-v2`" model. You then create your first collection.

<b>A collection is the object that stores your embedded documents along with any associated metadata</b>. If you’re familiar with relational databases, then you can think of a collection as a table. In this example, your collection is named demo_docs, it uses the "`all-MiniLM-L6-v2`" embedding function that you instantiated, and it uses the cosine similarity distance function as specified by `metadata={"hnsw:space": "cosine"}`.

The last step in setting up your collection is to add documents and metadata:

In [13]:
documents = [
     "The latest iPhone model comes with impressive features and a powerful camera.",
     "Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
     "Einstein's theory of relativity revolutionized our understanding of space and time.",
     "Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
     "The American Revolution had a profound impact on the birth of the United States as a nation.",
     "Regular exercise and a balanced diet are essential for maintaining good physical health.",
     "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
     "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
     "Startup companies often face challenges in securing funding and scaling their operations.",
     "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
]

genres = [
     "technology",
     "travel",
     "science",
     "food",
     "history",
     "fitness",
     "art",
     "climate change",
     "business",
     "music",
]

In [16]:
if not has_registered:
    collection.add(
        documents=documents,
         ids=[f"id{i}" for i in range(len(documents))],
         metadatas=[{"genre": g} for g in genres]
    )

<b>The `metadatas` argument is optional, but most of the time, it’s useful to store metadata with your embeddings. In this case, you define a single metadata field, "genre", that records the genre of each document</b>. When you query a document, metadata provides you with additional information that can be helpful to better understand the document’s contents. You can also filter on metadata fields, just like you would in a relational database query.

### <b><font color='darkgreen'>Query Vectorestore</font></b>
<b><font size='3ptx'>With documents embedded and stored in a collection, you’re ready to run some semantic queries.</font></b>

Below code snippet will send query `Find me some delicious food!` and request only one doc being returned:

In [17]:
query_results = collection.query(
    query_texts=["Find me some delicious food!"],
    n_results=1)

In [18]:
query_results.keys()

dict_keys(['ids', 'distances', 'metadatas', 'embeddings', 'documents', 'uris', 'data'])

In [19]:
query_results["documents"]

[['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.']]

In [20]:
query_results["ids"]

[['id3']]

In [21]:
query_results["distances"]

[[0.7638262063498773]]

In [22]:
query_results["metadatas"]

[[{'genre': 'food'}]]

As you can see, the embedding for `Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens` was most similar to the query `Find me some delicious food`. You probably agree that this document is the closest match. You can also see the ID, metadata, and distance associated with the matching document embedding. Here, you’re using **cosine distance**, which is one minus the cosine similarity between two embeddings.

With <font color='blue'>collection.query()</font>, you’re not limited to single queries or single results:

In [23]:
query_results = collection.query(
    query_texts=["Teach me about history",
                 "What's going on in the world?"],
    include=["documents", "distances"],
    n_results=2)

In [24]:
query_results["documents"][0]

["Einstein's theory of relativity revolutionized our understanding of space and time.",
 'The American Revolution had a profound impact on the birth of the United States as a nation.']

In [25]:
query_results["distances"][0]

[0.6265882893831924, 0.6904192480258038]

In [26]:
query_results["documents"][1]

["Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
 "Einstein's theory of relativity revolutionized our understanding of space and time."]

In [27]:
query_results["distances"][1]

[0.8002942768712199, 0.8882106605401324]

For this query, the two most similar documents weren’t as strong of a match as in the first query. Recall that cosine distance is one minus cosine similarity, so a cosine distance of 0.80 corresponds to a cosine similarity of 0.20.

<b><font color='darkred'>Note:</font></b>
> <b>Keep in mind that so-called similar documents returned from a semantic search over embeddings may not actually be relevant to the task that you’re trying to solve</b>. The success of a semantic search is somewhat subjective, and you or your stakeholders might not agree on the quality of the results.
> <br/><br/>
> <b>If there are no relevant documents in your collection for a given query, or your embedding algorithm wasn’t trained on the right or enough data, then your results might be poor</b>. It’s up to you to understand your application, your stakeholders’ expectations, and the limitations of your embedding algorithm and document collection.

<b>Another awesome feature of ChromaDB is the ability to filter queries on metadata</b>. To motivate this, suppose you want to find the single document that’s most related to music history. You might run this query:

In [28]:
collection.query(
    query_texts=["Teach me about music history"],
    n_results=1)

{'ids': [['id2']],
 'distances': [[0.7625820970012711]],
 'metadatas': [[{'genre': 'science'}]],
 'embeddings': None,
 'documents': [["Einstein's theory of relativity revolutionized our understanding of space and time."]],
 'uris': None,
 'data': None}

our query is `Teach me about music history`, and the most similar document is `Einstein’s theory of relativity revolutionized our understanding of space and time`. While Einstein is a historical figure who was a musician and teacher, this isn’t quite the result that you’re looking for. Because you’re particularly interested in `music` history, you can filter on the "genre" metadata field to search over more relevant documents:

In [29]:
collection.query(
    query_texts=["Teach me about music history"],
    where={"genre": {"$eq": "music"}},
    n_results=1)

{'ids': [['id9']],
 'distances': [[0.8186328079302339]],
 'metadatas': [[{'genre': 'music'}]],
 'embeddings': None,
 'documents': [["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'"]],
 'uris': None,
 'data': None}

As you can see, the document about `Beethoven’s Symphony No. 9` is the most similar document. Of course, for this example, there’s only one document with the `music` genre. To make it slightly more difficult, you could filter on both `history` and `music`:

In [30]:
query_results = collection.query(
    query_texts=["Teach me about music history"],
    where={"genre": {"$in": ["music", "history"]}},
    n_results=2,
)

In [31]:
query_results["documents"]

[["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
  'The American Revolution had a profound impact on the birth of the United States as a nation.']]

In [32]:
query_results["distances"]

[[0.8186328079302339, 0.8200413485985653]]

This query filters the collection of documents that have either a music or history genre, as specified by `where={"genre": {"$in": ["music", "history"]}}`. As you can see, the `Beethoven document` is still the most similar, while the `American Revolution document` is a close second. These were straightforward filtering examples on a single metadata field, but ChromaDB also supports [**other filtering operations**](https://docs.trychroma.com/usage-guide#:~:text=Filtering%20metadata%20supports%20the%20following%20operators%3A) that you might need.

### <b><font color='darkgreen'>Update documents</font></b>
<font size='3ptx'><b>If you want to update existing documents, embeddings, or metadata, then you can use <font color='blue'>collection.update()</font>.</b></font>


This requires you to know the IDs of the data that you want to update. In this example, you’ll update both the documents and metadata for "id1" and "id2":

In [33]:
collection.update(
    ids=["id1", "id2"],
    documents=[
        "The new iPhone is awesome!",
        "Bali has beautiful beaches"],
    metadatas=[{"genre": "tech"}, {"genre": "beaches"}]
)

In [34]:
query_results = collection.get(ids=["id1", "id2"])

In [35]:
query_results["documents"]

['The new iPhone is awesome!', 'Bali has beautiful beaches']

In [36]:
query_results["metadatas"]

[{'genre': 'tech'}, {'genre': 'beaches'}]

### <b><font color='darkgreen'>Delete documents</font></b>
<font size='3ptx'><b>Lastly, if you want to delete any items in the collection, then you can use <font color='blue'>collection.delete()</font>.</b></font>

In [37]:
print(f'Before deletion, we have {collection.count()} document(s)!')

Before deletion, we have 10 document(s)!


Below code snippet will delete two documents with id `id1` and `id2`:

In [38]:
collection.delete(ids=["id1", "id2"])

In [39]:
collection.get(["id1", "id2"])

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

In [40]:
print(f'After deletion, we have {collection.count()} document(s)!')

After deletion, we have 8 document(s)!


You’ve now seen many of ChromaDB’s main features, and you can learn more with the [**getting started guide**](https://docs.trychroma.com/getting-started) or [**API cheat sheet**](https://docs.trychroma.com/api-reference). You used a collection of ten hand-crafted documents that allowed you to get familiar with ChromaDB’s syntax and querying functionality, but this was by no means a realistic use case. <b>In the next section, you’ll see ChromaDB shine while you embed and query over thousands of real-world documents</b>!

## <b><font color='darkblue'>Practical Example: Add Context for a Large Language Model (LLM)</font></b>
<b><font size='3ptx'>Vector databases are capable of storing all types of embeddings, such as text, audio, and images. However, as you’ve learned, ChromaDB was initially designed with text embeddings in mind, and it’s most often used to build LLM applications.</font></b>


<b>In this section, you’ll get hands-on experience using ChromaDB to provide context to OpenAI’s ChatGPT LLM</b>. To set the scene, you’re a software engineer who works on a popular repo **["bt_test_common"](https://github.com/johnklee/bt_test_common)** (Common utilities for BT testing.). You want to help external users to learn more about this repo and how to use the APIs provided by this repo by LLM.

You’re responsible for designing and implementing the back-end logic that creates these summaries. You’ll take the following steps:
1. <b>Create a ChromaDB collection that stores documents</b> along with associated metadata.
2. <b>Create a system that accepts a query, finds semantically similar documents</b>, and uses the similar documents as context to an LLM. The LLM will use the documents to answer the question posed in the query.

<b>This process of retrieving relevant documents and using them as context for a generative model is known as [retrieval-augmented generation](https://cloud.google.com/use-cases/retrieval-augmented-generation?hl=en) (RAG)</b>. This allows LLMs to make inferences using information that wasn’t included in their training dataset, and this is the most common way to apply ChromaDB in LLM applications.