<a href="https://colab.research.google.com/github/harnalashok/LLMs/blob/main/chromadbExpts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Last amended: 31st May, 2024
# Objective: a. IF possible, transform chromadb dataset to csv
#            b. Export chromadb datasets to huggingface

## References:
> a. [Huggingface](https://huggingface.co/chromadb)    
> b. [GitHub-1](https://github.com/chroma-core/chroma_datasets/tree/master)    
> c. [GitHub-2](https://github.com/johnnycode8/chromadb_quickstart/tree/main)      
> d. [Analytics Vidya](https://www.analyticsvidhya.com/blog/2023/07/guide-to-chroma-db-a-vector-store-for-your-generative-ai-llms/)    
> e. [chromadb cookbook](https://cookbook.chromadb.dev/)

In [None]:
# 1.0 chroma db is installed as a python package
#       not as a ubuntu application:

! pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.0-py3-none-any.whl (526 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.8/526.8 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.30.0-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.

## Why chromadb?
Here are the steps describing how Chroma DB works:

- **Data Structure**: Chroma DB organizes chromatic data in a structured format optimized for efficient storage and retrieval.
- **Storage**: It stores metadat also. For images, it stores color-related information such as RGB values, color names, and associated metadata in the database.
- **Indexing**: Metadata is indexed. Chroma DB creates indexes to facilitate fast lookup of colors based on various criteria like RGB values, color names, or other attributes.
- **Querying**: Query can include metadata. Users can query Chroma DB using specific criteria such as color codes, names, or properties to retrieve relevant color information.
- **Analysis**: Chroma DB enables analysis of color data for various applications such as image processing, design, and color matching.
- **Optimization**: The database is optimized for speed and efficiency, allowing for quick retrieval and processing of color-related information.
- **Integration**: It can be integrated into different software applications and platforms to provide color-related functionalities seamlessly.
- **Continued Improvement**: Chroma DB may undergo updates and improvements to enhance its capabilities and accommodate evolving requirements in color management and analysis.


chroma has tow types of clients unlike in databases. One client (PersistentClient) is suited only for local chroma server and the other client (HttpClient) for remote chroma server.

In [None]:
# 1.0
# chroma cookbook: https://cookbook.chromadb.dev/core/tenants-and-databases/
# chroma db has Tenants: A tenant is a logical grouping of databases.
# chromadb has Databases: A database is a logical grouping of collections (tables).

import chromadb
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings

In [None]:
# 1.1 Create a localclient:
# https://cookbook.chromadb.dev/core/clients/#persistent-client

 myclient = chromadb.PersistentClient(
                                        path="/content/",      # Where data will be stored
                                        settings=Settings(),   # Default settungs
                                        tenant=DEFAULT_TENANT, # Default
                                        database=DEFAULT_DATABASE,
                                      )


Collection in Vector Store is where we save the set of a) vector embeddings, b) documents, and c) any metadata, if present. Collection in a vector database can be thought of as a Table in Relational Database.

In [None]:
# 1.2 We will now create an empty collection and add documents to it.

mycollection = myclient.create_collection("my_information")




To this collection, we will be adding documents. Here we are adding three documents. In our case, we are just adding three sentences as three documents. The first document is about cars, the second one is about dogs and the final one is about four-wheelers.    
We are even adding the metadata. Metadata for all three documents is provided.     
Every document needs to have a unique ID to it, hence we are giving id1, id2, and id3 to them.     
All these are like the variables to the add() function from the collection

In [None]:
# 1.3
mycollection.add(
                documents=["This is a document containing car information",
                "This is a document containing information about dogs",
                "This document contains four wheeler catalogue"],
                metadatas=[{"source": "Car Book"},{"source": "Dog Book"},{'source':'Vechile Info'}],
                ids=["id1", "id2", "id3"]
               )

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:03<00:00, 24.1MiB/s]


### Query a Vector Store

We will pass a query and will fetch the documents that are relevant to it. The corresponding code for this will be     

- To query a vector store, we have a query() function provided by the collections which lets us query the vector database for relevant documents. In this function, we provide two parameters
- query_texts – To this parameter, we give a list of queries for which we need to extract the relevant documents.
- n_results - This parameter specifies how many top results should the database return. In our case we want our collection to return 2 top most relevant documents related to the query

When we gave a query, the Chrom DB converts the query into a vector embedding with the embedding model we provided at the start. Then this vector embedding performs a semantic search(similar nearest neighbors) on all the available documents. The query here “car” is most relevant to the id1 and id3 documents, hence we get the following result for the query.

In [None]:
# 1.4
results = mycollection.query(
                            query_texts=["Car"],
                            n_results=2
                            )


# 1.4.1
print(results)

{'ids': [['id1', 'id3']], 'distances': [[1.0784918423508782, 1.5855050680528366]], 'metadatas': [[{'source': 'Car Book'}, {'source': 'Vechile Info'}]], 'embeddings': None, 'documents': [['This is a document containing car information', 'This document contains four wheeler catalogue']], 'uris': None, 'data': None}


Later in point of time, as and when we get more data then it becomes necessary to update the existing data/vector embeddings present in the Vector Store. To update data in Chroma DB, we do the following

Previously, the information in the document associated with id2 was about Dogs. Now we are changing it to Cats. For this information to be updated within the Vector Store, we pass the id of the document, the updated document, and the updated metadata of the document to the update() function of the collections. This will now update the id2 to Cats which was previously about Dogs.

In [None]:
# 1,5
mycollection.update(
                    ids=["id2"],
                    documents=["This is a document containing information about Cats"],
                    metadatas=[{"source": "Cat Book"}],
                  )

In [None]:
# 1.6 Query the database again:

results = mycollection.query(
                              query_texts=["Felines"],
                              n_results=1
                            )

# 1.6.1
print(results)

{'ids': [['id2']], 'distances': [[0.9560528333913512]], 'metadatas': [[{'source': 'Cat Book'}]], 'embeddings': None, 'documents': [['This is a document containing information about Cats']], 'uris': None, 'data': None}


Sometimes, to reduce the space or remove unnecessary/ unwanted information, we might want to delete some documents from the collection in the Vector Store.

In [None]:
# 1.7
mycollection.delete(ids = ['id1'])

# 1.7.1
results = mycollection.query(
                            query_texts=["Car"],
                            n_results=2
                            )

# 1.7.2
print(results)

{'ids': [['id3', 'id2']], 'distances': [[1.5855050680528366, 1.7295208916874063]], 'metadatas': [[{'source': 'Vechile Info'}, {'source': 'Cat Book'}]], 'embeddings': None, 'documents': [['This document contains four wheeler catalogue', 'This is a document containing information about Cats']], 'uris': None, 'data': None}


`count()` an `get()` functions

In [None]:
# 2.0
new_collections = myclient.create_collection("new_collection")


new_collections.add(
                    documents=["This is Python Documentation",
                              "This is a Javascript Documentation",
                              "This document contains Flast API Cheatsheet"],
                    metadatas=[{"source": "Python For Everyone"},
                    {"source": "JS Docs"},
                    {'source':'Everything Flask'}],
                    ids=["id1", "id2", "id3"]
                  )


print(new_collections.count())
print(new_collections.get())

3
{'ids': ['id1', 'id2', 'id3'], 'embeddings': None, 'metadatas': [{'source': 'Python For Everyone'}, {'source': 'JS Docs'}, {'source': 'Everything Flask'}], 'documents': ['This is Python Documentation', 'This is a Javascript Documentation', 'This document contains Flast API Cheatsheet'], 'uris': None, 'data': None}


In [None]:
########## DONE #######