## Investigation of Chroma DB

**Credit to ChromaDB usage guide**:  This Investigation was built using ChromaDB user guide.  Much of the content (description and code) is copied directly from the [Chroma Usage Guide](https://docs.trychroma.com/guides).  This investigation allows me follow a good guide and make modifications and test functionality as needed.

- https://docs.trychroma.com/guides
- https://docs.trychroma.com/deployment/auth#static-api-token-authentication

### Initiating a persistent Chroma Client

In [1]:
import chromadb
import os

In [2]:
db_path = '/mnt/c/ML/DU/local_rag_llm/prototype/jeff/3-vectorDB/db'

if not os.path.exists(db_path):
    os.makedirs(db_path)

# os.environ['STORAGE_PATH'] = db_path

OSError: [Errno 30] Read-only file system: '/mnt'

In [None]:
from chromadb.api.fastapi import FastAPI as ChromaFastAPI



chroma_api_impl = ChromaFastAPI()
chroma_db = chromadb(api_impl=chroma_api_impl)

client = chromadb.PersistentClient(path=db_path)

TypeError: FastAPI.__init__() missing 1 required positional argument: 'system'

The client object has a few useful convenience methods

In [None]:
# returns a nanosecond heartbeat. Useful for making sure the client remains connected.
client.heartbeat()

NameError: name 'client' is not defined

In [None]:
# Empties and completely resets the database. WARNING This is destructive and not reversible.

#client.reset() 

## Running Chroma in client/server mode

### prerequisites 

Note: instead of running in python, we will use a docker container.  Run the following docker command.

- expose the server on port `8200`

`docker run -d --name chromadb -v C:/ML/DU/local_rag_llm/db:/chroma/chroma -p 8200:8000 -e IS_PERSISTENT=TRUE -e ANONYMIZED_TELEMETRY=TRUE chromadb/chroma:latest`

Once running, access the chromaDB API documentation at:  http://localhost:8200

**Note:** Chroma also provides an async HTTP client.  For more details: https://docs.trychroma.com/guides

### Docker options:

You can also create a `.chroma_env` file setting the required environment variables and pass it to the Docker container with the `--env-file` flag when running the container.  This will be useful when adding authentication.

`docker run -d --name chromadb -v C:/ML/DU/local_rag_llm/db:/chroma/chroma -p 8200:8000 --env-file ./.chroma_env chromadb/chroma:latest`

where `.chroma_env` file contains:
- IS_PERSISTENT=TRUE
- ANONYMIZED_TELEMETRY=TRUE

docker flags
- `-v` mounts local directory `C:/ML/DU/local_rag_llm/db` to container directory `/chroma/chroma`
- `-p` exposes container port `8000` to localhost port `8200`
- `-d` runs container disconnected (returns to terminal prompt)
- `name` defines the name for the container.  when omitted docker assigns a random name

### Using the Python HTTP-only client

If you are running Chroma in client-server mode, you may not need the full Chroma library. Instead, you can use the lightweight client-only library. In this case, you can install the chromadb-client package. This package is a lightweight HTTP client for the server with a minimal dependency footprint.

`pip install chromadb-client` instead of `pip intall chromadb`

In [None]:
# Chroma's API will run in client-server mode with just this change.

# NOTE: Requires server to be running on port 8200 before running this command.

chroma_client = chromadb.HttpClient(host='localhost', port=8200)

In [None]:
# Try the heartbeat for the server.
chroma_client.heartbeat()

1721915287858657712

## Using Collections

Chroma lets you manage collections of embeddings, using the collection primitive.

Chroma uses collection names in the url, so there are a few restrictions on naming them:

- The length of the name must be between 3 and 63 characters.
- The name must start and end with a lowercase letter or a digit, and it can contain dots, dashes, and underscores in between.
- The name must not contain two consecutive dots.
- The name must not be a valid IP address.

Chroma collections are created with a `name` and an optional `embedding function`. If you supply an embedding function, you must supply it every time you get the collection.

### Embeddings

The embedding function takes text as input, and performs tokenization and embedding. If no embedding function is supplied, Chroma will use sentence transformer `all-MiniLM-L6-v2 model` to create embeddings. This embedding model can create sentence and document embeddings that can be used for a wide variety of tasks. This embedding function runs locally on your machine, and may require you download the model files (this will happen automatically).

See: https://docs.trychroma.com/guides/embeddings to use other sentence transformer embeddings or create a custom embedding.

Embedding functions can be linked to a collection and used whenever you call `add`, `update`, `upsert` or `query`. You can also use them directly which can be handy for debugging.

In [None]:
from chromadb.utils import embedding_functions

# Use a different sentence transformer: all-mpnet-base-v2
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2")

ValueError: The sentence_transformers python package is not installed. Please install it with `pip install sentence_transformers`

In [None]:
# Create a collection
#client.delete_collection(name="my_collection")
collection = chroma_client.create_collection(name="my_collection", embedding_function=sentence_transformer_ef)


In [None]:
# Get a collection
collection = chroma_client.get_collection(name="my_collection", embedding_function=sentence_transformer_ef)

### Collections have a some useful methods

Existing collections can be retrieved by name with `.get_collection`, and deleted with `.delete_collection`. You can also use `.get_or_create_collection` to get a collection if it exists, or create it if it doesn't.


In [None]:
collection = chroma_client.get_or_create_collection(name="test") # Get a collection object from an existing collection, by name. If it doesn't exist, create it.
collection = chroma_client.get_collection(name="test") # Get a collection object from an existing collection, by name. Will raise an exception if it's not found.
#chroma_client.delete_collection(name="my_collection") # Delete a collection and all associated embeddings, documents, and metadata. ⚠️ This is destructive and not reversible

Other useful Collection methods

In [None]:
# See this used at the last two cells of this notebook
col_list = collection.peek() # returns a list of the first 10 items in the collection
col_num = collection.count() # returns the number of items in the collection
# collection.modify(name="new_name") # Rename the collection

**Existing collections** can be retrieved by name with `.get_collection`, and deleted with `.delete_collection`. You can also use `.get_or_create_collection` to get a collection if it exists, or create it if it doesn't.



In [None]:
# Retrieve collection
collection_A = chroma_client.get_or_create_collection(name="test") # Get a collection object from an existing collection, by name. If it doesn't exist, create it.
collection_A = chroma_client.get_collection(name="test") # Get a collection object from an existing collection, by name. Will raise an exception if it's not found.
#chroma_client.delete_collection(name="my_collection") # Delete a collection and all associated embeddings, documents, and metadata. ⚠️ This is destructive and not reversible


## Test

- Read in a list of text document
- Add them to the collection
- Check the collection has 6 entries
- Query the collection for 2 best fit documents
- Retrieve those documents
- Print the results and note the structure

In [None]:
# If you are repeating this test, the first reset the database by uncommenting the following lines

#chroma_client.delete_collection(name="my_collection", embedding_function=sentence_transformer_ef)
#chroma_client.create_collection(name="my_collection", embedding_function=sentence_transformer_ef)


## read list of documents from directory

- Read a list of text documents

In [None]:
# read list of .txt files from directory
files = []
for x in os.listdir('data'):
    if x.endswith(".txt"):
        files.append(x)

files

['applying RL to build a binanace trading bot.txt',
 'reinforcement learning DQN part 1.txt',
 'reinforcement learning and introduction part 4.txt',
 'reinforcement learning and introduction part 3.txt',
 'reinforcement learning and introduction part 2.txt',
 'reinforcement learning and introduction part 1.txt']

In [None]:
# copy files contents to a list
content_list = []   # file contents
id_list = []        # IDs, text of your choice

for i, file in enumerate(files):
    with open('data/'+file, "r") as f:
        content_list.append( f.read() )
    id_list.append( 'id' + str(i))

In [None]:
content_list

['Our objective is to make a trading bot that trade cryptocurrency using state-of the-art reinforcment learning. To create our RL agents we will use the following technologies:\n\nPython\nReinforcment Learning\nOPenAI gym\nBinance\nYou don’t need any background in ML to understand the following articles, knowledge of Python will be enough. However, if a part is unclear do not hesitate to contact me.\n\nHere are the different parts of the creation of our bot, each part will be one article:\n\nWe will use Binance data to generate our data that we will customize for our need. All you need is a Binance account, you can create one by clicking here. It is free and one of best trading platform to find cryptocurrency data.\n\nAll of the code for this article will be available on my GitHub.\n\nSteps for generating data:\n\nAdd our Binance API keys\nPulling data from Binance\nStandardize our data as we wish\nSave it in a csv file\nAdd our Binance API keys\nAfter cloning the repository, we will n

In [None]:
id_list

['id0', 'id1', 'id2', 'id3', 'id4', 'id5']

Add Documents to a collection

In [None]:
# Add documents to the collection
collection.add(
    documents=content_list,
    metadatas=[
        {"source": files[0]},   # This holds the file name (source name) for the text being added to the DB
        {"source": files[1]},
        {"source": files[2]},
        {"source": files[3]},
        {"source": files[4]},
        {"source": files[5]}
    ],
    ids=id_list
)

/Users/ashwinikumar/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:13<00:00, 6.29MiB/s]


Verify there are 6 items in the collection


In [None]:
print(collection.count())

6


Query the Collection

In [None]:
results = collection.query(
    query_texts=[
        "This is a query about machine learning and data science"
    ],
    n_results=2
)

print(results)

{'ids': [['id1', 'id5']], 'distances': [[1.4940412090843298, 1.6182181858429838]], 'embeddings': None, 'metadatas': [[{'source': 'reinforcement learning DQN part 1.txt'}, {'source': 'reinforcement learning and introduction part 1.txt'}]], 'documents': [['Hi and welcome to the intro series on Reinforcement Learning, today’s topic will be about the DQN algorithm!\n\n\nThis blog post is part of a longer series about Reinforcement Learning (RL). If you are completely unfamiliar with RL, I suggest you read my previous blog posts first.\n\nPreviously we talked about policy gradient algorithms. Today we will have a look at a different family of RL algorithms: Q-learning algorithms. And more specifically, we will focus on the vanilla DQN-algorithm.\n\nThe topics for today’s blog post are:\n\nHistorical significance of DQN\nWhat is Q-Learning?\nDQN Explained\nQ-Learning VS. Policy Gradients\nThere will also be a part 2 for today’s blog post, which will include a basic implementation of DQN.\n\n

Print the result.  Note the data structure is a dictionary

In [None]:
import pprint
pprint.pprint(results)

{'data': None,
 'distances': [[1.4940412090843298, 1.6182181858429838]],
 'documents': [['Hi and welcome to the intro series on Reinforcement Learning, '
                'today’s topic will be about the DQN algorithm!\n'
                '\n'
                '\n'
                'This blog post is part of a longer series about Reinforcement '
                'Learning (RL). If you are completely unfamiliar with RL, I '
                'suggest you read my previous blog posts first.\n'
                '\n'
                'Previously we talked about policy gradient algorithms. Today '
                'we will have a look at a different family of RL algorithms: '
                'Q-learning algorithms. And more specifically, we will focus '
                'on the vanilla DQN-algorithm.\n'
                '\n'
                'The topics for today’s blog post are:\n'
                '\n'
                'Historical significance of DQN\n'
                'What is Q-Learning?\n'
           

Collections have a few useful convenience methods.

In [None]:
col_list = collection.peek() # returns a list of the first 10 items in the collection
col_num = collection.count() # returns the number of items in the collection
# collection.modify(name="new_name") # Rename the collection

In [None]:
print(col_list)
print(f'\n{col_num}')


{'ids': ['id0', 'id1', 'id2', 'id3', 'id4', 'id5'], 'embeddings': [[-0.06983184069395065, -0.06972957402467728, -0.09745623916387558, 0.05306597054004669, -0.0515860840678215, -0.015564018860459328, 0.016165276989340782, -0.009612586349248886, -0.06355269253253937, 0.01641022600233555, 0.027449922636151314, -0.10778404027223587, 0.06536437571048737, -0.006359830964356661, 0.11691190302371979, 0.05337958037853241, 0.010832883417606354, 0.04060027748346329, -0.05769278109073639, -0.05000103637576103, 0.04764392971992493, -0.020905539393424988, 0.009941269643604755, -0.0009177944157272577, 0.005550786387175322, -0.014407220296561718, 0.03488421067595482, 0.00801693182438612, -0.015840968117117882, -0.023544764146208763, 0.06425171345472336, -0.0007424064096994698, 0.04046451672911644, 0.03580328822135925, 0.04099864140152931, 0.050649888813495636, 0.005459611304104328, 0.0006583072245121002, 0.025315532460808754, -0.00743169104680419, -0.024391919374465942, -0.05828355625271797, -0.065522