## Using Chroma DB

**Credit to ChromaDB usage guide**:  This Investigation was built using ChromaDB user guide.  Much of the content (description and code) is copied directly from the [Chroma Usage Guide](https://docs.trychroma.com/guides).  This investigation allows me follow a good guide and make modifications and test functionality as needed.

- https://docs.trychroma.com/guides
- https://docs.trychroma.com/deployment/auth#static-api-token-authentication
- https://cookbook.chromadb.dev/
  
`pip intall chromadb`

### Embeddings

The embedding function takes text as input, and performs tokenization and embedding. If no embedding function is supplied, Chroma will use sentence transformer `all-MiniLM-L6-v2 model` to create embeddings. This embedding model can create sentence and document embeddings that can be used for a wide variety of tasks. This embedding function runs locally on your machine, and may require you download the model files (this will happen automatically).

See: https://docs.trychroma.com/guides/embeddings to use other sentence transformer embeddings or create a custom embedding.

Embedding functions can be linked to a collection and used whenever you call `add`, `update`, `upsert` or `query`. You can also use them directly which can be handy for debugging.

#### Using another embedding method

In [1]:
from chromadb.utils import embedding_functions

modelPath = "/mnt/c/ML/DU/local_rag_llm/models/sentence-transformers/all-MiniLM-L6-v2"

# Use a different sentence transformer: all-mpnet-base-v2
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=modelPath)

  from tqdm.autonotebook import tqdm, trange


## Instantiating a Chroma Client

ChromaDB clients may startup a local server or may be connected to a remote server over `http`.

**NOTE:** Once the client is created, the embeddings, collections and other methods using the client do not see a difference between a local client and an chromaDB http server client.

### Instantiating a local persistent Chroma Client


In [2]:
import chromadb
import os

In [3]:
db_path = '/mnt/c/ML/DU/local_rag_llm/prototype/jeff/3-vectorDB/db'

if not os.path.exists(db_path):
    os.makedirs(db_path)

# os.environ['STORAGE_PATH'] = db_path

In [4]:
chroma_client = chromadb.PersistentClient(path=db_path)

### Running a ChromaDB web server with an API

#### prerequisites 

Note: instead of running in python, we will use a docker container.  Run the following docker command.

- expose the server on port `8200`

`docker run -d --name chromadb -v C:/ML/DU/local_rag_llm/db:/chroma/chroma -p 8200:8000 -e IS_PERSISTENT=TRUE -e ANONYMIZED_TELEMETRY=TRUE chromadb/chroma:latest`

Once running, access the chromaDB API documentation at:  http://localhost:8200

**Note:** Chroma also provides an async HTTP client.  For more details: https://docs.trychroma.com/guides

### Instantiating ChromaDB web client using the Web Server API

If you are running Chroma in client-server mode, you may not need the full Chroma library. Instead, you can use the lightweight client-only library. In this case, you can install the chromadb-client package. This package is a lightweight HTTP client for the server with a minimal dependency footprint.

`pip install chromadb-client` instead of `pip intall chromadb`

In [5]:
# Chroma's API will run in client-server mode with just this change.

# NOTE: Requires server to be running on port 8200 before running this command.

import chromadb

client = chromadb.HttpClient(host='localhost', port=8200)

In [6]:
# Try the heartbeat for the server.
client.heartbeat()

1722447160594275564

In [7]:
client.get_version() 

'0.5.5'

In [8]:
client.list_collections() 

[Collection(id=dd394da5-2c20-4bce-8374-89a0ce611153, name=ML_doc_collection)]

## Using Collections

Chroma lets you manage collections of embeddings, using the collection primitive.

Chroma uses collection names in the url, so there are a few restrictions on naming them:

- The length of the name must be between 3 and 63 characters.
- The name must start and end with a lowercase letter or a digit, and it can contain dots, dashes, and underscores in between.
- The name must not contain two consecutive dots.
- The name must not be a valid IP address.

Chroma collections are created with a `name` and an optional `embedding function`. If you supply an embedding function, you must supply it every time you get the collection.

References:
- https://cookbook.chromadb.dev/core/collections/


### Creating a `collection`

When creating a collection it is often best to check if the collection exists.  

**If the goal is the start from a clean collection** then delete and recreate the collection

In [9]:
# Function to check if a collection exists
def collection_exists(collections, collection_name):
    for collection in collections:
        if collection.name == collection_name:
            return True
    return False

collections = client.list_collections()

exists = collection_exists(collections, 'my_collection')

if exists:
    # Delete and Recreate
    client.delete_collection(name="my_collection")
    collection = client.create_collection(
                        name="my_collection", 
                        embedding_function=sentence_transformer_ef)
else:
    collection = client.create_collection(
                        name="my_collection", 
                        embedding_function=sentence_transformer_ef)

**If the goal is to use the existing collection** or create on if it does not exist

In [10]:
# Get a collection object from an existing collection, by name. If it doesn't exist, create it.
collection = client.get_or_create_collection(name="my_collection", embedding_function=sentence_transformer_ef) 

### Working with **all collections** on the client

- Additional operations are available here: [Collection Utilities]([collections[0].name](https://cookbook.chromadb.dev/core/collections/#collection-utilities))

In [11]:
# Count Collections
client.count_collections()

2

In [12]:
#List all Collections
collections = client.list_collections()

collections

[Collection(id=588ec510-0b68-4c99-884d-5738768d1710, name=my_collection),
 Collection(id=dd394da5-2c20-4bce-8374-89a0ce611153, name=ML_doc_collection)]

In [13]:
collections[0].name

'my_collection'

### Collections have a some useful methods

- `create_collection`: Create a collection.  Parameters:  `name` & optional `embedding function
- `get_collection`:  Get a collection object from an existing collection, by name. Will raise an exception if it's not found
- `get_or_create_collection`: Get a collection object from an existing collection, by name. If it doesn't exist, create it
- `delete_collection`: Delete a collection and all associated embeddings, documents, and metadata. ⚠️ This is destructive and not reversible

In [14]:
# Get a collection object from an existing collection, by name. If it doesn't exist, create it.
collection = client.get_or_create_collection(name="my_collection", embedding_function=sentence_transformer_ef) 

# Get a collection object from an existing collection, by name. Will raise an exception if it's not found.
collection = client.get_collection(name="my_collection", embedding_function=sentence_transformer_ef) 

# Delete a collection and all associated embeddings, documents, and metadata. ⚠️ This is destructive and not reversible
#client.delete_collection(name="my_collection") #

Other useful Collection methods

- `list_collections`: List all the collections
- `peek`: list the first 10 items in the collection
- `count`: list the number of items in the collection
- `modify`: Both collection properties (`name` and `metadata`) can be modified, separately ot together.

In [15]:
# List all collections
collections = client.list_collections()

# returns a list of the first 10 items in the collection
col_list = collection.peek() 

# returns the number of items in the collection
col_num = collection.count() 

# Rename the collection
# collection.modify(name="new_name", metadata={"key": "value"} ) 

## Test

- Read in a list of text documents
- Add them to the collection
- Check the collection has 6 entries
- Query the collection for 2 best fit documents
- Retrieve those documents
- Print the results and note the structure

**If you are repeating this test, delete and recreate the collection**

In [16]:
# If you are repeating this test, reset the database by uncommenting the following lines

#client.delete_collection(name="my_collection", embedding_function=sentence_transformer_ef)
#client.create_collection(name="my_collection", embedding_function=sentence_transformer_ef)


### Read list of documents from directory

- Read a list of text documents

In [17]:
# read list of .txt files from directory
files = []
for x in os.listdir('data-ai'):
    if x.endswith(".txt"):
        files.append(x)

files

['applying RL to build a binanace trading bot.txt',
 'reinforcement learning and introduction part 1.txt',
 'reinforcement learning and introduction part 2.txt',
 'reinforcement learning and introduction part 3.txt',
 'reinforcement learning and introduction part 4.txt',
 'reinforcement learning DQN part 1.txt']

In [18]:
# copy files contents to a list
content_list = []   # file contents
id_list = []        # IDs, text of your choice

for i, file in enumerate(files):
    with open('data-ai/'+file, "r") as f:
        content_list.append( f.read() )
    id_list.append( 'id' + str(i))

In [19]:
content_list

['Our objective is to make a trading bot that trade cryptocurrency using state-of the-art reinforcment learning. To create our RL agents we will use the following technologies:\n\nPython\nReinforcment Learning\nOPenAI gym\nBinance\nYou don’t need any background in ML to understand the following articles, knowledge of Python will be enough. However, if a part is unclear do not hesitate to contact me.\n\nHere are the different parts of the creation of our bot, each part will be one article:\n\nWe will use Binance data to generate our data that we will customize for our need. All you need is a Binance account, you can create one by clicking here. It is free and one of best trading platform to find cryptocurrency data.\n\nAll of the code for this article will be available on my GitHub.\n\nSteps for generating data:\n\nAdd our Binance API keys\nPulling data from Binance\nStandardize our data as we wish\nSave it in a csv file\nAdd our Binance API keys\nAfter cloning the repository, we will n

In [20]:
id_list

['id0', 'id1', 'id2', 'id3', 'id4', 'id5']

Add Documents to a collection

In [21]:
# Add documents to the collection
collection.add(
    documents=content_list,
    metadatas=[
        {"source": files[0]},   # This holds the file name (source name) for the text being added to the DB
        {"source": files[1]},
        {"source": files[2]},
        {"source": files[3]},
        {"source": files[4]},
        {"source": files[5]}
    ],
    ids=id_list
)

Verify there are 6 items in the collection


In [22]:
print(collection.count())

6


Query the Collection

In [23]:
results = collection.query(
    query_texts=[
        "This is a query about machine learning and data science"
    ],
    n_results=2
)

print(results)

{'ids': [['id5', 'id1']], 'distances': [[1.4940424082213024, 1.618234589226665]], 'embeddings': None, 'metadatas': [[{'source': 'reinforcement learning DQN part 1.txt'}, {'source': 'reinforcement learning and introduction part 1.txt'}]], 'documents': [['Hi and welcome to the intro series on Reinforcement Learning, today’s topic will be about the DQN algorithm!\n\n\nThis blog post is part of a longer series about Reinforcement Learning (RL). If you are completely unfamiliar with RL, I suggest you read my previous blog posts first.\n\nPreviously we talked about policy gradient algorithms. Today we will have a look at a different family of RL algorithms: Q-learning algorithms. And more specifically, we will focus on the vanilla DQN-algorithm.\n\nThe topics for today’s blog post are:\n\nHistorical significance of DQN\nWhat is Q-Learning?\nDQN Explained\nQ-Learning VS. Policy Gradients\nThere will also be a part 2 for today’s blog post, which will include a basic implementation of DQN.\n\nN

Print the result.  Note the data structure is a dictionary

In [24]:
import pprint
pprint.pprint(results)

{'data': None,
 'distances': [[1.4940424082213024, 1.618234589226665]],
 'documents': [['Hi and welcome to the intro series on Reinforcement Learning, '
                'today’s topic will be about the DQN algorithm!\n'
                '\n'
                '\n'
                'This blog post is part of a longer series about Reinforcement '
                'Learning (RL). If you are completely unfamiliar with RL, I '
                'suggest you read my previous blog posts first.\n'
                '\n'
                'Previously we talked about policy gradient algorithms. Today '
                'we will have a look at a different family of RL algorithms: '
                'Q-learning algorithms. And more specifically, we will focus '
                'on the vanilla DQN-algorithm.\n'
                '\n'
                'The topics for today’s blog post are:\n'
                '\n'
                'Historical significance of DQN\n'
                'What is Q-Learning?\n'
            

Collections have a few useful convenience methods.

In [25]:
col_list = collection.peek() # returns a list of the first 10 items in the collection
col_num = collection.count() # returns the number of items in the collection
# collection.modify(name="new_name") # Rename the collection

In [26]:
print(col_list)
print(f'\n{col_num}')


{'ids': ['id0', 'id1', 'id2', 'id3', 'id4', 'id5'], 'embeddings': [[-0.06983666121959686, -0.06973139196634293, -0.09745301306247711, 0.05306638777256012, -0.05158229544758797, -0.015561041422188282, 0.016167953610420227, -0.009614809416234493, -0.0635523647069931, 0.016410304233431816, 0.027450721710920334, -0.10778401792049408, 0.065365731716156, -0.006362806539982557, 0.11691081523895264, 0.05337867513298988, 0.01083291508257389, 0.040601953864097595, -0.057694628834724426, -0.05000334978103638, 0.047643452882766724, -0.02090711146593094, 0.009937060065567493, -0.0009169140830636024, 0.005553410854190588, -0.0144102917984128, 0.03488175943493843, 0.008013769052922726, -0.01584082469344139, -0.02354278229176998, 0.06425479799509048, -0.0007492083823308349, 0.04046362265944481, 0.03580661490559578, 0.0409976951777935, 0.050650931894779205, 0.005449227057397366, 0.0006532143452204764, 0.02531011775135994, -0.007431329693645239, -0.024389605969190598, -0.058280061930418015, -0.065518938