# Playground for learning about vector databases

I'm working to learn more about how I can use large language models (LLM) with retrieval-augmented generation (RAG) to improve existing process and work-pipelines by offloading repetitive tasks. A part of expanding my knowledge around LLMs and RAG includes being able to generate a body of knowledge that is specific to the process or work-pipeline that I'm targeting at the time.

To help me focus on getting to a RAG and not be swallowed by the vast amount of information on the topic or the other options like knowledge graphs, I've decided to begin my exploration using [Chroma](https://docs.trychroma.com/docs/overview/introduction) (Chroma). Chroma is an open-source (Apache 2.0) vector database that, from what I can tell, provides a simple Python SDK and CLI that I can use to build out my knowledge base.

## Getting Started

Before we can run any of the code in this notebook, it's expected that you've satisfied the following requirments:

1. Install the required dependencies (`poetry install`) inside of an virtual environment (`pyenv`). 
2. A started Chroma instance exists (`chroma run`)

## Install and Import Dependencies

The code demonstrated in this notebook only requires the `chromadb` package, however, to make our output easier to read I've also decided to use the `json` package.

In [None]:
%pip install chromadb

import chromadb
import json

## Connect to the running Chroma instance

An instance of Chroma should be running at http://localhost:8000 after running the `chroma run` command in the virtual environment.


In [6]:
chroma_client = chromadb.HttpClient(host='localhost', port=8000)

## Create a collection to store our documents

In [9]:
collection = chroma_client.create_collection(name="documents", get_or_create=True)

## Add static sample documents to the collection

In [13]:
collection.delete(
    ids=["id1", "id2",],
)

collection.add(
    ids=["id1", "id2"],
    documents=[
        "Title of a book to be processed",
        "Title of another book to be processed"
    ]
)

## Query the documents collection for existing records

In [14]:
results = collection.query(
    query_texts=["This is a query document about hawaii"], # Chroma will embed this for you
    n_results=2 # how many results to return
)

json_results = json.dumps(results, indent=2)

print(json_results)

{
  "ids": [
    [
      "id1",
      "id2"
    ]
  ],
  "distances": [
    [
      1.6710149,
      1.7113707
    ]
  ],
  "embeddings": null,
  "metadatas": [
    [
      null,
      null
    ]
  ],
  "documents": [
    [
      "Title of a book to be processed",
      "Title of another book to be processed"
    ]
  ],
  "uris": null,
  "data": null,
  "included": [
    "metadatas",
    "documents",
    "distances"
  ]
}


## Viewing the collection in the Chroma CLI

The documents are viewable / queryable via the Chroma CLI as well. By executing `chroma browse --local documents` at a new terminal window, the documents added with the previous code cell should be viewable.

In [None]:
##
#
# WARNING: I was unable to get the following code block to 
#          display the table I wanted it to.
#
##

import subprocess

cli_result = subprocess.run([
    "chroma",
    "browse",
    "--local",
    "documents"
], capture_output=True, text=True)

print(cli_result.stderr)
print(cli_result.stdout)
