Reference - https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide

In [None]:
import sys
import os

# Use current working directory and go one level up
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(parent_dir)

# Now you can import your config
from config import api_key

In [None]:
import os
os.environ["ALLOW_RESET"] = "TRUE"

In [None]:
import chromadb

# create a client
client = chromadb.PersistentClient(path="db/")

# create a 'table'
collection = client.create_collection(name="Students")

In [None]:
student_info = """
Alexandra Thompson, a 19-year-old computer science sophomore with a 3.7 GPA,
is a member of the programming and chess clubs who enjoys pizza, swimming, and hiking
in her free time in hopes of working at a tech company after graduating from the University of Washington.
"""

club_info = """
The university chess club provides an outlet for students to come together and enjoy playing
the classic strategy game of chess. Members of all skill levels are welcome, from beginners learning
the rules to experienced tournament players. The club typically meets a few times per week to play casual games,
participate in tournaments, analyze famous chess matches, and improve members' skills.
"""

university_info = """
The University of Washington, founded in 1861 in Seattle, is a public research university
with over 45,000 students across three campuses in Seattle, Tacoma, and Bothell.
As the flagship institution of the six public universities in Washington state,
UW encompasses over 500 buildings and 20 million square feet of space,
including one of the largest library systems in the world.
"""

Now, we will use the `add()` function to add text data with metadata and unique IDs. After that, Chroma will automatically download the all-MiniLM-L6-v2 model to convert the text into embeddings and store it in the "Students" collection.

In [None]:
collection.add(
    documents = [student_info, club_info, university_info],
    metadatas = [{"source": "student info"},{"source": "club info"},{'source':'university info'}],
    ids = ["id1", "id2", "id3"]
)

To run a similarity search, you can use the `query()` function and ask questions in natural language. It will convert the query into embedding and use similarity algorithms to generate similar results. In our case, it is returning two similar results.

In [None]:
results = collection.query(
    query_texts=["What is the student name?"],
    n_results=2
)

results

In [None]:
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
        api_key=api_key,
        model_name="text-embedding-ada-002")

students_embeddings = openai_ef([student_info, club_info, university_info])
print(students_embeddings)

Instead of using the default embedding model, we will load the embedding already created directly into the collections.

1. We will use the `get_or_create_collection()` function to create a new collection called "Students2". This function is different from `create_collection()`. It will get a collection or create if it doesn't exist already.
2. We will now add embedding, text documents, metadata, and IDs to our newly created collection.

In [None]:
collection2 = client.get_or_create_collection(name="Students2")

collection2.add(
    embeddings = students_embeddings,
    documents = [student_info, club_info, university_info],
    metadatas = [{"source": "student info"},{"source": "club info"},{'source':'university info'}],
    ids = ["id1", "id2", "id3"]
)

There is another, more straightforward method, too. You can add an OpenAI embedding function while creating or accessing the collection. Apart from OpenAI, you can use Cohere, Google PaLM, HuggingFace, and Instructor models.

In our case, adding new text documents will run an OpenAI embedding function instead of the default model to convert text into embeddings.

In [None]:
collection2 = client.get_or_create_collection(name="Students2",embedding_function=openai_ef)

collection2.add(
    documents = [student_info, club_info, university_info],
    metadatas = [{"source": "student info"},{"source": "club info"},{'source':'university info'}],
    ids = ["id1", "id2", "id3"]
)

In [None]:
results = collection2.query(
    query_texts=['What is the student name?'],
    n_results=2)

results

Just like relational databases, you can update or remove the values from the collections. To update the text and metadata, we will provide the specific ID for the record and new text.

In [None]:
collection2.update(
    ids=["id1"],
    documents=["Kristiane Carina, a 19-year-old computer science sophomore with a 3.7 GPA"],
    metadatas=[{"source": "student info"}],

)

In [None]:
results = collection2.query(
    query_texts=["What is the student name?"],
    n_results=2
)

results

In [None]:
collection2.delete(ids = ['id1'])


results = collection2.query(
    query_texts=["What is the student name?"],
    n_results=2
)

results

## Collection Management

In this section, we will learn about the collection utility function that will make our lives much easier.

We will create a new collection called "vectordb" and add the information about the Chroma DB cheat sheet, documentation, and JS API with metadata.

In [None]:
vector_collections = client.create_collection("vectordb")


vector_collections.add(
    documents=["This is Chroma DB CheatSheet",
               "This is Chroma DB Documentation",
               "This document Chroma JS API Docs"],
    metadatas=[{"source": "Chroma Cheatsheet"},
    {"source": "Chroma Doc"},
    {'source':'JS API Doc'}],
    ids=["id1", "id2", "id3"]
)

In [None]:
vector_collections.count()

In [None]:
vector_collections.get()

In [None]:
vector_collections.modify(name="chroma_info")

# list all collections
client.list_collections()

In [None]:
vector_collections_new = client.get_collection(name="chroma_info")

In [None]:
client.delete_collection(name="chroma_info")
client.list_collections()

In [None]:
client.reset()
client.list_collections()