# What is ChromaDB

Chroma is indeed one of the easiest vector DBs to get started with, but the most widely deployed solutions today tend to be Pinecone and Milvus, with Weaviate and pgvector also very common in production. [dataaspirant](https://dataaspirant.com/popular-vector-databases/)

## What “most deployed” usually means

When people talk about the “most deployed” vector database, they usually look at a mix of:
- Managed SaaS adoption in enterprises (Pinecone). [celerdata](https://celerdata.com/glossary/best-vector-databases)
- Open‑source adoption signals like GitHub stars and Docker pulls (Milvus, Qdrant, Weaviate, Chroma, pgvector). [firecrawl](https://www.firecrawl.dev/blog/best-vector-databases-2025)
- How often they appear as “default choices” in recent 2025 roundups. [shakudo](https://www.shakudo.io/blog/top-9-vector-databases)

By those metrics:
- Pinecone is often cited as a leading hosted/vector‑DB service in enterprises. [researchandmarkets](https://www.researchandmarkets.com/reports/6216016/vector-database-global-market-insights)
- Milvus is usually the top open‑source option by community size and large‑scale deployments. [dataaspirant](https://dataaspirant.com/popular-vector-databases/)
- Weaviate and Qdrant are also heavily used, especially for RAG and multimodal search. [firecrawl](https://www.firecrawl.dev/blog/best-vector-databases-2025)
- pgvector is extremely common wherever teams already use Postgres and just add vector search. [dataaspirant](https://dataaspirant.com/popular-vector-databases/)

## Where Chroma fits

Chroma is frequently recommended when:
- You want a simple, embedded or self‑hosted DB for small–medium RAG apps. [liquidmetal](https://liquidmetal.ai/casesAndBlogs/vector-comparison/)
- You care more about **ease** and fast iteration than massive multi‑tenant scale. [risingwave](https://risingwave.com/blog/chroma-db-vs-pinecone-vs-faiss-vector-database-showdown/)

It’s very popular among individual developers and early‑stage projects, but Pinecone and Milvus are more often named as the most widely deployed in large, production environments. [datainsightsmarket](https://www.datainsightsmarket.com/reports/vector-database-1990919)

## Quick guide: choosing in practice

- If you want “most common SaaS in production”: Pinecone. [alphamatch](https://www.alphamatch.ai/blog/top-vector-databases-2025)
- If you want “most common large‑scale open source”: Milvus (with Qdrant/Weaviate close behind). [shakudo](https://www.shakudo.io/blog/top-9-vector-databases)
- If you want “easiest local / small RAG”: Chroma or pgvector. [zilliz](https://zilliz.com/blog/chroma-vs-neo4j-a-comprehensive-vector-database-comparison)

- https://cookbook.chromadb.dev/core/api/

## Best Vector DBs for Production RAG

There is no single universal “best” vector DB, but the commonly recommended production options are Milvus, Pinecone, Qdrant, Weaviate, pgvector, and Chroma (mainly for prototyping).

### Rule-of-thumb choices

- **Pinecone** – Managed SaaS, low-ops, good multi-region, ideal for small–medium scale and teams that want serverless and reliability over cost.
- **Milvus** – Strong for 100M–1B+ vectors, self-hosted or Kubernetes, high performance and rich features for large-scale workloads.
- **Qdrant** – Cost-effective, Rust-based, good latency and strong metadata filtering for 1M–100M vectors.
- **Weaviate** – Good for hybrid search (vector + keyword) and graph-like relations, with a GraphQL API for knowledge-heavy RAG.
- **pgvector** – Best when you are already on Postgres and have moderate scale; convenient but not the fastest at huge scale.
- **Chroma** – Great DX and simplicity for prototyping and small projects, less common for very large multi-tenant production.

### Quick decision table

| Situation / requirement                | Recommended DB(s)        |
| -------------------------------------- | ------------------------ |
| Managed, low DevOps, <100M vectors     | Pinecone                 |
| Massive scale (100M–1B+), self-hosted  | Milvus                   |
| Tight budget, need strong filters      | Qdrant                   |
| Hybrid keyword + vector, relationships | Weaviate                 |
| Already on Postgres                    | pgvector                 |
| Fast prototyping / PoC                 | Chroma (then migrate)    |

### What matters for production RAG

Focus on:
- Low and stable tail latency (p95).
- Predictable scaling behavior.
- Strong metadata filtering.
- An operational model (SaaS vs self-hosted) your team can realistically maintain.


# Basics of ChromaDB

In [1]:
import chromadb

# Create a client (this is your database connection)
client = chromadb.Client()

# Create a collection (think of it like a table)
collection = client.create_collection(name="my_first_collection")


# Add some documents
collection.add(
    documents=[
        "The cat sat on the mat",
        "The dog played in the park",
        "Python is a programming language"
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Search for similar documents
results = collection.query(
    query_texts=["Tell me about animals"],
    n_results=2
)

print('Showing results')
print(results)

Showing results
{'ids': [['doc2', 'doc3']], 'embeddings': None, 'documents': [['The dog played in the park', 'Python is a programming language']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None, None]], 'distances': [[1.3784717321395874, 1.6124422550201416]]}


**The key insight**: ChromaDB automatically converted your text into numbers (called embeddings/vectors). The distance is calculated between these number arrays, not the text itself!

**Default settings**:
- **Embedding model**: all-MiniLM-L6-v2 (a sentence transformer model)
- **Distance metric**: Squared L2 distance (Euclidean distance squared)

By default, with `chromadb.Client()`, you're using an in-memory database - it disappears when your program ends!

In [2]:
client = chromadb.Client()
collection = client.create_collection(name="test_embeddings")

# Add documents
collection.add(
    documents=["The cat sat on the mat"],
    ids=["doc1"]
)

# Let's peek at what ChromaDB actually stored
results = collection.get(
    ids=["doc1"],
    include=["embeddings", "documents"]
)

print("Document:", results['documents'])
print("\nEmbedding (first 10 numbers):", results['embeddings'][0][:10])
print("Embedding length:", len(results['embeddings'][0]))

Document: ['The cat sat on the mat']

Embedding (first 10 numbers): [ 0.13040181 -0.01187006 -0.02811697  0.05123863 -0.05597449  0.03019159
  0.03016133  0.02469834 -0.01837056  0.05876682]
Embedding length: 384


In [3]:
import chromadb.api

chromadb.api.client.SharedSystemClient.clear_system_cache()

import chromadb

# This creates a local database folder
client = chromadb.PersistentClient(path="./data/my_chroma_db")

collection = client.get_or_create_collection(name="persistent_collection")

collection.add(
    documents=["This will be saved to disk"],
    ids=["doc1"]
)

print("Data saved to /data/my_chroma_db folder")


Data saved to /data/my_chroma_db folder


# ChromaDB File Structure
```
./my_chroma_db/
├── chroma.sqlite3          # Metadata database
└── <hash-folder>/          # e.g., 4f2a3b1c-...
    ├── data_level0.bin     # Vector data
    ├── header.bin          # Index metadata
    ├── length.bin          # Document lengths
    └── link_lists.bin      # Graph connections (for HNSW)
```

## 1. `chroma.sqlite3` - The Metadata Store

This SQLite database stores:
- Collection names and settings
- Document IDs
- Document text (the actual strings you added)
- Metadata (any extra info you attached)
- Configuration (which embedding function, distance metric, etc.)

**Think of it as:** The "catalog" or "index card system" that keeps track of what you have

## 2. The Hash Folder - The Vector Store

Each collection gets a folder (named with a UUID hash). Inside are `.bin` files that store:

- **`data_level0.bin`**: Your actual embedding vectors (those arrays of floats)
- **`header.bin`**: Metadata about the index structure
- **`length.bin`**: Information about vector dimensions
- **`link_lists.bin`**: The HNSW (Hierarchical Navigable Small World) graph structure for fast similarity search

## Key Concepts

### Distance Calculation

✅ Distance is calculated **on-the-fly** during search  
✅ If a document is never queried, no distance is ever calculated for it  
✅ Distance only exists **between two vectors** (query vector ↔ document vector)

### Summary Table

| What | Where Stored | When Created |
|------|--------------|--------------|
| **Document text** | `chroma.sqlite3` | When you `.add()` |
| **Document vectors** | `data_level0.bin` | When you `.add()` (via embedding function) |
| **Query vector** | Nowhere (temporary) | When you `.query()` (on-the-fly) |
| **Distance** | Nowhere | When you `.query()` (calculated on-the-fly) |

## Performance Optimization

The `.bin` files like `link_lists.bin` help speed up distance calculation by using the HNSW (Hierarchical Navigable Small World) algorithm, which allows ChromaDB to search efficiently without calculating distances to all vectors.

In [4]:
import chromadb

# This creates a local database folder
client = chromadb.PersistentClient(path="./data/my_chroma_db")

collection = client.get_or_create_collection(name="persistent_collection")

# Search for similar documents
results = collection.query(
    query_texts=["Tell me about animals"],
    n_results=2
)

print('Showing results')
print(results)

Showing results
{'ids': [['doc1']], 'embeddings': None, 'documents': [['This will be saved to disk']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None]], 'distances': [[1.9052947759628296]]}


In [5]:
import chromadb

client = chromadb.PersistentClient(path="./data/my_chroma_db")
collection = client.get_or_create_collection(name="persistent_collection")

# Let's see what's actually IN the collection
all_docs = collection.get()
print("Documents in collection:", all_docs['documents'])
print("Number of documents:", len(all_docs['documents']))

Documents in collection: ['This will be saved to disk']
Number of documents: 1


In [6]:
import chromadb

client = chromadb.Client()
collection = client.create_collection(name="animal_test")

collection.add(
    documents=[
        "The cat sat on the mat",
        "The dog played in the park",
        "Python is a programming language"
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Let's try different queries
queries = [
    "Tell me about animals",
    "cat",
    "pets and animals"
]

for query in queries:
    results = collection.query(query_texts=[query], n_results=3)
    print(f"\n--- Query: '{query}' ---")
    for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0])):
        print(f"{i+1}. Distance: {distance:.4f} - '{doc}'")


--- Query: 'Tell me about animals' ---
1. Distance: 1.3785 - 'The dog played in the park'
2. Distance: 1.6124 - 'Python is a programming language'
3. Distance: 1.8603 - 'The cat sat on the mat'

--- Query: 'cat' ---
1. Distance: 0.9752 - 'The cat sat on the mat'
2. Distance: 1.5271 - 'Python is a programming language'
3. Distance: 1.6214 - 'The dog played in the park'

--- Query: 'pets and animals' ---
1. Distance: 1.1869 - 'The dog played in the park'
2. Distance: 1.5547 - 'The cat sat on the mat'
3. Distance: 1.6321 - 'Python is a programming language'


In [7]:
import chromadb

client = chromadb.Client()
collection = client.create_collection(name="word_test")

# Let's test single words vs. sentences
collection.add(
    documents=[
        "cat",
        "dog", 
        "animal",
        "The cat sat on the mat",
        "Tell me about animals"
    ],
    ids=["word_cat", "word_dog", "word_animal", "sentence_cat", "question_animals"]
)

# Compare these queries
test_queries = ["animal", "animals", "cat"]

for query in test_queries:
    results = collection.query(query_texts=[query], n_results=5)
    print(f"\n--- Query: '{query}' ---")
    for doc, dist in zip(results['documents'][0], results['distances'][0]):
        print(f"  {dist:.4f} - '{doc}'")


--- Query: 'animal' ---
  0.0000 - 'animal'
  0.3688 - 'dog'
  0.6511 - 'cat'
  0.7057 - 'Tell me about animals'
  1.5166 - 'The cat sat on the mat'

--- Query: 'animals' ---
  0.3075 - 'animal'
  0.4693 - 'Tell me about animals'
  0.7699 - 'dog'
  0.9449 - 'cat'
  1.6997 - 'The cat sat on the mat'

--- Query: 'cat' ---
  0.0000 - 'cat'
  0.6511 - 'animal'
  0.6787 - 'dog'
  0.9752 - 'The cat sat on the mat'
  1.3497 - 'Tell me about animals'


In [8]:
from dotenv import load_dotenv
import os
from huggingface_hub import login # Or other HF libraries

load_dotenv() # Loads variables from .env
hf_token = os.getenv("HF_TOKEN")


In [9]:

# Now use the token with HF libraries, e.g.:
login(token=hf_token)
# Or directly in from_pretrained:
# model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=hf_token)


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [10]:
import sentence_transformers
print(sentence_transformers.__version__)

5.2.2


In [11]:
import chromadb
from chromadb.utils import embedding_functions

# Example: Using OpenAI embeddings (you'll replace this with your own)
# openai_ef = embedding_functions.OpenAIEmbeddingFunction(
#     api_key="your-api-key",
#     model_name="text-embedding-ada-002"
# )

# Or using SentenceTransformers with a different model
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="paraphrase-MiniLM-L6-v2"
)

client = chromadb.Client()
collection = client.create_collection(
    name="custom_embeddings",
    embedding_function=sentence_transformer_ef
)

collection.add(
    documents=["The cat sat on the mat"],
    ids=["doc1"]
)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/paraphrase-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [12]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch

model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

text = "The cat sat on the mat"

# Step 1: Tokenization
tokens = tokenizer.encode(text)
print("Tokens (IDs):", tokens)
print("Tokens (text):", [tokenizer.decode([t]) for t in tokens])

# Step 2: What happens inside the model?
input_ids = torch.tensor([tokens])
with torch.no_grad():
    outputs = model(input_ids, output_hidden_states=True)

print("\nWhat's inside the model:")
print("- Number of hidden layers:", len(outputs.hidden_states))
print("- Shape of last hidden state:", outputs.hidden_states[-1].shape)
print("- That means: [batch_size, sequence_length, embedding_dimension]")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Tokens (IDs): [464, 3797, 3332, 319, 262, 2603]
Tokens (text): ['The', ' cat', ' sat', ' on', ' the', ' mat']

What's inside the model:
- Number of hidden layers: 13
- Shape of last hidden state: torch.Size([1, 6, 768])
- That means: [batch_size, sequence_length, embedding_dimension]


In [14]:
import chromadb
from chromadb.utils import embedding_functions
import os
from dotenv import load_dotenv

load_dotenv() # Loads variables from .env
openai_token = os.getenv("OPENAI_API_KEY")
if not openai_token:
    raise ValueError("OPENAI_API_KEY environment variable is not set")

# Create OpenAI embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small"  # or "text-embedding-3-large"
)

# Create persistent client with OpenAI embeddings
client = chromadb.PersistentClient(path="./data/openai_chroma_db")

collection = client.get_or_create_collection(
    name="openai_collection",
    embedding_function=openai_ef
)

# Add documents
collection.add(
    documents=[
        "The cat sat on the mat",
        "The dog played in the park",
        "Python is a programming language"
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Query
results = collection.query(
    query_texts=["Tell me about animals"],
    #n_results=2
)

print("Results:", results['documents'])
print("Distances:", results['distances'])

Results: [['The dog played in the park', 'The cat sat on the mat', 'Python is a programming language']]
Distances: [[0.7549513578414917, 0.77315354347229, 0.864280104637146]]


In [15]:
import chromadb

client = chromadb.PersistentClient(path="./data/test_db")
collection = client.get_or_create_collection(name="test")

collection.add(
    documents=["The cat sat on the mat", "The dog played"],
    ids=["doc1", "doc2"]
)

# Get the raw data
result = collection.get(ids=["doc1"], include=["embeddings"])
print("Vector for doc1 (first 5 numbers):", result['embeddings'][0][:5])
print("Vector length:", len(result['embeddings'][0]))

# Now query
query_result = collection.query(
    query_texts=["cat"],
    n_results=2,
    include=["embeddings", "distances"]
)
print("\nDistance to doc1:", query_result['distances'][0][0])
print("Distance to doc2:", query_result['distances'][0][1])

Vector for doc1 (first 5 numbers): [ 0.13040181 -0.01187006 -0.02811697  0.05123863 -0.05597449]
Vector length: 384

Distance to doc1: 0.9752493500709534
Distance to doc2: 1.3566614389419556


- https://claude.ai/share/c86070ea-4fa0-4c73-bcf2-f3f635f042f0

## End