# ChromaDB

> Chroma is an AI-native open-source vector database. It comes with everything you need to get started built-in, and runs on your machine.

Read more about it on the official [docs](https://docs.trychroma.com/docs/overview/getting-started)

In [1]:
# Install all the required modules for this notebook:
!pip install chromadb PyPDF2



In [2]:
# Import all modules used for this notebook activity:
import chromadb
import PyPDF2

## Algorithm:

1. Create a ChromaDB client
2. Create a collection
3. Add documents to the collection created on step `2`
4. Query the desired document


In [19]:
# 1. Create a ChromaDB client
client = chromadb.Client()
client

# get the pdf file:
pdf_file = open('./neural_networks_stanford.pdf', 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_file)
pdf_text = ""
for page in pdf_reader.pages:
    pdf_text += page.extract_text()

# Break the text by both newline and period
chunks = []
for paragraph in pdf_text.split('\n'):
    chunks.extend(paragraph.split('.'))

# Filter out empty or whitespace-only chunks
chunks = [chunk.strip() for chunk in chunks if chunk.strip()]

len(chunks)

1802

In [20]:
# 2. Create a collection
# This is the part were we store our `embeddings`
COLLECTION_NAME="neural-networks"

# import the file:
if len(client.list_collections()) == 0:
    collection = client.create_collection(name=COLLECTION_NAME)
else:
    collection = client.get_collection(name=COLLECTION_NAME)

In [21]:
# 3. Add documents to the collection created on step 2
# Clear the collection first to avoid adding duplicate data if this cell is run multiple times

# Iterate through the chunks and add them individually
chunk_ids = [f"doc_{i}" for i in range(len(chunks))]
metadatas = [{"source": "neural_networks_stanford.pdf"}] * len(chunks)

collection.add(
    documents=chunks,
    metadatas=metadatas,
    ids=chunk_ids
)

print(f"Added {len(chunks)} chunks to the collection.")

Added 1802 chunks to the collection.


In [22]:
# 4. Query the desired document
results = collection.query(
    query_texts=["What is a neural network?"],
    n_results=5
)

In [23]:
results

{'ids': [['doc_222', 'doc_13', 'doc_972', 'doc_1442', 'doc_313']],
 'embeddings': None,
 'documents': [['neural network, the feedforward network . A feedforward network is a multilayerfeedforward',
   'Instead, a modern neural network is a network of small computing units, each',
   '• Neural networks are built out of neural units , originally inspired by biological',
   '• Neural networks are built out of neural units , originally inspired by biological',
   'That means we can think of a neural network classiﬁer with one hidden layer']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'source': 'neural_networks_stanford.pdf'},
   {'source': 'neural_networks_stanford.pdf'},
   {'source': 'neural_networks_stanford.pdf'},
   {'source': 'neural_networks_stanford.pdf'},
   {'source': 'neural_networks_stanford.pdf'}]],
 'distances': [[0.6134884357452393,
   0.6193888783454895,
   0.6197201609611511,
   0.6197201609611511,
   0.63339519500