## Introduction to ChromaDB and Its Role in Generative AI

### What is ChromaDB?
ChromaDB is an open-source vector database designed for storing and retrieving high dimensional data efficiently. It is particularly useful for applications involving embeddings, such as text, images, and audio. ChromaDB is optimized for semantic search, recommendation systems, and retrieval-augmented generation (RAG) in AI applications.

### Why Vector Databases?
Traditional databases store structured data like numbers and text, but vector databases store and index numerical representations (embeddings) of unstructured data. These embeddings enable similarity search based on meaning rather than exact matches.

### Key Features of ChromaDB
- **Fast and Scalable:** Handles large-scale vector search efficiently.
- **Built-in Embedding Storage:** Can store embeddings from models like OpenAI, Sentence Transformers, etc.
- **Simple API:** Easy to integrate into Python applications.
- **Automatic Metadata Handling:** Supports additional filtering and retrieval based on structured metadata.


### How is ChromaDB Related to Generative AI?
Generative AI models, such as GPT (Generative Pre-trained Transformer), Stable Diffusion, and DALL·E, generate text, images, or code. However, they sometimes require external knowledge beyond their training data. ChromaDB enhances generative AI in several ways:

#### 1. Retrieval-Augmented Generation (RAG)
Large Language Models (LLMs) can struggle with factual consistency. RAG integrates retrieved knowledge from vector databases like ChromaDB to improve accuracy.
Example:
- A user asks about latest medical research.
- The system retrieves relevant papers stored as embeddings in ChromaDB.
- The LLM generates a response based on retrieved data.
#### 2. Semantic Search for AI Applications
ChromaDB enables AI models to find similar text/images/audio based on meaning.
Example:
- Searching for "renewable energy solutions" retrieves documents with similar concepts, even if they don’t contain the exact phrase.
#### 3. Personalized AI Assistants
ChromaDB allows AI assistants to remember past interactions and retrieve relevant data.
Example:
- A chatbot storing customer queries as embeddings can retrieve related past questions, improving responses.
#### 4. Content Recommendation Systems
AI models can use ChromaDB for real-time personalized recommendations.
Example:
- A music recommendation AI stores song embeddings and suggests similar songs based on past user preferences.

In [1]:
# firstly install the required packages
# !pip install chromadb
# !pip install openai
# !pip install langchain
# !pip install tiktoken
# !pip install langchain-community
# !pip install langchain-openai
# !pip install langchain-chroma

In [2]:
# Check the list of packages installed
# !pip list

In [3]:
!pip show chromadb

Name: chromadb
Version: 0.6.3
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: C:\Users\ASUSK5~1\miniconda3\Lib\site-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, httpx, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pydantic, pypika, PyYAML, rich, tenacity, tokenizers, tqdm, typer, typing_extensions, uvicorn
Required-by: langchain-chroma


In [4]:
# In order to be able to carry with this project, the zip file below is needed
# Hence, you can downlaod and unzip it either through the link or the commands below
# !wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip
# !unzip -q new_articles.zip -d new_articles

In [5]:
!pip show wget

Name: wget
Version: 3.2
Summary: pure python download utility
Home-page: http://bitbucket.org/techtonik/python-wget/
Author: anatoly techtonik <techtonik@gmail.com>
Author-email: 
License: Public Domain
Location: C:\Users\ASUSK5~1\miniconda3\Lib\site-packages
Requires: 
Required-by: 


In [None]:
import os

# Set OpenAI KEY
# As you may already know, through the command below the key will be set as an environment variable
# and you will no longer need to pass it directly whenever needed.
os.environ["OPENAI_API_KEY"] = "ADD YOUR OPENAI API KEY HERE"

In [7]:
from langchain.vectorstores import Chroma # deprecated
from langchain_chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings # deprecated
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.llms import OpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import embeddings
from langchain.chains import RetrievalQA

#### First step is to load our data so that we can manipulate it i.e., split, embed, etc.

In [8]:
# Load data from all the text files in new_articles directory
loader = DirectoryLoader("./articles", glob="./*.txt", loader_cls=TextLoader, show_progress=True)

In [9]:
documents = loader.load()
# documents

100%|████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 1073.04it/s]


In [10]:
# Create a splitter and split data into chunks 
txt_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

In [11]:
txt_chunks = txt_splitter.split_documents(documents)

In [12]:
txt_chunks[0].page_content

'Signaling that investments in the supply chain sector remain robust, Pando, a startup developing fulfillment management technologies, today announced that it raised $30 million in a Series B round, bringing its total raised to $45 million.\n\nIron Pillar and Uncorrelated Ventures led the round, with participation from existing investors Nexus Venture Partners, Chiratae Ventures and Next47. CEO and founder Nitin Jayakrishnan says that the new capital will be put toward expanding Pandoâ€™s global sales, marketing and delivery capabilities.\n\nâ€œWe will not expand into new industries or adjacent product areas,â€\u200c he told TechCrunch in an email interview. â€œGreat talent is the foundation of the business â€” we will continue to augment our teams at all levels of the organization. Pando is also open to exploring strategic partnerships and acquisitions with this round of funding.â€\u200c'

In [13]:
len(txt_chunks)

228

#### Now that the data is loaded and prepared in chunks, everything is set to create the database.

**Initialize a Vector Database From Documents/Chunks**

When creating the Chroma vector store, provide the persist_directory parameter to ensure data is stored on disk.

In [24]:
# persist_directory
persist_directory = "./chroma_db"

# Create the embedding object
embedding = OpenAIEmbeddings()

if os.path.isdir(persist_directory):
    print("Database already exists!")
else:
    print("Directory does not exist. Creating the database...")
    chroma_vdb = Chroma.from_documents(documents=txt_chunks,
                                  embedding=embedding,
                                  persist_directory=persist_directory)
    print("Database created successfully.")

Database already exists!


`Chroma.from_documents()` creates a new Chroma vector store from the given txt (list of documents). It automatically generates embeddings for all the documents using `embedding_function`.
If `persist_directory` is provided, the data is stored on disk for later use. Right after the execution of the code snippet above, a directory name `chroma_db` will be created at the root path where this notebook is stored.

**NOTE!** By specifying the `persist_directory`, Chroma will automatically handle data persistence, and there's no need to call a separate `persist()` method as in `chroma_vdb.persist()`.

**Initialize a Persisted Vector Store**

To load the persisted vector store in future sessions, you can initialize the Chroma instance with the same persist_directory as implemented below:

In [15]:
# load the persisted vector store
vector_store = Chroma(embedding_function=embedding,
                          persist_directory=persist_directory,  # Where to save data locally, remove if not necessary
                        )

`Chroma()` loads an existing Chroma database from persist_directory (if it exists). No new embeddings are created at this step—it simply retrieves the previously stored vectors. If the directory is empty, the Chroma store will be initialized but without any data.

In a nutshell, putting all together, we will have:

| Method                      | Creates New Embeddings? | Requires Existing Chroma DB? | Use Case                                      |
|-----------------------------|------------------------|-----------------------------|-----------------------------------------------|
| `Chroma.from_documents()`   | ✅ Yes                 | ❌ No                        | Creating a new vector store from documents   |
| `Chroma()`                  | ❌ No                  | ✅ Yes (or starts empty)     | Loading an existing vector store from disk   |


In [16]:
type(vector_store)

langchain_chroma.vectorstores.Chroma

#### What next... ?

Great! Now that you've stored your text chunks in a Chroma vector store, the next step is to retrieve relevant chunks based on a user query. This is where a retriever comes in. `vector_store.as_retriever()` will return a retriever that can be used to find documents based on similarity. 

In [17]:
retriever = vector_store.as_retriever()

You can also pass the argument `search_kwargs={"k": <NUMBER>}` to `vector_store.as_retriever()` to retrieve top `NUMBER` results.

Once the retriever is created, you can query it like this:

In [18]:
query = "What is StarCoder?"
retrieved_docs = retriever.invoke(query)
retrieved_docs

[Document(id='21bd54d0-38a6-401b-8cdc-6a9ba5c96482', metadata={'source': 'articles\\05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt'}, page_content='â€œOne thing we learned from releases such as Stable Diffusion last year is the creativity and capability of the open-source community,â€\u200c von Werra told TechCrunch in an email interview. â€œWithin weeks of the release the community had built dozens of variants of the model as well as custom applications. Releasing a powerful code generation model allows anybody to fine-tune and adapt it to their own use-cases and will enable countless downstream applications.â€\u200c\n\nBuilding a model\n\nStarCoder is a part of Hugging Faceâ€™s and ServiceNowâ€™s over-600-person BigCode project, launched late last year, which aims to develop â€œstate-of-the-artâ€\u200c AI systems for code in an â€œopen and responsibleâ€\u200c way. Hugging Face supplied an in-house compute cluster of 512 Nvidia V100 GPUs to train the StarCo

In [19]:
print(retrieved_docs[0].page_content)

â€œOne thing we learned from releases such as Stable Diffusion last year is the creativity and capability of the open-source community,â€‌ von Werra told TechCrunch in an email interview. â€œWithin weeks of the release the community had built dozens of variants of the model as well as custom applications. Releasing a powerful code generation model allows anybody to fine-tune and adapt it to their own use-cases and will enable countless downstream applications.â€‌

Building a model

StarCoder is a part of Hugging Faceâ€™s and ServiceNowâ€™s over-600-person BigCode project, launched late last year, which aims to develop â€œstate-of-the-artâ€‌ AI systems for code in an â€œopen and responsibleâ€‌ way. Hugging Face supplied an in-house compute cluster of 512 Nvidia V100 GPUs to train the StarCoder model.


**You have a retriever now!** Therefore, you can pass the retrieved documents to a language model (e.g., OpenAI GPT) to generate **refined** answers using the retrieved information based on similarity search.

In [20]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo")

qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                       chain_type="stuff",
                                       retriever=retriever,
                                       return_source_documents=True)

response = qa_chain.invoke(query)

In [21]:
response

{'query': 'What is StarCoder?',
 'result': "StarCoder is a free alternative to code-generating AI systems like GitHub's Copilot. It is a part of the BigCode project and was developed by Hugging Face and ServiceNow Research. StarCoder is a 15-billion-parameter model trained on an open source dataset called The Stack, which contains over 19 million curated, permissively licensed repositories and more than six terabytes of code in over 350 programming languages. StarCoder allows fine-tuning and adaptation by the community and aims to develop state-of-the-art AI systems for code in an open and responsible way.",
 'source_documents': [Document(id='21bd54d0-38a6-401b-8cdc-6a9ba5c96482', metadata={'source': 'articles\\05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt'}, page_content='â€œOne thing we learned from releases such as Stable Diffusion last year is the creativity and capability of the open-source community,â€\u200c von Werra told TechCrunch in an email inter

To add a bit of flavour to the work, let's define a function which enables us to extract the source documents from the retrieved response:

In [22]:
def extract_source(response):
    print("Resources:")
    for source in response["source_documents"]:
        print(source.metadata["source"])

In [23]:
extract_source(response)

Resources:
articles\05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt
articles\05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt
articles\05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt
articles\05-04-hugging-face-and-servicenow-release-a-free-code-generating-model.txt


### Last Words
Below are some resource documentations where you can explore and find out more about different tools:

[- ChromaDB](https://docs.trychroma.com/docs/overview/introduction) \
[- LangChain](https://python.langchain.com/docs/introduction/) \
[- ChatOpenAI](https://python.langchain.com/api_reference/openai/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html) \
[- Retrieval QA](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html)

Good Luck!