## Introduction to ChromaDB and Its Role in Generative AI

### What is ChromaDB?
ChromaDB is an open-source vector database designed for storing and retrieving high dimensional data efficiently. It is particularly useful for applications involving embeddings, such as text, images, and audio. ChromaDB is optimized for semantic search, recommendation systems, and retrieval-augmented generation (RAG) in AI applications.

### Why Vector Databases?
Traditional databases store structured data like numbers and text, but vector databases store and index numerical representations (embeddings) of unstructured data. These embeddings enable similarity search based on meaning rather than exact matches.

### Key Features of ChromaDB
- **Fast and Scalable:** Handles large-scale vector search efficiently.
- **Built-in Embedding Storage:** Can store embeddings from models like OpenAI, Sentence Transformers, etc.
- **Simple API:** Easy to integrate into Python applications.
- **Automatic Metadata Handling:** Supports additional filtering and retrieval based on structured metadata.


### How is ChromaDB Related to Generative AI?
Generative AI models, such as GPT (Generative Pre-trained Transformer), Stable Diffusion, and DALL·E, generate text, images, or code. However, they sometimes require external knowledge beyond their training data. ChromaDB enhances generative AI in several ways:

#### 1. Retrieval-Augmented Generation (RAG)
Large Language Models (LLMs) can struggle with factual consistency. RAG integrates retrieved knowledge from vector databases like ChromaDB to improve accuracy.
Example:
- A user asks about latest medical research.
- The system retrieves relevant papers stored as embeddings in ChromaDB.
- The LLM generates a response based on retrieved data.
#### 2. Semantic Search for AI Applications
ChromaDB enables AI models to find similar text/images/audio based on meaning.
Example:
- Searching for "renewable energy solutions" retrieves documents with similar concepts, even if they don’t contain the exact phrase.
#### 3. Personalized AI Assistants
ChromaDB allows AI assistants to remember past interactions and retrieve relevant data.
Example:
- A chatbot storing customer queries as embeddings can retrieve related past questions, improving responses.
#### 4. Content Recommendation Systems
AI models can use ChromaDB for real-time personalized recommendations.
Example:
- A music recommendation AI stores song embeddings and suggests similar songs based on past user preferences.

In [41]:
# firstly install the required packages
# !pip install chromadb
# !pip install openai
# !pip install langchain
# !pip install tiktoken
# !pip install langchain-community
# !pip install langchain-openai
!pip install langchain-chroma

Collecting langchain-chroma
  Downloading langchain_chroma-0.2.2-py3-none-any.whl.metadata (1.3 kB)
Collecting numpy<2.0.0,>=1.26.2 (from langchain-chroma)
  Using cached numpy-1.26.4-cp312-cp312-win_amd64.whl.metadata (61 kB)
Downloading langchain_chroma-0.2.2-py3-none-any.whl (11 kB)
Using cached numpy-1.26.4-cp312-cp312-win_amd64.whl (15.5 MB)
Installing collected packages: numpy, langchain-chroma
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.3
    Uninstalling numpy-2.2.3:
      Successfully uninstalled numpy-2.2.3
Successfully installed langchain-chroma-0.2.2 numpy-1.26.4


  You can safely remove it manually.
  You can safely remove it manually.


In [2]:
# Check the list of packages installed
# !pip list

In [3]:
!pip show chromadb

Name: chromadb
Version: 0.6.3
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: C:\Users\ASUSK5~1\miniconda3\Lib\site-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, httpx, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pydantic, pypika, PyYAML, rich, tenacity, tokenizers, tqdm, typer, typing_extensions, uvicorn
Required-by: 


In [4]:
# In order to be able to carry with this project, the zip file below is needed
# Hence, you can downlaod and unzip it either through the link or the commands below
# !wget -q https://www.dropbox.com/s/vs6ocyvpzzncvwh/new_articles.zip
# !unzip -q new_articles.zip -d new_articles

In [3]:
!pip show wget

Name: wget
Version: 3.2
Summary: pure python download utility
Home-page: http://bitbucket.org/techtonik/python-wget/
Author: anatoly techtonik <techtonik@gmail.com>
Author-email: 
License: Public Domain
Location: C:\Users\ASUSK5~1\miniconda3\Lib\site-packages
Requires: 
Required-by: 


In [None]:
import os

# Set OpenAI KEY
# As you may already know, through the command below the key will be set as an environment variable
# and you will no longer need to pass it directly whenever needed.
os.environ["OPENAI_API_KEY"] = "ADD YOUR OPENAI API KEY"

In [44]:
from langchain.vectorstores import Chroma # deprecated
from langchain_chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings # deprecated
from langchain_openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import embeddings

#### First step is to load our data so that we can manipulate it i.e., split, embed, etc.

In [20]:
# Load data from all the text files in new_articles directory
loader = DirectoryLoader("./new_articles", glob="./*.txt", loader_cls=TextLoader, show_progress=True)

In [37]:
documents = loader.load()
# documents

100%|████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 3501.51it/s]


In [27]:
# Create a splitter and split data into chunks 
txt_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

In [55]:
txt_chunks = txt_splitter.split_documents(documents)

In [56]:
txt_chunks[0].page_content

'Signaling that investments in the supply chain sector remain robust, Pando, a startup developing fulfillment management technologies, today announced that it raised $30 million in a Series B round, bringing its total raised to $45 million.\n\nIron Pillar and Uncorrelated Ventures led the round, with participation from existing investors Nexus Venture Partners, Chiratae Ventures and Next47. CEO and founder Nitin Jayakrishnan says that the new capital will be put toward expanding Pandoâ€™s global sales, marketing and delivery capabilities.\n\nâ€œWe will not expand into new industries or adjacent product areas,â€\u200c he told TechCrunch in an email interview. â€œGreat talent is the foundation of the business â€” we will continue to augment our teams at all levels of the organization. Pando is also open to exploring strategic partnerships and acquisitions with this round of funding.â€\u200c'

In [57]:
len(txt_chunks)

228

#### Now that the data is loaded and prepared in chunks, everything is set to create the database.

In [59]:
# persist_directory = "db"
persist_directory = "./chroma_db"

# Create the embedding object
embedding = OpenAIEmbeddings()

if os.path.isdir(persist_directory):
    print("Database already exists!")
else:
    print("Directory does not exist. Creating the database...")
    chroma_vdb = Chroma.from_documents(documents=txt_chunks,
                                  embedding=embedding,
                                  persist_directory=persist_directory)
    print("Database created successfully.")

Database already exists!


**NOTE!** By specifying the `persist_directory`, Chroma will automatically handle data persistence, and there's no need to call a separate `persist()` method as in `chroma_vdb.persist()`.

In [61]:
# Create vector store
vector_store = Chroma(embedding_function=embedding,
                          persist_directory=persist_directory,  # Where to save data locally, remove if not necessary
                        )