# Exercise: Using Chroma as a Vector Database

In this exercise, you will learn how to store and retrieve text embeddings using Chroma, a powerful open-source vector database. 
We will perform basic text embedding storage and retrieval, utilizing LangChain's integration with Chroma.


In [18]:
# Install necessary packages
!pip install langchain chromadb tiktoken openai langchain-community


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 1. Import Required Libraries
Import the necessary classes from LangChain, Chroma, and OpenAI.

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
import os

In [5]:
# Set up OpenAI API Key
os.environ["OPENAI_API_KEY"] = "<api key>"

## 2. Text Splitting
Since large texts may not be efficiently stored and queried, we will use a text splitter to break the text into smaller chunks.

In [20]:
# Sample text to store
text = """
Vector databases like Chroma enable efficient semantic search by storing and managing high-dimensional vector embeddings. Unlike traditional keyword-based search, which relies on exact matches, semantic search retrieves data based on conceptual similarity, making it highly valuable in various AI and machine learning applications.

What Makes Vector Databases Powerful?

Vector databases store numerical representations (embeddings) of text, images, or other types of data. These embeddings capture the meaning and context of the data, allowing for fuzzy matching and contextual retrieval instead of relying solely on exact terms. This enables a more intuitive and relevant search experience across different domains.

For example, in natural language processing (NLP), a traditional search for “AI development” might return only documents containing that exact phrase. However, a vector search could retrieve documents discussing machine learning, neural networks, and deep learning, since these concepts are semantically related.

Use Cases of Chroma in AI-Powered Applications
	1.	Retrieval-Augmented Generation (RAG)
	•	Chroma plays a crucial role in enhancing Large Language Models (LLMs) by allowing them to fetch relevant external knowledge before generating responses.
	•	Instead of relying solely on pre-trained knowledge, models can retrieve real-time or domain-specific data to produce more accurate and up-to-date answers.
	•	Example: A legal AI assistant could use Chroma to search legal documents semantically rather than relying on strict keyword matches.
	2.	Recommendation Systems
	•	Chroma can improve recommendation algorithms by comparing user preferences with product embeddings.
	•	Instead of exact-item matching, vector search recommends conceptually similar products.
	•	Example: A movie streaming service can suggest films with similar themes, moods, or actors rather than only matching by genre.
	3.	Anomaly Detection
	•	Chroma helps detect patterns that deviate from normal behavior in industries like cybersecurity, fraud prevention, and predictive maintenance.
	•	Example: A financial institution can use Chroma to analyze transaction embeddings and identify fraudulent activities that don’t follow typical spending patterns.
	4.	Medical and Scientific Research
	•	In healthcare, vector databases assist in clinical decision support, where patient records are searched semantically.
	•	Example: A doctor searching for “rare autoimmune disorder” might retrieve cases related to similar symptoms and conditions even if the exact disease name is not mentioned.
	5.	Image and Video Similarity Search
	•	Vector embeddings allow searching for images based on content rather than file names or tags.
	•	Example: An e-commerce platform can let users upload a product image to find visually similar products without relying on manual tagging.
	6.	Chatbots and Conversational AI
	•	By storing and retrieving conversation history and external knowledge, Chroma enhances memory retention in AI-powered chat systems.
	•	Example: A customer service bot can retrieve previous interactions and related FAQs to provide personalized responses.
	7.	Real-Time Data Indexing
	•	Many traditional databases require periodic indexing, whereas Chroma supports real-time updates.
	•	This ensures that the latest information is immediately available for retrieval.
	•	Example: News aggregation platforms use vector search to provide contextually relevant articles as soon as they are published.

Why Chroma Over Other Vector Databases?

While there are multiple vector databases, Chroma stands out due to its ease of integration, efficiency, and support for real-time data retrieval. It is designed with a developer-friendly approach, making it ideal for scalable AI applications.
	•	Pinecone is a great alternative for fully managed vector search but may lack customization flexibility.
	•	FAISS (Facebook AI Similarity Search) is highly optimized for similarity search but requires more engineering effort to deploy.
	•	Milvus provides a strong open-source alternative for high-scale similarity search.
	•	Weaviate and Qdrant are other notable options with strong graph-based reasoning and search optimizations.

Conclusion: The Future of AI-Driven Search with Chroma

Chroma’s vector-based approach to search, recommendation, and retrieval is revolutionizing how businesses use AI. It enables more intuitive interactions, reduces manual tagging requirements, and improves AI decision-making capabilities across industries.

As AI and machine learning evolve, vector databases will become essential for building intelligent, context-aware applications, bridging the gap between static knowledge and real-time adaptive AI systems.
"""
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=10)
text_chunks = text_splitter.split_text(text)

print(f"Text split into {len(text_chunks)} chunks:")
for chunk in text_chunks:
    print(chunk)

Text split into 125 chunks:
Vector databases like Chroma enable efficient
efficient semantic search by storing and managing
managing high-dimensional vector embeddings.
Unlike traditional keyword-based search, which
which relies on exact matches, semantic search
search retrieves data based on conceptual
similarity, making it highly valuable in various
various AI and machine learning applications.
What Makes Vector Databases Powerful?
Vector databases store numerical representations
(embeddings) of text, images, or other types of
types of data. These embeddings capture the
the meaning and context of the data, allowing for
for fuzzy matching and contextual retrieval
retrieval instead of relying solely on exact
on exact terms. This enables a more intuitive and
and relevant search experience across different
different domains.
For example, in natural language processing
(NLP), a traditional search for “AI development”
might return only documents containing that exact
exact phrase. However,

## 3. Storing and Retrieving Embeddings with Chroma
We will now store the embeddings in Chroma and perform a similarity search.

In [23]:
# Initialize OpenAI Embeddings
embeddings = OpenAIEmbeddings()

# Store text chunks in Chroma
# we dont persist the vector store in this exercise, in production you would

vector_store = Chroma.from_texts(texts=text_chunks, embedding=embeddings) 
# Perform a similarity search
query = "chroma database"
results = vector_store.similarity_search(query, k=2)

print("Top matching text chunks:")
for res in results:
    print(res.page_content)

Top matching text chunks:
While there are multiple vector databases, Chroma
with Chroma


## 4. Assignment: Extend the Vector Search
Modify the script to:
1. Load a larger text document and split it into chunks.
2. Store the embeddings in Chroma.
3. Perform similarity searches with different queries.
4. Try adjusting the chunk size and observe the impact on retrieval.

research multiple text splitting methods in the langchain documentation. Use the text FDA as an example of a source of knowledge. Your goal is that for every question, you find the passages in the text that are relevant for it.
