# Vector Stores in LangChain

Vector stores are databases that store and retrieve documents based on their vector embeddings. They enable **semantic search** - finding documents based on meaning, not just keywords.

## Chroma Vector Store

Chroma is a popular open-source vector database that's:
- ✅ Easy to use and set up
- ✅ Works locally (no API key needed)
- ✅ Supports persistence to disk
- ✅ Great for development and prototyping

In [None]:
# Install required packages (run once)
# !pip install chromadb langchain-chroma langchain-google-genai

In [None]:
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_core.documents import Document
import os
from dotenv import load_dotenv

load_dotenv()

# Initialize embeddings (using Google's embedding model)
embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

## 1. Create Vector Store from Documents

In [None]:
# Sample documents about AI topics
documents = [
    Document(page_content="LangChain is a framework for building LLM applications", metadata={"topic": "langchain", "type": "definition"}),
    Document(page_content="RAG stands for Retrieval-Augmented Generation. It combines retrieval with generation.", metadata={"topic": "rag", "type": "definition"}),
    Document(page_content="Vector databases store embeddings for semantic search", metadata={"topic": "vectordb", "type": "definition"}),
    Document(page_content="Agents can use tools to interact with external systems", metadata={"topic": "agents", "type": "definition"}),
    Document(page_content="Prompt engineering is the art of crafting effective prompts for LLMs", metadata={"topic": "prompts", "type": "definition"}),
    Document(page_content="Chroma is an open-source vector database that runs locally", metadata={"topic": "vectordb", "type": "tool"}),
    Document(page_content="Embeddings convert text into numerical vectors that capture semantic meaning", metadata={"topic": "embeddings", "type": "definition"}),
]

print(f"Created {len(documents)} documents")

In [None]:
# Create Chroma vector store (in-memory)
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    collection_name="ai_knowledge"
)

print(f"Vector store created with {vectorstore._collection.count()} documents")

## 2. Similarity Search

Find documents most similar to a query based on semantic meaning.

In [None]:
# Basic similarity search
query = "What is RAG?"
results = vectorstore.similarity_search(query, k=3)

print(f"Query: {query}\n")
for i, doc in enumerate(results, 1):
    print(f"{i}. {doc.page_content}")
    print(f"   Metadata: {doc.metadata}\n")

In [None]:
# Similarity search with scores (lower score = more similar)
results_with_scores = vectorstore.similarity_search_with_score(query, k=3)

print(f"Query: {query}\n")
for doc, score in results_with_scores:
    print(f"Score: {score:.4f} | {doc.page_content}")

## 3. Filtering with Metadata

Filter search results based on document metadata.

In [None]:
# Search with metadata filter
results = vectorstore.similarity_search(
    "database",
    k=3,
    filter={"topic": "vectordb"}  # Only search in vectordb topic
)

print("Results filtered by topic='vectordb':\n")
for doc in results:
    print(f"- {doc.page_content}")

## 4. Persistent Vector Store

Save the vector store to disk so it persists between sessions.

In [None]:
# Create a persistent vector store (saves to disk)
persist_directory = "../data/chroma_db"

persistent_vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    collection_name="ai_knowledge_persistent",
    persist_directory=persist_directory
)

print(f"Vector store saved to: {persist_directory}")

In [None]:
# Load existing vector store from disk (in a new session)
loaded_vectorstore = Chroma(
    collection_name="ai_knowledge_persistent",
    embedding_function=embeddings,
    persist_directory=persist_directory
)

# Test the loaded vector store
results = loaded_vectorstore.similarity_search("What are AI agents?", k=2)
print("Results from loaded vector store:")
for doc in results:
    print(f"- {doc.page_content}")

## 5. Add & Delete Documents

In [None]:
# Add new documents to existing vector store
new_docs = [
    Document(page_content="Fine-tuning adapts a pre-trained model to specific tasks", metadata={"topic": "training"}),
    Document(page_content="Temperature controls randomness in LLM outputs", metadata={"topic": "llm"}),
]

# Add documents and get their IDs
ids = vectorstore.add_documents(new_docs)
print(f"Added {len(ids)} documents with IDs: {ids}")
print(f"Total documents now: {vectorstore._collection.count()}")

In [None]:
# Delete documents by ID
vectorstore.delete(ids=ids)
print(f"Deleted documents. Total now: {vectorstore._collection.count()}")

## 6. Use as Retriever

Convert the vector store to a retriever for use in RAG chains.

In [None]:
# Convert to retriever (for use in RAG chains)
retriever = vectorstore.as_retriever(
    search_type="similarity",  # or "mmr" for diversity
    search_kwargs={"k": 3}     # return top 3 results
)

# Use the retriever
docs = retriever.invoke("How do embeddings work?")
print("Retrieved documents:")
for doc in docs:
    print(f"- {doc.page_content}")

## 7. Load PDFs into Vector Store (Real-World Example)

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load PDF
pdf_path = "../data/AI ML Engineer & Agentic AI Engineer.pdf"
loader = PyMuPDFLoader(pdf_path)
pdf_docs = loader.load()

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = text_splitter.split_documents(pdf_docs)

print(f"Split PDF into {len(chunks)} chunks")

In [None]:
# Create vector store from PDF chunks
pdf_vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="pdf_knowledge"
)

# Search the PDF content
query = "What are the technical skills?"
results = pdf_vectorstore.similarity_search(query, k=2)

print(f"Query: {query}\n")
for i, doc in enumerate(results, 1):
    print(f"Result {i}:")
    print(doc.page_content)
    print("-" * 50)

## Summary

| Feature | Method |
|---------|--------|
| Create from docs | `Chroma.from_documents(docs, embedding)` |
| Similarity search | `vectorstore.similarity_search(query, k=3)` |
| Search with scores | `vectorstore.similarity_search_with_score(query)` |
| Filter by metadata | `similarity_search(query, filter={"key": "value"})` |
| Persist to disk | `Chroma(..., persist_directory="./path")` |
| Load from disk | `Chroma(persist_directory="./path", embedding_function=emb)` |
| Add documents | `vectorstore.add_documents(docs)` |
| Delete documents | `vectorstore.delete(ids=[...])` |
| Convert to retriever | `vectorstore.as_retriever()` |