## ü§î What is a Vector Database?

A **vector database** is optimized for storing and searching vectors (lists of numbers).

```
Regular Database:              Vector Database:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê         ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Find exact match   ‚îÇ         ‚îÇ Find similar items ‚îÇ
‚îÇ WHERE name = 'Bob' ‚îÇ         ‚îÇ closest to [0.2,...]‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

**Regular DB:** Exact matching ("Find user with ID 123")
**Vector DB:** Similarity search ("Find documents about X")

## Why ChromaDB?

| Feature | ChromaDB | FAISS | Pinecone |
|---------|----------|-------|----------|
| Setup | Easy ‚úì | Medium | Complex |
| Cost | Free | Free | Paid |
| Persistence | Auto-save | Manual | Cloud |
| Add/Delete | Easy | Rebuild | Easy |
| Best for | Learning & Prototyping | Production (huge data) | Enterprise |

ChromaDB is perfect for learning and building prototypes!

---

## Step 1: Set Up (Load from previous notebooks)

In [None]:
# Load all the components from previous notebooks
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
import os

# Set up paths
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
data_folder = os.path.join(project_root, 'data')

# Load PDFs
print("üìö Loading documents...")
all_pages = []
for pdf_name in os.listdir(data_folder):
    if pdf_name.endswith('.pdf'):
        loader = PyPDFLoader(os.path.join(data_folder, pdf_name))
        all_pages.extend(loader.load())
print(f"   ‚úÖ Loaded {len(all_pages)} pages")

# Split into chunks
print("‚úÇÔ∏è Creating chunks...")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(all_pages)
print(f"   ‚úÖ Created {len(chunks)} chunks")

# Load embeddings
print("üî¢ Loading embeddings model...")
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'}
)
print(f"   ‚úÖ Embeddings ready!")

print(f"\nüéâ All set! Ready to create vector store.")

---

## Step 2: Create ChromaDB Vector Store

In [None]:
from langchain_community.vectorstores import Chroma
import shutil

# Where to save the database
chroma_path = os.path.join(os.getcwd(), 'chroma_db')

# Remove old database if exists (for fresh start)
if os.path.exists(chroma_path):
    shutil.rmtree(chroma_path)
    print("üóëÔ∏è Removed old database")

print("\nüîÑ Creating ChromaDB...")
print(f"   Processing {len(chunks)} chunks...")

# Create the vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=chroma_path
)

print(f"\n‚úÖ ChromaDB created!")
print(f"üìÅ Saved to: {chroma_path}")

### üí° What Just Happened?

```
chunks (text)  ‚Üí  embed each chunk  ‚Üí  store in ChromaDB
                        ‚Üì
                   [0.2, -0.5, ...]  ‚Üí  saved to disk
```

ChromaDB:
1. Took each text chunk
2. Created an embedding for it
3. Stored the text + embedding together
4. Saved everything to disk automatically

---

## Step 3: Search the Vector Store

In [None]:
# Search for similar documents
question = "What is qualitative data coding?"

print(f"üîç Searching for: '{question}'")
print("="*60 + "\n")

# Find the 3 most similar chunks
results = vectorstore.similarity_search(question, k=3)

print(f"‚úÖ Found {len(results)} relevant chunks!\n")

for i, doc in enumerate(results, 1):
    source = doc.metadata.get('source', 'Unknown').split('/')[-1]
    page = doc.metadata.get('page', '?')
    
    print(f"üìÑ Result {i}:")
    print(f"   Source: {source}, Page {page}")
    print(f"   Text: {doc.page_content[:200]}...")
    print()

### üí° How similarity_search Works

```
1. Your question: "What is qualitative data coding?"
                            ‚Üì
2. Convert to embedding: [0.2, -0.5, 0.8, ...]
                            ‚Üì
3. Compare with ALL stored embeddings
                            ‚Üì
4. Return the top 3 most similar (k=3)
```

This is the **Retrieval** part of RAG!

---

## Step 4: Search with Scores

In [None]:
# Get similarity scores too
question = "interview techniques for research"

print(f"üîç Searching for: '{question}'")
print("="*60 + "\n")

# Search with scores
results_with_scores = vectorstore.similarity_search_with_score(question, k=5)

for doc, score in results_with_scores:
    # Lower score = more similar (it's a distance)
    similarity = 1 / (1 + score)  # Convert distance to similarity
    bar = "‚ñà" * int(similarity * 20)
    
    print(f"Score: {bar} ({similarity:.0%})")
    print(f"   {doc.page_content[:100]}...")
    print()

### üí° Understanding Scores

ChromaDB returns a **distance** score (lower = closer/more similar):

| Distance | Meaning |
|----------|----------|
| 0.0 - 0.5 | Very similar |
| 0.5 - 1.0 | Somewhat related |
| > 1.0 | Less related |

We can use this to filter out low-quality matches.

---

## Step 5: Load Existing Database

In [None]:
# How to load an existing database (no need to recreate!)
print("üìÇ Loading existing ChromaDB...")

loaded_store = Chroma(
    persist_directory=chroma_path,
    embedding_function=embeddings
)

print("‚úÖ Database loaded!")

# Test it works
test_results = loaded_store.similarity_search("coding methods", k=1)
print(f"\nüîç Quick test: Found {len(test_results)} result")
print(f"   {test_results[0].page_content[:100]}...")

### üí° Save vs Load

```python
# FIRST TIME: Create and save
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory=path)

# LATER: Just load (much faster!)
vectorstore = Chroma(persist_directory=path, embedding_function=embeddings)
```

Once created, you don't need to re-embed everything each time!

---

## Step 6: Filter by Metadata

In [None]:
# Search within specific pages only
print("üîç Searching only in first 3 pages...\n")

# Filter: only pages 0, 1, or 2
results = vectorstore.similarity_search(
    "research methods",
    k=3,
    filter={"page": {"$lt": 3}}  # page < 3
)

for doc in results:
    page = doc.metadata.get('page', '?')
    print(f"üìÑ Page {page}: {doc.page_content[:100]}...")

### üí° Metadata Filtering

You can filter results by any metadata:

```python
# Filter by exact match
filter={"page": 5}

# Filter by comparison
filter={"page": {"$lt": 10}}  # Less than 10
filter={"page": {"$gt": 5}}   # Greater than 5

# Filter by source file
filter={"source": {"$contains": "research"}}
```

This is very useful when you have multiple document types!

---

## üß™ Experiment: Compare Different Questions

In [None]:
# Try different questions and see what comes up
questions = [
    "What is qualitative coding?",
    "How do you analyze interview data?",
    "What are the challenges of research?",
    "What is the weather like?"  # Irrelevant question
]

print("üß™ Testing Different Questions:\n")
print("="*60)

for question in questions:
    results = vectorstore.similarity_search_with_score(question, k=1)
    doc, score = results[0]
    similarity = 1 / (1 + score)
    
    print(f"\n‚ùì '{question}'")
    print(f"   Best match similarity: {similarity:.0%}")
    
    if similarity < 0.5:
        print("   ‚ö†Ô∏è Low relevance - might not have good info for this question")
    else:
        print(f"   ‚úÖ Found: {doc.page_content[:80]}...")

---

## ‚úÖ Summary

In this notebook, you learned:

1. **What a vector database is** - Storage optimized for similarity search
2. **Why ChromaDB** - Easy to use, auto-saves, great for learning
3. **Creating a vector store** - `Chroma.from_documents()`
4. **Searching** - `similarity_search()` finds relevant chunks
5. **Loading existing databases** - Don't re-embed every time!
6. **Metadata filtering** - Search within specific sources/pages

## ‚û°Ô∏è Next Step

In **Notebook 5: Complete RAG**, you'll combine the vector store with an AI model to answer questions!

---

**Key Components:**
- `vectorstore` - Your ChromaDB instance
- `chroma_path` - Where the database is saved