# RAG System - Build Index and Test

This notebook demonstrates how to:
1. Set up the RAG configuration with a single project root
2. Build the vector database from .docx files
3. Test retrieval functionality

## 1. Setup and Configuration

In [9]:
# Standard library imports
import os
from pathlib import Path
from getpass import getpass
from dotenv import load_dotenv
from datetime import datetime

# Project imports
from rag.config_rag import RAGConfig
from rag.db_indexer import DBIndexer
from rag.vector_retriever import VectorRetriever

In [10]:
# Load environment variables
load_dotenv()

# Set up OpenAI API key
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

# Log timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
print(f'Last ran at: {timestamp}')

Last ran at: 20260202_2200


## 2. Initialize Configuration

**This is the only place you need to specify paths!**

Set `project_root` to your project directory. All other paths will be automatically derived:
- Documents will be loaded from `{project_root}/aurelia_capital_internal/docs/`
- Vector DB will be stored in `{project_root}/vector_dbs/chroma_db/`

In [11]:
# Set your project root - this is the ONLY path you need to configure
PROJECT_ROOT = Path.cwd().parent

# Create configuration - all paths derived from project_root
cfg = RAGConfig(project_root=PROJECT_ROOT)

# Verify paths
print(f"Project root:      {cfg.project_root.absolute()}")
print(f"Documents dir:     {cfg.docs_dir}")
print(f"Vector DB dir:     {cfg.persist_dir}")
print(f"Collection name:   {cfg.collection_name}")
print(f"Embedding model:   {cfg.embedding_model}")

Project root:      C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary
Documents dir:     C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\aurelia_capital_internal\docs
Vector DB dir:     C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db
Collection name:   business_glossary
Embedding model:   text-embedding-3-small


## 3. Build Vector Database

This will:
1. Load all .docx files from the docs directory
2. Split them into chunks
3. Create embeddings
4. Store in a Chroma vector database

In [4]:
# Build the vector database
indexer = DBIndexer(cfg)
vector_db = indexer.build(wipe=True)  # wipe=True to rebuild from scratch

Wiping existing database at C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db
Loading documents...
Found 5 .docx file(s) in C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\aurelia_capital_internal\docs
Loaded 5 document(s)
Splitting into chunks...
Created 88 chunk(s)
Creating embeddings using text-embedding-3-small...
Building Chroma database at C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db...
✓ Successfully built database with 88 chunks


## 4. Test Retrieval

Test that we can retrieve relevant chunks from the database.

In [12]:
# Initialize retriever
retriever = VectorRetriever(cfg)

# Test query
query = "Who is coordinating gardening club?"
hits = retriever.retrieve(query, k=3)

# Display results
print(f"\n{'='*60}")
print(f"Query: {query}")
print(f"{'='*60}\n")

for i, doc in enumerate(hits, 1):
    print(f"Result {i}:")
    print(f"  Source: {doc.metadata.get('source', 'unknown')}")
    print(f"  Content: {doc.page_content[:80]}...")
    print()

Retrieving 3 chunk(s) for query: 'Who is coordinating gardening club?'
Loading vector database from C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db...
✓ Loaded database with 88 chunks
✓ Retrieved 3 chunk(s)

Query: Who is coordinating gardening club?

Result 1:
  Source: C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\aurelia_capital_internal\docs\office_gardening_club.docx
  Content: 2. Organization and Membership

The Gardening Club is coordinated by a small com...

Result 2:
  Source: C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\aurelia_capital_internal\docs\office_gardening_club.docx
  Content: The club is open to all employees, regardless of prior gardening experience. Mem...

Result 3:
  Source: C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\aurelia_capital_internal\docs\office_gardening_club.docx
  Content: Beyond environmental bene

In [13]:
hits

[Document(id='574c505b-e970-4473-bfc2-722dc46ea749', metadata={'source': 'C:\\Users\\marcin.grzechowiak\\Desktop\\repos_learn\\05_agent_dmo\\business_glossary\\aurelia_capital_internal\\docs\\office_gardening_club.docx'}, page_content='2. Organization and Membership\n\nThe Gardening Club is coordinated by a small committee of volunteers from different departments, ensuring cross-team collaboration. The committee typically includes:\n\nA club coordinator, responsible for planning activities and acting as the main point of contact'),
 Document(id='e68bfb60-c093-4254-806a-1541eb01e7d9', metadata={'source': 'C:\\Users\\marcin.grzechowiak\\Desktop\\repos_learn\\05_agent_dmo\\business_glossary\\aurelia_capital_internal\\docs\\office_gardening_club.docx'}, page_content='The club is open to all employees, regardless of prior gardening experience. Members range from experienced hobby gardeners to complete beginners who simply enjoy being around greenery. Participation is informal and flexible, 

## 5. Database Statistics

In [8]:
# Get database stats
collection = retriever.vector_db._collection
total_chunks = collection.count()

print(f"\nDatabase Statistics:")
print(f"  Total chunks: {total_chunks}")
print(f"  Collection name: {cfg.collection_name}")
print(f"  Storage location: {cfg.persist_dir}")


Database Statistics:
  Total chunks: 88
  Collection name: business_glossary
  Storage location: C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db
