# RAG System - Build Index and Test

This notebook demonstrates how to:
1. Set up the RAG configuration with a single project root
2. Build the vector database from .docx files
3. Test retrieval functionality

## 1. Setup and Configuration

In [1]:
# Standard library imports
import os
from pathlib import Path
from getpass import getpass
from dotenv import load_dotenv
from datetime import datetime

# Project imports
from rag.config_rag import RAGConfig
from rag.db_indexer import DBIndexer
from rag.vector_retriever import VectorRetriever

In [2]:
# Load environment variables
load_dotenv()

# Set up OpenAI API key
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key: ")

# Log timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
print(f'Last ran at: {timestamp}')

Last ran at: 20260222_1919


## 2. Initialize Configuration

**This is the only place you need to specify paths!**

Set `project_root` to your project directory. All other paths will be automatically derived:
- Documents will be loaded from `{project_root}/aurelia_capital_internal/docs/`
- Vector DB will be stored in `{project_root}/vector_dbs/chroma_db/`

In [3]:
# Set your project root - this is the ONLY path you need to configure
PROJECT_ROOT = Path.cwd().parent

# Create configuration - all paths derived from project_root
cfg = RAGConfig(project_root=PROJECT_ROOT)

# Verify paths
print(f"Project root:      {cfg.project_root.absolute()}")
print(f"Documents dir:     {cfg.docs_dir}")
print(f"Master BG dir:     {cfg.excel_dir}")
print(f"Vector DB dir:     {cfg.persist_dir}")
print(f"Collection name:   {cfg.collection_name}")
print(f"Embedding model:   {cfg.embedding_model}")

Project root:      C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary
Documents dir:     C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\data\docs
Master BG dir:     C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\data\master_business_glossary
Vector DB dir:     C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db
Collection name:   business_glossary
Embedding model:   text-embedding-3-small


## 3. Build Vector Database

This will:
1. Load all .docx files from the docs directory
2. Split them into chunks
3. Create embeddings
4. Store in a Chroma vector database

In [4]:
# Build the vector database
indexer = DBIndexer(cfg)
vector_db = indexer.build(wipe=True)  # wipe=True to rebuild from scratch

Wiping existing database at C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db
Loading documents...
Found 5 .docx file(s) in C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\data\docs
Found 1 .csv file(s) in C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\data\master_business_glossary
Loaded 6 document(s): {'docx': 5, 'csv': 1}
Splitting into chunks...
Created 124 chunk(s)
Creating embeddings using text-embedding-3-small...
Building Chroma database at C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db...
✓ Successfully built database with 124 chunks


## 4. Test Retrieval

Test that we can retrieve relevant chunks from the database.

In [5]:
# Initialize retriever
retriever = VectorRetriever(cfg)

# Test query
# query = "Who is coordinating gardening club?"
query = "What Business Domain Name is booking_entity related to?"
hits = retriever.retrieve(query, k=3)


# Display results
print(f"\n{'='*60}")
print(f"Query: {query}")
print(f"{'='*60}\n")

for i, doc in enumerate(hits, 1):
    print(f"Result {i}:")
    print(f"  Source: {doc.metadata.get('source', 'unknown')}")
    print(f"  Content: {doc.page_content[:]}...")
    print()

Retrieving 3 chunk(s) for query: 'What Business Domain Name is booking_entity related to?'
=== RAG ===
Loading vector database from C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db...
✅ Loaded database with 124 chunks
✅ Retrieved 3 chunk(s)

Query: What Business Domain Name is booking_entity related to?

Result 1:
  Source: C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\data\master_business_glossary\master_business_glossary_csv.csv
  Content: gs_bucket,client,client_account,booking_entity,Finance,Legal Entity Management,Booking Legal Entity,The legal entity where the trade or account is booked.,Aurelia Bank Poland,"Required for accounting, tax, and regulatory reporting.",Must exist in legal entity reference data,Victor...

Result 2:
  Source: C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\data\master_business_glossary\master_business_glossary_csv.csv
  Content: Bucket,Data

In [6]:
hits

[Document(id='0e2a25b5-b6d5-4339-bf64-6fafd103e03c', metadata={'source': 'C:\\Users\\marcin.grzechowiak\\Desktop\\repos_learn\\05_agent_dmo\\business_glossary\\data\\master_business_glossary\\master_business_glossary_csv.csv', 'file_type': 'csv'}, page_content='gs_bucket,client,client_account,booking_entity,Finance,Legal Entity Management,Booking Legal Entity,The legal entity where the trade or account is booked.,Aurelia Bank Poland,"Required for accounting, tax, and regulatory reporting.",Must exist in legal entity reference data,Victor'),
 Document(id='f1539374-1733-444f-b980-cb79bfbee837', metadata={'file_type': 'csv', 'source': 'C:\\Users\\marcin.grzechowiak\\Desktop\\repos_learn\\05_agent_dmo\\business_glossary\\data\\master_business_glossary\\master_business_glossary_csv.csv'}, page_content='Bucket,Dataset,Table Name,Column Name,Business Domain Name,Business Sub-Domain Name,Business Name,Column Description,Sample Values,Attribute related business rationale,Attribute logical busin

## 5. Database Statistics

In [12]:
# Get database stats
collection = retriever.vector_db._collection
total_chunks = collection.count()

print(f"\nDatabase Statistics:")
print(f"  Total chunks: {total_chunks}")
print(f"  Collection name: {cfg.collection_name}")
print(f"  Storage location: {cfg.persist_dir}")
print(f"  This vector DB was created at: ", collection.metadata)


Database Statistics:
  Total chunks: 124
  Collection name: business_glossary
  Storage location: C:\Users\marcin.grzechowiak\Desktop\repos_learn\05_agent_dmo\business_glossary\vector_dbs\chroma_db
  This vector DB was created at:  {'created_at': '20260222_1919'}
