# LlamaIndex RAG with OpenAI - Updated for Standard OpenAI API

## üîÑ Migration Notice
This notebook has been migrated from Azure OpenAI to **standard OpenAI API**.

### Required Environment Variables:
```bash
OPENAI_API_KEY=your-openai-api-key
OPENAI_MODEL=gpt-4.1  # or your preferred model
OPENAI_EMBEDDING_MODEL=text-embedding-3-small  # or text-embedding-3-large
```

### Key Changes:
- ‚úÖ Using `llama-index-embeddings-openai` instead of `llama-index-embeddings-azure-openai`
- ‚úÖ Using `llama-index-llms-openai` instead of `llama-index-llms-azure-openai`
- ‚úÖ Simplified configuration - no Azure endpoints, deployments, or API versions needed
- ‚úÖ All embedding operations now use `https://api.openai.com/v1/embeddings`

### Installation:
```bash
pip install llama-index-embeddings-openai llama-index-llms-openai
```

---

# LlamaIndex Kernel Crash Troubleshooting Guide

## üö® Kernel Crash at Step 3 (LlamaIndex Import)

Your kernel is crashing during the LlamaIndex import phase. This is a common issue that can be caused by:

### **Most Common Causes:**
1. **Package Version Conflicts** - Incompatible versions of LlamaIndex components
2. **Missing Dependencies** - Some required packages are not installed
3. **Memory Issues** - Insufficient memory for loading large models
4. **Environment Conflicts** - Multiple Python environments or conflicting packages

### **Troubleshooting Steps:**

#### **Step 1: Run Package Diagnostic**
Run the "PACKAGE DIAGNOSTIC & INSTALLATION CELL" below first.

#### **Step 2: Test Minimal Imports**
Run the "MINIMAL IMPORT TEST" to identify which specific import is failing.

#### **Step 3: Clean Installation (if needed)**
If imports still fail, run the "ALTERNATIVE LLAMAINDEX INSTALLATION" cell.

#### **Step 4: Restart Kernel**
After any package installation, **always restart your kernel**.

#### **Step 5: Use Kernel-Safe Query**
Finally, use the "KERNEL-SAFE LOCAL QUERY" cell which includes memory management.

### **Expected Flow:**
1. ‚úÖ Environment variables found (Step 2) - **COMPLETED**
2. ‚ùå LlamaIndex imports (Step 3) - **FAILING HERE**
3. ‚è∏Ô∏è Azure OpenAI configuration (Step 4) - Pending
4. ‚è∏Ô∏è ChromaDB connection (Step 5) - Pending
5. ‚è∏Ô∏è Query execution (Step 6) - Pending

---

In [29]:
# PACKAGE DIAGNOSTIC & INSTALLATION CELL
# Run this first to diagnose and fix LlamaIndex import issues

import sys
import subprocess
import importlib
import pkg_resources
from packaging import version

def check_and_install_packages():
    """
    Comprehensive package check and installation for LlamaIndex ecosystem
    """
    
    print("üîç Python Environment Information:")
    print(f"Python Version: {sys.version}")
    print(f"Python Executable: {sys.executable}")
    print(f"Platform: {sys.platform}")
    print("="*60)
    
    # Required packages with their correct names
    required_packages = {
        'llama-index': '0.11.20',  # Core package
        'llama-index-core': None,   # Core components
        'llama-index-embeddings-azure-openai': None,  # Azure OpenAI embeddings
        'llama-index-llms-azure-openai': None,  # Azure OpenAI LLM
        'llama-index-vector-stores-chroma': None,  # Chroma vector store
        'chromadb': '0.4.24',  # ChromaDB
        'openai': '1.3.0',  # OpenAI client
        'httpx': None,  # HTTP client
        'python-dotenv': None,  # Environment variables
    }
    
    print("üîç Checking installed packages...")
    installed_packages = {}
    missing_packages = []
    outdated_packages = []
    
    # Check what's currently installed
    for package_name, min_version in required_packages.items():
        try:
            # Try to get the installed version
            installed_version = pkg_resources.get_distribution(package_name).version
            installed_packages[package_name] = installed_version
            
            print(f"‚úÖ {package_name}: {installed_version}")
            
            # Check if version is adequate
            if min_version and version.parse(installed_version) < version.parse(min_version):
                outdated_packages.append((package_name, installed_version, min_version))
                
        except pkg_resources.DistributionNotFound:
            print(f"‚ùå {package_name}: NOT INSTALLED")
            missing_packages.append(package_name)
        except Exception as e:
            print(f"‚ö†Ô∏è {package_name}: Error checking - {e}")
            missing_packages.append(package_name)
    
    print("\n" + "="*60)
    
    # Report issues
    if missing_packages:
        print(f"‚ùå Missing packages: {missing_packages}")
    
    if outdated_packages:
        print(f"‚ö†Ô∏è Outdated packages: {outdated_packages}")
    
    if not missing_packages and not outdated_packages:
        print("‚úÖ All required packages are installed with adequate versions!")
        return True
    
    # Install/upgrade packages
    packages_to_install = missing_packages + [pkg[0] for pkg in outdated_packages]
    
    if packages_to_install:
        print(f"\nüöÄ Installing/upgrading packages: {packages_to_install}")
        
        for package in packages_to_install:
            try:
                print(f"Installing {package}...")
                subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", package])
                print(f"‚úÖ {package} installed successfully")
            except subprocess.CalledProcessError as e:
                print(f"‚ùå Failed to install {package}: {e}")
                return False
    
    return True

def test_imports():
    """
    Test critical imports one by one to identify the problem
    """
    print("\nüß™ Testing imports one by one...")
    
    imports_to_test = [
        ("chromadb", "import chromadb"),
        ("llama_index.core", "from llama_index.core import Settings"),
        ("llama_index.vector_stores.chroma", "from llama_index.vector_stores.chroma import ChromaVectorStore"),
        ("llama_index.core.storage", "from llama_index.core import StorageContext, VectorStoreIndex"),
        ("llama_index.embeddings.azure_openai", "from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding"),
        ("llama_index.llms.azure_openai", "from llama_index.llms.azure_openai import AzureOpenAI"),
    ]
    
    failed_imports = []
    
    for name, import_statement in imports_to_test:
        try:
            exec(import_statement)
            print(f"‚úÖ {name}: SUCCESS")
        except Exception as e:
            print(f"‚ùå {name}: FAILED - {e}")
            failed_imports.append((name, str(e)))
    
    if failed_imports:
        print(f"\n‚ùå Failed imports detected: {len(failed_imports)}")
        for name, error in failed_imports:
            print(f"  - {name}: {error}")
        return False
    else:
        print("\n‚úÖ All imports successful!")
        return True

# Run the diagnostics
print("üöÄ Starting Package Diagnostic...")
packages_ok = check_and_install_packages()

if packages_ok:
    print("\n" + "="*60)
    imports_ok = test_imports()
    
    if imports_ok:
        print("\nüéâ SUCCESS! All packages installed and imports working.")
        print("You can now proceed to run the local index query.")
    else:
        print("\n‚ùå Import issues detected. Check the error messages above.")
else:
    print("\n‚ùå Package installation issues detected.")

print("\nüîÑ Please restart your kernel after running this cell if packages were installed.")

üöÄ Starting Package Diagnostic...
üîç Python Environment Information:
Python Version: 3.12.0 (v3.12.0:0fb18b02c8, Oct  2 2023, 09:45:56) [Clang 13.0.0 (clang-1300.0.29.30)]
Python Executable: /Users/mzwandilemhlongo/Desktop/Data Science/PersonalProjects/ai-powered-analysis/text2sql/ai-analyst-agent/.venv/bin/python
Platform: darwin
üîç Checking installed packages...
‚úÖ llama-index: 0.14.6
‚úÖ llama-index-core: 0.14.6
‚úÖ llama-index-embeddings-azure-openai: 0.4.1
‚úÖ llama-index-llms-azure-openai: 0.4.2
‚úÖ llama-index-vector-stores-chroma: 0.5.3
‚úÖ chromadb: 1.3.0
‚úÖ openai: 1.109.1
‚úÖ httpx: 0.28.1
‚úÖ python-dotenv: 1.2.1

‚úÖ All required packages are installed with adequate versions!


üß™ Testing imports one by one...
‚úÖ chromadb: SUCCESS
‚úÖ llama_index.core: SUCCESS
‚úÖ llama_index.vector_stores.chroma: SUCCESS
‚úÖ llama_index.core.storage: SUCCESS
‚úÖ llama_index.embeddings.azure_openai: SUCCESS
‚úÖ llama_index.llms.azure_openai: SUCCESS

‚úÖ All imports successful!


In [30]:
# MINIMAL IMPORT TEST
# Run this to test imports without any complex logic

print("üß™ Testing minimal imports...")

try:
    print("Testing basic imports...")
    import os
    import sys
    print("‚úÖ Basic imports OK")
    
    print("Testing ChromaDB...")
    import chromadb
    print(f"‚úÖ ChromaDB version: {chromadb.__version__}")
    
    print("Testing LlamaIndex core...")
    from llama_index.core import Settings
    print("‚úÖ LlamaIndex core OK")
    
    print("Testing LlamaIndex vector store...")
    from llama_index.vector_stores.chroma import ChromaVectorStore
    print("‚úÖ ChromaVectorStore OK")
    
    print("Testing LlamaIndex storage...")
    from llama_index.core import StorageContext, VectorStoreIndex
    print("‚úÖ StorageContext and VectorStoreIndex OK")
    
    print("Testing Azure OpenAI embedding...")
    from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
    print("‚úÖ AzureOpenAIEmbedding OK")
    
    print("Testing Azure OpenAI LLM...")
    from llama_index.llms.azure_openai import AzureOpenAI
    print("‚úÖ AzureOpenAI LLM OK")
    
    print("\nüéâ ALL IMPORTS SUCCESSFUL!")
    print("You can now proceed with the local index query.")
    
except Exception as e:
    print(f"\n‚ùå IMPORT FAILED: {e}")
    print(f"Error type: {type(e).__name__}")
    import traceback
    print("\nFull traceback:")
    traceback.print_exc()
    
    print("\nüîß RECOMMENDATIONS:")
    print("1. Run the package diagnostic cell above")
    print("2. If that fails, run the clean installation cell")
    print("3. Restart your kernel after any installations")
    print("4. Check that you're using the correct Python environment")

üß™ Testing minimal imports...
Testing basic imports...
‚úÖ Basic imports OK
Testing ChromaDB...
‚úÖ ChromaDB version: 1.3.0
Testing LlamaIndex core...
‚úÖ LlamaIndex core OK
Testing LlamaIndex vector store...
‚úÖ ChromaVectorStore OK
Testing LlamaIndex storage...
‚úÖ StorageContext and VectorStoreIndex OK
Testing Azure OpenAI embedding...
‚úÖ AzureOpenAIEmbedding OK
Testing Azure OpenAI LLM...
‚úÖ AzureOpenAI LLM OK

üéâ ALL IMPORTS SUCCESSFUL!
You can now proceed with the local index query.


In [5]:
# Utility function to initialize LlamaIndex Settings
# Run this cell first before any local index querying

def initialize_llamaindex_settings():
    """
    Initialize LlamaIndex global settings with OpenAI configuration.
    This MUST be called before querying from local index.
    """
    import os
    from dotenv import load_dotenv, find_dotenv
    from llama_index.core import Settings
    from llama_index.embeddings.openai import OpenAIEmbedding
    from llama_index.llms.openai import OpenAI as LlamaIndexOpenAI
    import warnings
    
    warnings.filterwarnings("ignore")
    load_dotenv(find_dotenv())
    
    # OpenAI configuration
    OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
    OPENAI_MODEL = os.environ.get("OPENAI_MODEL", "gpt-4o")
    OPENAI_EMBEDDING_MODEL = os.environ.get("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small")
    
    # Set up LLM with standard OpenAI
    llm = LlamaIndexOpenAI(
        api_key=OPENAI_API_KEY,
        model=OPENAI_MODEL,
    )
    
    # Set up embedding model with standard OpenAI
    embedding_model = OpenAIEmbedding(
        model=OPENAI_EMBEDDING_MODEL,
        api_key=OPENAI_API_KEY,
        api_base="https://api.openai.com/v1",
    )
    
    # Set global Settings
    Settings.llm = llm
    Settings.embed_model = embedding_model
    
    print("‚úÖ LlamaIndex Settings initialized successfully with OpenAI!")
    return llm, embedding_model

# Initialize settings
llm, embedding_model = initialize_llamaindex_settings()

‚úÖ LlamaIndex Settings initialized successfully with OpenAI!


# Metadata RAG

## Text Embeddings

## LlamaIndex RAG

In [31]:
import os
from dotenv import load_dotenv, find_dotenv
import warnings
warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

# Note: SSL/HTTP configuration not needed for standard OpenAI API

True

In [7]:
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI as LlamaIndexOpenAI
from llama_index.core.query_engine import RetrieverQueryEngine
import logging
import sys

In [32]:
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
OPENAI_MODEL = os.environ.get("OPENAI_MODEL")
OPENAI_EMBEDDING_MODEL = os.environ.get("OPENAI_EMBEDDING_MODEL", "text-embedding-3-large")

# Set up LLM with standard OpenAI
llm = LlamaIndexOpenAI(
    api_key=OPENAI_API_KEY,
    model=OPENAI_MODEL,
)

# Set up embedding model with standard OpenAI
embedding_model = OpenAIEmbedding(
    model=OPENAI_EMBEDDING_MODEL,
    api_key=OPENAI_API_KEY,
    api_base="https://api.openai.com/v1",
)

Settings.llm = llm
Settings.embed_model = embedding_model

In [33]:
# Define file-specific metadata

doc_sample_file = r"/Users/mzwandilemhlongo/Desktop/Data Science/PersonalProjects/ai-powered-analysis/text2sql/ai-analyst-agent/table_metadata/sample_file.txt"

file_paths = [ doc_sample_file]

def get_metadata_for_files(file_paths):
    # Create a map of file path to custom metadata
    file_metadata_map = {
        
        doc_sample_file: {
            "category": "sample File ",
            "year": "2025-11-01",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "This file is only used for testing"
        },
    }
    
    # The function that SimpleDirectoryReader will call
    def file_metadata_func(file_path):
        # Get predefined metadata if available, otherwise return basic metadata
        if file_path in file_metadata_map:
            return file_metadata_map[file_path]
        else:
            return {
                "source": file_path,
                "file_type": os.path.splitext(file_path)[1],
                "confidentiality": "unknown"
            }
    
    return file_metadata_func

# Create reader with specific files and their metadata
documents = SimpleDirectoryReader(
    input_files=file_paths,
    file_metadata=get_metadata_for_files(file_paths)
).load_data()

# Create index and query engine as before
index = VectorStoreIndex.from_documents(documents)



2025-11-01 08:10:21,544 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [34]:
# Query From Local Index
import chromadb
import sys
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext  # Added Settings import
                                
# initialize client
index_path = r"..\index\chroma_db"
db = chromadb.PersistentClient(path=index_path)

# get collection
chroma_collection = db.get_or_create_collection("sql_tables_metadata")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# load your index from stored vectors
index_local = VectorStoreIndex.from_vector_store(
                                            vector_store = vector_store, 
                                            storage_context=storage_context
                                        )

# create a query engine
local_query_engine = index_local.as_query_engine(similarity_top_k=10)

def generate_response(query):
    answer = local_query_engine.query(query)
    print("\n**Query:**\n", query)
    print("\n**Answer:**\n", answer)
    print("\n**Source:**\n", answer.get_formatted_sources())
    
    # Optionally print metadata from sources to verify it's working
    print("\n**Source Metadata:**")
    for source_node in answer.source_nodes:
        print(f"- {source_node.node.metadata}")

generate_response("generate the full table details without intepreting or editing anything: customer information full table details")

2025-11-01 08:10:28,513 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"



**Query:**
 generate the full table details without intepreting or editing anything: customer information full table details

**Answer:**
 Empty Response

**Source:**
 

**Source Metadata:**


In [14]:
# Define file-specific metadata

doc_transactions_path = r"../table_metadata/metadata_transaction_history.txt"
doc_customer_info_path = r"../table_metadata/metadata_customer_information.txt"
doc_crs_accountreport_path = r"../table_metadata/metadata_crs_account_report.txt"
doc_crs_countrycode_path = r"../table_metadata/metadata_crs_countrycode.txt"
doc_crs_messagespec_path = r"../table_metadata/metadata_crs_messagespec.txt"

file_paths = [doc_transactions_path, 
            doc_customer_info_path, 
            doc_crs_accountreport_path, 
            doc_crs_countrycode_path, 
            doc_crs_messagespec_path]

def get_metadata_for_files(file_paths):
    # Create a map of file path to custom metadata
    file_metadata_map = {
        
        doc_transactions_path: {
            "category": "transaction history table",
            "year": "2025-07-20",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases"
        },
        doc_customer_info_path: {
            "category": "customer information table",
            "year": "2025-07-20",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "Comprehensive customer information data table containing personal information, financial details, loan information, and product holdings for bank customers"
        },
        doc_crs_accountreport_path: {
            "category": "Common Reporting Standard (CRS) Account Reporting",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "Detailed financial account reporting data in accordance with **Common Reporting Standard (CRS)** requirements. This table captures comprehensive information about account holders and their financial accounts, crucial for international tax transparency"
        },
        doc_crs_countrycode_path: {
            "category": "Common Reporting Standard (CRS) Country Codes",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "Comprehensive country code reference for **Common Reporting Standard (CRS)** reporting. This table provides essential mappings between various country code formats, ensuring accurate and consistent country identification across CRS data. It is based on the **ISO 3166-1 alpha-2 standard"
        },
        doc_crs_messagespec_path: {
            "category": "Common Reporting Standard (CRS) Message Specification",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "This table stores the crucial **header and reporting entity information** for **Common Reporting Standard (CRS) messages"
        },
    }
    
    # The function that SimpleDirectoryReader will call
    def file_metadata_func(file_path):
        # Get predefined metadata if available, otherwise return basic metadata
        if file_path in file_metadata_map:
            return file_metadata_map[file_path]
        else:
            return {
                "source": file_path,
                "file_type": os.path.splitext(file_path)[1],
                "confidentiality": "unknown"
            }
    
    return file_metadata_func

# Create reader with specific files and their metadata
documents = SimpleDirectoryReader(
    input_files=file_paths,
    file_metadata=get_metadata_for_files(file_paths)
).load_data()

# Create index and query engine as before
index = VectorStoreIndex.from_documents(documents)


2025-10-31 06:22:18,276 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [15]:
# query engine
query_engine = index.as_query_engine(
    similarity_top_k=10
)

In [16]:
def generate_response(query):
    answer = query_engine.query(query)
    print("\n**Query:**\n", query)
    print("\n**Answer:**\n", answer)
    print("\n**Source:**\n", answer.get_formatted_sources())
    
    # Optionally print metadata from sources to verify it's working
    print("\n**Source Metadata:**")
    for source_node in answer.source_nodes:
        print(f"- {source_node.node.metadata}")

In [17]:
generate_response("generate the full table details without intepreting or editing anything: customer information full table details")

2025-10-31 06:22:34,044 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-10-31 06:23:17,973 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-31 06:23:17,973 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



**Query:**
 generate the full table details without intepreting or editing anything: customer information full table details

**Answer:**
 Table: [dbo].[customer_information]
Database: SQL Server (master)
Schema: dbo
Primary Key: id
Description: Comprehensive customer data table containing personal information, financial details, loan information, and product holdings for bank customers

Columns:

- id (int)
  - Description: Unique customer identifier, 8-digit number
  - Range: 10000000 to 99999999
  - Examples: 10474206, 10962741, 13765547
  - Rules: Auto-generated unique identifier for each customer

- full_name (nvarchar)
  - Description: Customer's complete name (first and last name)
  - Examples: Rachel Benitez, Samuel Anderson, Austin Perkins
  - Rules: Required field, contains customer's legal name

- email (nvarchar)
  - Description: Customer's email address for communication
  - Examples: nelsoneddie@example.net, dillonjodi@example.net
  - Rules: Must be valid email format, u

## Persist the index to disk

In [19]:
import os
from dotenv import load_dotenv, find_dotenv
import warnings
warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.query_engine import RetrieverQueryEngine

# Note: LLM and embedding settings should already be configured 
# via initialize_llamaindex_settings() function from earlier cells

In [20]:
# Define file-specific metadata

doc_transactions_path = r"../table_metadata/metadata_transaction_history.txt"
doc_customer_info_path = r"../table_metadata/metadata_customer_information.txt"
doc_crs_accountreport_path = r"../table_metadata/metadata_crs_account_report.txt"
doc_crs_countrycode_path = r"../table_metadata/metadata_crs_countrycode.txt"
doc_crs_messagespec_path = r"../table_metadata/metadata_crs_messagespec.txt"

file_paths = [  doc_transactions_path, doc_customer_info_path, doc_crs_accountreport_path, doc_crs_countrycode_path, doc_crs_messagespec_path]

def get_metadata_for_files(file_paths):
    # Create a map of file path to custom metadata
    file_metadata_map = {
        
        doc_transactions_path: {
            "category": "transaction history table",
            "year": "2025-07-20",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases"
        },
        doc_customer_info_path: {
            "category": "customer information table",
            "year": "2025-07-20",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "Comprehensive customer information data table containing personal information, financial details, loan information, and product holdings for bank customers"
        },
        doc_crs_accountreport_path: {
            "category": "Common Reporting Standard (CRS) Account Reporting",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "Detailed financial account reporting data in accordance with **Common Reporting Standard (CRS)** requirements. This table captures comprehensive information about account holders and their financial accounts, crucial for international tax transparency"
        },
        doc_crs_countrycode_path: {
            "category": "Common Reporting Standard (CRS) Country Codes",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "Comprehensive country code reference for **Common Reporting Standard (CRS)** reporting. This table provides essential mappings between various country code formats, ensuring accurate and consistent country identification across CRS data. It is based on the **ISO 3166-1 alpha-2 standard"
        },
        doc_crs_messagespec_path: {
            "category": "Common Reporting Standard (CRS) Message Specification",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "This table stores the crucial **header and reporting entity information** for **Common Reporting Standard (CRS) messages"
        },
    }
    
    # The function that SimpleDirectoryReader will call
    def file_metadata_func(file_path):
        # Get predefined metadata if available, otherwise return basic metadata
        if file_path in file_metadata_map:
            return file_metadata_map[file_path]
        else:
            return {
                "source": file_path,
                "file_type": os.path.splitext(file_path)[1],
                "confidentiality": "unknown"
            }
    
    return file_metadata_func


# load some documents
# documents = SimpleDirectoryReader("./data").load_data()

# Create reader with specific files and their metadata
documents = SimpleDirectoryReader(
    input_files=file_paths,
    file_metadata=get_metadata_for_files(file_paths)
).load_data()

# initialize client, setting path to save data
index_path = r"../index/chroma_db"
db = chromadb.PersistentClient(path=index_path)

# create collection
chroma_collection = db.get_or_create_collection("sql_tables_metadata")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
index_persisted = VectorStoreIndex.from_documents(
    documents, 
    storage_context=storage_context
    
)


2025-10-31 06:31:51,025 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2025-10-31 06:31:51,706 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-10-31 06:31:51,706 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


In [21]:
# create a query engine and query
my_query_engine = index_persisted.as_query_engine(
    similarity_top_k=10
)

# Create index and query engine as before
# index = VectorStoreIndex.from_documents(documents)
# query_engine = index.as_query_engine(
#     similarity_top_k=10
# )


In [22]:
def generate_metadata(query):
    answer = my_query_engine.query(query)
    print("\n**Query:**\n", query)
    print("\n**Answer:**\n", answer)
    print("\n**Source:**\n", answer.get_formatted_sources())
    
    # Optionally print metadata from sources to verify it's working
    print("\n**Source Metadata:**")
    for source_node in answer.source_nodes:
        print(f"- {source_node.node.metadata}")

In [23]:
generate_metadata("whic table contains ClosedAccount field? Return the metadata for the table that contains ClosedAccount field")

2025-10-31 06:33:03,669 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-10-31 06:33:27,495 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-31 06:33:27,495 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



**Query:**
 whic table contains ClosedAccount field? Return the metadata for the table that contains ClosedAccount field

**Answer:**
 Table: [DATA].[CRS_GH_AccountReport]

General
- Database: SQL Server (master)
- Schema: dbo
- Full Name: [DATA].[CRS_GH_AccountReport]
- Purpose: Detailed financial account reporting data compiled for Ghana in accordance with the Common Reporting Standard (CRS). Each record represents a single reportable account within a CRS message.

Keys & Relationships
- Primary Key: (ParentID, DocRefId3) ‚Äî the composite key uniquely identifies each specific account report within a given message.
- Foreign Key: ParentID ‚Üí [DATA].[CRS_GH_MessageSpec].[ParentID] (links account reports to their parent CRS message).

Selected Column Definitions (highlights)
- ParentID (varchar(255))
  - Unique identifier linking this account report to its parent CRS message.
  - Mandatory. Must correspond to an existing ParentID in the message spec table.

- DocTypeIndic2 (varchar(2

## Load the index from the local vectorDB

In [24]:
import chromadb
import sys
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, SimpleDirectoryReader, VectorStoreIndex #, Settings
                                

In [25]:

# initialize client
index_path = r"../index/chroma_db"
db = chromadb.PersistentClient(path=index_path)

# get collection
chroma_collection = db.get_or_create_collection("sql_tables_metadata")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# load your index from stored vectors
index = VectorStoreIndex.from_vector_store(
                                            vector_store, 
                                            storage_context=storage_context
                                        )

# create a query engine
query_engine = index.as_query_engine(similarity_top_k=10)


In [26]:
def generate_response(query):
    answer = query_engine.query(query)
    print("\n**Query:**\n", query)
    print("\n**Answer:**\n", answer)
    print("\n**Source:**\n", answer.get_formatted_sources())
    
    # Optionally print metadata from sources to verify it's working
    print("\n**Source Metadata:**")
    for source_node in answer.source_nodes:
        print(f"- {source_node.node.metadata}")

In [27]:
generate_response("which table contains ClosedAccount field? Return the metadata for the table that contains ClosedAccount field")

2025-10-31 06:38:17,928 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-10-31 06:38:44,109 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-31 06:38:44,109 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



**Query:**
 which table contains ClosedAccount field? Return the metadata for the table that contains ClosedAccount field

**Answer:**
 Table: [DATA].[CRS_GH_AccountReport] (SQL Server; master; schema: dbo)

Primary key
- Composite: (ParentID, DocRefId3)

Foreign key
- ParentID ‚Üí [DATA].[CRS_GH_MessageSpec].[ParentID]

Purpose
- Detailed financial account reporting records (single reportable account per row) prepared for CRS reporting.

Selected column metadata (types, description, examples, rules)

- ParentID (varchar(255))
  - Description: Links this account report to its parent CRS message specification.
  - Examples: 110121c2-6227-4433-a802-54c0ed699405
  - Rules: Mandatory; must exist in [DATA].[CRS_GH_MessageSpec].[ParentID].

- DocTypeIndic2 (varchar(255))
  - Description: Type of document for the account report (OECD CRS XML).
  - Valid values: OECD1 (New Data), OECD2 (Corrected Data), OECD3 (Void Data)
  - Rules: Mandatory.

- DocRefId3 (varchar(255))
  - Description: Uniqu

In [28]:
generate_response("generate the full table details without intepreting or editing anything: customer information full table details")

2025-10-31 06:38:44,720 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-10-31 06:39:28,652 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-10-31 06:39:28,652 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"



**Query:**
 generate the full table details without intepreting or editing anything: customer information full table details

**Answer:**
 Database: SQL Server (master)
Schema: dbo
Table: customer_information
Description: Comprehensive customer data table containing personal information, financial details, loan information, and product holdings for bank customers
Primary Key: id
Full Name: [dbo].[customer_information]

Columns:

- id (int)
  - Description: Unique customer identifier, 8-digit number
  - Range: 10000000 to 99999999
  - Examples: 10474206, 10962741, 13765547
  - Rules: Auto-generated unique identifier for each customer

- full_name (nvarchar)
  - Description: Customer's complete name (first and last name)
  - Examples: Rachel Benitez, Samuel Anderson, Austin Perkins
  - Rules: Required field, contains customer's legal name

- email (nvarchar)
  - Description: Customer's email address for communication
  - Examples: nelsoneddie@example.net, dillonjodi@example.net
  - Rule

In [None]:
import json
import logging
import re
import json

import os
from dotenv import load_dotenv, find_dotenv
import warnings
warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI as LlamaIndexOpenAI

# OpenAI Configuration
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
OPENAI_MODEL = os.environ.get("OPENAI_MODEL", "gpt-4o")
OPENAI_EMBEDDING_MODEL = os.environ.get("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small")

# Set up LLM with standard OpenAI
llm = LlamaIndexOpenAI(
    api_key=OPENAI_API_KEY,
    model=OPENAI_MODEL,
)

# Set up embedding model with standard OpenAI
embedding_model = OpenAIEmbedding(
    model=OPENAI_EMBEDDING_MODEL,
    api_key=OPENAI_API_KEY,
    api_base="https://api.openai.com/v1",
)

Settings.llm = llm
Settings.embed_model = embedding_model

# Initialize ChromaDB client
index_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\index\chroma_db"
db = chromadb.PersistentClient(path=index_path)

# Get collection
chroma_collection = db.get_or_create_collection("sql_tables_metadata")
print(f"Chroma collection '{chroma_collection.name}' loaded successfully.")

# Assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Load your index from stored vectors
index = VectorStoreIndex.from_vector_store(
    vector_store, storage_context=storage_context
)

# Create a query engine
query_engine = index.as_query_engine(similarity_top_k=10)

def agent_table_rag(query):
    return query_engine.query("retrieve the full tables metadata without intepreting or editing anything for the following given tables: " + query)

def check_and_display_collection_info(client_db, collection_name_to_check):
    """
    Checks if a ChromaDB collection exists and displays its content if it does.
    Also lists all available collections.
    """
    print("\n--- ChromaDB Collection Information ---")

    # 1. List all collections
    print("Listing all collections:")
    all_collections = client_db.list_collections()
    if all_collections:
        for col in all_collections:
            print(f"- {col.name}")
    else:
        print("No collections found in the database.")

    # 2. Check if the specified collection exists
    print(f"\nChecking if collection '{collection_name_to_check}' exists...")
    try:
        target_collection = client_db.get_collection(collection_name_to_check)
        print(f"Collection '{collection_name_to_check}' exists!")

        # 3. Get and display some content from the index
        print(f"\nDisplaying first 5 items from '{collection_name_to_check}':")
        collection_content = target_collection.peek(limit=5)

        if collection_content and collection_content.get('documents'):
            for i, doc in enumerate(collection_content['documents']):
                print(f"--- Item {i+1} ---")
                print(f"ID: {collection_content['ids'][i]}")
                print(f"Document: {doc}")
                if collection_content.get('metadatas') and collection_content['metadatas'][i]:
                    print(f"Metadata: {json.dumps(collection_content['metadatas'][i], indent=2)}")
                print("-" * 20)
        else:
            print(f"Collection '{collection_name_to_check}' is empty or has no documents.")

    except chromadb.exceptions.CollectionNotFoundError:
        print(f"Collection '{collection_name_to_check}' does NOT exist.")
    except Exception as e:
        print(f"An error occurred while accessing collection '{collection_name_to_check}': {e}")
    print("-------------------------------------")


if __name__ == "__main__":
    # Call the new function to check and display collection info
    check_and_display_collection_info(db, "sql_tables_metadata")

    user_request = "customer information"
    try:
        print("\nRetrieving relevant tables for the request:", user_request)
        table_results = query_engine.query("retrieve the full tables metadata without intepreting or editing anything for the following given tables: " + user_request)
        print("Type of result:", type(table_results))
        print("Raw result:", repr(table_results))
        if not table_results:
            print("No results returned from query.")
        else:
            # If it's not a string, print its attributes
            if not isinstance(table_results, str):
                print("Result attributes:", dir(table_results))
                # Try to print a 'response' or 'text' attribute if present
                if hasattr(table_results, 'response'):
                    print("Response attribute:", table_results.response)
                if hasattr(table_results, 'text'):
                    print("Text attribute:", table_results.text)
            print("Relevant tables retrieved successfully!")
            print(table_results)
    except Exception as e:
        print(f"Error retrieving tables: {e}")

Chroma collection 'sql_tables_metadata' loaded successfully.

--- ChromaDB Collection Information ---
Listing all collections:
- sql_tables_metadata

Checking if collection 'sql_tables_metadata' exists...
Collection 'sql_tables_metadata' exists!

Displaying first 5 items from 'sql_tables_metadata':
--- Item 1 ---
ID: 7d83553b-a671-4803-964f-03d3545900d2
Document: <begin transaction history metadata>

# Database Schema Information

## Database Details
- **Database**: SQL Server (master)
- **Server**: localhost\SQLEXPRESS
- **Schema**: dbo

## Table: transaction_history
Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases

### Table Structure
**Full Name**: [dbo].[transaction_history]
**Records**: 5000+ transactions
**Primary Key**: transaction_id
**Foreign Keys**: customer_id
**Time Range**: Last 2 years of transaction data

### Column Definitions

**transaction_id** (bigint)
- Descript

In [None]:
import json

def agent_table_rag(query):
    return query_engine.query("retrieve the full tables metadata without intepreting or editing anything for the following given tables: " + query)

def check_and_display_collection_info(client_db, collection_name_to_check):
    """
    Checks if a ChromaDB collection exists and displays its content if it does.
    Also lists all available collections.
    """
    print("\n--- ChromaDB Collection Information ---")

    # 1. List all collections
    print("Listing all collections:")
    all_collections = client_db.list_collections()
    if all_collections:
        for col in all_collections:
            print(f"- {col.name}")
    else:
        print("No collections found in the database.")

    # 2. Check if the specified collection exists
    print(f"\nChecking if collection '{collection_name_to_check}' exists...")
    try:
        target_collection = client_db.get_collection(collection_name_to_check)
        print(f"Collection '{collection_name_to_check}' exists!")

        # 3. Get and display some content from the index
        print(f"\nDisplaying first 5 items from '{collection_name_to_check}':")
        # Use peek() to get a small sample of the collection's contents
        # It returns a dictionary with 'ids', 'embeddings', 'metadatas', 'documents'
        # We are primarily interested in 'documents' and 'metadatas' for content.
        collection_content = target_collection.peek(limit=5)

        if collection_content and collection_content.get('documents'):
            for i, doc in enumerate(collection_content['documents']):
                print(f"--- Item {i+1} ---")
                print(f"ID: {collection_content['ids'][i]}")
                print(f"Document: {doc}")
                if collection_content.get('metadatas') and collection_content['metadatas'][i]:
                    print(f"Metadata: {json.dumps(collection_content['metadatas'][i], indent=2)}")
                print("-" * 20)
        else:
            print(f"Collection '{collection_name_to_check}' is empty or has no documents.")

    except chromadb.exceptions.CollectionNotFoundError:
        print(f"Collection '{collection_name_to_check}' does NOT exist.")
    except Exception as e:
        print(f"An error occurred while accessing collection '{collection_name_to_check}': {e}")
    print("-------------------------------------")


if __name__ == "__main__":
    # Call the new function to check and display collection info
    check_and_display_collection_info(db, "sql_tables_metadata")
    # check_and_display_collection_info(db, "non_existent_collection") # Example for a non-existent collection

    user_request = "customer information"
    try:
        print("\nRetrieving relevant tables for the request:", user_request)
        table_results = agent_table_rag(user_request)
        print("Type of result:", type(table_results))
        print("Raw result:", repr(table_results))
        if not table_results:
            print("No results returned from query.")
        else:
            # If it's not a string, print its attributes
            if not isinstance(table_results, str):
                print("Result attributes:", dir(table_results))
                # Try to print a 'response' or 'text' attribute if present
                if hasattr(table_results, 'response'):
                    print("Response attribute:", table_results.response)
                if hasattr(table_results, 'text'):
                    print("Text attribute:", table_results.text)
            print("Relevant tables retrieved successfully!")
            print(table_results)
    except Exception as e:
        print(f"Error retrieving tables: {e}")


--- ChromaDB Collection Information ---
Listing all collections:
- sql_tables_metadata

Checking if collection 'sql_tables_metadata' exists...
Collection 'sql_tables_metadata' exists!

Displaying first 5 items from 'sql_tables_metadata':
--- Item 1 ---
ID: ffe71549-4d19-4da1-9813-8454a82b6dd4
Document: <begin transaction history metadata>

# Database Schema Information

## Database Details
- **Database**: SQL Server (master)
- **Server**: localhost\SQLEXPRESS
- **Schema**: dbo

## Table: transaction_history
Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases

### Table Structure
**Full Name**: [dbo].[transaction_history]
**Records**: 5000+ transactions
**Primary Key**: transaction_id
**Foreign Keys**: customer_id
**Time Range**: Last 2 years of transaction data

### Column Definitions

**transaction_id** (bigint)
- Description: Unique transaction identifier, 12-digit number
- Range: 

In [None]:
generate_response("generate the full table details without intepreting or editing anything: transaction history full table details")


**Query:**
 generate the full table details without intepreting or editing anything: transaction history full table details

**Answer:**
 Database: SQL Server (master)  
Server: localhost\SQLEXPRESS  
Schema: dbo  

Table: [dbo].[transaction_history]  
Description: Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases  
Records: 5000+ transactions  
Primary Key: transaction_id  
Foreign Keys: customer_id  
Time Range: Last 2 years of transaction data  

Column Definitions:

- transaction_id (bigint)  
  - Description: Unique transaction identifier, 12-digit number  
  - Range: 100000000000 to 999999999999  
  - Examples: 679551814302, 376513881618, 994709101726  
  - Rules: Auto-generated unique identifier for each transaction  

- customer_id (int)  
  - Description: Customer identifier linking to customer_information table  
  - Range: 10000000 to 99999999  
  - Examples: 10000001, 1

In [48]:
generate_response("generate the full metadata for the following tables without intepreting or editing anything: [transaction_history, customer_information ]")

2025-11-01 11:48:58,262 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"



**Query:**
 generate the full metadata for the following tables without intepreting or editing anything: [transaction_history, customer_information ]

**Answer:**
 Empty Response

**Source:**
 

**Source Metadata:**


# Index Bank Products Catalog
## Create ChromaDB Collection for Products

This section indexes the bank products catalog and persists it to ChromaDB with the collection name "products".

In [35]:
# Define file path for bank products catalog
products_file_path = r"../documents/bank_products_services.txt"

# Verify file exists
import os
if os.path.exists(products_file_path):
    print(f"‚úÖ Products file found: {products_file_path}")
    with open(products_file_path, 'r', encoding='utf-8') as f:
        content = f.read()
        print(f"üìÑ File size: {len(content)} characters")
        print(f"üìÑ Preview (first 500 chars):\n{content[:500]}...")
else:
    print(f"‚ùå Products file not found: {products_file_path}")
    print("Please ensure the bank_products_services.txt file exists in the documents folder")

‚úÖ Products file found: ../documents/bank_products_services.txt
üìÑ File size: 18596 characters
üìÑ Preview (first 500 chars):
# Bank Products and Services Catalog
## Comprehensive Guide to Our Financial Products

Last Updated: November 1, 2025
Department: Product Management
Classification: Public

---

## SAVINGS ACCOUNTS

### 1. Essential Savings Account
**Product Code:** SAV-001
**Description:** Basic savings account for everyday banking needs

**Eligibility Criteria:**
- Minimum Age: 18 years
- Income Requirement: None
- Credit Score: Not required
- Employment Status: Any (students, employed, self-employed, unemploy...


In [40]:
# Create metadata function for bank products
def get_products_metadata(file_path):
    """
    Define metadata for bank products catalog
    """
    return {
        "category": "bank_products_catalog",
        "year": "2025",
        "department": "Product Management",
        "author": "Product Team",
        "confidentiality": "public",
        "description": "Comprehensive bank products and services catalog including savings accounts, checking accounts, credit cards, loans, mortgages, and investment products",
        "product_categories": "savings_accounts, checking_accounts, credit_cards, personal_loans, home_loans, auto_loans, business_accounts, investment_products, specialty_accounts",
        "last_updated": "2025-11-01"
    }

# Load the products document with metadata
products_documents = SimpleDirectoryReader(
    input_files=[products_file_path],
    file_metadata=lambda fp: get_products_metadata(fp)
).load_data()

print(f"‚úÖ Loaded {len(products_documents)} document(s)")
print(f"üìä Document metadata: {products_documents[0].metadata if products_documents else 'N/A'}")

‚úÖ Loaded 1 document(s)
üìä Document metadata: {'category': 'bank_products_catalog', 'year': '2025', 'department': 'Product Management', 'author': 'Product Team', 'confidentiality': 'public', 'description': 'Comprehensive bank products and services catalog including savings accounts, checking accounts, credit cards, loans, mortgages, and investment products', 'product_categories': 'savings_accounts, checking_accounts, credit_cards, personal_loans, home_loans, auto_loans, business_accounts, investment_products, specialty_accounts', 'last_updated': '2025-11-01'}


In [41]:
# Initialize ChromaDB client for products collection
products_index_path = r"../index/chroma_db"
products_db = chromadb.PersistentClient(path=products_index_path)

# Create or get the products collection
print("üîÑ Creating 'products' collection...")
try:
    # Try to delete existing collection if it exists
    products_db.delete_collection("products")
    print("üóëÔ∏è  Deleted existing 'products' collection")
except:
    print("‚ÑπÔ∏è  No existing 'products' collection to delete")

# Create new products collection
products_collection = products_db.create_collection("products")
print(f"‚úÖ Created new 'products' collection")

# Set up vector store and storage context
products_vector_store = ChromaVectorStore(chroma_collection=products_collection)
products_storage_context = StorageContext.from_defaults(vector_store=products_vector_store)

print("‚úÖ ChromaDB vector store configured for products")

üîÑ Creating 'products' collection...
üóëÔ∏è  Deleted existing 'products' collection
‚úÖ Created new 'products' collection
‚úÖ ChromaDB vector store configured for products


In [42]:
# Create and persist the products index
print("üî® Building products vector index...")
print("‚è≥ This may take a few moments as we embed the products catalog...")

products_index = VectorStoreIndex.from_documents(
    products_documents,
    storage_context=products_storage_context,
    show_progress=True
)

print("‚úÖ Products index created and persisted successfully!")
print(f"üìÅ Index location: {products_index_path}")
print(f"üì¶ Collection name: products")
print(f"üî¢ Total documents indexed: {len(products_documents)}")

üî® Building products vector index...
‚è≥ This may take a few moments as we embed the products catalog...


Parsing nodes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 72.74it/s]
Parsing nodes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 72.74it/s]
Generating embeddings:   0%|          | 0/7 [00:00<?, ?it/s]2025-11-01 08:22:17,516 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-01 08:22:17,516 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Generating embeddings: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:01<00:00,  5.00it/s]

‚úÖ Products index created and persisted successfully!
üìÅ Index location: ../index/chroma_db
üì¶ Collection name: products
üî¢ Total documents indexed: 1





In [43]:
# Verify the products collection was created
print("üîç Verifying products collection...")

# List all collections in the database
all_collections = products_db.list_collections()
print(f"\nüìã All collections in database:")
for col in all_collections:
    print(f"  - {col.name} ({col.count()} items)")

# Check products collection specifically
products_collection_check = products_db.get_collection("products")
print(f"\n‚úÖ Products collection verified:")
print(f"  - Name: {products_collection_check.name}")
print(f"  - Item count: {products_collection_check.count()}")

# Peek at first few items
if products_collection_check.count() > 0:
    sample = products_collection_check.peek(limit=3)
    print(f"\nüìÑ Sample from products collection (first 3 items):")
    for i, doc in enumerate(sample['documents'][:3]):
        print(f"\n  Item {i+1}:")
        print(f"    ID: {sample['ids'][i]}")
        print(f"    Document preview: {doc[:200]}...")
        if sample.get('metadatas') and sample['metadatas'][i]:
            print(f"    Metadata: {sample['metadatas'][i]}")
else:
    print("‚ö†Ô∏è Products collection is empty!")

üîç Verifying products collection...

üìã All collections in database:
  - sql_tables_metadata (14 items)
  - products (7 items)

‚úÖ Products collection verified:
  - Name: products
  - Item count: 7

üìÑ Sample from products collection (first 3 items):

  Item 1:
    ID: 75961986-86cd-459a-a2a8-bcc03a477eeb
    Document preview: # Bank Products and Services Catalog
## Comprehensive Guide to Our Financial Products

Last Updated: November 1, 2025
Department: Product Management
Classification: Public

---

## SAVINGS ACCOUNTS

#...
    Metadata: {'category': 'bank_products_catalog', 'department': 'Product Management', '_node_type': 'TextNode', 'document_id': 'de680805-6173-40c9-aeb4-ef440bffff8f', 'last_updated': '2025-11-01', 'product_categories': 'savings_accounts, checking_accounts, credit_cards, personal_loans, home_loans, auto_loans, business_accounts, investment_products, specialty_accounts', '_node_content': '{"id_": "75961986-86cd-459a-a2a8-bcc03a477eeb", "embedding": null, "

## Test Products Query Engine

Now let's test querying the products collection to ensure it works correctly.

In [44]:
# Create a query engine for products
products_query_engine = products_index.as_query_engine(similarity_top_k=5)

print("‚úÖ Products query engine created")
print("üîç Ready to query bank products!")

‚úÖ Products query engine created
üîç Ready to query bank products!


In [45]:
# Helper function to query products
def query_products(query_text):
    """
    Query the products collection and display results
    """
    print(f"\n{'='*80}")
    print(f"üîç QUERY: {query_text}")
    print(f"{'='*80}\n")
    
    response = products_query_engine.query(query_text)
    
    print("üìù ANSWER:")
    print(response)
    
    print(f"\n{'='*80}")
    print("üìö SOURCES:")
    print(response.get_formatted_sources())
    
    # Show source metadata
    if hasattr(response, 'source_nodes') and response.source_nodes:
        print(f"\n{'='*80}")
        print("üìä SOURCE METADATA & RELEVANCE:")
        for i, node in enumerate(response.source_nodes[:3], 1):
            print(f"\n  Source {i}:")
            print(f"    Relevance Score: {node.score:.4f}")
            if hasattr(node, 'metadata') and node.metadata:
                print(f"    Metadata: {node.metadata}")
            print(f"    Text Preview: {node.text[:200]}...")
    
    print(f"\n{'='*80}\n")
    
    return response

In [46]:
# Test Query 1: Savings account query
query_products("I make $45,000 per year and want to open a savings account with good interest rates. What options do I have?")


üîç QUERY: I make $45,000 per year and want to open a savings account with good interest rates. What options do I have?



2025-11-01 08:22:40,863 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-01 08:23:39,032 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-11-01 08:23:39,032 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


üìù ANSWER:
You have several savings options that fit a $45,000 annual income. Which is best depends on how much you‚Äôll deposit, whether you need frequent access to funds, and your credit score. Summary of the strongest options:

1. Premium Savings Account (SAV-002)
- Eligibility: age 21+, $30,000+ annual income, credit score 650+, employed or self‚Äëemployed with verifiable income.  
- Key points: $1,000 minimum to open; $2,500 minimum monthly balance; APY 4.25%; $15 monthly fee (waived with >$5,000); unlimited free ATM; premium features and quarterly interest bonuses for balances >$10,000.  
- Best if: you meet the credit/age requirements and want one of the highest APYs without locking funds.

2. Money Market Account (MMA-001)
- Eligibility: age 18+, no income or credit requirement.  
- Key points: $2,500 minimum opening and to earn interest; tiered APY:
  - $2,500‚Äì$9,999: 3.00%
  - $10,000‚Äì$24,999: 3.50%
  - $25,000‚Äì$99,999: 4.00%
  - $100,000+: 4.50%
  $12 monthly fee (wa

Response(response='You have several savings options that fit a $45,000 annual income. Which is best depends on how much you‚Äôll deposit, whether you need frequent access to funds, and your credit score. Summary of the strongest options:\n\n1. Premium Savings Account (SAV-002)\n- Eligibility: age 21+, $30,000+ annual income, credit score 650+, employed or self‚Äëemployed with verifiable income.  \n- Key points: $1,000 minimum to open; $2,500 minimum monthly balance; APY 4.25%; $15 monthly fee (waived with >$5,000); unlimited free ATM; premium features and quarterly interest bonuses for balances >$10,000.  \n- Best if: you meet the credit/age requirements and want one of the highest APYs without locking funds.\n\n2. Money Market Account (MMA-001)\n- Eligibility: age 18+, no income or credit requirement.  \n- Key points: $2,500 minimum opening and to earn interest; tiered APY:\n  - $2,500‚Äì$9,999: 3.00%\n  - $10,000‚Äì$24,999: 3.50%\n  - $25,000‚Äì$99,999: 4.00%\n  - $100,000+: 4.50%\n 

In [47]:
# Test Query 2: Credit card query
query_products("What credit cards are available for someone with a 680 credit score?")


üîç QUERY: What credit cards are available for someone with a 680 credit score?



2025-11-01 08:23:39,642 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-11-01 08:23:54,807 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-11-01 08:23:54,807 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


üìù ANSWER:
You qualify for the Rewards Credit Card (CC-002).

Key eligibility (high-level)
- Minimum age: 21
- Income: $35,000 annually
- Credit score: 670‚Äì739 (your 680 falls in this range)
- Employment: Employed or self‚Äëemployed

Key features
- Credit limit: $3,000‚Äì$15,000
- Annual fee: $95 (waived first year)
- APR: 15.99%‚Äì21.99% variable
- Cash advance fee: 5% or $10 (whichever is greater)
- Late payment fee: $40
- Foreign transaction fee: 0%
- Rewards: 3% cashback on groceries, 2% on gas, 1% on other purchases
- Sign‚Äëup bonus: $200 after spending $1,000 in first 3 months
- Purchase protection: 90 days; extended warranty: additional 1 year

Notes
- The Premium Travel Card (CC-003) requires a higher credit score (740+).
- The Starter Card (CC-001) is targeted at lower scores (580‚Äì669).

üìö SOURCES:
> Source (Doc id: c8eddfe1-e33b-4eb3-bf44-1f7c0d893ee5): Premium Travel Credit Card
**Product Code:** CC-003
**Description:** Elite travel rewards card wi...

> Source (Do

Response(response='You qualify for the Rewards Credit Card (CC-002).\n\nKey eligibility (high-level)\n- Minimum age: 21\n- Income: $35,000 annually\n- Credit score: 670‚Äì739 (your 680 falls in this range)\n- Employment: Employed or self‚Äëemployed\n\nKey features\n- Credit limit: $3,000‚Äì$15,000\n- Annual fee: $95 (waived first year)\n- APR: 15.99%‚Äì21.99% variable\n- Cash advance fee: 5% or $10 (whichever is greater)\n- Late payment fee: $40\n- Foreign transaction fee: 0%\n- Rewards: 3% cashback on groceries, 2% on gas, 1% on other purchases\n- Sign‚Äëup bonus: $200 after spending $1,000 in first 3 months\n- Purchase protection: 90 days; extended warranty: additional 1 year\n\nNotes\n- The Premium Travel Card (CC-003) requires a higher credit score (740+).\n- The Starter Card (CC-001) is targeted at lower scores (580‚Äì669).', source_nodes=[NodeWithScore(node=TextNode(id_='c8eddfe1-e33b-4eb3-bf44-1f7c0d893ee5', embedding=None, metadata={'category': 'bank_products_catalog', 'year': 

In [None]:
# Test Query 3: Business account query
query_products("I'm starting a small business and need a business checking account")

## Load Products from Persisted Index

This section shows how to load the products collection from the persisted ChromaDB index in future sessions.

In [None]:
# Load products index from persisted ChromaDB
# Use this in future sessions to query without re-indexing

# Initialize client
products_db_load = chromadb.PersistentClient(path=r"../index/chroma_db")

# Get the products collection
products_collection_load = products_db_load.get_collection("products")

print(f"‚úÖ Loaded products collection")
print(f"  - Collection name: {products_collection_load.name}")
print(f"  - Item count: {products_collection_load.count()}")

# Create vector store and load index
products_vector_store_load = ChromaVectorStore(chroma_collection=products_collection_load)
products_storage_context_load = StorageContext.from_defaults(vector_store=products_vector_store_load)

# Load index from vector store
products_index_load = VectorStoreIndex.from_vector_store(
    products_vector_store_load,
    storage_context=products_storage_context_load
)

# Create query engine
products_query_engine_load = products_index_load.as_query_engine(similarity_top_k=5)

print("‚úÖ Products query engine loaded from persisted index")
print("üîç Ready to query!")

In [None]:
# Test query from loaded index
response = products_query_engine_load.query("What mortgage options are available for first-time homebuyers?")

print("\nüîç Query: What mortgage options are available for first-time homebuyers?")
print("\nüìù Answer:")
print(response)