# LlamaIndex Kernel Crash Troubleshooting Guide

## üö® Kernel Crash at Step 3 (LlamaIndex Import)

Your kernel is crashing during the LlamaIndex import phase. This is a common issue that can be caused by:

### **Most Common Causes:**
1. **Package Version Conflicts** - Incompatible versions of LlamaIndex components
2. **Missing Dependencies** - Some required packages are not installed
3. **Memory Issues** - Insufficient memory for loading large models
4. **Environment Conflicts** - Multiple Python environments or conflicting packages

### **Troubleshooting Steps:**

#### **Step 1: Run Package Diagnostic**
Run the "PACKAGE DIAGNOSTIC & INSTALLATION CELL" below first.

#### **Step 2: Test Minimal Imports**
Run the "MINIMAL IMPORT TEST" to identify which specific import is failing.

#### **Step 3: Clean Installation (if needed)**
If imports still fail, run the "ALTERNATIVE LLAMAINDEX INSTALLATION" cell.

#### **Step 4: Restart Kernel**
After any package installation, **always restart your kernel**.

#### **Step 5: Use Kernel-Safe Query**
Finally, use the "KERNEL-SAFE LOCAL QUERY" cell which includes memory management.

### **Expected Flow:**
1. ‚úÖ Environment variables found (Step 2) - **COMPLETED**
2. ‚ùå LlamaIndex imports (Step 3) - **FAILING HERE**
3. ‚è∏Ô∏è Azure OpenAI configuration (Step 4) - Pending
4. ‚è∏Ô∏è ChromaDB connection (Step 5) - Pending
5. ‚è∏Ô∏è Query execution (Step 6) - Pending

---

In [1]:
# PACKAGE DIAGNOSTIC & INSTALLATION CELL
# Run this first to diagnose and fix LlamaIndex import issues

import sys
import subprocess
import importlib
import pkg_resources
from packaging import version

def check_and_install_packages():
    """
    Comprehensive package check and installation for LlamaIndex ecosystem
    """
    
    print("üîç Python Environment Information:")
    print(f"Python Version: {sys.version}")
    print(f"Python Executable: {sys.executable}")
    print(f"Platform: {sys.platform}")
    print("="*60)
    
    # Required packages with their correct names
    required_packages = {
        'llama-index': '0.11.20',  # Core package
        'llama-index-core': None,   # Core components
        'llama-index-embeddings-azure-openai': None,  # Azure OpenAI embeddings
        'llama-index-llms-azure-openai': None,  # Azure OpenAI LLM
        'llama-index-vector-stores-chroma': None,  # Chroma vector store
        'chromadb': '0.4.24',  # ChromaDB
        'openai': '1.3.0',  # OpenAI client
        'httpx': None,  # HTTP client
        'python-dotenv': None,  # Environment variables
    }
    
    print("üîç Checking installed packages...")
    installed_packages = {}
    missing_packages = []
    outdated_packages = []
    
    # Check what's currently installed
    for package_name, min_version in required_packages.items():
        try:
            # Try to get the installed version
            installed_version = pkg_resources.get_distribution(package_name).version
            installed_packages[package_name] = installed_version
            
            print(f"‚úÖ {package_name}: {installed_version}")
            
            # Check if version is adequate
            if min_version and version.parse(installed_version) < version.parse(min_version):
                outdated_packages.append((package_name, installed_version, min_version))
                
        except pkg_resources.DistributionNotFound:
            print(f"‚ùå {package_name}: NOT INSTALLED")
            missing_packages.append(package_name)
        except Exception as e:
            print(f"‚ö†Ô∏è {package_name}: Error checking - {e}")
            missing_packages.append(package_name)
    
    print("\n" + "="*60)
    
    # Report issues
    if missing_packages:
        print(f"‚ùå Missing packages: {missing_packages}")
    
    if outdated_packages:
        print(f"‚ö†Ô∏è Outdated packages: {outdated_packages}")
    
    if not missing_packages and not outdated_packages:
        print("‚úÖ All required packages are installed with adequate versions!")
        return True
    
    # Install/upgrade packages
    packages_to_install = missing_packages + [pkg[0] for pkg in outdated_packages]
    
    if packages_to_install:
        print(f"\nüöÄ Installing/upgrading packages: {packages_to_install}")
        
        for package in packages_to_install:
            try:
                print(f"Installing {package}...")
                subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", package])
                print(f"‚úÖ {package} installed successfully")
            except subprocess.CalledProcessError as e:
                print(f"‚ùå Failed to install {package}: {e}")
                return False
    
    return True

def test_imports():
    """
    Test critical imports one by one to identify the problem
    """
    print("\nüß™ Testing imports one by one...")
    
    imports_to_test = [
        ("chromadb", "import chromadb"),
        ("llama_index.core", "from llama_index.core import Settings"),
        ("llama_index.vector_stores.chroma", "from llama_index.vector_stores.chroma import ChromaVectorStore"),
        ("llama_index.core.storage", "from llama_index.core import StorageContext, VectorStoreIndex"),
        ("llama_index.embeddings.azure_openai", "from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding"),
        ("llama_index.llms.azure_openai", "from llama_index.llms.azure_openai import AzureOpenAI"),
    ]
    
    failed_imports = []
    
    for name, import_statement in imports_to_test:
        try:
            exec(import_statement)
            print(f"‚úÖ {name}: SUCCESS")
        except Exception as e:
            print(f"‚ùå {name}: FAILED - {e}")
            failed_imports.append((name, str(e)))
    
    if failed_imports:
        print(f"\n‚ùå Failed imports detected: {len(failed_imports)}")
        for name, error in failed_imports:
            print(f"  - {name}: {error}")
        return False
    else:
        print("\n‚úÖ All imports successful!")
        return True

# Run the diagnostics
print("üöÄ Starting Package Diagnostic...")
packages_ok = check_and_install_packages()

if packages_ok:
    print("\n" + "="*60)
    imports_ok = test_imports()
    
    if imports_ok:
        print("\nüéâ SUCCESS! All packages installed and imports working.")
        print("You can now proceed to run the local index query.")
    else:
        print("\n‚ùå Import issues detected. Check the error messages above.")
else:
    print("\n‚ùå Package installation issues detected.")

print("\nüîÑ Please restart your kernel after running this cell if packages were installed.")

üöÄ Starting Package Diagnostic...
üîç Python Environment Information:
Python Version: 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]
Python Executable: c:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\venv\Scripts\python.exe
Platform: win32
üîç Checking installed packages...
‚úÖ llama-index: 0.12.50
‚úÖ llama-index-core: 0.12.51
‚úÖ llama-index-embeddings-azure-openai: 0.3.9
‚úÖ llama-index-llms-azure-openai: 0.3.4
‚úÖ llama-index-vector-stores-chroma: 0.4.2
‚úÖ chromadb: 1.0.15
‚úÖ openai: 1.97.0
‚úÖ httpx: 0.28.1
‚úÖ python-dotenv: 1.1.1

‚úÖ All required packages are installed with adequate versions!


üß™ Testing imports one by one...


  import pkg_resources


‚úÖ chromadb: SUCCESS
‚úÖ llama_index.core: SUCCESS
‚úÖ llama_index.vector_stores.chroma: SUCCESS
‚úÖ llama_index.core.storage: SUCCESS
‚úÖ llama_index.core: SUCCESS
‚úÖ llama_index.vector_stores.chroma: SUCCESS
‚úÖ llama_index.core.storage: SUCCESS
‚úÖ llama_index.embeddings.azure_openai: SUCCESS
‚úÖ llama_index.llms.azure_openai: SUCCESS

‚úÖ All imports successful!

üéâ SUCCESS! All packages installed and imports working.
You can now proceed to run the local index query.

üîÑ Please restart your kernel after running this cell if packages were installed.
‚úÖ llama_index.embeddings.azure_openai: SUCCESS
‚úÖ llama_index.llms.azure_openai: SUCCESS

‚úÖ All imports successful!

üéâ SUCCESS! All packages installed and imports working.
You can now proceed to run the local index query.

üîÑ Please restart your kernel after running this cell if packages were installed.


In [2]:
# MINIMAL IMPORT TEST
# Run this to test imports without any complex logic

print("üß™ Testing minimal imports...")

try:
    print("Testing basic imports...")
    import os
    import sys
    print("‚úÖ Basic imports OK")
    
    print("Testing ChromaDB...")
    import chromadb
    print(f"‚úÖ ChromaDB version: {chromadb.__version__}")
    
    print("Testing LlamaIndex core...")
    from llama_index.core import Settings
    print("‚úÖ LlamaIndex core OK")
    
    print("Testing LlamaIndex vector store...")
    from llama_index.vector_stores.chroma import ChromaVectorStore
    print("‚úÖ ChromaVectorStore OK")
    
    print("Testing LlamaIndex storage...")
    from llama_index.core import StorageContext, VectorStoreIndex
    print("‚úÖ StorageContext and VectorStoreIndex OK")
    
    print("Testing Azure OpenAI embedding...")
    from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
    print("‚úÖ AzureOpenAIEmbedding OK")
    
    print("Testing Azure OpenAI LLM...")
    from llama_index.llms.azure_openai import AzureOpenAI
    print("‚úÖ AzureOpenAI LLM OK")
    
    print("\nüéâ ALL IMPORTS SUCCESSFUL!")
    print("You can now proceed with the local index query.")
    
except Exception as e:
    print(f"\n‚ùå IMPORT FAILED: {e}")
    print(f"Error type: {type(e).__name__}")
    import traceback
    print("\nFull traceback:")
    traceback.print_exc()
    
    print("\nüîß RECOMMENDATIONS:")
    print("1. Run the package diagnostic cell above")
    print("2. If that fails, run the clean installation cell")
    print("3. Restart your kernel after any installations")
    print("4. Check that you're using the correct Python environment")

üß™ Testing minimal imports...
Testing basic imports...
‚úÖ Basic imports OK
Testing ChromaDB...
‚úÖ ChromaDB version: 1.0.15
Testing LlamaIndex core...
‚úÖ LlamaIndex core OK
Testing LlamaIndex vector store...
‚úÖ ChromaVectorStore OK
Testing LlamaIndex storage...
‚úÖ StorageContext and VectorStoreIndex OK
Testing Azure OpenAI embedding...
‚úÖ AzureOpenAIEmbedding OK
Testing Azure OpenAI LLM...
‚úÖ AzureOpenAI LLM OK

üéâ ALL IMPORTS SUCCESSFUL!
You can now proceed with the local index query.


In [None]:
# ALTERNATIVE LLAMAINDEX INSTALLATION
# Run this if the previous cell didn't resolve the issue

import sys
import subprocess

def install_llamaindex_from_scratch():
    """
    Clean installation of LlamaIndex ecosystem
    """
    print("üßπ Performing clean LlamaIndex installation...")
    
    # Uninstall existing LlamaIndex packages first
    packages_to_uninstall = [
        'llama-index',
        'llama-index-core',
        'llama-index-embeddings-azure-openai',
        'llama-index-llms-azure-openai',
        'llama-index-vector-stores-chroma'
    ]
    
    print("Step 1: Uninstalling existing packages...")
    for package in packages_to_uninstall:
        try:
            subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", package], 
                         capture_output=True, text=True)
            print(f"‚úÖ Uninstalled {package}")
        except Exception as e:
            print(f"‚ö†Ô∏è Could not uninstall {package}: {e}")
    
    # Install packages in the correct order
    print("\nStep 2: Installing packages in correct order...")
    packages_to_install = [
        'llama-index-core==0.11.20',
        'llama-index-embeddings-azure-openai',
        'llama-index-llms-azure-openai', 
        'llama-index-vector-stores-chroma',
        'llama-index==0.11.20',
        'chromadb==0.4.24'
    ]
    
    for package in packages_to_install:
        try:
            print(f"Installing {package}...")
            result = subprocess.run([sys.executable, "-m", "pip", "install", package], 
                                  capture_output=True, text=True, timeout=300)
            if result.returncode == 0:
                print(f"‚úÖ {package} installed successfully")
            else:
                print(f"‚ùå Failed to install {package}:")
                print(result.stderr)
                return False
        except subprocess.TimeoutExpired:
            print(f"‚è±Ô∏è Installation of {package} timed out")
            return False
        except Exception as e:
            print(f"‚ùå Error installing {package}: {e}")
            return False
    
    print("\nüéâ Clean installation completed!")
    return True

# Uncomment the line below to run clean installation
# install_llamaindex_from_scratch()

print("‚ö†Ô∏è To run clean installation, uncomment the last line and run this cell.")
print("‚ö†Ô∏è After installation, restart your kernel before proceeding.")

In [1]:
# KERNEL-SAFE LOCAL QUERY
# This version includes protection against kernel crashes

import gc
import time
from contextlib import contextmanager

@contextmanager
def safe_execution():
    """
    Context manager to safely execute code and clean up resources
    """
    try:
        yield
    except KeyboardInterrupt:
        print("‚ö†Ô∏è Execution interrupted by user")
        gc.collect()
        raise
    except MemoryError:
        print("‚ùå Memory error - clearing memory and retrying")
        gc.collect()
        time.sleep(1)
        raise
    except Exception as e:
        print(f"‚ùå Execution error: {e}")
        gc.collect()
        raise
    finally:
        # Always clean up
        gc.collect()

def kernel_safe_local_query(query_text):
    """
    Kernel-safe version of local query with memory management
    """
    
    with safe_execution():
        print("üõ°Ô∏è Starting kernel-safe local query...")
        
        # Import modules safely
        print("üì¶ Importing modules...")
        import os
        from dotenv import load_dotenv, find_dotenv
        import warnings
        import ssl
        import urllib3
        import httpx
        
        warnings.filterwarnings("ignore")
        load_dotenv(find_dotenv())
        ssl._create_default_https_context = ssl._create_unverified_context
        urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
        
        # Import LlamaIndex components with error handling
        try:
            import chromadb
            from llama_index.vector_stores.chroma import ChromaVectorStore
            from llama_index.core import StorageContext, Settings, VectorStoreIndex
            from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
            from llama_index.llms.azure_openai import AzureOpenAI
            print("‚úÖ LlamaIndex imports successful")
        except ImportError as e:
            print(f"‚ùå LlamaIndex import failed: {e}")
            print("Please run the package diagnostic cell first.")
            return None
        
        # Setup HTTP client
        http_client = httpx.Client(verify=False)
        
        # Azure OpenAI configuration
        API_KEY = os.environ.get("AZURE_OPENAI_KEY")
        API_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
        AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
        API_VERSION = os.environ.get("AZURE_OPENAI_VERSION")
        
        if not all([API_KEY, API_ENDPOINT, AZURE_DEPLOYMENT, API_VERSION]):
            print("‚ùå Missing Azure OpenAI configuration")
            return None
        
        # Configure models with memory management
        print("ü§ñ Configuring models...")
        try:
            llm = AzureOpenAI(
                default_headers={"Ocp-Apim-Subscription-Key": API_KEY},
                api_key=API_KEY,
                azure_endpoint=API_ENDPOINT,
                azure_deployment=AZURE_DEPLOYMENT,
                api_version=API_VERSION,
                model=AZURE_DEPLOYMENT,
                http_client=http_client
            )
            
            embeddings_endpoint = os.environ.get("AZURE_OPENAI_EMBEDDING_ENDPOINT")
            embeddings_api_subscription_key = os.environ.get("AZURE_OPENAI_EMBEDDING_KEY")
            embeddings_deployment = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
            embeddings_api_version = os.environ.get("AZURE_OPENAI_EMBEDDING_API_VERSION")
            
            embedding_model = AzureOpenAIEmbedding(
                deployment_name=embeddings_deployment,
                api_key=embeddings_api_subscription_key,
                azure_endpoint=embeddings_endpoint,
                api_version=embeddings_api_version,
                http_client=http_client
            )
            
            # Set global settings
            Settings.llm = llm
            Settings.embed_model = embedding_model
            print("‚úÖ Models configured")
            
        except Exception as e:
            print(f"‚ùå Model configuration failed: {e}")
            return None
        
        # Connect to ChromaDB
        print("üíæ Connecting to ChromaDB...")
        try:
            index_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\index\chroma_db"
            
            if not os.path.exists(index_path):
                print(f"‚ùå Index path does not exist: {index_path}")
                return None
            
            db = chromadb.PersistentClient(path=index_path)
            chroma_collection = db.get_or_create_collection("sql_tables_metadata")
            
            if chroma_collection.count() == 0:
                print("‚ö†Ô∏è Collection is empty")
                return None
                
            print(f"‚úÖ Connected to ChromaDB ({chroma_collection.count()} documents)")
            
        except Exception as e:
            print(f"‚ùå ChromaDB connection failed: {e}")
            return None
        
        # Create vector store and index
        print("üîó Creating vector store...")
        try:
            vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
            storage_context = StorageContext.from_defaults(vector_store=vector_store)
            
            index_local = VectorStoreIndex.from_vector_store(
                vector_store=vector_store,
                storage_context=storage_context
            )
            
            local_query_engine = index_local.as_query_engine(similarity_top_k=5)  # Reduced for memory
            print("‚úÖ Query engine created")
            
        except Exception as e:
            print(f"‚ùå Vector store creation failed: {e}")
            return None
        
        # Execute query
        print(f"üîç Executing query: '{query_text[:50]}...'")
        try:
            answer = local_query_engine.query(query_text)
            
            print("\n" + "="*50)
            print("üéâ QUERY SUCCESSFUL!")
            print("="*50)
            print(f"\n**Query:**\n{query_text}")
            print(f"\n**Answer:**\n{answer}")
            print(f"\n**Source:**\n{answer.get_formatted_sources()}")
            
            print("\n**Source Metadata:**")
            for i, source_node in enumerate(answer.source_nodes):
                print(f"  {i+1}. {source_node.node.metadata}")
            
            return answer
            
        except Exception as e:
            print(f"‚ùå Query execution failed: {e}")
            return None
        
        finally:
            # Clean up resources
            try:
                http_client.close()
            except:
                pass
            gc.collect()

# Test with a simple query
print("üöÄ Testing kernel-safe local query...")
result = kernel_safe_local_query("generate the full table details without intepreting or editing anything: customer information full table details")

if result:
    print("\n‚úÖ Query completed successfully!")
else:
    print("\n‚ùå Query failed - check the error messages above.")

üöÄ Testing kernel-safe local query...
üõ°Ô∏è Starting kernel-safe local query...
üì¶ Importing modules...


: 

In [None]:
# SIMPLE ENVIRONMENT TEST - Run this first
# This cell tests basic imports and environment setup

print("üîç Testing basic imports and environment...")

try:
    # Test basic imports
    import os
    import sys
    print("‚úÖ Basic Python imports successful")
    
    # Test dotenv
    from dotenv import load_dotenv, find_dotenv
    load_dotenv(find_dotenv())
    print("‚úÖ dotenv loaded successfully")
    
    # Test environment variables
    azure_key = os.environ.get("AZURE_OPENAI_KEY")
    if azure_key:
        print(f"‚úÖ AZURE_OPENAI_KEY found (length: {len(azure_key)})")
    else:
        print("‚ùå AZURE_OPENAI_KEY not found")
    
    # Test ChromaDB path
    index_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\index\chroma_db"
    if os.path.exists(index_path):
        print(f"‚úÖ ChromaDB path exists: {index_path}")
    else:
        print(f"‚ùå ChromaDB path not found: {index_path}")
    
    # Test ChromaDB import
    import chromadb
    print("‚úÖ ChromaDB import successful")
    
    # Test LlamaIndex core imports
    from llama_index.core import Settings
    print("‚úÖ LlamaIndex core import successful")
    
    # Test Azure OpenAI imports
    from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
    from llama_index.llms.azure_openai import AzureOpenAI
    print("‚úÖ Azure OpenAI imports successful")
    
    print("\nüéâ All basic tests passed! Environment looks good.")
    
except Exception as e:
    print(f"\n‚ùå Basic test failed: {e}")
    import traceback
    traceback.print_exc()

In [1]:
# Utility function to initialize LlamaIndex Settings
# Run this cell first before any local index querying

def initialize_llamaindex_settings():
    """
    Initialize LlamaIndex global settings with Azure OpenAI configuration.
    This MUST be called before querying from local index.
    """
    import os
    from dotenv import load_dotenv, find_dotenv
    from llama_index.core import Settings
    from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
    from llama_index.llms.azure_openai import AzureOpenAI
    import ssl
    import urllib3
    import httpx
    import warnings
    
    warnings.filterwarnings("ignore")
    load_dotenv(find_dotenv())
    
    # SSL and HTTP configuration
    ssl._create_default_https_context = ssl._create_unverified_context
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
    http_client = httpx.Client(verify=False)
    
    # Azure OpenAI configuration
    API_KEY = os.environ.get("AZURE_OPENAI_KEY") 
    API_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
    AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
    API_VERSION = os.environ.get("AZURE_OPENAI_VERSION")
    
    # Set up LLM
    llm = AzureOpenAI(
        default_headers={"Ocp-Apim-Subscription-Key": API_KEY},
        api_key=API_KEY,
        azure_endpoint=API_ENDPOINT,
        azure_deployment=AZURE_DEPLOYMENT,
        api_version=API_VERSION, 
        model=AZURE_DEPLOYMENT,
        http_client=http_client
    )
    
    # Set up embedding model
    embeddings_endpoint = os.environ.get("AZURE_OPENAI_EMBEDDING_ENDPOINT")
    embeddings_api_subscription_key = os.environ.get("AZURE_OPENAI_EMBEDDING_KEY")
    embeddings_deployment = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
    embeddings_api_version = os.environ.get("AZURE_OPENAI_EMBEDDING_API_VERSION")
    
    embedding_model = AzureOpenAIEmbedding(
        deployment_name=embeddings_deployment,
        api_key=embeddings_api_subscription_key,
        azure_endpoint=embeddings_endpoint,
        api_version=embeddings_api_version,
        http_client=http_client
    )
    
    # Set global Settings
    Settings.llm = llm
    Settings.embed_model = embedding_model
    
    print("‚úÖ LlamaIndex Settings initialized successfully!")
    return llm, embedding_model

# Initialize settings
llm, embedding_model = initialize_llamaindex_settings()

‚úÖ LlamaIndex Settings initialized successfully!


# Metadata RAG

## Text Embeddings

In [1]:
# # from openai import AzureOpenAI
# import os
# from dotenv import load_dotenv, find_dotenv
# import warnings
# warnings.filterwarnings("ignore")
# load_dotenv(find_dotenv())


# import ssl
# import urllib3
# import httpx

# # Disable SSL certificate verification globally (for development only!)
# ssl._create_default_https_context = ssl._create_unverified_context

# # # For requests and urllib3, suppress warnings
# urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# # Create httpx client with SSL verification disabled
# http_client = httpx.Client(verify=False)

In [2]:
# from openai import AzureOpenAI
# import os
# from dotenv import load_dotenv, find_dotenv
# import warnings
# warnings.filterwarnings("ignore")
# load_dotenv(find_dotenv())


# import ssl
# import urllib3
# import httpx

# # Disable SSL certificate verification globally (for development only!)
# ssl._create_default_https_context = ssl._create_unverified_context

# # # For requests and urllib3, suppress warnings
# urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# # Create httpx client with SSL verification disabled
# http_client = httpx.Client(verify=False)

# embeddings_endpoint = os.environ.get("AZURE_OPENAI_EMBEDDING_ENDPOINT")
# embeddings_api_subscription_key = os.environ.get("AZURE_OPENAI_EMBEDDING_KEY")
# embeddings_model_name = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
# embeddings_deployment = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
# embeddings_api_version = os.environ.get("AZURE_OPENAI_EMBEDDING_API_VERSION")

# embeddings_client = AzureOpenAI(
#     api_version=embeddings_api_version,
#     azure_endpoint=embeddings_endpoint,
#     api_key= embeddings_api_subscription_key, #AzureKeyCredential("<API_KEY>")
#     http_client=http_client
# )

# response = embeddings_client.embeddings.create(
#     input=["first phrase","second phrase","third phrase"],
#     model=embeddings_model_name
# )

# for item in response.data:
#     length = len(item.embedding)
#     print(
#         f"data[{item.index}]: length={length}, "
#         f"[{item.embedding[0]}, {item.embedding[1]}, "
#         f"..., {item.embedding[length-2]}, {item.embedding[length-1]}]"
#     )
# print(response.usage)

## LlamaIndex RAG

In [None]:
# from openai import AzureOpenAI
import os
from dotenv import load_dotenv, find_dotenv
import warnings
warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

import ssl
import urllib3
import httpx

# Disable SSL certificate verification globally (for development only!)
ssl._create_default_https_context = ssl._create_unverified_context

# # For requests and urllib3, suppress warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Create httpx client with SSL verification disabled
http_client = httpx.Client(verify=False)

: 

In [2]:
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.query_engine import RetrieverQueryEngine
import logging
import sys

In [3]:
API_KEY = os.environ.get("AZURE_OPENAI_KEY") 
API_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
API_VERSION = os.environ.get("AZURE_OPENAI_VERSION")
MODEL = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")

client = AzureOpenAI(
  default_headers={"Ocp-Apim-Subscription-Key": API_KEY},
  api_key=API_KEY,
  azure_endpoint=API_ENDPOINT,
  azure_deployment= AZURE_DEPLOYMENT,
  api_version=API_VERSION, 
  model = AZURE_DEPLOYMENT,
  http_client=http_client
)

# set up an LLM
llm = AzureOpenAI(
  default_headers={"Ocp-Apim-Subscription-Key": API_KEY},
  api_key=API_KEY,
  azure_endpoint=API_ENDPOINT,
  azure_deployment= AZURE_DEPLOYMENT,
  api_version=API_VERSION, 
  model = AZURE_DEPLOYMENT,
  http_client=http_client
)
# Embeddings Model
embeddings_endpoint = os.environ.get("AZURE_OPENAI_EMBEDDING_ENDPOINT")
embeddings_api_subscription_key = os.environ.get("AZURE_OPENAI_EMBEDDING_KEY")
embeddings_model_name = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
embeddings_deployment = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
embeddings_api_version = os.environ.get("AZURE_OPENAI_EMBEDDING_API_VERSION")

# Set up embedding model
embedding_model = AzureOpenAIEmbedding(
    deployment_name=embeddings_deployment,
    api_key=embeddings_api_subscription_key,
    azure_endpoint=embeddings_endpoint,
    api_version=embeddings_api_version,
    http_client=http_client
)

Settings.llm = llm
Settings.embed_model = embedding_model

In [None]:
# Define file-specific metadata

doc_sample_file = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\sample_file.txt"

file_paths = [ doc_sample_file]

def get_metadata_for_files(file_paths):
    # Create a map of file path to custom metadata
    file_metadata_map = {
        
        doc_sample_file: {
            "category": "sample File ",
            "year": "2025-07-20",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "This file is only used for testing"
        },
    }
    
    # The function that SimpleDirectoryReader will call
    def file_metadata_func(file_path):
        # Get predefined metadata if available, otherwise return basic metadata
        if file_path in file_metadata_map:
            return file_metadata_map[file_path]
        else:
            return {
                "source": file_path,
                "file_type": os.path.splitext(file_path)[1],
                "confidentiality": "unknown"
            }
    
    return file_metadata_func

# Create reader with specific files and their metadata
documents = SimpleDirectoryReader(
    input_files=file_paths,
    file_metadata=get_metadata_for_files(file_paths)
).load_data()

# Create index and query engine as before
index = VectorStoreIndex.from_documents(documents)



In [5]:
# Query From Local Index
import chromadb
import sys
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext  # Added Settings import
                                
# initialize client
index_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\index\chroma_db"
db = chromadb.PersistentClient(path=index_path)

# get collection
chroma_collection = db.get_or_create_collection("sql_tables_metadata")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# load your index from stored vectors
index_local = VectorStoreIndex.from_vector_store(
                                            vector_store = vector_store, 
                                            storage_context=storage_context
                                        )

# create a query engine
local_query_engine = index_local.as_query_engine(similarity_top_k=10)

def generate_response(query):
    answer = local_query_engine.query(query)
    print("\n**Query:**\n", query)
    print("\n**Answer:**\n", answer)
    print("\n**Source:**\n", answer.get_formatted_sources())
    
    # Optionally print metadata from sources to verify it's working
    print("\n**Source Metadata:**")
    for source_node in answer.source_nodes:
        print(f"- {source_node.node.metadata}")

generate_response("generate the full table details without intepreting or editing anything: customer information full table details")

: 

In [None]:
# Define file-specific metadata

doc_transactions_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_transaction_history.txt"
doc_customer_info_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_customer_information.txt"
doc_crs_accountreport_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_crs.txt"
doc_crs_countrycode_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_crs.txt"
doc_crs_messagespec_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_crs.txt"

file_paths = [doc_transactions_path, 
            doc_customer_info_path, 
            doc_crs_accountreport_path, 
            doc_crs_countrycode_path, 
            doc_crs_messagespec_path]

def get_metadata_for_files(file_paths):
    # Create a map of file path to custom metadata
    file_metadata_map = {
        
        doc_transactions_path: {
            "category": "transaction history table",
            "year": "2025-07-20",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases"
        },
        doc_customer_info_path: {
            "category": "customer information table",
            "year": "2025-07-20",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "Comprehensive customer information data table containing personal information, financial details, loan information, and product holdings for bank customers"
        },
        doc_crs_accountreport_path: {
            "category": "Common Reporting Standard (CRS) Account Reporting",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "Detailed financial account reporting data in accordance with **Common Reporting Standard (CRS)** requirements. This table captures comprehensive information about account holders and their financial accounts, crucial for international tax transparency"
        },
        doc_crs_countrycode_path: {
            "category": "Common Reporting Standard (CRS) Country Codes",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "Comprehensive country code reference for **Common Reporting Standard (CRS)** reporting. This table provides essential mappings between various country code formats, ensuring accurate and consistent country identification across CRS data. It is based on the **ISO 3166-1 alpha-2 standard"
        },
        doc_crs_messagespec_path: {
            "category": "Common Reporting Standard (CRS) Message Specification",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "This table stores the crucial **header and reporting entity information** for **Common Reporting Standard (CRS) messages"
        },
    }
    
    # The function that SimpleDirectoryReader will call
    def file_metadata_func(file_path):
        # Get predefined metadata if available, otherwise return basic metadata
        if file_path in file_metadata_map:
            return file_metadata_map[file_path]
        else:
            return {
                "source": file_path,
                "file_type": os.path.splitext(file_path)[1],
                "confidentiality": "unknown"
            }
    
    return file_metadata_func

# Create reader with specific files and their metadata
documents = SimpleDirectoryReader(
    input_files=file_paths,
    file_metadata=get_metadata_for_files(file_paths)
).load_data()

# Create index and query engine as before
index = VectorStoreIndex.from_documents(documents)



In [5]:
# query engine
query_engine = index.as_query_engine(
    similarity_top_k=10
)

In [6]:
def generate_response(query):
    answer = query_engine.query(query)
    print("\n**Query:**\n", query)
    print("\n**Answer:**\n", answer)
    print("\n**Source:**\n", answer.get_formatted_sources())
    
    # Optionally print metadata from sources to verify it's working
    print("\n**Source Metadata:**")
    for source_node in answer.source_nodes:
        print(f"- {source_node.node.metadata}")

In [9]:
generate_response("generate the full table details without intepreting or editing anything: customer information full table details")


**Query:**
 generate the full table details without intepreting or editing anything: customer information full table details

**Answer:**
 **Full Name**: [dbo].[customer_information]  
**Primary Key**: id

### Column Definitions

- **id** (int)  
  - Description: Unique customer identifier, 8-digit number  
  - Range: 10000000 to 99999999  
  - Examples: 10474206, 10962741, 13765547  
  - Rules: Auto-generated unique identifier for each customer

- **full_name** (nvarchar)  
  - Description: Customer's complete name (first and last name)  
  - Examples: Rachel Benitez, Samuel Anderson, Austin Perkins  
  - Rules: Required field, contains customer's legal name

- **email** (nvarchar)  
  - Description: Customer's email address for communication  
  - Examples: nelsoneddie@example.net, dillonjodi@example.net  
  - Rules: Must be valid email format, used for notifications

- **phone_number** (nvarchar)  
  - Description: Customer's contact phone number  
  - Examples: +1-555-123-4567, (5

## Persist the index to disk

In [5]:
from openai import AzureOpenAI
import os
from dotenv import load_dotenv, find_dotenv
import warnings
warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

import ssl
import urllib3
import httpx

# Disable SSL certificate verification globally (for development only!)
ssl._create_default_https_context = ssl._create_unverified_context

# # For requests and urllib3, suppress warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Create httpx client with SSL verification disabled
http_client = httpx.Client(verify=False)

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex #, Settings
from llama_index.core.query_engine import RetrieverQueryEngine
# from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
# from llama_index.llms.azure_openai import AzureOpenAI

# API_KEY = os.environ.get("AZURE_OPENAI_KEY") 
# API_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
# AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
# API_VERSION = os.environ.get("AZURE_OPENAI_VERSION")
# MODEL = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")


# # set up an LLM
# llm = AzureOpenAI(
#   default_headers={"Ocp-Apim-Subscription-Key": API_KEY},
#   api_key=API_KEY,
#   azure_endpoint=API_ENDPOINT,
#   azure_deployment= AZURE_DEPLOYMENT,
#   api_version=API_VERSION, 
#   model = AZURE_DEPLOYMENT,
#   http_client=http_client
# )

# embeddings_endpoint = os.environ.get("AZURE_OPENAI_EMBEDDING_ENDPOINT")
# embeddings_api_subscription_key = os.environ.get("AZURE_OPENAI_EMBEDDING_KEY")
# embeddings_model_name = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
# embeddings_deployment = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
# embeddings_api_version = os.environ.get("AZURE_OPENAI_EMBEDDING_API_VERSION")

# # Set up embedding model
# embedding_model = AzureOpenAIEmbedding(
#     deployment_name=embeddings_deployment,
#     api_key=embeddings_api_subscription_key,
#     azure_endpoint=embeddings_endpoint,
#     api_version=embeddings_api_version,
#     http_client=http_client
# )

# Settings.llm = llm
# Settings.embed_model = embedding_model

In [6]:
# Define file-specific metadata

doc_transactions_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_transaction_history.txt"
doc_customer_info_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_customer_information.txt"
doc_crs_accountreport_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_crs.txt"
doc_crs_countrycode_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_crs.txt"
doc_crs_messagespec_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\table_metadata\metadata_crs.txt"

file_paths = [  doc_transactions_path, doc_customer_info_path, doc_crs_accountreport_path, doc_crs_countrycode_path, doc_crs_messagespec_path]

def get_metadata_for_files(file_paths):
    # Create a map of file path to custom metadata
    file_metadata_map = {
        
        doc_transactions_path: {
            "category": "transaction history table",
            "year": "2025-07-20",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases"
        },
        doc_customer_info_path: {
            "category": "customer information table",
            "year": "2025-07-20",
            "department": "Finance", 
            "author": "Mzwandile Mhlongo",
            "confidentiality": "high",
            "description": "Comprehensive customer information data table containing personal information, financial details, loan information, and product holdings for bank customers"
        },
        doc_crs_accountreport_path: {
            "category": "Common Reporting Standard (CRS) Account Reporting",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "Detailed financial account reporting data in accordance with **Common Reporting Standard (CRS)** requirements. This table captures comprehensive information about account holders and their financial accounts, crucial for international tax transparency"
        },
        doc_crs_countrycode_path: {
            "category": "Common Reporting Standard (CRS) Country Codes",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "Comprehensive country code reference for **Common Reporting Standard (CRS)** reporting. This table provides essential mappings between various country code formats, ensuring accurate and consistent country identification across CRS data. It is based on the **ISO 3166-1 alpha-2 standard"
        },
        doc_crs_messagespec_path: {
            "category": "Common Reporting Standard (CRS) Message Specification",
            "year": "2025-07-20",
            "department": "Finance",
            "author": "Mzwandile Mhlongo",
            "confidentiality": "low",
            "description": "This table stores the crucial **header and reporting entity information** for **Common Reporting Standard (CRS) messages"
        },
    }
    
    # The function that SimpleDirectoryReader will call
    def file_metadata_func(file_path):
        # Get predefined metadata if available, otherwise return basic metadata
        if file_path in file_metadata_map:
            return file_metadata_map[file_path]
        else:
            return {
                "source": file_path,
                "file_type": os.path.splitext(file_path)[1],
                "confidentiality": "unknown"
            }
    
    return file_metadata_func


# load some documents
# documents = SimpleDirectoryReader("./data").load_data()

# Create reader with specific files and their metadata
documents = SimpleDirectoryReader(
    input_files=file_paths,
    file_metadata=get_metadata_for_files(file_paths)
).load_data()

# initialize client, setting path to save data
index_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\index\chroma_db"
db = chromadb.PersistentClient(path=index_path)

# create collection
chroma_collection = db.get_or_create_collection("sql_tables_metadata")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create your index
index_persisted = VectorStoreIndex.from_documents(
    documents, 
    storage_context=storage_context
    
)


: 

In [12]:
# create a query engine and query
my_query_engine = index_persisted.as_query_engine(
    similarity_top_k=10
)

# Create index and query engine as before
# index = VectorStoreIndex.from_documents(documents)
# query_engine = index.as_query_engine(
#     similarity_top_k=10
# )


In [13]:
def generate_metadata(query):
    answer = my_query_engine.query(query)
    print("\n**Query:**\n", query)
    print("\n**Answer:**\n", answer)
    print("\n**Source:**\n", answer.get_formatted_sources())
    
    # Optionally print metadata from sources to verify it's working
    print("\n**Source Metadata:**")
    for source_node in answer.source_nodes:
        print(f"- {source_node.node.metadata}")

In [14]:
generate_metadata("whic table contains ClosedAccount field? Return the metadata for the table that contains ClosedAccount field")


**Query:**
 whic table contains ClosedAccount field? Return the metadata for the table that contains ClosedAccount field

**Answer:**
 The table that contains the ClosedAccount field is used for storing crucial header and reporting entity information for Common Reporting Standard (CRS) messages. Below is the metadata for this table:

- **Purpose**: Stores header and reporting entity information for CRS messages, including document type, unique document references, account identifiers, and key account status indicators.
- **Fields and Descriptions**:
  - **DocTypeIndic** (varchar(255)): Indicates the type of CRS message (`OECD1`, `OECD2`, `OECD3`). Mandatory.
  - **DocRefId3** (varchar(255)): Unique reference identifier for the specific account report document. Mandatory and unique within a CRS message.
  - **AccountNumber** (varchar(255)): Unique identifier for the financial account (IBAN, BBAN, or proprietary number). Mandatory.
  - **AccNumberType** (varchar(255)): Describes the for

## Load the index from the local vectorDB

In [7]:
import chromadb
import sys
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, SimpleDirectoryReader, VectorStoreIndex #, Settings
                                

In [15]:
# import os
# import sys
# import logging
# from dotenv import load_dotenv, find_dotenv
# import warnings


# import os
# import chromadb
# from llama_index.vector_stores.chroma import ChromaVectorStore
# from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
# from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
# from llama_index.llms.azure_openai import AzureOpenAI
# from llama_index.core.query_engine import RetrieverQueryEngine
# from llama_index.core import (  StorageContext,
#                                 SimpleDirectoryReader, 
#                                 VectorStoreIndex, 
#                                 Settings
#                                 )

# import ssl
# import urllib3
# import httpx

# # Disable SSL certificate verification globally (for development only!)
# ssl._create_default_https_context = ssl._create_unverified_context

# # # For requests and urllib3, suppress warnings
# urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# # Create httpx client with SSL verification disabled

# http_client = httpx.Client(verify=False)
# warnings.filterwarnings("ignore")
# load_dotenv(find_dotenv())

# API_KEY = os.environ.get("AZURE_OPENAI_KEY") 
# API_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
# AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
# API_VERSION = os.environ.get("AZURE_OPENAI_VERSION")
# MODEL = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")

# client = AzureOpenAI(
#   default_headers={"Ocp-Apim-Subscription-Key": API_KEY},
#   api_key=API_KEY,
#   azure_endpoint=API_ENDPOINT,
#   azure_deployment= AZURE_DEPLOYMENT,
#   api_version=API_VERSION, 
#   model = AZURE_DEPLOYMENT,
#   http_client=http_client
# )

# # set up an LLM
# llm = AzureOpenAI(
#   default_headers={"Ocp-Apim-Subscription-Key": API_KEY},
#   api_key=API_KEY,
#   azure_endpoint=API_ENDPOINT,
#   azure_deployment= AZURE_DEPLOYMENT,
#   api_version=API_VERSION, 
#   model = AZURE_DEPLOYMENT,
#   http_client=http_client
# )
# # Embeddings Model
# embeddings_endpoint = os.environ.get("AZURE_OPENAI_EMBEDDING_ENDPOINT")
# embeddings_api_subscription_key = os.environ.get("AZURE_OPENAI_EMBEDDING_KEY")
# embeddings_model_name = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
# embeddings_deployment = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
# embeddings_api_version = os.environ.get("AZURE_OPENAI_EMBEDDING_API_VERSION")

# # Set up embedding model
# embedding_model = AzureOpenAIEmbedding(
#     deployment_name=embeddings_deployment,
#     api_key=embeddings_api_subscription_key,
#     azure_endpoint=embeddings_endpoint,
#     api_version=embeddings_api_version,
#     http_client=http_client
# )

# Settings.llm = llm
# Settings.embed_model = embedding_model

In [8]:

# initialize client
index_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\index\chroma_db"
db = chromadb.PersistentClient(path=index_path)

# get collection
chroma_collection = db.get_or_create_collection("sql_tables_metadata")

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# load your index from stored vectors
index = VectorStoreIndex.from_vector_store(
                                            vector_store, 
                                            storage_context=storage_context
                                        )

# create a query engine
query_engine = index.as_query_engine(similarity_top_k=10)


In [9]:
def generate_response(query):
    answer = query_engine.query(query)
    print("\n**Query:**\n", query)
    print("\n**Answer:**\n", answer)
    print("\n**Source:**\n", answer.get_formatted_sources())
    
    # Optionally print metadata from sources to verify it's working
    print("\n**Source Metadata:**")
    for source_node in answer.source_nodes:
        print(f"- {source_node.node.metadata}")

In [10]:
generate_response("which table contains ClosedAccount field? Return the metadata for the table that contains ClosedAccount field")


**Query:**
 which table contains ClosedAccount field? Return the metadata for the table that contains ClosedAccount field

**Answer:**
 The table stores crucial header and reporting entity information for Common Reporting Standard (CRS) messages. It includes the following metadata:

- **DocTypeIndic** (varchar(255)): Indicates the type of CRS message (`OECD1`, `OECD2`, `OECD3`). Mandatory.
- **DocRefId3** (varchar(255)): Unique reference identifier for each account report document. Mandatory and unique within a CRS message.
- **AccountNumber** (varchar(255)): Unique identifier for the financial account (IBAN, BBAN, or proprietary number). Mandatory.
- **AccNumberType** (varchar(255)): Describes the format or type of the account number (`IBAN`, `BBAN`, `OBAN`, `Other`, `OECD605`). Mandatory.
- **ClosedAccount** (bit): Boolean indicator (0 or 1) specifying if the account was closed during the reporting period. Mandatory.
- **DormantAccount** (bit): Boolean indicator (0 or 1) specifying 

In [None]:
generate_response("generate the full table details without intepreting or editing anything: customer information full table details")


**Query:**
 generate the full table details without intepreting or editing anything: customer information full table details

**Answer:**
 **Full Name**: [dbo].[customer_information]  
**Primary Key**: id

### Column Definitions

- **id** (int)  
  Description: Unique customer identifier, 8-digit number  
  Range: 10000000 to 99999999  
  Examples: 10474206, 10962741, 13765547  
  Rules: Auto-generated unique identifier for each customer

- **full_name** (nvarchar)  
  Description: Customer's complete name (first and last name)  
  Examples: Rachel Benitez, Samuel Anderson, Austin Perkins  
  Rules: Required field, contains customer's legal name

- **email** (nvarchar)  
  Description: Customer's email address for communication  
  Examples: nelsoneddie@example.net, dillonjodi@example.net  
  Rules: Must be valid email format, used for notifications

- **phone_number** (nvarchar)  
  Description: Customer's contact phone number  
  Examples: +1-555-123-4567, (555) 987-6543  
  Rules: 

In [20]:
# from pathlib import Path
# import sys
# src_dir = str(Path(__file__).parent.parent)
# if src_dir not in sys.path:
#     sys.path.append(src_dir)

import json
import logging
import re
import json

# logger = logging.getLogger(__name__)

from openai import AzureOpenAI
import os
from dotenv import load_dotenv, find_dotenv
import warnings
warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())


import ssl
import urllib3
import httpx

# Disable SSL certificate verification globally (for development only!)
ssl._create_default_https_context = ssl._create_unverified_context

# # For requests and urllib3, suppress warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Create httpx client with SSL verification disabled
http_client = httpx.Client(verify=False)

# # For requests and urllib3, suppress warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# Create httpx client with SSL verification disabled
http_client = httpx.Client(verify=False)

import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI

API_KEY = os.environ.get("AZURE_OPENAI_KEY")
API_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
API_VERSION = os.environ.get("AZURE_OPENAI_VERSION")
MODEL = os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")


# set up an LLM
llm = AzureOpenAI(
    default_headers={"Ocp-Apim-Subscription-Key": API_KEY},
    api_key=API_KEY,
    azure_endpoint=API_ENDPOINT,
    azure_deployment= AZURE_DEPLOYMENT,
    api_version=API_VERSION,
    model = AZURE_DEPLOYMENT,
    http_client=http_client
)

embeddings_endpoint = os.environ.get("AZURE_OPENAI_EMBEDDING_ENDPOINT")
embeddings_api_subscription_key = os.environ.get("AZURE_OPENAI_EMBEDDING_KEY")
embeddings_model_name = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
embeddings_deployment = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
embeddings_api_version = os.environ.get("AZURE_OPENAI_EMBEDDING_API_VERSION")

# Set up embedding model
embedding_model = AzureOpenAIEmbedding(
    deployment_name=embeddings_deployment,
    api_key=embeddings_api_subscription_key,
    azure_endpoint=embeddings_endpoint,
    api_version=embeddings_api_version,
    http_client=http_client
)

Settings.llm = llm
Settings.embed_model = embedding_model


# initialize client
index_path = r"C:\Users\A238737\OneDrive - Standard Bank\Documents\GroupFunctions\rag-systems\ai-analyst-demo\text_sql_analysis\index\chroma_db"
db = chromadb.PersistentClient(path=index_path)

# get collection
# Note: get_or_create_collection will create the collection if it doesn't exist.
# For checking existence without creation, use db.get_collection() with a try-except.
chroma_collection = db.get_or_create_collection("sql_tables_metadata")
print(f"Chroma collection '{chroma_collection.name}' loaded successfully.")
# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# load your index from stored vectors
index = VectorStoreIndex.from_vector_store(
    vector_store, storage_context=storage_context
)

# create a query engine
query_engine = index.as_query_engine(similarity_top_k=10)

def agent_table_rag(query):
    return query_engine.query("retrieve the full tables metadata without intepreting or editing anything for the following given tables: " + query)

def check_and_display_collection_info(client_db, collection_name_to_check):
    """
    Checks if a ChromaDB collection exists and displays its content if it does.
    Also lists all available collections.
    """
    print("\n--- ChromaDB Collection Information ---")

    # 1. List all collections
    print("Listing all collections:")
    all_collections = client_db.list_collections()
    if all_collections:
        for col in all_collections:
            print(f"- {col.name}")
    else:
        print("No collections found in the database.")

    # 2. Check if the specified collection exists
    print(f"\nChecking if collection '{collection_name_to_check}' exists...")
    try:
        target_collection = client_db.get_collection(collection_name_to_check)
        print(f"Collection '{collection_name_to_check}' exists!")

        # 3. Get and display some content from the index
        print(f"\nDisplaying first 5 items from '{collection_name_to_check}':")
        # Use peek() to get a small sample of the collection's contents
        # It returns a dictionary with 'ids', 'embeddings', 'metadatas', 'documents'
        # We are primarily interested in 'documents' and 'metadatas' for content.
        collection_content = target_collection.peek(limit=5)

        if collection_content and collection_content.get('documents'):
            for i, doc in enumerate(collection_content['documents']):
                print(f"--- Item {i+1} ---")
                print(f"ID: {collection_content['ids'][i]}")
                print(f"Document: {doc}")
                if collection_content.get('metadatas') and collection_content['metadatas'][i]:
                    print(f"Metadata: {json.dumps(collection_content['metadatas'][i], indent=2)}")
                print("-" * 20)
        else:
            print(f"Collection '{collection_name_to_check}' is empty or has no documents.")

    except chromadb.exceptions.CollectionNotFoundError:
        print(f"Collection '{collection_name_to_check}' does NOT exist.")
    except Exception as e:
        print(f"An error occurred while accessing collection '{collection_name_to_check}': {e}")
    print("-------------------------------------")


if __name__ == "__main__":
    # Call the new function to check and display collection info
    check_and_display_collection_info(db, "sql_tables_metadata")
    # check_and_display_collection_info(db, "non_existent_collection") # Example for a non-existent collection

    user_request = "customer information"
    try:
        print("\nRetrieving relevant tables for the request:", user_request)
        # table_results = agent_table_rag(user_request)
        table_results = query_engine.query("retrieve the full tables metadata without intepreting or editing anything for the following given tables: " + user_request)
        print("Type of result:", type(table_results))
        print("Raw result:", repr(table_results))
        if not table_results:
            print("No results returned from query.")
        else:
            # If it's not a string, print its attributes
            if not isinstance(table_results, str):
                print("Result attributes:", dir(table_results))
                # Try to print a 'response' or 'text' attribute if present
                if hasattr(table_results, 'response'):
                    print("Response attribute:", table_results.response)
                if hasattr(table_results, 'text'):
                    print("Text attribute:", table_results.text)
            print("Relevant tables retrieved successfully!")
            print(table_results)
    except Exception as e:
        print(f"Error retrieving tables: {e}")



Chroma collection 'sql_tables_metadata' loaded successfully.

--- ChromaDB Collection Information ---
Listing all collections:
- sql_tables_metadata

Checking if collection 'sql_tables_metadata' exists...
Collection 'sql_tables_metadata' exists!

Displaying first 5 items from 'sql_tables_metadata':
--- Item 1 ---
ID: 7d83553b-a671-4803-964f-03d3545900d2
Document: <begin transaction history metadata>

# Database Schema Information

## Database Details
- **Database**: SQL Server (master)
- **Server**: localhost\SQLEXPRESS
- **Schema**: dbo

## Table: transaction_history
Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases

### Table Structure
**Full Name**: [dbo].[transaction_history]
**Records**: 5000+ transactions
**Primary Key**: transaction_id
**Foreign Keys**: customer_id
**Time Range**: Last 2 years of transaction data

### Column Definitions

**transaction_id** (bigint)
- Descript

In [None]:
import json

def agent_table_rag(query):
    return query_engine.query("retrieve the full tables metadata without intepreting or editing anything for the following given tables: " + query)

def check_and_display_collection_info(client_db, collection_name_to_check):
    """
    Checks if a ChromaDB collection exists and displays its content if it does.
    Also lists all available collections.
    """
    print("\n--- ChromaDB Collection Information ---")

    # 1. List all collections
    print("Listing all collections:")
    all_collections = client_db.list_collections()
    if all_collections:
        for col in all_collections:
            print(f"- {col.name}")
    else:
        print("No collections found in the database.")

    # 2. Check if the specified collection exists
    print(f"\nChecking if collection '{collection_name_to_check}' exists...")
    try:
        target_collection = client_db.get_collection(collection_name_to_check)
        print(f"Collection '{collection_name_to_check}' exists!")

        # 3. Get and display some content from the index
        print(f"\nDisplaying first 5 items from '{collection_name_to_check}':")
        # Use peek() to get a small sample of the collection's contents
        # It returns a dictionary with 'ids', 'embeddings', 'metadatas', 'documents'
        # We are primarily interested in 'documents' and 'metadatas' for content.
        collection_content = target_collection.peek(limit=5)

        if collection_content and collection_content.get('documents'):
            for i, doc in enumerate(collection_content['documents']):
                print(f"--- Item {i+1} ---")
                print(f"ID: {collection_content['ids'][i]}")
                print(f"Document: {doc}")
                if collection_content.get('metadatas') and collection_content['metadatas'][i]:
                    print(f"Metadata: {json.dumps(collection_content['metadatas'][i], indent=2)}")
                print("-" * 20)
        else:
            print(f"Collection '{collection_name_to_check}' is empty or has no documents.")

    except chromadb.exceptions.CollectionNotFoundError:
        print(f"Collection '{collection_name_to_check}' does NOT exist.")
    except Exception as e:
        print(f"An error occurred while accessing collection '{collection_name_to_check}': {e}")
    print("-------------------------------------")


if __name__ == "__main__":
    # Call the new function to check and display collection info
    check_and_display_collection_info(db, "sql_tables_metadata")
    # check_and_display_collection_info(db, "non_existent_collection") # Example for a non-existent collection

    user_request = "customer information"
    try:
        print("\nRetrieving relevant tables for the request:", user_request)
        table_results = agent_table_rag(user_request)
        print("Type of result:", type(table_results))
        print("Raw result:", repr(table_results))
        if not table_results:
            print("No results returned from query.")
        else:
            # If it's not a string, print its attributes
            if not isinstance(table_results, str):
                print("Result attributes:", dir(table_results))
                # Try to print a 'response' or 'text' attribute if present
                if hasattr(table_results, 'response'):
                    print("Response attribute:", table_results.response)
                if hasattr(table_results, 'text'):
                    print("Text attribute:", table_results.text)
            print("Relevant tables retrieved successfully!")
            print(table_results)
    except Exception as e:
        print(f"Error retrieving tables: {e}")


--- ChromaDB Collection Information ---
Listing all collections:
- sql_tables_metadata

Checking if collection 'sql_tables_metadata' exists...
Collection 'sql_tables_metadata' exists!

Displaying first 5 items from 'sql_tables_metadata':
--- Item 1 ---
ID: ffe71549-4d19-4da1-9813-8454a82b6dd4
Document: <begin transaction history metadata>

# Database Schema Information

## Database Details
- **Database**: SQL Server (master)
- **Server**: localhost\SQLEXPRESS
- **Schema**: dbo

## Table: transaction_history
Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases

### Table Structure
**Full Name**: [dbo].[transaction_history]
**Records**: 5000+ transactions
**Primary Key**: transaction_id
**Foreign Keys**: customer_id
**Time Range**: Last 2 years of transaction data

### Column Definitions

**transaction_id** (bigint)
- Description: Unique transaction identifier, 12-digit number
- Range: 

In [None]:
generate_response("generate the full table details without intepreting or editing anything: transaction history full table details")


**Query:**
 generate the full table details without intepreting or editing anything: transaction history full table details

**Answer:**
 Database: SQL Server (master)  
Server: localhost\SQLEXPRESS  
Schema: dbo  

Table: [dbo].[transaction_history]  
Description: Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases  
Records: 5000+ transactions  
Primary Key: transaction_id  
Foreign Keys: customer_id  
Time Range: Last 2 years of transaction data  

Column Definitions:

- transaction_id (bigint)  
  - Description: Unique transaction identifier, 12-digit number  
  - Range: 100000000000 to 999999999999  
  - Examples: 679551814302, 376513881618, 994709101726  
  - Rules: Auto-generated unique identifier for each transaction  

- customer_id (int)  
  - Description: Customer identifier linking to customer_information table  
  - Range: 10000000 to 99999999  
  - Examples: 10000001, 1

In [None]:
generate_response("generate the full metadata for the following tables without intepreting or editing anything: [transaction_history, customer_information ]")


**Query:**
 generate the full metadata for the following tables without intepreting or editing anything: [transaction_history, customer_information ]

**Answer:**
 <begin transaction history metadata>

# Database Schema Information

## Database Details
- **Database**: SQL Server (master)
- **Server**: localhost\SQLEXPRESS
- **Schema**: dbo

## Table: transaction_history
Comprehensive transaction history table containing all customer financial transactions including deposits, withdrawals, transfers, payments, and purchases

### Table Structure
**Full Name**: [dbo].[transaction_history]
**Records**: 5000+ transactions
**Primary Key**: transaction_id
**Foreign Keys**: customer_id
**Time Range**: Last 2 years of transaction data

### Column Definitions

**transaction_id** (bigint)
- Description: Unique transaction identifier, 12-digit number
- Range: 100000000000 to 999999999999
- Examples: 679551814302, 376513881618, 994709101726
- Rules: Auto-generated unique identifier for each transac