# Data Preparation Notebook

This notebook handles document loading, preprocessing, and vector index creation for the Maverick RAG system.

## Features
- Document loading from local files or Unity Catalog volumes
- Text chunking and preprocessing
- FAISS index creation (local development)
- Databricks vector search setup (enterprise deployment)
- Delta Lake table creation
- Configuration validation


In [None]:
#%pip install -qU databricks-sdk databricks-langchain databricks-agents mlflow[databricks] databricks-vectorsearch langchain langchain_core bs4 markdownify pydantic sentence-transformers pandas openpyxl langdetect
#dbutils.library.restartPython()

## Setup and Configuration


In [None]:
import sys
import pkg_resources

python_executable = sys.executable
print(f"Python Executable: {python_executable}")
print(f"Python Version: {sys.version}")
print(f"Installed Libraries:")

try:
    sdk_version = pkg_resources.get_distribution("databricks-sdk").version
    print(f"  - databricks-sdk: {sdk_version}")
except pkg_resources.DistributionNotFound:
    print("  - databricks-sdk: Not installed")

import databricks
print(f"🐍 Databricks loaded from: {databricks.__file__}")

In [1]:
# Import required libraries
import os
import sys
from pathlib import Path
import pandas as pd

PROJECT_ROOT = Path.cwd().parent
sys.path.append(str(PROJECT_ROOT))

# --- CORRECTED IMPORTS ---
# Imports are now absolute from the project root, which is best practice
from src.utils.config import config_manager
from src.rag.ingest import load_documents, build_index, build_databricks_artifacts

print("✅ Setup complete!")

✅ Setup complete!


## Configuration Validation


In [2]:
# Display current configuration
config_summary = config_manager.get_config_summary()
print("📋 Current Configuration:")
for section, config in config_summary.items():
    print(f"\n{section.upper()}:")
    for key, value in config.items():
        print(f"  {key}: {value}")

# Validate configuration
validation = config_manager.validate_config()
print(f"\n🔍 Configuration Validation: {'✅ Valid' if validation['valid'] else '❌ Invalid'}")

if validation['errors']:
    print("\n❌ Errors:")
    for error in validation['errors']:
        print(f"  - {error}")

if validation['warnings']:
    print("\n⚠️ Warnings:")
    for warning in validation['warnings']:
        print(f"  - {warning}")


📋 Current Configuration:

APP:
  server_name: 0.0.0.0
  server_port: 7860
  title: Maverick RAG

LLM:
  provider: ollama
  model: llama3.1:8b
  temperature: 0.2
  timeout: 60
  api_key: ********
  base_url: http://localhost:11434/v1

EMBEDDING:
  model_name: sentence-transformers/all-MiniLM-L6-v2
  provider: huggingface

DATA:
  docs_dir: D:\Work\E D C\poc_p1\data\docs\sample_data
  index_dir: D:\Work\E D C\poc_p1\data\index
  use_databricks: False
  volume_path: None

DATABRICKS:
  workspace_url: your_workspace_url.databricks.com
  access_token: ********
  catalog: main
  schema: default
  volume: source_data
  vector_search_endpoint: your_vector_search_endpoint
  model_serving_endpoint: your_model_serving_endpoint
  _client: None

🔍 Configuration Validation: ✅ Valid


## Document Loading


In [3]:
# Load documents
docs_dir = config_manager.data.docs_dir
print(f"📁 Loading documents from: {docs_dir}")

try:
    documents = load_documents(docs_dir)
    print(f"✅ Loaded {len(documents)} documents")
    
    # Display document information
    if documents:
        print("\n📄 Document Summary:")
        for i, doc in enumerate(documents[:5]):  # Show first 5
            source = doc.metadata.get('source', 'Unknown')
            content_preview = doc.page_content[:100] + "..." if len(doc.page_content) > 100 else doc.page_content
            print(f"  {i+1}. Source: {source}")
            print(f"     Content: {content_preview}")
            print(f"     Length: {len(doc.page_content)} characters")
            print()
        
        if len(documents) > 5:
            print(f"  ... and {len(documents) - 5} more documents")
    else:
        print("⚠️ No documents found")
        
except Exception as e:
    print(f"❌ Error loading documents: {e}")


📁 Loading documents from: D:\Work\E D C\poc_p1\data\docs\sample_data


100%|██████████| 1/1 [00:03<00:00,  3.03s/it]


[Document(metadata={'source': 'D:\\Work\\E D C\\poc_p1\\data\\docs\\sample_data\\Project_Book_Databricks_Asset_Bundle.docx'}, page_content='Project Book — for GenAI & ML\n\nProduction-ready patterns for ML and LLM/GenAI on Azure Databricks\n\nVersion: 1.1 | Date: 2025-09-10\n\n1. Executive Summary\n\nThis Project Book defines a production-grade template and runbook for delivering machine learning (ML) and generative AI (GenAI) applications using Databricks Asset Bundles (DAB). It separates concerns between ML (data science, modeling, evaluation) and MLOps (platform, deployment, governance, observability), and codifies the repository structure, bundle configuration (databricks.yml), CI/CD, security, and operations.\n\n2. Scope & Audience\n\nAudience: ML Engineers, Data Scientists, Platform/MLOps Engineers, Security & Governance, Product Owners.\n\nScope: Applies to both traditional ML (forecasting, classification, etc.) and GenAI (RAG, agents, copilots).\n\nCloud/Platform: Azure Databri

100%|██████████| 1/1 [00:00<00:00,  5.90it/s]

✅ Loaded 1 documents

📄 Document Summary:
  1. Source: D:\Work\E D C\poc_p1\data\docs\sample_data\Project_Book_Databricks_Asset_Bundle.docx
     Content: Project Book — for GenAI & ML

Production-ready patterns for ML and LLM/GenAI on Azure Databricks

V...
     Length: 6753 characters






## Local FAISS Index Creation


In [4]:
# Create FAISS index for local development
if not config_manager.data.use_databricks:
    print("🏗️ Creating FAISS index for local development...")
    
    try:
        index_path = build_index(
            docs_dir=config_manager.data.docs_dir,
            index_dir=config_manager.data.index_dir,
            use_databricks=False
        )
        print(f"✅ FAISS index created at: {index_path}")
        
        # Test the index
        from langchain_community.vectorstores import FAISS
        from langchain_community.embeddings import HuggingFaceEmbeddings
        
        embeddings = HuggingFaceEmbeddings(
            model_name=config_manager.embedding.model_name
        )
        
        store = FAISS.load_local(
            config_manager.data.index_dir, 
            embeddings, 
            allow_dangerous_deserialization=True
        )
        
        print(f"📊 Index contains {store.index.ntotal} vectors")
        
        # Test similarity search
        test_query = "What is the project charter about?"
        results = store.similarity_search(test_query, k=3)
        
        print(f"\n🔍 Test query: '{test_query}'")
        print("📄 Top 3 results:")
        for i, result in enumerate(results):
            print(f"  {i+1}. {result.page_content[:150]}...")
            
    except Exception as e:
        print(f"❌ Error creating FAISS index: {e}")
else:
    print("⏭️ Skipping FAISS index creation (Databricks mode enabled)")


🏗️ Creating FAISS index for local development...


100%|██████████| 1/1 [00:00<00:00,  5.46it/s]


[Document(metadata={'source': 'D:\\Work\\E D C\\poc_p1\\data\\docs\\sample_data\\Project_Book_Databricks_Asset_Bundle.docx'}, page_content='Project Book — for GenAI & ML\n\nProduction-ready patterns for ML and LLM/GenAI on Azure Databricks\n\nVersion: 1.1 | Date: 2025-09-10\n\n1. Executive Summary\n\nThis Project Book defines a production-grade template and runbook for delivering machine learning (ML) and generative AI (GenAI) applications using Databricks Asset Bundles (DAB). It separates concerns between ML (data science, modeling, evaluation) and MLOps (platform, deployment, governance, observability), and codifies the repository structure, bundle configuration (databricks.yml), CI/CD, security, and operations.\n\n2. Scope & Audience\n\nAudience: ML Engineers, Data Scientists, Platform/MLOps Engineers, Security & Governance, Product Owners.\n\nScope: Applies to both traditional ML (forecasting, classification, etc.) and GenAI (RAG, agents, copilots).\n\nCloud/Platform: Azure Databri

100%|██████████| 1/1 [00:00<00:00,  4.43it/s]
  embeddings = HuggingFaceEmbeddings(model_name=embedding_model)


✅ FAISS index created at: None
📊 Index contains 9 vectors

🔍 Test query: 'What is the project charter about?'
📄 Top 3 results:
  1. targets allow environment-specific overrides (compute, permissions, schedules, variables).

6. Repository & Project Folder Structure

Use the followin...
  2. Register models & app artifacts in MLflow / Unity Catalog; author notebooks & src modules; tests.

4.2 MLOps (Platform/Operations scope)

Bundle confi...
  3. 7. Governance & Security

Unity Catalog: single source of truth for data, models, volumes; assign privileges to groups; avoid user-level grants.

Use ...
