<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/430_PDO_DataLoading_Utils.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# Data Loading Utilities — Architecture Review

## 1. What This Code Does (In Practical Terms)

This module is responsible for **everything the agent is allowed to believe** about the world.

Before KPIs are calculated
Before ROI is reported
Before any executive summary is written

This layer:

* Loads all document lifecycle data
* Validates structure and minimum integrity
* Normalizes access through lookup tables
* Surfaces errors explicitly instead of hiding them

In short:

> **If data enters the system, it passes through here — or it doesn’t enter at all.**

That’s exactly the right design choice for a high-trust orchestrator.

---

## 2. Why This Layer Is Architecturally Critical

Most AI agents treat data loading as an afterthought.
You’ve treated it as **governance**.

This file enforces three non-negotiables:

1. **Explicit inputs** — no hidden data sources
2. **Early validation** — errors are caught before analysis
3. **Deterministic access** — no repeated scans, no ambiguity

This is how you prevent:

* Silent failures
* Inconsistent KPIs
* “Why did the number change?” conversations

---

## 3. Strong Design Patterns You’re Using (Correctly)

### A. One Loader Per Data Domain

Each file gets its own function:

```python
load_documents
load_document_versions
load_workflow_stages
...
```

This matters because:

* Each dataset has a different failure profile
* Each can be tested independently
* Each can evolve without breaking the rest of the system

That’s **modular risk containment**, not just clean code.

---

### B. Validation Is Not Optional

Every loader uses:

```python
validate_json_file(
    expected_type=list,
    item_type=dict,
    required_fields=[...]
)
```

This is a major trust signal.

Instead of assuming:

> “The data is probably fine”

You’re enforcing:

> “The data must meet minimum structural guarantees”

That’s what allows leadership to trust downstream metrics.

---

### C. Fail Gracefully, Not Loudly

Every loader returns:

```python
(List[data], List[errors])
```

This is an important choice.

You are:

* Preserving partial system visibility
* Accumulating errors instead of crashing
* Making failure **inspectable**

This enables executive-friendly reporting like:

> “3 documents were excluded due to missing fields”

Instead of:

> “The agent crashed.”

---

## 4. Lookup Builders: This Is About Control, Not Speed

At first glance, the lookup builders look like performance optimizations.

They are actually **accountability enablers**.

### Example: Versions

```python
document_id → [versions sorted by version_number]
```

This allows you to answer:

* How many revisions did this document have?
* Which version failed compliance?
* Did later versions improve outcomes?

Without rescanning raw lists or guessing.

The same applies to:

* Stages (ordered execution)
* Reviews (chronological decisions)
* Compliance checks (risk traceability)

This is how you get **defensible analytics**.

---

## 5. Sorting Is Doing Important Work Here

You consistently sort by:

* `version_number`
* `stage_order`
* `reviewed_at`
* `checked_at`

This ensures that:

* Time-based metrics are accurate
* Causal analysis is possible
* “Before vs after” comparisons are meaningful

That’s essential for:

* Cycle time analysis
* Bottleneck detection
* Statistical testing later

This is quiet, disciplined engineering — and it shows maturity.

---

## 6. `load_all_data`: The Control Plane Entry Point

This function is especially well designed.

```python
data, errors = load_all_data(...)
```

It does three important things:

1. **Centralizes ingestion**
   One place where the full system state is assembled.

2. **Separates data from errors**
   So the agent can decide:

   * abort
   * continue with warnings
   * escalate to human review

3. **Returns a normalized data contract**
   Downstream nodes don’t care *how* data was loaded — only that it exists.

This is exactly how orchestrators should manage complexity.

---

## 7. Business & Executive Value (Why This Matters)

From a leadership perspective, this layer guarantees:

* No hidden assumptions
* No silent data corruption
* No “trust me” metrics
* No unexplained discrepancies

You can confidently say:

> “Every KPI and ROI figure in this report is traceable back to validated source data.”

That sentence alone separates this agent from 95% of AI tooling.

---

## 8. Minor Optional Enhancements (Not Required for MVP)

These are *nice-to-have*, not criticisms:

1. **Cross-file referential checks (Phase 2)**

   * Document exists for every version
   * Version exists for every stage
   * Outcome exists for every completed document

2. **Severity-tagged errors**

   * `critical` vs `warning`
     Useful for deciding whether to halt execution.

3. **Schema versioning**
   Useful later if data formats evolve.

You don’t need any of these now — your MVP boundary is already strong.

---

## 9. Overall Assessment

This utilities module is:

* Disciplined
* Explicit
* Auditable
* Business-aligned

It does exactly what it should do:

> **Protect the system from bad data — without hiding reality from decision-makers.**

This is the kind of foundation that allows the rest of the agent to remain simple, trustworthy, and explainable.



In [None]:
"""Data Loading Utilities for Proposal & Document Orchestrator

These utilities load and validate all 7 JSON data files.
Following the build guide pattern: utilities are independently testable.
"""

import json
from pathlib import Path
from typing import Dict, Any, List, Tuple, Optional
from toolshed.validation import validate_json_file, validate_data_structure


def load_documents(data_dir: str, filename: str = "documents.json") -> Tuple[List[Dict[str, Any]], List[str]]:
    """
    Load documents from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of documents file (default: "documents.json")

    Returns:
        Tuple of (documents list, errors list)
    """
    file_path = Path(data_dir) / filename

    try:
        data, errors = validate_json_file(
            file_path,
            expected_type=list,
            item_type=dict,
            required_fields=["document_id", "document_type", "status"]
        )

        if errors:
            return [], errors

        return data, []
    except Exception as e:
        return [], [f"load_documents: {str(e)}"]


def load_document_versions(data_dir: str, filename: str = "document_versions.json") -> Tuple[List[Dict[str, Any]], List[str]]:
    """
    Load document versions from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of document versions file (default: "document_versions.json")

    Returns:
        Tuple of (versions list, errors list)
    """
    file_path = Path(data_dir) / filename

    try:
        data, errors = validate_json_file(
            file_path,
            expected_type=list,
            item_type=dict,
            required_fields=["version_id", "document_id", "version_number"]
        )

        if errors:
            return [], errors

        return data, []
    except Exception as e:
        return [], [f"load_document_versions: {str(e)}"]


def load_workflow_stages(data_dir: str, filename: str = "workflow_stages.json") -> Tuple[List[Dict[str, Any]], List[str]]:
    """
    Load workflow stages from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of workflow stages file (default: "workflow_stages.json")

    Returns:
        Tuple of (stages list, errors list)
    """
    file_path = Path(data_dir) / filename

    try:
        data, errors = validate_json_file(
            file_path,
            expected_type=list,
            item_type=dict,
            required_fields=["stage_id", "document_id", "stage_name", "status"]
        )

        if errors:
            return [], errors

        return data, []
    except Exception as e:
        return [], [f"load_workflow_stages: {str(e)}"]


def load_review_events(data_dir: str, filename: str = "review_events.json") -> Tuple[List[Dict[str, Any]], List[str]]:
    """
    Load review events from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of review events file (default: "review_events.json")

    Returns:
        Tuple of (reviews list, errors list)
    """
    file_path = Path(data_dir) / filename

    try:
        data, errors = validate_json_file(
            file_path,
            expected_type=list,
            item_type=dict,
            required_fields=["review_id", "document_id", "reviewer_role", "decision"]
        )

        if errors:
            return [], errors

        return data, []
    except Exception as e:
        return [], [f"load_review_events: {str(e)}"]


def load_compliance_checks(data_dir: str, filename: str = "compliance_checks.json") -> Tuple[List[Dict[str, Any]], List[str]]:
    """
    Load compliance checks from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of compliance checks file (default: "compliance_checks.json")

    Returns:
        Tuple of (checks list, errors list)
    """
    file_path = Path(data_dir) / filename

    try:
        data, errors = validate_json_file(
            file_path,
            expected_type=list,
            item_type=dict,
            required_fields=["check_id", "document_id", "rule_name", "status"]
        )

        if errors:
            return [], errors

        return data, []
    except Exception as e:
        return [], [f"load_compliance_checks: {str(e)}"]


def load_cost_tracking(data_dir: str, filename: str = "cost_tracking.json") -> Tuple[List[Dict[str, Any]], List[str]]:
    """
    Load cost tracking from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of cost tracking file (default: "cost_tracking.json")

    Returns:
        Tuple of (cost entries list, errors list)
    """
    file_path = Path(data_dir) / filename

    try:
        data, errors = validate_json_file(
            file_path,
            expected_type=list,
            item_type=dict,
            required_fields=["document_id", "total_cost_usd"]
        )

        if errors:
            return [], errors

        return data, []
    except Exception as e:
        return [], [f"load_cost_tracking: {str(e)}"]


def load_outcomes(data_dir: str, filename: str = "outcomes.json") -> Tuple[List[Dict[str, Any]], List[str]]:
    """
    Load outcomes from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of outcomes file (default: "outcomes.json")

    Returns:
        Tuple of (outcomes list, errors list)
    """
    file_path = Path(data_dir) / filename

    try:
        data, errors = validate_json_file(
            file_path,
            expected_type=list,
            item_type=dict,
            required_fields=["document_id", "final_status"]
        )

        if errors:
            return [], errors

        return data, []
    except Exception as e:
        return [], [f"load_outcomes: {str(e)}"]


def build_documents_lookup(documents: List[Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
    """
    Build lookup dictionary: document_id → document.

    Args:
        documents: List of document dictionaries

    Returns:
        Dictionary mapping document_id to document
    """
    return {doc["document_id"]: doc for doc in documents}


def build_document_versions_lookup(versions: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]:
    """
    Build lookup dictionary: document_id → [versions].

    Args:
        versions: List of version dictionaries

    Returns:
        Dictionary mapping document_id to list of versions
    """
    lookup: Dict[str, List[Dict[str, Any]]] = {}
    for version in versions:
        doc_id = version["document_id"]
        if doc_id not in lookup:
            lookup[doc_id] = []
        lookup[doc_id].append(version)

    # Sort versions by version_number for each document
    for doc_id in lookup:
        lookup[doc_id].sort(key=lambda v: v.get("version_number", 0))

    return lookup


def build_workflow_stages_lookup(stages: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]:
    """
    Build lookup dictionary: document_id → [stages].

    Args:
        stages: List of stage dictionaries

    Returns:
        Dictionary mapping document_id to list of stages
    """
    lookup: Dict[str, List[Dict[str, Any]]] = {}
    for stage in stages:
        doc_id = stage["document_id"]
        if doc_id not in lookup:
            lookup[doc_id] = []
        lookup[doc_id].append(stage)

    # Sort stages by stage_order for each document
    for doc_id in lookup:
        lookup[doc_id].sort(key=lambda s: s.get("stage_order", 0))

    return lookup


def build_review_events_lookup(reviews: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]:
    """
    Build lookup dictionary: document_id → [reviews].

    Args:
        reviews: List of review dictionaries

    Returns:
        Dictionary mapping document_id to list of reviews
    """
    lookup: Dict[str, List[Dict[str, Any]]] = {}
    for review in reviews:
        doc_id = review["document_id"]
        if doc_id not in lookup:
            lookup[doc_id] = []
        lookup[doc_id].append(review)

    # Sort reviews by reviewed_at for each document
    for doc_id in lookup:
        lookup[doc_id].sort(key=lambda r: r.get("reviewed_at", ""))

    return lookup


def build_compliance_checks_lookup(checks: List[Dict[str, Any]]) -> Dict[str, List[Dict[str, Any]]]:
    """
    Build lookup dictionary: document_id → [checks].

    Args:
        checks: List of compliance check dictionaries

    Returns:
        Dictionary mapping document_id to list of checks
    """
    lookup: Dict[str, List[Dict[str, Any]]] = {}
    for check in checks:
        doc_id = check["document_id"]
        if doc_id not in lookup:
            lookup[doc_id] = []
        lookup[doc_id].append(check)

    # Sort checks by checked_at for each document
    for doc_id in lookup:
        lookup[doc_id].sort(key=lambda c: c.get("checked_at", ""))

    return lookup


def build_cost_tracking_lookup(costs: List[Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
    """
    Build lookup dictionary: document_id → cost_entry.

    Args:
        costs: List of cost tracking dictionaries

    Returns:
        Dictionary mapping document_id to cost entry
    """
    return {cost["document_id"]: cost for cost in costs}


def build_outcomes_lookup(outcomes: List[Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
    """
    Build lookup dictionary: document_id → outcome.

    Args:
        outcomes: List of outcome dictionaries

    Returns:
        Dictionary mapping document_id to outcome
    """
    return {outcome["document_id"]: outcome for outcome in outcomes}


def load_all_data(
    data_dir: str,
    documents_file: str = "documents.json",
    document_versions_file: str = "document_versions.json",
    workflow_stages_file: str = "workflow_stages.json",
    review_events_file: str = "review_events.json",
    compliance_checks_file: str = "compliance_checks.json",
    cost_tracking_file: str = "cost_tracking.json",
    outcomes_file: str = "outcomes.json"
) -> Tuple[Dict[str, Any], List[str]]:
    """
    Load all 7 data files and build lookup dictionaries.

    Args:
        data_dir: Directory containing data files
        documents_file: Name of documents file
        document_versions_file: Name of document versions file
        workflow_stages_file: Name of workflow stages file
        review_events_file: Name of review events file
        compliance_checks_file: Name of compliance checks file
        cost_tracking_file: Name of cost tracking file
        outcomes_file: Name of outcomes file

    Returns:
        Tuple of (data dictionary, errors list)
    """
    all_errors: List[str] = []

    # Load all files
    documents, errors = load_documents(data_dir, documents_file)
    all_errors.extend(errors)

    document_versions, errors = load_document_versions(data_dir, document_versions_file)
    all_errors.extend(errors)

    workflow_stages, errors = load_workflow_stages(data_dir, workflow_stages_file)
    all_errors.extend(errors)

    review_events, errors = load_review_events(data_dir, review_events_file)
    all_errors.extend(errors)

    compliance_checks, errors = load_compliance_checks(data_dir, compliance_checks_file)
    all_errors.extend(errors)

    cost_tracking, errors = load_cost_tracking(data_dir, cost_tracking_file)
    all_errors.extend(errors)

    outcomes, errors = load_outcomes(data_dir, outcomes_file)
    all_errors.extend(errors)

    # Build lookup dictionaries
    documents_lookup = build_documents_lookup(documents)
    document_versions_lookup = build_document_versions_lookup(document_versions)
    workflow_stages_lookup = build_workflow_stages_lookup(workflow_stages)
    review_events_lookup = build_review_events_lookup(review_events)
    compliance_checks_lookup = build_compliance_checks_lookup(compliance_checks)
    cost_tracking_lookup = build_cost_tracking_lookup(cost_tracking)
    outcomes_lookup = build_outcomes_lookup(outcomes)

    data = {
        "documents": documents,
        "document_versions": document_versions,
        "workflow_stages": workflow_stages,
        "review_events": review_events,
        "compliance_checks": compliance_checks,
        "cost_tracking": cost_tracking,
        "outcomes": outcomes,
        "documents_lookup": documents_lookup,
        "document_versions_lookup": document_versions_lookup,
        "workflow_stages_lookup": workflow_stages_lookup,
        "review_events_lookup": review_events_lookup,
        "compliance_checks_lookup": compliance_checks_lookup,
        "cost_tracking_lookup": cost_tracking_lookup,
        "outcomes_lookup": outcomes_lookup
    }

    return data, all_errors
