<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/453_TPRO_DataLoadingUtils.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Data Loading Utilities — Trust, Integrity, and Determinism

## What This Module Does (In Plain Terms)

This module is responsible for **bringing reality into the system**.

Before the orchestrator can reason about risk, escalate decisions, or calculate ROI, it needs data it can **trust**. These utilities ensure that:

* every dataset is present
* every dataset is structurally valid
* failures are explicit and immediate
* data access is deterministic and testable

This module is deliberately simple — and that’s what makes it powerful.

---

## Why “Pure Functions” Matter Here

Every function in this module is:

* stateless
* deterministic
* independently testable
* free of side effects

That means:

* no hidden caching
* no implicit globals
* no database coupling
* no environment-specific behavior

This is exactly what you want in a **risk system**.
If the data is wrong, the system fails loudly and early.

---

## 1. Explicit, Defensive Data Loading

Each `load_*` function follows the same disciplined pattern:

1. Build the file path explicitly
2. Fail if the file does not exist
3. Load JSON with UTF-8 encoding
4. Validate that the result is a list
5. Return raw data without mutation

Example pattern:

```python
if not file_path.exists():
    raise FileNotFoundError(...)
```

### Why This Is Important

This enforces a critical principle:

> **No risk decision is made on partial or malformed data.**

If a dataset is missing or corrupted:

* the orchestrator stops
* the failure is traceable
* the issue is operational, not analytical

This prevents silent risk degradation — one of the most dangerous failure modes in governance systems.

---

## 2. Separation of Concerns: Loading ≠ Interpretation

Notice what these functions *do not* do:

* no validation of business logic
* no cross-file joins
* no assumptions about risk meaning
* no transformation of values

They answer only one question:

> “Is the data present and structurally sane?”

This clean separation allows:

* loaders to stay stable
* logic to evolve independently
* unit tests to stay simple
* audits to trace failures precisely

This is textbook **defensive system design**, applied correctly.

---

## 3. Lookup Builders: Determinism at Scale

### `build_vendor_lookup`

### `build_risk_domain_lookup`

These functions turn lists into **explicit dictionaries** keyed by identifiers.

Why this matters:

* lookups are O(1), not O(n)
* behavior is predictable
* repeated joins don’t drift
* results are reproducible

More importantly, this prevents **implicit joins** from happening deep inside scoring logic — a common source of subtle bugs.

In risk systems, determinism is not an optimization — it’s a requirement.

---

## 4. Controlled Scope: Filtering Without Guessing

### `filter_vendors_by_id`

This function enforces a simple but powerful rule:

* If a specific vendor is requested, it **must exist**
* If it doesn’t, the run fails immediately

There is no:

* silent fallback
* partial execution
* ambiguous behavior

This protects against:

* typos in vendor IDs
* accidental partial assessments
* misleading KPI results

In other words: **no false confidence**.

---

## 5. Why This Design Improves Accountability

Because of this module:

* every run has a clear data lineage
* every failure has a concrete cause
* every decision can be traced back to a file

When an executive asks:

> “What data did the system use for this decision?”

You can answer that question **without interpretation**.

That’s the difference between:

* an AI demo
* and an operational risk system

---

## 6. Why MVP-First Is the Right Choice Here

You deliberately avoided:

* databases
* ORMs
* schema registries
* complex validation frameworks

That’s a strength, not a weakness.

It means:

* the architecture is visible
* behavior is understandable
* learning velocity is high
* future migration paths are clean

You can always add complexity later — but you can’t easily remove it once trust is lost.

---

## How This Fits the Larger Orchestrator

This module sits **before** all intelligence.

If this layer fails:

* no scoring occurs
* no escalation happens
* no KPIs are reported

That’s intentional.

A system that reasons confidently on bad data is worse than a system that refuses to run.

---

## Why This Is Executive-Grade Engineering

Executives don’t judge AI systems on cleverness — they judge them on:

* reliability
* predictability
* explainability
* failure behavior

This module directly supports all four.

It’s quiet. It’s boring. It’s strict.
And that’s exactly why the rest of your agent can be trusted.




In [None]:
"""Data loading utilities for Third-Party Risk Orchestrator

This module contains utilities to load and prepare all data sources.
All utilities are pure functions, independently testable.

Following MVP-first approach: Simple JSON file loading, no database dependencies.
"""

import json
from pathlib import Path
from typing import List, Dict, Any, Optional
from config import ThirdPartyRiskOrchestratorConfig


def load_third_parties(data_dir: str, filename: str = "third_parties.json") -> List[Dict[str, Any]]:
    """
    Load third-party vendor data from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of the JSON file (default: "third_parties.json")

    Returns:
        List of vendor dictionaries

    Raises:
        FileNotFoundError: If file doesn't exist
        json.JSONDecodeError: If file contains invalid JSON
    """
    file_path = Path(data_dir) / filename

    if not file_path.exists():
        raise FileNotFoundError(f"Third parties file not found: {file_path}")

    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    if not isinstance(data, list):
        raise ValueError(f"Expected list in {filename}, got {type(data).__name__}")

    return data


def load_risk_domains(data_dir: str, filename: str = "risk_domains.json") -> List[Dict[str, Any]]:
    """
    Load risk domain definitions from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of the JSON file (default: "risk_domains.json")

    Returns:
        List of risk domain dictionaries

    Raises:
        FileNotFoundError: If file doesn't exist
        json.JSONDecodeError: If file contains invalid JSON
    """
    file_path = Path(data_dir) / filename

    if not file_path.exists():
        raise FileNotFoundError(f"Risk domains file not found: {file_path}")

    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    if not isinstance(data, list):
        raise ValueError(f"Expected list in {filename}, got {type(data).__name__}")

    return data


def load_vendor_controls(data_dir: str, filename: str = "vendor_controls.json") -> List[Dict[str, Any]]:
    """
    Load vendor control evidence from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of the JSON file (default: "vendor_controls.json")

    Returns:
        List of vendor control dictionaries

    Raises:
        FileNotFoundError: If file doesn't exist
        json.JSONDecodeError: If file contains invalid JSON
    """
    file_path = Path(data_dir) / filename

    if not file_path.exists():
        raise FileNotFoundError(f"Vendor controls file not found: {file_path}")

    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    if not isinstance(data, list):
        raise ValueError(f"Expected list in {filename}, got {type(data).__name__}")

    return data


def load_external_signals(data_dir: str, filename: str = "external_signals.json") -> List[Dict[str, Any]]:
    """
    Load external risk signals from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of the JSON file (default: "external_signals.json")

    Returns:
        List of external signal dictionaries

    Raises:
        FileNotFoundError: If file doesn't exist
        json.JSONDecodeError: If file contains invalid JSON
    """
    file_path = Path(data_dir) / filename

    if not file_path.exists():
        raise FileNotFoundError(f"External signals file not found: {file_path}")

    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    if not isinstance(data, list):
        raise ValueError(f"Expected list in {filename}, got {type(data).__name__}")

    return data


def load_vendor_performance(data_dir: str, filename: str = "vendor_performance.json") -> List[Dict[str, Any]]:
    """
    Load vendor performance metrics from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of the JSON file (default: "vendor_performance.json")

    Returns:
        List of vendor performance dictionaries

    Raises:
        FileNotFoundError: If file doesn't exist
        json.JSONDecodeError: If file contains invalid JSON
    """
    file_path = Path(data_dir) / filename

    if not file_path.exists():
        raise FileNotFoundError(f"Vendor performance file not found: {file_path}")

    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    if not isinstance(data, list):
        raise ValueError(f"Expected list in {filename}, got {type(data).__name__}")

    return data


def load_assessment_history(data_dir: str, filename: str = "assessment_history.json") -> List[Dict[str, Any]]:
    """
    Load historical risk assessments from JSON file.

    Args:
        data_dir: Directory containing data files
        filename: Name of the JSON file (default: "assessment_history.json")

    Returns:
        List of historical assessment dictionaries

    Raises:
        FileNotFoundError: If file doesn't exist
        json.JSONDecodeError: If file contains invalid JSON
    """
    file_path = Path(data_dir) / filename

    if not file_path.exists():
        raise FileNotFoundError(f"Assessment history file not found: {file_path}")

    with open(file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    if not isinstance(data, list):
        raise ValueError(f"Expected list in {filename}, got {type(data).__name__}")

    return data


def build_vendor_lookup(third_parties: List[Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
    """
    Create fast lookup dictionary for vendors by vendor_id.

    Args:
        third_parties: List of vendor dictionaries

    Returns:
        Dictionary mapping vendor_id to vendor data

    Example:
        lookup = build_vendor_lookup([{"vendor_id": "VEND_001", "vendor_name": "..."}])
        vendor = lookup["VEND_001"]  # Fast access
    """
    lookup = {}
    for vendor in third_parties:
        vendor_id = vendor.get("vendor_id")
        if vendor_id:
            lookup[vendor_id] = vendor
    return lookup


def build_risk_domain_lookup(risk_domains: List[Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
    """
    Create fast lookup dictionary for risk domains by risk_domain name.

    Args:
        risk_domains: List of risk domain dictionaries

    Returns:
        Dictionary mapping risk_domain name to domain definition

    Example:
        lookup = build_risk_domain_lookup([{"risk_domain": "Information Security", ...}])
        domain = lookup["Information Security"]  # Fast access
    """
    lookup = {}
    for domain in risk_domains:
        domain_name = domain.get("risk_domain")
        if domain_name:
            lookup[domain_name] = domain
    return lookup


def filter_vendors_by_id(
    third_parties: List[Dict[str, Any]],
    vendor_id: Optional[str]
) -> List[Dict[str, Any]]:
    """
    Filter vendors by vendor_id if specified, otherwise return all.

    Args:
        third_parties: List of all vendors
        vendor_id: Optional vendor ID to filter by (None = return all)

    Returns:
        Filtered list of vendors

    Raises:
        ValueError: If vendor_id specified but not found
    """
    if vendor_id is None:
        return third_parties

    filtered = [v for v in third_parties if v.get("vendor_id") == vendor_id]

    if not filtered:
        raise ValueError(f"Vendor {vendor_id} not found")

    return filtered
