<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/295_HITL_AuditLogging_utils.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Audit logging utilities for HITL Orchestrator

In [None]:
"""Audit logging utilities for HITL Orchestrator"""

from typing import Dict, Any, List, Optional
from datetime import datetime


def create_audit_log(
    task_id: str,
    risk_level: str,
    confidence_score: float,
    routing_decision: str,
    human_involved: bool,
    final_decision: str,
    decision_source: str,
    latency_seconds: float,
    timestamp: Optional[str] = None
) -> Dict[str, Any]:
    """
    Create an audit log entry.

    Args:
        task_id: Task identifier
        risk_level: Task risk level
        confidence_score: Agent confidence score
        routing_decision: Routing decision made
        human_involved: Whether human was involved
        final_decision: Final decision outcome
        decision_source: "agent" or "human"
        latency_seconds: Time taken in seconds
        timestamp: ISO timestamp (defaults to now)

    Returns:
        Audit log dictionary
    """
    if timestamp is None:
        timestamp = datetime.now().isoformat()

    log_id = f"log_{task_id.split('_')[1]}"  # e.g., "log_001" from "task_001"

    return {
        "log_id": log_id,
        "task_id": task_id,
        "risk_level": risk_level,
        "confidence_score": confidence_score,
        "routing_decision": routing_decision,
        "human_involved": human_involved,
        "final_decision": final_decision,
        "decision_source": decision_source,
        "latency_seconds": latency_seconds,
        "timestamp": timestamp
    }


def calculate_summary_metrics(
    routing_decisions: List[Dict[str, Any]],
    final_decisions: List[Dict[str, Any]],
    audit_logs: List[Dict[str, Any]]
) -> Dict[str, Any]:
    """
    Calculate summary metrics from routing decisions and audit logs.

    Args:
        routing_decisions: List of routing decisions
        final_decisions: List of final decisions
        audit_logs: List of audit logs

    Returns:
        Summary metrics dictionary
    """
    total_tasks = len(routing_decisions)

    # Count routing decisions
    auto_approved_count = sum(
        1 for d in routing_decisions
        if d.get("routing_decision") == "auto_approve"
    )
    human_reviewed_count = sum(
        1 for d in routing_decisions
        if d.get("routing_decision") == "human_review"
    )
    escalated_count = sum(
        1 for d in routing_decisions
        if d.get("routing_decision") == "escalate"
    )

    # Calculate average confidence
    confidence_scores = [d.get("confidence_score", 0.0) for d in routing_decisions]
    average_confidence = sum(confidence_scores) / len(confidence_scores) if confidence_scores else 0.0

    # Calculate average latency
    latencies = [log.get("latency_seconds", 0.0) for log in audit_logs]
    average_latency = sum(latencies) / len(latencies) if latencies else 0.0

    # Count human overrides
    human_override_count = sum(
        1 for d in final_decisions
        if d.get("decision_source") == "human" and d.get("final_decision") in ["override_approved", "modified_and_approved"]
    )

    return {
        "total_tasks": total_tasks,
        "auto_approved_count": auto_approved_count,
        "human_reviewed_count": human_reviewed_count,
        "escalated_count": escalated_count,
        "average_confidence_score": round(average_confidence, 2),
        "average_latency_seconds": round(average_latency, 2),
        "human_override_count": human_override_count
    }




# üß† Big Picture: Why Audit Logging Exists

This code answers two questions that **humans care deeply about**:

1. **‚ÄúWhat happened?‚Äù**
2. **‚ÄúHow is the system performing overall?‚Äù**

AI systems don‚Äôt fail because they‚Äôre inaccurate.
They fail because **no one can explain them afterward**.

This module makes your agent **accountable**.

---

# Part 1: `create_audit_log`

## üßæ ‚ÄúWrite it down so we can‚Äôt lie later‚Äù

```python
def create_audit_log(...)
```

### What this function does (in plain English)

Every time a task finishes, this function creates a **receipt**.

Think of it like:

* a transaction record
* a bank statement
* a flight black box entry

It captures:

* what the task was
* how risky it was
* who decided
* what the outcome was
* how long it took

---

## ‚è± Timestamp logic

```python
if timestamp is None:
    timestamp = datetime.now().isoformat()
```

Conceptually:

> If no one tells us when this happened, record ‚Äúright now‚Äù.

This ensures:

* every decision is time-stamped
* events can be reconstructed later

Time is crucial for trust.

---

## üÜî Log ID generation

```python
log_id = f"log_{task_id.split('_')[1]}"
```

Plain English:

* Take `task_001`
* Turn it into `log_001`

This keeps:

* logs readable
* easy to trace
* consistent with your datasets

It‚Äôs about **human legibility**, not cleverness.

---

## üì¶ What gets returned

The returned dictionary is a **complete, frozen record**.

Once written:

* it should never change
* it represents the official truth

This is the *single source of truth* for audits.

---

# Part 2: `calculate_summary_metrics`

## üìä ‚ÄúZoom out and see the system‚Äù

```python
def calculate_summary_metrics(...)
```

This function stops thinking about individual tasks and asks:

> **‚ÄúHow is the system behaving overall?‚Äù**

Executives, managers, and regulators don‚Äôt read logs ‚Äî
they read **summaries**.

---

## üî¢ Total tasks

```python
total_tasks = len(routing_decisions)
```

This answers:

> ‚ÄúHow much work did the system process?‚Äù

Simple, but essential.

---

## üßÆ Counting routing outcomes

```python
auto_approved_count
human_reviewed_count
escalated_count
```

These numbers answer **strategic questions**:

* Are we automating too aggressively?
* Are humans overloaded?
* Are we flagging too many things as high risk?

This is how **automation maturity** is measured.

---

## üìà Average confidence score

```python
average_confidence
```

This tells you:

* how confident the AI *thinks* it is
* whether confidence is trending up or down

Later, you‚Äôll compare this to **human overrides**.

That‚Äôs where learning happens.

---

## ‚è≥ Average latency

```python
average_latency
```

This measures:

* system speed
* human bottlenecks
* operational friction

High latency ‚â† bad AI
High latency = process problem

---

## üö® Human override count

```python
human_override_count
```

This is the **most important metric** in the entire system.

It answers:

> ‚ÄúHow often does a human say ‚Äòno‚Äô?‚Äù

High overrides mean:

* confidence is miscalibrated
* rules are too permissive
* risk thresholds are wrong

This is your **early warning system**.

---

# üéØ Big Takeaway (Most Important)

This module exists to enforce a core rule:

> **If an AI decision cannot be audited, it should not be trusted.**

You are designing for:

* transparency
* accountability
* long-term adoption

Not just correctness.

---

## üîÅ How this fits into the whole agent

1. Routing decides *who should act*
2. Humans (sometimes) act
3. Final decisions are made
4. **Audit logs record the truth**
5. Metrics summarize behavior

This is how AI becomes *organizationally acceptable*.

