<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/252_Product_CustomerFitDiscoveryOrchestrator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data loading utilities for Product-Customer Fit Discovery Orchestrator

In [None]:
"""Data loading utilities for Product-Customer Fit Discovery Orchestrator"""

from pathlib import Path
from typing import List, Dict, Any
import pandas as pd


def load_customers_csv(file_path: str = "data/customers.csv") -> List[Dict[str, Any]]:
    """
    Load customers from CSV file.

    Args:
        file_path: Path to customers.csv file

    Returns:
        List of customer dictionaries

    Raises:
        FileNotFoundError: If file doesn't exist
        ValueError: If file is empty or invalid
    """
    path = Path(file_path)

    if not path.exists():
        raise FileNotFoundError(f"Customers file not found: {file_path}")

    df = pd.read_csv(path)

    if df.empty:
        raise ValueError(f"Customers file is empty: {file_path}")

    # Convert to list of dictionaries
    customers = df.to_dict('records')

    return customers


def load_transactions_csv(file_path: str = "data/transactions.csv") -> List[Dict[str, Any]]:
    """
    Load transactions from CSV file.

    Args:
        file_path: Path to transactions.csv file

    Returns:
        List of transaction dictionaries

    Raises:
        FileNotFoundError: If file doesn't exist
        ValueError: If file is empty or invalid
    """
    path = Path(file_path)

    if not path.exists():
        raise FileNotFoundError(f"Transactions file not found: {file_path}")

    df = pd.read_csv(path)

    if df.empty:
        raise ValueError(f"Transactions file is empty: {file_path}")

    # Convert Transaction_Date to datetime for easier processing
    df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])

    # Convert to list of dictionaries
    transactions = df.to_dict('records')

    return transactions


def load_product_catalog_csv(file_path: str = "data/product_catalog.csv") -> List[Dict[str, Any]]:
    """
    Load product catalog from CSV file.

    Args:
        file_path: Path to product_catalog.csv file

    Returns:
        List of product dictionaries

    Raises:
        FileNotFoundError: If file doesn't exist
        ValueError: If file is empty or invalid
    """
    path = Path(file_path)

    if not path.exists():
        raise FileNotFoundError(f"Product catalog file not found: {file_path}")

    df = pd.read_csv(path)

    if df.empty:
        raise ValueError(f"Product catalog file is empty: {file_path}")

    # Convert to list of dictionaries
    products = df.to_dict('records')

    return products





## ðŸ§  Core Agent Architecture: Data Integrity & Standardization

The focus here is on **robustness, efficiency, and standardization**, ensuring that the data passed to the specialist agents (like the `clustering_agent` or `graph_motif_agent`) is always in a predictable and usable format.

### ðŸŽ¯ What to Focus On

1.  **Robust Error Handling (The "Guardrails"):**
    * The functions employ explicit **try-catch logic** (using `if not path.exists()` and `if df.empty`).
    * **Focus:** This is critical for making your orchestrator **reliable**. If a file is missing or empty, the workflow doesn't just crash; it raises a specific, controlled error (`FileNotFoundError` or `ValueError`) that the **Orchestrator** can catch and report back to the user via the `errors` state object (as seen in the `goal_node` and `planning_node`).

2.  **Integration of Data Science Standards (`pandas`):**
    * The use of the `pandas` library (`import pandas as pd`) is not just a convenience; it's a **requirement for efficient data analysis**.
    * **Focus:** The goal of the data ingestion node is to leverage specialized, non-LLM tools (like `pandas`) for tasks they do best: fast, high-volume data manipulation. This is what makes your system a **Hybrid Agent** architecture.

3.  **Standardized Output Format:**
    * All functions convert the raw CSV/DataFrame into a consistent Python native type: `List[Dict[str, Any]]` (a list of dictionaries).
    * **Focus:** This creates a **Data Contract** for the rest of the workflow. By standardizing on this format, the downstream pre-processing and specialist agents (e.g., the `clustering_agent`) don't have to worry about reading CSVs; they just know they will receive a list of simple Python objects.

4.  **Targeted Pre-processing:**
    * The `load_transactions_csv` function includes a specific line: `df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])`.
    * **Focus:** This demonstrates a clean separation of concerns. It handles data type conversions immediately upon loading, which is necessary for proper subsequent analysis (like identifying sequential patterns).

***

## âœ¨ Differentiation: The Hybrid Power of the Orchestrator

This seemingly simple loading node is actually one of the clearest examples of what makes your orchestrator more powerful than a simple LLM agent:

* **Simple Agent Limitation:** A simple LLM agent would often struggle or fail to reliably load, parse, and validate large CSV files. It might hallucinate file paths or fail to correctly handle data types.
* **Orchestrator/Hybrid Power:** Your architecture **delegates** the data handling task to a **specialized, reliable, non-LLM utility** (Python/Pandas). This is the hallmark of a powerful systemâ€”it recognizes that not every task needs an LLM; many are better handled by traditional, high-performance code.

The `data_ingestion` utility acts as the reliable **Input Gate** for the entire analytic **Data Pipeline**, ensuring the subsequent, more complex steps have clean, validated data.




# Tests for data loading utilities

In [None]:
"""Tests for data loading utilities"""

import pytest
from pathlib import Path
from tools.data_loading import (
    load_customers_csv,
    load_transactions_csv,
    load_product_catalog_csv
)


def test_load_customers_csv():
    """Test loading customers CSV"""
    customers = load_customers_csv("data/customers.csv")

    assert len(customers) > 0
    assert isinstance(customers, list)
    assert isinstance(customers[0], dict)
    assert "Customer_ID" in customers[0]
    assert "Age_Group" in customers[0]
    assert "Location_Tier" in customers[0]
    assert "Acquisition_Channel" in customers[0]


def test_load_customers_csv_has_expected_count():
    """Test customers CSV has expected number of records"""
    customers = load_customers_csv("data/customers.csv")

    # Should have 200 customers (C001-C200)
    assert len(customers) == 200
    assert customers[0]["Customer_ID"] == "C001"
    assert customers[-1]["Customer_ID"] == "C200"


def test_load_transactions_csv():
    """Test loading transactions CSV"""
    transactions = load_transactions_csv("data/transactions.csv")

    assert len(transactions) > 0
    assert isinstance(transactions, list)
    assert isinstance(transactions[0], dict)
    assert "Transaction_ID" in transactions[0]
    assert "Customer_ID" in transactions[0]
    assert "Product_ID" in transactions[0]
    assert "Transaction_Date" in transactions[0]
    assert "Usage_Metric" in transactions[0]


def test_load_transactions_csv_date_parsing():
    """Test transaction dates are parsed correctly"""
    transactions = load_transactions_csv("data/transactions.csv")

    # Check that Transaction_Date is a datetime object (pandas Timestamp)
    first_transaction = transactions[0]
    date_value = first_transaction["Transaction_Date"]

    # Should be a pandas Timestamp (or datetime-like)
    assert hasattr(date_value, 'year') or isinstance(date_value, str)


def test_load_product_catalog_csv():
    """Test loading product catalog CSV"""
    products = load_product_catalog_csv("data/product_catalog.csv")

    assert len(products) > 0
    assert isinstance(products, list)
    assert isinstance(products[0], dict)
    assert "Product_ID" in products[0]
    assert "Product_Type" in products[0]
    assert "Feature_Set" in products[0]
    assert "Monetization_Model" in products[0]


def test_load_product_catalog_csv_has_all_products():
    """Test product catalog has all 20 products"""
    products = load_product_catalog_csv("data/product_catalog.csv")

    # Should have 20 products (P01-P20)
    assert len(products) == 20
    assert products[0]["Product_ID"] == "P01"
    assert products[-1]["Product_ID"] == "P20"


def test_load_customers_csv_file_not_found():
    """Test error handling for missing file"""
    with pytest.raises(FileNotFoundError):
        load_customers_csv("data/nonexistent.csv")


def test_load_transactions_csv_file_not_found():
    """Test error handling for missing file"""
    with pytest.raises(FileNotFoundError):
        load_transactions_csv("data/nonexistent.csv")


def test_load_product_catalog_csv_file_not_found():
    """Test error handling for missing file"""
    with pytest.raises(FileNotFoundError):
        load_product_catalog_csv("data/nonexistent.csv")


def test_load_customers_csv_default_path():
    """Test default path works"""
    customers = load_customers_csv()

    assert len(customers) > 0
    assert customers[0]["Customer_ID"] == "C001"


def test_load_transactions_csv_default_path():
    """Test default path works"""
    transactions = load_transactions_csv()

    assert len(transactions) > 0


def test_load_product_catalog_csv_default_path():
    """Test default path works"""
    products = load_product_catalog_csv()

    assert len(products) > 0
    assert products[0]["Product_ID"] == "P01"



# Test Results

In [None]:
(.venv) micahshull@Micahs-iMac LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator % python3 -m pytest tests/test_data_loading.py -v
============================================================ test session starts ============================================================
platform darwin -- Python 3.13.7, pytest-9.0.1, pluggy-1.6.0 -- /Users/micahshull/Documents/AI_LangGraph/LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/micahshull/Documents/AI_LangGraph/LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator
plugins: langsmith-0.4.53, anyio-4.12.0, asyncio-1.3.0, cov-7.0.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 12 items

tests/test_data_loading.py::test_load_customers_csv PASSED                                                                            [  8%]
tests/test_data_loading.py::test_load_customers_csv_has_expected_count PASSED                                                         [ 16%]
tests/test_data_loading.py::test_load_transactions_csv PASSED                                                                         [ 25%]
tests/test_data_loading.py::test_load_transactions_csv_date_parsing PASSED                                                            [ 33%]
tests/test_data_loading.py::test_load_product_catalog_csv PASSED                                                                      [ 41%]
tests/test_data_loading.py::test_load_product_catalog_csv_has_all_products PASSED                                                     [ 50%]
tests/test_data_loading.py::test_load_customers_csv_file_not_found PASSED                                                             [ 58%]
tests/test_data_loading.py::test_load_transactions_csv_file_not_found PASSED                                                          [ 66%]
tests/test_data_loading.py::test_load_product_catalog_csv_file_not_found PASSED                                                       [ 75%]
tests/test_data_loading.py::test_load_customers_csv_default_path PASSED                                                               [ 83%]
tests/test_data_loading.py::test_load_transactions_csv_default_path PASSED                                                            [ 91%]
tests/test_data_loading.py::test_load_product_catalog_csv_default_path PASSED                                                         [100%]

============================================================ 12 passed in 0.41s =============================================================



All 12 utility tests passed. Adding tests for the data ingestion node:


## Phase 2: Data Loading â€” complete

### What we built

1. Data loading utilities (`tools/data_loading.py`)
   - `load_customers_csv()` â€” loads customers.csv
   - `load_transactions_csv()` â€” loads transactions.csv (with date parsing)
   - `load_product_catalog_csv()` â€” loads product_catalog.csv
   - All 12 utility tests passing

2. Data ingestion node (`data_ingestion_node`)
   - Orchestrates all three loading utilities
   - Uses config defaults or state file paths
   - Error handling for missing files
   - Populates `raw_customers`, `raw_transactions`, `raw_products`

3. Node tests (`tests/test_nodes_phase2.py`)
   - 8 tests covering structure, error handling, and integration

### Architecture pattern

- Utilities first, then node
- Utilities are independently testable
- Node is thin (orchestrates utilities)
- Error accumulation pattern maintained



# Data Ingestion Node

In [None]:
def data_ingestion_node(state: ProductCustomerFitState) -> Dict[str, Any]:
    """
    Data Ingestion Node: Orchestrate loading customer, transaction, and product data.

    Loads raw data from CSV files using data loading utilities.

    Args:
        state: Current orchestrator state

    Returns:
        Updated state with raw_customers, raw_transactions, raw_products
    """
    errors = state.get("errors", [])

    # Get file paths from state or use defaults from config
    config = ProductCustomerFitConfig()
    customers_file = state.get("customers_file") or config.customers_file
    transactions_file = state.get("transactions_file") or config.transactions_file
    products_file = state.get("products_file") or config.products_file

    try:
        # Load all data files
        raw_customers = load_customers_csv(customers_file)
        raw_transactions = load_transactions_csv(transactions_file)
        raw_products = load_product_catalog_csv(products_file)

        return {
            "raw_customers": raw_customers,
            "raw_transactions": raw_transactions,
            "raw_products": raw_products,
            "errors": errors
        }
    except FileNotFoundError as e:
        return {
            "errors": errors + [f"data_ingestion_node: File not found - {str(e)}"]
        }
    except ValueError as e:
        return {
            "errors": errors + [f"data_ingestion_node: Invalid data - {str(e)}"]
        }
    except Exception as e:
        return {
            "errors": errors + [f"data_ingestion_node: Unexpected error - {str(e)}"]
        }


This is the **`data_ingestion_node`**, the executive function for the first step of your DAG. It brings together the structure of the `planning_node` (the step itself) and the robustness of the utility functions (the execution logic).

This node is a perfect example of what a **proper orchestration layer** should do.

***

## ðŸ§  Core Agent Architecture: Configuration and Fault Tolerance

The primary job of this node is **flow control and error management**. It showcases best practices for building an agent system that is flexible and designed not to crash, but to fail gracefully.

### ðŸŽ¯ What to Focus On

1.  **Configuration Management and Flexibility:**
    * **Focus:** The use of both `state.get("file_path")` and `config.file_path` allows for **runtime overrides**. This means you can run the same agent workflow on different datasets simply by passing new file paths in the initial `state` without having to change the default configuration file. This is crucial for building a **reusable and testable** agent.

2.  **Orchestration of Specialized Workers (Decoupling):**
    * **Focus:** The body of the `try` block consists of three simple function calls. The `data_ingestion_node` does not contain any data reading logic itself. Its role is strictly to **orchestrate** the execution of the specialized utility functions (the "workers").
    * **The Power:** This **Decoupling** makes the system incredibly clean and maintainable. If you need to change how data is read (e.g., switch from CSV to a database query), you only modify the utility function, not the main workflow node.

3.  **Structured Error Reporting (Fault Tolerance):**
    * **Focus:** The `try...except` block is the core of the node's intelligence. It catches all foreseeable errors (`FileNotFoundError`, `ValueError`, `Exception`).
    * **The Power:** Instead of terminating the program, the node **captures the error details** and appends them to the **`errors` list in the state**. This means the workflow doesn't just fail; it records *why* it failed and maintains a clean state. Later nodes (like a **Reporting Agent**) could be designed to inspect the `errors` list and generate a failure report automaticallyâ€”a hallmark of **sophisticated fault tolerance**.

***

## âœ¨ Differentiation: Graceful Failure and Debugging

A simple agent fails silently or crashes when an input file is missing. Your orchestrator does this:

1.  **Delegates Failure:** It delegates the I/O work to a robust utility.
2.  **Catches Failure:** It wraps the call in a `try/except`.
3.  **Records Failure:** It appends the specific, descriptive error message to the shared `state`.
4.  **Terminates Gracefully:** It returns the updated state, allowing the orchestration engine to stop the workflow in a controlled manner and use the error message for instant debugging or reporting.

This pattern demonstrates the necessary rigor to move from a working script to a **production-ready, reliable, self-monitoring agent system.**

# Tests for Phase 2: Data Ingestion Node

In [None]:
"""Tests for Phase 2: Data Ingestion Node"""

import pytest
from agents.product_customer_fit.nodes import data_ingestion_node
from config import ProductCustomerFitState


def test_data_ingestion_node_loads_all_data():
    """Test data ingestion node loads all three data sources"""
    state: ProductCustomerFitState = {
        "errors": []
    }

    result = data_ingestion_node(state)

    assert "raw_customers" in result
    assert "raw_transactions" in result
    assert "raw_products" in result
    assert len(result.get("errors", [])) == 0


def test_data_ingestion_node_customers_structure():
    """Test raw_customers has correct structure"""
    state: ProductCustomerFitState = {
        "errors": []
    }

    result = data_ingestion_node(state)
    customers = result["raw_customers"]

    assert len(customers) > 0
    assert isinstance(customers, list)
    assert "Customer_ID" in customers[0]
    assert "Age_Group" in customers[0]
    assert customers[0]["Customer_ID"] == "C001"


def test_data_ingestion_node_transactions_structure():
    """Test raw_transactions has correct structure"""
    state: ProductCustomerFitState = {
        "errors": []
    }

    result = data_ingestion_node(state)
    transactions = result["raw_transactions"]

    assert len(transactions) > 0
    assert isinstance(transactions, list)
    assert "Transaction_ID" in transactions[0]
    assert "Customer_ID" in transactions[0]
    assert "Product_ID" in transactions[0]
    assert "Transaction_Date" in transactions[0]
    assert "Usage_Metric" in transactions[0]


def test_data_ingestion_node_products_structure():
    """Test raw_products has correct structure"""
    state: ProductCustomerFitState = {
        "errors": []
    }

    result = data_ingestion_node(state)
    products = result["raw_products"]

    assert len(products) > 0
    assert isinstance(products, list)
    assert "Product_ID" in products[0]
    assert "Product_Type" in products[0]
    assert "Feature_Set" in products[0]
    assert "Monetization_Model" in products[0]
    assert products[0]["Product_ID"] == "P01"


def test_data_ingestion_node_uses_custom_paths():
    """Test data ingestion node uses custom file paths from state"""
    state: ProductCustomerFitState = {
        "customers_file": "data/customers.csv",
        "transactions_file": "data/transactions.csv",
        "products_file": "data/product_catalog.csv",
        "errors": []
    }

    result = data_ingestion_node(state)

    assert "raw_customers" in result
    assert "raw_transactions" in result
    assert "raw_products" in result
    assert len(result.get("errors", [])) == 0


def test_data_ingestion_node_handles_missing_file():
    """Test data ingestion node handles missing file gracefully"""
    state: ProductCustomerFitState = {
        "customers_file": "data/nonexistent.csv",
        "errors": []
    }

    result = data_ingestion_node(state)

    assert "raw_customers" not in result
    assert "errors" in result
    assert len(result["errors"]) > 0
    assert "File not found" in result["errors"][0]


def test_data_ingestion_node_preserves_errors():
    """Test data ingestion node preserves existing errors"""
    state: ProductCustomerFitState = {
        "errors": ["existing_error"]
    }

    result = data_ingestion_node(state)

    # Should have existing error plus any new ones (or just existing if successful)
    assert "errors" in result
    assert "existing_error" in result["errors"]


def test_data_ingestion_node_with_goal_and_planning():
    """Test data ingestion node works after goal and planning nodes"""
    from agents.product_customer_fit.nodes import goal_node, planning_node

    state: ProductCustomerFitState = {
        "errors": []
    }

    # Run goal and planning first
    state = goal_node(state)
    state = planning_node(state)

    # Then data ingestion
    state = data_ingestion_node(state)

    assert "raw_customers" in state
    assert "raw_transactions" in state
    assert "raw_products" in state
    assert "goal" in state
    assert "plan" in state
    assert len(state.get("errors", [])) == 0


def test_data_ingestion_node_data_counts():
    """Test data ingestion loads expected number of records"""
    state: ProductCustomerFitState = {
        "errors": []
    }

    result = data_ingestion_node(state)

    # Verify expected counts
    assert len(result["raw_customers"]) == 200
    assert len(result["raw_products"]) == 20
    # Transactions should be > 1800 (we added 9 for P20)
    assert len(result["raw_transactions"]) >= 1800



# Test Results

In [None]:
(.venv) micahshull@Micahs-iMac LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator % python3 -m pytest tests/test_nodes_phase2.py -v
============================================================ test session starts ============================================================
platform darwin -- Python 3.13.7, pytest-9.0.1, pluggy-1.6.0 -- /Users/micahshull/Documents/AI_LangGraph/LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/micahshull/Documents/AI_LangGraph/LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator
plugins: langsmith-0.4.53, anyio-4.12.0, asyncio-1.3.0, cov-7.0.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 9 items

tests/test_nodes_phase2.py::test_data_ingestion_node_loads_all_data PASSED                                                            [ 11%]
tests/test_nodes_phase2.py::test_data_ingestion_node_customers_structure PASSED                                                       [ 22%]
tests/test_nodes_phase2.py::test_data_ingestion_node_transactions_structure PASSED                                                    [ 33%]
tests/test_nodes_phase2.py::test_data_ingestion_node_products_structure PASSED                                                        [ 44%]
tests/test_nodes_phase2.py::test_data_ingestion_node_uses_custom_paths PASSED                                                         [ 55%]
tests/test_nodes_phase2.py::test_data_ingestion_node_handles_missing_file PASSED                                                      [ 66%]
tests/test_nodes_phase2.py::test_data_ingestion_node_preserves_errors PASSED                                                          [ 77%]
tests/test_nodes_phase2.py::test_data_ingestion_node_with_goal_and_planning PASSED                                                    [ 88%]
tests/test_nodes_phase2.py::test_data_ingestion_node_data_counts PASSED                                                               [100%]

============================================================= 9 passed in 0.35s =============================================================



## Phase 2 complete â€” all tests passing

### Summary

**Phase 1: Foundation**
- Goal Node âœ“
- Planning Node âœ“
- 7/7 tests passing

**Phase 2: Data Loading**
- Data Loading Utilities (3 functions) âœ“
- Data Ingestion Node âœ“
- 12 utility tests + 9 node tests = 21/21 passing

**Total Progress:** 28/28 tests passing across both phases

---

### What we've built

1. **Architecture pattern established:**
   - Utilities first, then nodes
   - Thin nodes (orchestration only)
   - State enrichment pattern
   - Error accumulation

2. **Working workflow:**
   - Goal â†’ Planning â†’ Data Ingestion
   - All nodes tested independently
   - Integration tests passing

3. **Data loaded:**
   - 200 customers
   - 1,824 transactions
   - 20 products
   - Ready for preprocessing

---

## Next: Phase 3 â€” Data Preprocessing

This phase will:
1. Parse Feature_Set strings into lists
2. Normalize usage metrics
3. Build graph structures (NetworkX)
4. Create derived features

