<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/254_Product_CustomerFitDiscoveryOrchestrator_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering utilities for Product-Customer Fit Discovery Orchestrator

This set of utilities represents the **Clustering Agent** itselfâ€”**Step 3** in your DAGâ€”which is dedicated to performing **unsupervised machine learning** to achieve the "customer\_segmentation" and "product\_bundling" goals.

This section confirms your architectureâ€™s reliance on **specialized ML libraries** for tasks where numerical precision is paramount.

***

## ðŸ§  Core Agent Architecture: Autonomous Machine Learning

The primary focus here is on **feature engineering for algorithms**, **autonomous execution**, and **ML quality assurance**.

### ðŸŽ¯ What to Focus On: Feature Preparation and Scaling

The most important step for reliable clustering is preparing the feature matrix, covered by `prepare_customer_features` and `prepare_product_features`.

1.  **Homogenization of Features (One-Hot Encoding):**
    * **Focus:** Clustering algorithms like K-means rely on calculating the distance between points. Categorical text data (like `Age_Group` or `Product_Type`) must be converted into a numerical format using **One-Hot Encoding** (a binary vector where $\text{1.0}$ means "this feature is present"). This is a fundamental requirement for distance-based ML.
2.  **Standardization vs. Normalization:**
    * **Standard Scaling (`StandardScaler`):** This is applied to the final feature matrix. It transforms data to have a mean of $\text{0}$ and a standard deviation of $\text{1}$.
    * **Why it Matters:** Without scaling, features with naturally large ranges (like $\text{transaction\_count}$) would completely dominate the distance calculation, making the segmentation meaningless. Scaling ensures that the $\text{Engagement Score}$ has the same impact on the clusters as the $\text{Transaction Count}$.
3.  **Feature Blend:** You are clustering based on a blend of **demographics** (age, location) and **behavioral/derived metrics** (diversity, engagement score). This creates segments that are both *describable* (demographics) and *predictive* (behavior).

***

### âœ¨ Differentiation: Autonomous and Validated Clustering

Your agent is more powerful than a simple ML script because it automates complex decisions and validates its own work:

1.  **Automated Cluster Selection (`find_optimal_clusters`):**
    * **The Power:** This function provides **Autonomy**. It uses the **Silhouette Score**, a quantitative metric for measuring how similar an object is to its own cluster compared to other clusters (score closer to $\text{+1}$ is better).
    * By iterating from $\text{k=2}$ to $\text{max\_clusters}$ and choosing the $\text{k}$ with the best Silhouette Score, the agent eliminates the need for a human data scientist to manually determine the optimal number of segments, making the whole workflow faster and more objective.

2.  **The Final Translation (`analyze_cluster_characteristics`):**
    * **The Power:** This function acts as the **Analytic Interpreter**. The raw output of K-means is just a list of numbers (labels $\text{0, 1, 2, ...}$). This function takes those labels and translates them back into a strategic summary by calculating things like the **most common age group**, **top products**, and **average usage** for each cluster.
    * This **Strategic Summarization** provides the Synthesis Agent with the human-readable insights it needs to name the segments (e.g., "The Low-Engagement Seniors") and formulate the final business opportunities.

This robust framework ensures your agent's conclusions are not just LLM-generated guesses, but **data-validated, rigorously calculated machine learning results.**

In [None]:
"""Clustering utilities for Product-Customer Fit Discovery Orchestrator"""

from typing import List, Dict, Any, Optional, Tuple
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from collections import Counter, defaultdict


def prepare_customer_features(
    customers: List[Dict[str, Any]],
    transactions: List[Dict[str, Any]],
    derived_features: Dict[str, Any]
) -> Tuple[np.ndarray, List[str]]:
    """
    Prepare feature matrix for customer clustering.

    Features include:
    - Demographics (age group, location tier, acquisition channel) - one-hot encoded
    - Behavioral (transaction count, product diversity, engagement score)

    Args:
        customers: List of customer dictionaries
        transactions: List of transaction dictionaries
        derived_features: Derived features from preprocessing

    Returns:
        Tuple of (feature_matrix, customer_ids)
    """
    customer_ids = []
    features = []

    # Get engagement metrics
    engagement = derived_features.get("customer_engagement", {})

    # Group transactions by customer
    customer_txn_count = defaultdict(int)
    customer_products = defaultdict(set)

    for txn in transactions:
        customer_id = txn.get("Customer_ID")
        product_id = txn.get("Product_ID")
        if customer_id:
            customer_txn_count[customer_id] += 1
            if product_id:
                customer_products[customer_id].add(product_id)

    # Age group mapping
    age_groups = ["18-24", "35-44", "45-54", "55+"]
    location_tiers = ["Tier 1 (High)", "Tier 2 (Medium)", "Tier 3 (Low)"]
    acquisition_channels = ["Email", "Referral", "Search", "Social"]

    for customer in customers:
        customer_id = customer.get("Customer_ID")
        if not customer_id:
            continue

        customer_ids.append(customer_id)
        feature_vector = []

        # Age group (one-hot)
        age_group = customer.get("Age_Group", "")
        for age in age_groups:
            feature_vector.append(1.0 if age_group == age else 0.0)

        # Location tier (one-hot)
        location = customer.get("Location_Tier", "")
        for tier in location_tiers:
            feature_vector.append(1.0 if location == tier else 0.0)

        # Acquisition channel (one-hot)
        channel = customer.get("Acquisition_Channel", "")
        for acq in acquisition_channels:
            feature_vector.append(1.0 if channel == acq else 0.0)

        # Behavioral features
        txn_count = customer_txn_count.get(customer_id, 0)
        product_diversity = len(customer_products.get(customer_id, set()))

        # Engagement score
        eng_data = engagement.get(customer_id, {})
        engagement_score = eng_data.get("engagement_score", 0.0)

        feature_vector.extend([
            txn_count / 100.0,  # Normalize
            product_diversity / 20.0,  # Normalize
            engagement_score
        ])

        features.append(feature_vector)

    return np.array(features), customer_ids


def prepare_product_features(
    products: List[Dict[str, Any]],
    transactions: List[Dict[str, Any]],
    derived_features: Dict[str, Any]
) -> Tuple[np.ndarray, List[str]]:
    """
    Prepare feature matrix for product clustering.

    Features include:
    - Product attributes (type, monetization model) - one-hot encoded
    - Feature set (which features A, B, C, D) - one-hot encoded
    - Behavioral (popularity score, customer count, transaction count)

    Args:
        products: List of product dictionaries
        transactions: List of transaction dictionaries
        derived_features: Derived features from preprocessing

    Returns:
        Tuple of (feature_matrix, product_ids)
    """
    product_ids = []
    features = []

    # Get popularity metrics
    popularity = derived_features.get("product_popularity", {})

    # Product type and monetization model mappings
    product_types = ["Hardware", "Software", "Service"]
    monetization_models = ["One-Time Purchase", "Freemium", "Subscription"]
    feature_letters = ["A", "B", "C", "D"]

    for product in products:
        product_id = product.get("Product_ID")
        if not product_id:
            continue

        product_ids.append(product_id)
        feature_vector = []

        # Product type (one-hot)
        ptype = product.get("Product_Type", "")
        for pt in product_types:
            feature_vector.append(1.0 if ptype == pt else 0.0)

        # Monetization model (one-hot)
        monet = product.get("Monetization_Model", "")
        for mm in monetization_models:
            feature_vector.append(1.0 if monet == mm else 0.0)

        # Feature set (one-hot for A, B, C, D)
        feature_list = product.get("feature_list", [])
        for feat in feature_letters:
            feature_vector.append(1.0 if feat in feature_list else 0.0)

        # Behavioral features
        pop_data = popularity.get(product_id, {})
        popularity_score = pop_data.get("popularity_score", 0.0)
        customer_count = pop_data.get("customer_count", 0)
        txn_count = pop_data.get("transaction_count", 0)

        feature_vector.extend([
            popularity_score,
            customer_count / 200.0,  # Normalize
            txn_count / 1000.0  # Normalize
        ])

        features.append(feature_vector)

    return np.array(features), product_ids


def find_optimal_clusters(
    feature_matrix: np.ndarray,
    max_clusters: int = 10,
    min_clusters: int = 2
) -> int:
    """
    Find optimal number of clusters using elbow method and silhouette score.

    Args:
        feature_matrix: Feature matrix for clustering
        max_clusters: Maximum clusters to consider
        min_clusters: Minimum clusters to consider

    Returns:
        Optimal number of clusters
    """
    if len(feature_matrix) < min_clusters:
        return len(feature_matrix)

    max_clusters = min(max_clusters, len(feature_matrix) - 1)

    if max_clusters < min_clusters:
        return min_clusters

    # Scale features
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(feature_matrix)

    best_k = min_clusters
    best_silhouette = -1

    for k in range(min_clusters, max_clusters + 1):
        try:
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            labels = kmeans.fit_predict(scaled_features)

            if len(set(labels)) > 1:  # Need at least 2 clusters
                silhouette = silhouette_score(scaled_features, labels)
                if silhouette > best_silhouette:
                    best_silhouette = silhouette
                    best_k = k
        except:
            continue

    return best_k


def cluster_customers(
    customers: List[Dict[str, Any]],
    transactions: List[Dict[str, Any]],
    derived_features: Dict[str, Any],
    num_clusters: Optional[int] = None,
    max_clusters: int = 10
) -> Tuple[List[int], Dict[str, Any]]:
    """
    Cluster customers using K-means.

    Args:
        customers: List of customer dictionaries
        transactions: List of transaction dictionaries
        derived_features: Derived features from preprocessing
        num_clusters: Number of clusters (None = auto-determine)
        max_clusters: Maximum clusters to consider if auto-determining

    Returns:
        Tuple of (cluster_labels, cluster_metadata)
    """
    # Prepare features
    feature_matrix, customer_ids = prepare_customer_features(
        customers, transactions, derived_features
    )

    if len(feature_matrix) == 0:
        return [], {}

    # Scale features
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(feature_matrix)

    # Determine number of clusters
    if num_clusters is None:
        num_clusters = find_optimal_clusters(feature_matrix, max_clusters)

    num_clusters = min(num_clusters, len(feature_matrix))

    # Perform clustering
    kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
    labels = kmeans.fit_predict(scaled_features)

    # Calculate silhouette score
    silhouette = silhouette_score(scaled_features, labels) if len(set(labels)) > 1 else 0.0

    # Create metadata
    metadata = {
        "num_clusters": num_clusters,
        "silhouette_score": float(silhouette),
        "inertia": float(kmeans.inertia_),
        "customer_ids": customer_ids
    }

    return labels.tolist(), metadata


def cluster_products(
    products: List[Dict[str, Any]],
    transactions: List[Dict[str, Any]],
    derived_features: Dict[str, Any],
    num_clusters: Optional[int] = None,
    max_clusters: int = 10
) -> Tuple[List[int], Dict[str, Any]]:
    """
    Cluster products using K-means.

    Args:
        products: List of product dictionaries
        transactions: List of transaction dictionaries
        derived_features: Derived features from preprocessing
        num_clusters: Number of clusters (None = auto-determine)
        max_clusters: Maximum clusters to consider if auto-determining

    Returns:
        Tuple of (cluster_labels, cluster_metadata)
    """
    # Prepare features
    feature_matrix, product_ids = prepare_product_features(
        products, transactions, derived_features
    )

    if len(feature_matrix) == 0:
        return [], {}

    # Scale features
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(feature_matrix)

    # Determine number of clusters
    if num_clusters is None:
        num_clusters = find_optimal_clusters(feature_matrix, max_clusters)

    num_clusters = min(num_clusters, len(feature_matrix))

    # Perform clustering
    kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init=10)
    labels = kmeans.fit_predict(scaled_features)

    # Calculate silhouette score
    silhouette = silhouette_score(scaled_features, labels) if len(set(labels)) > 1 else 0.0

    # Create metadata
    metadata = {
        "num_clusters": num_clusters,
        "silhouette_score": float(silhouette),
        "inertia": float(kmeans.inertia_),
        "product_ids": product_ids
    }

    return labels.tolist(), metadata


def analyze_cluster_characteristics(
    cluster_labels: List[int],
    entity_ids: List[str],
    entities: List[Dict[str, Any]],
    transactions: List[Dict[str, Any]],
    entity_type: str = "customer"
) -> List[Dict[str, Any]]:
    """
    Analyze characteristics of each cluster.

    Args:
        cluster_labels: Cluster assignment for each entity
        entity_ids: List of entity IDs
        entities: List of entity dictionaries
        transactions: List of transaction dictionaries
        entity_type: "customer" or "product"

    Returns:
        List of cluster analysis dictionaries
    """
    # Group entities by cluster
    clusters = defaultdict(list)
    for idx, (entity_id, label) in enumerate(zip(entity_ids, cluster_labels)):
        clusters[label].append((entity_id, idx))

    cluster_analyses = []

    # Create entity lookup
    entity_lookup = {e.get("Customer_ID" if entity_type == "customer" else "Product_ID"): e
                     for e in entities}

    # Group transactions by entity
    entity_transactions = defaultdict(list)
    for txn in transactions:
        entity_id = txn.get("Customer_ID" if entity_type == "customer" else "Product_ID")
        if entity_id:
            entity_transactions[entity_id].append(txn)

    for cluster_id, members in sorted(clusters.items()):
        member_ids = [m[0] for m in members]
        member_entities = [entity_lookup.get(eid) for eid in member_ids if eid in entity_lookup]

        if not member_entities:
            continue

        # Analyze characteristics
        if entity_type == "customer":
            # Customer cluster analysis
            age_groups = [e.get("Age_Group", "") for e in member_entities if e]
            location_tiers = [e.get("Location_Tier", "") for e in member_entities if e]
            channels = [e.get("Acquisition_Channel", "") for e in member_entities if e]

            # Product usage
            products_used = set()
            total_usage = 0.0
            usage_count = 0

            for member_id in member_ids:
                for txn in entity_transactions.get(member_id, []):
                    products_used.add(txn.get("Product_ID"))
                    usage = txn.get("Usage_Metric", 0)
                    if usage:
                        total_usage += usage
                        usage_count += 1

            avg_usage = total_usage / usage_count if usage_count > 0 else 0.0

            cluster_analysis = {
                "cluster_id": int(cluster_id),
                "cluster_label": f"Customer Segment {cluster_id + 1}",
                "entity_ids": member_ids,
                "size": len(member_ids),
                "characteristics": {
                    "avg_age_group": Counter(age_groups).most_common(1)[0][0] if age_groups else "",
                    "common_location_tiers": [tier for tier, count in Counter(location_tiers).most_common(2)],
                    "common_acquisition_channels": [ch for ch, count in Counter(channels).most_common(2)],
                    "avg_usage_metric": float(avg_usage),
                    "top_products": list(products_used)[:5],
                    "product_diversity": float(len(products_used))
                },
                "underserved_products": [],  # Will be filled by synthesis
                "business_value": float(len(member_ids) * avg_usage)  # Simple estimate
            }
        else:
            # Product cluster analysis
            product_types = [e.get("Product_Type", "") for e in member_entities if e]
            monetization_models = [e.get("Monetization_Model", "") for e in member_entities if e]
            feature_sets = [e.get("feature_list", []) for e in member_entities if e]

            # Flatten feature sets
            all_features = []
            for fs in feature_sets:
                all_features.extend(fs)

            # Usage metrics
            total_usage = 0.0
            usage_count = 0
            customers_using = set()

            for member_id in member_ids:
                for txn in entity_transactions.get(member_id, []):
                    customers_using.add(txn.get("Customer_ID"))
                    usage = txn.get("Usage_Metric", 0)
                    if usage:
                        total_usage += usage
                        usage_count += 1

            avg_usage = total_usage / usage_count if usage_count > 0 else 0.0

            cluster_analysis = {
                "cluster_id": int(cluster_id),
                "cluster_label": f"Product Bundle {cluster_id + 1}",
                "entity_ids": member_ids,
                "size": len(member_ids),
                "characteristics": {
                    "common_features": list(set(all_features)),
                    "monetization_models": [mm for mm, count in Counter(monetization_models).most_common(3)],
                    "product_types": [pt for pt, count in Counter(product_types).most_common(3)],
                    "avg_usage_metric": float(avg_usage),
                    "customer_count": len(customers_using)
                },
                "bundle_potential": float(len(customers_using) / max(len(member_ids), 1))  # Simple estimate
            }

        cluster_analyses.append(cluster_analysis)

    return cluster_analyses



# Visualization utilities for Product-Customer Fit Discovery Orchestrator

In [None]:
"""Visualization utilities for Product-Customer Fit Discovery Orchestrator"""

from typing import List, Dict, Any, Optional, Tuple
import numpy as np
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from pathlib import Path


def plot_customer_clusters(
    feature_matrix: np.ndarray,
    cluster_labels: List[int],
    customer_ids: List[str],
    output_path: str,
    title: str = "Customer Clusters"
) -> str:
    """
    Plot customer clusters using PCA for 2D visualization.

    Args:
        feature_matrix: Feature matrix used for clustering
        cluster_labels: Cluster assignment for each customer
        customer_ids: List of customer IDs
        output_path: Path to save the plot
        title: Plot title

    Returns:
        Path to saved plot file
    """
    if len(feature_matrix) == 0:
        return ""

    # Reduce to 2D using PCA
    if feature_matrix.shape[1] < 2:
        # Not enough features for PCA, create simple plot
        fig, ax = plt.subplots(figsize=(10, 8))
        unique_clusters = sorted(set(cluster_labels))
        colors = plt.cm.tab10(np.linspace(0, 1, len(unique_clusters)))

        for i, cluster_id in enumerate(unique_clusters):
            mask = np.array(cluster_labels) == cluster_id
            ax.scatter([i] * mask.sum(), range(mask.sum()),
                      c=[colors[i]], label=f'Cluster {cluster_id + 1}',
                      alpha=0.6, s=50)

        ax.set_xlabel('Cluster')
        ax.set_ylabel('Customer Index')
        ax.set_title(title)
        ax.legend()
    else:
        pca = PCA(n_components=2, random_state=42)
        features_2d = pca.fit_transform(feature_matrix)

        # Create plot
        fig, ax = plt.subplots(figsize=(12, 10))

        unique_clusters = sorted(set(cluster_labels))
        colors = plt.cm.tab10(np.linspace(0, 1, len(unique_clusters)))

        for i, cluster_id in enumerate(unique_clusters):
            mask = np.array(cluster_labels) == cluster_id
            cluster_points = features_2d[mask]

            ax.scatter(cluster_points[:, 0], cluster_points[:, 1],
                      c=[colors[i]], label=f'Cluster {cluster_id + 1}',
                      alpha=0.6, s=50, edgecolors='black', linewidths=0.5)

        ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
        ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
        ax.set_title(title)
        ax.legend()
        ax.grid(True, alpha=0.3)

    plt.tight_layout()

    # Save plot
    output_file = Path(output_path)
    output_file.parent.mkdir(parents=True, exist_ok=True)
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()

    return output_path


def plot_product_clusters(
    feature_matrix: np.ndarray,
    cluster_labels: List[int],
    product_ids: List[str],
    output_path: str,
    title: str = "Product Clusters"
) -> str:
    """
    Plot product clusters using PCA for 2D visualization.

    Args:
        feature_matrix: Feature matrix used for clustering
        cluster_labels: Cluster assignment for each product
        product_ids: List of product IDs
        output_path: Path to save the plot
        title: Plot title

    Returns:
        Path to saved plot file
    """
    if len(feature_matrix) == 0:
        return ""

    # Reduce to 2D using PCA
    if feature_matrix.shape[1] < 2:
        # Not enough features for PCA, create simple plot
        fig, ax = plt.subplots(figsize=(10, 8))
        unique_clusters = sorted(set(cluster_labels))
        colors = plt.cm.tab10(np.linspace(0, 1, len(unique_clusters)))

        for i, cluster_id in enumerate(unique_clusters):
            mask = np.array(cluster_labels) == cluster_id
            ax.scatter([i] * mask.sum(), range(mask.sum()),
                      c=[colors[i]], label=f'Cluster {cluster_id + 1}',
                      alpha=0.6, s=50)

        ax.set_xlabel('Cluster')
        ax.set_ylabel('Product Index')
        ax.set_title(title)
        ax.legend()
    else:
        pca = PCA(n_components=2, random_state=42)
        features_2d = pca.fit_transform(feature_matrix)

        # Create plot
        fig, ax = plt.subplots(figsize=(12, 10))

        unique_clusters = sorted(set(cluster_labels))
        colors = plt.cm.tab10(np.linspace(0, 1, len(unique_clusters)))

        for i, cluster_id in enumerate(unique_clusters):
            mask = np.array(cluster_labels) == cluster_id
            cluster_points = features_2d[mask]

            ax.scatter(cluster_points[:, 0], cluster_points[:, 1],
                      c=[colors[i]], label=f'Cluster {cluster_id + 1}',
                      alpha=0.6, s=100, edgecolors='black', linewidths=0.5)

            # Annotate product IDs
            for j, (x, y) in enumerate(cluster_points):
                product_id = product_ids[np.where(mask)[0][j]]
                ax.annotate(product_id, (x, y), fontsize=8, alpha=0.7)

        ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
        ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
        ax.set_title(title)
        ax.legend()
        ax.grid(True, alpha=0.3)

    plt.tight_layout()

    # Save plot
    output_file = Path(output_path)
    output_file.parent.mkdir(parents=True, exist_ok=True)
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()

    return output_path


def plot_cluster_summary(
    customer_clusters: List[Dict[str, Any]],
    product_clusters: List[Dict[str, Any]],
    output_path: str,
    title: str = "Cluster Summary"
) -> str:
    """
    Create a summary visualization showing cluster sizes and characteristics.

    Args:
        customer_clusters: List of customer cluster analyses
        product_clusters: List of product cluster analyses
        output_path: Path to save the plot
        title: Plot title

    Returns:
        Path to saved plot file
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

    # Customer clusters
    if customer_clusters:
        cluster_ids = [c["cluster_id"] for c in customer_clusters]
        cluster_sizes = [c["size"] for c in customer_clusters]
        cluster_labels = [c["cluster_label"] for c in customer_clusters]

        colors = plt.cm.tab10(np.linspace(0, 1, len(cluster_ids)))
        ax1.bar(range(len(cluster_ids)), cluster_sizes, color=colors, alpha=0.7, edgecolor='black')
        ax1.set_xticks(range(len(cluster_ids)))
        ax1.set_xticklabels([f'C{cid+1}' for cid in cluster_ids], rotation=45, ha='right')
        ax1.set_ylabel('Number of Customers')
        ax1.set_title('Customer Cluster Sizes')
        ax1.grid(True, alpha=0.3, axis='y')

    # Product clusters
    if product_clusters:
        cluster_ids = [c["cluster_id"] for c in product_clusters]
        cluster_sizes = [c["size"] for c in product_clusters]

        colors = plt.cm.tab10(np.linspace(0, 1, len(cluster_ids)))
        ax2.bar(range(len(cluster_ids)), cluster_sizes, color=colors, alpha=0.7, edgecolor='black')
        ax2.set_xticks(range(len(cluster_ids)))
        ax2.set_xticklabels([f'P{cid+1}' for cid in cluster_ids], rotation=45, ha='right')
        ax2.set_ylabel('Number of Products')
        ax2.set_title('Product Cluster Sizes')
        ax2.grid(True, alpha=0.3, axis='y')

    plt.suptitle(title, fontsize=14, fontweight='bold')
    plt.tight_layout()

    # Save plot
    output_file = Path(output_path)
    output_file.parent.mkdir(parents=True, exist_ok=True)
    plt.savefig(output_path, dpi=150, bbox_inches='tight')
    plt.close()

    return output_path



# Tests for clustering utilities

In [None]:
"""Tests for clustering utilities"""

import pytest
import numpy as np
from tools.clustering import (
    prepare_customer_features,
    prepare_product_features,
    find_optimal_clusters,
    cluster_customers,
    cluster_products,
    analyze_cluster_characteristics
)


def test_prepare_customer_features():
    """Test customer feature preparation"""
    customers = [
        {"Customer_ID": "C001", "Age_Group": "35-44", "Location_Tier": "Tier 2 (Medium)", "Acquisition_Channel": "Search"},
        {"Customer_ID": "C002", "Age_Group": "45-54", "Location_Tier": "Tier 1 (High)", "Acquisition_Channel": "Social"}
    ]
    transactions = [
        {"Customer_ID": "C001", "Product_ID": "P01", "Usage_Metric": 50.0}
    ]
    derived_features = {
        "customer_engagement": {
            "C001": {"engagement_score": 0.7},
            "C002": {"engagement_score": 0.5}
        }
    }

    features, customer_ids = prepare_customer_features(customers, transactions, derived_features)

    assert len(features) == 2
    assert len(customer_ids) == 2
    assert customer_ids[0] == "C001"
    assert features.shape[1] > 0  # Should have features


def test_prepare_product_features():
    """Test product feature preparation"""
    products = [
        {"Product_ID": "P01", "Product_Type": "Hardware", "Monetization_Model": "One-Time Purchase", "feature_list": ["A", "B"]},
        {"Product_ID": "P02", "Product_Type": "Software", "Monetization_Model": "Subscription", "feature_list": ["C"]}
    ]
    transactions = [
        {"Product_ID": "P01", "Customer_ID": "C001"}
    ]
    derived_features = {
        "product_popularity": {
            "P01": {"popularity_score": 0.8, "customer_count": 10, "transaction_count": 50},
            "P02": {"popularity_score": 0.6, "customer_count": 5, "transaction_count": 20}
        }
    }

    features, product_ids = prepare_product_features(products, transactions, derived_features)

    assert len(features) == 2
    assert len(product_ids) == 2
    assert product_ids[0] == "P01"
    assert features.shape[1] > 0  # Should have features


def test_find_optimal_clusters():
    """Test optimal cluster finding"""
    # Create simple 2D data with clear clusters
    feature_matrix = np.array([
        [1, 1], [1, 2], [2, 1], [2, 2],  # Cluster 1
        [10, 10], [10, 11], [11, 10], [11, 11]  # Cluster 2
    ])

    optimal_k = find_optimal_clusters(feature_matrix, max_clusters=5, min_clusters=2)

    assert 2 <= optimal_k <= 5


def test_cluster_customers():
    """Test customer clustering"""
    customers = [
        {"Customer_ID": "C001", "Age_Group": "35-44", "Location_Tier": "Tier 2 (Medium)", "Acquisition_Channel": "Search"},
        {"Customer_ID": "C002", "Age_Group": "45-54", "Location_Tier": "Tier 1 (High)", "Acquisition_Channel": "Social"},
        {"Customer_ID": "C003", "Age_Group": "35-44", "Location_Tier": "Tier 2 (Medium)", "Acquisition_Channel": "Search"}
    ]
    transactions = [
        {"Customer_ID": "C001", "Product_ID": "P01", "Usage_Metric": 50.0},
        {"Customer_ID": "C002", "Product_ID": "P02", "Usage_Metric": 60.0},
        {"Customer_ID": "C003", "Product_ID": "P01", "Usage_Metric": 55.0}
    ]
    derived_features = {
        "customer_engagement": {
            "C001": {"engagement_score": 0.7},
            "C002": {"engagement_score": 0.5},
            "C003": {"engagement_score": 0.6}
        }
    }

    labels, metadata = cluster_customers(customers, transactions, derived_features, num_clusters=2)

    assert len(labels) == 3
    assert "num_clusters" in metadata
    assert "silhouette_score" in metadata
    assert metadata["num_clusters"] == 2


def test_cluster_products():
    """Test product clustering"""
    products = [
        {"Product_ID": "P01", "Product_Type": "Hardware", "Monetization_Model": "One-Time Purchase", "feature_list": ["A", "B"]},
        {"Product_ID": "P02", "Product_Type": "Software", "Monetization_Model": "Subscription", "feature_list": ["C"]},
        {"Product_ID": "P03", "Product_Type": "Hardware", "Monetization_Model": "One-Time Purchase", "feature_list": ["A", "B"]}
    ]
    transactions = [
        {"Product_ID": "P01", "Customer_ID": "C001"},
        {"Product_ID": "P02", "Customer_ID": "C002"},
        {"Product_ID": "P03", "Customer_ID": "C001"}
    ]
    derived_features = {
        "product_popularity": {
            "P01": {"popularity_score": 0.8, "customer_count": 10, "transaction_count": 50},
            "P02": {"popularity_score": 0.6, "customer_count": 5, "transaction_count": 20},
            "P03": {"popularity_score": 0.7, "customer_count": 8, "transaction_count": 40}
        }
    }

    labels, metadata = cluster_products(products, transactions, derived_features, num_clusters=2)

    assert len(labels) == 3
    assert "num_clusters" in metadata
    assert "silhouette_score" in metadata
    assert metadata["num_clusters"] == 2


def test_analyze_cluster_characteristics_customers():
    """Test customer cluster analysis"""
    cluster_labels = [0, 0, 1]
    customer_ids = ["C001", "C002", "C003"]
    customers = [
        {"Customer_ID": "C001", "Age_Group": "35-44", "Location_Tier": "Tier 2 (Medium)", "Acquisition_Channel": "Search"},
        {"Customer_ID": "C002", "Age_Group": "35-44", "Location_Tier": "Tier 2 (Medium)", "Acquisition_Channel": "Search"},
        {"Customer_ID": "C003", "Age_Group": "45-54", "Location_Tier": "Tier 1 (High)", "Acquisition_Channel": "Social"}
    ]
    transactions = [
        {"Customer_ID": "C001", "Product_ID": "P01", "Usage_Metric": 50.0},
        {"Customer_ID": "C002", "Product_ID": "P01", "Usage_Metric": 55.0},
        {"Customer_ID": "C003", "Product_ID": "P02", "Usage_Metric": 60.0}
    ]

    analyses = analyze_cluster_characteristics(
        cluster_labels, customer_ids, customers, transactions, entity_type="customer"
    )

    assert len(analyses) == 2  # Two clusters
    assert analyses[0]["cluster_id"] == 0
    assert "characteristics" in analyses[0]
    assert "size" in analyses[0]
    assert "avg_age_group" in analyses[0]["characteristics"]


def test_analyze_cluster_characteristics_products():
    """Test product cluster analysis"""
    cluster_labels = [0, 1, 0]
    product_ids = ["P01", "P02", "P03"]
    products = [
        {"Product_ID": "P01", "Product_Type": "Hardware", "Monetization_Model": "One-Time Purchase", "feature_list": ["A", "B"]},
        {"Product_ID": "P02", "Product_Type": "Software", "Monetization_Model": "Subscription", "feature_list": ["C"]},
        {"Product_ID": "P03", "Product_Type": "Hardware", "Monetization_Model": "One-Time Purchase", "feature_list": ["A", "B"]}
    ]
    transactions = [
        {"Product_ID": "P01", "Customer_ID": "C001", "Usage_Metric": 50.0},
        {"Product_ID": "P02", "Customer_ID": "C002", "Usage_Metric": 60.0},
        {"Product_ID": "P03", "Customer_ID": "C001", "Usage_Metric": 55.0}
    ]

    analyses = analyze_cluster_characteristics(
        cluster_labels, product_ids, products, transactions, entity_type="product"
    )

    assert len(analyses) == 2  # Two clusters
    assert analyses[0]["cluster_id"] == 0
    assert "characteristics" in analyses[0]
    assert "size" in analyses[0]
    assert "common_features" in analyses[0]["characteristics"]



# Test Results

In [None]:
(.venv) micahshull@Micahs-iMac LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator % python3 -m pytest tests/test_clustering.py -v
============================================================ test session starts ============================================================
platform darwin -- Python 3.13.7, pytest-9.0.1, pluggy-1.6.0 -- /Users/micahshull/Documents/AI_LangGraph/LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/micahshull/Documents/AI_LangGraph/LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator
plugins: langsmith-0.4.53, anyio-4.12.0, asyncio-1.3.0, cov-7.0.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 7 items

tests/test_clustering.py::test_prepare_customer_features PASSED                                                                       [ 14%]
tests/test_clustering.py::test_prepare_product_features PASSED                                                                        [ 28%]
tests/test_clustering.py::test_find_optimal_clusters PASSED                                                                           [ 42%]
tests/test_clustering.py::test_cluster_customers PASSED                                                                               [ 57%]
tests/test_clustering.py::test_cluster_products PASSED                                                                                [ 71%]
tests/test_clustering.py::test_analyze_cluster_characteristics_customers PASSED                                                       [ 85%]
tests/test_clustering.py::test_analyze_cluster_characteristics_products PASSED                                                        [100%]

============================================================ 7 passed in 16.70s =============================================================


In [None]:
(.venv) micahshull@Micahs-iMac LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator % python3 -m pytest tests/test_nodes_phase1.py tests/test_nodes_phase2.py tests/test_nodes_phase3.py -v
============================================================ test session starts ============================================================
platform darwin -- Python 3.13.7, pytest-9.0.1, pluggy-1.6.0 -- /Users/micahshull/Documents/AI_LangGraph/LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/micahshull/Documents/AI_LangGraph/LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator
plugins: langsmith-0.4.53, anyio-4.12.0, asyncio-1.3.0, cov-7.0.0
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 26 items

tests/test_nodes_phase1.py::test_goal_node_basic PASSED                                                                               [  3%]
tests/test_nodes_phase1.py::test_goal_node_preserves_errors PASSED                                                                    [  7%]
tests/test_nodes_phase1.py::test_planning_node_with_goal PASSED                                                                       [ 11%]
tests/test_nodes_phase1.py::test_planning_node_requires_goal PASSED                                                                   [ 15%]
tests/test_nodes_phase1.py::test_planning_node_plan_structure PASSED                                                                  [ 19%]
tests/test_nodes_phase1.py::test_planning_node_dependencies PASSED                                                                    [ 23%]
tests/test_nodes_phase1.py::test_goal_and_planning_together PASSED                                                                    [ 26%]
tests/test_nodes_phase2.py::test_data_ingestion_node_loads_all_data PASSED                                                            [ 30%]
tests/test_nodes_phase2.py::test_data_ingestion_node_customers_structure PASSED                                                       [ 34%]
tests/test_nodes_phase2.py::test_data_ingestion_node_transactions_structure PASSED                                                    [ 38%]
tests/test_nodes_phase2.py::test_data_ingestion_node_products_structure PASSED                                                        [ 42%]
tests/test_nodes_phase2.py::test_data_ingestion_node_uses_custom_paths PASSED                                                         [ 46%]
tests/test_nodes_phase2.py::test_data_ingestion_node_handles_missing_file PASSED                                                      [ 50%]
tests/test_nodes_phase2.py::test_data_ingestion_node_preserves_errors PASSED                                                          [ 53%]
tests/test_nodes_phase2.py::test_data_ingestion_node_with_goal_and_planning PASSED                                                    [ 57%]
tests/test_nodes_phase2.py::test_data_ingestion_node_data_counts PASSED                                                               [ 61%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_basic PASSED                                                                 [ 65%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_structure PASSED                                                             [ 69%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_parses_feature_sets PASSED                                                   [ 73%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_normalizes_usage PASSED                                                      [ 76%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_builds_graphs PASSED                                                         [ 80%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_creates_derived_features PASSED                                              [ 84%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_data_quality_report PASSED                                                   [ 88%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_requires_raw_data PASSED                                                     [ 92%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_preserves_errors PASSED                                                      [ 96%]
tests/test_nodes_phase3.py::test_data_preprocessing_node_full_workflow PASSED                                                         [100%]

============================================================ 26 passed in 3.15s =============================================================
