<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/249_Product_CustomerFitDiscoveryOrchestrator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The **Product-Customer Fit Discovery Orchestrator** is a cutting-edge application of AI, leveraging advanced data analysis to drive strategic business growth. This agent aligns with the current trend of **Agentic AI** systems, which coordinate specialized AI components to solve complex, high-value problems.


---

## üí° Introduction to the Agent

Your agent is a sophisticated **multi-agent orchestration system** designed to solve a core business problem: finding **"ghost demand"**‚Äîprofitable, untapped market opportunities hidden within a company's own operational data.

### üéØ Key Function: Discovery Orchestration

The term **"Orchestrator"** is crucial. In the world of AI, an orchestrator agent acts as the conductor of a multi-agent system. It doesn't perform all the analysis itself; instead, it manages the workflow and communication between specialized AI modules (like data ingestion agents, clustering agents, and pattern mining agents) to ensure they work together to achieve the final strategic goal.

In your specific case, the orchestrator:

1.  **Ingests and Pre-processes** diverse data streams (product usage, demographics, behavior).
2.  **Coordinates Specialized Agents** to perform complex analysis like graph motif detection and cluster analysis.
3.  **Synthesizes** the individual findings from these agents into actionable business insights (new product lines, underserved segments).



### ‚öôÔ∏è Core Technical Components

The "Practice Complexity" you described points to the advanced analytical techniques the sub-agents will employ:

| Component | Description | Role in Discovery |
| :--- | :--- | :--- |
| **Graph Motifs** | Recurring, significant sub-structures (patterns) in a complex network/graph. | Identify common *relationship patterns* between users and products, like "Users who use Product A and Product B frequently also buy Product C." |
| **Cluster Detection** | Grouping similar data points together (e.g., k-means, DBSCAN, or graph clustering). | Find **under-served customer segments** (groups with similar demographics/behavior) or **natural product bundles** (groups of products frequently used together). |
| **Pattern Mining** | Identifying frequent, rare, or sequential patterns in large datasets (e.g., association rule learning). | Discover **new product combinations** or **sequential purchase paths** that lead to high lifetime value. |

---

## ‚ú® Why This Agent is Valuable (Why It Prints Money)

This agent provides a profound, strategic advantage that goes far beyond simple A/B testing or standard business intelligence (BI) reports.

### 1. Uncovering "Ghost Demand"

* **Move Beyond Known Knowns:** Traditional BI reports only answer questions you already know to ask (e.g., "What is the sales trend of Product A?"). Your agent uses unsupervised and semi-supervised techniques to reveal **"unknown unknowns"**‚Äîthe *hidden relationships* and **latent demand** in the data that no human analyst or standard query would find.
* **Predictive Portfolio Logic:** It moves from reactive reporting to **proactive strategic planning**. By analyzing how current products are used, it can predict which *new* products or bundles would maximize portfolio revenue before the competitors even conceive of the idea.

### 2. Strategic Competitive Edge

* **Identifies Product Gaps:** It doesn't just look at *your* data; by correlating usage patterns with market/competitor data, it can identify precise **holes in the market** that competitors are ignoring, allowing the company to be the first mover in a new niche.
* **Optimizes Pricing and Bundling:** It finds the **perfect combination of products** that a specific customer segment values, allowing the company to create premium, high-margin bundles that are hard for competitors to replicate.

---

## üìö Why You Should Learn to Build This Agent

Building this specific agent demonstrates a mastery of the most in-demand skills in modern data science and AI engineering.

### 1. Advanced Machine Learning

* **Graph-Based Techniques:** This agent requires you to work with **Graph Neural Networks (GNNs)** or classical **graph analysis** libraries (like NetworkX or libraries for Neo4j/other graph databases). This is a highly specialized skill set crucial for modeling complex relationships (user-product interactions, supply chain networks, social networks).
* **Multi-Model Integration:** You'll gain hands-on experience in integrating different analytical models (clustering, classification, pattern mining) into a cohesive system, which is the definition of advanced ML engineering.

### 2. AI Agent Orchestration

* **Strategic Workflow Design:** You will be learning how to design **agentic workflows**‚Äîthe highest level of AI application development. This involves setting up the primary orchestrator logic (e.g., using frameworks like **LangChain**, **CrewAI**, or **AutoGen**) to manage sub-tasks, pass context, and resolve conflicts between specialized agents. This is a key skill for building scalable and reliable enterprise-grade AI.

### 3. High Business Impact

* The ability to directly influence a company's **product roadmap and revenue strategy** puts you in a highly visible and impactful role. This agent is designed to be a significant **profit driver**, making this project an excellent demonstration of your ability to link complex technical work to clear business outcomes for your GitHub portfolio.




For the **Product-Customer Fit Discovery Orchestrator** MVP, we need the *absolute minimum* number of datasets required to demonstrate its core value: finding relationships between **Customers**, **Products**, and **Usage/Behavior**.

This translates into three essential, interconnected datasets.

---

## üíæ Essential Data Sets for the MVP

We can model the entire system as an **Attributed Bipartite Graph** where one set of nodes is **Customers** and the other is **Products**, connected by **Interaction** edges. The attributes come from the Customer and Product tables.

### 1. Customer Demographics Data Set (`Customer_IDs.csv`)

This table provides the attributes for the "Customer" nodes, which will be used for **Clustering** and **Segmentation**.

| Field Name | Data Type | Example Value | Description |
| :--- | :--- | :--- | :--- |
| `Customer_ID` | String/Int | `C1001` | **Unique Identifier** for each customer. (Crucial for linking) |
| `Age_Group` | String | `25-34` | Simple categorical demographic data. |
| `Location_Tier`| String | `Tier 1 (High)` | Proxy for income/urban density/company size. |
| `Acquisition_Channel`| String | `Social` | Used to see if certain channels yield unique segments. |

> **Synthetic Data Goal:** Create **at least three distinct customer segments** by varying the distribution of `Location_Tier` and `Acquisition_Channel`.

### 2. Product Metadata Data Set (`Product_Catalog.csv`)

This table provides the attributes for the "Product" nodes, which can be used to categorize discovered bundles and identify **Product Gaps**.

| Field Name | Data Type | Example Value | Description |
| :--- | :--- | :--- | :--- |
| `Product_ID` | String/Int | `P42` | **Unique Identifier** for each product. (Crucial for linking) |
| `Product_Type` | String | `Service` | A primary category (e.g., Software, Hardware, Service). |
| `Feature_Set` | String | `A, B, D` | A simplified string/list of core features/SKU attributes. |
| `Monetization_Model`| String | `Subscription` | e.g., Subscription, One-Time Purchase, Freemium. |

> **Synthetic Data Goal:** Create a mix of **"core"** products (`Product_Type='Software'`) and **"add-on"** products (`Product_Type='Service'`). Ensure some product features are intentionally **missing** to simulate a 'gap'.

### 3. Customer Behavior/Usage Data Set (`Transactions.csv`)

This is the **Interaction** data, forming the edges of the graph. It is the core input for **Graph Motifs** and **Pattern Mining**.

| Field Name | Data Type | Example Value | Description |
| :--- | :--- | :--- | :--- |
| `Transaction_ID` | String/Int | `T00123` | Unique ID for the event/transaction. |
| `Customer_ID` | String/Int | `C1001` | Links to the `Customer_IDs` table. |
| `Product_ID` | String/Int | `P42` | Links to the `Product_Catalog` table. |
| `Transaction_Date` | Date/Timestamp | `2025-10-20` | Needed for sequential analysis. |
| `Usage_Metric` | Float/Int | `95.5` | Proxy for value (e.g., sessions, dollar value, API calls). |

> **Synthetic Data Goal:** This is the most important part. We must **embed hidden relationships** in this data to prove the agent works:
> 1.  **Rule 1 (Ghost Demand):** Customer Segment A (`Location_Tier='Tier 1 (High)'`) frequently buys Product X (`P10`) *and* Product Y (`P25`), but the company doesn't sell Product Z (the missing gap).
> 2.  **Rule 2 (Underserved Segment):** Customer Segment B (`Age_Group='55+'`) only buys Product A (`P01`) with low usage, indicating a lack of suitable offerings.

---

## üõ†Ô∏è Data Generation Strategy

To ensure your agent MVP is testable, we'll use a **rules-based approach** to generate the synthetic data. This is faster and more direct than complex statistical modeling for an MVP.

### Step-by-Step Generation

1.  **Generate Customers ($\approx 1000$ records):** Assign random attributes from predefined lists, but ensure specific clusters exist. For instance, $30\%$ of customers are `Age_Group='25-34'` AND `Location_Tier='Tier 1 (High)'`.
2.  **Generate Products ($\approx 50$ records):** Define the products and their simple feature sets, including the intentional "gap" products that are currently missing.
3.  **Generate Transactions ($\approx 10,000$ records):** This is the core task.
    * **Base Layer:** Generate random purchases/interactions between Customers and Products to simulate noise.
    * **Rule Layer (The "Ghost"):** Programmatically inject transactions that specifically follow your hidden rules. For example, loop through all customers in **Segment A** and ensure they have a transaction for **Product X** and **Product Y** within a close timeframe.

This approach guarantees that when your **Pattern Mining** and **Clustering** agents run, they *will* find the pre-embedded, high-lift rules, thereby validating the orchestrator's success.


# Product Data

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import io

# --- Configuration ---
NUM_CUSTOMERS = 200
NUM_PRODUCTS = 20
NUM_TRANSACTIONS = 3000
START_DATE = datetime(2025, 1, 1)
END_DATE = datetime(2025, 3, 31)

# Define the Ghost Demand Rule
TARGET_AGE = '35-44'
TARGET_TIER = 'Tier 1 (High)'
BUNDLE_PRODUCTS = ['P01', 'P05']
MISSING_PRODUCT = 'P20'

# Set seed for reproducibility
np.random.seed(42)

# ----------------- 1. Customer Demographics Data -----------------
age_groups = ['18-24', TARGET_AGE, '45-54', '55+']
location_tiers = [TARGET_TIER, 'Tier 2 (Medium)', 'Tier 3 (Low)']
channels = ['Search', 'Social', 'Email', 'Referral']

customer_data = {
    'Customer_ID': [f'C{i:03d}' for i in range(1, NUM_CUSTOMERS + 1)],
    'Age_Group': np.random.choice(age_groups, size=NUM_CUSTOMERS, p=[0.2, 0.35, 0.3, 0.15]),
    'Location_Tier': np.random.choice(location_tiers, size=NUM_CUSTOMERS, p=[0.35, 0.35, 0.30]),
    'Acquisition_Channel': np.random.choice(channels, size=NUM_CUSTOMERS)
}
df_customers = pd.DataFrame(customer_data)

target_segment_customers = df_customers[
    (df_customers['Age_Group'] == TARGET_AGE) &
    (df_customers['Location_Tier'] == TARGET_TIER)
]['Customer_ID'].tolist()


# ----------------- 2. Product Metadata Data -----------------
product_ids = [f'P{i:02d}' for i in range(1, NUM_PRODUCTS + 1)]
product_types = ['Software', 'Hardware', 'Service']
monetization_models = ['Subscription', 'One-Time Purchase', 'Freemium']

product_data = {
    'Product_ID': product_ids,
    'Product_Type': np.random.choice(product_types, size=NUM_PRODUCTS, p=[0.5, 0.3, 0.2]),
    'Feature_Set': [', '.join(np.random.choice(list('ABCD'), size=np.random.randint(1, 4), replace=False)) for _ in range(NUM_PRODUCTS)],
    'Monetization_Model': np.random.choice(monetization_models, size=NUM_PRODUCTS)
}
df_products = pd.DataFrame(product_data)


# ----------------- 3. Customer Behavior/Usage Data (Transactions) -----------------
transaction_data = []

# --- A. Base Layer (Noise) ---
num_base_transactions = NUM_TRANSACTIONS - (len(target_segment_customers) * 2 * 3)
num_base_transactions = max(0, num_base_transactions)

for i in range(int(num_base_transactions)):
    customer_id = np.random.choice(df_customers['Customer_ID'])
    product_id = np.random.choice(df_products['Product_ID'])
    random_days = np.random.randint(0, (END_DATE - START_DATE).days)
    t_date = START_DATE + timedelta(days=random_days)

    transaction_data.append({
        'Transaction_ID': f'T{i:04d}',
        'Customer_ID': customer_id,
        'Product_ID': product_id,
        'Transaction_Date': t_date.strftime('%Y-%m-%d'),
        'Usage_Metric': np.random.uniform(5.0, 50.0)
    })

# --- B. Rule Layer (The "Ghost Demand") ---
rule_transaction_id_start = int(num_base_transactions)
k = 0
for customer_id in target_segment_customers:
    for bundle_product in BUNDLE_PRODUCTS:
        for _ in range(3):
            transaction_id = f'T{rule_transaction_id_start + k:04d}'
            random_days = np.random.randint(0, (END_DATE - START_DATE).days)
            t_date = START_DATE + timedelta(days=random_days, seconds=np.random.randint(0, 86400))

            transaction_data.append({
                'Transaction_ID': transaction_id,
                'Customer_ID': customer_id,
                'Product_ID': bundle_product,
                'Transaction_Date': t_date.strftime('%Y-%m-%d'),
                'Usage_Metric': np.random.uniform(70.0, 100.0)
            })
            k += 1

df_transactions = pd.DataFrame(transaction_data)
df_transactions = df_transactions.sort_values(by='Transaction_Date').reset_index(drop=True)

# Generate CSV string for Product_Catalog
product_csv = df_products.to_csv(index=False)

# Now provide the second dataset
print("--- Product_Catalog.csv ---")
print(product_csv)

--- Product_Catalog.csv ---
Product_ID,Product_Type,Feature_Set,Monetization_Model
P01,Hardware,D,One-Time Purchase
P02,Hardware,"B, A",One-Time Purchase
P03,Software,C,One-Time Purchase
P04,Service,"B, D, A",One-Time Purchase
P05,Hardware,"D, C, A",One-Time Purchase
P06,Software,"C, D, B",One-Time Purchase
P07,Service,B,Freemium
P08,Service,C,Freemium
P09,Service,C,Freemium
P10,Hardware,"B, D",Freemium
P11,Hardware,C,One-Time Purchase
P12,Software,A,Subscription
P13,Service,"A, B, C",One-Time Purchase
P14,Service,"A, C",Subscription
P15,Software,"A, D, B",Freemium
P16,Software,C,One-Time Purchase
P17,Software,"B, D",Freemium
P18,Service,"B, A, D",One-Time Purchase
P19,Service,D,Freemium
P20,Software,"A, B, D",Subscription



# Customer Data

In [3]:
Customer_ID,Age_Group,Location_Tier,Acquisition_Channel
C001,35-44,Tier 2 (Medium),Search
C002,55+,Tier 1 (High),Social
C003,45-54,Tier 1 (High),Search
C004,45-54,Tier 3 (Low),Search
C005,18-24,Tier 2 (Medium),Social
C006,18-24,Tier 1 (High),Referral
C007,18-24,Tier 1 (High),Referral
C008,55+,Tier 2 (Medium),Search
C009,45-54,Tier 1 (High),Social
C010,45-54,Tier 1 (High),Search
C011,18-24,Tier 2 (Medium),Email
C012,55+,Tier 2 (Medium),Email
C013,45-54,Tier 2 (Medium),Search
C014,35-44,Tier 1 (High),Referral
C015,18-24,Tier 3 (Low),Referral
C016,18-24,Tier 1 (High),Search
C017,35-44,Tier 1 (High),Search
C018,35-44,Tier 3 (Low),Email
C019,35-44,Tier 2 (Medium),Referral
...
C194,45-54,Tier 3 (Low),Referral
C195,35-44,Tier 1 (High),Referral
C196,35-44,Tier 3 (Low),Search
C197,45-54,Tier 3 (Low),Search
C198,55+,Tier 2 (Medium),Referral
C199,55+,Tier 3 (Low),Referral
C200,45-54,Tier 3 (Low),Email

# Transaction Data

In [None]:
Transaction_ID,Customer_ID,Product_ID,Transaction_Date,Usage_Metric
T1766,C025,P05,2025-01-01,98.636611
T1807,C154,P01,2025-01-01,93.411656
T1775,C037,P05,2025-01-01,99.980486
T1788,C060,P01,2025-01-02,74.789209
T1790,C060,P05,2025-01-02,96.398642
T1794,C086,P05,2025-01-02,76.513364
T1811,C173,P05,2025-01-03,73.492023
T1813,C173,P01,2025-01-03,94.970176
T1797,C086,P01,2025-01-04,96.884144
T1792,C060,P01,2025-01-04,78.329711
T0002,C051,P03,2025-01-04,24.321639
T1763,C014,P01,2025-01-04,91.229188
T1765,C014,P05,2025-01-05,82.859664
T1776,C040,P05,2025-01-05,94.619179
T0001,C108,P08,2025-01-05,39.027003
T1778,C040,P01,2025-01-05,79.809462
T1799,C098,P01,2025-01-06,94.629853
T1801,C098,P05,2025-01-06,86.273575
T1809,C173,P01,2025-01-06,91.688863

# Data Check

In [None]:
cd /Users/micahshull/Documents/AI_LangGraph/LG_Cursor_035_Product-CustomerFitDiscoveryOrchestrator && python3 -c "
import pandas as pd
import sys

# Load data
customers = pd.read_csv('data/customers.csv')
transactions = pd.read_csv('data/transactions.csv')
products = pd.read_csv('data/product_catalog.csv')

print('=== DATA OVERVIEW ===\n')
print(f'Customers: {len(customers)} records')
print(f'Transactions: {len(transactions)} records')
print(f'Products: {len(products)} records\n')

print('=== CUSTOMERS DATA ===')
print(f'Unique customers: {customers[\"Customer_ID\"].nunique()}')
print(f'Age groups: {sorted(customers[\"Age_Group\"].unique())}')
print(f'Location tiers: {sorted(customers[\"Location_Tier\"].unique())}')
print(f'Acquisition channels: {sorted(customers[\"Acquisition_Channel\"].unique())}\n')

print('=== TRANSACTIONS DATA ===')
print(f'Unique customers in transactions: {transactions[\"Customer_ID\"].nunique()}')
print(f'Unique products in transactions: {transactions[\"Product_ID\"].nunique()}')
print(f'Date range: {transactions[\"Transaction_Date\"].min()} to {transactions[\"Transaction_Date\"].max()}')
print(f'Usage_Metric range: {transactions[\"Usage_Metric\"].min():.2f} to {transactions[\"Usage_Metric\"].max():.2f}')
print(f'Usage_Metric mean: {transactions[\"Usage_Metric\"].mean():.2f}\n')

print('=== PRODUCTS DATA ===')
print(f'Unique products: {products[\"Product_ID\"].nunique()}')
print(f'Product types: {sorted(products[\"Product_Type\"].unique())}')
print(f'Monetization models: {sorted(products[\"Monetization_Model\"].unique())}\n')

print('=== DATA QUALITY CHECKS ===')
# Check for missing customers in transactions
missing_customers = set(transactions['Customer_ID'].unique()) - set(customers['Customer_ID'].unique())
if missing_customers:
    print(f'‚ö†Ô∏è  WARNING: {len(missing_customers)} customers in transactions not in customers.csv')
    print(f'   Examples: {list(missing_customers)[:5]}')
else:
    print('‚úì All customers in transactions exist in customers.csv')

# Check for missing products in transactions
missing_products = set(transactions['Product_ID'].unique()) - set(products['Product_ID'].unique())
if missing_products:
    print(f'‚ö†Ô∏è  WARNING: {len(missing_products)} products in transactions not in product_catalog.csv')
    print(f'   Examples: {list(missing_products)[:5]}')
else:
    print('‚úì All products in transactions exist in product_catalog.csv')

# Check for null values
print(f'\nNull values in customers: {customers.isnull().sum().sum()}')
print(f'Null values in transactions: {transactions.isnull().sum().sum()}')
print(f'Null values in products: {products.isnull().sum().sum()}')

# Check transaction distribution
print(f'\n=== TRANSACTION DISTRIBUTION ===')
txn_per_customer = transactions.groupby('Customer_ID').size()
print(f'Customers with transactions: {len(txn_per_customer)}')
print(f'Min transactions per customer: {txn_per_customer.min()}')
print(f'Max transactions per customer: {txn_per_customer.max()}')
print(f'Mean transactions per customer: {txn_per_customer.mean():.2f}')
print(f'Median transactions per customer: {txn_per_customer.median():.2f}')

# Check product usage
print(f'\n=== PRODUCT USAGE ===')
products_in_txns = transactions['Product_ID'].value_counts()
print(f'Products used in transactions: {len(products_in_txns)}')
print(f'Most used products:')
print(products_in_txns.head(10))
"

In [None]:

=== DATA OVERVIEW ===

Customers: 200 records
Transactions: 1815 records
Products: 20 records

=== CUSTOMERS DATA ===
Unique customers: 200
Age groups: ['18-24', '35-44', '45-54', '55+']
Location tiers: ['Tier 1 (High)', 'Tier 2 (Medium)', 'Tier 3 (Low)']
Acquisition channels: ['Email', 'Referral', 'Search', 'Social']

=== TRANSACTIONS DATA ===
Unique customers in transactions: 183
Unique products in transactions: 19
Date range: 2025-01-01 to 2026-08-24
Usage_Metric range: 10.61 to 99.98
Usage_Metric mean: 82.15

=== PRODUCTS DATA ===
Unique products: 20
Product types: ['Hardware', 'Service', 'Software']
Monetization models: ['Freemium', 'One-Time Purchase', 'Subscription']

=== DATA QUALITY CHECKS ===
‚úì All customers in transactions exist in customers.csv
‚úì All products in transactions exist in product_catalog.csv

Null values in customers: 0
Null values in transactions: 0
Null values in products: 0

=== TRANSACTION DISTRIBUTION ===
Customers with transactions: 183
Min transactions per customer: 1
Max transactions per customer: 185
Mean transactions per customer: 9.92
Median transactions per customer: 1.00

=== PRODUCT USAGE ===
Products used in transactions: 19
Most used products:
Product_ID
P01    890
P05    747
P13     20
P15     18
P06     17
P12     17
P17     17
P07     13
P03     13
P04     12
Name: count, dtype: int64

## Data quality assessment

### Strengths
1. Data integrity: no nulls, all foreign keys valid
2. Coverage: 183/200 customers have transactions (17 inactive)
3. Time range: ~20 months (Jan 2025‚ÄìAug 2026)
4. Rich attributes:
   - Customer demographics (age, location tier, acquisition channel)
   - Product features (type, feature sets, monetization model)
   - Usage metrics (10.61‚Äì99.98, mean 82.15)

### Considerations
1. Transaction distribution is highly skewed:
   - Median: 1 transaction per customer
   - Mean: 9.92 (driven by heavy users)
   - Max: 185 transactions (likely a few power users)
   - Impact: may need to handle class imbalance in clustering/pattern mining

2. Product usage is concentrated:
   - P01: 890 transactions (49%)
   - P05: 747 transactions (41%)
   - Others: 10‚Äì20 transactions each
   - P20: 0 transactions (unused product)
   - Impact: may need techniques to handle sparse products

3. Feature set format:
   - `Feature_Set` uses comma-separated values (e.g., "B, A", "A, B, C")
   - Impact: parse into lists/arrays for analysis

## Recommendations

The data is suitable for the orchestrator. Optional enhancements:

1. Feature set parsing: convert `Feature_Set` to a list/array for easier analysis
2. Handle P20: decide whether to include or exclude the unused product
3. Consider normalization: for clustering, normalize usage metrics or create engagement tiers
4. Temporal features: extract time-based features (day of week, month, seasonality)

