<a href="https://colab.research.google.com/github/micah-shull/AI_Agents/blob/main/568_SEv2_dataGen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MVP2 Data Suggestions — Sales Enablement Orchestrator

This document suggests **additional and enhanced data** for your Sales Enablement Orchestrator MVP2. The goal is to make the agent more realistic and architecturally interesting **without** growing the dataset so much that it distracts from learning orchestration.

---

## What You Already Have (MVP1)

| Dataset | Records | Purpose |
|---------|---------|---------|
| **leads.json** | 20 | Prioritization, personalization, objection/needs inference |
| **sales_reps.json** | 4 | Nudging, load balancing, performance context |
| **interactions.json** | 12 | Follow-up orchestration, objection/sentiment hooks |
| **deals.json** | 15 | Forecasting, risk, executive reporting |
| **signals.json** | 20 | Orchestrator “brain fuel,” recommended actions, urgency |

Your schemas are solid and aligned with the agent_data_gen_plan. The main MVP2 opportunities are: **(1) a few small reference tables**, **(2) linking and consistency**, and **(3) one config file** so the orchestrator’s logic stays explicit and demo-friendly.

---

## Suggested Additions (by impact vs effort)

### 1. **content_assets.json** (NEW — high value, low effort)

**Why:** The orchestrator recommends “send pricing comparison,” “send case study,” “implementation doc.” Giving it a tiny catalog of assets makes those recommendations **concrete** and testable.

**Suggested schema (5–10 rows):**

```json
{
  "asset_id": "A-001",
  "name": "Pricing comparison one-pager",
  "type": "one_pager",
  "use_cases": ["pricing pushback", "competitor comparison"],
  "stage_fit": ["Qualification", "Proposal", "Negotiation"]
}
```

**Fields:** `asset_id`, `name`, `type` (one_pager, case_study, demo_video, implementation_guide, security_doc), `use_cases`, `stage_fit`.

**Volume:** 5–10 assets. Keep names and use_cases aligned with the `recommended_action` and `key_topics` you already use in signals/interactions.

**Orchestrator use:** Map `signals.recommended_action` or interaction `key_topics` → `content_assets`; e.g. “send pricing comparison” → filter assets where `use_cases` includes “pricing pushback.”

---

### 2. **objections.json** (NEW — high value, low effort)

**Why:** The intro doc stresses **objection detection** and tailoring messaging. A small reference table turns “objection → recommended response” into deterministic, demo-ready logic.

**Suggested schema (≈10 rows):**

```json
{
  "objection_id": "OBJ-001",
  "label": "Pricing too high",
  "theme": "pricing",
  "suggested_asset_types": ["one_pager", "case_study"],
  "suggested_actions": ["send pricing justification", "send ROI case study"]
}
```

**Volume:** 8–12 objections. Themes could match your existing `key_topics` and `risk_flags` (e.g. pricing, timeline, budget, competition, security).

**Orchestrator use:** From `interactions.key_topics` or `deals.risk_flags`, map to `objections` → get `suggested_asset_types` / `suggested_actions` for next-step logic.

---

### 3. **deal_id on interactions** (ENHANCE existing)

**Why:** Today interactions point to `lead_id` and `rep_id` but not to a specific deal. Many leads have 1:1 lead→deal, but the mental model is “this touch happened in the context of this deal.” Adding optional `deal_id` makes “last activity per deal” and “deal timeline” natural without changing record count.

**Change:** Add optional `"deal_id": "D-001"` to each interaction row where that interaction clearly belongs to a deal. Leave it null or omitted for early-stage touches before a deal exists.

**Orchestrator use:** “Stalled deal” = deal with `days_in_stage` high and no recent interaction linked via `deal_id` (or lead_id + time window).

---

### 4. **stage_actions.json** (NEW — medium value, low effort)

**Why:** “What should we do next for this deal?” is a core orchestrator question. A small playbook of stage → suggested actions gives the agent a clear, configurable rule set.

**Suggested schema (one row per stage, or per stage + scenario):**

```json
{
  "stage": "Proposal",
  "default_next_actions": ["follow up on proposal", "share case study", "offer exec briefing"],
  "if_stalled_days": 14,
  "stalled_actions": ["escalate to manager", "send re-engagement sequence"]
}
```

**Volume:** 6–8 rows (Discovery, Qualification, Proposal, Negotiation, plus optional Closed Won/Lost for handoff messaging).

**Orchestrator use:** Given `deal.stage` and `deal.days_in_stage`, choose from `default_next_actions` or `stalled_actions` and combine with `content_assets` / `objections` when relevant.

---

### 5. **thresholds.json or config.json** (NEW — high value, minimal effort)

**Why:** The plan calls out “CEO-style insights” and “decision thresholds.” A single small config file makes “what counts as stalled?” or “what counts as high-intent?” explicit and editable, which is great for demos and for learning orchestration logic.

**Suggested content (example):**

```json
{
  "stalled_deal_days_in_stage": 21,
  "high_priority_intent_min": 0.80,
  "at_risk_engagement_max": 0.50,
  "urgent_deal_days_to_close_max": 14
}
```

**Orchestrator use:** Use these values when ranking leads, flagging at-risk deals, and deciding urgency. Keeps logic out of code and in data.

---

### 6. **More interactions for a few key deals** (ENHANCE existing)

**Why:** You have 12 interactions across 20 leads; several deals (e.g. in Proposal/Negotiation) have no or one interaction. Adding **5–8** more interactions—especially for 2–3 deals that are “hot” or “stalled”—makes follow-up and “next step” logic feel real.

**Targets:**

- 1–2 deals in Proposal/Negotiation with 2–3 touches each (e.g. proposal sent → follow-up call → pricing question).
- 1 deal that looks “stalled” (long `days_in_stage`, no recent touch).
- Optionally 1–2 early-stage leads (Discovery/Qualification) with a single call or email.

**Orchestrator use:** Richer “last contact” and “next step promised” behavior; more realistic nudge triggers.

---

### 7. **competitors.json** (NEW — optional, very small)

**Why:** Deals already have `competitors` (e.g. "VendorX", "VendorY"). A tiny reference table (3–5 rows) with “win theme” or “weakness” gives the orchestrator something to say when it recommends competitive battlecards or positioning.

**Suggested schema:**

```json
{
  "competitor_id": "VendorX",
  "display_name": "Vendor X",
  "common_weakness": "Implementation timeline",
  "win_theme": "Ease of deployment"
}
```

**Volume:** 3–5, matching the competitor strings you use in `deals.json`.

**Orchestrator use:** When `deal.competitors` is non-empty, look up `competitors` and suggest “highlight ease of deployment” or “share implementation comparison” type actions.

---

### 8. **Consistency and coverage (no new files)**

**Leads with deals but no interactions:**  
Consider giving every lead that has a deal at least one interaction (even a short email or call). That avoids “deal exists but we have no history” in demos.

**Signals ↔ deals:**  
Where a lead has a deal, ensure `signals.recommended_action` and `signals.urgency` are plausible for that deal’s `stage` and `risk_flags`. You don’t need to change schema—just a quick consistency pass so the orchestrator’s decisions look coherent.

---

## What to Skip for MVP2 (keep focus on architecture)

- **Heavy CRM-style history:** Many more interactions or full email threads. Better to learn orchestration with the above additions than to maintain large timelines.
- **Full product catalog or pricing matrix:** A few content assets are enough; detailed product/SKU data can wait.
- **Territory/assignment rules:** Your existing `region` on leads and reps is enough for “assign by region” logic; a separate territories table doesn’t pay off yet.
- **User/tenant multi-tenancy:** Out of scope for MVP2.

---

## Recommended MVP2 build order

| Order | Item | Effort | Effect |
|-------|------|--------|--------|
| 1 | **thresholds.json** (or config) | ~5 min | Makes orchestration logic configurable and visible |
| 2 | **objections.json** | ~15 min | Connects topics/risk_flags → actions and assets |
| 3 | **content_assets.json** | ~15 min | Turns “send X” into concrete asset picks |
| 4 | **deal_id** on existing interactions | ~10 min | Enables deal-level timeline and stall detection |
| 5 | **stage_actions.json** | ~15 min | Stage-based “what to do next” playbook |
| 6 | **5–8 more interactions** | ~20 min | Richer follow-up and nudge behavior |
| 7 | **competitors.json** (optional) | ~5 min | Nice for competitive-deal recommendations |

---

## Summary

MVP2 stays small: **three small reference files** (content_assets, objections, stage_actions), **one config file** (thresholds), **optional competitors**, plus **linking (deal_id)** and **a few more interactions**. That’s enough to make prioritization, personalization, follow-up, and “what to do next” feel realistic while you focus on how the orchestrator composes agents and uses these tables—rather than on curating large datasets.




#objections.json

In [None]:
[
  {
    "objection_id": "OBJ-001",
    "label": "Pricing too high",
    "theme": "pricing",
    "suggested_asset_types": ["one_pager", "case_study"],
    "suggested_actions": ["send pricing justification", "send ROI case study", "send pricing comparison"]
  },
  {
    "objection_id": "OBJ-002",
    "label": "Budget not approved yet",
    "theme": "budget",
    "suggested_asset_types": ["case_study", "one_pager"],
    "suggested_actions": ["send ROI case study", "schedule exec briefing to align budget"]
  },
  {
    "objection_id": "OBJ-003",
    "label": "Implementation timeline concerns",
    "theme": "timeline",
    "suggested_asset_types": ["implementation_guide", "case_study"],
    "suggested_actions": ["send implementation guide", "share rollout case study"]
  },
  {
    "objection_id": "OBJ-004",
    "label": "Considering a competitor",
    "theme": "competition",
    "suggested_asset_types": ["one_pager", "case_study"],
    "suggested_actions": ["send competitive comparison", "schedule differentiation call"]
  },
  {
    "objection_id": "OBJ-005",
    "label": "Security and compliance questions",
    "theme": "security",
    "suggested_asset_types": ["security_doc"],
    "suggested_actions": ["send security documentation", "arrange security review"]
  },
  {
    "objection_id": "OBJ-006",
    "label": "Need more discount",
    "theme": "discount",
    "suggested_asset_types": ["one_pager", "case_study"],
    "suggested_actions": ["send pricing justification", "offer value-add instead of discount"]
  },
  {
    "objection_id": "OBJ-007",
    "label": "Not the right time / no decision",
    "theme": "timing",
    "suggested_asset_types": ["case_study"],
    "suggested_actions": ["light nurture sequence", "schedule follow-up in 30 days"]
  },
  {
    "objection_id": "OBJ-008",
    "label": "Legal or procurement review needed",
    "theme": "legal",
    "suggested_asset_types": ["security_doc", "one_pager"],
    "suggested_actions": ["send security docs", "send contract summary for legal"]
  },
  {
    "objection_id": "OBJ-009",
    "label": "Unclear ROI / need proof",
    "theme": "roi",
    "suggested_asset_types": ["case_study", "one_pager"],
    "suggested_actions": ["send ROI case study", "schedule reference customer call"]
  },
  {
    "objection_id": "OBJ-010",
    "label": "Budget uncertainty",
    "theme": "budget",
    "suggested_asset_types": ["case_study"],
    "suggested_actions": ["send flexible packaging options", "schedule discovery to refine scope"]
  }
]


#content_assets.json

In [None]:
[
  {
    "asset_id": "A-001",
    "name": "Pricing comparison one-pager",
    "type": "one_pager",
    "use_cases": ["pricing pushback", "competitor comparison", "pricing sensitivity"],
    "stage_fit": ["Qualification", "Proposal", "Negotiation"]
  },
  {
    "asset_id": "A-002",
    "name": "ROI and value justification",
    "type": "one_pager",
    "use_cases": ["pricing pushback", "ROI proof", "budget approval"],
    "stage_fit": ["Qualification", "Proposal", "Negotiation"]
  },
  {
    "asset_id": "A-003",
    "name": "Logistics rollout case study",
    "type": "case_study",
    "use_cases": ["timeline concerns", "implementation", "reference proof"],
    "stage_fit": ["Qualification", "Proposal", "Negotiation"]
  },
  {
    "asset_id": "A-004",
    "name": "Implementation guide",
    "type": "implementation_guide",
    "use_cases": ["implementation timeline", "rollout planning"],
    "stage_fit": ["Proposal", "Negotiation"]
  },
  {
    "asset_id": "A-005",
    "name": "Security and compliance overview",
    "type": "security_doc",
    "use_cases": ["security questions", "legal review"],
    "stage_fit": ["Proposal", "Negotiation"]
  },
  {
    "asset_id": "A-006",
    "name": "Product overview",
    "type": "one_pager",
    "use_cases": ["business overview", "discovery"],
    "stage_fit": ["Discovery", "Qualification"]
  },
  {
    "asset_id": "A-007",
    "name": "Competitive battlecard — VendorX",
    "type": "one_pager",
    "use_cases": ["competitor comparison", "VendorX"],
    "stage_fit": ["Qualification", "Proposal", "Negotiation"]
  },
  {
    "asset_id": "A-008",
    "name": "Competitive battlecard — VendorY",
    "type": "one_pager",
    "use_cases": ["competitor comparison", "VendorY"],
    "stage_fit": ["Qualification", "Proposal", "Negotiation"]
  },
  {
    "asset_id": "A-009",
    "name": "Executive demo video",
    "type": "demo_video",
    "use_cases": ["exec briefing", "high-level overview"],
    "stage_fit": ["Discovery", "Qualification", "Proposal"]
  }
]


#state_actions.json

In [None]:
[
  {
    "stage": "Discovery",
    "default_next_actions": ["schedule discovery call", "send product overview", "share relevant case study"],
    "if_stalled_days": 14,
    "stalled_actions": ["re-engage sequence", "assign to different rep", "escalate to manager"]
  },
  {
    "stage": "Qualification",
    "default_next_actions": ["schedule demo", "send ROI one-pager", "qualify budget and timeline"],
    "if_stalled_days": 14,
    "stalled_actions": ["send re-engagement email", "schedule follow-up call", "offer exec briefing"]
  },
  {
    "stage": "Proposal",
    "default_next_actions": ["follow up on proposal", "share case study", "offer exec briefing"],
    "if_stalled_days": 14,
    "stalled_actions": ["escalate to manager", "send re-engagement sequence", "revise proposal scope"]
  },
  {
    "stage": "Negotiation",
    "default_next_actions": ["send pricing justification", "share security docs if needed", "schedule final terms call"],
    "if_stalled_days": 14,
    "stalled_actions": ["escalate to manager", "offer limited-time terms", "schedule decision-maker call"]
  },
  {
    "stage": "Closed Won",
    "default_next_actions": ["handoff to onboarding", "schedule kickoff", "send welcome materials"],
    "if_stalled_days": null,
    "stalled_actions": []
  },
  {
    "stage": "Closed Lost",
    "default_next_actions": ["document loss reasons", "schedule win-back nurture", "update competitive intel"],
    "if_stalled_days": null,
    "stalled_actions": []
  }
]


#competitors.json

In [None]:
[
  {
    "competitor_id": "VendorX",
    "display_name": "Vendor X",
    "common_weakness": "Implementation timeline and support",
    "win_theme": "Ease of deployment and faster time-to-value"
  },
  {
    "competitor_id": "VendorY",
    "display_name": "Vendor Y",
    "common_weakness": "Pricing and flexibility",
    "win_theme": "Transparent pricing and flexible packaging"
  },
  {
    "competitor_id": "VendorZ",
    "display_name": "Vendor Z",
    "common_weakness": "Enterprise security and compliance",
    "win_theme": "Security certifications and enterprise readiness"
  }
]


#interactions.json

In [None]:
[
  {
    "interaction_id": "INT-001",
    "lead_id": "L-001",
    "deal_id": "D-001",
    "rep_id": "SR-01",
    "type": "call",
    "datetime": "2025-11-22T15:00:00Z",
    "duration_minutes": 28,
    "sentiment": "positive",
    "key_topics": ["business overview", "forecasting needs"],
    "next_step_promised": "send product overview",
    "next_step_completed": true,
    "outcome": "qualified_interest"
  },
  {
    "interaction_id": "INT-002",
    "lead_id": "L-001",
    "deal_id": "D-001",
    "rep_id": "SR-01",
    "type": "email",
    "datetime": "2025-11-24T16:30:00Z",
    "duration_minutes": 0,
    "sentiment": "neutral",
    "key_topics": ["product overview"],
    "next_step_promised": "schedule demo",
    "next_step_completed": false,
    "outcome": "awaiting_response"
  },
  {
    "interaction_id": "INT-003",
    "lead_id": "L-003",
    "deal_id": "D-003",
    "rep_id": "SR-01",
    "type": "meeting",
    "datetime": "2025-11-18T14:00:00Z",
    "duration_minutes": 45,
    "sentiment": "positive",
    "key_topics": ["pricing", "ROI discussion"],
    "next_step_promised": "send proposal",
    "next_step_completed": true,
    "outcome": "proposal_requested"
  },
  {
    "interaction_id": "INT-004",
    "lead_id": "L-003",
    "deal_id": "D-003",
    "rep_id": "SR-01",
    "type": "email",
    "datetime": "2025-11-20T10:15:00Z",
    "duration_minutes": 0,
    "sentiment": "positive",
    "key_topics": ["proposal"],
    "next_step_promised": "review proposal internally",
    "next_step_completed": false,
    "outcome": "awaiting_decision"
  },
  {
    "interaction_id": "INT-005",
    "lead_id": "L-005",
    "deal_id": "D-004",
    "rep_id": "SR-03",
    "type": "demo",
    "datetime": "2025-11-12T17:00:00Z",
    "duration_minutes": 60,
    "sentiment": "neutral",
    "key_topics": ["features", "integration"],
    "next_step_promised": "discuss pricing",
    "next_step_completed": true,
    "outcome": "pricing_discussed"
  },
  {
    "interaction_id": "INT-006",
    "lead_id": "L-005",
    "deal_id": "D-004",
    "rep_id": "SR-03",
    "type": "call",
    "datetime": "2025-11-25T16:00:00Z",
    "duration_minutes": 22,
    "sentiment": "negative",
    "key_topics": ["pricing concerns", "budget approval"],
    "next_step_promised": "send pricing comparison",
    "next_step_completed": false,
    "outcome": "pricing_pushback"
  },
  {
    "interaction_id": "INT-007",
    "lead_id": "L-008",
    "deal_id": "D-007",
    "rep_id": "SR-04",
    "type": "call",
    "datetime": "2025-11-19T14:30:00Z",
    "duration_minutes": 26,
    "sentiment": "neutral",
    "key_topics": ["budget", "timeline"],
    "next_step_promised": "schedule follow-up",
    "next_step_completed": false,
    "outcome": "no_followup_scheduled"
  },
  {
    "interaction_id": "INT-008",
    "lead_id": "L-009",
    "deal_id": "D-008",
    "rep_id": "SR-04",
    "type": "email",
    "datetime": "2025-11-16T09:45:00Z",
    "duration_minutes": 0,
    "sentiment": "neutral",
    "key_topics": ["proposal delivery"],
    "next_step_promised": "review proposal",
    "next_step_completed": false,
    "outcome": "awaiting_response"
  },
  {
    "interaction_id": "INT-009",
    "lead_id": "L-011",
    "deal_id": "D-010",
    "rep_id": "SR-01",
    "type": "meeting",
    "datetime": "2025-10-28T15:00:00Z",
    "duration_minutes": 50,
    "sentiment": "positive",
    "key_topics": ["final approval", "contract terms"],
    "next_step_promised": "finalize contract",
    "next_step_completed": true,
    "outcome": "deal_closed"
  },
  {
    "interaction_id": "INT-010",
    "lead_id": "L-012",
    "deal_id": "D-011",
    "rep_id": "SR-03",
    "type": "call",
    "datetime": "2025-10-20T14:00:00Z",
    "duration_minutes": 30,
    "sentiment": "negative",
    "key_topics": ["pricing", "competition"],
    "next_step_promised": "revise pricing",
    "next_step_completed": false,
    "outcome": "deal_lost"
  },
  {
    "interaction_id": "INT-011",
    "lead_id": "L-014",
    "deal_id": "D-013",
    "rep_id": "SR-03",
    "type": "meeting",
    "datetime": "2025-10-30T16:00:00Z",
    "duration_minutes": 40,
    "sentiment": "positive",
    "key_topics": ["final terms", "implementation"],
    "next_step_promised": "kickoff planning",
    "next_step_completed": true,
    "outcome": "deal_closed"
  },
  {
    "interaction_id": "INT-012",
    "lead_id": "L-015",
    "deal_id": "D-014",
    "rep_id": "SR-03",
    "type": "call",
    "datetime": "2025-11-21T15:30:00Z",
    "duration_minutes": 35,
    "sentiment": "neutral",
    "key_topics": ["legal review", "security"],
    "next_step_promised": "send security docs",
    "next_step_completed": false,
    "outcome": "awaiting_legal"
  },
  {
    "interaction_id": "INT-013",
    "lead_id": "L-007",
    "deal_id": "D-006",
    "rep_id": "SR-02",
    "type": "call",
    "datetime": "2025-11-26T10:00:00Z",
    "duration_minutes": 25,
    "sentiment": "neutral",
    "key_topics": ["business overview", "pipeline visibility"],
    "next_step_promised": "send product overview",
    "next_step_completed": false,
    "outcome": "qualified_interest"
  },
  {
    "interaction_id": "INT-014",
    "lead_id": "L-002",
    "deal_id": "D-002",
    "rep_id": "SR-02",
    "type": "meeting",
    "datetime": "2025-11-24T14:00:00Z",
    "duration_minutes": 40,
    "sentiment": "positive",
    "key_topics": ["data silos", "reporting needs"],
    "next_step_promised": "send pricing comparison",
    "next_step_completed": false,
    "outcome": "pricing_discussed"
  },
  {
    "interaction_id": "INT-015",
    "lead_id": "L-003",
    "deal_id": "D-003",
    "rep_id": "SR-01",
    "type": "call",
    "datetime": "2025-11-26T11:00:00Z",
    "duration_minutes": 18,
    "sentiment": "neutral",
    "key_topics": ["proposal follow-up", "internal review"],
    "next_step_promised": "decision by end of week",
    "next_step_completed": false,
    "outcome": "awaiting_decision"
  },
  {
    "interaction_id": "INT-016",
    "lead_id": "L-005",
    "deal_id": "D-004",
    "rep_id": "SR-03",
    "type": "call",
    "datetime": "2025-11-27T09:30:00Z",
    "duration_minutes": 30,
    "sentiment": "neutral",
    "key_topics": ["discount request", "final terms"],
    "next_step_promised": "send revised proposal",
    "next_step_completed": false,
    "outcome": "negotiation_active"
  },
  {
    "interaction_id": "INT-017",
    "lead_id": "L-009",
    "deal_id": "D-008",
    "rep_id": "SR-04",
    "type": "email",
    "datetime": "2025-11-20T08:00:00Z",
    "duration_minutes": 0,
    "sentiment": "neutral",
    "key_topics": ["proposal follow-up"],
    "next_step_promised": "review proposal",
    "next_step_completed": false,
    "outcome": "awaiting_response"
  },
  {
    "interaction_id": "INT-018",
    "lead_id": "L-013",
    "deal_id": "D-012",
    "rep_id": "SR-02",
    "type": "call",
    "datetime": "2025-11-24T16:00:00Z",
    "duration_minutes": 22,
    "sentiment": "positive",
    "key_topics": ["utilization forecasting", "capacity planning"],
    "next_step_promised": "schedule demo",
    "next_step_completed": false,
    "outcome": "qualified_interest"
  }
]


# data validation

In [None]:
#!/usr/bin/env python3
"""Validate the sales enablement orchestrator data files."""

import json
import os
from collections import Counter

# Run from this script's directory so paths work either way
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
os.chdir(SCRIPT_DIR)

# Load core datasets
with open('leads.json') as f:
    leads = json.load(f)
with open('sales_reps.json') as f:
    reps = json.load(f)
with open('interactions.json') as f:
    interactions = json.load(f)
with open('deals.json') as f:
    deals = json.load(f)
with open('signals.json') as f:
    signals = json.load(f)

# Load MVP2 datasets (optional)
mvp2 = {}
for name, filename in [
    ('thresholds', 'thresholds.json'),
    ('objections', 'objections.json'),
    ('content_assets', 'content_assets.json'),
    ('stage_actions', 'stage_actions.json'),
    ('competitors', 'competitors.json'),
]:
    if os.path.exists(filename):
        with open(filename) as f:
            mvp2[name] = json.load(f)

# Extract IDs
lead_ids = {l['lead_id'] for l in leads}
rep_ids = {r['rep_id'] for r in reps}
deal_ids = {d['deal_id'] for d in deals}
signal_lead_ids = {s['lead_id'] for s in signals}

# Check interactions
interaction_lead_ids = {i['lead_id'] for i in interactions}
interaction_rep_ids = {i['rep_id'] for i in interactions}
interaction_deal_ids = {i['deal_id'] for i in interactions if i.get('deal_id')}

# Check deals
deal_lead_ids = {d['lead_id'] for d in deals}
deal_rep_ids = {d['rep_id'] for d in deals}

# Validation
print('=== DATA VALIDATION REPORT ===\n')
print(f'Leads: {len(leads)}')
print(f'Sales Reps: {len(reps)}')
print(f'Interactions: {len(interactions)}')
print(f'Deals: {len(deals)}')
print(f'Signals: {len(signals)}')
if mvp2:
    print('\n--- MVP2 datasets ---')
    for name, data in mvp2.items():
        n = len(data) if isinstance(data, list) else '(object)'
        print(f'  {name}: {n}')
print()

print('=== REFERENTIAL INTEGRITY ===')
# Check for orphaned references
orphaned_interaction_leads = interaction_lead_ids - lead_ids
orphaned_interaction_reps = interaction_rep_ids - rep_ids
orphaned_interaction_deals = interaction_deal_ids - deal_ids
orphaned_deal_leads = deal_lead_ids - lead_ids
orphaned_deal_reps = deal_rep_ids - rep_ids
missing_signals = lead_ids - signal_lead_ids

issues = []
if orphaned_interaction_leads:
    issues.append(f'⚠️  Interactions reference non-existent leads: {sorted(orphaned_interaction_leads)}')
if orphaned_interaction_reps:
    issues.append(f'⚠️  Interactions reference non-existent reps: {sorted(orphaned_interaction_reps)}')
if orphaned_interaction_deals:
    issues.append(f'⚠️  Interactions reference non-existent deals (deal_id): {sorted(orphaned_interaction_deals)}')
if orphaned_deal_leads:
    issues.append(f'⚠️  Deals reference non-existent leads: {sorted(orphaned_deal_leads)}')
if orphaned_deal_reps:
    issues.append(f'⚠️  Deals reference non-existent reps: {sorted(orphaned_deal_reps)}')
if missing_signals:
    issues.append(f'⚠️  Leads missing signals: {sorted(missing_signals)}')

if not issues:
    print('✅ All referential integrity checks passed\n')
else:
    for issue in issues:
        print(issue)
    print()

print('=== DATA DISTRIBUTION ===')
# Leads without deals
leads_without_deals = lead_ids - deal_lead_ids
print(f'Leads without deals: {len(leads_without_deals)} ({sorted(leads_without_deals)})')

# Leads without interactions
leads_without_interactions = lead_ids - interaction_lead_ids
print(f'Leads without interactions: {len(leads_without_interactions)} ({sorted(leads_without_interactions)})')

# Interactions with vs without deal_id
with_deal = sum(1 for i in interactions if i.get('deal_id'))
print(f'Interactions with deal_id: {with_deal} / {len(interactions)}')

# Deal stages
deal_stages = Counter(d['stage'] for d in deals)
print(f'\nDeal stages: {dict(deal_stages)}')

# Deal status
deal_status = Counter(d['status'] for d in deals)
print(f'Deal status: {dict(deal_status)}')

# Interaction types
interaction_types = Counter(i['type'] for i in interactions)
print(f'\nInteraction types: {dict(interaction_types)}')

# Rep deal distribution
rep_deal_counts = Counter(d['rep_id'] for d in deals if d['status'] == 'active')
print(f'\nActive deals per rep: {dict(rep_deal_counts)}')

# Check for required fields
print('\n=== SCHEMA VALIDATION ===')
required_lead_fields = ['lead_id', 'company_name', 'contact_email', 'contact_phone', 'created_date']
required_rep_fields = ['rep_id', 'name', 'email', 'quota_usd', 'year_to_date_revenue_usd']
required_interaction_fields = ['interaction_id', 'lead_id', 'rep_id', 'type', 'datetime', 'outcome']
required_deal_fields = ['deal_id', 'lead_id', 'rep_id', 'stage', 'created_date', 'expected_close_date', 'status']
required_signal_fields = ['lead_id', 'engagement_score', 'deal_risk_score', 'recommended_action']

schema_issues = []

for lead in leads:
    for field in required_lead_fields:
        if field not in lead:
            schema_issues.append(f'Lead {lead.get("lead_id", "unknown")} missing field: {field}')

for rep in reps:
    for field in required_rep_fields:
        if field not in rep:
            schema_issues.append(f'Rep {rep.get("rep_id", "unknown")} missing field: {field}')

for interaction in interactions:
    for field in required_interaction_fields:
        if field not in interaction:
            schema_issues.append(f'Interaction {interaction.get("interaction_id", "unknown")} missing field: {field}')

for deal in deals:
    for field in required_deal_fields:
        if field not in deal:
            schema_issues.append(f'Deal {deal.get("deal_id", "unknown")} missing field: {field}')

for signal in signals:
    for field in required_signal_fields:
        if field not in signal:
            schema_issues.append(f'Signal for lead {signal.get("lead_id", "unknown")} missing field: {field}')

# MVP2 schema checks when present
if 'objections' in mvp2:
    for obj in mvp2['objections']:
        for field in ['objection_id', 'label', 'theme', 'suggested_actions']:
            if field not in obj:
                schema_issues.append(f'Objection {obj.get("objection_id", "?")} missing field: {field}')
if 'content_assets' in mvp2:
    for a in mvp2['content_assets']:
        for field in ['asset_id', 'name', 'type', 'use_cases', 'stage_fit']:
            if field not in a:
                schema_issues.append(f'Asset {a.get("asset_id", "?")} missing field: {field}')
if 'stage_actions' in mvp2:
    for s in mvp2['stage_actions']:
        for field in ['stage', 'default_next_actions']:
            if field not in s:
                schema_issues.append(f'Stage action for {s.get("stage", "?")} missing field: {field}')
if 'competitors' in mvp2:
    for c in mvp2['competitors']:
        for field in ['competitor_id', 'display_name', 'win_theme']:
            if field not in c:
                schema_issues.append(f'Competitor {c.get("competitor_id", "?")} missing field: {field}')

if not schema_issues:
    print('✅ All required fields present\n')
else:
    for issue in schema_issues[:10]:  # Show first 10 issues
        print(issue)
    if len(schema_issues) > 10:
        print(f'... and {len(schema_issues) - 10} more issues')

print('\n=== SUMMARY ===')
if not issues and not schema_issues:
    print('✅ Data quality looks good! Ready for orchestrator development.')
else:
    print('⚠️  Some issues found. Review above before proceeding.')
