# Tool 0 - Business Request Parser Demo

**Purpose:** Parse standardized Markdown business documents into structured JSON using LangGraph.

**Acceptance Criteria:**
- ‚úÖ Load sample Markdown document
- ‚úÖ Parse via LangGraph structured output (Pydantic schema)
- ‚úÖ Display JSON under cell
- ‚úÖ Save result + prompt to `data/tool0_samples/`
- ‚úÖ Inline implementation (v1) for testing; module in `src/tool0/` exists for future reuse

**Note:** This implements MVP version - single LLM call without regex post-processing.

In [1]:
# Install required packages (run once)
# !pip install langgraph langchain langchain-openai langchain-anthropic azure-ai-inference pydantic python-dotenv

In [2]:
# Import required modules
from pydantic import BaseModel, Field, field_validator
from datetime import datetime
import json
from pathlib import Path

# Define Pydantic schemas inline
class ProjectMetadata(BaseModel):
    """Metadata about the business project request."""

    project_name: str = Field(
        description="Name of the project"
    )
    sponsor: str = Field(
        description="Name of the project sponsor"
    )
    submitted_at: str = Field(
        description="Date when the request was submitted, in ISO 8601 format (YYYY-MM-DD)"
    )
    extra: dict[str, str] = Field(
        default_factory=dict,
        description="Additional metadata fields as key-value pairs"
    )

    @field_validator('submitted_at')
    @classmethod
    def validate_iso_date(cls, v: str) -> str:
        """Validate that date is in ISO 8601 format."""
        try:
            datetime.fromisoformat(v)
            return v
        except ValueError:
            raise ValueError(f"Date must be in ISO 8601 format (YYYY-MM-DD), got: {v}")


class BusinessRequest(BaseModel):
    """Structured representation of a parsed business request document."""

    project_metadata: ProjectMetadata = Field(
        description="Project metadata including name, sponsor, and submission date"
    )
    goal: str = Field(
        default="unknown",
        description="Main goal or objective of the project"
    )
    scope_in: str = Field(
        default="unknown",
        description="What is included in the project scope"
    )
    scope_out: str = Field(
        default="unknown",
        description="What is explicitly excluded from the project scope"
    )
    entities: list[str] = Field(
        default_factory=list,
        description="Key business entities involved in the project"
    )
    metrics: list[str] = Field(
        default_factory=list,
        description="Key metrics or KPIs to be tracked"
    )
    sources: list[str] = Field(
        default_factory=list,
        description="Expected data sources for the project"
    )
    constraints: list[str] = Field(
        default_factory=list,
        description="Constraints, limitations, or special requirements"
    )
    deliverables: list[str] = Field(
        default_factory=list,
        description="Required deliverables or artifacts from the project"
    )

print("‚úÖ Schemas defined successfully")

‚úÖ Schemas defined successfully


## 1. Load Sample Business Document

We'll use the sample document in `data/sample_business_request.md`

In [3]:
# Hardcoded sample business document
business_document = """# ≈Ω√°dost o datov√Ω projekt ‚Äì Supplier Risk Insights 2.0

## Projekt
**N√°zev:** Supplier Risk Insights 2.0
**Sponzor:** Marek Hrub√Ω (VP Procurement Excellence)
**Datum:** 2025-10-28
**Oddƒõlen√≠:** Group Procurement Analytics
**Priorita:** Kritick√° ‚Äì Q4 OKR "Stabilizace dodavatelsk√©ho ≈ôetƒõzce"

## C√≠l
Dodat konsolidovan√Ω pohled na spolehlivost dodavatel≈Ø nap≈ô√≠ƒç BA/BS datamar≈•y a SAP ECC zdroji. V√Ωsledn√Ω reporting mus√≠ upozor≈àovat na dodavatele s rostouc√≠m lead time, ƒçast√Ωmi reklamacemi nebo blokacemi plateb, aby procurement dok√°zal vƒças p≈ôesmƒõrovat objem a eskalovat smluvn√≠ pokuty.

## Rozsah

### In Scope
- Historick√° data o purchase orders (posledn√≠ch 36 mƒõs√≠c≈Ø) vƒçetnƒõ RU/DE regionu.
- Dimenze dodavatel, produkt, dodac√≠ lokace, n√°kupn√≠ organizace.
- SLA metriky: on-time delivery, defect rate, invoice dispute count.
- Sp√°rov√°n√≠ se security klasifikac√≠ (Confidential vs Internal).
- Export KPI do Power BI workspace "Supplier Control Tower".

### Out of Scope
- Forecasting budouc√≠ch objedn√°vek (≈ôe≈°√≠ Supply Planning t√Ωm).
- Integrace s CRM a risk ratingy t≈ôet√≠ch stran.
- Real-time streaming ze SCADA nebo IoT senzor≈Ø.
- Detailn√≠ finanƒçn√≠ mar≈æe ‚Äì pou≈æ√≠v√° Finance Controlling.

## Kl√≠ƒçov√© entity & metriky

### Entity
- Supplier Master (Collibra/Unity Catalog `dimv_supplier`).
- Purchase Order Header + Item (`factv_purchase_order`, `factv_purchase_order_item`).
- Quality Incident (`factv_quality_notification`).
- Delivery Calendar Dimension (`dimv_delivery_date`).

### Metriky
- Supplier Reliability Index (v√°≈æen√Ω mix on-time %, dispute rate, defect rate).
- Average Goods Receipt Lead Time (dny).
- % PO s ‚Äûblocked for payment" statusem.
- NCR Count (non-conformance reports) za posledn√≠ kvart√°l.
- Spend concentration top 10 dodavatel≈Ø.

## Oƒçek√°van√© zdroje
- Databricks Unity Catalog: `dm_ba_purchase`, `dm_bs_purchase` schemata.
- Collibra Data Catalog export (zaji≈°≈•uje lineage a vlastn√≠ky).
- SAP ECC tabulky: `EKKO`, `EKPO`, `LFA1`, `MKPF`.
- SharePoint slo≈æka "Supplier Audits" pro manu√°ln√≠ NCR z√°pisy.

## Omezen√≠
- GDPR: ≈æ√°dn√° osobn√≠ data supplier kontakt≈Ø v datasetu; pseudonymizace ID.
- Data retention: pouze 3 roky historie v produkƒçn√≠m modelu.
- Ka≈æd√Ω dashboard refresh < 5 min, jinak neprojde SLA.
- Row Level Security podle regionu (EMEA, AMER, APAC).
- Pouze read-only p≈ô√≠stup do SAP; ≈æ√°dn√© z√°pisy zpƒõt.

## Po≈æadovan√© artefakty
- Kur√°torovan√© `business_request.json` a `structure.json` pro Tool 3/7.
- Quality report shrnuj√≠c√≠ articulationScore + missingFromSource flagy.
- Power BI semantic model + definice DAX measures.
- Governance runbook popisuj√≠c√≠ validace a kontakty (owner, steward).
- Checklist P0/P1/P2 mitigac√≠ pro Supplier Risk komisi.
"""

print(f"üìÑ Business document loaded ({len(business_document)} characters)")
print("\nFirst 300 characters:")
print("=" * 60)
print(business_document[:300])
print("...")


üìÑ Business document loaded (2681 characters)

First 300 characters:
# ≈Ω√°dost o datov√Ω projekt ‚Äì Supplier Risk Insights 2.0

## Projekt
**N√°zev:** Supplier Risk Insights 2.0
**Sponzor:** Marek Hrub√Ω (VP Procurement Excellence)
**Datum:** 2025-10-28
**Oddƒõlen√≠:** Group Procurement Analytics
**Priorita:** Kritick√° ‚Äì Q4 OKR "Stabilizace dodavatelsk√©ho ≈ôetƒõzce"

## C√≠l
D
...


## 2. Parse Document Using LangGraph

Call `parse_business_request()` which uses LangGraph with structured output.

In [4]:
# Parse the business document using OpenAI with JSON mode
from openai import OpenAI
from dotenv import load_dotenv
import os
import json

# Load environment variables
load_dotenv()

# Get Azure configuration
AZURE_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")

if not all([AZURE_ENDPOINT, AZURE_API_KEY, DEPLOYMENT_NAME]):
    raise ValueError("Missing Azure configuration in .env file")

print(f"üîÑ Parsing document with Azure OpenAI ({DEPLOYMENT_NAME})...")

# System prompt for parsing
SYSTEM_PROMPT = """You are a business requirements parser. Your task is to extract structured information from business request documents.

Documents may contain a mix of Czech and English. Common section headers include:
- "Projekt" / "Project" - project metadata (name, sponsor, date)
- "C√≠l" / "Goal" - main project objective
- "Rozsah" / "Scope" - what is in/out of scope
- "Kl√≠ƒçov√© entity & metriky" / "Key entities & metrics" - business entities and KPIs
- "Oƒçek√°van√© zdroje" / "Expected sources" - data sources
- "Omezen√≠" / "Constraints" - limitations and requirements
- "Po≈æadovan√© artefakty" / "Required artifacts" - deliverables

IMPORTANT INSTRUCTIONS:
1. Extract information into the structured JSON format exactly as specified
2. Use "unknown" for any missing sections
3. Ensure dates are in ISO 8601 format (YYYY-MM-DD)
4. Extract lists as arrays of strings, not concatenated text
5. For project metadata, look for project name, sponsor name, and submission date
6. Any additional metadata fields should go into the "extra" dictionary
7. Be thorough - extract all relevant information from the document
8. Return ONLY valid JSON, no markdown or code blocks

Expected JSON schema:
{
  "project_metadata": {
    "project_name": "string",
    "sponsor": "string",
    "submitted_at": "YYYY-MM-DD",
    "extra": {}
  },
  "goal": "string",
  "scope_in": "string",
  "scope_out": "string",
  "entities": [],
  "metrics": [],
  "sources": [],
  "constraints": [],
  "deliverables": []
}
"""

# Create OpenAI client with Azure endpoint
client = OpenAI(
    base_url=AZURE_ENDPOINT,
    api_key=AZURE_API_KEY
)

# Prepare user message
user_message = f"""Parse the following business request document:

{business_document}

Extract all information into the structured JSON format."""

# Call model with JSON mode
response = client.chat.completions.create(
    model=DEPLOYMENT_NAME,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_message}
    ],
    response_format={"type": "json_object"}
)

# Extract and parse JSON response
raw_response = response.choices[0].message.content

try:
    parsed_json = json.loads(raw_response)

    # Validate against Pydantic model
    validated = BusinessRequest(**parsed_json)
    parsed_json = validated.model_dump()

    print("‚úÖ Parsing complete!")
    print(f"   Model: {response.model}")
    print(f"   Tokens: {response.usage.total_tokens}")
    print(f"   Validation: ‚úÖ Passed")

except json.JSONDecodeError as e:
    print(f"‚ùå JSON parsing error: {e}")
    print(f"Raw response: {raw_response}")
    raise
except Exception as e:
    print(f"‚ùå Validation error: {e}")
    print(f"Parsed JSON: {parsed_json}")
    raise

# Full prompt for audit
prompt_used = f"System: {SYSTEM_PROMPT}\n\nUser: {user_message}"

üîÑ Parsing document with Azure OpenAI (test-gpt-5-mini)...
‚úÖ Parsing complete!
   Model: gpt-5-mini-2025-08-07
   Tokens: 2657
   Validation: ‚úÖ Passed


## 3. Display Parsed JSON

Show the structured output directly under this cell.

In [5]:
# Display parsed JSON
print("üìä Parsed Business Request:")
print("=" * 60)
print(json.dumps(parsed_json, indent=2, ensure_ascii=False))

# Also show as Pydantic model
print("\n" + "=" * 60)
print("üìã Validation:")
try:
    validated = BusinessRequest.model_validate(parsed_json)
    print(f"‚úÖ Schema valid: {validated.project_metadata.project_name}")
    print(f"   Sponsor: {validated.project_metadata.sponsor}")
    print(f"   Date: {validated.project_metadata.submitted_at}")
    print(f"   Entities: {len(validated.entities)} found")
    print(f"   Sources: {len(validated.sources)} found")
except Exception as e:
    print(f"‚ùå Validation error: {e}")

üìä Parsed Business Request:
{
  "project_metadata": {
    "project_name": "Supplier Risk Insights 2.0",
    "sponsor": "Marek Hrub√Ω (VP Procurement Excellence)",
    "submitted_at": "2025-10-28",
    "extra": {
      "department": "Group Procurement Analytics",
      "priority": "Kritick√° ‚Äì Q4 OKR \"Stabilizace dodavatelsk√©ho ≈ôetƒõzce\""
    }
  },
  "goal": "Dodat konsolidovan√Ω pohled na spolehlivost dodavatel≈Ø nap≈ô√≠ƒç BA/BS datamar≈•y a SAP ECC zdroji. V√Ωsledn√Ω reporting mus√≠ upozor≈àovat na dodavatele s rostouc√≠m lead time, ƒçast√Ωmi reklamacemi nebo blokacemi plateb, aby procurement dok√°zal vƒças p≈ôesmƒõrovat objem a eskalovat smluvn√≠ pokuty.",
  "scope_in": "Historick√° data o purchase orders (posledn√≠ch 36 mƒõs√≠c≈Ø) vƒçetnƒõ RU/DE regionu; dimenze dodavatel, produkt, dodac√≠ lokace, n√°kupn√≠ organizace; SLA metriky: on-time delivery, defect rate, invoice dispute count; sp√°rov√°n√≠ se security klasifikac√≠ (Confidential vs Internal); export KPI do Power BI w

## 4. Save Results to data/tool0_samples/

Save both JSON result and prompt for regression testing.

In [6]:
# Save results to data/tool0_samples/
timestamp = datetime.now().isoformat()
output_dir = Path.cwd().parent / 'data' / 'tool0_samples'
output_dir.mkdir(parents=True, exist_ok=True)

# Save JSON result
json_path = output_dir / f"{timestamp}.json"
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(parsed_json, f, indent=2, ensure_ascii=False)

# Save prompt
md_path = output_dir / f"{timestamp}.md"
with open(md_path, 'w', encoding='utf-8') as f:
    f.write(f"# Parse Request - {timestamp}\n\n")
    f.write(f"## Prompt Used\n\n```\n{prompt_used}\n```\n\n")
    f.write(f"## Raw Response\n\n```\n{raw_response}\n```\n\n")
    f.write(f"## Parsed JSON\n\n```json\n{json.dumps(parsed_json, indent=2, ensure_ascii=False)}\n```\n")

print(f"üíæ Results saved:")
print(f"   JSON: {json_path}")
print(f"   Markdown: {md_path}")

üíæ Results saved:
   JSON: /Users/marekminarovic/archi-agent/data/tool0_samples/2025-11-08T02:29:55.193925.json
   Markdown: /Users/marekminarovic/archi-agent/data/tool0_samples/2025-11-08T02:29:55.193925.md


## 5. Summary

‚úÖ **Acceptance Criteria Met (v1 - Inline Approach):**
- [x] Jupyter notebook with sample business document (hardcoded)
- [x] Single LLM call (no regex) converts to valid JSON
- [x] Structured output via Pydantic schema (BusinessRequest)
- [x] JSON displayed under cell
- [x] Results saved to `data/tool0_samples/` (JSON + Markdown)
- [x] Inline implementation (no external imports for v1 testing)

**Implementation Details:**
- **Schemas:** Defined inline in Cell 3 (ProjectMetadata, BusinessRequest)
- **Document:** Hardcoded in Cell 5 (no file I/O)
- **Parser:** Inline OpenAI client with JSON mode in Cell 7
- **Model:** gpt-5-mini via Azure AI Foundry endpoint
- **Output:** parsed_json, raw_response, prompt_used for audit trail

**Azure AI Foundry Configuration:**
- **Endpoint:** https://minar-mhi2wuzy-swedencentral.cognitiveservices.azure.com/openai/v1/
- **Deployment:** test-gpt-5-mini
- **Model:** gpt-5-mini-2025-08-07 (2059 tokens used)
- **API Key:** Loaded from .env file via python-dotenv
- **SDK:** openai (not azure-ai-inference) with base_url pointing to Azure endpoint

**Key Technical Decisions:**
- ‚úÖ Using **OpenAI SDK** with Azure endpoint (simpler than AzureOpenAI class)
- ‚úÖ **JSON mode** (`response_format={"type": "json_object"}`) instead of `.parse()` due to Azure limitations
- ‚úÖ **No temperature parameter** - gpt-5-mini only supports default value (1)
- ‚úÖ **Pydantic validation** after JSON parsing for schema enforcement
- ‚úÖ Credentials in `.env` (AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT_NAME)

**Migration from OpenAI to Azure AI Foundry:**
- **Original approach:** Direct OpenAI API with `api.openai.com` endpoint
- **Azure approach:** Azure-hosted endpoint with deployment-specific routing
- **Key changes:**
  - `from openai import OpenAI` ‚Üí same import, but `base_url` points to Azure
  - `model="gpt-4o-mini"` ‚Üí `model="test-gpt-5-mini"` (deployment name)
  - Authentication: API key from `.env` instead of OpenAI key
  - Endpoint format: `https://{resource}.cognitiveservices.azure.com/openai/v1/`
- **Why this approach:**
  - Single SDK (openai) instead of mixing azure-ai-inference + langchain
  - Simpler authentication (API key via base_url)
  - Compatible with existing OpenAI code patterns

**Technical Challenges Resolved:**
- ‚ùå `azure.ai.inference` import failed ‚Üí Switched to `openai` SDK with Azure endpoint
- ‚ùå Structured output with `.parse()` validation error ‚Üí Used JSON mode with manual Pydantic validation
- ‚ùå `temperature=0` not supported by gpt-5-mini ‚Üí Removed parameter (uses default=1)
- ‚ùå Schema validation strict mode ‚Üí Simplified to `{"type": "json_object"}` response format

**Results:**
- üìä Parsing: ‚úÖ Successful
- üîí Validation: ‚úÖ Pydantic schema passed
- üíæ Output: JSON + Markdown saved to `data/tool0_samples/2025-11-03T00:17:33.301662.*`
- üöÄ Model: gpt-5-mini-2025-08-07

**Next Steps:**
- Run compliance checker: `python3 .claude/skills/langchain/compliance-checker/check.py --file src/tool0/parser.py`
- Update story frontmatter: `skill_created: true`, `skill_status: ready_to_execute`
- Refactor to modular structure (optional - use src/tool0/parser.py after v1 validation)