# Datamodel Generation: PDF → M/TEXT Datamodel

Generate M/TEXT datamodel XML from PDF documents using OCR and Large Language Models.

## M/TEXT Context

In M/TEXT, datamodels define the structure of variable data used in document generation:
- **Node elements**: Define fields with data types (TEXT, NUMBER, DATETIME, BOOLEAN)
- **Validation rules**: Specify required fields, max lengths, dropdown values
- **Hierarchical structure**: Support nested and repeated data

**Traditional Approach:** Manually analyze documents and hand-write datamodel XML  
**AI Approach:** OCR the PDF, LLM extracts field structure automatically

## Who This Is For

**M/TEXT Developers**: Quickly scaffold datamodels from existing documents instead of manual XML authoring.

This notebook shows a portable 2-step pattern: OCR → Schema Generation.


## How It Works

The workflow has two stages:

### Stage 1: OCR (PDF → Markdown)
1. **Upload PDF**: Convert to base64 data URL
2. **Call Mistral OCR API**: Send PDF to `api.mistral.ai/v1/ocr`
3. **Extract Markdown**: Receive page-by-page markdown representation

### Stage 2: Schema Generation (Markdown → Datamodel)
4. **Prompt LLM**: Send markdown to Claude with instructions to identify business-relevant fields
5. **Parse JSON**: Extract structured variable list with types, validation rules, example values
6. **Generate XML**: Convert JSON to M/TEXT datamodel format
7. **Optional Artifacts**: Generate testcase XML and XSLT for testing

Total time: 15-45 seconds depending on PDF size and complexity.


## Inputs and Requirements

**Required:**
- **PDF Document**: Invoice, form, letter, or any document with structured data
- **Mistral API Key**: For OCR service (`MISTRAL_API_KEY`)

**Optional:**
- **Instructions**: Guidance for field selection (e.g., "Focus on invoice line items and totals")
- **Include Artifacts**: Generate testcase XML and XSLT alongside the datamodel

**Output:**
- `BusinessData.datamodel` XML (M/TEXT schema)
- Optionally: `BusinessData.xml` (testcase) and `BusinessData.xslt` (identity transform)


In [None]:
# Example: Expected JSON structure from the LLM

EXAMPLE_JSON = {
    "variables": [
        {
            "name": "PartnerID",
            "label": "Partner Number",
            "field_type": "Zahl",
            "data_type": "NUMBER",
            "is_required": True,
            "max_length": 10,
            "value": "12345"
        },
        {
            "name": "Firstname",
            "label": "First Name",
            "field_type": "Freitext",
            "data_type": "TEXT",
            "is_required": True,
            "max_length": 50,
            "value": "John"
        },
        {
            "name": "InvoiceDate",
            "label": "Invoice Date",
            "field_type": "Datum",
            "data_type": "DATETIME",
            "is_required": False,
            "max_length": 0,
            "value": "2024-01-15"
        }
    ]
}

INSTRUCTIONS = "Focus on partner data and invoice details. Ignore footer/header content."



## System Prompt: Field Extraction from OCR

This prompt teaches the LLM to extract business-relevant fields from OCR markdown and structure them as JSON.

### Key Instructions

- **JSON-only output**: No prose, no markdown fences
- **Field selection**: Focus on variable data (names, dates, IDs, amounts), ignore boilerplate
- **Type inference**: Map German field types (Datum, Zahl, Checkbox) to M/TEXT data types
- **Example values**: Extract actual values from the document when possible
- **Validation values**: For dropdown fields, provide available options

### The Prompt (Copy-Paste Ready)


In [None]:
SYSTEM = """You analyze OCR markdown of letters/documents and return a JSON spec of business-relevant variables for a DataModel named BusinessData.

Rules:
- OUTPUT STRICT JSON ONLY. No prose, no markdown fences.
- JSON shape:
  {
    "variables": [
      {
        "name": string,                  // e.g., PartnerID, UserID, Firstname, Zip, City, InvoiceDate, TotalAmount
        "label": string,                 // optional human-friendly label
        "field_type": string,            // e.g., Checkbox, Datum/Date, Dropdown, Freitext, Zahl/Numeric
        "data_type": string,             // optional explicit type (TEXT, NUMBER, BOOLEAN, DATETIME)
        "is_required": boolean,          // default false
        "max_length": number,            // default 0
        "validation_values": [           // for dropdowns/combos
          { "content": string, "description": string, "valId": string }
        ],
        "value": string                  // example value extracted/inferred from OCR; empty string if unknown
      }
    ]
  }
- Focus on short, structured values that vary per recipient/case: dates, numbers, reference IDs, salutations, title, first/last name, address components, totals, currency.
- Prefer body content; ignore headers/footers where appropriate.
- Aim for 10–40 variables depending on detected content; choose sensible defaults when unsure.
"""



## User Prompt (OCR + Instructions)

The user prompt contains the OCR markdown (with page separators) and optional instructions for field selection.


In [None]:
# Example OCR markdown from Mistral (multi-page)
MARKDOWN = """
=== PAGE 1 ===
Invoice
Date: January 15, 2024
Invoice Number: INV-2024-001

Bill To:
John Doe
123 Main Street
Springfield, IL 62701

=== PAGE 2 ===
Items:
- Product A: $100.00
- Product B: $250.00

Subtotal: $350.00
Tax (8%): $28.00
Total: $378.00
"""

USER = f"""OCR Markdown (pages separated by === PAGE N ===):
{MARKDOWN}

Optional notes:
{INSTRUCTIONS}

Task: Return ONLY JSON with the described shape. Do NOT include XML or prose."""



## Implementation Pattern (Framework-Agnostic)

The two-stage pattern can be implemented with any LLM provider:

### Stage 1: OCR
```python
import requests, base64

# Convert PDF to base64
with open("document.pdf", "rb") as f:
    pdf_b64 = base64.b64encode(f.read()).decode()
    data_url = f"data:application/pdf;base64,{pdf_b64}"

# Call Mistral OCR
response = requests.post(
    "https://api.mistral.ai/v1/ocr",
    headers={
        "Authorization": f"Bearer {MISTRAL_API_KEY}",
        "Content-Type": "application/json"
    },
    json={
        "model": "mistral-ocr-latest",
        "document": {"type": "document_url", "document_url": data_url}
    }
)

# Extract markdown
pages = response.json()["pages"]
markdown = "\n\n".join([f"=== PAGE {i+1} ===\n{p['markdown']}" for i, p in enumerate(pages)])
```

### Stage 2: LLM (Using Anthropic or OpenAI)
```python
# Example with Anthropic
from anthropic import Anthropic
client = Anthropic(api_key=ANTHROPIC_API_KEY)

response = client.messages.create(
    model="claude-sonnet-4-0",
    max_tokens=4096,
    messages=[
        {"role": "user", "content": SYSTEM + "\n\n" + USER}
    ]
)

# Parse JSON and generate XML (see next section)
```


## JSON → XML Conversion

After extracting the JSON response, convert it to M/TEXT datamodel XML format:



In [None]:
import json, re

def map_field_type_to_data_type(field_type):
    """Map German field types to M/TEXT data types"""
    ft = (field_type or "").lower()
    if "checkbox" in ft: return "BOOLEAN"
    if "datum" in ft or "date" in ft: return "DATETIME"
    if "zahl" in ft or "numeric" in ft or "number" in ft: return "NUMBER"
    return "TEXT"

def create_node_xml(variable):
    """Convert JSON variable to M/TEXT <Node> XML"""
    data_type = variable.get("data_type") or map_field_type_to_data_type(variable.get("field_type", ""))
    name = variable["name"]
    label = variable.get("label", "")
    is_required = variable.get("is_required", False)
    
    # Build validation XML
    validation = f'''<Validation allow-empty-value="{str(not is_required).lower()}"
                        dialog-field=""
                        label="{label}"
                        operator="ANY"
                        validation-type="ANY_VALUE">
               <Values/>
            </Validation>'''
    
    return f'''<Node data-type="{data_type}" hierarchical="FLAT" multiple="false" name="{name}" searchable="true">
{validation}
            <Settings/>
          </Node>'''

# Example: Convert JSON to full datamodel XML
variables = EXAMPLE_JSON["variables"]
nodes_xml = "\n".join([create_node_xml(v) for v in variables])
datamodel_xml = f'''<?xml version="1.0" encoding="UTF-8"?>
<DataModel>
{nodes_xml}
</DataModel>'''

print(datamodel_xml)

