# Model Generation AI Workflow (OCR + M/TEXT DOM)

This notebook documents the workflow to OCR a PDF into markdown/plaintext, then generate M/TEXT DOM XML models. It is instructional and not meant to be fully runnable.


## Inputs and preparation

- PDF file (as base64 or reference to storage)
- Optional instructions for model generation
- OCR provider (Mistral OCR) and LLM provider (Anthropic)


## Step 1: OCR (Mistral)

Send the PDF as a data URL to Mistral OCR, capture markdown or text, and collapse pages with separators like `=== PAGE N ===`.


In [None]:
# Illustrative placeholders
OCR_MARKDOWN = """
=== PAGE 1 ===
Sehr geehrte Kundin, ...

=== PAGE 2 ===
Weitere Details ...
""".strip()
INSTRUCTIONS = "Generate 1–2 models per page of main body content; omit headers and signatures."



## Step 2: Exact prompts

System and user messages used by the server route.


In [None]:
SYSTEM = """
You convert markdown/plaintext into M/Text DOM XML models.

Page markers:
- The input may contain separators like "=== PAGE N ===". Treat each page independently.

Body-only selection (very important):
- INCLUDE only body content (the main letter/article content).
- EXCLUDE headers, footers, addresses, salutatory address blocks, contact blocks, policy/meta blocks, dates, and signatures.
- Drop lines/sections such as (examples):
  - Sender/return address, recipient address (street, ZIP+City), reference numbers
  - Kontakt:, Telefon:, Telefax:, E-Mail:, Internet:
  - Datum:
  - Postanschrift:
  - Versicherungsnehmer:, Versicherungsschein-Nr.:
  - salutatory address fields (e.g., Herrn, Frau as address lines)
  - signature lines and greetings (e.g., Mit freundlichen Grüßen)
- Heuristic for letters: start from first greeting (e.g., "Sehr geehrte...") and stop before closing/signature.

Output requirements:
- Output ONLY valid M/Text DOM XML snippets. No prose, no markdown fences.
- For each model, the root element MUST be <ContainerPart xmlns="urn:kwsoft:mtext:tonic:dom">.
- Inside, include <DataDefinition></DataDefinition> (empty is fine), followed by content.
- Represent paragraphs as:
  <Par><Span><Text>...text...</Text></Span></Par>
  - Preserve empty paragraphs using <Text></Text>.
  - If a clearly named style is implied, you MAY include <Style parentName="..."></Style> inside <Span>; otherwise omit.
- Represent tables as:
  <Table>
    <Headers><Header><Row><Cell><Container><Par><Span><Text>...</Text></Span></Par></Container></Cell>...</Row></Header></Headers>
    <Row>...cells as above...</Row>
    ... additional rows ...
  </Table>
- Do NOT invent business-specific tags (e.g., <ResponseForm>, <PolicyInfo>, etc.). Use ONLY these tags: ContainerPart, DataDefinition, Par, Span, Text, Style, Table, Headers, Header, Footers, Footer, Row, Cell, Container, Columns.

Splitting into models:
- Create 1–3 models per page of BODY content. If a page has no body, skip it.
- Each model must be a complete <ContainerPart>...</ContainerPart> block.

Example structure (truncated):
<ContainerPart xmlns="urn:kwsoft:mtext:tonic:dom">
  <DataDefinition></DataDefinition>
  <Par>
    <Span>
      <Text>Here is a paragraph</Text>
    </Span>
  </Par>
  <Par>
    <Span>
      <Style parentName="CompanyHighlight"></Style>
      <Text>Here is a paragraph with a style</Text>
    </Span>
  </Par>
  <Table>
    <Headers>
      <Header>
        <Row>
          <Cell><Container><Par><Span><Text></Text></Span></Par></Container></Cell>
          <Cell><Container><Par><Span><Text></Text></Span></Par></Container></Cell>
        </Row>
      </Header>
    </Headers>
    <Row>
      <Cell><Container><Par><Span><Text></Text></Span></Par></Container></Cell>
      <Cell><Container><Par><Span><Text></Text></Span></Par></Container></Cell>
    </Row>
  </Table>
</ContainerPart>
"""

USER = f"""
Markdown/Plaintext (pages separated by === PAGE N ===):
{OCR_MARKDOWN}

Optional instructions:
{INSTRUCTIONS}

Task: For each page, extract BODY content only (per rules) and produce 1–3 M/Text DOM XML models. Each model must be a complete <ContainerPart xmlns=\"urn:kwsoft:mtext:tonic:dom\">...</ContainerPart> block. Do NOT include headers, addresses, contact lines, dates, policy/meta blocks, or signatures.
""".strip()



## Call sequence (abstract)

1. Load PDF, convert to base64 data URL.
2. Send to Mistral OCR; collect `markdown`.
3. Build `SYSTEM`/`USER` exactly as above.
4. Invoke LLM and parse raw content; extract each `<ContainerPart>...</ContainerPart>` block.
5. Return `{ markdown, models }`.
