# Monitorul Oficial (Part IV) — everything other than S.A. Parsing Notebook

This notebook is a mostly ready-to-run developer guide for parsing and structuring **Monitorul Oficial (Part IV)** documents, focusing on **everything other than S.A.** (limited liability companies). It embeds concrete examples , parsing skeletons (BeautifulSoup + regex), categorization, a Pydantic-like schema (target), and placeholders for LLM-based extraction using **PydanticAI**. A mermaid diagram at the end explains the pipeline.

Save this notebook, run cells sequentially, and adapt selectors/parsers to your HTML corpus.


## Objectives

- Parse HTML files (iLegis and Lege5) into individual JSON `entries`.
- Categorize entries and focus on **everything other than S.A.** documents.
- Implement and evaluate *Heuristic* and *LLM (PydanticAI structured output)* approaches.
- Produce final JSON compatible with the provided exhaustive schema.
- Benchmark accuracy and throughput; manual / pattern-based validation only (no golden set).


## Concrete Examples

Two real examples from the uploaded scope document are embedded below as raw text. Use these for early development and unit tests.


In [None]:
# ROCCO example raw text
rocco_text = """
Societatea ROCCO & MIHA SWEET - everything other than S.A.

DECIZIA NR. 1

din data de 16.10.2025 a asociatului unic al S.C. ROCCO & MIHA SWEET - everything other than S.A.

Subsemnatul, PASTUKOV \u0218TEFAN, cet\u0103\u021bean rom\u00e2n, n\u0103scut la data de 28.11.1975 \u00een mun. Bucure\u0219ti, sectorul 6, domiciliat \u00een mun. Bucure\u0219ti, sectorul 6, bd. Timi\u0219oara nr. 44, bl. RATB, sc. 1, et. 4, ap. 102, posesor al CI seria RZ nr. 284063, eliberat\u0103 de SPCLEP Sector 6, la data de 22.12.2023, valabil\u0103 p\u00e2n\u0103 la data de 03.08.2031, \u00een calitate de asociat unic al firmei ROCCO & MIHA SWEET - everything other than S.A., cu sediul \u00een mun. Bucure\u0219ti, sectorul 2, strada Grigore lonescu nr. 63, bl. T73, sc. 2, et. 4, ap. 42, camera 1, \u00eendeplinind dispozi\u021biile constitutive \u0219i legale, prin prezenta, decid;

Art. 1.Revocarea din func\u021bia de administrator a doamnei MICLE MIHAELA-FLORICA, n\u0103scut\u0103 la data de 18.04.1977 \u00een ora\u0219ul Baia-Mare, jud. Maramure\u0219, domiciliat\u0103 \u00een mun. Bucure\u0219ti, sectorul 6, bd. Timi\u0219oara nr. 44, bl. RATB, sc. 1, et. 4, ap. 102, legitimat\u0103 cu CI seria RZ nr. 284061, emis\u0103 la data de 22.12.2023, valabil\u0103 p\u00e2n\u0103 la 03.08.2031, cu men\u021binerea administratorului existent, PASTUKOV \u0218TEFAN, cu puteri depline de exercitare, reprezentare \u0219i administrare pentru o perioad\u0103 de timp de 60 de ani, de la data numirii, p\u00e2n\u0103 la data de 01.06.2082.

Art. 2. Actualizarea actului constitutiv cu modific\u0103rile din prezenta decizie.

Semnat\u0103 la Bucure\u0219ti ast\u0103zi, data de 16.10.2024.

(6/8.408.087)
"""
print(rocco_text[:800] + '...')

In [None]:
# MADEROS example raw text
maderos_text = """
Societatea MADEROS DEVELOPMENT - everything other than S.A.

ROM\u00c2NIA

MINISTERUL JUSTI\u021aIEI

OFICIUL NA\u021aIONAL AL REGISTRULUI COMER\u021aULUI

OFICIUL REGISTRULUI COMER\u021aULUI DE PE L\u00c2NG\u0102 TRIBUNALUL ILFOV

EXTRAS AL \u00ceNCHEIERII NR. 464529/29.05.2025

\u00een baza cererii nr. 2052358 din data de 27.05.2025 \u0219i a actelor doveditoare depuse, Emma Madalina Leonida -registrator de registrul comer\u021bului conform art. 107 alin.

( 1) din Legea nr. 265/2022 privind registrul comer\u021bului ... a dispus autorizarea constituirii, \u00eenmatricularea \u0219i \u00eenregistrarea profesionistului:

- denumire \u0219i form\u0103 juridic\u0103: MADEROS DEVELOPMENT- everything other than S.A.;

- cod unic de \u00eenregistrare: 51885179;

- identificator unic la nivel european (EUID): ROONRC.J2025038755001;

- num\u0103r de ordine \u00een registrul comer\u021bului: J2025038755001;

- sediul social: jud. Ilfov, ora\u0219ul Voluntari, \u0219oseaua Erou lancu Nicolae nr. 84, scara B, etaj 6, ap. B.6.1;

- domeniul principal de activitate: grupa CAEN: 681 -Cump\u0103rarea \u0219i v\u00e2nzarea de bunuri imobiliare proprii \u0219i dezvoltare imobiliar\u0103;

- activitate principal\u0103: 6811 - Cump\u0103rarea \u0219i v\u00e2nzarea de bunuri imobiliare proprii;

- capital social: 100 lei, total p\u0103r\u021bi sociale: 10 a c\u00e2te 10 lei fiecare;

- fondator: Macadrai Rodica, cu domiciliul \u00een Rom\u00e2nia, Ilfov, ora\u0219ul Voluntari;

- administrator: Macadrai Rodica, cu domiciliul \u00een Rom\u00e2nia, Ilfov, ora\u0219ul Voluntari;

- durata de func\u021bionare: nedeterminat\u0103.

(70/8.687.902)
"""
print(maderos_text[:800] + '...')

## Target JSON Schema (exhaustive format)

Below is the target schema that parsed entries should conform to. Use this as the design for Pydantic models / validation.

In [None]:
schema = {
  "id": "string",
  "type": "string",
  "name": "string",
  "mainInfo": {
    "addresses": [
      {
        "fullAddress": "string",
        "country": "string",
        "county": "string",
        "city": "string"
      }
    ],
    "caen_principal": "string",
    "caen_secundar": "string",
    "cui": "string",
    "dateOfCreation": "string",
    "euid": "string",
    "capital": "string",
    "ownership": [
      {
        "founders": [
          {
            "name": "string",
            "birth_date": "date",
            "startDate": "date",
            "endDate": "date",
            "address": "string",
            "place_of_birth": "string",
            "citizenship": "string",
            "type_ID": "string",
            "series_ID": "string",
            "number_ID": "string",
            "cnp": "string"
          }
        ],
        "administrators": [
          {
            "name": "string",
            "birth_date": "date",
            "startDate": "date",
            "endDate": "date",
            "address": "string",
            "place_of_birth": "string",
            "citizenship": "string",
            "type_ID": "string",
            "series_ID": "string",
            "number_ID": "string",
            "cnp": "string"
          }
        ],
        "associates": [
          {
            "name": "string",
            "birth_date": "date",
            "startDate": "date",
            "endDate": "date",
            "address": "string",
            "place_of_birth": "string",
            "citizenship": "string",
            "type_ID": "string",
            "series_ID": "string",
            "number_ID": "string",
            "cnp": "string",
            "percentage_ownership": "string"
          }
        ]
      }
    ],
    "activityFieldDescription": "string",
    "fieldOfActivity": "string",
    "country": "string",
    "dataSource": [
      "string"
    ],
    "otherName": "string",
    "registrationNumber": "string"
  }
}
print('Schema keys:', list(schema.keys()))

### Example filled JSON (for testing validation & mapping)

In [None]:
example = {
  "id": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
  "type": "company",
  "name": "Innovatech Solutions SRL",
  "mainInfo": {
    "addresses": [
      {
        "fullAddress": "Strada Exemplu Nr. 123, Sector 1, Bucure\u0219ti, 010101",
        "country": "Romania",
        "county": "Bucure\u0219ti",
        "city": "Bucure\u0219ti"
      },
      {
        "fullAddress": "Bulevardul Tehnologiei Nr. 45, Cluj-Napoca, 400000",
        "country": "Romania",
        "county": "Cluj",
        "city": "Cluj-Napoca"
      }
    ],
    "caen": "6201",
    "cui": "RO12345678",
    "dateOfCreation": "2018-05-15",
    "euid": "RO.ONRC.J40/12345/2018",
    "capital": "200 lei capital social",
    "ownership": [
      {
        "administrators": [
          {
            "name": "Popescu Andrei",
            "birth_date": "1988-01-01",
            "startDate": "2018-05-15",
            "endDate": null,
            "address": "str. Exemplu Nr. 321 etc etc",
            "place_of_birth": "jude\u021b D\u00e2mbovi\u021ba, Comuna Oricare",
            "citizenship": "cet\u0103\u021bean rom\u00e2n",
            "type_ID": "CI",
            "series_ID": "RT",
            "number_ID": "111222",
            "cnp": "1880303430088"
          }
        ],
        "associates": [
          {
            "name": "Marinescu Vasile",
            "birth_date": "1985-03-10",
            "cnp": "1850310345678",
            "percentage": 50,
            "startDate": "2018-05-15",
            "endDate": null
          }
        ]
      }
    ],
    "activityFieldDescription": "Custom software development activities ...",
    "fieldOfActivity": "Information Technology",
    "country": "Romania",
    "dataSource": [
      "Romanian Trade Register",
      "Official Gazette"
    ],
    "otherName": "Innovatech IT",
    "registrationNumber": "J40/12345/2018"
  }
}
print('Example name:', example['name'])

## Step 1 — HTML segmentation example (BeautifulSoup + heuristics)

In [None]:

# Parsing skeleton: segment an HTML "issue" into individual "entries".
# This cell demonstrates the approach using BeautifulSoup on a string.
from bs4 import BeautifulSoup
import re
import json

def segment_issue_html(html_text):
    # Very generic segmentation by common article headings or separators.
    soup = BeautifulSoup(html_text, 'html.parser')
    # Heuristic: many entries start with "Societatea" or "Societatea comercială"
    text = soup.get_text("")
    # Split on two newlines followed by 'Societatea' (approx)
    parts = re.split(r'(?=\s*Societatea\b)', text)
    entries = [p.strip() for p in parts if p.strip()]
    return entries

# Quick demo using the ROCCO + MADEROS combined as if they were in one issue.
combined = rocco_text + "\n" + maderos_text
entries = segment_issue_html(combined)
print(f"Found {len(entries)} entries. First 2 chars of each:")
for i,e in enumerate(entries):
    print(i, repr(e[:80]))


## Step 2 — Categorization (keep SRL only)

In [None]:

# Categorization: detect company type (SRL, SA, PFA, etc.) from the entry text.
import re

def categorize_entry(text):
    text_lower = text.lower()
    # Look for patterns like "- s.r.l." or " - everything other than S.A." or "s.r.l."
    if re.search(r'\b(s\.r\.l|srl|s\.r\.l\.)\b', text_lower, re.IGNORECASE):
        return "SRL"
    if re.search(r'\b(s\.a|sa|s\.a\.)\b', text_lower, re.IGNORECASE):
        return "SA"
    if re.search(r'\b(p\.f\.a|pfa)\b', text_lower, re.IGNORECASE):
        return "PFA"
    return "UNKNOWN"

# Demo:
for i, e in enumerate(entries):
    print(i, categorize_entry(e))


## Step 3 — Heuristic extraction examples (CUI, CAEN, dates, names)

In [None]:

# Heuristic extraction: CUI, CAEN, dates, administrators, associates
import re
from datetime import datetime

def extract_cui(text):
    m = re.search(r'cod unic de înregistrare[:\s\-]*([0-9]{8,})', text, re.IGNORECASE)
    if m:
        return m.group(1)
    # alternative patterns like "CUI: RO12345678" or simply 8+ digits preceded by "cui"
    m2 = re.search(r'\b(ro)?\s?([0-9]{8,})\b', text, re.IGNORECASE)
    if m2:
        return m2.group(2)
    return None

def extract_caen(text):
    m = re.search(r'caen[:\s\-]*([0-9]{4})', text, re.IGNORECASE)
    if m:
        return m.group(1)
    m2 = re.search(r'grupa CAEN[:\s\-]*([0-9]{3,4})', text, re.IGNORECASE)
    if m2:
        return m2.group(1)
    return None

def extract_dates(text):
    # Find dates like dd.mm.yyyy or dd/mm/yyyy
    dates = re.findall(r'(\d{2}\.\d{2}\.\d{4}|\d{2}/\d{2}/\d{4})', text)
    return dates

def extract_names_simple(text):
    # Very simple heuristic to capture "Administrator: Name" or "administrator: Name"
    names = re.findall(r'administrator(?:\:|\s)+([A-ZĂÂÎȘȚ][^\,\r]+)', text)
    return names

print("ROCCO CUI:", extract_cui(rocco_text))
print("MADEROS CUI:", extract_cui(maderos_text))
print("MADEROS CAEN:", extract_caen(maderos_text))
print("ROCCO dates:", extract_dates(rocco_text))
print("MADEROS dates:", extract_dates(maderos_text))
print("ROCCO admins:", extract_names_simple(rocco_text))
print("MADEROS admins:", extract_names_simple(maderos_text))


## Step 4 — Target model (Pydantic / PydanticAI mapping)

Define models that mirror the exhaustive schema. If you use PydanticAI structured output, create an `Output` model that matches this structure.

In [None]:

# Pydantic model placeholder for target schema.
# If you use PydanticAI for LLM structured output, map the model below to the Output model.
try:
    from pydantic import BaseModel, Field
except Exception:
    BaseModel = object
    Field = lambda *a, **k: None

class Address(BaseModel):
    fullAddress: str = None
    country: str = None
    county: str = None
    city: str = None

class Associate(BaseModel):
    name: str = None
    birth_date: str = None
    startDate: str = None
    endDate: str = None
    address: str = None
    place_of_birth: str = None
    citizenship: str = None
    type_ID: str = None
    series_ID: str = None
    number_ID: str = None
    cnp: str = None
    percentage_ownership: str = None

class MainInfo(BaseModel):
    addresses: list = []
    caen_principal: str = None
    caen_secundar: str = None
    cui: str = None
    dateOfCreation: str = None
    euid: str = None
    capital: str = None
    ownership: list = []
    activityFieldDescription: str = None
    fieldOfActivity: str = None
    country: str = None
    dataSource: list = []
    otherName: str = None
    registrationNumber: str = None

class CompanyModel(BaseModel):
    id: str = None
    type: str = None
    name: str = None
    mainInfo: MainInfo = None

print('Pydantic-like models defined (if pydantic installed, they are real BaseModel classes).')


## Step 5 — LLM-based structured extraction (PydanticAI) — placeholder

The real notebook should call PydanticAI here. Because the runtime environment may not have network access or PydanticAI installed, the cell below contains only an example.

In [None]:

# LLM (PydanticAI) integration placeholder.
# requires PydanticAI library and API access.
from pydantic_ai import LLM, Output

class CompanyData(Output):
    name: str
    cui: str
    euid: str
    registration_number: str
    addresses: list
    caen_principal: str
    caen_secundar: str
    administrators: list
    associates: list

# Initialize LLM client (replace with your provider/model)
llm = LLM("gpt-4")

# Example run on the ROCCO text
result = llm.run(
    prompt=f"Extract structured company data from the following text:\n\n{rocco_text}",
    output=CompanyData
)

print(result)
print("Important: Implement hallucination detection by cross-checking LLM outputs with deterministic regex extractions; log mismatches for manual review.")


## Step 6 — Benchmarking & Throughput (skeleton)

Provide real benchmarking by running the pipeline on a representative sample and measuring time, memory, and error rates. The cell below is a mock skeleton to adapt.

In [None]:

# Benchmarking skeleton (mock). Replace with actual batch processing code.
import time, random

def mock_process_batch(n):
    start = time.time()
    # simulate variable processing time per doc (LLM slower than heuristic)
    times = [random.uniform(0.001, 0.005) for _ in range(n)]  # heuristic
    total = sum(times)
    time.sleep(min(total, 0.5))  # simulate
    return total

n = 10000
elapsed = mock_process_batch(n)
throughput = n / elapsed
print(f"Mock throughput: {throughput:.1f} docs/sec (heuristic simulated). For 3M docs: approx {3_000_000/throughput/3600:.1f} hours")



## Step 7 — Validation & Acceptance Checks

Because no golden dataset is available, use a combination of:
- Deterministic regex checks for critical fields (CUI, CAEN, dates, registration numbers).
- Manual review of random samples (stratified by subtype).
- Cross-check LLM outputs with heuristic extractions and flag mismatches.
- Keep track of precision/recall metrics on the manual samples.

Example deterministic check:
- CUI must be 8 digits (optionally preceded by 'RO').
- CAEN codes are 3-4 digits.
- Dates must match dd.mm.yyyy or yyyy-mm-dd after normalization.


## Process Diagram

The diagram below shows the recommended pipeline. Rendered as mermaid in supporting viewers.


```mermaid
flowchart TD
    A[Raw HTML issues iLegis / Lege5] --> B[Run initial parser ]
    B --> C[Segment into entries]
    C --> D[Categorize entries SRL / SA / PFA / ...]
    D --> E{SRL?}
    E -- Yes --> F[Subtype detection]
    F --> G1[Heuristic parsers - modular per subtype]
    F --> G2[LLM structured output - PydanticAI]
    G1 --> H1[Deterministic validation - regex]
    G2 --> H2[Cross-check with heuristics & hallucination detection]
    H1 --> I[Benchmarking & throughput measurement]
    H2 --> I
    I --> J[Choose final pipeline & run at scale]
    E -- No --> K[Route to SA or other processing pipelines]
```


## Next steps

1. Replace selector heuristics with exact BeautifulSoup selectors for Lege5 HTML files.
2. Prepare a small validation sample for manual review (100-500 documents), stratified by subtype.
3. If using LLMs, create a mapping between PydanticAI `Output` models and the `schema` above.
4. Run benchmarks; collect throughput.
5. Iterate: if heuristics reach >80% for SRL, scale; otherwise shift to LLM approach but ensure hallucination detection and cost feasibility.
