# üõ´ Soji AI ‚Äî Notebook Setup Guide

This notebook provides step-by-step setup for running the AD Recognition pipeline interactively.

> ‚ö†Ô∏è **Important:** Choose **only one** section below based on your hardware. Running both will cause dependency conflicts.

---

## üîç Step 0: Check Your Hardware

Run this cell first to detect your environment:

```python
import subprocess

def check_gpu():
    try:
        result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
        if result.returncode == 0:
            print("‚úÖ NVIDIA GPU detected!")
            print(result.stdout)
            print("üëâ Follow: (GPU Setup)")
            return True
        else:
            print("‚ùå No NVIDIA GPU found")
            print("üëâ Follow: (CPU Setup) (‚ö†Ô∏è Note: the ocr process will take a while since using CPU)")
            return False
    except FileNotFoundError:
        print("‚ùå nvidia-smi not found ‚Äî no GPU available")
        print("üëâ Follow: (CPU Setup) (‚ö†Ô∏è Note: the ocr process will take a while since using CPU)")
        return False

HAS_GPU = check_gpu()
```

---

## GPU Setup

> **Requirements:** NVIDIA GPU with CUDA 13.0+ drivers

In [None]:
import subprocess

def check_gpu():
    try:
        result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
        if result.returncode == 0:
            print("‚úÖ NVIDIA GPU detected!")
            print(result.stdout)
            print("üëâ Follow: (GPU Setup)")
            return True
        else:
            print("‚ùå No NVIDIA GPU found")
            print("üëâ Follow: (CPU Setup) (‚ö†Ô∏è Note: the ocr process will take a while since using CPU)")
            return False
    except FileNotFoundError:
        print("‚ùå nvidia-smi not found ‚Äî no GPU available")
        print("üëâ Follow: (CPU Setup) (‚ö†Ô∏è Note: the ocr process will take a while since using CPU)")
        return False

HAS_GPU = check_gpu()

‚úÖ NVIDIA GPU detected!
Sat Feb 21 04:40:35 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.57                 Driver Version: 591.86         CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3050 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   42C    P0             20W /  100W |       0MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+---------------------

# Airworthiness Directive (AD) Applicability Parser

Determining whether an aircraft is affected by an Airworthiness Directive (AD) requires engineers to manually cross-reference each AD document against fleet data ‚Äî checking aircraft model, MSN, embodied modifications, and incorporated service bulletins. This process is time-consuming, error-prone, and does not scale as the number of ADs and fleet size grows.

A key challenge is that **AD document layouts are not standardized and evolve over time**. Different issuing authorities (EASA, FAA, etc.) use varying formats, and even within the same authority, the structure changes across revisions. This makes traditional rule-based extraction using regex or template matching impractical ‚Äî any hard-coded parsing logic would break as soon as the layout shifts, requiring constant maintenance with no guarantee of reliability.

## Proposed Approaches

To address this, we leverage LLMs for the extraction layer, as they can interpret unstructured regulatory text regardless of layout changes. Two implementation approaches are proposed:

1. **Full LLM Extraction (Multimodal)** ‚Äî The AD document (PDF) is sent directly to a multimodal LLM that processes both text and visual layout. This is simpler to implement and preserves the original document structure, including tables and formatting that may carry semantic meaning.

2. **Local OCR + LLM (Text-Only)** ‚Äî The document is first processed through a local OCR pipeline (e.g., PaddleOCR) to extract raw text, which is then fed to a text-only LLM for structured extraction. This offers more control over preprocessing, reduces multimodal API costs, and can run partially offline.

Both approaches output structured JSON conforming to a Pydantic schema that captures the full AD structure: applicable models, MSN constraints, modification/SB exclusions, aircraft group definitions, and every required action paragraph with its compliance deadlines.

## Applicability Engine

Once parsed into structured JSON, a deterministic rule-based engine evaluates each aircraft against the extracted data through three sequential checks: **(1) Model Check** ‚Äî is the aircraft's model listed in the AD's applicability; **(2) MSN Check** ‚Äî does the serial number satisfy the AD's MSN constraints (all-MSN, ranges, specific lists, exclusions); **(3) Modification/SB Exclusion Check** ‚Äî has the aircraft already embodied a modification or SB that exempts it. The output is an augmented fleet DataFrame with status indicators: `‚úÖ Affected`, `‚ùå Not applicable`, or `‚ùå Not Affected` (exempted).

## Why This Architecture

The LLM handles what regex cannot ‚Äî understanding unstructured, evolving document formats ‚Äî while the applicability logic remains fully deterministic and auditable. The "decision" layer never hallucinates; it only operates on structured data that can be reviewed before any determination is made. This separation ensures traceability, which is critical for aviation regulatory compliance.

## Core Section

In [1]:
## System Prompt
SYSTEM_PROMPT = """
You are an aviation regulatory document parser specialized in Airworthiness Directives (ADs).
Extract structured applicability and compliance information from the provided AD document.

EXTRACTION RULES:
- Extract only information explicitly stated in the document. Never infer or assume.
- Preserve all identifiers verbatim (model names, SB numbers, mod numbers, MSNs).
- If a field has no corresponding information in the document, set it to null.
- Output valid JSON only. No markdown, no explanation, no commentary.

CRITICAL DISTINCTIONS:
- Airbus modification numbers (e.g. "mod 24591") ‚Üí always go in modification_constraints. Never in sb_constraints.
- Service Bulletin identifiers (e.g. "A320-57-1089") ‚Üí always go in sb_constraints. Never in modification_constraints.
- If the AD states "all MSN" or "all manufacturer serial numbers", always set MSNConstraint(all=True, excluded=False). Never leave msn_constraints null when MSN applicability is mentioned.
- When multiple compliance limits use "whichever occurs first", list each as a separate ComplianceTime entry.
- Recurring intervals ("thereafter, at intervals not exceeding...") ‚Üí is_interval=True.
- One-time thresholds ("before exceeding...") ‚Üí is_interval=False.

OUTPUT: Valid JSON strictly following the provided schema.
"""

## Schemas
from enum import Enum
from typing import Optional, List
from pydantic import BaseModel, Field

class TimeUnit(str, Enum):
    FLIGHT_HOURS = "flight_hours"
    FLIGHT_CYCLES = "flight_cycles"
    DAYS = "days"
    MONTHS = "months"
    YEARS = "years"
    CALENDAR_DATE = "calendar_date"


class NumericRange(BaseModel):
    start: Optional[int] = Field(
        default=None,
        description=(
            "Lower bound of the MSN range (inclusive by default). "
            "Set to None if there is no lower bound."
        )
    )
    end: Optional[int] = Field(
        default=None,
        description=(
            "Upper bound of the MSN range (inclusive by default). "
            "Set to None if there is no upper bound."
        )
    )
    inclusive_start: bool = Field(
        default=True,
        description="True means >= (greater than or equal to start). False means > (strictly greater than)."
    )
    inclusive_end: bool = Field(
        default=True,
        description="True means <= (less than or equal to end). False means < (strictly less than)."
    )


class MSNConstraint(BaseModel):
    all: Optional[bool] = Field(
        default=None,
        description=(
            "Set to True when the AD explicitly states 'all manufacturer serial numbers (MSN)' or 'all MSN'. "
            "IMPORTANT: Never leave this None when the AD explicitly uses the word 'all' for MSN applicability ‚Äî "
            "even if other exclusions apply, the 'all' inclusion must still be captured here. "
            "Leave None only when applicability is defined purely by a specific range or list."
        )
    )
    range: Optional[NumericRange] = Field(
        default=None,
        description=(
            "A continuous numeric range of MSNs this constraint covers. "
            "Use when the AD specifies a span like 'MSN 100 through MSN 500'. "
            "Do not use together with specific_msns."
        )
    )
    specific_msns: Optional[List[int]] = Field(
        default=None,
        description=(
            "An explicit list of individual MSN integers this constraint covers. "
            "Use when the AD names specific serial numbers, e.g. 'MSN 364 or MSN 385'. "
            "Do not use together with range."
        )
    )
    excluded: bool = Field(
        default=False,
        description=(
            "Set to True when these MSNs are EXCLUDED from applicability "
            "(AD language like 'except MSN...', 'excluding MSN...'). "
            "Set to False when these MSNs are positively INCLUDED in applicability. "
            "Default is False (inclusion)."
        )
    )


class ModificationConstraint(BaseModel):
    modification_id: str = Field(
        description=(
            "The exact modification identifier as written in the AD. "
            "Always an Airbus 'mod' number, e.g. 'mod 24591', 'mod 24977'. "
            "IMPORTANT: Modification numbers are never Service Bulletins ‚Äî "
            "do not confuse with SB identifiers (e.g. 'A320-57-XXXX'). "
            "Copy the identifier verbatim from the AD text."
        )
    )
    embodied: Optional[bool] = Field(
        default=None,
        description=(
            "True = this modification IS embodied on the aircraft. "
            "False = this modification is NOT embodied on the aircraft. "
            "None = embodiment status is unspecified or not relevant to this constraint."
        )
    )
    excluded: bool = Field(
        default=False,
        description=(
            "Set to True when aircraft WITH this modification embodied are EXCLUDED from applicability "
            "(AD language like 'except those on which mod XXXXX has been embodied in production'). "
            "Set to False when this modification is a positive inclusion condition. "
            "Default is False (inclusion)."
        )
    )


class ServiceBulletinConstraint(BaseModel):
    sb_identifier: str = Field(
        description=(
            "The exact Service Bulletin identifier as written in the AD, "
            "e.g. 'A320-57-1089', 'A320-57-1100'. "
            "IMPORTANT: Only actual Airbus Service Bulletins belong here (format: 'AXXX-XX-XXXX'). "
            "Airbus modification numbers ('mod XXXXX') must NEVER be placed here ‚Äî "
            "those belong exclusively in ModificationConstraint. "
            "Copy the identifier verbatim from the AD text, without the 'SB' prefix."
        )
    )
    revision: Optional[str] = Field(
        default=None,
        description=(
            "The revision qualifier for this SB constraint, exactly as stated in the AD. "
            "Examples: 'Revision 04', 'any revision lower than Revision 04', 'Revision 03 or later'. "
            "Leave None if no specific revision is mentioned and any revision applies."
        )
    )
    incorporated: Optional[bool] = Field(
        default=None,
        description=(
            "True = this SB HAS been incorporated on the aircraft. "
            "False = this SB has NOT been incorporated on the aircraft. "
            "None = incorporation status is unspecified or not relevant to this constraint."
        )
    )
    excluded: bool = Field(
        default=False,
        description=(
            "Set to True when aircraft on which this SB HAS been embodied are EXCLUDED from applicability "
            "(AD language like 'except those on which SB XXXX has been embodied'). "
            "Set to False when this SB is a positive inclusion or compliance condition. "
            "Default is False (inclusion)."
        )
    )


class AircraftGroup(BaseModel):
    group_id: str = Field(
        description=(
            "The group label exactly as defined in the AD's Groups section. "
            "Examples: 'Group 1', 'Group 2', 'Group A', 'Group B'. "
            "Use verbatim from the AD ‚Äî do not invent or rename groups."
        )
    )
    models: Optional[List[str]] = Field(
        default=None,
        description=(
            "Aircraft model variants that belong to this group, "
            "derived from the group definition. "
            "Examples: ['A321-111', 'A321-112'] or ['A320']. "
            "Leave None if the group definition does not restrict by model "
            "(i.e. it applies to all models already listed in the top-level applicability)."
        )
    )
    msn_constraints: Optional[List[MSNConstraint]] = Field(
        default=None,
        description=(
            "MSN-based constraints that define or restrict membership in this group. "
            "Apply the same rules as top-level msn_constraints: "
            "if the group definition says 'all MSN', populate with MSNConstraint(all=True, excluded=False). "
            "If the group is defined by specific MSNs, list them in specific_msns. "
            "Leave None only if MSN is not a factor in this group's definition."
        )
    )
    modification_constraints: Optional[List[ModificationConstraint]] = Field(
        default=None,
        description=(
            "Modification-based constraints that define or exclude aircraft from this group. "
            "Only use ModificationConstraint here ‚Äî never mix with SB identifiers. "
            "Examples: a group excluding aircraft with a specific mod embodied in production. "
            "Leave None if modifications are not a factor in this group's definition."
        )
    )
    sb_constraints: Optional[List[ServiceBulletinConstraint]] = Field(
        default=None,
        description=(
            "Service Bulletin constraints that define or exclude aircraft from this group. "
            "Only use actual SB identifiers here ‚Äî never use mod numbers. "
            "Example: a group defined by aircraft on which a specific SB has NOT been embodied. "
            "Leave None if SBs are not a factor in this group's definition."
        )
    )
    description: Optional[str] = Field(
        default=None,
        description=(
            "Free-text fallback for group membership logic that cannot be fully expressed "
            "by the structured fields above. "
            "Transcribe the exact defining sentence from the AD. "
            "Always populate this field ‚Äî it serves as a human-readable audit trail "
            "even when structured fields are also populated."
        )
    )


class ComplianceTime(BaseModel):
    value: Optional[int] = Field(
        default=None,
        description=(
            "The numeric value of this compliance time. Always a positive integer. "
            "Examples: 37300 for '37 300 flight hours', 24 for '24 months', 90 for '90 days'. "
            "Set to None only when a specific calendar_date is used instead of a relative time value."
        )
    )
    unit: Optional[TimeUnit] = Field(
        default=None,
        description=(
            "The unit of measurement corresponding to value. "
            "Must be one of the TimeUnit enum values. "
            "Set to None only when calendar_date is used instead of value+unit."
        )
    )
    reference: Optional[str] = Field(
        default=None,
        description=(
            "The reference point from which this time is measured, transcribed from the AD. "
            "Examples: 'since first flight of the aeroplane', "
            "'after the effective date of this AD', "
            "'since the last inspection', "
            "'from the effective date of this AD'. "
            "Leave None only if no reference point is stated and the context is self-evident."
        )
    )
    calendar_date: Optional[str] = Field(
        default=None,
        description=(
            "An absolute calendar deadline in ISO 8601 format (YYYY-MM-DD). "
            "Use only when the AD specifies a hard date rather than a relative time window. "
            "When populated, value and unit should be None. "
            "Example: '2026-06-01' for 'before 01 June 2026'."
        )
    )
    is_interval: bool = Field(
        default=False,
        description=(
            "Set to True for RECURRING intervals between repeated actions "
            "(AD language like 'thereafter, at intervals not exceeding X FH'). "
            "Set to False for one-time initial thresholds "
            "(AD language like 'before exceeding X FH since first flight'). "
            "Default is False."
        )
    )


class RequirementAction(BaseModel):
    paragraph_id: str = Field(
        description=(
            "The paragraph identifier exactly as numbered in the AD's Required Actions section. "
            "Examples: '(1)', '(5)', '(8)', '(12)'. "
            "Used to cross-reference paragraphs (e.g. corrective actions referencing their "
            "triggering inspection paragraph)."
        )
    )
    action_type: str = Field(
        description=(
            "The category of this required action. Use exactly one of the following values: "
            "'inspection' ‚Äî any DET, GVI, SDI, ESDI, or other inspection task; "
            "'modification' ‚Äî a structural, design, or configuration change to the aircraft; "
            "'corrective_action' ‚Äî a repair or follow-up action triggered by a finding during inspection; "
            "'terminating_action' ‚Äî an action whose accomplishment ends one or more repetitive requirements; "
            "'prohibition' ‚Äî an action that must NOT be accomplished (e.g. 'do not embody SB X below Rev Y'); "
            "'clarification' ‚Äî a paragraph that clarifies scope or interaction between other paragraphs "
            "without itself requiring a physical action (e.g. 'accomplishment of paragraph X does not "
            "terminate paragraph Y')."
        )
    )
    applies_to_groups: Optional[List[str]] = Field(
        default=None,
        description=(
            "List of group IDs, exactly as defined in the AD's Groups section, "
            "to which this requirement applies. "
            "Examples: ['Group 1'], ['Group 1', 'Group 4']. "
            "Leave None if the requirement is stated in terms of direct model references "
            "rather than group labels, or if it applies implicitly to all groups "
            "(e.g. clarification paragraphs)."
        )
    )
    applies_to_models: Optional[List[str]] = Field(
        default=None,
        description=(
            "Direct aircraft model references for requirements that do not use group labels. "
            "Examples: ['A320-211', 'A320-212']. "
            "Leave None when applies_to_groups is populated ‚Äî do not duplicate the same "
            "applicability in both fields."
        )
    )
    additional_applicability_condition: Optional[str] = Field(
        default=None,
        description=(
            "Any further condition within the stated group or model scope that narrows "
            "which aircraft this paragraph applies to, transcribed verbatim from the AD. "
            "Use when the paragraph adds a qualifier beyond the group definition itself. "
            "Examples: "
            "'except aeroplanes modified in accordance with the instructions of Airbus SB A320-57-1100', "
            "'having embodied SB A320-57-1089 at any revision lower than Revision 04 (for Group 4 aeroplanes)'. "
            "Leave None if no additional condition is stated."
        )
    )
    description: str = Field(
        description=(
            "A concise, self-contained human-readable summary of what action must be performed. "
            "Include: the inspection method or action type (e.g. DET, GVI, modification), "
            "the area or component involved, and the reference document(s) to follow. "
            "Write in plain language suitable for a maintenance engineer to understand at a glance. "
            "Example: 'Accomplish a detailed inspection (DET) of the LH and RH wing inner rear spars "
            "at the MLG anchorage fitting attachment holes, per SB A320-57-1101 Revision 04.'"
        )
    )
    compliance_times: Optional[List[ComplianceTime]] = Field(
        default=None,
        description=(
            "One or more initial compliance thresholds by which this action must first be accomplished. "
            "When the AD states multiple limits with 'whichever occurs first', "
            "list each as a separate ComplianceTime entry ‚Äî the whichever-first logic is implied "
            "by multiple entries in this list. "
            "Example: '37 300 FH or 20 000 FC whichever occurs first since first flight' ‚Üí "
            "two ComplianceTime entries: one for 37300 FH and one for 20000 FC, "
            "both with reference 'since first flight of the aeroplane' and is_interval=False. "
            "Leave None for clarification paragraphs or terminating action notes with no time limit."
        )
    )
    interval: Optional[List[ComplianceTime]] = Field(
        default=None,
        description=(
            "One or more recurring intervals for repetitive requirements. "
            "Populate only when the AD states 'thereafter, at intervals not exceeding...'. "
            "As with compliance_times, list each limit as a separate ComplianceTime entry "
            "when multiple limits apply with 'whichever occurs first'. "
            "All entries must have is_interval=True. "
            "Leave None for one-time actions (modifications, one-time inspections, corrective actions)."
        )
    )
    reference_documents: Optional[List[str]] = Field(
        default=None,
        description=(
            "List of Airbus Service Bulletins or other technical documents whose instructions "
            "must be followed to accomplish this action. "
            "Include the revision where the AD specifies it. "
            "Examples: ['SB A320-57-1101 Revision 04', 'SB A320-57-1256']. "
            "Leave None for corrective actions where the repair instructions are obtained "
            "from Airbus on a case-by-case basis, or for clarification paragraphs."
        )
    )
    triggered_by_paragraph: Optional[str] = Field(
        default=None,
        description=(
            "For corrective_action paragraphs only: the paragraph_id of the inspection "
            "or action that triggers this corrective action when discrepancies are found. "
            "Example: '(1)' means this corrective action is triggered by findings during "
            "the inspection required by paragraph (1). "
            "Leave None for all non-corrective action types."
        )
    )
    terminating_action_for: Optional[List[str]] = Field(
        default=None,
        description=(
            "List of paragraph_ids whose repetitive requirements are permanently terminated "
            "upon accomplishment of this action. "
            "Example: ['(5)'] means completing this action ends the recurring inspections "
            "required by paragraph (5) for that aircraft. "
            "Leave None if this action has no terminating effect on other paragraphs. "
            "Note: also set is_terminating_action=True when this field is populated."
        )
    )
    is_terminating_action: bool = Field(
        default=False,
        description=(
            "Set to True if accomplishing this action permanently terminates one or more "
            "repetitive requirements in this AD. "
            "Must be True whenever terminating_action_for is populated. "
            "Default is False."
        )
    )


class ADApplicabilityExtraction(BaseModel):
    ad_number: str = Field(
        description=(
            "The full AD identifier including any revision suffix, exactly as it appears in the AD header. "
            "Examples: '2025-0254R1', '2023-0041', 'AD 2021-23-10'. "
            "Never omit the revision suffix if present."
        )
    )
    issuing_authority: Optional[str] = Field(
        default=None,
        description=(
            "The aviation authority that issued this AD. "
            "Examples: 'EASA', 'FAA', 'TCCA', 'CASA'. "
            "Taken from the AD header or introductory paragraph."
        )
    )
    effective_date: Optional[str] = Field(
        default=None,
        description=(
            "The effective date of this AD (or its most recent revision) in ISO 8601 format (YYYY-MM-DD). "
            "If multiple dates are listed (original issue and revision), use the revision's effective date. "
            "Example: '2025-12-08'."
        )
    )
    revision: Optional[str] = Field(
        default=None,
        description=(
            "The revision label of this AD exactly as stated in the document. "
            "Examples: 'Revision 01', 'R1', 'Amendment 2'. "
            "Leave None for original issue (no revision)."
        )
    )
    supersedes: Optional[List[str]] = Field(
        default=None,
        description=(
            "List of AD identifiers that this AD supersedes, replaces, or revises, "
            "taken from the Revision field or the Reason section. "
            "Include all superseded ADs, not just the immediate predecessor. "
            "Examples: ['2025-0254', '2007-0162', '2014-0169']. "
            "Leave None if this is a first-issue AD that supersedes nothing."
        )
    )
    models: Optional[List[str]] = Field(
        default=None,
        description=(
            "Complete list of every aircraft model variant explicitly named in the "
            "Applicability section of the AD. "
            "List each variant as a separate string, exactly as written. "
            "Examples: ['A320-211', 'A320-212', 'A320-214', 'A321-111', 'A321-112']. "
            "Do not collapse variants (e.g. do not write 'A320' if the AD lists 'A320-211', 'A320-212' etc.)."
        )
    )
    msn_constraints: Optional[List[MSNConstraint]] = Field(
        default=None,
        description=(
            "Top-level MSN constraints covering the entire AD applicability, before any group scoping. "
            "IMPORTANT ‚Äî never leave this None when the AD mentions MSN applicability: "
            "If the AD says 'all manufacturer serial numbers (MSN)' or 'all MSN', "
            "always populate with at least one MSNConstraint(all=True, excluded=False). "
            "If specific MSN ranges or numbers are excluded (e.g. 'except MSN 001 to 099'), "
            "add a separate MSNConstraint with excluded=True for those. "
            "Only leave None if the AD makes absolutely no reference to MSN applicability."
        )
    )
    modification_constraints: Optional[List[ModificationConstraint]] = Field(
        default=None,
        description=(
            "Top-level Airbus modification constraints covering the entire AD applicability. "
            "IMPORTANT: Only 'mod XXXXX' numbers belong here ‚Äî never SB identifiers. "
            "These are almost always exclusions: aircraft on which a specific mod has been "
            "embodied in production are excluded from the AD's scope. "
            "Capture each mod as a separate ModificationConstraint. "
            "Example: 'except those on which Airbus mod 24591 has been embodied in production' ‚Üí "
            "ModificationConstraint(modification_id='mod 24591', embodied=True, excluded=True). "
            "Leave None only if no modification-based applicability constraints exist in this AD."
        )
    )
    sb_constraints: Optional[List[ServiceBulletinConstraint]] = Field(
        default=None,
        description=(
            "Top-level Service Bulletin constraints covering the entire AD applicability. "
            "IMPORTANT: Only actual Airbus SB identifiers (format 'AXXX-XX-XXXX') belong here. "
            "Airbus modification numbers ('mod XXXXX') must NEVER be placed here ‚Äî "
            "those belong exclusively in modification_constraints. "
            "These are typically SB-based exclusions, e.g. aircraft on which a specific SB "
            "revision has been embodied are excluded from scope. "
            "Example: 'except those on which SB A320-57-1089 at Revision 04 has been embodied' ‚Üí "
            "ServiceBulletinConstraint(sb_identifier='A320-57-1089', revision='Revision 04', "
            "incorporated=True, excluded=True). "
            "Leave None only if no SB-based applicability constraints exist in this AD."
        )
    )
    compliance_time: Optional[List[ComplianceTime]] = Field(
        default=None,
        description=(
            "Top-level summary of the most immediate compliance deadline(s) imposed by this AD as a whole. "
            "The intent is to surface the AD's urgency at a glance, without requiring a consumer "
            "to parse every RequirementAction. "
            "Populate with the most restrictive (shortest) initial deadline across all requirements. "
            "When the shortest deadline is expressed as 'X or Y whichever occurs first', "
            "list both as separate ComplianceTime entries. "
            "This field is a summary ‚Äî full per-paragraph compliance times are still "
            "captured in each RequirementAction.compliance_times. "
            "Leave None only if this AD contains no time-limited requirements "
            "(e.g. a purely prohibitive AD with no deadline)."
        )
    )
    groups: Optional[List[AircraftGroup]] = Field(
        default=None,
        description=(
            "Definitions of all aircraft groups declared in the AD's Groups section, "
            "one AircraftGroup entry per defined group. "
            "Groups are internal AD constructs that partition applicable aircraft for "
            "the purpose of applying different requirements to different subsets. "
            "Preserve the exact group labels and definitions from the AD. "
            "Leave None only if the AD does not define any named groups."
        )
    )
    requirements: Optional[List[RequirementAction]] = Field(
        default=None,
        description=(
            "Complete list of all required actions, one RequirementAction per numbered paragraph "
            "in the AD's Required Actions section. "
            "This is the primary output of the extraction. "
            "Every paragraph must be captured ‚Äî inspections, modifications, corrective actions, "
            "prohibitions, terminating actions, and clarification notes alike. "
            "Preserve paragraph numbering exactly as in the AD. "
            "Leave None only if the AD contains no required actions (which should never occur "
            "for a valid AD)."
        )
    )

## Utils
import re
import pandas as pd
from loguru import logger

def compare_to_ad(df: pd.DataFrame, ad_file_dict: dict) -> pd.DataFrame:

    ad_columns = list(ad_file_dict.keys())
    ad_rows = []

    for _, item in df.iterrows():
        model = str(item["aircraft_model"])
        msn = int(item["msn"])

        raw_mod = item["modifications_applied"]
        if pd.isna(raw_mod) or str(raw_mod).strip().lower() in ("none", "n/a", ""):
            mods_applied = []
        else:
            mods_applied = [m.strip() for m in str(raw_mod).split(",")]

        logger.info(
            f"üîé Checking AD status ‚Äî model: {model}, MSN: {msn}, mods: {mods_applied}"
        )

        ad_status_rows = []

        for ad in ad_columns:
            logger.debug(f"   üìã Checking against: {ad}")
            ad_data = ad_file_dict[ad]

            # --- Model check ---
            model_status = any(model in m for m in ad_data["models"])
            if not model_status:
                ad_status_rows.append("‚ùå Not applicable")
                continue

            # --- MSN check ---
            msn_constraints = ad_data.get("msn_constraints") or []

            if not msn_constraints:
                msn_status = True
            else:
                msn_status = False
                for msn_constraint in msn_constraints:
                    all_msn = msn_constraint.get("all")
                    range_data = msn_constraint.get("range")
                    specific = msn_constraint.get("specific_msns")
                    excluded = msn_constraint.get("excluded", False)

                    matched = False

                    if all_msn:
                        matched = True
                    elif range_data:
                        start = range_data.get("start")
                        end = range_data.get("end")
                        incl_start = range_data.get("inclusive_start", True)
                        incl_end = range_data.get("inclusive_end", True)
                        lower_ok = (msn >= start) if incl_start else (msn > start)
                        upper_ok = (msn <= end) if incl_end else (msn < end)
                        matched = lower_ok and upper_ok
                    elif specific:
                        matched = msn in specific

                    if matched:
                        msn_status = not excluded
                        break

            if not msn_status:
                ad_status_rows.append("‚ùå Not applicable")
                continue

            # --- Modification / SB exclusion check ---
            if not mods_applied:
                ad_status_rows.append("‚úÖ Affected")
                continue

            excluded_by_mod = False

            for mod_applied in mods_applied:
                if "mod" in mod_applied.lower():
                    mod_constraints = ad_data.get("modification_constraints") or []
                    for mod_constraint in mod_constraints:
                        mod_id = mod_constraint.get("modification_id", "")
                        is_excluded = mod_constraint.get("excluded", False)
                        if re.search(r"\b" + re.escape(mod_id) + r"\b", mod_applied):
                            if is_excluded:
                                excluded_by_mod = True
                            break
                else:
                    sb_constraints = ad_data.get("sb_constraints") or []
                    for sb_constraint in sb_constraints:
                        sb_id = sb_constraint.get("sb_identifier", "")
                        is_excluded = sb_constraint.get("excluded", False)
                        if re.search(r"\b" + re.escape(sb_id) + r"\b", mod_applied):
                            if is_excluded:
                                excluded_by_mod = True
                            break

                if excluded_by_mod:
                    break

            if excluded_by_mod:
                ad_status_rows.append("‚ùå Not Affected")
            else:
                ad_status_rows.append("‚úÖ Affected")

        ad_rows.append(ad_status_rows)

    ad_df = pd.DataFrame(ad_rows, columns=ad_columns)
    combined_df = pd.concat([df, ad_df], axis=1)
    
    return combined_df

Component 1: LLM-Based AD Document Parser

The first component uses a large language model (LLM) guided by a carefully engineered system prompt to extract structured data from unstructured AD documents. The system prompt enforces strict extraction rules ‚Äî the model must only extract explicitly stated information, preserve all identifiers verbatim (model designations, Service Bulletin numbers, modification numbers, MSNs), and output valid JSON conforming to a predefined Pydantic schema. Critical distinctions are enforced at the prompt level: Airbus modification numbers (e.g., `mod 24591`) are always routed to `modification_constraints` and never confused with Service Bulletin identifiers (e.g., `A320-57-1089`), which are routed to `sb_constraints`. This separation is essential because modifications and service bulletins have fundamentally different implications for AD applicability.

The output schema (`ADApplicabilityExtraction`) captures the full structure of an AD, including: the AD identifier and metadata (issuing authority, effective date, revision history), the complete list of applicable aircraft models and MSN constraints, modification and service bulletin exclusions, aircraft group definitions (which partition the fleet into subsets with different compliance requirements), and every numbered required action paragraph ‚Äî each annotated with its action type, compliance deadlines, recurring intervals, reference documents, and cross-references to other paragraphs (e.g., corrective actions triggered by inspection findings, or terminating actions that end repetitive requirements).

Component 2: Rule-Based Applicability Engine

The second component (`compare_to_ad`) takes a fleet inventory DataFrame ‚Äî containing each aircraft's model, MSN, and applied modifications ‚Äî and evaluates it against the parsed AD data. For each aircraft-AD pair, the engine performs a three-stage check:

1. **Model Check** ‚Äî Verifies whether the aircraft's type certificate model appears in the AD's applicability list.
2. **MSN Check** ‚Äî Evaluates whether the aircraft's serial number falls within the AD's MSN constraints, supporting `all MSN` declarations, numeric ranges with configurable inclusivity bounds, specific MSN lists, and exclusion logic.
3. **Modification/SB Exclusion Check** ‚Äî Determines whether any modification or service bulletin already embodied on the aircraft exempts it from the AD's scope, using regex-based identifier matching against the parsed constraint data.

The output is an augmented DataFrame where each AD column contains a status indicator: `‚úÖ Affected` (the aircraft is subject to the AD), `‚ùå Not applicable` (the aircraft does not meet the AD's applicability criteria), or `‚ùå Not Affected` (the aircraft originally fell within scope but is exempted by an already-embodied modification or service bulletin).

In [5]:
## Pipeline (With Full LLM)
import os
import json
import shutil
import pandas as pd
from datetime import datetime
from uuid import uuid4
from typing import Optional
from loguru import logger
from pydantic import BaseModel
from google import genai
from google.genai import types
from pdf2image import convert_from_bytes

class ADRecognitionFullLLM:
    def __init__(
        self,
        dpi: int,
        llm_model: str,
        llm_system_prompt: str,
        llm_temperature: float,
        llm_output_schema: type[BaseModel],
        temp_dir: Optional[str] = None,
    ):
        self.dpi = dpi
        self.llm_client = genai.Client(
            api_key=os.getenv("GOOGLE_API_KEY")
        )
        self.llm_model = llm_model
        self.llm_system_prompt = llm_system_prompt
        self.llm_temperature = llm_temperature
        self.llm_output_schema = llm_output_schema

        if not temp_dir:
            current_dir = os.getcwd()
            self.temp_dir = os.path.join(current_dir, "tmp/ad_recognition")

        else:
            self.temp_dir = temp_dir
            
        os.makedirs(self.temp_dir, exist_ok=True)
        self._run_dirs: list[str] = []  # track created run dirs for cleanup

    # ------------------------------------------------------------------ #
    #  Helper: Derive AD label from filename
    # ------------------------------------------------------------------ #
    @staticmethod
    def _label_from_path(pdf_path: str) -> str:
        return os.path.splitext(os.path.basename(pdf_path))[0]

    # ------------------------------------------------------------------ #
    #  Cleanup
    # ------------------------------------------------------------------ #
    def _cleanup_temp(self):
        """Remove all temporary run directories created during this session."""
        if not self._run_dirs:
            return

        logger.info(f"üßπ Cleaning up {len(self._run_dirs)} temp directories...")
        for run_dir in self._run_dirs:
            try:
                shutil.rmtree(run_dir)
                logger.debug(f"   üóëÔ∏è  Removed: {run_dir}")
            except Exception as e:
                logger.warning(f"   ‚ö†Ô∏è  Failed to remove {run_dir}: {e}")
        self._run_dirs.clear()

        # Remove parent temp dir if empty
        try:
            if os.path.exists(self.temp_dir) and not os.listdir(self.temp_dir):
                os.rmdir(self.temp_dir)
                logger.debug(f"   üóëÔ∏è  Removed empty temp dir: {self.temp_dir}")
        except Exception:
            pass

        logger.info("‚úÖ Cleanup complete")

    # ------------------------------------------------------------------ #
    #  Step 1: PDF -> Images
    # ------------------------------------------------------------------ #
    def _pdf_to_images(self, pdf_path: str, run_dir: str) -> list[str]:
        logger.info(f"üìÑ Converting PDF to images: {pdf_path} (dpi={self.dpi})")
        imgs_dir = os.path.join(run_dir, "pages")
        os.makedirs(imgs_dir, exist_ok=True)

        with open(pdf_path, "rb") as f:
            img_paths = convert_from_bytes(
                f.read(),
                output_folder=imgs_dir,
                fmt="png",
                paths_only=True,
                dpi=self.dpi,
            )
        logger.info(f"üñºÔ∏è  Generated {len(img_paths)} page images")
        return img_paths

    # ------------------------------------------------------------------ #
    #  Step 2: Prepare LLM messages
    # ------------------------------------------------------------------ #
    def _prepare_messages(self, img_paths: list[str]) -> list:
        logger.info(f"üì¶ Preparing {len(img_paths)} images for LLM...")
        messages = ["Now, extract the following images!"]
        for img_path in img_paths:
            logger.debug(f"   üîó Encoding: {os.path.basename(img_path)}")
            with open(img_path, "rb") as f:
                img_bytes = f.read()
            messages.append(
                types.Part.from_bytes(
                    data=img_bytes,
                    mime_type="image/png",
                )
            )
        logger.info("‚úÖ All images encoded and ready")
        return messages

    # ------------------------------------------------------------------ #
    #  Step 3: Call Gemini for structured extraction
    # ------------------------------------------------------------------ #
    def _extract_with_llm(self, messages: list) -> dict:
        logger.info(f"ü§ñ Calling LLM model: {self.llm_model}")

        config = types.GenerateContentConfig(
            system_instruction=self.llm_system_prompt,
            temperature=self.llm_temperature,
            response_mime_type="application/json",
            response_json_schema=self.llm_output_schema.model_json_schema(),
        )

        response = self.llm_client.models.generate_content(
            model=self.llm_model,
            config=config,
            contents=messages,
        )

        parsed = self.llm_output_schema.model_validate_json(response.text)
        logger.info("üéØ LLM extraction completed successfully")
        return parsed.model_dump()

    # ------------------------------------------------------------------ #
    #  Step 4: Save extraction results
    # ------------------------------------------------------------------ #
    def _save_extraction(self, data: dict, run_dir: str, label: str) -> str:
        out_path = os.path.join(run_dir, f"{label}_extraction.json")
        with open(out_path, "w") as f:
            json.dump(data, f, indent=2, ensure_ascii=False)
        logger.info(f"üíæ Saved extraction: {out_path}")
        return out_path

    # ------------------------------------------------------------------ #
    #  Step 5: Extract a single AD PDF
    # ------------------------------------------------------------------ #
    def extract_ad(self, pdf_path: str, label: Optional[str] = None) -> dict:
        if label is None:
            label = self._label_from_path(pdf_path)

        run_id = uuid4().hex
        run_dir = os.path.join(self.temp_dir, run_id)
        os.makedirs(run_dir, exist_ok=True)
        self._run_dirs.append(run_dir)
        logger.info(f"üöÄ [{label}] Starting extraction ‚Äî run_id={run_id}")

        img_paths = self._pdf_to_images(pdf_path, run_dir)
        messages = self._prepare_messages(img_paths)
        extraction = self._extract_with_llm(messages)
        self._save_extraction(extraction, run_dir, label)

        logger.info(f"‚úÖ [{label}] Extraction complete!")
        return extraction

    # ------------------------------------------------------------------ #
    #  Step 6: Full pipeline
    # ------------------------------------------------------------------ #
    def run_analysis(
        self,
        test_data_path: str,
        ad_file_paths: list[str],
        save_dir: str,
        cleanup: bool = True,
    ) -> str:
        """
        Run the complete AD recognition and comparison pipeline.

        Args:
            test_data_path: Path to test CSV file.
            ad_file_paths: List of AD PDF file paths to extract and compare.
            save_dir: Directory to save final results.
            cleanup: Whether to delete temp directories after saving results.

        Returns:
            Path to the saved results CSV.
        """
        logger.info("üî∞" + "=" * 58)
        logger.info(f"üõ´ Starting AD Recognition Pipeline ‚Äî {len(ad_file_paths)} AD(s)")
        logger.info("üî∞" + "=" * 58)

        try:
            # --- Extract all AD PDFs ---
            ad_extractions: dict[str, dict] = {}
            for i, pdf_path in enumerate(ad_file_paths, 1):
                label = self._label_from_path(pdf_path)
                logger.info(f"üìã [{i}/{len(ad_file_paths)}] Processing: {label}")
                extraction = self.extract_ad(pdf_path, label=label)
                ad_extractions[label] = extraction

            # --- Load test data ---
            logger.info(f"üìä Loading test data: {test_data_path}")
            test_data = pd.read_csv(test_data_path, sep=",")
            logger.info(f"üìê Test data shape: {test_data.shape}")

            # --- Compare ---
            logger.info(f"‚öôÔ∏è  Running AD comparison against {len(ad_extractions)} AD(s)...")
            result_df = compare_to_ad(test_data, ad_file_dict=ad_extractions)
            logger.info(f"üèÅ Comparison done ‚Äî {len(result_df)} rows classified")

            # --- Present results ---
            print("========== RESULT ==========")
            print(result_df.to_markdown(index=False))
            print("============================")

            # --- Save results ---
            run_timestamp = datetime.now().strftime("%y%m%d")
            run_id = uuid4().hex[:8]
            run_output_dir = os.path.join(save_dir, f"{run_id}_{run_timestamp}")
            os.makedirs(run_output_dir, exist_ok=True)
            logger.info(f"üìÅ Run output directory: {run_output_dir}")

            result_path = os.path.join(run_output_dir, "ad_classification_results.csv")
            result_df.to_csv(result_path, index=False)
            logger.info(f"üíæ Results saved: {result_path}")

            extractions_path = os.path.join(run_output_dir, "ad_extractions.json")
            with open(extractions_path, "w") as f:
                json.dump(ad_extractions, f, indent=2, ensure_ascii=False)
            logger.info(f"üíæ Extractions saved: {extractions_path}")

        finally:
            if cleanup:
                self._cleanup_temp()

        logger.info("üî∞" + "=" * 58)
        logger.info("üéâ Pipeline complete!")
        logger.info("üî∞" + "=" * 58)

        return result_path
    
## Pipeline (With OCR + LLM (text only))
import os
import json
import shutil
import numpy as np
import pandas as pd

from datetime import datetime
from uuid import uuid4
from typing import Optional, List, Dict, Any
from PIL import Image, ImageDraw, ImageFont
from loguru import logger
from pydantic import BaseModel
from google import genai
from google.genai import types
from pdf2image import convert_from_bytes
from paddleocr import PaddleOCR

class ADRecognitionOCR:
    def __init__(
        self,
        dpi: int,
        llm_model: str,
        llm_system_prompt: str,
        llm_temperature: float,
        llm_output_schema: type[BaseModel],
        ocr_device: str = "cpu",
        ocr_precision: str = "fp32",
        ocr_det_model: str = "PP-OCRv5_mobile_det",
        ocr_rec_model: str = "PP-OCRv5_mobile_rec",
        y_threshold: float = 15.0,
        save_ocr_viz: bool = True,
        cpu_threads: int = 8,
        temp_dir: Optional[str] = None,
    ):
        self.dpi = dpi
        self.y_threshold = y_threshold
        self.save_ocr_viz = save_ocr_viz

        # --- LLM ---
        self.llm_client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
        self.llm_model = llm_model
        self.llm_system_prompt = llm_system_prompt
        self.llm_temperature = llm_temperature
        self.llm_output_schema = llm_output_schema

        # --- OCR Engine ---
        is_cpu = ocr_device.lower() == "cpu"

        if is_cpu:
            logger.info(f"üîß Initializing PaddleOCR engine on CPU with {cpu_threads} threads...")
            _precision = "fp32"
            _enable_mkldnn = False
        else:
            logger.info(f"üîß Initializing PaddleOCR engine on {ocr_device}...")
            _precision = ocr_precision
            _enable_mkldnn = True

        self.ocr_engine = PaddleOCR(
            use_doc_orientation_classify=False,
            use_doc_unwarping=False,
            use_textline_orientation=False,
            device=ocr_device,
            precision=_precision,
            enable_mkldnn=_enable_mkldnn,
            text_detection_model_name=ocr_det_model,
            text_recognition_model_name=ocr_rec_model,
            cpu_threads=cpu_threads if is_cpu else None,
        )

        if is_cpu:
            logger.info(f"‚úÖ PaddleOCR engine ready (CPU mode ‚Äî {cpu_threads} threads, mkldnn=off, fp32)")
        else:
            logger.info(f"‚úÖ PaddleOCR engine ready ({ocr_device}, {ocr_precision})")

        # --- Temp dir ---
        if not temp_dir:
            self.temp_dir = os.path.join(os.getcwd(), "tmp/ad_recognition_ocr")
        else:
            self.temp_dir = temp_dir
        os.makedirs(self.temp_dir, exist_ok=True)
        self._run_dirs: list[str] = []

    # ================================================================== #
    #  Helpers
    # ================================================================== #
    @staticmethod
    def _label_from_path(pdf_path: str) -> str:
        return os.path.splitext(os.path.basename(pdf_path))[0]

    def _cleanup_temp(self):
        """Remove all temporary run directories created during this session."""
        if not self._run_dirs:
            return

        logger.info(f"üßπ Cleaning up {len(self._run_dirs)} temp directories...")
        for run_dir in self._run_dirs:
            try:
                shutil.rmtree(run_dir)
                logger.debug(f"   üóëÔ∏è  Removed: {run_dir}")
            except Exception as e:
                logger.warning(f"   ‚ö†Ô∏è  Failed to remove {run_dir}: {e}")
        self._run_dirs.clear()

        try:
            if os.path.exists(self.temp_dir) and not os.listdir(self.temp_dir):
                os.rmdir(self.temp_dir)
                logger.debug(f"   üóëÔ∏è  Removed empty temp dir: {self.temp_dir}")
        except Exception:
            pass

        logger.info("‚úÖ Cleanup complete")

    # ================================================================== #
    #  Step 1: PDF -> Images
    # ================================================================== #
    def _pdf_to_images(self, pdf_path: str, run_dir: str) -> list[str]:
        logger.info(f"üìÑ Converting PDF to images: {pdf_path} (dpi={self.dpi})")
        imgs_dir = os.path.join(run_dir, "pages")
        os.makedirs(imgs_dir, exist_ok=True)

        with open(pdf_path, "rb") as f:
            img_paths = convert_from_bytes(
                f.read(),
                output_folder=imgs_dir,
                fmt="png",
                paths_only=True,
                dpi=self.dpi,
            )
        logger.info(f"üñºÔ∏è  Generated {len(img_paths)} page images")
        return img_paths

    # ================================================================== #
    #  Step 2: OCR
    # ================================================================== #
    def _run_ocr(self, img_paths: list[str]) -> list[dict]:
        logger.info(f"üîç Running OCR on {len(img_paths)} pages...")
        ocr_results = list(self.ocr_engine.predict(img_paths))
        logger.info(f"‚úÖ OCR complete ‚Äî {len(ocr_results)} pages processed")
        return ocr_results

    # ================================================================== #
    #  Step 3: OCR Postprocessing (sort + full text)
    # ================================================================== #
    @staticmethod
    def _sort_ocr_reading_order(
        texts: List[str],
        boxes: List[np.ndarray],
        y_threshold: float = 15.0,
    ) -> tuple[List[str], List[np.ndarray]]:
        """Sort OCR results in natural reading order (top-to-bottom, left-to-right)."""
        if not texts:
            return texts, boxes

        coords = []
        for i, box in enumerate(boxes):
            box = np.array(box)
            if box.shape == (4,):
                x_left = box[0]
                y_center = (box[1] + box[3]) / 2
            elif box.shape == (4, 2):
                x_left = box[:, 0].min()
                y_center = box[:, 1].mean()
            else:
                raise ValueError(f"Unexpected box shape: {box.shape}")
            coords.append((i, x_left, y_center))

        coords.sort(key=lambda c: c[2])

        lines = []
        current_line = [coords[0]]
        for item in coords[1:]:
            if abs(item[2] - current_line[0][2]) <= y_threshold:
                current_line.append(item)
            else:
                lines.append(current_line)
                current_line = [item]
        lines.append(current_line)

        sorted_indices = []
        for line in lines:
            line.sort(key=lambda c: c[1])
            sorted_indices.extend([item[0] for item in line])

        sorted_texts = [texts[i] for i in sorted_indices]
        sorted_boxes = [boxes[i] for i in sorted_indices]
        return sorted_texts, sorted_boxes

    def _get_full_text(self, ocr_results: List[Dict[str, Any]]) -> str:
        """Convert OCR results to full text in reading order with page headers."""
        all_pages_text = []
        total_pages = len(ocr_results)

        for page_idx, page in enumerate(ocr_results):
            texts = page.get("rec_texts", [])
            boxes = page.get("rec_boxes", [])

            if not texts:
                continue

            sorted_texts, sorted_boxes = self._sort_ocr_reading_order(
                texts, boxes, self.y_threshold
            )

            coords = []
            for i, box in enumerate(sorted_boxes):
                box = np.array(box)
                if box.shape == (4,):
                    y_center = (box[1] + box[3]) / 2
                else:
                    y_center = box[:, 1].mean()
                coords.append((i, y_center))

            lines_text = []
            current_line_texts = [sorted_texts[0]]
            current_y = coords[0][1]

            for idx in range(1, len(coords)):
                if abs(coords[idx][1] - current_y) <= self.y_threshold:
                    current_line_texts.append(sorted_texts[idx])
                else:
                    line = " ".join(t for t in current_line_texts if t.strip())
                    if line.strip():
                        lines_text.append(line)
                    current_line_texts = [sorted_texts[idx]]
                    current_y = coords[idx][1]

            line = " ".join(t for t in current_line_texts if t.strip())
            if line.strip():
                lines_text.append(line)

            page_num = page_idx + 1
            header = f"\n{'='*60}\n  PAGE {page_num} / {total_pages}\n{'='*60}\n"
            all_pages_text.append(header + "\n".join(lines_text))

        return "\n".join(all_pages_text)

    # ================================================================== #
    #  Step 4: Draw OCR bbox visualizations
    # ================================================================== #
    @staticmethod
    def _draw_ocr_bboxes(
        image_path: str,
        ocr_result: dict,
        output_path: str,
        use_polys: bool = True,
        box_color: str = "red",
        text_color: str = "blue",
        show_text: bool = False,
        font_size: int = 14,
    ) -> None:
        """Draw OCR bounding boxes on the original image and save."""
        img = Image.open(image_path).convert("RGB")
        draw = ImageDraw.Draw(img)

        texts = ocr_result.get("rec_texts", [])
        polys = ocr_result.get("rec_polys" if use_polys else "rec_boxes", [])

        try:
            font = ImageFont.truetype(
                "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", font_size
            )
        except Exception:
            font = ImageFont.load_default()

        for i, poly in enumerate(polys):
            poly = np.array(poly)

            if poly.shape == (4,):
                x_min, y_min, x_max, y_max = poly
                draw.rectangle([x_min, y_min, x_max, y_max], outline=box_color, width=2)
                text_pos = (x_min, y_min - font_size - 2)
            elif poly.shape == (4, 2):
                points = [tuple(p) for p in poly.astype(int)]
                points.append(points[0])
                draw.line(points, fill=box_color, width=2)
                text_pos = (int(poly[:, 0].min()), int(poly[:, 1].min()) - font_size - 2)
            else:
                continue

            if show_text and i < len(texts) and texts[i].strip():
                draw.text(text_pos, texts[i], fill=text_color, font=font)

        img.save(output_path)

    def _save_ocr_visualizations(
        self,
        img_paths: list[str],
        ocr_results: list[dict],
        save_dir: str,
        label: str,
    ) -> list[str]:
        """Draw and save bbox visualizations for all pages."""
        viz_dir = os.path.join(save_dir, f"{label}_ocr_viz")
        os.makedirs(viz_dir, exist_ok=True)
        viz_paths = []

        logger.info(f"üé® Drawing OCR visualizations for {len(img_paths)} pages...")
        for i, (img_path, ocr_result) in enumerate(zip(img_paths, ocr_results)):
            viz_path = os.path.join(viz_dir, f"page_{i+1}_ocr_viz.png")
            self._draw_ocr_bboxes(
                image_path=img_path,
                ocr_result=ocr_result,
                output_path=viz_path,
            )
            viz_paths.append(viz_path)
            logger.debug(f"   üñçÔ∏è  Saved viz: page {i+1}")

        logger.info(f"‚úÖ All OCR visualizations saved to: {viz_dir}")
        return viz_paths

    # ================================================================== #
    #  Step 5: LLM extraction (text-only input)
    # ================================================================== #
    def _extract_with_llm(self, full_text: str) -> dict:
        logger.info(f"ü§ñ Calling LLM model: {self.llm_model} (text-only mode)")

        config = types.GenerateContentConfig(
            system_instruction=self.llm_system_prompt,
            temperature=self.llm_temperature,
            response_mime_type="application/json",
            response_json_schema=self.llm_output_schema.model_json_schema(),
        )

        response = self.llm_client.models.generate_content(
            model=self.llm_model,
            config=config,
            contents=f"Now extract the following OCR'd text:\n\n{full_text}",
        )

        parsed = self.llm_output_schema.model_validate_json(response.text)
        logger.info("üéØ LLM extraction completed successfully")
        return parsed.model_dump()

    # ================================================================== #
    #  Step 6: Save extraction results
    # ================================================================== #
    def _save_extraction(self, data: dict, run_dir: str, label: str) -> str:
        out_path = os.path.join(run_dir, f"{label}_extraction.json")
        with open(out_path, "w") as f:
            json.dump(data, f, indent=2, ensure_ascii=False)
        logger.info(f"üíæ Saved extraction: {out_path}")
        return out_path

    # ================================================================== #
    #  Step 7: Extract a single AD PDF (full OCR pipeline)
    # ================================================================== #
    def extract_ad(
        self, pdf_path: str, label: Optional[str] = None
    ) -> tuple[dict, list[str], list[dict]]:
        """
        Full OCR extraction pipeline for a single AD PDF.

        Returns:
            (extraction_dict, img_paths, ocr_results)
        """
        if label is None:
            label = self._label_from_path(pdf_path)

        run_id = uuid4().hex
        run_dir = os.path.join(self.temp_dir, run_id)
        os.makedirs(run_dir, exist_ok=True)
        self._run_dirs.append(run_dir)
        logger.info(f"üöÄ [{label}] Starting OCR extraction ‚Äî run_id={run_id}")

        # PDF -> Images
        img_paths = self._pdf_to_images(pdf_path, run_dir)

        # Images -> OCR
        ocr_results = self._run_ocr(img_paths)

        # OCR -> Sorted full text
        full_text = self._get_full_text(ocr_results)
        logger.info(f"üìù Full text extracted: {len(full_text)} characters")

        # Save raw OCR text for debugging
        text_path = os.path.join(run_dir, f"{label}_ocr_text.txt")
        with open(text_path, "w", encoding="utf-8") as f:
            f.write(full_text)
        logger.debug(f"   üìÑ Raw OCR text saved: {text_path}")

        # Text -> LLM structured extraction
        extraction = self._extract_with_llm(full_text)
        self._save_extraction(extraction, run_dir, label)

        logger.info(f"‚úÖ [{label}] OCR extraction complete!")
        return extraction, img_paths, ocr_results

    # ================================================================== #
    #  Step 8: Full pipeline
    # ================================================================== #
    def run_analysis(
        self,
        test_data_path: str,
        ad_file_paths: list[str],
        save_dir: str,
        cleanup: bool = True,
    ) -> str:
        logger.info("üî∞" + "=" * 58)
        logger.info(f"üõ´ Starting AD Recognition Pipeline (OCR) ‚Äî {len(ad_file_paths)} AD(s)")
        logger.info("üî∞" + "=" * 58)

        try:
            # --- Extract all AD PDFs via OCR ---
            ad_extractions: dict[str, dict] = {}
            ad_ocr_data: dict[str, tuple[list[str], list[dict]]] = {}

            for i, pdf_path in enumerate(ad_file_paths, 1):
                label = self._label_from_path(pdf_path)
                logger.info(f"üìã [{i}/{len(ad_file_paths)}] Processing: {label}")
                extraction, img_paths, ocr_results = self.extract_ad(pdf_path, label=label)
                ad_extractions[label] = extraction
                ad_ocr_data[label] = (img_paths, ocr_results)

            # --- Save OCR visualizations to save_dir ---

            run_timestamp = datetime.now().strftime("%y%m%d")
            run_id = uuid4().hex[:8]
            run_output_dir = os.path.join(save_dir, f"{run_id}_{run_timestamp}")
            os.makedirs(run_output_dir, exist_ok=True)
            logger.info(f"üìÅ Run output directory: {run_output_dir}")

            # --- Save OCR visualizations ---
            if self.save_ocr_viz:
                for label, (img_paths, ocr_results) in ad_ocr_data.items():
                    self._save_ocr_visualizations(
                        img_paths, ocr_results, run_output_dir, label
                    )

            # --- Load test data ---
            logger.info(f"üìä Loading test data: {test_data_path}")
            test_data = pd.read_csv(test_data_path, sep=",")
            logger.info(f"üìê Test data shape: {test_data.shape}")

            # --- Compare ---
            logger.info(f"‚öôÔ∏è  Running AD comparison against {len(ad_extractions)} AD(s)...")
            result_df = compare_to_ad(test_data, ad_file_dict=ad_extractions)
            logger.info(f"üèÅ Comparison done ‚Äî {len(result_df)} rows classified")

            # --- Present results ---
            print("========== RESULT ==========")
            print(result_df.to_markdown(index=False))
            print("============================")

            # --- Save results ---
            result_path = os.path.join(run_output_dir, "ad_classification_results.csv")
            result_df.to_csv(result_path, index=False)
            logger.info(f"üíæ Results saved: {result_path}")

            extractions_path = os.path.join(run_output_dir, "ad_extractions.json")
            with open(extractions_path, "w") as f:
                json.dump(ad_extractions, f, indent=2, ensure_ascii=False)
            logger.info(f"üíæ Extractions saved: {extractions_path}")

        finally:
            if cleanup:
                self._cleanup_temp()

        logger.info("üî∞" + "=" * 58)
        logger.info("üéâ Pipeline complete!")
        logger.info("üî∞" + "=" * 58)

        return result_path


  from .autonotebook import tqdm as notebook_tqdm
[33mChecking connectivity to the model hosters, this may take a while. To bypass this check, set `PADDLE_PDX_DISABLE_MODEL_SOURCE_CHECK` to `True`.[0m


## Pipeline Section

### LLM ONLY SECTION

In [6]:
DPI = 300
LLM_MODEL = "gemini-2.5-flash"
LLM_TEMPERATURE = 0.1

# Initializing pipeline
pipeline = ADRecognitionFullLLM(
    dpi=DPI,
    llm_model=LLM_MODEL,
    llm_system_prompt=SYSTEM_PROMPT,
    llm_temperature=LLM_TEMPERATURE,
    llm_output_schema=ADApplicabilityExtraction,
)

In [None]:
from pathlib import Path

current_path = Path(os.getcwd()) # Notebook Path
root_project_path = current_path.parent

TEST_DATA_PATH = os.path.join(root_project_path, "test/ad_test_data.csv") # CHANGE THIS BASED ON YOUR PATH
AD_FILE_DIR = os.path.join(root_project_path, "documents") # CHANGE THIS BASED ON YOUR PATH
AD_FILE_PATHS = [os.path.join(AD_FILE_DIR, file_path) for file_path in os.listdir(AD_FILE_DIR)]
SAVE_DIR = os.path.join(root_project_path, "results") # CHANGE THIS BASED ON YOUR PATH
os.makedirs(SAVE_DIR, exist_ok=True)
CLEANUP = True

# Run Analysis
result_path = pipeline.run_analysis(
    test_data_path=TEST_DATA_PATH,
    ad_file_paths=AD_FILE_PATHS,
    save_dir=SAVE_DIR,
    cleanup=CLEANUP,
)

[32m2026-02-21 05:24:21.858[0m | [1mINFO    [0m | [36m__main__[0m:[36mrun_analysis[0m:[36m193[0m - [1müõ´ Starting AD Recognition Pipeline ‚Äî 2 AD(s)[0m
[32m2026-02-21 05:24:21.860[0m | [1mINFO    [0m | [36m__main__[0m:[36mrun_analysis[0m:[36m201[0m - [1müìã [1/2] Processing: EASA_AD_US-2025-23-53_1[0m
[32m2026-02-21 05:24:21.862[0m | [1mINFO    [0m | [36m__main__[0m:[36mextract_ad[0m:[36m160[0m - [1müöÄ [EASA_AD_US-2025-23-53_1] Starting extraction ‚Äî run_id=d606e55fd0db40dca876850496a67dd3[0m
[32m2026-02-21 05:24:21.862[0m | [1mINFO    [0m | [36m__main__[0m:[36m_pdf_to_images[0m:[36m82[0m - [1müìÑ Converting PDF to images: /home/naufal/soji_ai/documents/EASA_AD_US-2025-23-53_1.pdf (dpi=300)[0m
[32m2026-02-21 05:24:26.382[0m | [1mINFO    [0m | [36m__main__[0m:[36m_pdf_to_images[0m:[36m94[0m - [1müñºÔ∏è  Generated 7 page images[0m
[32m2026-02-21 05:24:26.383[0m | [1mINFO    [0m | [36m__main__[0m:[36m_prepare_mess

| aircraft_model   |   msn | modifications_applied   | EASA_AD_US-2025-23-53_1   | EASA_AD_2025-0254R1_1   |
|:-----------------|------:|:------------------------|:--------------------------|:------------------------|
| MD-11            | 48123 | nan                     | ‚úÖ Affected               | ‚ùå Not applicable       |
| DC-10-30F        | 47890 | nan                     | ‚úÖ Affected               | ‚ùå Not applicable       |
| Boeing 737-800   | 30123 | nan                     | ‚ùå Not applicable         | ‚ùå Not applicable       |
| A320-214         |  5234 | nan                     | ‚ùå Not applicable         | ‚úÖ Affected             |
| A320-232         |  6789 | mod 24591 (production)  | ‚ùå Not applicable         | ‚ùå Not Affected         |
| A320-214         |  7456 | SB A320-57-1089 Rev 04  | ‚ùå Not applicable         | ‚ùå Not Affected         |
| A321-111         |  8123 | nan                     | ‚ùå Not applicable         | ‚úÖ Affected             |
| A32

### LOCAL OCR + LLM (TEXT ONLY) CPU ONLY SECTION

In [15]:
DPI = 300
LLM_MODEL = "gemini-2.5-flash"
LLM_TEMPERATURE = 0.1
OCR_DEVICE = "cpu" # DO NOT CHANGE THIS
OCR_PRECISION = "fp32" # DO NOT CHANGE THIS
OCR_DET_MODEL = "PP-OCRv5_mobile_det" # DO NOT CHANGE THIS
OCR_REC_MODEL = "PP-OCRv5_mobile_rec" # DO NOT CHANGE THIS
OCR_CPU_THREADS = 8 # ADJUST BASED ON NUMBER OF CPU THREADS
OCR_Y_THRESHOLD = 15.0 # RECOMMENDED 10-15
OCR_SAVE_VIZ = True # RECOMMENDED TO SAVE

# Initializing pipeline
pipeline = ADRecognitionOCR(
    dpi=DPI,
    llm_model=LLM_MODEL,
    llm_system_prompt=SYSTEM_PROMPT,
    llm_temperature=LLM_TEMPERATURE,
    llm_output_schema=ADApplicabilityExtraction,
    ocr_device=OCR_DEVICE,
    ocr_precision=OCR_PRECISION,
    ocr_det_model=OCR_DET_MODEL,
    ocr_rec_model=OCR_REC_MODEL,
    y_threshold=OCR_Y_THRESHOLD,
    save_ocr_viz=OCR_SAVE_VIZ,
    cpu_threads=OCR_CPU_THREADS,
)

[32m2026-02-21 05:34:26.206[0m | [1mINFO    [0m | [36m__main__[0m:[36m__init__[0m:[36m296[0m - [1müîß Initializing PaddleOCR engine on CPU with 8 threads...[0m
[32mCreating model: ('PP-OCRv5_mobile_det', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/naufal/.paddlex/official_models/PP-OCRv5_mobile_det`.[0m
[32mCreating model: ('PP-OCRv5_mobile_rec', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/naufal/.paddlex/official_models/PP-OCRv5_mobile_rec`.[0m
[32m2026-02-21 05:34:32.062[0m | [1mINFO    [0m | [36m__main__[0m:[36m__init__[0m:[36m317[0m - [1m‚úÖ PaddleOCR engine ready (CPU mode ‚Äî 8 threads, mkldnn=off, fp32)[0m


In [16]:
from pathlib import Path

current_path = Path(os.getcwd()) # Notebook Path
root_project_path = current_path.parent

TEST_DATA_PATH = os.path.join(root_project_path, "test/ad_test_data.csv") # CHANGE THIS BASED ON YOUR PATH
AD_FILE_DIR = os.path.join(root_project_path, "documents") # CHANGE THIS BASED ON YOUR PATH
AD_FILE_PATHS = [os.path.join(AD_FILE_DIR, file_path) for file_path in os.listdir(AD_FILE_DIR)]
SAVE_DIR = os.path.join(root_project_path, "results") # CHANGE THIS BASED ON YOUR PATH
os.makedirs(SAVE_DIR, exist_ok=True)
CLEANUP = True

# Run Analysis
result_path = pipeline.run_analysis(
    test_data_path=TEST_DATA_PATH,
    ad_file_paths=AD_FILE_PATHS,
    save_dir=SAVE_DIR,
    cleanup=CLEANUP,
)

[32m2026-02-21 05:34:49.799[0m | [1mINFO    [0m | [36m__main__[0m:[36mrun_analysis[0m:[36m645[0m - [1müõ´ Starting AD Recognition Pipeline (OCR) ‚Äî 2 AD(s)[0m
[32m2026-02-21 05:34:49.800[0m | [1mINFO    [0m | [36m__main__[0m:[36mrun_analysis[0m:[36m655[0m - [1müìã [1/2] Processing: EASA_AD_US-2025-23-53_1[0m
[32m2026-02-21 05:34:49.801[0m | [1mINFO    [0m | [36m__main__[0m:[36mextract_ad[0m:[36m609[0m - [1müöÄ [EASA_AD_US-2025-23-53_1] Starting OCR extraction ‚Äî run_id=7d2b1090b8f443589f5905c6079152bb[0m
[32m2026-02-21 05:34:49.802[0m | [1mINFO    [0m | [36m__main__[0m:[36m_pdf_to_images[0m:[36m363[0m - [1müìÑ Converting PDF to images: /home/naufal/soji_ai/documents/EASA_AD_US-2025-23-53_1.pdf (dpi=300)[0m
[32m2026-02-21 05:34:54.106[0m | [1mINFO    [0m | [36m__main__[0m:[36m_pdf_to_images[0m:[36m375[0m - [1müñºÔ∏è  Generated 7 page images[0m
[32m2026-02-21 05:34:54.106[0m | [1mINFO    [0m | [36m__main__[0m:[36m_

| aircraft_model   |   msn | modifications_applied   | EASA_AD_US-2025-23-53_1   | EASA_AD_2025-0254R1_1   |
|:-----------------|------:|:------------------------|:--------------------------|:------------------------|
| MD-11            | 48123 | nan                     | ‚úÖ Affected               | ‚ùå Not applicable       |
| DC-10-30F        | 47890 | nan                     | ‚úÖ Affected               | ‚ùå Not applicable       |
| Boeing 737-800   | 30123 | nan                     | ‚ùå Not applicable         | ‚ùå Not applicable       |
| A320-214         |  5234 | nan                     | ‚ùå Not applicable         | ‚úÖ Affected             |
| A320-232         |  6789 | mod 24591 (production)  | ‚ùå Not applicable         | ‚ùå Not Affected         |
| A320-214         |  7456 | SB A320-57-1089 Rev 04  | ‚ùå Not applicable         | ‚ùå Not Affected         |
| A321-111         |  8123 | nan                     | ‚ùå Not applicable         | ‚úÖ Affected             |
| A32

### LOCAL OCR + LLM (TEXT ONLY) GPU ONLY SECTION

In [17]:
DPI = 300
LLM_MODEL = "gemini-2.5-flash"
LLM_TEMPERATURE = 0.1
OCR_DEVICE = "gpu:0" # DO NOT CHANGE THIS
OCR_PRECISION = "fp16" # RECOMMENDED FOR GPU
OCR_DET_MODEL = "PP-OCRv5_mobile_det" # DO NOT CHANGE THIS
OCR_REC_MODEL = "PP-OCRv5_mobile_rec" # DO NOT CHANGE THIS
OCR_CPU_THREADS = 8 # ADJUST BASED ON NUMBER OF CPU THREADS
OCR_Y_THRESHOLD = 15.0 # RECOMMENDED 10-15
OCR_SAVE_VIZ = True # RECOMMENDED TO SAVE

# Initializing pipeline
pipeline = ADRecognitionOCR(
    dpi=DPI,
    llm_model=LLM_MODEL,
    llm_system_prompt=SYSTEM_PROMPT,
    llm_temperature=LLM_TEMPERATURE,
    llm_output_schema=ADApplicabilityExtraction,
    ocr_device=OCR_DEVICE,
    ocr_precision=OCR_PRECISION,
    ocr_det_model=OCR_DET_MODEL,
    ocr_rec_model=OCR_REC_MODEL,
    y_threshold=OCR_Y_THRESHOLD,
    save_ocr_viz=OCR_SAVE_VIZ,
)

[32m2026-02-21 05:55:35.832[0m | [1mINFO    [0m | [36m__main__[0m:[36m__init__[0m:[36m300[0m - [1müîß Initializing PaddleOCR engine on gpu:0...[0m
[32mCreating model: ('PP-OCRv5_mobile_det', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/naufal/.paddlex/official_models/PP-OCRv5_mobile_det`.[0m
[32mCreating model: ('PP-OCRv5_mobile_rec', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/naufal/.paddlex/official_models/PP-OCRv5_mobile_rec`.[0m
[32m2026-02-21 05:55:38.171[0m | [1mINFO    [0m | [36m__main__[0m:[36m__init__[0m:[36m319[0m - [1m‚úÖ PaddleOCR engine ready (gpu:0, fp16)[0m


In [18]:
from pathlib import Path

current_path = Path(os.getcwd()) # Notebook Path
root_project_path = current_path.parent

TEST_DATA_PATH = os.path.join(root_project_path, "test/ad_test_data.csv") # CHANGE THIS BASED ON YOUR PATH
AD_FILE_DIR = os.path.join(root_project_path, "documents") # CHANGE THIS BASED ON YOUR PATH
AD_FILE_PATHS = [os.path.join(AD_FILE_DIR, file_path) for file_path in os.listdir(AD_FILE_DIR)]
SAVE_DIR = os.path.join(root_project_path, "results") # CHANGE THIS BASED ON YOUR PATH
os.makedirs(SAVE_DIR, exist_ok=True)
CLEANUP = True

# Run Analysis
result_path = pipeline.run_analysis(
    test_data_path=TEST_DATA_PATH,
    ad_file_paths=AD_FILE_PATHS,
    save_dir=SAVE_DIR,
    cleanup=CLEANUP,
)

[32m2026-02-21 05:56:18.032[0m | [1mINFO    [0m | [36m__main__[0m:[36mrun_analysis[0m:[36m645[0m - [1müõ´ Starting AD Recognition Pipeline (OCR) ‚Äî 2 AD(s)[0m
[32m2026-02-21 05:56:18.034[0m | [1mINFO    [0m | [36m__main__[0m:[36mrun_analysis[0m:[36m655[0m - [1müìã [1/2] Processing: EASA_AD_US-2025-23-53_1[0m
[32m2026-02-21 05:56:18.036[0m | [1mINFO    [0m | [36m__main__[0m:[36mextract_ad[0m:[36m609[0m - [1müöÄ [EASA_AD_US-2025-23-53_1] Starting OCR extraction ‚Äî run_id=f740392b2f274cc587c5dca134de7a0b[0m
[32m2026-02-21 05:56:18.037[0m | [1mINFO    [0m | [36m__main__[0m:[36m_pdf_to_images[0m:[36m363[0m - [1müìÑ Converting PDF to images: /home/naufal/soji_ai/documents/EASA_AD_US-2025-23-53_1.pdf (dpi=300)[0m
[32m2026-02-21 05:56:22.592[0m | [1mINFO    [0m | [36m__main__[0m:[36m_pdf_to_images[0m:[36m375[0m - [1müñºÔ∏è  Generated 7 page images[0m
[32m2026-02-21 05:56:22.593[0m | [1mINFO    [0m | [36m__main__[0m:[36m_

| aircraft_model   |   msn | modifications_applied   | EASA_AD_US-2025-23-53_1   | EASA_AD_2025-0254R1_1   |
|:-----------------|------:|:------------------------|:--------------------------|:------------------------|
| MD-11            | 48123 | nan                     | ‚úÖ Affected               | ‚ùå Not applicable       |
| DC-10-30F        | 47890 | nan                     | ‚úÖ Affected               | ‚ùå Not applicable       |
| Boeing 737-800   | 30123 | nan                     | ‚ùå Not applicable         | ‚ùå Not applicable       |
| A320-214         |  5234 | nan                     | ‚ùå Not applicable         | ‚úÖ Affected             |
| A320-232         |  6789 | mod 24591 (production)  | ‚ùå Not applicable         | ‚ùå Not Affected         |
| A320-214         |  7456 | SB A320-57-1089 Rev 04  | ‚ùå Not applicable         | ‚ùå Not Affected         |
| A321-111         |  8123 | nan                     | ‚ùå Not applicable         | ‚úÖ Affected             |
| A32

# Written Report: AD Document Extraction System

## Approach

My pipeline follows a two-stage architecture: **Local OCR (PaddleOCR) ‚Üí LLM (text-only)** for extraction, followed by a **deterministic rule-based engine** for applicability evaluation.

The core idea is straightforward ‚Äî OCR converts the PDF into raw text, the LLM parses that text into structured JSON, and a rule engine checks each aircraft against the parsed data. I chose this over a pure regex/template approach because AD document layouts are not standardized. They vary across issuing authorities, and even within EASA, the formatting shifts between revisions. Any hard-coded parser would need constant patching ‚Äî it's a maintenance trap.

I also considered a **full multimodal LLM approach** (sending the PDF directly to a vision-capable model), which would be simpler to implement. I opted against it for two reasons: multimodal inference is more expensive per request, and ‚Äî more critically ‚Äî vision models tend to underperform on small, dense text like MSN lists, modification numbers, and SB identifiers. These are exactly the fields where precision matters most in AD compliance. A missed serial number or a misread modification ID can lead to an incorrect applicability determination, which in aviation is not an acceptable margin of error.

The OCR + LLM route gives me a layer of control between the document and the model. I can inspect, clean, and restructure the OCR output before the LLM ever sees it ‚Äî something you simply cannot do when the model is reading the PDF as an image.

## Challenges

The hardest part was not the LLM extraction itself ‚Äî it was everything around the OCR output.

**OCR post-processing** was the most time-consuming challenge. Raw PaddleOCR output is flat and unordered ‚Äî it doesn't inherently understand that a block of text is a table header, or that two columns should be read left-to-right rather than top-to-bottom. I had to build post-processing logic to reconstruct reading order, merge fragmented text blocks, and normalize common OCR artifacts (misread characters in identifiers, inconsistent whitespace, broken line continuations). Getting this right was essential because garbage-in from OCR means garbage-out from the LLM, no matter how good the prompt is.

**Schema design and evaluation** was another significant effort. I formulated the extraction output as a strict Pydantic schema (`ADApplicabilityExtraction`) with clearly separated fields for models, MSN constraints, modification constraints, SB constraints, aircraft groups, and required actions. The challenge was handling the many edge cases in AD language ‚Äî "all MSN except...", "whichever occurs first", recurring vs. one-time compliance, terminating actions that cancel other paragraphs. Each of these required careful schema modeling and explicit extraction rules in the system prompt to prevent the LLM from conflating similar-but-distinct concepts (e.g., Airbus `mod` numbers vs. Service Bulletin identifiers ‚Äî these look similar but have completely different compliance implications).

**Building the deterministic applicability engine** required translating nuanced AD logic into boolean checks. The three-stage evaluation (model ‚Üí MSN ‚Üí modification/SB exclusion) sounds simple, but the devil is in the details: handling inclusive vs. exclusive range bounds, matching modification identifiers with regex while avoiding partial matches, and correctly implementing exclusion logic where an aircraft is initially in-scope but exempted by an already-embodied mod or SB.

## Limitations

There are several areas where this approach can fall short:

**Layout-dependent information loss.** When OCR flattens a PDF into text, spatial relationships (table structures, column alignments, indentation hierarchies) are partially lost. For most AD paragraphs this is manageable, but for complex multi-column tables ‚Äî like group definitions that map models to MSN ranges ‚Äî the flattened text can be ambiguous. A multimodal model would handle these cases better since it can "see" the table structure visually.

**LLM extraction is not deterministic.** Even with a strict schema and detailed prompt, the LLM can still occasionally misclassify a field, hallucinate a value, or miss a constraint. This is mitigated by the Pydantic validation layer (malformed output is rejected), but subtle errors ‚Äî like placing a mod number in the SB field ‚Äî can slip through if the prompt guardrails aren't specific enough.

**GPU bottleneck in the current setup.** The OCR stage runs on a 4GB VRAM laptop GPU, which is a practical limitation. Processing speed is slower than it would be on a dedicated GPU with more CUDA cores ‚Äî in production, this would need to be addressed with better hardware or a batched processing queue.

**With more time, I would:**
- Add a confidence scoring layer ‚Äî have the LLM output confidence levels per extracted field, then flag low-confidence extractions for human review.
- Build an automated evaluation pipeline that compares LLM extraction output against a ground-truth dataset of manually parsed ADs, measuring field-level precision and recall.
- Experiment with a hybrid approach ‚Äî use OCR + LLM as the primary path, but fall back to multimodal extraction for documents where OCR post-processing detects likely table structures that would benefit from visual understanding.

## Trade-offs

**Why LLM?** Because AD documents are written in natural language with enough variation that deterministic parsing (regex, template matching) is fragile. The LLM absorbs the ambiguity ‚Äî it can handle paraphrased compliance language, varying section orderings, and inconsistent formatting without breaking. The trade-off is cost and non-determinism, but for this use case, the flexibility outweighs the risk, especially when paired with schema validation.

**Why OCR + text-only LLM over a full multimodal (VLM) approach?** Three reasons: cost, precision, and control. Text-only inference is cheaper. OCR engines are purpose-built for text recognition and outperform vision models on small/dense identifiers. And the OCR intermediate step gives me a post-processing hook ‚Äî I can clean, validate, and restructure the text before the LLM processes it. With a VLM, the model is a black box between PDF-in and JSON-out; I have no opportunity to intervene when the document is messy.

That said, VLMs win on simplicity and layout understanding. If cost were not a constraint and the documents were consistently well-formatted, a multimodal approach would be a perfectly valid ‚Äî and arguably simpler ‚Äî choice. The right answer depends on the production context: how many ADs you're processing, how often, and how much tolerance you have for per-request cost vs. pipeline complexity.