# Ingestion Demo: Structural Parser
This notebook demonstrates the `StructuralParser` using `LlamaParse` and the Context Walker logic.

In [9]:
%load_ext autoreload
%autoreload 2

import os
import sys
from pathlib import Path

# Add src to path
sys.path.append(os.path.abspath("../src"))

from venra.ingestion import StructuralParser
from venra.logging_config import logger
import nest_asyncio

# Required for running async in notebook
nest_asyncio.apply()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
PDF_PATH = "../data/10K_TD_test.pdf"
DOM_OUTPUT = "../data/processed/10k_test_dom.pkl"

parser = StructuralParser()
blocks = await parser.parse_pdf(PDF_PATH)

2026-01-30 17:51:22,427 - venra - INFO - Starting LlamaParse for: ../data/10K_TD_test.pdf
Started parsing the file under job_id 4b8adc99-4fc2-4306-b645-7eb4c1d8eb42


In [11]:
print(f"Extracted {len(blocks)} blocks.")

for i, block in enumerate(blocks[:10]):
    print(f"--- Block {i} ({block.block_type}) ---")
    print(f"Path: {block.section_path}")
    print(f"Content (first 100 chars): {block.content[:100]}...")
    print("\n")

Extracted 65 blocks.
--- Block 0 (BlockType.TEXT) ---
Path: ['☒ Annual Report Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934']
Content (first 100 chars): For the fiscal year ended September 30, 2025...


--- Block 1 (BlockType.TEXT) ---
Path: ['☐ Transition Report pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934']
Content (first 100 chars): For the transition period from    to

Commission File Number 001-32833...


--- Block 2 (BlockType.TEXT) ---
Path: ['TransDigm Group Incorporated']
Content (first 100 chars): (Exact name of registrant as specified in its charter)...


--- Block 3 (BlockType.TEXT) ---
Path: ['Delaware']
Content (first 100 chars): (State or other jurisdiction of incorporation or organization)...


--- Block 4 (BlockType.TEXT) ---
Path: ['41-2101738']
Content (first 100 chars): (I.R.S. Employer Identification No.)

1350 Euclid Avenue, Suite 1600, Cleveland, Ohio 44115

(Addres...


--- Block 5 (BlockType.TEXT) ---
Path: ['(

  HITL Acceptance Analysis:
   1. Block Differentiation: It correctly distinguishes between TEXT and TABLE blocks.
   2. Context Preservation: The Path attribute (e.g., ['TransDigm Group Incorporated'])
      shows that the "Context Walker" logic is successfully capturing headers, which is
      critical for tying numbers to specific companies and document sections.
   3. Content Integrity: The snippet of Block 6 (| Title of each class: | Trading Symbol:
      | ...) shows that LlamaParse is maintaining the table structure in Markdown format,
      which the "Table Melter" in Stage 2 will need.
   4. Serialization: The log confirms the DOM was saved to
      ../data/processed/10k_test_dom.pkl.

In [None]:
parser.save_dom(blocks, DOM_OUTPUT)

2026-01-30 17:51:27,452 - venra - INFO - DOM saved to ../data/processed/10k_test_dom.pkl
