# Markdown-Native RAG Demo

This notebook demonstrates how to process the LlamaParser-generated Markdown files to preserve their structure (headers, sections) during ingestion.

## Key Ideas
1. **Hierarchical Splitting**: Use `MarkdownHeaderTextSplitter` to keep text attached to its headers (e.g., `# Audit Result > ## Financial Loss`).
2. **Context Preservation**: The metadata from headers acts as "breadcrumbs" for the LLM.

In [1]:
!pip install langchain-text-splitters



In [2]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

# Load the parsed single file for testing
input_file = "2538-00_parsed.md"
with open(input_file, "r", encoding="utf-8") as f:
    markdown_text = f.read()

print(f"Loaded {len(markdown_text)} characters.")
# print(markdown_text[:500])

Loaded 59 characters.


### Step 1: Split by Headers
We define the headers we want to track. LlamaParse usually creates standard `#`, `##` headers.

In [3]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_text)

# CLEANING STEP: Remove superscript-like citations (e.g., '플라이애시1)', 'Ａ16)')
import re
def clean_text(text):
    # Remove patterns like '1)', '16)' that might be attached to words or isolated
    # Pattern: Digit(s) followed by ')'
    text = re.sub(r'(?<=\w)\d+\)', '', text)  # remove '플라이애시1)' -> '플라이애시'
    text = re.sub(r'\s[Ａ-Ｚ]+\d+\)', '', text) # remove ' Ａ16)'
    text = re.sub(r'\. {3,}', '', text)  # Remove TOC dots (e.g., '......')
    return text

for doc in md_header_splits:
    doc.page_content = clean_text(doc.page_content)

print(f"Split into {len(md_header_splits)} Parent Sections based on headers (and cleaned).")

# --- PARENT-CHILD SIMULATION ---
# Let's say we define a 'Child Chunk' size. If a section is smaller, Child == Parent.
from langchain_text_splitters import RecursiveCharacterTextSplitter

child_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

print("\n--- Simulating Parent-Child Indexing ---")
documents_to_index = []

for parent_doc in md_header_splits[:5]: # look at first 5 sections
    parent_text = parent_doc.page_content
    breadcrumbs = parent_doc.metadata
    
    # Split parent into children
    child_chunks = child_splitter.create_documents([parent_text])
    
    print(f"Parent Section: {breadcrumbs} (Length: {len(parent_text)}) -> {len(child_chunks)} Children")
    
    for i, child in enumerate(child_chunks):
        # This is what goes into Milvus
        record = {
            "vector": "(Embedding of Child Text)",
            "text": child.page_content,           # Child Text (Search Target)
            "parent_text": parent_text,           # Parent Text (LLM Context)
            "breadcrumbs": str(breadcrumbs),      # Context Path
            "idx": f"Section_{i}"               # ID
        }
        documents_to_index.append(record)
        # if i == 0:
        #    print(f"  [Child 1 Preview]: {child.page_content[:50]}...")

Split into 1 Parent Sections based on headers (and cleaned).

--- Simulating Parent-Child Indexing ---
Parent Section: {'Header 1': '- 변상판정 청구사항 서면감사 -'} (Length: 16) -> 1 Children


### Step 2: Inspection
See how the metadata captures the document structure.

In [4]:
for i, doc in enumerate(md_header_splits[:5]):
    print(f"--- Parent Chunk {i} ---")
    print(f"Metadata: {doc.metadata}")
    # print(doc.page_content[:100]) # Commented out to reduce noise

--- Parent Chunk 0 ---
Metadata: {'Header 1': '- 변상판정 청구사항 서면감사 -'}
