# 政策文档检索总是漏关键点？构建高精度垂直领域检索系统

政策文档检索总是遗漏关键信息？这在企业实际业务中是个普遍且令人头疼的问题。  
面对海量复杂、层级严密的政策、法规或财务报告时，传统的检索增强生成（RAG）系统往往力不从心。

本文将深入探讨如何结合 **HiRAG（Hierarchical RAG，分层知识检索增强生成）** 的理论优势与 **LlamaIndex** 的实操能力，构建一个高精度的垂直领域检索系统，彻底解决 RAG 在商业化落地中遇到的“断章取义”和“大海捞针”困境。

## 传统 RAG 的局限与商业化挑战

在处理复杂、结构化程度高的政策文档、法规或财务报告时，传统的 **检索增强生成（RAG）** 系统暴露出一系列关键问题，严重制约其在企业级场景中的落地应用。

### 核心挑战

1. **上下文碎片化（Context Fragmentation）**  
   传统 RAG 多基于段落或句子级别的向量化检索，导致原始文档的上下文信息被割裂。这种碎片化使得模型难以理解完整语义，尤其在涉及长距离依赖或逻辑推理的场景中表现不佳。

2. **语义关联缺失（Lack of Semantic Connection）**  
   文档中的概念往往存在多层级、多维度的语义关系（如政策中的条款、子条款、附件、实施细则等）。传统 RAG 缺乏对这些语义结构的建模能力，导致检索结果“只见树木，不见森林”。

3. **局部与全局知识断层（Local-Global Knowledge Disconnection）**  
   检索过程通常只关注与查询最相似的局部文本片段，忽略了文档整体结构和上下文逻辑。这种断层容易引发“断章取义”，在政策解读、合规审查等高风险场景中可能带来严重后果。

因此，构建一个能理解文档深层结构、融合多粒度知识的RAG系统至关重要。

## HiRAG：构建“智能目录”与“专家网络”

HiRAG 通过引入**层次化知识结构**，大幅提升复杂政策文档的检索精度，解决了传统 RAG 的三大痛点：上下文碎片、语义断层、局部与全局知识割裂。

![HiRAG 数据集转换](./15/image.png)

https://github.com/hhy-huang/HiRAG


### 1. 层次化知识索引（HiIndex）——构建“智能目录”

- 从文档中提取实体和关系，构建**多层级知识图谱**。
- 通过语义聚类和摘要生成，打通底层实体间的语义孤岛。
- 实现文档从“无结构文本”到“结构化目录”的转变。



### 2. 层次化知识检索（HiRetrieval）——三级检索机制

- **局部检索**：定位具体实体（如某政策条款）。
- **全局检索**：识别宏观背景（如所属监管框架）。
- **桥接检索**：建立局部与全局之间的语义路径（如“云计算 → AWS → 亚马逊”）。

> 通过三级检索机制，LLM 可同时获取细节与背景，避免“断章取义”。



### 3. 融合知识图谱——升级为“专家网络”

- **HiRAG**：专注单文档内部的精准检索。
- **知识图谱**：连接多文档间的逻辑关系（如替代、依赖、引用）。

两者结合，打造一个既懂结构又懂逻辑的“专家级”检索系统，显著提升垂直领域的信息召回与理解能力。

## HiRAG 实操指南：LlamaIndex 框架下的实现

在主流 RAG 框架中，**LlamaIndex** 是实现 **HiRAG** 最直接、最高效的工具，得益于其内置的 **AutoMergingRetriever** 组件，天然支持分层检索结构。

### 三步实现 HiRAG

无论使用哪种框架，HiRAG 的核心流程都包含以下三步：

### 1. 分层解析（Hierarchical Parsing）

- **目标**：保留文档天然结构（如章节、条款、段落）。
- **方法**：使用 `HierarchicalNodeParser` 将文档拆分为 **父节点（Parent Nodes）** 和 **子节点（Child Nodes）**。
- **优势**：避免传统切块方式破坏语义层级，提升上下文完整性。
- **最佳实践**：
  - 根据标题、编号、缩进等结构划分层级。
  - 设置不同层级的 `chunk_sizes`，例如章节为父节点，具体条款为子节点。

### 2. 两阶段检索（Two-Step Retrieval）

- **步骤一：子节点检索**  
  在子节点中进行向量搜索，找到最相关的**细节内容**。
  
- **步骤二：关联父节点**  
  自动关联相关父节点，补充上下文背景，避免“断章取义”。

- **实现工具**：`AutoMergingRetriever` 自动合并子节点及其父节点，构建完整上下文。

### 3. 增强生成（Augmented Generation）

- **输入内容**：结合“精确的子节点” + “完整的父节点”作为上下文。
- **输出效果**：LLM 生成内容更准确、完整、逻辑清晰。
- **提示技巧**：设计提示词时，引导模型利用多层级信息进行回答。

### LlamaIndex 实操简要步骤

### 4. **安装依赖**

In [None]:
%pip install llama-index==0.12.44
%pip install pydantic
%pip install llama-index-llms-openai 
%pip install llama-index-embeddings-huggingface 
%pip install torch transformers

### 5. 准备文档并进行分层解析
这是最关键的一步。我们使用 HierarchicalNodeParser。
为了模拟政策文档，我们创建一个包含多章节的字符串。

In [2]:
import re
from typing import List, Dict, Any
from langchain_core.documents import Document

# -----------------------------------------------------------------------------
# 1) Same policy text (English) wrapped as LangChain Documents
# -----------------------------------------------------------------------------

policy_text = """
# Chapter 1: General Provisions and Objectives

## Article 1: Policy Background and Objectives

This policy aims to thoroughly implement the national innovation-driven development strategy
and accelerate the construction of a globally influential science and technology innovation center.
Through a comprehensive set of fiscal, taxation, talent, and financial support measures,
the policy seeks to stimulate enterprise innovation, promote industrial structure optimization
and upgrading, and enhance the core competitiveness of the regional economy.
Special emphasis is placed on supporting **high-tech enterprises**, helping them achieve
breakthroughs in key core technologies.

## Article 2: Scope of Application

This policy applies to all types of enterprises that are legally registered within the administrative
region of City XX and possess independent legal entity status, as well as qualified high-level talent
working in City XX.
Priority support is given to enterprises operating in strategic emerging industries such as
new energy, artificial intelligence, biomedicine, and advanced manufacturing.

# Chapter 2: High-Tech Enterprise Certification and Support

## Article 3: Certification Standards and Procedures

Enterprises that meet the requirements of the national high-tech enterprise certification
management regulations may apply for certification in accordance with the prescribed procedures.
Certification criteria include, but are not limited to: ownership of core independent intellectual
property rights, products or services falling within the “National Key Supported High-Tech Fields,”
the proportion of R&D expenditure, and the proportion of revenue from high-tech products or services.
The centralized application period runs from July to September each year. Enterprises must submit
application materials through the official platform of the Municipal Science and Technology Bureau.

## Article 4: Financial Subsidies and Incentives

Newly certified national high-tech enterprises will receive a one-time reward of **RMB 300,000**.
Enterprises passing national high-tech certification for the first time may additionally receive
an R&D investment subsidy of up to **RMB 500,000**, with the specific amount determined based on
R&D intensity and output performance.
    * **4.1 R&D Investment Subsidy Details**: The subsidy is calculated as 10% of the enterprise’s
      R&D expenditure in the previous year, capped at RMB 500,000. It is primarily used for
      purchasing R&D equipment, paying R&D personnel salaries, and outsourced R&D activities.
    * **4.2 Disbursement of Incentive Funds**: Incentive funds will be disbursed to the enterprise’s
      designated account within 30 working days after the certification results are announced.
    * **4.3 Priority Loan Support**: Certified high-tech enterprises may access low-interest loans
      provided by partner banks, with a maximum credit line of RMB 50 million.

## Article 5: Talent Recruitment and Development Support

High-tech enterprises are encouraged to recruit high-level talent. Eligible individuals may receive
benefits such as housing subsidies and priority access to education for their children.
Special funds are established to encourage enterprises to conduct technical talent training and
enhance employees’ professional skills.

# Chapter 3: Supervision, Management, and Violations

## Article 6: Supervision and Inspection

The Municipal Science and Technology Bureau, the Finance Bureau, and other relevant departments will
conduct regular inspections of enterprises receiving policy support to ensure compliance and
maximize policy effectiveness.
Enterprises must cooperate with inspections and provide relevant materials truthfully.

## Article 7: Handling of Violations

Enterprises that falsely claim, fraudulently obtain, withhold, or misappropriate fiscal funds will
be required to return all subsidies in accordance with the law and will be disqualified from applying
for any municipal policy support for the next three years.
Serious cases will be referred to judicial authorities.
"""

docs = [Document(page_content=policy_text, metadata={"source": "policy_demo"})]

# -----------------------------------------------------------------------------
# 2) "Hierarchical parsing" in LangChain (simple, explicit, no special parser)
#    We create:
#      - Parent nodes: Chapters
#      - Child nodes: Articles within each chapter
# -----------------------------------------------------------------------------

CHAPTER_RE = re.compile(r"(?m)^#\s+(.*)$")
ARTICLE_RE = re.compile(r"(?m)^##\s+(.*)$")

def split_by_pattern(text: str, header_re: re.Pattern) -> List[Dict[str, Any]]:
    """
    Split text into sections by a markdown header regex.
    Returns a list of {"title": str, "content": str}.
    """
    matches = list(header_re.finditer(text))
    if not matches:
        return [{"title": "ROOT", "content": text}]

    sections = []
    for i, m in enumerate(matches):
        start = m.start()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
        title = m.group(1).strip()
        content = text[start:end].strip()
        sections.append({"title": title, "content": content})
    return sections

parent_docs: List[Document] = []
child_docs: List[Document] = []

# Build parent (chapter) docs
chapters = split_by_pattern(policy_text, CHAPTER_RE)

for ci, ch in enumerate(chapters, 1):
    parent_id = f"chapter_{ci}"
    parent_doc = Document(
        page_content=ch["content"],
        metadata={
            "node_type": "parent",
            "parent_id": parent_id,
            "title": ch["title"],
            "level": 1,
        },
    )
    parent_docs.append(parent_doc)

    # Build child (article) docs within each chapter
    articles = split_by_pattern(ch["content"], ARTICLE_RE)

    # If no "##" headers, treat the whole chapter as one child
    if len(articles) == 1 and articles[0]["title"] == "ROOT":
        articles = [{"title": "Chapter body", "content": ch["content"]}]

    for ai, art in enumerate(articles, 1):
        child_id = f"{parent_id}_article_{ai}"
        child_doc = Document(
            page_content=art["content"],
            metadata={
                "node_type": "child",
                "child_id": child_id,
                "parent_id": parent_id,   # link child -> parent
                "title": art["title"],
                "level": 2,
            },
        )
        child_docs.append(child_doc)

print(f"Total parent nodes (chapters): {len(parent_docs)}")
print(f"Total child nodes (articles): {len(child_docs)}")

# -----------------------------------------------------------------------------
# 3) Print an example parent + its child nodes
# -----------------------------------------------------------------------------

# Build parent_id -> list(child_docs)
children_by_parent: Dict[str, List[Document]] = {}
for cd in child_docs:
    children_by_parent.setdefault(cd.metadata["parent_id"], []).append(cd)

# Print the first parent with children
# Print the first parent with children
for pd in parent_docs:
    pid = pd.metadata["parent_id"]
    kids = children_by_parent.get(pid, [])
    if kids:
        parent_preview = pd.page_content[:80].replace("\n", " ")

        print("\nExample parent node:")
        print(f"  parent_id={pid}, title={pd.metadata.get('title')}, text_len={len(pd.page_content)}")
        print(f"  parent preview: {parent_preview}...")

        print(f"  #children={len(kids)}")
        for i, cd in enumerate(kids[:5], 1):
            child_preview = cd.page_content[:80].replace("\n", " ")
            print(f"    child {i}: child_id={cd.metadata['child_id']}, title={cd.metadata.get('title')}, text_len={len(cd.page_content)}")
            print(f"      child preview: {child_preview}...")
        break


Total parent nodes (chapters): 3
Total child nodes (articles): 7

Example parent node:
  parent_id=chapter_1, title=Chapter 1: General Provisions and Objectives, text_len=1109
  parent preview: # Chapter 1: General Provisions and Objectives  ## Article 1: Policy Background ...
  #children=2
    child 1: child_id=chapter_1_article_1, title=Article 1: Policy Background and Objectives, text_len=630
      child preview: ## Article 1: Policy Background and Objectives  This policy aims to thoroughly i...
    child 2: child_id=chapter_1_article_2, title=Article 2: Scope of Application, text_len=429
      child preview: ## Article 2: Scope of Application  This policy applies to all types of enterpri...


代码解释：
- HierarchicalNodeParser 是 LlamaIndex 提供的核心组件，用于实现分层文档解析。
- chunk_sizes 参数是其关键，它定义了从上到下不同层级的块大小。例如，[2048, 512, 128] 意味着它会先尝试将文档切分成最大2048个Token的块（作为“父节点”），然后将这些父节点再细分成最大512个Token的块（“中间节点”），最后将中间节点细分成最大128个Token的块（“叶子节点”）。这种层级拆分保留了文档的语义上下文。
- get_nodes_from_documents(docs) 执行实际的解析，生成一个包含所有层级节点（及其父子关系）的列表。
- get_leaf_nodes(nodes) 从所有节点中筛选出最底层的叶子节点。在后续的向量索引构建中，我们通常只对这些最细粒度的叶子节点进行嵌入和索引，因为它们包含了最直接的答案片段。

### 6. 构建索引和存储
我们需要将所有节点（父和子）的信息都存储起来。AutoMergingRetriever 会利用这个存储上下文中的父子关系。

In [4]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [5]:
# -----------------------------------------------------------------------------
# Cell 1: Vector index ONLY on leaf/child nodes (LangChain + Weaviate)
# -----------------------------------------------------------------------------

from langchain_openai import OpenAIEmbeddings
from langchain_weaviate import WeaviateVectorStore
import weaviate
from weaviate.connect import ConnectionParams


embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

client = weaviate.WeaviateClient(
    connection_params=ConnectionParams.from_params(
        http_host="127.0.0.1",
        http_port=8080,
        http_secure=False,
        grpc_host="127.0.0.1",
        grpc_port=50051,
        grpc_secure=False,
    )
)
client.connect()

# IMPORTANT: index only child_docs (leaf nodes)
vectorstore = WeaviateVectorStore.from_documents(
    documents=child_docs,          # <-- leaf nodes only
    embedding=embeddings,
    client=client,
    index_name="PolicyLeafNodes",
    text_key="text",
)

print("✅ Vector index built. Leaf/child nodes have been embedded and stored in Weaviate.")


✅ Vector index built. Leaf/child nodes have been embedded and stored in Weaviate.


代码解释：
- ServiceContext 是 LlamaIndex 的一个核心概念，它封装了 RAG 管道中的核心组件，包括 LLM、Embedding 模型和节点解析器。
- StorageContext 负责管理数据的存储。storage_context.docstore.add_documents(nodes) 这一步至关重要，它将所有层级的节点（包括父节点、中间节点和叶子节点）都存储在 LlamaIndex 的文档存储中。AutoMergingRetriever 后续就是通过这个 docstore 来查询节点的父子关系。
- VectorStoreIndex(leaf_nodes, ...)：这里我们只用 leaf_nodes 来初始化 VectorStoreIndex。这意味着只有最细粒度的叶子节点会被转换为向量并存储在向量数据库中。这是因为我们通常认为用户查询的语义与最具体的文本片段（叶子节点）最匹配。

### 7. 配置并使用 AutoMergingRetriever
现在，我们用 AutoMergingRetriever 替换掉普通的检索器。

In [6]:
from typing import List, Dict
from langchain_core.documents import Document

def auto_merge_retrieve(
    query: str,
    vectorstore,
    parent_map: Dict[str, Document],
    top_k: int = 5,
    verbose: bool = True,
) -> List[Document]:
    """
    LangChain-style equivalent of LlamaIndex AutoMergingRetriever.

    1) Retrieve top_k leaf nodes from the vectorstore
    2) Group by parent_id
    3) If multiple children share a parent, return merged parent context
       (parent + concatenated child snippets)
    4) Otherwise return (parent + single child) merged context
    """
    base_retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})
    leaf_hits: List[Document] = base_retriever.invoke(query)

    if verbose:
        print(f"[auto_merge] query='{query}'")
        print(f"[auto_merge] retrieved leaf_hits={len(leaf_hits)}")

    # Group retrieved leaf nodes by parent_id
    grouped: Dict[str, List[Document]] = {}
    for d in leaf_hits:
        pid = (d.metadata or {}).get("parent_id")
        grouped.setdefault(pid, []).append(d)

    merged_results: List[Document] = []

    for pid, children in grouped.items():
        parent = parent_map.get(pid)

        # Build merged text
        child_block = "\n\n".join(
            f"- {c.page_content.strip()}" for c in children if (c.page_content or "").strip()
        ).strip()

        if parent:
            merged_text = (
                f"[PARENT CONTEXT]\n{parent.page_content.strip()}\n\n"
                f"[CHILD EVIDENCE - {len(children)} hit(s)]\n{child_block}"
            )
        else:
            # If parent not found, fall back to only child evidence
            merged_text = f"[CHILD EVIDENCE - {len(children)} hit(s)]\n{child_block}"

        if verbose:
            title = parent.metadata.get("title") if parent and parent.metadata else None
            print(f"[auto_merge] parent_id={pid} title={title} children={len(children)}")

        merged_results.append(
            Document(
                page_content=merged_text,
                metadata={
                    "parent_id": pid,
                    "num_children_merged": len(children),
                    "merged_parent_found": bool(parent),
                },
            )
        )

    return merged_results

In [8]:
# -----------------------------------------------------------------------------
# Build parent_map (LangChain replacement for LlamaIndex docstore)
# -----------------------------------------------------------------------------

# parent_docs must already exist from your hierarchical parsing step
# Each parent Document must have metadata["parent_id"]

parent_map = {
    p.metadata["parent_id"]: p
    for p in parent_docs
}

query = "What is the one-time reward for newly certified high-tech enterprises?"

merged_docs = auto_merge_retrieve(
    query=query,
    vectorstore=vectorstore,
    parent_map=parent_map,
    top_k=5,
    verbose=True,
)

print("\n--- Merged results ---")
for i, d in enumerate(merged_docs, 1):
    print("-" * 80)
    print(f"[{i}] parent_id={d.metadata.get('parent_id')} children={d.metadata.get('num_children_merged')}")
    print(d.page_content[:600] + "...")

[auto_merge] query='What is the one-time reward for newly certified high-tech enterprises?'
[auto_merge] retrieved leaf_hits=5
[auto_merge] parent_id=chapter_2 title=Chapter 2: High-Tech Enterprise Certification and Support children=3
[auto_merge] parent_id=chapter_1 title=Chapter 1: General Provisions and Objectives children=2

--- Merged results ---
--------------------------------------------------------------------------------
[1] parent_id=chapter_2 children=3
[PARENT CONTEXT]
# Chapter 2: High-Tech Enterprise Certification and Support

## Article 3: Certification Standards and Procedures

Enterprises that meet the requirements of the national high-tech enterprise certification
management regulations may apply for certification in accordance with the prescribed procedures.
Certification criteria include, but are not limited to: ownership of core independent intellectual
property rights, products or services falling within the “National Key Supported High-Tech Fields,”
the proporti

In [10]:
# 5. 构建查询引擎# RetrieverQueryEngine 将检索器与可选的后处理器结合起来，然后将检索结果传递给LLM进行答案生成。
# 最佳实践：添加重排器 (Re-ranker)。
# 即使向量检索找到了相似的节点，这些节点的相关性可能还需要进一步排序。
# SentenceTransformerRerank 使用一个独立的模型（如 BAAI/bge-reranker-base）对检索到的节点进行重新打分，
# 从而选择出最相关的 top_n 个节点。这能显著提升检索结果的精确性。

# -----------------------------------------------------------------------------
# Cell 1: Reranker + LLM (LangChain)
# -----------------------------------------------------------------------------

from typing import List, Tuple
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from sentence_transformers import CrossEncoder

# Cross-encoder reranker (same model you wanted)
rerank_model = CrossEncoder("BAAI/bge-reranker-base")
TOP_N = 3

# LLM for answer generation
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

print("✅ CrossEncoder reranker + LLM ready")



✅ CrossEncoder reranker + LLM ready


In [11]:
# -----------------------------------------------------------------------------
# Cell 2: rerank + query_engine (LangChain equivalent of RetrieverQueryEngine)
# -----------------------------------------------------------------------------

def rerank_docs(query: str, docs: List[Document], top_n: int = TOP_N) -> List[Document]:
    """
    Re-score (query, doc) pairs with a cross-encoder and keep top_n docs.
    Stores score in doc.metadata["rerank_score"].
    """
    if not docs:
        return []

    pairs: List[Tuple[str, str]] = [(query, d.page_content or "") for d in docs]
    scores = rerank_model.predict(pairs)  # higher = more relevant for bge-reranker

    scored: List[Tuple[Document, float]] = []
    for d, s in zip(docs, scores):
        d.metadata = d.metadata or {}
        d.metadata["rerank_score"] = float(s)
        scored.append((d, float(s)))

    scored.sort(key=lambda x: x[1], reverse=True)
    return [d for d, _ in scored[:top_n]]


def build_context(docs: List[Document], max_chars: int = 6000) -> str:
    """
    Concatenate top docs into a single context string.
    """
    blocks = []
    total = 0
    for i, d in enumerate(docs, 1):
        text = (d.page_content or "").strip()
        if not text:
            continue
        block = f"[Doc {i}]\n{text}\n"
        if total + len(block) > max_chars:
            break
        blocks.append(block)
        total += len(block)
    return "\n".join(blocks)


def query_engine(query: str, retriever, retrieve_top_k: int = 5, rerank_top_n: int = TOP_N) -> str:
    """
    LangChain-style query engine:
      1) retrieve documents
      2) rerank with CrossEncoder
      3) generate answer with ChatOpenAI
    """
    # 1) retrieve
    docs = retriever.invoke(query)

    # 2) rerank
    top_docs = rerank_docs(query, docs, top_n=rerank_top_n)

    # 3) generate
    context = build_context(top_docs)
    prompt = (
        "You are a precise assistant. Answer using ONLY the provided context.\n"
        "If the answer is not in the context, say you don't know.\n\n"
        f"Question:\n{query}\n\n"
        f"Context:\n{context}\n\n"
        "Answer:"
    )

    return llm.invoke(prompt).content


print("✅ rerank_docs() and query_engine() defined")

✅ rerank_docs() and query_engine() defined


In [12]:
from typing import Tuple, List

def query_engine_with_sources(
    query: str,
    retriever,
    retrieve_top_k: int = 5,
    rerank_top_n: int = TOP_N,
) -> Tuple[str, List[Document]]:
    # 1) retrieve
    docs = retriever.invoke(query)

    # 2) rerank
    top_docs = rerank_docs(query, docs, top_n=rerank_top_n)

    # 3) generate
    context = build_context(top_docs)
    prompt = (
        "You are a precise assistant. Answer using ONLY the provided context.\n"
        "If the answer is not in the context, say you don't know.\n\n"
        f"Question:\n{query}\n\n"
        f"Context:\n{context}\n\n"
        "Answer:"
    )
    answer = llm.invoke(prompt).content
    return answer, top_docs


In [14]:
class MergedRetriever:
    def __init__(self, vectorstore, parent_map, top_k: int = 5, verbose: bool = True):
        self.vectorstore = vectorstore
        self.parent_map = parent_map
        self.top_k = top_k
        self.verbose = verbose

    def invoke(self, query: str):
        return auto_merge_retrieve(
            query=query,
            vectorstore=self.vectorstore,
            parent_map=self.parent_map,
            top_k=self.top_k,
            verbose=self.verbose,
        )

merged_retriever = MergedRetriever(
    vectorstore=vectorstore,
    parent_map=parent_map,
    top_k=5,
    verbose=True,
)

print("✅ merged_retriever is ready")

✅ merged_retriever is ready


In [15]:
query = "What is the exact subsidy amount and what conditions must be met?"
answer, source_docs = query_engine_with_sources(query, merged_retriever)

print(answer)
print("\n--- Sources ---")
for d in source_docs:
    print("parent_id:", d.metadata.get("parent_id"))
    print("rerank_score:", d.metadata.get("rerank_score"))
    print(d.page_content[:300], "...\n")


[auto_merge] query='What is the exact subsidy amount and what conditions must be met?'
[auto_merge] retrieved leaf_hits=5
[auto_merge] parent_id=chapter_2 title=Chapter 2: High-Tech Enterprise Certification and Support children=2
[auto_merge] parent_id=chapter_3 title=Chapter 3: Supervision, Management, and Violations children=2
[auto_merge] parent_id=chapter_1 title=Chapter 1: General Provisions and Objectives children=1
The exact subsidy amount is **RMB 300,000** for newly certified national high-tech enterprises, and an additional R&D investment subsidy of up to **RMB 500,000** may be available, determined by R&D intensity and output performance. The conditions that must be met include passing the national high-tech enterprise certification and having R&D expenditure in the previous year, as the R&D investment subsidy is calculated as 10% of that expenditure, capped at RMB 500,000.

--- Sources ---
parent_id: chapter_2
rerank_score: 0.04834342002868652
[PARENT CONTEXT]
# Chapter 2: 

代码解释：
- AutoMergingRetriever 是 LlamaIndex 实现 HiRAG 核心功能的组件。它接收一个基础的 retriever（这里是我们基于叶子节点构建的向量索引的检索器），以及一个 storage_context（包含了所有节点的父子关系）。
- 当 AutoMergingRetriever 执行检索时，它首先通过底层检索器找到相关的叶子节点。然后，它会检查这些叶子节点的父节点，如果发现多个相关的叶子节点都属于同一个父节点，或者该父节点能提供更完整的上下文，它就会“合并”并返回该父节点的内容，而不是零散的叶子节点。这个过程是自动且智能的。
- verbose=True 在开发调试时非常有用，它会打印出 AutoMergingRetriever 的内部操作日志，让你看到它是如何进行节点合并的。
- SentenceTransformerRerank：这是一个后处理器，用于对检索到的文档进行重新排序。虽然向量相似度已经筛选出了一批相关文档，但它们可能并非最优。重排器使用一个更强大的语义模型对这些文档进行二次评估，选择出真正与查询最相关的文档，进一步提升召回精度。

效果对比：
- 普通检索器：针对“申请高新技术企业补贴的具体金额是多少？”的查询，可能只返回一个非常孤立的句子，如 “...补贴金额为50万元...“。提问者会困惑：这是针对哪个城市？哪一年的政策？有什么前提条件？
- HiRAG (AutoMergingRetriever)：会返回一个更大的内容块，如 “第二章 高新技术企业认定与支持 ... 第四条 财政补贴与奖励 ... 对首次通过认定的国家高新技术企业，额外提供最高50万元的研发投入补贴... 4.1 研发投入补贴细则：研发投入补贴按照企业上一年度研发费用的10%计算，最高不超过50万元。本政策适用于在XX市行政区域内依法注册...”。上下文一目了然，完美解决了“漏掉关键点”的问题，提供了更全面的信息，包括金额、计算方式和适用范围。

## 总结
通过上述实践，我们可以清晰地看到 LlamaIndex 的 AutoMergingRetriever 如何利用层次化节点解析和智能合并策略，有效地将底层细节与上层上下文结合，从而在垂直领域文档检索中实现更高的精度和上下文连贯性，真正解决商业化RAG的业务痛点。