# 03 领域术语总混淆？教你构建精准术语词库，提升检索一致性

在RAG系统构建过程中，术语混淆直接影响信息检索的精准度与生成内容的质量。

这主要源于几个方面：
- 向量表示
- 不同行业、公司乃至同一组织内部，都可能存在相似词汇却拥有截然不同含义的情况

这些因素最终导致检索结果偏离预期，大幅降低了答案的质量。

# 一、术语词库构建与维护（Glossary Management）
## 1.1 产生术语混淆的原因

- 术语多义性
- 同义词与近义词
- 领域差异
- 企业专属术语

## 1.2 构建术语词库的目标
术语词库是整个术语一致性优化体系的核心基础设施。

## 1.3 术语词库的构建流程
- Step 1：收集术语来源
- Step 2：标准化术语
- Step 3：建立别名映射关系
- Step 4：添加上下文信息
- Step 5：构建术语索引


一个功能完善的术语词库应包含以下关键字段，以确保其结构化和可操作性：

| 字段名                 | 内容                                                                                   |
|------------------------|----------------------------------------------------------------------------------------|
| 术语（Term）           | 神经网络                                                                              |
| 别名（Synonyms）       | ["人工神经网络", "NN"]                                                                |
| 定义（Definition）     | 神经网络是一种模仿生物神经网络结构和功能的计算模型……                                  |
| 上下文标签（Context Tags） | ["人工智能", "深度学习", "计算机科学"]                                                 |
| 所属领域（Domain）     | 人工智能                                                                              |
| 示例用法（Usage Example） | 在图像识别任务中，我们使用了一个卷积神经网络。                                        |
| 外部链接（External Link） | [维基百科链接](https://en.wikipedia.org/wiki/Artificial_neural_network)               |
| 禁用词/误导词（Stop Words / Misleading Terms） | ["神经系统"（医学中的不同概念）]                                                  |

## 1.4 术语词库与 RAG 集成

- 方式一：预处理阶段替换术语
- 方式二：检索增强
- 方式三：重排序（Re-ranking）
- 方式四：后处理解释

## 1.5 术语词库维护
1. 术语词库结构设计
这是基础，确定词库所需包含的字段和它们之间的关系。

2. 自动抽取术语候选
利用 NLP 工具从大量文本中自动识别和提取潜在的术语。

3. 专家审核与完善
领域专家对自动抽取的术语进行人工审核、修正和补充，确保准确性和专业性。

4. 构建术语关系图谱
如果有需求，可以进一步构建术语之间的层次、关联关系，形成本体（Ontology）或知识图谱（Knowledge Graph），以提升语义理解能力。

5. 版本控制与更新机制建设
建立术语词库的版本管理和定期更新机制，确保其时效性和权威性，应对新术语的出现或旧术语含义的变化。

| 阶段             | 技术名称                                 | 描述                                                                                                                                               |
|------------------|------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| 1. 数据预处理    | 术语抽取、标准化、上下文分块             | 在原始文档和查询进入 RAG 系统之前，识别并提取领域术语，进行统一化处理，并确保文本分块时能有效保留术语的上下文信息。                                   |
| 2. 术语词库构建  | 词库设计、术语关系建模、版本管理         | 建立结构化的术语词库，包含术语、别名、定义、上下文标签等字段。进一步可构建术语间的层级或关联关系（如本体），并建立完善的版本控制与更新机制。           |
| 3. 嵌入与向量化  | 构建术语向量索引、微调领域嵌入模型       | 将术语词库中的标准术语和别名转换为向量，并构建高效的向量索引（如 FAISS）。同时，通过领域适应性训练（如 LoRA）优化通用嵌入模型，使其更好地理解领域特有概念。 |
| 4. 检索增强      | 查询扩展、混合检索、重排序、元数据过滤   | 利用术语词库对用户查询进行扩展（添加别名），结合向量检索与关键词检索（混合检索）。在召回结果后，通过术语匹配度进行重排序，或利用术语作为元数据进行更精确的过滤。 |
| 5. 生成控制      | 提示工程、结构化输出、术语验证           | 设计包含术语词库信息的提示词，引导大模型生成更准确的答案。在输出阶段，可强制模型使用词库中的标准术语，并对生成内容进行术语验证，避免出现混淆或不规范表达。     |
| 6. 评估与反馈    | 术语一致性指标、LLM-as-a-Judge、用户反馈 | 建立专门的评估指标来衡量 RAG 系统在术语一致性方面的表现。利用大型语言模型作为评估器（LLM-as-a-Judge）来检查术语使用情况，并收集用户反馈，持续优化词库和系统。 |

### 玩具版代码

In [1]:
import re
from typing import List, Dict, Any

# 1. Define a terminology glossary (keep it updated, including context tags)
GLOSSARY = [
    {
        "term": "Convolutional Neural Network",
        "synonyms": ["CNN", "Convolution-based neural network"],
        "definition": "A computational model inspired by biological neural networks, especially well-suited for image processing.",
        "context_tags": ["image recognition", "deep learning"],
    },
    {
        "term": "Machine Learning",
        "synonyms": ["ML", "Machine learn", "AI algorithms"],
        "definition": "A field of artificial intelligence that enables computer systems to learn from data without being explicitly programmed.",
        "context_tags": ["artificial intelligence", "data science"],
    },
    {
        "term": "Natural Language Processing",
        "synonyms": ["NLP", "natural language"],
        "definition": "A field that studies the interaction between human language and computers.",
        "context_tags": ["artificial intelligence", "linguistics"],
    },
    {
        "term": "Central Processing Unit",
        "synonyms": ["CPU"],
        "definition": "The arithmetic, logic, and control unit of a computer.",
        "context_tags": ["computer hardware", "computer"],
    },
    {
        "term": "Cost per Unit",
        "synonyms": ["CPU"],
        "definition": "A business analytics metric that measures the cost per unit of a product or service.",
        "context_tags": ["business analytics", "financial management", "cost"],
    },
]


class TerminologyProcessor:
    def __init__(self, glossary: List[Dict[str, Any]]):
        self.glossary = glossary
        self.standard_term_map = {}
        self.alias_to_entries_map = {}
        self._build_mappings()

    def _build_mappings(self):
        """Build mappings; one alias may map to multiple terminology entries to handle ambiguity."""
        for entry in self.glossary:
            standard_term = entry["term"]
            self.standard_term_map[standard_term.lower()] = standard_term

            all_aliases = [standard_term] + entry.get("synonyms", [])
            for alias in all_aliases:
                alias_lower = alias.lower()
                if alias_lower not in self.alias_to_entries_map:
                    self.alias_to_entries_map[alias_lower] = []
                if entry not in self.alias_to_entries_map[alias_lower]:
                    self.alias_to_entries_map[alias_lower].append(entry)

    def standardize_text(self, text: str, context_window: int = 10) -> str:
        """
        Context-aware terminology standardization using iteration + a replacement function.
        Dynamically generates the correct regex for each term type.
        """
        standardized_text = text
        sorted_keys = sorted(self.alias_to_entries_map.keys(), key=len, reverse=True)

        for key_lower in sorted_keys:
            possible_entries = self.alias_to_entries_map[key_lower]

            # --- Dynamically create the correct regex for each key ---
            pattern_str = ""
            # If key contains Latin letters, assume it's an abbreviation and enforce boundaries
            if re.search(r"[a-zA-Z]", key_lower):
                # Use lookarounds to avoid matching inside a larger word
                pattern_str = r"(?<![a-zA-Z])" + re.escape(key_lower) + r"(?![a-zA-Z])"
            else:
                # For Chinese (or non-Latin) terms, match exactly
                pattern_str = re.escape(key_lower)

            pattern = re.compile(pattern_str, flags=re.IGNORECASE)

            # Replacement function called for each match
            def replacer(match: re.Match) -> str:
                if len(possible_entries) == 1:
                    return possible_entries[0]["term"]
                else:
                    # --- Context-based disambiguation ---
                    context_snippet = standardized_text[
                        max(0, match.start() - context_window) : min(len(standardized_text), match.end() + context_window)
                    ]
                    for entry in possible_entries:
                        clues = entry.get("context_tags", []) + [entry["term"]]
                        if any(clue in context_snippet for clue in clues):
                            return entry["term"]
                    # If no context clue is found, fall back to the first definition
                    return possible_entries[0]["term"]

            # Update text using the replacement function
            standardized_text = pattern.sub(replacer, standardized_text)

        return standardized_text

    def extract_terms(self, text: str) -> List[str]:
        """
        Extract known standardized terms from text
        """
        found_terms = set()
        text_lower = text.lower()

        for standard_term_lower, original_standard_term in self.standard_term_map.items():
            # Direct substring search; do not use \b
            if re.search(re.escape(standard_term_lower), text_lower):
                found_terms.add(original_standard_term)

        return sorted(list(found_terms))


# 1. Initialize the terminology processor with the glossary.
term_processor = TerminologyProcessor(GLOSSARY)

# 2. Data preprocessing: terminology standardization
print("--- 2. Data Preprocessing: Terminology Standardization ---")
user_query = "I want to learn about applications of ML models in image recognition, and also some NLP knowledge."
processed_query = term_processor.standardize_text(user_query)
print(f"Original query: {user_query}")
print(f"Standardized query: {processed_query}")

document_text = "Recently I studied CNN and AI algorithms, and found they perform well on big data, especially ML in certain scenarios."
processed_document = term_processor.standardize_text(document_text)
print(f"Original document: {document_text}")
print(f"Standardized document: {processed_document}")

# 3. Term extraction (for downstream vectorization or metadata tagging)
print("\n--- 3. Term Extraction ---")
extracted_terms_query = term_processor.extract_terms(processed_query)
print(f"Extracted terms from query: {extracted_terms_query}")

extracted_terms_document = term_processor.extract_terms(processed_document)
print(f"Extracted terms from document: {extracted_terms_document}")

# 4. Simulated vector storage and retrieval augmentation (conceptual)
print("\n--- 4. Simulated Vector Storage and Retrieval Augmentation (Conceptual) ---")
print("In a real application, we would use an embedding model (e.g., SentenceTransformers) to convert the standardized text and terms into vectors.")
print("These vectors would then be stored in a dedicated vector database (e.g., FAISS, Pinecone, or Weaviate) for efficient similarity search.")
print("During retrieval, the user query is first standardized and vectorized, then used to query the vector database to fetch relevant documents.")

# 5. Simulated retrieval augmentation: query expansion
def enhance_query_for_retrieval(query: str, processor: TerminologyProcessor) -> List[str]:
    """Expand query keywords using the terminology glossary to improve recall."""
    standardized_query = processor.standardize_text(query)
    query_terms = processor.extract_terms(standardized_query)

    expanded_keywords = set([standardized_query])
    for term in query_terms:
        expanded_keywords.add(term)
        for entry in processor.glossary:
            if entry["term"] == term:
                for synonym in entry.get("synonyms", []):
                    expanded_keywords.add(synonym)
                break
    return sorted(list(expanded_keywords))


print("\n--- 5. Simulated Retrieval Augmentation: Query Expansion ---")
original_query_for_retrieval = "I want to know what a CPU does in a computer, and what cost-per-unit CPU means?"
expanded_keywords = enhance_query_for_retrieval(original_query_for_retrieval, term_processor)
print(f"Original retrieval query: {original_query_for_retrieval}")
print(f"Expanded retrieval keyword list: {expanded_keywords}")

print("\nIn a production RAG system, these expanded keywords would drive a hybrid retrieval strategy, combining semantic (vector) search with keyword-based search for best results.")


--- 2. Data Preprocessing: Terminology Standardization ---
Original query: I want to learn about applications of ML models in image recognition, and also some NLP knowledge.
Standardized query: I want to learn about applications of Machine Learning models in image recognition, and also some Natural Language Processing knowledge.
Original document: Recently I studied CNN and AI algorithms, and found they perform well on big data, especially ML in certain scenarios.
Standardized document: Recently I studied Convolutional Neural Network and Machine Learning, and found they perform well on big data, especially Machine Learning in certain scenarios.

--- 3. Term Extraction ---
Extracted terms from query: ['Machine Learning', 'Natural Language Processing']
Extracted terms from document: ['Convolutional Neural Network', 'Machine Learning']

--- 4. Simulated Vector Storage and Retrieval Augmentation (Conceptual) ---
In a real application, we would use an embedding model (e.g., SentenceTransfor

# 二、数据预处理阶段（Preprocessing）：提升语义表示质量

这是术语一致性优化的“第一道防线”，直接影响后续所有环节的质量。

| 技术名称                          | 描述                                       | 对术语一致性的帮助                           |
|-----------------------------------|--------------------------------------------|----------------------------------------------|
| 术语抽取（NER、TF-IDF、KeyBERT）  | 自动从语料中识别候选术语                   | 提供术语来源，是词库构建的基础               |
| 术语标准化（Term Normalization）  | 替换非标准表达为统一术语（如“AI”→“人工智能”） | 消除输入噪声，确保术语表达一致               |
| 文本清洗与格式统一                | 清理无意义内容、统一大小写、标点等         | 减少干扰，提升术语识别准确率                 |
| 上下文感知分块策略（SemanticChunker） | 按语义相似度切分文本块                     | 保留术语所在上下文信息，避免割裂语义         |

## 2.1 环境准备
首先，确保你安装了必要的Python库：

In [3]:
# ! pip install transformers sentence-transformers faiss-cpu scikit-learn spacy
# ! python -m spacy download en_core_web_sm
# ! pip install faiss-cpu

### 步骤一：术语词库结构设计

In [4]:
# This is a structured JSON that defines the core of our knowledge system
term_glossary = {
    "Neural Network": {
        "synonyms": ["Artificial Neural Network", "NN"],
        "definition": "A computational model that mimics the structure and function of biological neural networks",
        "context_tags": ["Artificial Intelligence", "Deep Learning"],
        "domain": "Computer Science",
        "stop_words": ["Nervous System"]
    },
    "Convolutional Neural Network": {
        "synonyms": ["CNN", "ConvNet"],
        "definition": "A deep learning model that extracts local features through convolutional layers",
        "context_tags": ["Computer Vision", "Image Recognition"],
        "domain": "Artificial Intelligence",
        "stop_words": []
    },
    "Image Recognition": { 
        "synonyms": [],
        "definition": "The task of enabling computers to identify and interpret the content of images",
        "context_tags": ["Computer Vision"],
        "domain": "Artificial Intelligence",
        "stop_words": []
    },
    "Autonomous Driving": { 
        "synonyms": [],
        "definition": "Technology that enables vehicles to operate without human intervention",
        "context_tags": ["Artificial Intelligence", "Robotics"],
        "domain": "Computer Science",
        "stop_words": []
    },
    "Medical Imaging Diagnosis": {
        "synonyms": [],
        "definition": "The use of medical imaging to diagnose diseases",
        "context_tags": ["Healthcare", "Image Processing"],
        "domain": "Medicine",
        "stop_words": []
    }
}

### 步骤二：2.1 术语抽取与标准化

In [5]:
# Use spaCy's EntityRuler to customize entity recognition rules
# based on our terminology glossary
import spacy

def extract_terms_with_ruler(text, glossary):
    nlp = spacy.load("en_core_web_sm")
    
    # Create an EntityRuler pipeline and load all terms and their aliases
    # from the glossary
    ruler = nlp.add_pipe("entity_ruler", before="ner")
    patterns = []
    for term, data in glossary.items():
        patterns.append({"label": "TERM", "pattern": term})
        for syn in data.get("synonyms", []):
            patterns.append({"label": "TERM", "pattern": syn})
    ruler.add_patterns(patterns)
    
    # Process the text and extract entities recognized as "TERM"
    doc = nlp(text)
    candidates = {ent.text for ent in doc.ents if ent.label_ == "TERM"}
    return candidates

# Example
text = "Examples of CNN model applications in image recognition include autonomous driving and medical imaging diagnosis."
candidates = extract_terms_with_ruler(text, term_glossary)
print(f"Automatically extracted term candidates: {candidates}")

Automatically extracted term candidates: {'CNN'}


# 三、嵌入构建与向量化阶段（Embedding & Vectorization）

核心任务是将这些经过清洗和标准化的术语，转化为机器能够理解和计算的密集向量（Dense Vectors），并构建高效的检索索引。这直接决定了系统语义匹配的能力上限。

| 技术名称                                 | 描述                                       | 对术语一致性的帮助                             |
|------------------------------------------|--------------------------------------------|------------------------------------------------|
| 术语嵌入与向量索引（FAISS / Pinecone）   | 将术语及其别名转换为向量并构建索引         | 支持语义匹配，提升检索时的术语识别能力         |
| 域专用嵌入模型（Legal-BERT、ChatLaw-Text2Vec） | 在专业语料上继续训练通用模型               | 提升术语理解质量，增强语义表示                 |
| Sentence Transformers + PEFT（LoRA）微调 | 参数高效微调嵌入模型                       | 针对特定领域进一步优化术语语义表示             |

### 步骤二：2.2 基于向量相似度的同义词发现

In [6]:
from sentence_transformers import SentenceTransformer, util

# It is recommended to load the model once during project initialization
# to avoid repeated loading overhead.
# model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

def map_synonyms_by_similarity(main_terms: list, candidates: list, threshold: float = 0.8) -> dict:
    """
    Map candidate terms to the closest standard terms by computing
    cosine similarity between embeddings.

    Args:
        main_terms (list): List of standard (canonical) terms.
        candidates (list): List of candidate synonyms to be matched.
        threshold (float): Similarity threshold above which a candidate
                           is considered a synonym.

    Returns:
        dict: A dictionary mapping each standard term to a list of
              successfully matched synonyms.
    """
    _matched_synonyms = {term: [] for term in main_terms}

    if not main_terms or not candidates:
        return _matched_synonyms
    
    model = SentenceTransformer("paraphrase-MiniLM-L6-v2")
    
    # Encode in batches for better efficiency
    embeddings = model.encode(main_terms + candidates, convert_to_tensor=True)
    term_embeddings = embeddings[:len(main_terms)]
    candidate_embeddings = embeddings[len(main_terms):]

    # Compute the cosine similarity matrix between standard terms and candidates
    similarity_matrix = util.cos_sim(term_embeddings, candidate_embeddings)

    for i, term in enumerate(main_terms):
        for j, candidate in enumerate(candidates):
            if similarity_matrix[i][j] > threshold:
                _matched_synonyms[term].append(candidate)

    return _matched_synonyms


# Example:
main_terms_to_map = ["Convolutional Neural Network", "Neural Network"]
all_possible_synonyms = [
    "CNN",
    "ConvNet",
    "Artificial Neural Network",
    "NN",
    "Nervous System",
    "Deep Learning Model",
]

optimized_mapped_synonyms = map_synonyms_by_similarity(
    main_terms_to_map,
    all_possible_synonyms
)
print("\nOptimized matched synonyms:", optimized_mapped_synonyms)


  from .autonotebook import tqdm as notebook_tqdm



Optimized matched synonyms: {'Convolutional Neural Network': [], 'Neural Network': ['Artificial Neural Network']}


### 3.2 步骤三：构建术语向量索引

In [7]:
import requests
s = requests.Session()
s.trust_env = False
print(s.get("https://huggingface.co", timeout=10).status_code)

200


In [8]:
import os
from sentence_transformers import SentenceTransformer

# (Optional but recommended) Explicit cache directory
os.environ.setdefault("HF_HOME", os.path.expanduser("~/.cache/huggingface"))
os.environ.setdefault("TRANSFORMERS_CACHE", os.path.expanduser("~/.cache/huggingface/transformers"))

model_name = 'paraphrase-MiniLM-L6-v2'

print(f"\nAttempting to load model '{model_name}'...")
model = SentenceTransformer(model_name)


Attempting to load model 'paraphrase-MiniLM-L6-v2'...


In [9]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import Dict, Tuple, List


def build_term_vector_index(
    term_glossary: Dict[str, dict],
    model: SentenceTransformer,
    use_cosine: bool = True
) -> Tuple[faiss.Index, List[str]]:
    """
    Convert all terms and their synonyms into vector embeddings and build a FAISS index.

    Args:
        term_glossary (dict):
            A structured glossary where keys are canonical terms and values contain
            a 'synonyms' list.
        model (SentenceTransformer):
            A loaded SentenceTransformer model.
        use_cosine (bool):
            Whether to use cosine similarity (recommended for sentence embeddings).

    Returns:
        tuple:
            (faiss_index, indexed_terms)
            - faiss_index: FAISS index containing all embeddings
            - indexed_terms: list of terms aligned with index rows
    """
    terms_to_index: List[str] = []

    # Collect canonical terms and synonyms
    for canonical_term, info in term_glossary.items():
        terms_to_index.append(canonical_term)
        synonyms = info.get("synonyms", [])
        if isinstance(synonyms, list):
            terms_to_index.extend(synonyms)

    # Deduplicate while keeping deterministic order
    indexed_terms = sorted(set(terms_to_index))
    if not indexed_terms:
        raise ValueError("No terms found in glossary.")

    print("Generating term embeddings...")
    embeddings = model.encode(
        indexed_terms,
        convert_to_numpy=True,
        normalize_embeddings=use_cosine,
        show_progress_bar=True
    ).astype("float32")

    dim = embeddings.shape[1]

    # Choose FAISS index type
    if use_cosine:
        # Cosine similarity = inner product on normalized vectors
        index = faiss.IndexFlatIP(dim)
    else:
        index = faiss.IndexFlatL2(dim)

    index.add(embeddings)

    metric = "cosine similarity" if use_cosine else "L2 distance"
    print(f"FAISS index built successfully. "
          f"Vectors: {index.ntotal}, Dimension: {dim}, Metric: {metric}")

    return index, indexed_terms


def search_terms(
    query: str,
    model: SentenceTransformer,
    index: faiss.Index,
    indexed_terms: List[str],
    top_k: int = 5,
    use_cosine: bool = True
):
    """
    Search the FAISS index for the most similar terms to a query.

    Args:
        query (str): Input query text.
        model (SentenceTransformer): SentenceTransformer model.
        index (faiss.Index): FAISS index.
        indexed_terms (list): Terms aligned with index rows.
        top_k (int): Number of results to return.
        use_cosine (bool): Whether cosine similarity is used.

    Returns:
        list of (term, score) tuples.
    """
    query_embedding = model.encode(
        [query],
        convert_to_numpy=True,
        normalize_embeddings=use_cosine
    ).astype("float32")

    scores, indices = index.search(query_embedding, top_k)

    results = []
    for score, idx in zip(scores[0], indices[0]):
        if idx == -1:
            continue
        results.append((indexed_terms[idx], float(score)))

    return results


# ------------------------------------------------------------------
# 1. Load embedding model
# ------------------------------------------------------------------
model = SentenceTransformer(
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
)

# ------------------------------------------------------------------
# 2. Prepare term glossary (ALL IN ENGLISH)
# ------------------------------------------------------------------
term_glossary_example = {
    "Convolutional Neural Network": {
        "synonyms": ["CNN", "ConvNet"]
    },
    "Transformer": {
        "synonyms": ["transformer", "TRANSFORMER"]
    },
    "Image Recognition": {
        "synonyms": ["Image Classification", "Visual Recognition"]
    }
}

# ------------------------------------------------------------------
# 3. Build FAISS index
# ------------------------------------------------------------------
faiss_index, indexed_term_list = build_term_vector_index(
    term_glossary_example,
    model,
    use_cosine=True
)

# ------------------------------------------------------------------
# 4. Verify results
# ------------------------------------------------------------------
print("\n--- Index Build Successful ---")
print("Number of vectors in FAISS index:", faiss_index.ntotal)
print("Indexed terms:", indexed_term_list)

print("\n--- Query Tests ---")
print(search_terms("CNN", model, faiss_index, indexed_term_list))
print(search_terms("image classification", model, faiss_index, indexed_term_list))
print(search_terms("transformer model", model, faiss_index, indexed_term_list))


Generating term embeddings...


Batches: 100%|██████████| 1/1 [00:00<00:00,  1.67it/s]


FAISS index built successfully. Vectors: 9, Dimension: 384, Metric: cosine similarity

--- Index Build Successful ---
Number of vectors in FAISS index: 9
Indexed terms: ['CNN', 'ConvNet', 'Convolutional Neural Network', 'Image Classification', 'Image Recognition', 'TRANSFORMER', 'Transformer', 'Visual Recognition', 'transformer']

--- Query Tests ---
[('CNN', 1.0), ('ConvNet', 0.46894609928131104), ('Convolutional Neural Network', 0.4410366714000702), ('Image Classification', 0.3689947724342346), ('Visual Recognition', 0.3341491222381592)]
[('Image Classification', 0.9950429201126099), ('Image Recognition', 0.8483021855354309), ('Visual Recognition', 0.6152772307395935), ('CNN', 0.3614282011985779), ('Convolutional Neural Network', 0.35928869247436523)]
[('transformer', 0.8794118165969849), ('TRANSFORMER', 0.7784579992294312), ('Transformer', 0.7107824683189392), ('Image Classification', 0.31479692459106445), ('Image Recognition', 0.289703369140625)]


如何调用该函数并获取索引和术语列表的示例

In [10]:
# --- Part 2: Define our core retrieval function ---

def search_similar_terms(
    query_text: str,
    model: SentenceTransformer,
    index: faiss.Index,
    term_list: list,
    k: int = 5
):
    """
    Retrieve the top-k most similar terms to a query text from a FAISS index.

    Args:
        query_text (str): The user input query term/text.
        model (SentenceTransformer): The embedding model used to encode the query.
        index (faiss.Index): The FAISS index object.
        term_list (list): The term list aligned with the order of vectors in the FAISS index.
        k (int): The number of most similar results to return.
    """
    print(f"\n--- Running Retrieval ---")
    print(f"Query: '{query_text}'")

    # 1) Encode the query text into an embedding vector
    query_vector = model.encode([query_text])
    query_vector = query_vector.astype("float32")

    # 2) Search in the FAISS index
    # index.search returns two arrays: D (distances/scores) and I (indices)
    distances, indices = index.search(query_vector, k)

    # 3) Parse and print results
    print("Results:")
    for i in range(k):
        idx = int(indices[0][i])
        dist = float(distances[0][i])
        term = term_list[idx]

        # For IndexFlatL2, distance is squared Euclidean distance:
        # smaller distance => more similar
        print(f"  Top {i+1}: term='{term}', distance={dist:.4f} (smaller = more similar)")


# 4) === Demo ===

# Case 1: Query using a synonym for the canonical term
# Goal: Test whether the system understands that "CNN" refers to "Convolutional Neural Network".
search_similar_terms(query_text="CNN", model=model, index=faiss_index, term_list=indexed_term_list, k=3)

# Case 2: Semantically similar query (core advantage)
# Goal: Query a term not explicitly in the glossary but semantically related: "Computer Vision".
# Expected: The system should find related terms like "Image Recognition" or "Visual Recognition".
search_similar_terms(query_text="Computer Vision", model=model, index=faiss_index, term_list=indexed_term_list, k=3)

# Case 3: Query with a broader term
# Goal: Query "Language Model" and see whether it matches more specific related terms (if present).
search_similar_terms(query_text="Language Model", model=model, index=faiss_index, term_list=indexed_term_list, k=3)

# Case 4: Tolerance to minor noise / paraphrases
# Goal: Query "Transformer model" (a paraphrase) and see if it matches "Transformer" / "transformer".
search_similar_terms(query_text="Transformer model", model=model, index=faiss_index, term_list=indexed_term_list, k=3)



--- Running Retrieval ---
Query: 'CNN'
Results:
  Top 1: term='CNN', distance=6.2372 (smaller = more similar)
  Top 2: term='ConvNet', distance=2.9249 (smaller = more similar)
  Top 3: term='Convolutional Neural Network', distance=2.7508 (smaller = more similar)

--- Running Retrieval ---
Query: 'Computer Vision'
Results:
  Top 1: term='Visual Recognition', distance=3.8383 (smaller = more similar)
  Top 2: term='Image Recognition', distance=3.2715 (smaller = more similar)
  Top 3: term='Image Classification', distance=2.6665 (smaller = more similar)

--- Running Retrieval ---
Query: 'Language Model'
Results:
  Top 1: term='TRANSFORMER', distance=1.6813 (smaller = more similar)
  Top 2: term='Convolutional Neural Network', distance=1.3615 (smaller = more similar)
  Top 3: term='CNN', distance=1.2736 (smaller = more similar)

--- Running Retrieval ---
Query: 'Transformer model'
Results:
  Top 1: term='Transformer', distance=5.3101 (smaller = more similar)
  Top 2: term='transformer', di

# 四、检索增强阶段

核心目标是在初步召回（Recall）的基础上，进一步优化检索结果的广度与精度。预处理阶段解决了术语的“标准”问题，而本阶段则聚焦于如何利用这些标准化的知识，在实际检索中发挥最大效用。

| 技术名称                             | 描述                                                                                   | 对术语一致性的帮助                                                                 |
|--------------------------------------|----------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| 查询扩展与重写（MultiQueryRetriever） | 利用 LLM 生成多个语义等价的查询变体，合并检索结果。                                     | 自动覆盖用户未提及的同义词或相关表达，极大提升对多样化术语的识别与召回能力。         |
| HyDE（假设文档嵌入）                 | 利用 LLM 为查询生成一个“理想答案”的假设性文档，再用该文档的嵌入进行检索。               | 通过生成富含上下文的理想答案，有效缓解原始查询中术语模糊或信息不足的问题，提升检索相关性。 |
| 混合检索（BM25 + FAISS）             | 结合关键词检索（如 BM25）与向量检索（如 FAISS）的优势。                                 | 综合利用字面精确匹配和语义相似匹配，确保基础术语不丢失，同时发现语义相关内容。         |
| 交叉编码器重排序（BGE-reranker）     | 使用更复杂的交叉编码器模型（如 BGE-reranker）对召回结果进行精细化重排序。               | 通过深度交互分析查询与文档的匹配度，提升对术语匹配度的排序精度

In [11]:
# ! pip install langchain_community
# ! pip install langchain langchain-openai faiss-cpu sentence-transformers

### 4.1 核心技术一：查询扩展与重写

In [12]:
## Query Expansion and Rewriting
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
import os

# --- Preparation: Set API Key and Create Vector Database ---

import dotenv
# Set your OpenAI API Key
# os.environ["OPENAI_API_KEY"] = "sk-..."
dotenv.load_dotenv()

# 1. Prepare sample documents
# We create some example text containing technical terminology
doc_text = """
Convolutional Neural Networks (CNNs) are a key model in deep learning,
especially effective in the field of image recognition.
Their core idea is to automatically extract local image features
through convolutional layers and pooling layers.

Unlike CNNs, Transformer models were originally applied to
natural language processing (NLP) tasks such as machine translation.
Today, they have also been successfully applied to computer vision,
known as Vision Transformers.

Large Language Models (LLMs) are a major focus of current AI research.
Based on the Transformer architecture, they are capable of
understanding and generating human-like text,
demonstrating strong reasoning capabilities.
"""

with open("sample_tech_doc.txt", "w", encoding="utf-8") as f:
    f.write(doc_text)

# 2. Load and split the document
loader = TextLoader("sample_tech_doc.txt", encoding="utf-8")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)
docs = text_splitter.split_documents(documents)

# 3. Create the vector database
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)

# --- MultiQueryRetriever Implementation ---

# 4. Initialize the LLM and retriever
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm
)

# 5. Execute a query
query = "What is a CNN?"
retrieved_docs = retriever_from_llm.invoke(query)

# --- Result Analysis ---
print(f"Original query: {query}")

print("\n--- Query Variants Generated by MultiQueryRetriever ---")
# MultiQueryRetriever internally logs the generated queries.
# In practice, you can inspect them by enabling:
# logging.basicConfig(level=logging.INFO)

generated_queries = [
    "What is the definition of a Convolutional Neural Network?",
    "What role does the CNN model play in deep learning?",
    "Can you explain Convolutional Neural Networks (CNNs)?"
]

for i, q in enumerate(generated_queries):
    print(f"Query variant {i+1}: {q}")

print("\n--- Final Retrieved Document Content ---")
for doc in retrieved_docs:
    print(doc.page_content)


Original query: What is a CNN?

--- Query Variants Generated by MultiQueryRetriever ---
Query variant 1: What is the definition of a Convolutional Neural Network?
Query variant 2: What role does the CNN model play in deep learning?
Query variant 3: Can you explain Convolutional Neural Networks (CNNs)?

--- Final Retrieved Document Content ---
Convolutional Neural Networks (CNNs) are a key model in deep learning,
Unlike CNNs, Transformer models were originally applied to
through convolutional layers and pooling layers.
Today, they have also been successfully applied to computer vision,
known as Vision Transformers.
especially effective in the field of image recognition.
Their core idea is to automatically extract local image features


### 4.2 假设性文档嵌入 (HyDE)

In [13]:
# ! pip install --upgrade langchain langchain-community langchain-openai rank_bm25 faiss-cpu

In [14]:
import os
from langchain.retrievers import MergerRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# --- 1. Preparation: Set API Key and Prepare Data ---

# os.environ["OPENAI_API_KEY"] = "sk-..."

# Prepare sample documents (slightly richer content for comparison)
doc_text = """
Part 1: About Convolutional Networks.
Convolutional Neural Networks (CNNs) are a key model in deep learning,
especially effective in image recognition.
Their core idea is to automatically extract local image features
through convolutional layers and pooling layers. This makes CNNs very efficient.

Part 2: About Transformers.
Unlike CNNs, Transformer models were originally used in
Natural Language Processing (NLP) tasks such as machine translation.
Today, a variant called Vision Transformer (ViT) has also been successfully applied
to the field of computer vision.

Part 3: About Large Models.
Large Language Models (LLMs) are a major focus of modern AI research.
They are typically based on the Transformer architecture,
and can understand and generate human-like text,
demonstrating strong reasoning capabilities.
"""

with open("hybrid_search_doc.txt", "w", encoding="utf-8") as f:
    f.write(doc_text)

# Load and split documents
loader = TextLoader("hybrid_search_doc.txt", encoding="utf-8")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=120, chunk_overlap=20)
docs = text_splitter.split_documents(documents)

print(f"The document has been split into {len(docs)} chunks.")


# --- 2. Build two different retrievers ---

# Retriever 1: FAISS vector retriever (for semantic matching)
print("\nBuilding FAISS vector retriever...")
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
faiss_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
print("FAISS retriever built successfully.")


# Retriever 2: BM25 keyword retriever (for exact matching)
print("\nBuilding BM25 keyword retriever...")
# BM25Retriever can be initialized directly from documents; it does not require embeddings
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 3
print("BM25 retriever built successfully.")


# --- 3. Merge using MergerRetriever ---

print("\nInitializing MergerRetriever...")
retriever_list = [bm25_retriever, faiss_retriever]

# MergerRetriever handles parallel retrieval and deduplication
merged_retriever = MergerRetriever(retrievers=retriever_list)
print("MergerRetriever initialized successfully.")


# --- 4. Run a query and compare results ---

query = "Technical details of ViT"
print(f"\n\n--- Running Hybrid Retrieval ---")
print(f"Query: '{query}'")


# For comparison, inspect each retriever’s results individually first
print("\n--- Individual Retriever Results ---")

bm25_results = bm25_retriever.invoke(query)
print(f"[BM25 Keyword Retrieval Results] (total {len(bm25_results)}):")
for doc in bm25_results:
    print(f"  - {doc.page_content[:50]}...")

faiss_results = faiss_retriever.invoke(query)
print(f"\n[FAISS Vector Retrieval Results] (total {len(faiss_results)}):")
for doc in faiss_results:
    print(f"  - {doc.page_content[:50]}...")


# Now inspect the merged (hybrid) results
print("\n--- MergerRetriever Hybrid Results ---")
merged_results = merged_retriever.invoke(query)
print(f"[Final Hybrid Results] (total {len(merged_results)}, deduplicated):")
for doc in merged_results:
    print(f"  - {doc.page_content[:50]}...")

The document has been split into 9 chunks.

Building FAISS vector retriever...
FAISS retriever built successfully.

Building BM25 keyword retriever...
BM25 retriever built successfully.

Initializing MergerRetriever...
MergerRetriever initialized successfully.


--- Running Hybrid Retrieval ---
Query: 'Technical details of ViT'

--- Individual Retriever Results ---
[BM25 Keyword Retrieval Results] (total 3):
  - Part 3: About Large Models.
Large Language Models ...
  - Today, a variant called Vision Transformer (ViT) h...
  - demonstrating strong reasoning capabilities....

[FAISS Vector Retrieval Results] (total 3):
  - Today, a variant called Vision Transformer (ViT) h...
  - They are typically based on the Transformer archit...
  - especially effective in image recognition.
Their c...

--- MergerRetriever Hybrid Results ---
[Final Hybrid Results] (total 6, deduplicated):
  - Part 3: About Large Models.
Large Language Models ...
  - Today, a variant called Vision Transformer (ViT) h.

## 五、生成控制与输出验证阶段

| 技术名称                  | 描述                                                                 | 对术语一致性的贡献与作用                                                                 |
|---------------------------|----------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| 提示工程 (Prompt Engineering)       | 在提示中明确指令，引导 LLM 使用标准术语、保持特定风格。                | 最基础的控制手段，直接影响 LLM 的选词倾向，引导其遵循术语规范。                           |
| 结构化输出 (Structured Output)     | 强制 LLM 返回符合预定义模式（如 Pydantic 或 JSON Schema）的对象。      | 从根本上杜绝术语的随意使用，确保关键信息以标准、可控的格式输出。                         |
| 输出解析与修复 (Output Parsers)    | 使用如 OutputFixingParser 等工具，在 LLM 输出格式错误时自动尝试修复。  | 提升结构化输出的鲁棒性，能自动纠正轻微的术语格式或拼写错误。                               |
| 后处理与内容增强                   | 在答案文本中自动高亮术语、添加定义弹窗或引用链接。                    | 提升最终答案的可读性和专业性，为用户提供即时的术语解释和来源追溯。                        |
| LLM 即评委 (LLM-as-a-Judge)        | 使用另一个 LLM 实例，根据预设标准（如术语一致性）对生成结果进行打分评估。 | 提供一种可扩展的、自动化的输出质量与术语合规性评估方案。                                  |

### 5.1 生成时控制：结构化输出

In [15]:
import os
from typing import List

# Fix: Import BaseModel and Field directly from the pydantic library
from pydantic import BaseModel, Field

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# --- 1. Preparation ---
# os.environ["OPENAI_API_KEY"] = "sk-..."
llm = ChatOpenAI(temperature=0, model="gpt-4o")

# --- 2. Define the expected output structure (using Pydantic v2) ---
class TerminologyInAnswer(BaseModel):
    """A structured model containing the main answer and the technical terms used."""
    answer: str = Field(description="A detailed and accurate answer to the user's question.")
    standard_terms_used: List[str] = Field(
        description="A list of standard technical terms from the official glossary that are explicitly used in the answer.",
        example=["Convolutional Neural Network", "Image Recognition"]
    )

# --- 3. Create a structured output chain ---
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an AI expert with deep technical knowledge. Please provide a structured answer to the user's question."),
    ("human", "Please explain what a CNN is and describe its main application areas.")
])

# Bind the LLM to the Pydantic v2 model using structured output
# This uses the efficient json_schema mode by default and avoids warnings
structured_llm_chain = prompt | llm.with_structured_output(TerminologyInAnswer)

# --- 4. Execute the chain and inspect the result ---
print("--- Executing structured output chain ---")
structured_response = structured_llm_chain.invoke({})

print("\n--- Structured object returned by the LLM ---")
print(structured_response)

print("\n--- Result analysis ---")
print(f"Answer content: {structured_response.answer}")
print(f"Standard terms confirmed by the model: {structured_response.standard_terms_used}")

if "Convolutional Neural Network" in structured_response.standard_terms_used:
    print("Validation passed: The answer correctly uses the standard term 'Convolutional Neural Network'.")

# --- 5. Content enhancement: automatically add term definitions ---
print("\n" + "=" * 50)
print("--- Starting content enhancement ---")

# Assume we have a simplified terminology glossary
glossary = {
    "Convolutional Neural Network": "A specialized deep learning model that excels at processing image data.",
    "Deep Learning": "A subfield of machine learning based on artificial neural networks.",
    "Image Recognition": "A core task in computer vision that aims to identify and classify objects in images."
}

# Extract the answer text generated by the LLM
llm_answer_text = structured_response.answer

def enhance_text_with_definitions(text: str, term_glossary: dict) -> str:
    """
    Find standard technical terms in the text and add Markdown/HTML hover annotations for them.
    """
    enhanced_text = text
    for term, definition in term_glossary.items():
        # Create annotated Markdown/HTML format
        replacement = f'<abbr title="{definition}">{term}</abbr>'
        # Replace occurrences of the term in the text
        enhanced_text = enhanced_text.replace(term, replacement)
    return enhanced_text

# Perform content enhancement
final_output = enhance_text_with_definitions(llm_answer_text, glossary)

print("\n--- Final enhanced output (view in an HTML-enabled Markdown renderer) ---")
print(final_output)

--- Executing structured output chain ---

--- Structured object returned by the LLM ---
answer='A Convolutional Neural Network (CNN) is a type of deep learning model specifically designed to process data with a grid-like topology, such as images. CNNs are particularly effective for image recognition and classification tasks due to their ability to automatically and adaptively learn spatial hierarchies of features from input data. The architecture of a CNN typically includes layers such as convolutional layers, pooling layers, and fully connected layers. \n\n1. **Convolutional Layers**: These layers apply a convolution operation to the input, passing the result to the next layer. This operation helps in detecting features such as edges, textures, and patterns in the image.\n\n2. **Pooling Layers**: These layers reduce the spatial size of the representation, which decreases the number of parameters and computation in the network, and also helps control overfitting.\n\n3. **Fully Connect

### 5.2 生成后处理：验证与内容增强

In [16]:
# Assume we have a simplified terminology glossary
glossary = {
    "Convolutional Neural Network": "A specialized deep learning model that excels at processing image data.",
    "Transformer Architecture": "A neural network architecture based on self-attention that has achieved major success in NLP.",
    "Large Language Model": "A language model trained on massive datasets with a very large number of parameters."
}

# Assume this is the LLM-generated, already-validated answer text
llm_answer_text = (
    "Large Language Models are typically based on the Transformer Architecture, "
    "while Convolutional Neural Networks are the dominant approach for image processing."
)

def enhance_text_with_definitions(text: str, term_glossary: dict) -> str:
    """
    Find standard technical terms in the text and add Markdown/HTML hover annotations.
    In an HTML-enabled Markdown renderer, this typically appears as a tooltip on hover.
    """
    enhanced_text = text
    for term, definition in term_glossary.items():
        # Create annotated Markdown/HTML
        replacement = f'<abbr title="{definition}">{term}</abbr>'
        # Replace occurrences of the term in the text
        enhanced_text = enhanced_text.replace(term, replacement)
    return enhanced_text

# Perform content enhancement
final_output = enhance_text_with_definitions(llm_answer_text, glossary)

print("\n--- Enhanced output ---")
print(final_output)


--- Enhanced output ---
<abbr title="A language model trained on massive datasets with a very large number of parameters.">Large Language Model</abbr>s are typically based on the <abbr title="A neural network architecture based on self-attention that has achieved major success in NLP.">Transformer Architecture</abbr>, while <abbr title="A specialized deep learning model that excels at processing image data.">Convolutional Neural Network</abbr>s are the dominant approach for image processing.


## 六、评估与反馈机制

人工抽样评估难以覆盖海量的生成内容，而“LLM即评委”为此提供了一个高效、可扩展的自动化解决方案。其核心是利用一个LLM的理解和推理能力，来评估另一个LLM（或整个RAG系统）的输出质量。

使用LCEL构建评估链

In [17]:
import os
from typing import List
from pydantic import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from langchain_openai import ChatOpenAI

# --- 1. Preparation ---
# os.environ["OPENAI_API_KEY"] = "sk-..."
evaluator_llm = ChatOpenAI(temperature=0, model="gpt-4o")

# --- 2. Define a structured evaluation model ---
class TerminologyEvaluation(BaseModel):
    """A structured model for evaluating terminology consistency."""
    consistency_score: int = Field(description="A score from 1 to 5, where 5 means fully consistent and 1 means severely inconsistent.")
    is_consistent: bool = Field(description="Boolean indicating whether the answer is overall compliant with terminology standards.")
    reasoning: str = Field(description="A detailed explanation of the score, highlighting strengths and issues.")
    suggestions_for_improvement: List[str] = Field(description="Concrete suggestions to improve terminology usage in the answer.")

# --- 3. Build the evaluation chain ---
# Create a dedicated evaluation prompt
evaluation_prompt_template = """
You are a strict technical documentation quality evaluator for AI content. Your core task is to evaluate the consistency and correctness of terminology usage in a given answer.

**Evaluation Criteria:**
1.  **Accuracy**: Are standard terms used correctly?
2.  **Compliance**: Does the answer avoid unofficial or ambiguous aliases?
3.  **Completeness**: Does the answer use the most appropriate standard terms when needed?

**Authoritative Terminology Glossary (partial):**
- Convolutional Neural Network (alias: CNN)
- Transformer Model (alias: Transformer)
- Large Language Model (alias: LLM)

**Answer to Evaluate:**
{answer_text}

Based on the criteria and glossary above, output your evaluation result in JSON format.
"""

prompt = ChatPromptTemplate.from_template(evaluation_prompt_template)
parser = JsonOutputParser(pydantic_object=TerminologyEvaluation)

# Build the evaluation chain using LCEL
evaluation_chain = prompt | evaluator_llm | parser

# --- 4. Run evaluations ---

# Case 1: A terminology-compliant answer
good_answer = "A Large Language Model (LLM) is built on a Transformer Model, while a Convolutional Neural Network (CNN) is widely used in image-related domains."

# Case 2: An answer using non-standard terminology
bad_answer = "A big model is based on a transformer-style architecture, and a conv net is very strong at picture processing."

print("--- Evaluating [Good Answer] ---")
good_evaluation_result = evaluation_chain.invoke({"answer_text": good_answer})
print(good_evaluation_result)

print("\n--- Evaluating [Needs Improvement Answer] ---")
bad_evaluation_result = evaluation_chain.invoke({"answer_text": bad_answer})
print(bad_evaluation_result)


--- Evaluating [Good Answer] ---
{'Accuracy': "The terms 'Large Language Model (LLM)', 'Transformer Model', and 'Convolutional Neural Network (CNN)' are used correctly according to the glossary.", 'Compliance': 'The answer complies with the glossary by using the official terms and their aliases without introducing unofficial or ambiguous aliases.', 'Completeness': 'The answer uses the most appropriate standard terms as per the glossary, and all necessary terms are included.'}

--- Evaluating [Needs Improvement Answer] ---
{'Accuracy': {'status': 'Fail', 'issues': [{'term': 'big model', 'description': "The term 'big model' is not a standard term. The correct term is 'Large Language Model' or 'LLM' if applicable."}, {'term': 'transformer-style architecture', 'description': "The term 'transformer-style architecture' is not a standard term. The correct term is 'Transformer Model' or 'Transformer'."}, {'term': 'conv net', 'description': "The term 'conv net' is not a standard term. The corre

## 总结：术语一致性优化路线图

经过以上各阶段的详细探讨，从数据预处理到最终的评估反馈，我们已经全面构建了保障术语一致性的技术体系。

为了更直观地理解各项技术的定位与优先级，我们将整个优化策略总结为以下分级路线图，为不同阶段的RAG系统建设提供实践指引。

| Optimization Layer | Core Techniques & Solutions |
|--------------------|-----------------------------|
| Foundation | Terminology glossary construction, term extraction, preprocessing standardization, term embeddings and vector indexing |
| Key Enhancement | Hybrid retrieval (BM25 + vectors), query expansion (MultiQuery), hypothetical document embeddings (HyDE), cross-encoder re-ranking |
| Auxiliary Optimization | Domain-specific embedding fine-tuning, context-aware chunking |
| Generation Control | Prompt engineering, structured outputs, output parsing and repair |
| Long-term Assurance | LLM-as-a-Judge, user feedback loops, logging, auditing, and analytics |
