# Proposition Chunking for Enhanced RAG

In this notebook, I implement proposition chunking - an advanced technique to break down documents into atomic, factual statements for more accurate retrieval. Unlike traditional chunking that simply divides text by character count, proposition chunking preserves the semantic integrity of individual facts.

Proposition chunking delivers more precise retrieval by:

1. Breaking content into atomic, self-contained facts
2. Creating smaller, more granular units for retrieval  
3. Enabling more precise matching between queries and relevant content
4. Filtering out low-quality or incomplete propositions

Let's build a complete implementation without relying on LangChain or FAISS.

在本笔记本中，我实现了命题分块，这是一种高级技术，可以将文档分解为原子的事实陈述，以便更准确地检索。与传统的按字符数划分文本不同，命题分块保留了单个事实的语义完整性。命题分块通过以下方式提供更精确的检索：将内容分解为原子的、自包含的事实；为检索创建更小、更细粒度的单元；在查询和相关内容之间实现更精确的匹配；过滤掉低质量或不完整的命题。

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
pip install PymuPdf

Collecting PymuPdf
  Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.1-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PymuPdf
Successfully installed PymuPdf-1.26.1


In [2]:
import os
import numpy as np
import json
import fitz
from openai import OpenAI
import re

## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [3]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file and prints the first `num_chars` characters.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # Get the page
        text = page.get_text("text")  # Extract text from the page
        all_text += text  # Append the extracted text to the all_text string

    return all_text  # Return the extracted text

## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [4]:
def chunk_text(text, chunk_size=800, overlap=100):
    """
    Split text into overlapping chunks.

    Args:
        text (str): Input text to chunk
        chunk_size (int): Size of each chunk in characters
        overlap (int): Overlap between chunks in characters

    Returns:
        List[Dict]: List of chunk dictionaries with text and metadata
    """
    chunks = []  # Initialize an empty list to store the chunks

    # Iterate over the text with the specified chunk size and overlap
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i + chunk_size]  # Extract a chunk of the specified size
        if chunk:  # Ensure we don't add empty chunks
            chunks.append({
                "text": chunk,  # The chunk text
                "chunk_id": len(chunks) + 1,  # Unique ID for the chunk
                "start_char": i,  # Starting character index of the chunk
                "end_char": i + len(chunk)  # Ending character index of the chunk
            })

    print(f"Created {len(chunks)} text chunks")  # Print the number of created chunks
    return chunks  # Return the list of chunks

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [5]:
client = OpenAI(
    base_url="http://xxxxx/v1/",
    api_key="skxxxxxxxxxx9" # Retrieve the API key from environment variables
)

## Simple Vector Store Implementation
We'll create a basic vector store to manage document chunks and their embeddings.

In [6]:
class SimpleVectorStore:
    """
    A simple vector store implementation using NumPy.
    """
    def __init__(self):
        # Initialize lists to store vectors, texts, and metadata
        self.vectors = []
        self.texts = []
        self.metadata = []

    def add_item(self, text, embedding, metadata=None):
        """
        Add an item to the vector store.

        Args:
            text (str): The text content
            embedding (List[float]): The embedding vector
            metadata (Dict, optional): Additional metadata
        """
        # Append the embedding, text, and metadata to their respective lists
        self.vectors.append(np.array(embedding))
        self.texts.append(text)
        self.metadata.append(metadata or {})

    def add_items(self, texts, embeddings, metadata_list=None):
        """
        Add multiple items to the vector store.

        Args:
            texts (List[str]): List of text contents
            embeddings (List[List[float]]): List of embedding vectors
            metadata_list (List[Dict], optional): List of metadata dictionaries
        """
        # If no metadata list is provided, create an empty dictionary for each text
        if metadata_list is None:
            metadata_list = [{} for _ in range(len(texts))]

        # Add each text, embedding, and metadata to the store
        for text, embedding, metadata in zip(texts, embeddings, metadata_list):
            self.add_item(text, embedding, metadata)

    def similarity_search(self, query_embedding, k=5):
        """
        Find the most similar items to a query embedding.

        Args:
            query_embedding (List[float]): Query embedding vector
            k (int): Number of results to return

        Returns:
            List[Dict]: Top k most similar items
        """
        # Return an empty list if there are no vectors in the store
        if not self.vectors:
            return []

        # Convert query embedding to a numpy array
        query_vector = np.array(query_embedding)

        # Calculate similarities using cosine similarity
        similarities = []
        for i, vector in enumerate(self.vectors):
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))

        # Sort by similarity in descending order
        similarities.sort(key=lambda x: x[1], reverse=True)

        # Collect the top k results
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],
                "metadata": self.metadata[idx],
                "similarity": float(score)  # Convert to float for JSON serialization
            })

        return results

我来详细解读这个 `SimpleVectorStore` 类的代码：

## 代码整体功能

这是一个**简单的向量存储实现**，用于存储文本及其对应的向量表示，并支持相似性搜索。这是RAG（检索增强生成）系统中的核心组件。

---

## 类结构分析

### 1. **初始化方法 `__init__`**
```python
def __init__(self):
    self.vectors = []      # 存储向量
    self.texts = []        # 存储文本
    self.metadata = []     # 存储元数据
```

**功能：**
- 创建三个列表来分别存储向量、文本和元数据
- 使用列表结构，简单但效率较低（实际应用中通常使用数据库）

---

### 2. **添加单个项目 `add_item`**
```python
def add_item(self, text, embedding, metadata=None):
    self.vectors.append(np.array(embedding))
    self.texts.append(text)
    self.metadata.append(metadata or {})
```

**功能：**
- 将文本、向量和元数据添加到存储中
- 将向量转换为NumPy数组以确保一致性
- 如果没有元数据，使用空字典作为默认值

**参数说明：**
- `text`: 文本内容
- `embedding`: 文本的向量表示
- `metadata`: 可选的元数据（如文档来源、时间戳等）

---

### 3. **批量添加项目 `add_items`**
```python
def add_items(self, texts, embeddings, metadata_list=None):
    if metadata_list is None:
        metadata_list = [{} for _ in range(len(texts))]
    
    for text, embedding, metadata in zip(texts, embeddings, metadata_list):
        self.add_item(text, embedding, metadata)
```

**功能：**
- 批量添加多个文本和对应的向量
- 如果没有提供元数据列表，为每个文本创建空元数据
- 使用 `zip` 函数并行处理三个列表

---

### 4. **相似性搜索 `similarity_search`**
```python
def similarity_search(self, query_embedding, k=5):
    if not self.vectors:
        return []
    
    query_vector = np.array(query_embedding)
    
    # 计算余弦相似度
    similarities = []
    for i, vector in enumerate(self.vectors):
        similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
        similarities.append((i, similarity))
    
    # 按相似度排序
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # 返回前k个结果
    results = []
    for i in range(min(k, len(similarities))):
        idx, score = similarities[i]
        results.append({
            "text": self.texts[idx],
            "metadata": self.metadata[idx],
            "similarity": float(score)
        })
    
    return results
```

**功能：**
- 根据查询向量找到最相似的k个文本
- 使用**余弦相似度**计算向量间的相似性
- 返回排序后的结果，包含文本、元数据和相似度分数

---

## 核心算法：余弦相似度

```python
similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
```

**数学原理：**
- 余弦相似度 = 向量点积 / (向量A的模 × 向量B的模)
- 结果范围：[-1, 1]，1表示完全相同，0表示无关，-1表示完全相反
- 适合衡量语义相似性，因为关注方向而非大小

---

## 使用示例

```python
# 创建向量存储
vector_store = SimpleVectorStore()

# 添加文档
texts = ["机器学习很有趣", "深度学习是AI的子集", "自然语言处理很重要"]
embeddings = [[0.1, 0.2, 0.3], [0.2, 0.3, 0.4], [0.3, 0.4, 0.5]]
metadata = [{"source": "doc1"}, {"source": "doc2"}, {"source": "doc3"}]

vector_store.add_items(texts, embeddings, metadata)

# 搜索相似文档
query_embedding = [0.15, 0.25, 0.35]
results = vector_store.similarity_search(query_embedding, k=2)

# 输出结果
for result in results:
    print(f"文本: {result['text']}")
    print(f"相似度: {result['similarity']:.3f}")
    print(f"元数据: {result['metadata']}")
    print("---")
```

---

## 优缺点分析

### 优点：
1. **简单易懂**：代码结构清晰，易于理解
2. **功能完整**：支持添加、搜索等基本功能
3. **灵活性**：支持元数据存储

### 缺点：
1. **性能问题**：使用列表存储，搜索时需要遍历所有向量
2. **内存效率低**：所有数据存在内存中
3. **扩展性差**：不适合大规模数据
4. **缺少优化**：没有使用索引或近似搜索

---

## 实际应用中的改进

在实际的RAG系统中，通常会使用：
- **向量数据库**：如Pinecone、Weaviate、Qdrant
- **近似搜索**：如HNSW、IVF等算法
- **索引优化**：提高搜索效率
- **持久化存储**：支持大规模数据

这个 `SimpleVectorStore` 是一个很好的教学示例，展示了向量存储的基本原理！

## Creating Embeddings

In [7]:
def create_embeddings(texts, model="text-embedding-ada-002"):
    """
    Create embeddings for the given texts.

    Args:
        texts (str or List[str]): Input text(s)
        model (str): Embedding model name

    Returns:
        List[List[float]]: Embedding vector(s)
    """
    # Handle both string and list inputs
    input_texts = texts if isinstance(texts, list) else [texts]

    # Process in batches if needed (OpenAI API limits)
    batch_size = 100
    all_embeddings = []

    # Iterate over the input texts in batches
    for i in range(0, len(input_texts), batch_size):
        batch = input_texts[i:i + batch_size]  # Get the current batch of texts

        # Create embeddings for the current batch
        response = client.embeddings.create(
            model=model,
            input=batch
        )

        # Extract embeddings from the response
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)  # Add the batch embeddings to the list

    # If input was a single string, return just the first embedding
    if isinstance(texts, str):
        return all_embeddings[0]

    # Otherwise, return all embeddings
    return all_embeddings

## Proposition Generation

In [9]:
def generate_propositions(chunk):
    """
    Generate atomic, self-contained propositions from a text chunk.

    Args:
        chunk (Dict): Text chunk with content and metadata

    Returns:
        List[str]: List of generated propositions
    """
    # System prompt to instruct the AI on how to generate propositions
    system_prompt = """Please break down the following text into simple, self-contained propositions.
    Ensure that each proposition meets the following criteria:

    1. Express a Single Fact: Each proposition should state one specific fact or claim.
    2. Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.
    3. Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.
    4. Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.
    5. Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses.

    Output ONLY the list of propositions without any additional text or explanations."""

    # User prompt containing the text chunk to be converted into propositions
    user_prompt = f"Text to convert into propositions:\n\n{chunk['text']}"

    # Generate response from the model
    response = client.chat.completions.create(
        model="o1",  # Using a stronger model for accurate proposition generation
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )

    # Extract propositions from the response
    raw_propositions = response.choices[0].message.content.strip().split('\n')

    # Clean up propositions (remove numbering, bullets, etc.)
    clean_propositions = []
    for prop in raw_propositions:
        # Remove numbering (1., 2., etc.) and bullet points
        cleaned = re.sub(r'^\s*(\d+\.|\-|\*)\s*', '', prop).strip()
        if cleaned and len(cleaned) > 10:  # Simple filter for empty or very short propositions
            clean_propositions.append(cleaned)

    return clean_propositions

我来详细讲解这个 `generate_propositions` 函数的代码：

## 函数整体功能

这个函数用于**将文本块分解为原子化的、自包含的命题**。这是RAG系统中一种高级的文本分块技术，比简单的按长度分块更智能。

---

## 代码结构分析

### 1. **函数定义和文档**
```python
def generate_propositions(chunk):
    """
    Generate atomic, self-contained propositions from a text chunk.
    
    Args:
        chunk (Dict): Text chunk with content and metadata
        
    Returns:
        List[str]: List of generated propositions
    """
```

**功能说明：**
- 输入：包含文本内容和元数据的字典
- 输出：生成的命题列表
- 目标：将复杂文本分解为简单、独立的命题

---

### 2. **系统提示词（System Prompt）**
```python
system_prompt = """Please break down the following text into simple, self-contained propositions.
    Ensure that each proposition meets the following criteria:

    1. Express a Single Fact: Each proposition should state one specific fact or claim.
    2. Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.
    3. Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.
    4. Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.
    5. Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses.

    Output ONLY the list of propositions without any additional text or explanations."""
```

**详细解析：**

#### 五个核心标准：

1. **表达单一事实**：每个命题只陈述一个具体事实或主张
   - 例如：❌ "苹果公司发布了iPhone 15，并且价格很高"
   - ✅ "苹果公司发布了iPhone 15" 和 "iPhone 15价格很高"

2. **无需上下文即可理解**：命题应该是自包含的
   - 例如：❌ "它很贵"（需要知道"它"指什么）
   - ✅ "iPhone 15价格很高"

3. **使用完整名称而非代词**：避免模糊引用
   - 例如：❌ "他发明了电话"
   - ✅ "亚历山大·格雷厄姆·贝尔发明了电话"

4. **包含相关日期/限定词**：使事实更精确
   - 例如：✅ "苹果公司于2023年9月发布了iPhone 15"

5. **包含单一主谓关系**：避免复杂从句
   - 例如：❌ "虽然价格很高，但iPhone 15很受欢迎"
   - ✅ "iPhone 15价格很高" 和 "iPhone 15很受欢迎"

---

### 3. **用户提示词（User Prompt）**
```python
user_prompt = f"Text to convert into propositions:\n\n{chunk['text']}"
```

**功能：**
- 将待处理的文本块插入到提示词中
- 使用 f-string 格式化，确保文本正确传递

---

### 4. **调用语言模型**
```python
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",  # Using a stronger model for accurate proposition generation
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    temperature=0
)
```

**参数解析：**
- **模型**：使用 Llama-3.2-3B-Instruct，这是一个相对强大的模型
- **消息格式**：标准的聊天完成格式，包含系统提示和用户提示
- **temperature=0**：确保输出的一致性，减少随机性

---

### 5. **提取和清理命题**
```python
raw_propositions = response.choices[0].message.content.strip().split('\n')

clean_propositions = []
for prop in raw_propositions:
    # Remove numbering (1., 2., etc.) and bullet points
    cleaned = re.sub(r'^\s*(\d+\.|\-|\*)\s*', '', prop).strip()
    if cleaned and len(cleaned) > 10:  # Simple filter for empty or very short propositions
        clean_propositions.append(cleaned)
```

**处理步骤：**

1. **提取原始响应**：从模型输出中获取内容
2. **按行分割**：将响应按换行符分割成列表
3. **正则表达式清理**：
   - `r'^\s*(\d+\.|\-|\*)\s*'` 匹配行首的编号、破折号或星号
   - 移除这些格式标记
4. **质量过滤**：
   - 过滤空字符串
   - 过滤过短的命题（少于10个字符）

---

## 使用示例

```python
# 示例输入
chunk = {
    "text": "苹果公司于2023年9月发布了iPhone 15。这款手机采用了A17 Pro芯片，性能比上一代提升了20%。iPhone 15的起售价为799美元，比iPhone 14贵了100美元。",
    "metadata": {"source": "tech_news", "date": "2023-09-12"}
}

# 调用函数
propositions = generate_propositions(chunk)

# 预期输出
# [
#     "苹果公司于2023年9月发布了iPhone 15",
#     "iPhone 15采用了A17 Pro芯片",
#     "iPhone 15的性能比上一代提升了20%",
#     "iPhone 15的起售价为799美元",
#     "iPhone 15比iPhone 14贵了100美元"
# ]
```

---

## 在RAG系统中的作用

### 1. **提高检索精度**
- 将复杂文档分解为原子事实
- 更精确地匹配用户查询

### 2. **改善生成质量**
- 提供更具体、更相关的上下文
- 减少幻觉和不准确信息

### 3. **增强可解释性**
- 每个命题都是独立的、可验证的事实
- 便于追踪信息来源

---

## 优缺点分析

### 优点：
1. **语义完整性**：保持事实的完整性
2. **检索精度高**：原子化命题更容易匹配
3. **可解释性强**：每个命题都是独立的事实

### 缺点：
1. **计算成本高**：需要调用语言模型
2. **依赖模型质量**：输出质量取决于模型能力
3. **可能丢失上下文**：过度分解可能丢失重要关联

---

## 实际应用场景

1. **法律文档分析**：将法律条文分解为具体条款
2. **学术论文检索**：将论文分解为具体研究发现
3. **新闻事实提取**：将新闻文章分解为具体事实
4. **技术文档索引**：将技术文档分解为具体功能点

这个函数是RAG系统中文本预处理的高级技术，能够显著提升检索和生成的质量！

## Quality Checking for Propositions

In [10]:
def evaluate_proposition(proposition, original_text):
    """
    Evaluate a proposition's quality based on accuracy, clarity, completeness, and conciseness.

    Args:
        proposition (str): The proposition to evaluate
        original_text (str): The original text for comparison

    Returns:
        Dict: Scores for each evaluation dimension
    """
    # System prompt to instruct the AI on how to evaluate the proposition
    system_prompt = """You are an expert at evaluating the quality of propositions extracted from text.
    Rate the given proposition on the following criteria (scale 1-10):

    - Accuracy: How well the proposition reflects information in the original text
    - Clarity: How easy it is to understand the proposition without additional context
    - Completeness: Whether the proposition includes necessary details (dates, qualifiers, etc.)
    - Conciseness: Whether the proposition is concise without losing important information

    The response must be in valid JSON format with numerical scores for each criterion:
    {"accuracy": X, "clarity": X, "completeness": X, "conciseness": X}
    """

    # User prompt containing the proposition and the original text
    user_prompt = f"""Proposition: {proposition}

    Original Text: {original_text}

    Please provide your evaluation scores in JSON format."""

    # Generate response from the model
    response = client.chat.completions.create(
        model="o1",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        response_format={"type": "json_object"},
        temperature=0
    )

    # Parse the JSON response
    try:
        scores = json.loads(response.choices[0].message.content.strip())
        return scores
    except json.JSONDecodeError:
        # Fallback if JSON parsing fails
        return {
            "accuracy": 5,
            "clarity": 5,
            "completeness": 5,
            "conciseness": 5
        }

我来详细讲解这个 `evaluate_proposition` 函数的代码：

## 函数整体功能

这个函数用于**评估命题的质量**，通过多个维度对从文本中提取的命题进行评分。这是RAG系统中质量控制的重要组件。

---

## 代码结构分析

### 1. **函数定义和文档**
```python
def evaluate_proposition(proposition, original_text):
    """
    Evaluate a proposition's quality based on accuracy, clarity, completeness, and conciseness.
    
    Args:
        proposition (str): The proposition to evaluate
        original_text (str): The original text for comparison
        
    Returns:
        Dict: Scores for each evaluation dimension
    """
```

**功能说明：**
- 输入：待评估的命题和原始文本
- 输出：包含四个维度评分的字典
- 目标：从多个角度评估命题质量

---

### 2. **系统提示词（System Prompt）**
```python
system_prompt = """You are an expert at evaluating the quality of propositions extracted from text.
    Rate the given proposition on the following criteria (scale 1-10):

    - Accuracy: How well the proposition reflects information in the original text
    - Clarity: How easy it is to understand the proposition without additional context
    - Completeness: Whether the proposition includes necessary details (dates, qualifiers, etc.)
    - Conciseness: Whether the proposition is concise without losing important information

    The response must be in valid JSON format with numerical scores for each criterion:
    {"accuracy": X, "clarity": X, "completeness": X, "conciseness": X}
    """
```

**详细解析四个评估维度：**

#### 1. **准确性（Accuracy）- 1-10分**
- **定义**：命题在多大程度上准确反映了原始文本中的信息
- **评估标准**：
  - 10分：完全准确，无任何错误或偏差
  - 5分：基本准确，但有小错误
  - 1分：完全不准确，包含错误信息

**示例：**
- 原文："苹果公司于2023年9月发布了iPhone 15"
- ✅ 准确："苹果公司于2023年9月发布了iPhone 15"（10分）
- ❌ 不准确："苹果公司于2024年发布了iPhone 15"（1分）

#### 2. **清晰度（Clarity）- 1-10分**
- **定义**：命题是否易于理解，无需额外上下文
- **评估标准**：
  - 10分：非常清晰，任何人都能理解
  - 5分：基本清晰，但可能需要一些背景知识
  - 1分：非常模糊，难以理解

**示例：**
- ✅ 清晰："苹果公司于2023年9月发布了iPhone 15"（10分）
- ❌ 模糊："它发布了它"（1分）

#### 3. **完整性（Completeness）- 1-10分**
- **定义**：命题是否包含必要的细节（日期、限定词等）
- **评估标准**：
  - 10分：包含所有必要细节
  - 5分：包含基本信息，但缺少一些细节
  - 1分：信息严重不完整

**示例：**
- ✅ 完整："苹果公司于2023年9月发布了iPhone 15"（10分）
- ❌ 不完整："苹果公司发布了iPhone"（5分，缺少时间）

#### 4. **简洁性（Conciseness）- 1-10分**
- **定义**：命题是否简洁而不丢失重要信息
- **评估标准**：
  - 10分：非常简洁，无冗余信息
  - 5分：基本简洁，但有些冗余
  - 1分：过于冗长，包含不必要信息

**示例：**
- ✅ 简洁："苹果公司于2023年9月发布了iPhone 15"（10分）
- ❌ 冗长："苹果公司是一家总部位于美国加利福尼亚州库比蒂诺的跨国科技公司，该公司于2023年9月发布了iPhone 15智能手机"（5分，包含不必要信息）

---

### 3. **用户提示词（User Prompt）**
```python
user_prompt = f"""Proposition: {proposition}

    Original Text: {original_text}

    Please provide your evaluation scores in JSON format."""
```

**功能：**
- 将待评估的命题和原始文本提供给模型
- 要求模型以JSON格式返回评分

---

### 4. **调用语言模型**
```python
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    response_format={"type": "json_object"},
    temperature=0
)
```

**关键参数：**
- **response_format={"type": "json_object"}**：强制模型返回JSON格式
- **temperature=0**：确保输出的一致性
- **模型**：使用Llama-3.2-3B-Instruct进行评估

---

### 5. **解析响应和错误处理**
```python
try:
    scores = json.loads(response.choices[0].message.content.strip())
    return scores
except json.JSONDecodeError:
    # Fallback if JSON parsing fails
    return {
        "accuracy": 5,
        "clarity": 5,
        "completeness": 5,
        "conciseness": 5
    }
```

**错误处理机制：**
- 尝试解析JSON响应
- 如果解析失败，返回默认的中等评分（5分）
- 确保函数不会因为解析错误而崩溃

---

## 使用示例

```python
# 示例输入
proposition = "苹果公司于2023年9月发布了iPhone 15"
original_text = "苹果公司于2023年9月12日在加利福尼亚州库比蒂诺发布了iPhone 15。这款新手机采用了A17 Pro芯片，性能比上一代提升了20%。"

# 调用函数
scores = evaluate_proposition(proposition, original_text)

# 预期输出
# {
#     "accuracy": 8,      # 基本准确，但缺少具体日期
#     "clarity": 10,      # 非常清晰
#     "completeness": 7,  # 缺少具体日期（9月12日）
#     "conciseness": 10   # 非常简洁
# }
```

---

## 在RAG系统中的作用

### 1. **质量控制**
- 过滤低质量的命题
- 确保检索到的信息准确可靠

### 2. **优化排序**
- 根据质量分数对命题进行排序
- 优先展示高质量的命题

### 3. **系统改进**
- 监控命题生成的质量
- 为模型调优提供反馈

---

## 评分标准详解

### 评分指南：

| 分数 | 描述 | 示例 |
|------|------|------|
| 9-10 | 优秀 | 完全符合标准，质量很高 |
| 7-8 | 良好 | 基本符合标准，有小问题 |
| 5-6 | 中等 | 部分符合标准，有明显问题 |
| 3-4 | 较差 | 基本不符合标准 |
| 1-2 | 很差 | 完全不符合标准 |

---

## 优缺点分析

### 优点：
1. **多维度评估**：从四个不同角度评估质量
2. **标准化评分**：使用1-10分制，便于比较
3. **错误处理**：有完善的异常处理机制
4. **JSON格式**：便于程序处理

### 缺点：
1. **主观性**：评分仍有一定主观性
2. **计算成本**：需要调用语言模型
3. **依赖模型**：评分质量取决于模型能力

---

## 实际应用场景

1. **内容审核**：评估生成内容的质量
2. **信息过滤**：过滤低质量的信息
3. **质量监控**：监控系统输出质量
4. **模型评估**：评估不同模型的表现

这个函数是RAG系统中质量控制的重要工具，能够确保提供给用户的信息是高质量、准确可靠的！

## Complete Proposition Processing Pipeline

In [11]:
def process_document_into_propositions(pdf_path, chunk_size=800, chunk_overlap=100,
                                      quality_thresholds=None):
    """
    Process a document into quality-checked propositions.

    Args:
        pdf_path (str): Path to the PDF file
        chunk_size (int): Size of each chunk in characters
        chunk_overlap (int): Overlap between chunks in characters
        quality_thresholds (Dict): Threshold scores for proposition quality

    Returns:
        Tuple[List[Dict], List[Dict]]: Original chunks and proposition chunks
    """
    # Set default quality thresholds if not provided
    if quality_thresholds is None:
        quality_thresholds = {
            "accuracy": 7,
            "clarity": 7,
            "completeness": 7,
            "conciseness": 7
        }

    # Extract text from the PDF file
    text = extract_text_from_pdf(pdf_path)

    # Create chunks from the extracted text
    chunks = chunk_text(text, chunk_size, chunk_overlap)

    # Initialize a list to store all propositions
    all_propositions = []

    print("Generating propositions from chunks...")
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}/{len(chunks)}...")

        # Generate propositions for the current chunk
        chunk_propositions = generate_propositions(chunk)
        print(f"Generated {len(chunk_propositions)} propositions")

        # Process each generated proposition
        for prop in chunk_propositions:
            proposition_data = {
                "text": prop,
                "source_chunk_id": chunk["chunk_id"],
                "source_text": chunk["text"]
            }
            all_propositions.append(proposition_data)

    # Evaluate the quality of the generated propositions
    print("\nEvaluating proposition quality...")
    quality_propositions = []

    for i, prop in enumerate(all_propositions):
        if i % 10 == 0:  # Status update every 10 propositions
            print(f"Evaluating proposition {i+1}/{len(all_propositions)}...")

        # Evaluate the quality of the current proposition
        scores = evaluate_proposition(prop["text"], prop["source_text"])
        prop["quality_scores"] = scores

        # Check if the proposition passes the quality thresholds
        passes_quality = True
        for metric, threshold in quality_thresholds.items():
            if scores.get(metric, 0) < threshold:
                passes_quality = False
                break

        if passes_quality:
            quality_propositions.append(prop)
        else:
            print(f"Proposition failed quality check: {prop['text'][:50]}...")

    print(f"\nRetained {len(quality_propositions)}/{len(all_propositions)} propositions after quality filtering")

    return chunks, quality_propositions

我来详细讲解这个 `process_document_into_propositions` 函数的代码：

## 函数整体功能

这是一个**文档处理流水线**，将PDF文档转换为经过质量检查的命题。这是RAG系统中的核心预处理组件，实现了从原始文档到高质量知识库的完整转换。

---

## 函数参数分析

### 1. **输入参数**
```python
def process_document_into_propositions(pdf_path, chunk_size=800, chunk_overlap=100,
                                      quality_thresholds=None):
```

**参数详解：**
- **pdf_path**: PDF文件路径
- **chunk_size=800**: 每个文本块的大小（字符数）
- **chunk_overlap=100**: 文本块之间的重叠字符数
- **quality_thresholds=None**: 质量阈值字典

---

## 代码结构分析

### 1. **质量阈值设置**
```python
if quality_thresholds is None:
    quality_thresholds = {
        "accuracy": 7,
        "clarity": 7,
        "completeness": 7,
        "conciseness": 7
    }
```

**功能：**
- 设置默认的质量阈值（1-10分制）
- 所有维度都设为7分，这是一个相对严格的标准
- 只有达到或超过这些阈值的命题才会被保留

---

### 2. **文本提取和分块**
```python
# Extract text from the PDF file
text = extract_text_from_pdf(pdf_path)

# Create chunks from the extracted text
chunks = chunk_text(text, chunk_size, chunk_overlap)
```

**处理流程：**
1. **PDF文本提取**：从PDF文件中提取纯文本
2. **文本分块**：将长文本分割成小块，便于处理
3. **重叠设计**：确保重要信息不会在分块边界丢失

**分块策略示例：**
```
原文：1000字符
chunk_size=800, chunk_overlap=100

块1：字符1-800
块2：字符700-1500  (与块1重叠100字符)
块3：字符1400-2200 (与块2重叠100字符)
```

---

### 3. **命题生成循环**
```python
all_propositions = []

print("Generating propositions from chunks...")
for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i+1}/{len(chunks)}...")
    
    # Generate propositions for the current chunk
    chunk_propositions = generate_propositions(chunk)
    print(f"Generated {len(chunk_propositions)} propositions")
    
    # Process each generated proposition
    for prop in chunk_propositions:
        proposition_data = {
            "text": prop,
            "source_chunk_id": chunk["chunk_id"],
            "source_text": chunk["text"]
        }
        all_propositions.append(proposition_data)
```

**详细流程：**

#### 步骤1：遍历每个文本块
- 显示处理进度
- 调用 `generate_propositions` 函数生成命题

#### 步骤2：构建命题数据结构
```python
proposition_data = {
    "text": prop,                    # 命题文本
    "source_chunk_id": chunk["chunk_id"],  # 来源块ID
    "source_text": chunk["text"]     # 原始文本（用于质量评估）
}
```

**数据结构设计的好处：**
- **可追溯性**：知道每个命题来自哪个文本块
- **质量评估**：保留原始文本用于准确性检查
- **调试支持**：便于问题定位和系统优化

---

### 4. **质量评估和过滤**
```python
print("\nEvaluating proposition quality...")
quality_propositions = []

for i, prop in enumerate(all_propositions):
    if i % 10 == 0:  # Status update every 10 propositions
        print(f"Evaluating proposition {i+1}/{len(all_propositions)}...")
        
    # Evaluate the quality of the current proposition
    scores = evaluate_proposition(prop["text"], prop["source_text"])
    prop["quality_scores"] = scores
    
    # Check if the proposition passes the quality thresholds
    passes_quality = True
    for metric, threshold in quality_thresholds.items():
        if scores.get(metric, 0) < threshold:
            passes_quality = False
            break
    
    if passes_quality:
        quality_propositions.append(prop)
    else:
        print(f"Proposition failed quality check: {prop['text'][:50]}...")
```

**质量评估流程：**

#### 步骤1：批量评估
- 每10个命题显示一次进度
- 调用 `evaluate_proposition` 进行质量评分

#### 步骤2：阈值检查
```python
passes_quality = True
for metric, threshold in quality_thresholds.items():
    if scores.get(metric, 0) < threshold:
        passes_quality = False
        break
```

**逻辑说明：**
- 所有维度都必须达到阈值
- 任何一个维度不达标，整个命题就被过滤掉
- 使用 `scores.get(metric, 0)` 处理缺失评分的情况

#### 步骤3：结果分类
- **通过**：添加到 `quality_propositions`
- **未通过**：打印失败信息，便于调试

---

### 5. **结果统计和返回**
```python
print(f"\nRetained {len(quality_propositions)}/{len(all_propositions)} propositions after quality filtering")

return chunks, quality_propositions
```

**返回数据：**
- **chunks**: 原始文本块（用于调试和参考）
- **quality_propositions**: 经过质量过滤的命题列表

---

## 完整的数据流示例

```python
# 输入：PDF文档
pdf_path = "document.pdf"

# 处理过程
chunks, propositions = process_document_into_propositions(
    pdf_path,
    chunk_size=800,
    chunk_overlap=100,
    quality_thresholds={
        "accuracy": 8,      # 更严格的准确性要求
        "clarity": 7,
        "completeness": 6,  # 稍微宽松的完整性要求
        "conciseness": 7
    }
)

# 输出示例
print(f"原始文本块数量: {len(chunks)}")
print(f"生成命题总数: {len(propositions)}")

# 查看第一个命题
if propositions:
    first_prop = propositions[0]
    print(f"命题文本: {first_prop['text']}")
    print(f"质量评分: {first_prop['quality_scores']}")
    print(f"来源块ID: {first_prop['source_chunk_id']}")
```

---

## 在RAG系统中的作用

### 1. **知识库构建**
- 将非结构化文档转换为结构化知识
- 为向量数据库提供高质量数据

### 2. **质量控制**
- 确保知识库中的信息准确可靠
- 过滤低质量或错误的信息

### 3. **可追溯性**
- 每个命题都可以追溯到原始文档
- 便于验证和调试

---

## 性能优化建议

### 1. **并行处理**
```python
# 可以改进为并行处理
from concurrent.futures import ThreadPoolExecutor

def process_chunk_parallel(chunk):
    # 并行处理每个块
    pass
```

### 2. **批量评估**
```python
# 批量评估多个命题
def batch_evaluate_propositions(propositions):
    # 减少API调用次数
    pass
```

### 3. **缓存机制**
```python
# 缓存已处理的文档
def get_cached_propositions(pdf_path):
    # 避免重复处理
    pass
```

---

## 优缺点分析

### 优点：
1. **完整的流水线**：从PDF到高质量命题的完整处理
2. **质量控制**：多维度质量评估和过滤
3. **可追溯性**：保留完整的来源信息
4. **可配置性**：支持自定义参数和阈值

### 缺点：
1. **计算成本高**：需要多次调用语言模型
2. **处理时间长**：特别是大文档
3. **依赖外部服务**：需要稳定的API服务

这个函数是RAG系统中文档预处理的核心组件，为后续的检索和生成提供了高质量的知识基础！

## Building Vector Stores for Both Approaches

In [12]:
def build_vector_stores(chunks, propositions):
    """
    Build vector stores for both chunk-based and proposition-based approaches.

    Args:
        chunks (List[Dict]): Original document chunks
        propositions (List[Dict]): Quality-filtered propositions

    Returns:
        Tuple[SimpleVectorStore, SimpleVectorStore]: Chunk and proposition vector stores
    """
    # Create vector store for chunks
    chunk_store = SimpleVectorStore()

    # Extract chunk texts and create embeddings
    chunk_texts = [chunk["text"] for chunk in chunks]
    print(f"Creating embeddings for {len(chunk_texts)} chunks...")
    chunk_embeddings = create_embeddings(chunk_texts)

    # Add chunks to vector store with metadata
    chunk_metadata = [{"chunk_id": chunk["chunk_id"], "type": "chunk"} for chunk in chunks]
    chunk_store.add_items(chunk_texts, chunk_embeddings, chunk_metadata)

    # Create vector store for propositions
    prop_store = SimpleVectorStore()

    # Extract proposition texts and create embeddings
    prop_texts = [prop["text"] for prop in propositions]
    print(f"Creating embeddings for {len(prop_texts)} propositions...")
    prop_embeddings = create_embeddings(prop_texts)

    # Add propositions to vector store with metadata
    prop_metadata = [
        {
            "type": "proposition",
            "source_chunk_id": prop["source_chunk_id"],
            "quality_scores": prop["quality_scores"]
        }
        for prop in propositions
    ]
    prop_store.add_items(prop_texts, prop_embeddings, prop_metadata)

    return chunk_store, prop_store

## Query and Retrieval Functions

In [13]:
def retrieve_from_store(query, vector_store, k=5):
    """
    Retrieve relevant items from a vector store based on query.

    Args:
        query (str): User query
        vector_store (SimpleVectorStore): Vector store to search
        k (int): Number of results to retrieve

    Returns:
        List[Dict]: Retrieved items with scores and metadata
    """
    # Create query embedding
    query_embedding = create_embeddings(query)

    # Search vector store for the top k most similar items
    results = vector_store.similarity_search(query_embedding, k=k)

    return results

In [14]:
def compare_retrieval_approaches(query, chunk_store, prop_store, k=5):
    """
    Compare chunk-based and proposition-based retrieval for a query.

    Args:
        query (str): User query
        chunk_store (SimpleVectorStore): Chunk-based vector store
        prop_store (SimpleVectorStore): Proposition-based vector store
        k (int): Number of results to retrieve from each store

    Returns:
        Dict: Comparison results
    """
    print(f"\n=== Query: {query} ===")

    # Retrieve results from the proposition-based vector store
    print("\nRetrieving with proposition-based approach...")
    prop_results = retrieve_from_store(query, prop_store, k)

    # Retrieve results from the chunk-based vector store
    print("Retrieving with chunk-based approach...")
    chunk_results = retrieve_from_store(query, chunk_store, k)

    # Display proposition-based results
    print("\n=== Proposition-Based Results ===")
    for i, result in enumerate(prop_results):
        print(f"{i+1}) {result['text']} (Score: {result['similarity']:.4f})")

    # Display chunk-based results
    print("\n=== Chunk-Based Results ===")
    for i, result in enumerate(chunk_results):
        # Truncate text to keep the output manageable
        truncated_text = result['text'][:150] + "..." if len(result['text']) > 150 else result['text']
        print(f"{i+1}) {truncated_text} (Score: {result['similarity']:.4f})")

    # Return the comparison results
    return {
        "query": query,
        "proposition_results": prop_results,
        "chunk_results": chunk_results
    }

## Response Generation and Evaluation

In [15]:
def generate_response(query, results, result_type="proposition"):
    """
    Generate a response based on retrieved results.

    Args:
        query (str): User query
        results (List[Dict]): Retrieved items
        result_type (str): Type of results ('proposition' or 'chunk')

    Returns:
        str: Generated response
    """
    # Combine retrieved texts into a single context string
    context = "\n\n".join([result["text"] for result in results])

    # System prompt to instruct the AI on how to generate the response
    system_prompt = f"""You are an AI assistant answering questions based on retrieved information.
Your answer should be based on the following {result_type}s that were retrieved from a knowledge base.
If the retrieved information doesn't answer the question, acknowledge this limitation."""

    # User prompt containing the query and the retrieved context
    user_prompt = f"""Query: {query}

Retrieved {result_type}s:
{context}

Please answer the query based on the retrieved information."""

    # Generate the response using the OpenAI client
    response = client.chat.completions.create(
        model="o1",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.2
    )

    # Return the generated response text
    return response.choices[0].message.content

In [16]:
def evaluate_responses(query, prop_response, chunk_response, reference_answer=None):
    """
    Evaluate and compare responses from both approaches.

    Args:
        query (str): User query
        prop_response (str): Response from proposition-based approach
        chunk_response (str): Response from chunk-based approach
        reference_answer (str, optional): Reference answer for comparison

    Returns:
        str: Evaluation analysis
    """
    # System prompt to instruct the AI on how to evaluate the responses
    system_prompt = """You are an expert evaluator of information retrieval systems.
    Compare the two responses to the same query, one generated from proposition-based retrieval
    and the other from chunk-based retrieval.

    Evaluate them based on:
    1. Accuracy: Which response provides more factually correct information?
    2. Relevance: Which response better addresses the specific query?
    3. Conciseness: Which response is more concise while maintaining completeness?
    4. Clarity: Which response is easier to understand?

    Be specific about the strengths and weaknesses of each approach."""

    # User prompt containing the query and the responses to be compared
    user_prompt = f"""Query: {query}

    Response from Proposition-Based Retrieval:
    {prop_response}

    Response from Chunk-Based Retrieval:
    {chunk_response}"""

    # If a reference answer is provided, include it in the user prompt for factual checking
    if reference_answer:
        user_prompt += f"""

    Reference Answer (for factual checking):
    {reference_answer}"""

    # Add the final instruction to the user prompt
    user_prompt += """
    Please provide a detailed comparison of these two responses, highlighting which approach performed better and why."""

    # Generate the evaluation analysis using the OpenAI client
    response = client.chat.completions.create(
        model="o1",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )

    # Return the generated evaluation analysis
    return response.choices[0].message.content

## Complete End-to-End Evaluation Pipeline

In [17]:
def run_proposition_chunking_evaluation(pdf_path, test_queries, reference_answers=None):
    """
    Run a complete evaluation of proposition chunking vs standard chunking.

    Args:
        pdf_path (str): Path to the PDF file
        test_queries (List[str]): List of test queries
        reference_answers (List[str], optional): Reference answers for queries

    Returns:
        Dict: Evaluation results
    """
    print("=== Starting Proposition Chunking Evaluation ===\n")

    # Process document into propositions and chunks
    chunks, propositions = process_document_into_propositions(pdf_path)

    # Build vector stores for chunks and propositions
    chunk_store, prop_store = build_vector_stores(chunks, propositions)

    # Initialize a list to store results for each query
    results = []

    # Run tests for each query
    for i, query in enumerate(test_queries):
        print(f"\n\n=== Testing Query {i+1}/{len(test_queries)} ===")
        print(f"Query: {query}")

        # Get retrieval results from both chunk-based and proposition-based approaches
        retrieval_results = compare_retrieval_approaches(query, chunk_store, prop_store)

        # Generate responses based on the retrieved proposition-based results
        print("\nGenerating response from proposition-based results...")
        prop_response = generate_response(
            query,
            retrieval_results["proposition_results"],
            "proposition"
        )

        # Generate responses based on the retrieved chunk-based results
        print("Generating response from chunk-based results...")
        chunk_response = generate_response(
            query,
            retrieval_results["chunk_results"],
            "chunk"
        )

        # Get reference answer if available
        reference = None
        if reference_answers and i < len(reference_answers):
            reference = reference_answers[i]

        # Evaluate the generated responses
        print("\nEvaluating responses...")
        evaluation = evaluate_responses(query, prop_response, chunk_response, reference)

        # Compile results for the current query
        query_result = {
            "query": query,
            "proposition_results": retrieval_results["proposition_results"],
            "chunk_results": retrieval_results["chunk_results"],
            "proposition_response": prop_response,
            "chunk_response": chunk_response,
            "reference_answer": reference,
            "evaluation": evaluation
        }

        # Append the results to the overall results list
        results.append(query_result)

        # Print the responses and evaluation for the current query
        print("\n=== Proposition-Based Response ===")
        print(prop_response)

        print("\n=== Chunk-Based Response ===")
        print(chunk_response)

        print("\n=== Evaluation ===")
        print(evaluation)

    # Generate overall analysis of the evaluation
    print("\n\n=== Generating Overall Analysis ===")
    overall_analysis = generate_overall_analysis(results)
    print("\n" + overall_analysis)

    # Return the evaluation results, overall analysis, and counts of propositions and chunks
    return {
        "results": results,
        "overall_analysis": overall_analysis,
        "proposition_count": len(propositions),
        "chunk_count": len(chunks)
    }

In [19]:
def generate_overall_analysis(results):
    """
    Generate an overall analysis of proposition vs chunk approaches.

    Args:
        results (List[Dict]): Results from each test query

    Returns:
        str: Overall analysis
    """
    # System prompt to instruct the AI on how to generate the overall analysis
    system_prompt = """You are an expert at evaluating information retrieval systems.
    Based on multiple test queries, provide an overall analysis comparing proposition-based retrieval
    to chunk-based retrieval for RAG (Retrieval-Augmented Generation) systems.

    Focus on:
    1. When proposition-based retrieval performs better
    2. When chunk-based retrieval performs better
    3. The overall strengths and weaknesses of each approach
    4. Recommendations for when to use each approach"""

    # Create a summary of evaluations for each query
    evaluations_summary = ""
    for i, result in enumerate(results):
        evaluations_summary += f"Query {i+1}: {result['query']}\n"
        evaluations_summary += f"Evaluation Summary: {result['evaluation'][:200]}...\n\n"

    # User prompt containing the summary of evaluations
    user_prompt = f"""Based on the following evaluations of proposition-based vs chunk-based retrieval across {len(results)} queries,
    provide an overall analysis comparing these two approaches:

    {evaluations_summary}

    Please provide a comprehensive analysis on the relative strengths and weaknesses of proposition-based
    and chunk-based retrieval for RAG systems."""

    # Generate the overall analysis using the OpenAI client
    response = client.chat.completions.create(
        model="o1",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )

    # Return the generated analysis text
    return response.choices[0].message.content

## Evaluation of Proposition Chunking

In [20]:
# Path to the AI information document that will be processed
pdf_path = "AI_Information.pdf"

# Define test queries covering different aspects of AI to evaluate proposition chunking
test_queries = [
    "What are the main ethical concerns in AI development?",
    # "How does explainable AI improve trust in AI systems?",
    # "What are the key challenges in developing fair AI systems?",
    # "What role does human oversight play in AI safety?"
]

# Reference answers for more thorough evaluation and comparison of results
# These provide a ground truth to measure the quality of generated responses
reference_answers = [
    "The main ethical concerns in AI development include bias and fairness, privacy, transparency, accountability, safety, and the potential for misuse or harmful applications.",
    # "Explainable AI improves trust by making AI decision-making processes transparent and understandable to users, helping them verify fairness, identify potential biases, and better understand AI limitations.",
    # "Key challenges in developing fair AI systems include addressing data bias, ensuring diverse representation in training data, creating transparent algorithms, defining fairness across different contexts, and balancing competing fairness criteria.",
    # "Human oversight plays a critical role in AI safety by monitoring system behavior, verifying outputs, intervening when necessary, setting ethical boundaries, and ensuring AI systems remain aligned with human values and intentions throughout their operation."
]

# Run the evaluation
evaluation_results = run_proposition_chunking_evaluation(
    pdf_path=pdf_path,
    test_queries=test_queries,
    reference_answers=reference_answers
)

# Print the overall analysis
print("\n\n=== Overall Analysis ===")
print(evaluation_results["overall_analysis"])

=== Starting Proposition Chunking Evaluation ===

Created 48 text chunks
Generating propositions from chunks...
Processing chunk 1/48...
Generated 10 propositions
Processing chunk 2/48...
Generated 18 propositions
Processing chunk 3/48...
Generated 17 propositions
Processing chunk 4/48...
Generated 13 propositions
Processing chunk 5/48...
Generated 13 propositions
Processing chunk 6/48...
Generated 19 propositions
Processing chunk 7/48...
Generated 21 propositions
Processing chunk 8/48...
Generated 22 propositions
Processing chunk 9/48...
Generated 23 propositions
Processing chunk 10/48...
Generated 21 propositions
Processing chunk 11/48...
Generated 15 propositions
Processing chunk 12/48...
Generated 16 propositions
Processing chunk 13/48...
Generated 13 propositions
Processing chunk 14/48...
Generated 15 propositions
Processing chunk 15/48...
Generated 14 propositions
Processing chunk 16/48...
Generated 21 propositions
Processing chunk 17/48...
Generated 25 propositions
Processing ch

KeyboardInterrupt: 