# PDF 内容清洗分析

这个 notebook 用于分析从 PDF 文件中提取的文本内容，并开发启发式规则来过滤掉乱码和低质量的文本块。

**目标**: 针对 `Journal of Field Robotics - 2017 - Paolillo - Autonomous car driving by a humanoid robot.pdf` 文件，识别并过滤掉类似 `dy Woy papeo|umog` 这样的错误提取结果。

In [1]:
from unstructured.partition.pdf import partition_pdf
import os
import re

print("Libraries imported.")

  from .autonotebook import tqdm as notebook_tqdm


Libraries imported.


In [2]:
pdf_path = os.path.join("data", "embodia", "pdf", "Journal of Field Robotics - 2017 - Paolillo - Autonomous car driving by a humanoid robot.pdf")

print(f"Checking for file at: {pdf_path}")
if os.path.exists(pdf_path):
    print("✅ File found.")
else:
    print("❌ File not found! Please check the path.")

Checking for file at: data\embodia\pdf\Journal of Field Robotics - 2017 - Paolillo - Autonomous car driving by a humanoid robot.pdf
✅ File found.


In [3]:
# 使用 hi_res策略以获取更好的上下文和结构。
# 我们禁用了图像提取，因为我们专注于文本。
# 我们启用了表格推断。
print("Partitioning PDF... (这可能需要一点时间)")
elements = []
try:
    elements = partition_pdf(
        pdf_path,
        strategy="hi_res",
        extract_images_in_pdf=False,
        infer_table_structure=True,
        languages=["eng"], # 指定语言以提高解析效果
    )
    print(f"✅ Successfully partitioned PDF into {len(elements)} elements.")
except Exception as e:
    print(f"❌ An error occurred during partitioning: {e}")

Partitioning PDF... (这可能需要一点时间)


The `max_size` parameter is deprecated and will be removed in v4.26. Please specify in `size['longest_edge'] instead`.


✅ Successfully partitioned PDF into 617 elements.


## 步骤 1: 检查 `element` 对象的实际结构

`partition_pdf` 返回一个 `Element` 对象列表。为了避免之前的 `KeyError`，我们首先需要检查这些对象的实际数据结构。我们将使用 `.to_dict()` 方法安全地查看其内容。

In [4]:
print("--- Inspecting Element Structure ---")
if elements:
    # partition_pdf 的输出是 Element 对象的列表。
    # 让我们通过将前3个元素转换为字典来查看它们。
    for i, element in enumerate(elements[:3]):
        print(f"--- Element {i+1} (Category: {element.category}) ---")
        # .to_dict() 是查看元素数据的安全方法。
        print(element.to_dict())
        print(" ")
else:
    print("No elements were extracted.")

--- Inspecting Element Structure ---
--- Element 1 (Category: Header) ---
{'type': 'Header', 'element_id': '4c77e1b61a4b29a53a2681e922f3f3e0', 'text': 'Received: 10 February 2016', 'metadata': {'detection_class_prob': 0.8243770599365234, 'is_extracted': 'true', 'coordinates': {'points': ((np.float64(129.86666666666667), np.float64(77.83703222222205)), (np.float64(129.86666666666667), np.float64(97.27418518066406)), (np.float64(372.77178955078125), np.float64(97.27418518066406)), (np.float64(372.77178955078125), np.float64(77.83703222222205))), 'system': 'PixelSpace', 'layout_width': 1654, 'layout_height': 2174}, 'last_modified': '2025-05-08T18:56:56', 'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'file_directory': 'data\\embodia\\pdf', 'filename': 'Journal of Field Robotics - 2017 - Paolillo - Autonomous car driving by a humanoid robot.pdf'}}
 
--- Element 2 (Category: Header) ---
{'type': 'Header', 'element_id': '87d166be0a8b58cc6f88442bcc7a76d5', 'text': 'Rev

## 步骤 2: 定义用于数据清洗的启发式规则

根据观察到的乱码，我们定义一些函数来识别和标记低质量的文本。

In [5]:
import re
print("--- Defining Cleaning Functions (with Reasons) ---")

def contains_garbled_patterns(text):
    """检查特定的已知乱码模式。"""
    patterns = [
        re.compile(r"dy Woy papeo"),
        re.compile(r"wOD AojLMW ATH"),
        # 可以在这里添加更多你发现的乱码模式
    ]
    for pattern in patterns:
        if pattern.search(text):
            return True
    return False

def is_low_quality(element, text_length_threshold=20, alphanum_ratio_threshold=0.6):
    """
    应用一系列规则来判断一个文档元素是否是低质量的。
    
    返回: (is_low_quality: bool, reason: str)
    """
    text = element.text
    
    # 规则 1: 按元素类别过滤，我们只想要有意义的文本块。
    if element.category not in ['NarrativeText', 'Title', 'ListItem', 'Table']:
        return True, f"Rejected category: '{element.category}'"

    # 规则 2: 文本太短，没有意义。
    if not text or len(text.strip()) < text_length_threshold:
        return True, f"Text shorter than threshold ({text_length_threshold} chars)"

    # 规则 3: 字母和数字的比例过低。
    alphanum_count = len(re.findall(r'[a-zA-Z0-9]', text))
    total_count = len(text)
    if total_count == 0:
        return True, "Empty text content"
    
    ratio = alphanum_count / total_count
    if ratio < alphanum_ratio_threshold:
        return True, f"Alphanum ratio too low ({ratio:.2f} < {alphanum_ratio_threshold})"

    # 规则 4: 包含已知的乱码模式。
    if contains_garbled_patterns(text):
        return True, "Contains known garbled pattern"

    return False, "High quality"

print("✅ Corrected cleaning functions defined.")

--- Defining Cleaning Functions (with Reasons) ---
✅ Corrected cleaning functions defined.


## 步骤 3: 分析并过滤元素

现在我们遍历所有提取出的元素，应用我们的清洗函数，并查看哪些内容被丢弃了。

In [6]:
print("--- Analyzing and Filtering PDF Content ---")
print("Iterating through all elements to find and analyze low-quality text...")

kept_elements = []
discarded_elements = []

if elements:
    for i, element in enumerate(elements):
        is_bad, reason = is_low_quality(element)

        if is_bad:
            discarded_elements.append({
                "index": i,
                "text": element.text,
                "reason": reason,
                "category": element.category
            })
        else:
            kept_elements.append(element)

    print(f"--- Analysis Complete ---")
    print(f"Total elements processed: {len(elements)}")
    print(f"Elements kept: {len(kept_elements)}")
    print(f"Elements discarded: {len(discarded_elements)}")

else:
    print("Cannot analyze, no elements were extracted.")

--- Analyzing and Filtering PDF Content ---
Iterating through all elements to find and analyze low-quality text...
--- Analysis Complete ---
Total elements processed: 617
Elements kept: 257
Elements discarded: 360


## 步骤 4: 检查被丢弃内容的样本

这可以帮助我们验证我们的启发式规则是否有效，或者是否过于严格。

In [7]:
print("--- 20 Examples of Discarded Elements ---")
if discarded_elements:
    for item in discarded_elements[:20]: # 打印前20个例子
        print(f"- Index: {item['index']}")
        print(f"  Category: {item['category']}")
        print(f"  Reason: {item['reason']}")
        print(f"  Text: {item['text'][:200].strip()}...") # 打印摘要
else:
    print("No elements were discarded.")

--- 20 Examples of Discarded Elements ---
- Index: 0
  Category: Header
  Reason: Rejected category: 'Header'
  Text: Received: 10 February 2016...
- Index: 1
  Category: Header
  Reason: Rejected category: 'Header'
  Text: Revised: 15 May 2017...
- Index: 2
  Category: Header
  Reason: Rejected category: 'Header'
  Text: Accepted: 24 May 2017...
- Index: 4
  Category: UncategorizedText
  Reason: Rejected category: 'UncategorizedText'
  Text: F I E L D R E P O RT...
- Index: 5
  Category: Title
  Reason: Text shorter than threshold (20 chars)
  Text: WILEY...
- Index: 9
  Category: Title
  Reason: Text shorter than threshold (20 chars)
  Text: Correspondence...
- Index: 11
  Category: Title
  Reason: Text shorter than threshold (20 chars)
  Text: Abstract...
- Index: 13
  Category: Title
  Reason: Text shorter than threshold (20 chars)
  Text: K E YW O R D S...
- Index: 15
  Category: UncategorizedText
  Reason: Rejected category: 'UncategorizedText'
  Text: 1...
- Index: 16
  Category

In [10]:
if kept_elements:
        print("\n\n--- Examples of KEPT Elements ---")
        for i, element in enumerate(kept_elements[:5]): # 最多显示5个
            print("-" * 20)
            # 在原始列表中找到它的索引，方便对比
            original_index = -1
            try:
                original_index = elements.index(element)
            except ValueError:
                pass
            print(f"Original Index: {original_index} | Category: {element.category}")
            print(f"Text Snippet: {element.text[:300]}...")



--- Examples of KEPT Elements ---
--------------------
Original Index: 3 | Category: NarrativeText
Text Snippet: DOI: 10.1002/rob.21731...
--------------------
Original Index: 6 | Category: Title
Text Snippet: Autonomous car driving by a humanoid robot...
--------------------
Original Index: 7 | Category: Title
Text Snippet: Antonio Paolillo1,2 Pierre Gergondet2 Andrea Cherubini1 Marilena Vendittelli3 Abderrahmane Kheddar1,2...
--------------------
Original Index: 8 | Category: NarrativeText
Text Snippet: 1CNRS-UMLIRMM,Montpellier,France 2CNRS-AISTJRLUMI3218/RL,Tsukuba,Japan 3DIAG,SapienzaUniversitàdiRoma,Roma,Italy...
--------------------
Original Index: 10 | Category: NarrativeText
Text Snippet: AntonioPaolillo,CNRS-UMLIRMM161Rue Ada,34090Montpellier. Email:paolillo@lirmm.fr...
