# Processing Documents for AI Applications

## Using pypdf
pypdf is a pure Python library capable of splitting, merging, cropping, and transforming th apages of PDF files. It's text extraction capabilities turn binary PDF data into plain text strings. It is great for documents containing mainly text; pypdf doesn't do so well with tables, graphs, and images.

In [1]:
import os
import re
from pypdf import PdfReader

fp_paper_pdf = "data/ResearchPaper.pdf"
fp_slides_pdf = "data/LectureSlides.pdf"
fp_slides_pptx = "data/SlideDeck"

output_dir = "cleaned_data/"
os.makedirs(output_dir, exist_ok=True)

In [2]:
# Basic script to examine quality of extracted text 
reader = PdfReader(fp_paper_pdf)
clean_pages = []

# for each page in document
for page in reader.pages:
    lines = (page.extract_text().strip() or "").splitlines()
    cleaned = [line for line in lines if len(line) >= 3] # clear junk lines
    clean_pages.append("\n".join(cleaned))

text = "\n".join(clean_pages)
with open(f"{output_dir}/research_paper.txt", "w+") as f:
    f.write(text)
# 0m46s

In [None]:
# More thurough example that retains semantic continuity for RAG applications
def clean_text_pypdf(page_text):
    if not page_text:
        return ""
    
    lines = page_text.splitlines()
    # drop noise lines (headers, footers, page numbers)
    if len(lines) > 1 and len(lines[0]) < 5:
        lines = lines[1:]
    if len(lines) > 1 and len(lines[-1]) < 5:
        lines = lines[:-1]

    text = '\n'.join(lines)
    # fix hyphenated text
    text = re.sub(r'-\n\s*', '', text)
    # merge broken lines but keep paragraph breaks
    text = re.sub(r'(?<!\n)\n(?!\n)', ' ', text)
    # clean extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

reader = PdfReader(fp_paper_pdf)
processed_pages = []

for i, page in enumerate(reader.pages):
    raw_text = page.extract_text()
    clean_content = clean_text_pypdf(raw_text)
    # skip empty pages
    if len(clean_content) < 40:
        continue
    processed_pages.append({
        "source": fp_paper_pdf,
        "page_number": i + 1,
        "content": clean_content
    })

# save output for debugging or ingestion
with open("cleaned_output_pypdf.txt", "w", encoding="utf-8") as f:
    for page in processed_pages:
        f.write(f"--- Page {page['page_number']} ---\n")
        f.write(page['content'] + "\n\n")

print(f"[S] Processed {len(processed_pages)} pages.")
# 0m31s

[S] Processed 14 pages.


## Using Docling

In [4]:
from docling.document_converter import DocumentConverter

fp_paper_pdf = "data/ResearchPaper.pdf"
fp_slides_pdf = "data/LectureSlides.pdf"
fp_slides_pptx = "data/SlideDeck"

output_dir = "cleaned_data"
os.makedirs(output_dir, exist_ok=True)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# examine markdown representation of extracted info
converter = DocumentConverter()
result = converter.convert(fp_paper_pdf)

# save output for debugging or ingestion
with open("cleaned_output_docling.txt", "w", encoding="utf-8") as f:
    f.write(result.document.export_to_markdown())
# 3m27s

2025-12-20 19:42:46,989 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-20 19:42:47,173 - INFO - Going to convert document batch...
2025-12-20 19:42:47,174 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2025-12-20 19:42:47,198 - INFO - Loading plugin 'docling_defaults'
2025-12-20 19:42:47,207 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-12-20 19:42:47,237 - INFO - Loading plugin 'docling_defaults'
2025-12-20 19:42:47,253 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-12-20 19:42:47,258 - INFO - rapidocr cannot be used because onnxruntime is not installed.
2025-12-20 19:42:47,261 - INFO - easyocr cannot be used because it is not installed.
2025-12-20 19:42:47,989 - INFO - Accelerator device: 'cpu'
[32m[INFO] 2025-12-20 19:42:48,040 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2025-12-20 19:42:48,053 [RapidOCR] device_con

<!-- image -->

Contents lists available at ScienceDirect

## Neural Networks

journal homepage: www.elsevier.com/locate/neunet

## Estimating heading from optic flow: Comparing deep learning network and human performance ✩

Natalie Maus a , Oliver W. Layton b, ∗

- a Department of Computer Science, University of Pennsylvania, Philadelphia, 19104, PA, USA
- b Department of Computer Science, Colby College, Waterville, 04901, ME, USA

## a r t i c l e i n f o

Article history: Received 22 November 2021 Received in revised form 17 June 2022 Accepted 7 July 2022 Available online 25 July 2022

Keywords: Optic flow Deep learning Heading Vision Self-motion

## 1. Introduction

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) catalyzed historic developments in deep convolutional neural networks (CNNs) (Lecun et al., 2015; LeCun et al., 1995; Russakovsky et al., 2015). The challenge benchmarked the accuracy with which competing algorithms classified 1000 categories of natural imag

### Docling Pipeline with a PostgreSQL Database
Each chunk is embedded using `pgvector` and logs the following metadata:
```json
{
  "page_numbers": [5],
  "headings": ["Q3 Performance", "Revenue Breakdown"],
  "origin": "annual_report.pdf"
}
```
Unlike pure vector databases, this allows for hybrid searches using embeddings and SQL queries. For example we can select vectors similar to "quaterly earnings" but only where metadata.headings = "Q3".

In [None]:
attributes = dir(result)
print(attributes)
'''
'assembled', 'confidence', 'construct', 'copy', 'dict', 'document', 'errors', 'from_orm', 
'input', 'json', 'legacy_document', 'load', 'model_computed_fields', 'model_config', 
'model_construct', 'model_copy', 'model_dump', 'model_dump_json', 'model_extra', 'model_fields', 
'model_fields_set', 'model_json_schema', 'model_parametrized_name', 'model_post_init', 'model_rebuild', 
'model_validate', 'model_validate_json', 'model_validate_strings', 'pages', 'parse_file', 'parse_obj', 
'parse_raw', 'save', 'schema', 'schema_json', 'status', 'timestamp', 'timings', 'update_forward_refs', 
'validate', 'version'
'''
print(dir(result.document))
''''
add_code', 'add_document', 'add_form', 'add_formula', 'add_group', 'add_heading', 'add_inline_group',
'add_key_values', 'add_list_group', 'add_list_item', 'add_node_items', 'add_ordered_list', 'add_page', 
'add_picture', 'add_table', 'add_table_cell', 'add_text', 'add_title', 'add_unordered_list', 
'append_child_item', 'body', 'check_version_is_compatible', 'concatenate', 'construct', 'copy', 
'delete_items', 'delete_items_range', 'dict', 'export_to_dict', 'export_to_doctags', 
'export_to_document_tokens', 'export_to_element_tree', 'export_to_html', 'export_to_markdown', 
'export_to_text', 'extract_items_range', 'filter', 'form_items', 'from_orm', 'furniture', 
'get_visualization', 'groups', 'insert_code', 'insert_document', 'insert_form', 'insert_formula', 
'insert_group', 'insert_heading', 'insert_inline_group', 'insert_item_after_sibling', 
'insert_item_before_sibling', 'insert_key_values', 'insert_list_group', 'insert_list_item', 
'insert_node_items', 'insert_picture', 'insert_table', 'insert_text', 'insert_title', 
'iterate_items', 'json', 'key_value_items', 'load_from_doctags', 'load_from_json', 'load_from_yaml', 
'model_computed_fields', 'model_config', 'model_construct', 'model_copy', 'model_dump', 
'model_dump_json', 'model_extra', 'model_fields', 'model_fields_set', 'model_json_schema', 
'model_parametrized_name', 'model_post_init', 'model_rebuild', 'model_validate', 'model_validate_json', 
'model_validate_strings', 'name', 'num_pages', 'origin', 'pages', 'parse_file', 'parse_obj', 
'parse_raw', 'pictures', 'print_element_tree', 'replace_item', 'save_as_doctags', 
'save_as_document_tokens', 'save_as_html', 'save_as_json', 'save_as_markdown', 'save_as_yaml', 
'schema', 'schema_json', 'schema_name', 'tables', 'texts', 'transform_to_content_layer', 
'update_forward_refs', 'validate', 'validate_document', 'validate_misplaced_list_items', 
'validate_tree', 'version'
 '''

['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__class_vars__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__fields__', '__fields_set__', '__format__', '__ge__', '__get_pydantic_core_schema__', '__get_pydantic_json_schema__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pretty__', '__private_attributes__', '__pydantic_complete__', '__pydantic_computed_fields__', '__pydantic_core_schema__', '__pydantic_custom_init__', '__pydantic_decorators__', '__pydantic_extra__', '__pydantic_fields__', '__pydantic_fields_set__', '__pydantic_generic_metadata__', '__pydantic_init_subclass__', '__pydantic_on_complete__', '__pydantic_parent_namespace__', '__pydantic_post_init__', '__pydantic_private__', '__pydantic_root_model__', '__pydantic_serializer__', '__pydantic_setattr_handlers__', '__pydantic_