## Data Extraction with Docling

In this notebook, we'll extract content from PDFs into structured formats:

- **Markdown**: Full document text with page breaks for chunking
- **Images**: Save pages containing large charts/diagrams (>500x500 pixels)
- **Tables**: Extract with 2 paragraphs of context + page number metadata

**Pipeline Overview:**
1. **This Notebook (06-01)**: Extract PDFs → Markdown, Images, Tables
2. **Next Notebook (06-02)**: Load into vector database with embeddings
3. **Notebook 07**: Intelligent search with filters and reranking

**Output Structure:**
```
data/markdown/{company}/{document}.md
data/images/{company}/{document}/page_5.png
data/tables/{company}/{document}/table_1_page_5.md
```

### 1. Setup and Configuration

In [1]:
from pathlib import Path
from typing import List, Tuple

from docling_core.types.doc import PictureItem
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

In [2]:
# Directory paths
DATA_DIR = "data/rag-data/pdfs"
OUTPUT_MD_DIR = "data/rag-data/markdown"
OUTPUT_IMAGES_DIR = "data/rag-data/images"
OUTPUT_TABLES_DIR = "data/rag-data/tables"

### 2. Helper Functions

In [3]:
def extract_metadata_from_filename(filename: str) -> dict:
    """
    Extract metadata from filename.
    
    Expected format: CompanyName DocType [Quarter] Year.pdf
    Examples:
        - Amazon 10-Q Q1 2024.pdf
        - Microsoft 10-K 2023.pdf
    """
    parts = filename.split()
    
    return {
        'company_name': parts[0],
        'doc_type': parts[1],
        'fiscal_quarter': parts[2] if len(parts) == 4 else None,
        'fiscal_year': int(parts[-1])
    }

In [4]:
def extract_context_and_table(lines: List[str], table_index: int) -> Tuple[str, int]:
    """
    Extract context and table content at a specific position.
    
    Args:
        lines: All markdown lines
        table_index: Where the table starts
    
    Returns:
        (combined_content, next_line_index)
    """
    # Get 2 paragraphs before the table
    context_lines = []
    para_count = 0
    j = table_index - 1
    
    while j >= 0 and para_count < 2:
        if lines[j].strip():
            if '<!-- page break -->' not in lines[j]:
                context_lines.insert(0, lines[j])
        elif context_lines:
            para_count += 1
        j -= 1
    
    # Get all table rows
    table_lines = []
    i = table_index
    
    while i < len(lines):
        if lines[i].strip().startswith('|'):
            table_lines.append(lines[i])
            i += 1
        elif not lines[i].strip():
            i += 1  # Skip empty lines within table
        else:
            break  # End of table
    
    # Combine them
    content = '\n'.join(context_lines) + '\n\n' + '\n'.join(table_lines)
    return content, i

In [5]:
def extract_tables_with_context(markdown_text: str) -> List[Tuple[str, str, int]]:
    """
    Find all tables and extract them with context and page numbers.
    
    Returns:
        List of (content, table_name, page_number)
    """
    lines = markdown_text.split('\n')
    tables = []
    current_page = 1
    table_num = 1
    i = 0
    
    while i < len(lines):
        # Track page numbers
        if '<!-- page break -->' in lines[i]:
            current_page += 1
            i += 1
            continue
        
        # Found a table?
        if lines[i].strip().startswith('|') and lines[i].count('|') >= 2:
            content, next_i = extract_context_and_table(lines, i)
            tables.append((content, f"table_{table_num}", current_page))
            table_num += 1
            i = next_i
        else:
            i += 1
    
    return tables

In [6]:
def convert_pdf_to_docling(pdf_path: Path):
    """
    Convert PDF using Docling.
    
    Returns:
        Docling conversion result
    """
    pipeline_options = PdfPipelineOptions()
    pipeline_options.generate_picture_images = True
    pipeline_options.generate_page_images = True
    
    converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )
    
    return converter.convert(pdf_path)

In [7]:
def save_page_images(result, images_dir: Path):
    """
    Find and save pages with large images (>500x500 pixels).
    """
    pages_to_save = set()
    
    for item in result.document.iterate_items():
        element = item[0]
        
        if isinstance(element, PictureItem):
            image = element.get_image(result.document)
            if image.size[0] > 500 and image.size[1] > 500:
                page_no = element.prov[0].page_no if element.prov else None
                if page_no:
                    pages_to_save.add(page_no)
    
    # Save images
    for page_no in pages_to_save:
        page = result.document.pages[page_no]
        page.image.pil_image.save(images_dir / f"page_{page_no}.png", "PNG")
    
    if pages_to_save:
        print(f"  ✓ Saved {len(pages_to_save)} page images")

In [8]:
def save_tables(markdown_text: str, tables_dir: Path):
    """
    Extract and save tables with context and page numbers.
    """
    tables = extract_tables_with_context(markdown_text)
    
    for table_content, table_name, page_num in tables:
        content_with_page = f"**Page:** {page_num}\n\n{table_content}"
        (tables_dir / f"{table_name}_page_{page_num}.md").write_text(content_with_page, encoding='utf-8')
    
    if tables:
        print(f"  ✓ Saved {len(tables)} tables")

### 3. Main Extraction Function

In [9]:
def extract_pdf_content(pdf_path: Path):
    """Extract PDF to markdown, images, and tables."""
    print(f"Processing: {pdf_path.name}")
    
    # Setup output directories
    filename = pdf_path.name.replace('.pdf', '')
    metadata = extract_metadata_from_filename(filename)
    company = metadata['company_name']
    filename_stem = pdf_path.stem
    
    md_dir = Path(OUTPUT_MD_DIR) / company
    images_dir = Path(OUTPUT_IMAGES_DIR) / company / filename_stem
    tables_dir = Path(OUTPUT_TABLES_DIR) / company / filename_stem
    
    for dir_path in [md_dir, images_dir, tables_dir]:
        dir_path.mkdir(parents=True, exist_ok=True)
    
    # Convert PDF with Docling
    result = convert_pdf_to_docling(pdf_path)
    
    # Save markdown
    markdown_text = result.document.export_to_markdown(page_break_placeholder="<!-- page break -->")
    (md_dir / f"{filename_stem}.md").write_text(markdown_text, encoding='utf-8')
    print(f"  ✓ Markdown saved")
    
    # Save page images
    save_page_images(result, images_dir)
    
    # Save tables
    save_tables(markdown_text, tables_dir)
    
    print(f"  Done!\n")

### 4. Process a Single PDF (Example)

In [10]:
# Find all PDF files
data_path = Path(DATA_DIR)
pdf_files = list(data_path.rglob("*.pdf"))
print(f"Found {len(pdf_files)} PDF files\n")

# Process one example first to see the output
if pdf_files:
    print("=== Processing Example PDF ===")
    extract_pdf_content(pdf_files[0])
    print("\nCheck the output folders to see extracted files!")
    print(f"- Markdown: {OUTPUT_MD_DIR}")
    print(f"- Images: {OUTPUT_IMAGES_DIR}")
    print(f"- Tables: {OUTPUT_TABLES_DIR}")

2025-12-13 18:38:56,615 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:38:56,642 - INFO - Going to convert document batch...
2025-12-13 18:38:56,643 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:38:56,667 - INFO - Loading plugin 'docling_defaults'
2025-12-13 18:38:56,675 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-12-13 18:38:56,692 - INFO - Loading plugin 'docling_defaults'
2025-12-13 18:38:56,710 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']


Found 28 PDF files

=== Processing Example PDF ===
Processing: amazon 10-k 2023.pdf


2025-12-13 18:38:57,294 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:38:57,304 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:38:57,316 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:38:57,317 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:38:57,364 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:38:57,366 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2025-12-13 18:38:57,367 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2025-12-13

  ✓ Markdown saved
  ✓ Saved 61 tables
  Done!


Check the output folders to see extracted files!
- Markdown: data/rag-data/markdown
- Images: data/rag-data/images
- Tables: data/rag-data/tables


### 5. Process All PDFs

In [11]:
# Process all PDFs
print(f"\n=== Processing All {len(pdf_files)} PDFs ===\n")

for idx, pdf_path in enumerate(pdf_files, 1):
    print(f"[{idx}/{len(pdf_files)}]", end=" ")
    extract_pdf_content(pdf_path)

print("\n=== Extraction Complete ===")

2025-12-13 18:42:24,480 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:42:24,483 - INFO - Going to convert document batch...
2025-12-13 18:42:24,484 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:42:24,484 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:42:24,497 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:42:24,501 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:42:24,502 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:42:24,550 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:42:24,552 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\


=== Processing All 28 PDFs ===

[1/28] Processing: amazon 10-k 2023.pdf


2025-12-13 18:42:24,801 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:42:24,801 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:42:25,754 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:42:26,074 - INFO - Processing document amazon 10-k 2023.pdf
2025-12-13 18:43:03,759 - INFO - Finished converting document amazon 10-k 2023.pdf in 39.28 sec.
2025-12-13 18:43:03,918 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:43:03,921 - INFO - Going to convert document batch...
2025-12-13 18:43:03,922 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:43:03,922 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:43:03,931 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:43:03,934 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[

  ✓ Markdown saved
  ✓ Saved 61 tables
  Done!

[2/28] Processing: amazon 10-k 2024.pdf


2025-12-13 18:43:04,155 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:43:04,155 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:43:05,103 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:43:05,434 - INFO - Processing document amazon 10-k 2024.pdf
2025-12-13 18:43:39,433 - INFO - Finished converting document amazon 10-k 2024.pdf in 35.52 sec.
2025-12-13 18:43:39,645 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:43:39,649 - INFO - Going to convert document batch...
2025-12-13 18:43:39,650 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:43:39,650 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:43:39,662 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:43:39,666 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[

  ✓ Markdown saved
  ✓ Saved 62 tables
  Done!

[3/28] Processing: amazon 10-q q1 2024.pdf


2025-12-13 18:43:39,961 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:43:39,967 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:43:40,894 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:43:41,187 - INFO - Processing document amazon 10-q q1 2024.pdf
2025-12-13 18:43:57,633 - INFO - Finished converting document amazon 10-q q1 2024.pdf in 17.99 sec.
2025-12-13 18:43:57,748 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:43:57,751 - INFO - Going to convert document batch...
2025-12-13 18:43:57,752 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:43:57,753 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:43:57,765 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:43:57,769 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.

  ✓ Markdown saved
  ✓ Saved 35 tables
  Done!

[4/28] Processing: amazon 10-q q1 2025.pdf


2025-12-13 18:43:58,102 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:43:58,103 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:43:59,015 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:43:59,353 - INFO - Processing document amazon 10-q q1 2025.pdf
2025-12-13 18:44:14,545 - INFO - Finished converting document amazon 10-q q1 2025.pdf in 16.80 sec.
2025-12-13 18:44:14,661 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:44:14,665 - INFO - Going to convert document batch...
2025-12-13 18:44:14,666 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:44:14,667 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:44:14,679 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:44:14,683 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.

  ✓ Markdown saved
  ✓ Saved 35 tables
  Done!

[5/28] Processing: amazon 10-q q2 2024.pdf


2025-12-13 18:44:14,949 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:44:14,950 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:44:15,902 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:44:16,221 - INFO - Processing document amazon 10-q q2 2024.pdf
2025-12-13 18:44:35,450 - INFO - Finished converting document amazon 10-q q2 2024.pdf in 20.79 sec.
2025-12-13 18:44:35,579 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:44:35,583 - INFO - Going to convert document batch...
2025-12-13 18:44:35,584 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:44:35,585 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:44:35,596 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:44:35,601 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.

  ✓ Markdown saved
  ✓ Saved 34 tables
  Done!

[6/28] Processing: amazon 10-q q2 2025.pdf


2025-12-13 18:44:35,901 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:44:35,901 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:44:36,871 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:44:37,198 - INFO - Processing document amazon 10-q q2 2025.pdf
2025-12-13 18:44:55,029 - INFO - Finished converting document amazon 10-q q2 2025.pdf in 19.45 sec.
2025-12-13 18:44:55,167 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:44:55,171 - INFO - Going to convert document batch...
2025-12-13 18:44:55,172 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:44:55,173 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:44:55,184 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:44:55,189 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.

  ✓ Markdown saved
  ✓ Saved 34 tables
  Done!

[7/28] Processing: amazon 10-q q3 2024.pdf


2025-12-13 18:44:55,457 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:44:55,462 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:44:56,427 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:44:56,753 - INFO - Processing document amazon 10-q q3 2024.pdf
2025-12-13 18:45:23,910 - INFO - Finished converting document amazon 10-q q3 2024.pdf in 28.74 sec.
2025-12-13 18:45:24,192 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:45:24,196 - INFO - Going to convert document batch...
2025-12-13 18:45:24,197 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:45:24,198 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:45:24,207 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:45:24,210 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.

  ✓ Markdown saved
  ✓ Saved 38 tables
  Done!

[8/28] Processing: apple 10-k 2023.pdf


[32m[INFO] 2025-12-13 18:45:24,324 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_rec_infer.onnx[0m
[32m[INFO] 2025-12-13 18:45:24,328 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_rec_infer.onnx[0m
2025-12-13 18:45:24,551 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:45:24,552 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:45:25,505 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:45:25,816 - INFO - Processing document apple 10-k 2023.pdf
2025-12-13 18:45:59,366 - INFO - Finished converting document apple 10-k 2023.pdf in 35.17 sec.
2025-12-13 18:45:59,553 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:45:59,558 - INFO - Going to convert document batch...
2025-12-13 18:45:59,559 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9fa

  ✓ Markdown saved
  ✓ Saved 53 tables
  Done!

[9/28] Processing: apple 10-k 2024.pdf


2025-12-13 18:45:59,826 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:45:59,827 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:46:00,783 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:46:01,055 - INFO - Processing document apple 10-k 2024.pdf
2025-12-13 18:46:35,999 - INFO - Finished converting document apple 10-k 2024.pdf in 36.45 sec.
2025-12-13 18:46:36,202 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:46:36,205 - INFO - Going to convert document batch...
2025-12-13 18:46:36,206 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:46:36,207 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:46:36,219 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:46:36,224 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m

  ✓ Markdown saved
  ✓ Saved 50 tables
  Done!

[10/28] Processing: apple 10-q q1 2024.pdf


[32m[INFO] 2025-12-13 18:46:36,337 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_rec_infer.onnx[0m
[32m[INFO] 2025-12-13 18:46:36,338 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_rec_infer.onnx[0m
2025-12-13 18:46:36,515 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:46:36,520 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:46:37,485 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:46:37,809 - INFO - Processing document apple 10-q q1 2024.pdf
2025-12-13 18:46:50,058 - INFO - Finished converting document apple 10-q q1 2024.pdf in 13.86 sec.
2025-12-13 18:46:50,146 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:46:50,156 - INFO - Going to convert document batch...
2025-12-13 18:46:50,157 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d9847

  ✓ Markdown saved
  ✓ Saved 26 tables
  Done!

[11/28] Processing: apple 10-q q2 2024.pdf


2025-12-13 18:46:50,464 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:46:50,474 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:46:51,425 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:46:51,743 - INFO - Processing document apple 10-q q2 2024.pdf
2025-12-13 18:47:04,531 - INFO - Finished converting document apple 10-q q2 2024.pdf in 14.38 sec.
2025-12-13 18:47:04,614 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:47:04,617 - INFO - Going to convert document batch...
2025-12-13 18:47:04,618 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:47:04,618 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:47:04,627 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:47:04,631 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.on

  ✓ Markdown saved
  ✓ Saved 27 tables
  Done!

[12/28] Processing: apple 10-q q4 2023.pdf


2025-12-13 18:47:04,886 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:47:04,891 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:47:05,858 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:47:06,190 - INFO - Processing document apple 10-q q4 2023.pdf
2025-12-13 18:47:15,877 - INFO - Finished converting document apple 10-q q4 2023.pdf in 11.26 sec.
2025-12-13 18:47:15,969 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:47:15,971 - INFO - Going to convert document batch...
2025-12-13 18:47:15,972 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:47:15,973 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:47:15,985 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:47:15,989 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.on

  ✓ Markdown saved
  ✓ Saved 26 tables
  Done!

[13/28] Processing: apple 8-k q4 2023.pdf


2025-12-13 18:47:16,259 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:47:16,265 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:47:17,430 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:47:17,707 - INFO - Processing document apple 8-k q4 2023.pdf
2025-12-13 18:47:20,972 - INFO - Finished converting document apple 8-k q4 2023.pdf in 5.00 sec.
2025-12-13 18:47:21,012 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:47:21,017 - INFO - Going to convert document batch...
2025-12-13 18:47:21,017 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:47:21,018 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:47:21,031 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:47:21,035 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx

  ✓ Markdown saved
  ✓ Saved 5 tables
  Done!

[14/28] Processing: google 10-k 2023.pdf


2025-12-13 18:47:21,322 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:47:21,323 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:47:22,309 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:47:22,580 - INFO - Processing document google 10-k 2023.pdf
2025-12-13 18:48:02,986 - INFO - Finished converting document google 10-k 2023.pdf in 41.97 sec.
2025-12-13 18:48:03,250 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:48:03,255 - INFO - Going to convert document batch...
2025-12-13 18:48:03,256 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:48:03,256 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:48:03,269 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:48:03,273 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[

  ✓ Markdown saved
  ✓ Saved 70 tables
  Done!

[15/28] Processing: google 10-k 2024.pdf


[32m[INFO] 2025-12-13 18:48:03,380 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_rec_infer.onnx[0m
[32m[INFO] 2025-12-13 18:48:03,380 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_rec_infer.onnx[0m
2025-12-13 18:48:03,550 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:48:03,551 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:48:04,542 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:48:05,148 - INFO - Processing document google 10-k 2024.pdf
2025-12-13 18:48:46,523 - INFO - Finished converting document google 10-k 2024.pdf in 43.27 sec.
2025-12-13 18:48:46,786 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:48:46,789 - INFO - Going to convert document batch...
2025-12-13 18:48:46,789 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9

  ✓ Markdown saved
  ✓ Saved 71 tables
  Done!

[16/28] Processing: google 10-q q1 2025.pdf


[32m[INFO] 2025-12-13 18:48:46,925 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_rec_infer.onnx[0m
[32m[INFO] 2025-12-13 18:48:46,929 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_rec_infer.onnx[0m
2025-12-13 18:48:47,142 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:48:47,142 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:48:48,084 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:48:48,360 - INFO - Processing document google 10-q q1 2025.pdf
2025-12-13 18:49:09,839 - INFO - Finished converting document google 10-q q1 2025.pdf in 23.05 sec.
2025-12-13 18:49:09,954 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:49:09,957 - INFO - Going to convert document batch...
2025-12-13 18:49:09,957 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d98

  ✓ Markdown saved
  ✓ Saved 57 tables
  Done!

[17/28] Processing: google 10-q q2 2024.pdf


2025-12-13 18:49:10,156 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:49:10,156 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:49:11,189 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:49:11,497 - INFO - Processing document google 10-q q2 2024.pdf
2025-12-13 18:49:36,446 - INFO - Finished converting document google 10-q q2 2024.pdf in 26.49 sec.
2025-12-13 18:49:36,584 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:49:36,587 - INFO - Going to convert document batch...
2025-12-13 18:49:36,587 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:49:36,588 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:49:36,597 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:49:36,600 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.

  ✓ Markdown saved
  ✓ Saved 56 tables
  Done!

[18/28] Processing: google 10-q q2 2025.pdf


2025-12-13 18:49:36,749 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:49:36,750 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:49:37,831 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:49:38,126 - INFO - Processing document google 10-q q2 2025.pdf
2025-12-13 18:50:05,599 - INFO - Finished converting document google 10-q q2 2025.pdf in 29.02 sec.
2025-12-13 18:50:05,745 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:50:05,747 - INFO - Going to convert document batch...
2025-12-13 18:50:05,748 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:50:05,748 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:50:05,757 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:50:05,761 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.

  ✓ Markdown saved
  ✓ Saved 63 tables
  Done!

[19/28] Processing: google 10-q q3 2024.pdf


2025-12-13 18:50:05,981 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:50:05,982 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:50:06,952 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:50:07,249 - INFO - Processing document google 10-q q3 2024.pdf
2025-12-13 18:50:32,125 - INFO - Finished converting document google 10-q q3 2024.pdf in 26.38 sec.
2025-12-13 18:50:32,260 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:50:32,265 - INFO - Going to convert document batch...
2025-12-13 18:50:32,266 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:50:32,266 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:50:32,275 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:50:32,279 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.

  ✓ Markdown saved
  ✓ Saved 56 tables
  Done!

[20/28] Processing: meta 10-k 2024.pdf


2025-12-13 18:50:32,467 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:50:33,544 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:50:33,855 - INFO - Processing document meta 10-k 2024.pdf
2025-12-13 18:51:26,384 - INFO - Finished converting document meta 10-k 2024.pdf in 54.12 sec.
2025-12-13 18:51:26,652 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:51:26,656 - INFO - Going to convert document batch...
2025-12-13 18:51:26,656 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:51:26,657 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:51:26,666 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:51:26,669 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:26,669 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaco

  ✓ Markdown saved
  ✓ Saved 64 tables
  Done!

[21/28] Processing: meta 10-q q1 2024.pdf


2025-12-13 18:51:26,899 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2025-12-13 18:51:26,899 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:51:27,972 - INFO - Accelerator device: 'cuda:0'
2025-12-13 18:51:28,273 - INFO - Processing document meta 10-q q1 2024.pdf
2025-12-13 18:51:33,972 - INFO - Finished converting document meta 10-q q1 2024.pdf in 7.32 sec.
2025-12-13 18:51:34,193 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:51:34,197 - INFO - Going to convert document batch...
2025-12-13 18:51:34,198 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:51:34,198 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:51:34,207 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m


  ✓ Markdown saved
  ✓ Saved 8 page images
  ✓ Saved 3 tables
  Done!

[22/28] Processing: meta 10-q q1 2025.pdf


[32m[INFO] 2025-12-13 18:51:34,211 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:34,212 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:34,265 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:51:34,267 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:34,268 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:34,292 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:51:34,299 [RapidOCR] download_file.py:60: File exists and 

  ✓ Markdown saved
  ✓ Saved 9 page images
  ✓ Saved 3 tables
  Done!

[23/28] Processing: meta 10-q q2 2024.pdf


2025-12-13 18:51:41,509 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:51:41,512 - INFO - Going to convert document batch...
2025-12-13 18:51:41,513 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:51:41,513 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:51:41,523 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:51:41,526 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:41,527 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:41,574 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:51:41,576 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\

  ✓ Markdown saved
  ✓ Saved 9 page images
  ✓ Saved 3 tables
  Done!

[24/28] Processing: meta 10-q q2 2025.pdf


2025-12-13 18:51:48,975 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:51:48,979 - INFO - Going to convert document batch...
2025-12-13 18:51:48,980 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:51:48,980 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:51:48,989 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:51:48,993 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:48,994 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:49,044 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:51:49,045 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\

  ✓ Markdown saved
  ✓ Saved 8 page images
  ✓ Saved 3 tables
  Done!

[25/28] Processing: meta 10-q q3 2024.pdf


[32m[INFO] 2025-12-13 18:51:56,580 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:51:56,585 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:56,586 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:56,636 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:51:56,638 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:56,638 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2025-12-13 18:51:56,673 [RapidOCR] base.py:22: Using engine_name: onnxru

  ✓ Markdown saved
  ✓ Saved 9 page images
  ✓ Saved 3 tables
  Done!

[26/28] Processing: meta 10-q q3 2025.pdf


2025-12-13 18:52:03,950 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:52:03,954 - INFO - Going to convert document batch...
2025-12-13 18:52:03,955 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:52:03,955 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:52:03,964 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:52:03,968 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:52:03,968 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:52:04,012 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:52:04,014 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\

  ✓ Markdown saved
  ✓ Saved 8 page images
  ✓ Saved 3 tables
  Done!

[27/28] Processing: meta 10-q q4 2024.pdf


[32m[INFO] 2025-12-13 18:52:11,395 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:52:11,396 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:52:11,454 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:52:11,456 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2025-12-13 18:52:11,457 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2025-12-13 18:52:11,480 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:52:11,486 [RapidOCR] download_file.py:60: File exists and 

  ✓ Markdown saved
  ✓ Saved 9 page images
  ✓ Saved 3 tables
  Done!

[28/28] Processing: meta10-k 2023.pdf


2025-12-13 18:52:18,789 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-13 18:52:18,793 - INFO - Going to convert document batch...
2025-12-13 18:52:18,794 - INFO - Initializing pipeline for StandardPdfPipeline with options hash afb4d61b52d512d984736b9faa45e3e9
2025-12-13 18:52:18,794 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2025-12-13 18:52:18,803 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:52:18,807 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:52:18,808 [RapidOCR] main.py:53: Using C:\Users\laxmi\anaconda3\envs\ml\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2025-12-13 18:52:18,851 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2025-12-13 18:52:18,853 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\laxmi\anaconda3\

  ✓ Markdown saved
  ✓ Saved 67 tables
  Done!


=== Extraction Complete ===
