Skip to content

A simple PDF parsing pipeline that extracts text, tables, and images from PDF documents into structured JSON format.

License

Notifications You must be signed in to change notification settings

phizzog-ai/pdf_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Parser Pipeline

A robust PDF parsing pipeline that extracts text, tables, and images from PDF documents into structured JSON format. Designed as the first stage in a multimodal RAG (Retrieval-Augmented Generation) system.

Features

  • Multi-Strategy Table Extraction: Detects tables using multiple strategies including line-based, text-based, and chart data extraction
  • Chart Data Recognition: Automatically extracts data from bar charts and figures with percentage values
  • OCR Fallback: Uses Tesseract OCR when text extraction fails
  • Concurrent Processing: Processes pages in parallel for optimal performance
  • Cross-Page Table Merging: Automatically detects and merges tables that span multiple pages
  • Comprehensive Output: Exports text, tables, and full-page images as base64-encoded strings
  • Error Resilience: Continues processing even when individual components fail

Installation

Prerequisites

  • Python 3.8+
  • Tesseract OCR (required for text extraction fallback)

System Dependencies

macOS

brew install tesseract

Ubuntu/Debian

sudo apt-get update
sudo apt-get install tesseract-ocr

Windows

Download and install from UB-Mannheim/tesseract

Python Installation

  1. Clone the repository:
git clone https://github.com/gaimplan/pdf-parser.git
cd pdf-parser
  1. Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

Basic Usage

python main.py /path/to/your/document.pdf

The parsed output will be saved to output/output_parsed.json by default.

Python API

import asyncio
from main import process_pdf

# Process a PDF file
asyncio.run(process_pdf("path/to/document.pdf"))

Configuration

The parser can be configured using environment variables:

# Image settings
PDF_IMAGE_DPI=144                    # DPI for image extraction
PDF_OCR_RESOLUTION=150               # Resolution for OCR processing
PDF_IMAGE_FORMAT=JPEG                # Output image format

# Table detection settings
PDF_TABLE_Y_TOLERANCE=3              # Tolerance for table row alignment
PDF_TABLE_BOTTOM_THRESHOLD=75        # Threshold for bottom detection
PDF_TABLE_HEIGHT_RATIO=0.2           # Minimum table height ratio
PDF_TABLE_BOTTOM_MARGIN=100          # Bottom margin for tables

# Processing settings
PDF_MAX_CONCURRENT_PAGES=4           # Max pages to process concurrently
PDF_ENABLE_OCR_FALLBACK=true         # Enable OCR fallback
PDF_ENABLE_CAMELOT_FALLBACK=true    # Enable Camelot fallback

# Output settings
PDF_OUTPUT_DIR=output                # Output directory
PDF_OUTPUT_FILENAME=output_parsed.json # Output filename
PDF_JSON_INDENT=2                    # JSON indentation

Output Format

The parser outputs a JSON structure with the following schema:

{
  "pdf_document": {
    "document_id": "doc_filename.pdf",
    "filename": "filename.pdf",
    "total_pages": 10,
    "metadata": {}
  },
  "pages": [
    {
      "page_id": "page_1",
      "pdf_title": "filename.pdf",
      "text": "Extracted text content...",
      "tables": [
        {
          "columns": ["Column1", "Column2"],
          "data": [
            ["Row1Col1", "Row1Col2"],
            ["Row2Col1", "Row2Col2"]
          ],
          "extends_to_bottom": false,
          "chart_data": true  // Indicates data extracted from charts
        }
      ],
      "image_base64": ["base64_encoded_page_image"]
    }
  ]
}

Architecture

Core Components

  • main.py: Entry point and orchestration
  • parsers.py: Content extraction logic with multi-strategy approach
  • models.py: Pydantic data models for validation
  • utils.py: Helper functions for image processing
  • config.py: Configuration management
  • exceptions.py: Custom exception hierarchy

Processing Flow

  1. PDF loaded with pdfplumber
  2. Pages processed concurrently via asyncio
  3. For each page:
    • Text extracted with OCR fallback
    • Tables detected using multiple strategies
    • Chart data extracted from figures
    • Full page rendered as base64 image
  4. Results aggregated into structured JSON

Table Extraction Strategies

  1. Lines Strict: For well-formatted tables with clear borders
  2. Lines: For tables with partial borders
  3. Text-based: For borderless tables
  4. Chart Data: Extracts data from bar charts and figures

Advanced Features

Chart Data Extraction

The parser can extract structured data from charts and figures that display percentage values. This is particularly useful for:

  • Bar charts
  • Pie charts with labeled segments
  • Infographics with percentage data

Cross-Page Table Detection

Tables that span multiple pages are automatically detected and merged based on:

  • Column header matching
  • Table position at page boundaries
  • Content continuity

Error Handling

The pipeline implements comprehensive error handling:

  • Each extraction component has independent error handling
  • Failed extractions are logged but don't stop processing
  • Partial results are returned even if some pages fail

Performance Optimization

  • Concurrent page processing with configurable limits
  • Parallel extraction of text, tables, and images per page
  • Efficient memory usage with streaming processing
  • Optimized regex patterns for chart data extraction

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Install development dependencies: pip install -r requirements-dev.txt
  4. Make your changes
  5. Run tests: pytest
  6. Commit your changes (git commit -m 'Add some AmazingFeature')
  7. Push to the branch (git push origin feature/AmazingFeature)
  8. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Troubleshooting

Common Issues

  1. OCR not working: Ensure Tesseract is installed and in your PATH
  2. Memory issues with large PDFs: Reduce PDF_MAX_CONCURRENT_PAGES
  3. Table extraction missing data: Check if the PDF contains actual tables or chart images

Debug Mode

Enable debug logging:

PDF_LOG_LEVEL=DEBUG python main.py document.pdf

Citation

If you use this parser in your research, please cite:

@software{pdf_parser_pipeline,
  title = {PDF Parser Pipeline},
  year = {2025},
  url = {https://github.com/gaimplan/pdf-parser}
}

About

A simple PDF parsing pipeline that extracts text, tables, and images from PDF documents into structured JSON format.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages