PDF Parser Pipeline

A robust PDF parsing pipeline that extracts text, tables, and images from PDF documents into structured JSON format. Designed as the first stage in a multimodal RAG (Retrieval-Augmented Generation) system.

Features

Multi-Strategy Table Extraction: Detects tables using multiple strategies including line-based, text-based, and chart data extraction
Chart Data Recognition: Automatically extracts data from bar charts and figures with percentage values
OCR Fallback: Uses Tesseract OCR when text extraction fails
Concurrent Processing: Processes pages in parallel for optimal performance
Cross-Page Table Merging: Automatically detects and merges tables that span multiple pages
Comprehensive Output: Exports text, tables, and full-page images as base64-encoded strings
Error Resilience: Continues processing even when individual components fail

Installation

Prerequisites

Python 3.8+
Tesseract OCR (required for text extraction fallback)

System Dependencies

macOS

brew install tesseract

Ubuntu/Debian

sudo apt-get update
sudo apt-get install tesseract-ocr

Windows

Download and install from UB-Mannheim/tesseract

Python Installation

Clone the repository:

git clone https://github.com/gaimplan/pdf-parser.git
cd pdf-parser

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Basic Usage

python main.py /path/to/your/document.pdf

The parsed output will be saved to output/output_parsed.json by default.

Python API

import asyncio
from main import process_pdf

# Process a PDF file
asyncio.run(process_pdf("path/to/document.pdf"))

Configuration

The parser can be configured using environment variables:

# Image settings
PDF_IMAGE_DPI=144                    # DPI for image extraction
PDF_OCR_RESOLUTION=150               # Resolution for OCR processing
PDF_IMAGE_FORMAT=JPEG                # Output image format

# Table detection settings
PDF_TABLE_Y_TOLERANCE=3              # Tolerance for table row alignment
PDF_TABLE_BOTTOM_THRESHOLD=75        # Threshold for bottom detection
PDF_TABLE_HEIGHT_RATIO=0.2           # Minimum table height ratio
PDF_TABLE_BOTTOM_MARGIN=100          # Bottom margin for tables

# Processing settings
PDF_MAX_CONCURRENT_PAGES=4           # Max pages to process concurrently
PDF_ENABLE_OCR_FALLBACK=true         # Enable OCR fallback
PDF_ENABLE_CAMELOT_FALLBACK=true    # Enable Camelot fallback

# Output settings
PDF_OUTPUT_DIR=output                # Output directory
PDF_OUTPUT_FILENAME=output_parsed.json # Output filename
PDF_JSON_INDENT=2                    # JSON indentation

Output Format

The parser outputs a JSON structure with the following schema:

{
  "pdf_document": {
    "document_id": "doc_filename.pdf",
    "filename": "filename.pdf",
    "total_pages": 10,
    "metadata": {}
  },
  "pages": [
    {
      "page_id": "page_1",
      "pdf_title": "filename.pdf",
      "text": "Extracted text content...",
      "tables": [
        {
          "columns": ["Column1", "Column2"],
          "data": [
            ["Row1Col1", "Row1Col2"],
            ["Row2Col1", "Row2Col2"]
          ],
          "extends_to_bottom": false,
          "chart_data": true  // Indicates data extracted from charts
        }
      ],
      "image_base64": ["base64_encoded_page_image"]
    }
  ]
}

Architecture

Core Components

main.py: Entry point and orchestration
parsers.py: Content extraction logic with multi-strategy approach
models.py: Pydantic data models for validation
utils.py: Helper functions for image processing
config.py: Configuration management
exceptions.py: Custom exception hierarchy

Processing Flow

PDF loaded with pdfplumber
Pages processed concurrently via asyncio
For each page:
- Text extracted with OCR fallback
- Tables detected using multiple strategies
- Chart data extracted from figures
- Full page rendered as base64 image
Results aggregated into structured JSON

Table Extraction Strategies

Lines Strict: For well-formatted tables with clear borders
Lines: For tables with partial borders
Text-based: For borderless tables
Chart Data: Extracts data from bar charts and figures

Advanced Features

Chart Data Extraction

The parser can extract structured data from charts and figures that display percentage values. This is particularly useful for:

Bar charts
Pie charts with labeled segments
Infographics with percentage data

Cross-Page Table Detection

Tables that span multiple pages are automatically detected and merged based on:

Column header matching
Table position at page boundaries
Content continuity

Error Handling

The pipeline implements comprehensive error handling:

Each extraction component has independent error handling
Failed extractions are logged but don't stop processing
Partial results are returned even if some pages fail

Performance Optimization

Concurrent page processing with configurable limits
Parallel extraction of text, tables, and images per page
Efficient memory usage with streaming processing
Optimized regex patterns for chart data extraction

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Install development dependencies: pip install -r requirements-dev.txt
Make your changes
Run tests: pytest
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with pdfplumber for PDF parsing
Uses Tesseract OCR for text recognition
pdf2image for page rendering

Troubleshooting

Common Issues

OCR not working: Ensure Tesseract is installed and in your PATH
Memory issues with large PDFs: Reduce PDF_MAX_CONCURRENT_PAGES
Table extraction missing data: Check if the PDF contains actual tables or chart images

Debug Mode

Enable debug logging:

PDF_LOG_LEVEL=DEBUG python main.py document.pdf

Citation

If you use this parser in your research, please cite:

@software{pdf_parser_pipeline,
  title = {PDF Parser Pipeline},
  year = {2025},
  url = {https://github.com/gaimplan/pdf-parser}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
config.py		config.py
exceptions.py		exceptions.py
main.py		main.py
models.py		models.py
parsers.py		parsers.py
requirements.txt		requirements.txt
utils.py		utils.py

License

phizzog-ai/pdf_parser

Folders and files

Latest commit

History

Repository files navigation

PDF Parser Pipeline

Features

Installation

Prerequisites

System Dependencies

macOS

Ubuntu/Debian

Windows

Python Installation

Usage

Basic Usage

Python API

Configuration

Output Format

Architecture

Core Components

Processing Flow

Table Extraction Strategies

Advanced Features

Chart Data Extraction

Cross-Page Table Detection

Error Handling

Performance Optimization

Contributing

Development Setup

License

Acknowledgments

Troubleshooting

Common Issues

Debug Mode

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages