A robust PDF parsing pipeline that extracts text, tables, and images from PDF documents into structured JSON format. Designed as the first stage in a multimodal RAG (Retrieval-Augmented Generation) system.
- Multi-Strategy Table Extraction: Detects tables using multiple strategies including line-based, text-based, and chart data extraction
- Chart Data Recognition: Automatically extracts data from bar charts and figures with percentage values
- OCR Fallback: Uses Tesseract OCR when text extraction fails
- Concurrent Processing: Processes pages in parallel for optimal performance
- Cross-Page Table Merging: Automatically detects and merges tables that span multiple pages
- Comprehensive Output: Exports text, tables, and full-page images as base64-encoded strings
- Error Resilience: Continues processing even when individual components fail
- Python 3.8+
- Tesseract OCR (required for text extraction fallback)
brew install tesseractsudo apt-get update
sudo apt-get install tesseract-ocrDownload and install from UB-Mannheim/tesseract
- Clone the repository:
git clone https://github.com/gaimplan/pdf-parser.git
cd pdf-parser- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtpython main.py /path/to/your/document.pdfThe parsed output will be saved to output/output_parsed.json by default.
import asyncio
from main import process_pdf
# Process a PDF file
asyncio.run(process_pdf("path/to/document.pdf"))The parser can be configured using environment variables:
# Image settings
PDF_IMAGE_DPI=144 # DPI for image extraction
PDF_OCR_RESOLUTION=150 # Resolution for OCR processing
PDF_IMAGE_FORMAT=JPEG # Output image format
# Table detection settings
PDF_TABLE_Y_TOLERANCE=3 # Tolerance for table row alignment
PDF_TABLE_BOTTOM_THRESHOLD=75 # Threshold for bottom detection
PDF_TABLE_HEIGHT_RATIO=0.2 # Minimum table height ratio
PDF_TABLE_BOTTOM_MARGIN=100 # Bottom margin for tables
# Processing settings
PDF_MAX_CONCURRENT_PAGES=4 # Max pages to process concurrently
PDF_ENABLE_OCR_FALLBACK=true # Enable OCR fallback
PDF_ENABLE_CAMELOT_FALLBACK=true # Enable Camelot fallback
# Output settings
PDF_OUTPUT_DIR=output # Output directory
PDF_OUTPUT_FILENAME=output_parsed.json # Output filename
PDF_JSON_INDENT=2 # JSON indentationThe parser outputs a JSON structure with the following schema:
{
"pdf_document": {
"document_id": "doc_filename.pdf",
"filename": "filename.pdf",
"total_pages": 10,
"metadata": {}
},
"pages": [
{
"page_id": "page_1",
"pdf_title": "filename.pdf",
"text": "Extracted text content...",
"tables": [
{
"columns": ["Column1", "Column2"],
"data": [
["Row1Col1", "Row1Col2"],
["Row2Col1", "Row2Col2"]
],
"extends_to_bottom": false,
"chart_data": true // Indicates data extracted from charts
}
],
"image_base64": ["base64_encoded_page_image"]
}
]
}- main.py: Entry point and orchestration
- parsers.py: Content extraction logic with multi-strategy approach
- models.py: Pydantic data models for validation
- utils.py: Helper functions for image processing
- config.py: Configuration management
- exceptions.py: Custom exception hierarchy
- PDF loaded with pdfplumber
- Pages processed concurrently via asyncio
- For each page:
- Text extracted with OCR fallback
- Tables detected using multiple strategies
- Chart data extracted from figures
- Full page rendered as base64 image
- Results aggregated into structured JSON
- Lines Strict: For well-formatted tables with clear borders
- Lines: For tables with partial borders
- Text-based: For borderless tables
- Chart Data: Extracts data from bar charts and figures
The parser can extract structured data from charts and figures that display percentage values. This is particularly useful for:
- Bar charts
- Pie charts with labeled segments
- Infographics with percentage data
Tables that span multiple pages are automatically detected and merged based on:
- Column header matching
- Table position at page boundaries
- Content continuity
The pipeline implements comprehensive error handling:
- Each extraction component has independent error handling
- Failed extractions are logged but don't stop processing
- Partial results are returned even if some pages fail
- Concurrent page processing with configurable limits
- Parallel extraction of text, tables, and images per page
- Efficient memory usage with streaming processing
- Optimized regex patterns for chart data extraction
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Install development dependencies:
pip install -r requirements-dev.txt - Make your changes
- Run tests:
pytest - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with pdfplumber for PDF parsing
- Uses Tesseract OCR for text recognition
- pdf2image for page rendering
- OCR not working: Ensure Tesseract is installed and in your PATH
- Memory issues with large PDFs: Reduce
PDF_MAX_CONCURRENT_PAGES - Table extraction missing data: Check if the PDF contains actual tables or chart images
Enable debug logging:
PDF_LOG_LEVEL=DEBUG python main.py document.pdfIf you use this parser in your research, please cite:
@software{pdf_parser_pipeline,
title = {PDF Parser Pipeline},
year = {2025},
url = {https://github.com/gaimplan/pdf-parser}
}