PrecisionDoc - Medical Precision Document Processing Tool

This project processes medical guideline PDF files, especially treatment guidelines from CSCO (Chinese Society of Clinical Oncology). It can:

Process PDF files in a specified folder
Split PDF files into individual pages
Analyze each page using AI (OpenAI or Alibaba Cloud Qwen)
Extract precision medicine evidence related to drug efficacy
Save analysis results in JSON and Excel formats
Generate Word reports containing precision medicine evidence

Installation

From Source

Clone this repository
Install dependencies:

pip install -r requirements.txt

Using pip

pip install precisiondoc

Configuration

Create a .env file (refer to env.example) and set API keys:

OPENAI_API_KEY=your_openai_api_key
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4

QWEN_API_KEY=your_qwen_api_key
QWEN_BASES_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
QWEN_TEXT_MODEL=qwen-max
QWEN_MULTIMODAL_MODEL=qwen-vl-max

LOG_LEVEL=INFO

Dependencies

The project requires the following main dependencies:

PyMuPDF: PDF processing
openai: OpenAI API client
pandas and openpyxl: Data processing and Excel file handling
python-docx: Word document generation
python-dotenv: Environment variable management
numpy: Numerical operations
requests: HTTP requests
tqdm: Progress bars

All dependencies are listed in requirements.txt.

Usage

Command Line Interface

After installation, you can use the precisiondoc command:

# Process PDF files
precisiondoc process-pdf --folder /path/to/pdfs --output-folder ./output

# Convert Excel to Word
precisiondoc excel-to-word --excel-file /path/to/evidence.xlsx --multi-line --show-borders

Python API

You can also use PrecisionDoc as a Python package:

# Import the package
from precisiondoc import process_pdf, excel_to_word, process_single_pdf

# Process PDF files
results = process_pdf(
    folder_path="/path/to/pdfs",
    output_folder="./output",
    ai_settings={
        "api_key": "your-api-key",
        "base_url": "https://api.example.com/v1",
        "model": "gpt-4"
    }
)

# Process a single PDF file
results = process_single_pdf(
    pdf_path="/path/to/document.pdf",
    doc_type="DocumentName",  # Optional, will use filename if not provided
    output_folder="./output",  # Optional
    ai_settings={
        "api_key": "your-api-key",
        "base_url": "https://api.example.com/v1",
        "model": "gpt-4"
    },
    multi_line_text=True,  # Optional
    show_borders=True,  # Optional
    page_settings={  # Optional, controls Word document page layout
        'orientation': 'landscape',  # 'landscape' or 'portrait'
        'margins': {  # Optional custom margins in inches
            'left': 0.75,
            'right': 0.5,
            'top': 0.5,
            'bottom': 0.75
        }
    }
)

# Convert Excel evidence to Word
word_file = excel_to_word(
    excel_file="/path/to/evidence.xlsx",
    word_file="/path/to/output.docx",  # Optional
    multi_line_text=True,  # Optional
    show_borders=True,  # Optional
    exclude_columns=["column1", "column2"]  # Optional
)

Advanced Usage

For more advanced usage, you can directly use the classes provided by the package:

from precisiondoc import PDFProcessor, WordUtils, DataUtils

# Create a PDF processor
processor = PDFProcessor(
    folder_path="/path/to/pdfs",
    output_folder="./output",
    ai_settings={
        "api_key": "your-api-key",
        "base_url": "https://api.example.com/v1",
        "model": "gpt-4"
    }
)

# Process all PDFs
results = processor.process_all()

# Save results
processor.save_consolidated_results(results)

# Work with data utilities
data_utils = DataUtils()
df = data_utils.load_excel_file("/path/to/evidence.xlsx")

# Export to Word with custom formatting
WordUtils.export_evidence_to_word(
    excel_file=df,
    word_file="/path/to/output.docx",
    multi_line_text=True,
    show_borders=False,
    exclude_columns=["column1", "column2"]
)

Environment Variables

The package uses the following environment variables:

API_KEY: API key for AI service
BASE_URL: Base URL for API endpoint
TEXT_MODEL: Model name for text processing
MULTIMODAL_MODEL: Model name for image processing
LOG_LEVEL: Logging level (default: INFO)

You can set these variables in a .env file or directly in your environment.

Parameters

Command Line Parameters

--folder: Path to the folder containing PDF files (required)
--api-key: API key for OpenAI or Qwen (if not provided, will be read from environment variables)
--use-qwen: Use Qwen API instead of OpenAI (optional)
--output-folder: Output folder path (optional, default: "./output")

Excel to Word Parameters

--excel-file: Path to Excel file with evidence data (required)
--word-file: Path to output Word file (optional)
--output-folder: Output folder path, used to find images (optional)
--multi-line: Use multi-line text format (default: True)
--show-borders: Show table borders (default: True)
--exclude-columns: Columns to exclude from evidence text (optional)

Output

The program creates the following in the output directory:

pages/: Contains split single-page PDF files
images/: (When using Qwen) Contains PDF page image files
json/: JSON files with structured data and AI processing results
excel/: Excel files with flattened analysis results
word/: Word files with extracted precision medicine evidence reports

Word Export Features

The Word export functionality includes several advanced formatting options:

Enhanced Table Layout:
- Left side displays multiple rows of text fields (one field per row)
- Right side shows images in a single vertically merged cell
- Customizable table borders (can be shown or hidden)
- Table continuation across pages for long evidence items
Page Formatting:
- Automatic page numbering in "Page X of Y" format
- Support for both portrait and landscape orientations
- Table continuation across page breaks
Text Formatting:
- Support for multi-line text display
- Consistent font styling
Image Handling:
- Automatic resizing and centering
- Fallback mechanism for missing images
Customization Parameters:
- multi_line_text: Controls text formatting in the left cell
  - True: Creates multiple rows, one for each key-value pair
  - False: Creates a single row with JSON-style dictionary
- show_borders: Controls table border visibility
  - True: Shows all table borders
  - False: Hides table borders for a cleaner look

Latest Features

Version 0.1.4+

Parameter Validation with Pydantic

PrecisionDoc uses Pydantic for robust parameter validation:

Type Safety: All parameters are validated for correct types and formats
Default Values: Sensible defaults are provided for optional parameters
Validation Rules: Business rules are enforced (e.g., valid margin ranges)
Error Messages: Clear error messages when invalid parameters are provided
Nested Validation: Complex nested structures like page settings are fully validated

Example of page settings validation:

# Valid page settings
page_settings = {
    "orientation": "landscape",  # must be 'landscape' or 'portrait'
    "margins": {
        "left": 0.75,  # in inches
        "right": 0.5,
        "top": 0.5,
        "bottom": 0.75
    }
}

# These will be validated automatically when passed to any function
results = process_single_pdf(
    pdf_path="/path/to/document.pdf",
    page_settings=page_settings
)

API Simplification

The API has been simplified:

Consolidated AI Parameters: Individual parameters (api_key, base_url, model) have been consolidated into a single ai_settings dictionary
Backward Compatibility: Legacy parameters are still supported but deprecated and will be removed in a future version
Cleaner Interface: Reduces parameter redundancy and improves code organization

Example of new API usage:

# New style (recommended)
results = process_pdf(
    folder_path="/path/to/pdfs",
    ai_settings={
        "api_key": "your-api-key",
        "base_url": "https://api.example.com/v1",
        "model": "gpt-4"
    }
)

# Legacy style (deprecated, will be removed in future)
results = process_pdf(
    folder_path="/path/to/pdfs",
    api_key="your-api-key",
    base_url="https://api.example.com/v1",
    model="gpt-4"
)

1:1 PDF Processing Mapping

PrecisionDoc now ensures a strict 1:1 mapping between original PDF files and their output files (JSON, Excel, Word). This means:

Each original PDF generates exactly one output file of each type
Output files are initialized at the start of processing each PDF
No redundant data accumulation on repeated runs
Improved data organization and traceability

Page Metadata Enhancement

Each processed page now includes additional metadata:

Current page number
Total page count in the document
Original PDF filename
This enriches the JSON output with useful pagination context for better organization and reference.

Modular PDF Processing

The PDF processing pipeline has been refactored into smaller, more maintainable functions:

_initialize_output_files: Handles initialization of JSON, Excel, and Word output files
_process_pdf_pages: Processes individual PDF pages and saves intermediate results
_save_final_results: Saves final results to JSON, Excel, and Word files

Single PDF Processing

PrecisionDoc now supports processing individual PDF files directly:

Process a specific PDF file without needing to place it in a dedicated folder
Generate the same comprehensive outputs (JSON, Excel, Word) as with folder processing
Maintain the same high-quality analysis and evidence extraction
Useful for targeted processing of individual documents

Direct Excel-to-Word Conversion

Users can now convert Excel files to formatted Word documents without needing to process PDF files first:

Supports various formatting options including multi-line text vs. JSON format
Provides table borders control and column exclusion options
Accessible via both command line and Python API

Future Plans

Add support for additional PDF processing libraries for better handling of complex layouts
Implement batch processing with multi-threading to improve performance
Create a web-based user interface for easier interaction
Add support for more languages and document types
Enhance evidence extraction with more detailed categorization
Improve image handling and OCR capabilities
Add support for custom templates for Word export

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

OpenAI and Alibaba Cloud for providing the AI APIs
The open-source community for the various libraries used in this project

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
examples		examples
precisiondoc		precisiondoc
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
env.example		env.example
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrecisionDoc - Medical Precision Document Processing Tool

Installation

From Source

Using pip

Configuration

Dependencies

Usage

Command Line Interface

Python API

Advanced Usage

Environment Variables

Parameters

Command Line Parameters

Excel to Word Parameters

Output

Word Export Features

Latest Features

Version 0.1.4+

Parameter Validation with Pydantic

API Simplification

1:1 PDF Processing Mapping

Page Metadata Enhancement

Modular PDF Processing

Single PDF Processing

Direct Excel-to-Word Conversion

Future Plans

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PrecisionDoc - Medical Precision Document Processing Tool

Installation

From Source

Using pip

Configuration

Dependencies

Usage

Command Line Interface

Python API

Advanced Usage

Environment Variables

Parameters

Command Line Parameters

Excel to Word Parameters

Output

Word Export Features

Latest Features

Version 0.1.4+

Parameter Validation with Pydantic

API Simplification

1:1 PDF Processing Mapping

Page Metadata Enhancement

Modular PDF Processing

Single PDF Processing

Direct Excel-to-Word Conversion

Future Plans

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages