This project processes medical guideline PDF files, especially treatment guidelines from CSCO (Chinese Society of Clinical Oncology). It can:
- Process PDF files in a specified folder
- Split PDF files into individual pages
- Analyze each page using AI (OpenAI or Alibaba Cloud Qwen)
- Extract precision medicine evidence related to drug efficacy
- Save analysis results in JSON and Excel formats
- Generate Word reports containing precision medicine evidence
- Clone this repository
- Install dependencies:
pip install -r requirements.txtpip install precisiondocCreate a .env file (refer to env.example) and set API keys:
OPENAI_API_KEY=your_openai_api_key
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4
QWEN_API_KEY=your_qwen_api_key
QWEN_BASES_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
QWEN_TEXT_MODEL=qwen-max
QWEN_MULTIMODAL_MODEL=qwen-vl-max
LOG_LEVEL=INFO
The project requires the following main dependencies:
PyMuPDF: PDF processingopenai: OpenAI API clientpandasandopenpyxl: Data processing and Excel file handlingpython-docx: Word document generationpython-dotenv: Environment variable managementnumpy: Numerical operationsrequests: HTTP requeststqdm: Progress bars
All dependencies are listed in requirements.txt.
After installation, you can use the precisiondoc command:
# Process PDF files
precisiondoc process-pdf --folder /path/to/pdfs --output-folder ./output
# Convert Excel to Word
precisiondoc excel-to-word --excel-file /path/to/evidence.xlsx --multi-line --show-bordersYou can also use PrecisionDoc as a Python package:
# Import the package
from precisiondoc import process_pdf, excel_to_word, process_single_pdf
# Process PDF files
results = process_pdf(
folder_path="/path/to/pdfs",
output_folder="./output",
ai_settings={
"api_key": "your-api-key",
"base_url": "https://api.example.com/v1",
"model": "gpt-4"
}
)
# Process a single PDF file
results = process_single_pdf(
pdf_path="/path/to/document.pdf",
doc_type="DocumentName", # Optional, will use filename if not provided
output_folder="./output", # Optional
ai_settings={
"api_key": "your-api-key",
"base_url": "https://api.example.com/v1",
"model": "gpt-4"
},
multi_line_text=True, # Optional
show_borders=True, # Optional
page_settings={ # Optional, controls Word document page layout
'orientation': 'landscape', # 'landscape' or 'portrait'
'margins': { # Optional custom margins in inches
'left': 0.75,
'right': 0.5,
'top': 0.5,
'bottom': 0.75
}
}
)
# Convert Excel evidence to Word
word_file = excel_to_word(
excel_file="/path/to/evidence.xlsx",
word_file="/path/to/output.docx", # Optional
multi_line_text=True, # Optional
show_borders=True, # Optional
exclude_columns=["column1", "column2"] # Optional
)For more advanced usage, you can directly use the classes provided by the package:
from precisiondoc import PDFProcessor, WordUtils, DataUtils
# Create a PDF processor
processor = PDFProcessor(
folder_path="/path/to/pdfs",
output_folder="./output",
ai_settings={
"api_key": "your-api-key",
"base_url": "https://api.example.com/v1",
"model": "gpt-4"
}
)
# Process all PDFs
results = processor.process_all()
# Save results
processor.save_consolidated_results(results)
# Work with data utilities
data_utils = DataUtils()
df = data_utils.load_excel_file("/path/to/evidence.xlsx")
# Export to Word with custom formatting
WordUtils.export_evidence_to_word(
excel_file=df,
word_file="/path/to/output.docx",
multi_line_text=True,
show_borders=False,
exclude_columns=["column1", "column2"]
)The package uses the following environment variables:
API_KEY: API key for AI serviceBASE_URL: Base URL for API endpointTEXT_MODEL: Model name for text processingMULTIMODAL_MODEL: Model name for image processingLOG_LEVEL: Logging level (default: INFO)
You can set these variables in a .env file or directly in your environment.
--folder: Path to the folder containing PDF files (required)--api-key: API key for OpenAI or Qwen (if not provided, will be read from environment variables)--use-qwen: Use Qwen API instead of OpenAI (optional)--output-folder: Output folder path (optional, default: "./output")
--excel-file: Path to Excel file with evidence data (required)--word-file: Path to output Word file (optional)--output-folder: Output folder path, used to find images (optional)--multi-line: Use multi-line text format (default: True)--show-borders: Show table borders (default: True)--exclude-columns: Columns to exclude from evidence text (optional)
The program creates the following in the output directory:
pages/: Contains split single-page PDF filesimages/: (When using Qwen) Contains PDF page image filesjson/: JSON files with structured data and AI processing resultsexcel/: Excel files with flattened analysis resultsword/: Word files with extracted precision medicine evidence reports
The Word export functionality includes several advanced formatting options:
-
Enhanced Table Layout:
- Left side displays multiple rows of text fields (one field per row)
- Right side shows images in a single vertically merged cell
- Customizable table borders (can be shown or hidden)
- Table continuation across pages for long evidence items
-
Page Formatting:
- Automatic page numbering in "Page X of Y" format
- Support for both portrait and landscape orientations
- Table continuation across page breaks
-
Text Formatting:
- Support for multi-line text display
- Consistent font styling
-
Image Handling:
- Automatic resizing and centering
- Fallback mechanism for missing images
-
Customization Parameters:
multi_line_text: Controls text formatting in the left cellTrue: Creates multiple rows, one for each key-value pairFalse: Creates a single row with JSON-style dictionary
show_borders: Controls table border visibilityTrue: Shows all table bordersFalse: Hides table borders for a cleaner look
PrecisionDoc uses Pydantic for robust parameter validation:
- Type Safety: All parameters are validated for correct types and formats
- Default Values: Sensible defaults are provided for optional parameters
- Validation Rules: Business rules are enforced (e.g., valid margin ranges)
- Error Messages: Clear error messages when invalid parameters are provided
- Nested Validation: Complex nested structures like page settings are fully validated
Example of page settings validation:
# Valid page settings
page_settings = {
"orientation": "landscape", # must be 'landscape' or 'portrait'
"margins": {
"left": 0.75, # in inches
"right": 0.5,
"top": 0.5,
"bottom": 0.75
}
}
# These will be validated automatically when passed to any function
results = process_single_pdf(
pdf_path="/path/to/document.pdf",
page_settings=page_settings
)The API has been simplified:
- Consolidated AI Parameters: Individual parameters (
api_key,base_url,model) have been consolidated into a singleai_settingsdictionary - Backward Compatibility: Legacy parameters are still supported but deprecated and will be removed in a future version
- Cleaner Interface: Reduces parameter redundancy and improves code organization
Example of new API usage:
# New style (recommended)
results = process_pdf(
folder_path="/path/to/pdfs",
ai_settings={
"api_key": "your-api-key",
"base_url": "https://api.example.com/v1",
"model": "gpt-4"
}
)
# Legacy style (deprecated, will be removed in future)
results = process_pdf(
folder_path="/path/to/pdfs",
api_key="your-api-key",
base_url="https://api.example.com/v1",
model="gpt-4"
)PrecisionDoc now ensures a strict 1:1 mapping between original PDF files and their output files (JSON, Excel, Word). This means:
- Each original PDF generates exactly one output file of each type
- Output files are initialized at the start of processing each PDF
- No redundant data accumulation on repeated runs
- Improved data organization and traceability
Each processed page now includes additional metadata:
- Current page number
- Total page count in the document
- Original PDF filename
- This enriches the JSON output with useful pagination context for better organization and reference.
The PDF processing pipeline has been refactored into smaller, more maintainable functions:
_initialize_output_files: Handles initialization of JSON, Excel, and Word output files_process_pdf_pages: Processes individual PDF pages and saves intermediate results_save_final_results: Saves final results to JSON, Excel, and Word files
PrecisionDoc now supports processing individual PDF files directly:
- Process a specific PDF file without needing to place it in a dedicated folder
- Generate the same comprehensive outputs (JSON, Excel, Word) as with folder processing
- Maintain the same high-quality analysis and evidence extraction
- Useful for targeted processing of individual documents
Users can now convert Excel files to formatted Word documents without needing to process PDF files first:
- Supports various formatting options including multi-line text vs. JSON format
- Provides table borders control and column exclusion options
- Accessible via both command line and Python API
- Add support for additional PDF processing libraries for better handling of complex layouts
- Implement batch processing with multi-threading to improve performance
- Create a web-based user interface for easier interaction
- Add support for more languages and document types
- Enhance evidence extraction with more detailed categorization
- Improve image handling and OCR capabilities
- Add support for custom templates for Word export
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI and Alibaba Cloud for providing the AI APIs
- The open-source community for the various libraries used in this project