⚠️ Notice: This Code & README was generated by AI and may contain errors or inaccuracies. Please verify all information, especially installation instructions and technical specifications, before use.
A comprehensive Python tool for analyzing PDF documents for accessibility features and compliance with international standards.
This tool performs detailed accessibility analysis of PDF documents, checking for compliance with PDF/UA (ISO 14289) and WCAG 2.2 standards. It generates both human-readable text reports and machine-readable JSON output, making it suitable for both manual review and automated accessibility testing pipelines.
- Tagged PDF Detection: Identifies whether the PDF has a logical structure tree
- Bookmark Analysis: Counts and extracts document outline/bookmark hierarchy
- Heading Structure: Detects and validates heading tags (H1-H6) and hierarchy
- Document Language: Checks for proper language declaration
- Metadata Validation: Verifies document title, author, and other metadata
- Figure Analysis: Counts figures and checks for alternative text (alt text)
- Table Structure: Validates tables have proper header cells
- Text Extraction Quality: Validates that text can be properly extracted (real text, not image-based)
- Font Embedding: Checks that all fonts are embedded in the PDF (PDF/UA requirement)
- Reading Order: Validates that logical structure matches visual reading order
Validates against accessibility standards using veraPDF:
- PDF/UA-1 (ISO 14289-1:2014) - Universal Accessibility standard
- WCAG 2.2 Complete - Web Content Accessibility Guidelines machine-testable criteria
The tool combines both standards into a single compliance check: passes if either PDF/UA-1 or WCAG 2.2 passes.
- Text Report: Human-readable formatted report with pass/fail indicators
- JSON Report: Structured data for programmatic analysis and integration
- Console Output: Summary displayed in terminal
- Batch Reports: Aggregate statistics and validation summaries for multiple PDFs
pip install pikepdf PyMuPDF- pikepdf: PDF manipulation and structure analysis
- PyMuPDF (fitz): Additional PDF analysis capabilities (optional but recommended)
- veraPDF: External validator (must be installed separately and available in PATH)
Download and install veraPDF from: https://verapdf.org/
Ensure the verapdf command is available in your system PATH.
- Clone or download this repository
- Install Python dependencies:
pip install pikepdf PyMuPDF
- Install veraPDF (see above)
python3 analyze_pdf.py --input /path/to/document.pdfThis will create two files:
document_analysis.txt- Human-readable reportdocument_analysis.json- Machine-readable data
python3 analyze_pdf.py --input document.pdf --output reports/my_analysisCreates:
reports/my_analysis.txtreports/my_analysis.json
python3 analyze_pdf.py --input document.pdf --verboseShows detailed debug information during analysis.
For large documents, you can control reading order analysis:
# Check all pages (default)
python3 analyze_pdf.py --input document.pdf --reading-order-sample all
# Sample strategy: first 15 pages + 5 random pages
python3 analyze_pdf.py --input document.pdf --reading-order-sample standardAnalyze multiple PDFs in a directory using batch_analyze.py:
python3 batch_analyze.py --input-dir /path/to/pdfs --output-dir batch_resultsThis will:
- Analyze all PDFs in the input directory
- Create individual analysis reports (
.txtand.json) for each PDF - Generate a comprehensive batch summary with:
- Aggregate statistics (total pages, min/max/average)
- Validation statistics (PDF/UA-1 and WCAG 2.2 pass/fail counts)
- Page count distribution
- Per-file validation status
- Creates
batch_summary.json(machine-readable) andbatch_summary.txt(human-readable)
Example batch summary output:
================================================================================
PDF BATCH ANALYSIS RESULTS
================================================================================
Total PDFs analyzed: 50
Total pages across all PDFs: 12,450
AGGREGATE PAGE STATISTICS:
Minimum pages: 25
Maximum pages: 600
Average pages: 249.00
VALIDATION STATISTICS:
PDF/UA-1:
Pass: 12
Fail: 38
WCAG 2.2 (Complete):
Pass: 18
Fail: 32
================================================================================
Page count distribution (top 20):
1. 250 pages: 15 PDFs
2. 200 pages: 10 PDFs
...
================================================================================
FILES ANALYZED:
--------------------------------------------------------------------------------
document1.pdf 600 pages PDF/UA-1: FAIL WCAG: PASS
document2.pdf 450 pages PDF/UA-1: PASS WCAG: PASS
...
| Option | Description | Default |
|---|---|---|
--input PATH |
Path to PDF file to analyze (required) | - |
--output PATH |
Base path for output files (without extension) | {input_stem}_analysis |
--verbose |
Enable verbose debug logging | False |
--reading-order-sample {all|standard} |
Reading order sampling strategy | all |
| Option | Description | Default |
|---|---|---|
--input-dir PATH |
Directory containing PDF files to analyze (required) | - |
--output-dir PATH |
Directory to store output files (required) | - |
--analyze-script PATH |
Path to analyze_pdf.py script | ./analyze_pdf.py |
The text report includes:
- Document statistics (pages, bookmarks, tagging status, language)
- Heading structure and hierarchy validation
- Figure analysis with alt text quality assessment
- Table structure validation
- Text extraction quality metrics
- Font embedding analysis
- Internal cross-references and links
- Document metadata
- Reading order validation results
- veraPDF compliance results (PDF/UA-1 and WCAG 2.2) with detailed rule failures
- Overall accessibility summary with pass/fail checklist
The JSON report contains structured data with the same information, suitable for:
- Automated testing pipelines
- Data analysis and reporting
- Integration with other tools
- Historical tracking of accessibility metrics
======================================================================
PDF ACCESSIBILITY ANALYSIS
======================================================================
File: document.pdf
Path: /path/to/document.pdf
DOCUMENT STATISTICS
----------------------------------------------------------------------
Pages: 281
Bookmarks: 0
Tagged: No ✗
Language: Not declared ✗
TEXT EXTRACTION QUALITY
----------------------------------------------------------------------
Total characters analyzed: 27,750
Unmapped Unicode chars: None ✓
Replacement chars: None ✓
FONT EMBEDDING
----------------------------------------------------------------------
PDF/UA 7.21.4.1 requires all fonts be embedded for consistent rendering.
(Special placeholder fonts like GlyphLessFont are excluded from this check)
Unique fonts found: 28
• 1 embedded ✓
• 27 not embedded ✗
(96.4% of distinct fonts are not embedded)
Font uses across document: 1450 total
• 50 uses of embedded fonts
• 1400 uses of non-embedded fonts
(96.6% of font uses in document are non-embedded)
READING ORDER
----------------------------------------------------------------------
Pages sampled: 204
Pages with reading order issues: 57
Pages without issues: 147 (72.1%)
Reading order issues: ✗
VERAPDF VALIDATION
----------------------------------------------------------------------
PDF/UA-1: ✗ FAIL
Passed rules: 96/106
Failed rules: 10
WCAG 2.2 (Complete): ✗ FAIL
Passed rules: 153/155
Failed rules: 2
ACCESSIBILITY SUMMARY
----------------------------------------------------------------------
✗ PDF is not tagged
✗ Document language not declared
✓ Text is properly extractable (real text, not images)
✗ 27/28 fonts not embedded
(96.6% of font uses in document)
✓ Document title in metadata
✗ Reading order issues (27.9% of pages)
✗ Standards compliance: Both PDF/UA-1 and WCAG 2.2 fail
Overall: 3/11 checks passed (27%)
- ✓ Checkmark: Feature passes accessibility check
- ✗ X mark: Feature fails accessibility check
- Not Tagged: PDF lacks logical structure tree (required for accessibility)
- Language Not Declared: Screen readers need language info for proper pronunciation
- Missing Alt Text: Images/figures without alternative text descriptions
- Reading Order Issues: Content structure doesn't match visual reading order
- Fonts Not Embedded: Non-embedded fonts can cause rendering inconsistencies and violate PDF/UA 7.21.4.1
The tool validates against PDF/UA-1 and WCAG 2.2 and combines them into a single check:
- Pass: If either standard passes
- Fail: Only if both standards fail
This approach recognizes that documents may meet accessibility goals through different standard paths.
The veraPDF WCAG 2.2 validation covers machine-testable success criteria (Level A and AA). Full WCAG 2.2 conformance requires additional manual testing of:
- Content quality and clarity
- Cognitive accessibility
- User experience factors
- Context-specific requirements
- Document Remediation: Identify accessibility issues before publishing
- Compliance Testing: Verify PDF/UA and WCAG compliance
- Quality Assurance: Automated accessibility checks in CI/CD pipelines
- Accessibility Audits: Generate detailed reports for stakeholders
- Batch Processing: Analyze multiple documents systematically
- Requires veraPDF to be installed for compliance validation
- Reading order analysis uses a weak heuristic: Compares PyMuPDF's text extraction order (tag-tree order for tagged PDFs) against visual positioning. The fuzzy matching approach generates:
- False negatives for multi-column layouts (which may be correctly tagged but appear out of order visually)
- False positives from short repeated strings (headers, footers) that match incorrectly
- Results should be treated as a rough indicator only, not weighted heavily in scoring
- Manual verification strongly recommended for any flagged reading order issues
- Reading order analysis checks all pages by default (can use
--reading-order-sample standardfor faster sampling on large documents) - Does not perform subjective quality checks (e.g., alt text quality, content clarity)
- Some advanced structure elements may require manual verification
[Specify license here]
[Contribution guidelines if applicable]
For issues or questions, please [specify contact method or issue tracker].