PDF Accessibility Analyzer

⚠️ Notice: This Code & README was generated by AI and may contain errors or inaccuracies. Please verify all information, especially installation instructions and technical specifications, before use.

A comprehensive Python tool for analyzing PDF documents for accessibility features and compliance with international standards.

Overview

This tool performs detailed accessibility analysis of PDF documents, checking for compliance with PDF/UA (ISO 14289) and WCAG 2.2 standards. It generates both human-readable text reports and machine-readable JSON output, making it suitable for both manual review and automated accessibility testing pipelines.

Features

Document Structure Analysis

Tagged PDF Detection: Identifies whether the PDF has a logical structure tree
Bookmark Analysis: Counts and extracts document outline/bookmark hierarchy
Heading Structure: Detects and validates heading tags (H1-H6) and hierarchy
Document Language: Checks for proper language declaration
Metadata Validation: Verifies document title, author, and other metadata

Content Accessibility

Figure Analysis: Counts figures and checks for alternative text (alt text)
Table Structure: Validates tables have proper header cells
Text Extraction Quality: Validates that text can be properly extracted (real text, not image-based)
Font Embedding: Checks that all fonts are embedded in the PDF (PDF/UA requirement)
Reading Order: Validates that logical structure matches visual reading order

Standards Compliance

Validates against accessibility standards using veraPDF:

PDF/UA-1 (ISO 14289-1:2014) - Universal Accessibility standard
WCAG 2.2 Complete - Web Content Accessibility Guidelines machine-testable criteria

The tool combines both standards into a single compliance check: passes if either PDF/UA-1 or WCAG 2.2 passes.

Output Formats

Text Report: Human-readable formatted report with pass/fail indicators
JSON Report: Structured data for programmatic analysis and integration
Console Output: Summary displayed in terminal
Batch Reports: Aggregate statistics and validation summaries for multiple PDFs

Requirements

Dependencies

pip install pikepdf PyMuPDF

pikepdf: PDF manipulation and structure analysis
PyMuPDF (fitz): Additional PDF analysis capabilities (optional but recommended)
veraPDF: External validator (must be installed separately and available in PATH)

Installing veraPDF

Download and install veraPDF from: https://verapdf.org/

Ensure the verapdf command is available in your system PATH.

Installation

Clone or download this repository
Install Python dependencies:
```
pip install pikepdf PyMuPDF
```
Install veraPDF (see above)

Usage

Basic Usage

python3 analyze_pdf.py --input /path/to/document.pdf

This will create two files:

document_analysis.txt - Human-readable report
document_analysis.json - Machine-readable data

Specify Output Location

python3 analyze_pdf.py --input document.pdf --output reports/my_analysis

Creates:

reports/my_analysis.txt
reports/my_analysis.json

Verbose Logging

python3 analyze_pdf.py --input document.pdf --verbose

Shows detailed debug information during analysis.

Reading Order Sampling

For large documents, you can control reading order analysis:

# Check all pages (default)
python3 analyze_pdf.py --input document.pdf --reading-order-sample all

# Sample strategy: first 15 pages + 5 random pages
python3 analyze_pdf.py --input document.pdf --reading-order-sample standard

Batch Processing

Analyze multiple PDFs in a directory using batch_analyze.py:

python3 batch_analyze.py --input-dir /path/to/pdfs --output-dir batch_results

This will:

Analyze all PDFs in the input directory
Create individual analysis reports (.txt and .json) for each PDF
Generate a comprehensive batch summary with:
- Aggregate statistics (total pages, min/max/average)
- Validation statistics (PDF/UA-1 and WCAG 2.2 pass/fail counts)
- Page count distribution
- Per-file validation status
Creates batch_summary.json (machine-readable) and batch_summary.txt (human-readable)

Example batch summary output:

================================================================================
PDF BATCH ANALYSIS RESULTS
================================================================================

Total PDFs analyzed: 50
Total pages across all PDFs: 12,450

AGGREGATE PAGE STATISTICS:
  Minimum pages: 25
  Maximum pages: 600
  Average pages: 249.00

VALIDATION STATISTICS:
  PDF/UA-1:
    Pass: 12
    Fail: 38
  WCAG 2.2 (Complete):
    Pass: 18
    Fail: 32

================================================================================

Page count distribution (top 20):
   1.  250 pages:  15 PDFs
   2.  200 pages:  10 PDFs
   ...

================================================================================

FILES ANALYZED:
--------------------------------------------------------------------------------
  document1.pdf                                        600 pages  PDF/UA-1: FAIL  WCAG: PASS
  document2.pdf                                        450 pages  PDF/UA-1: PASS  WCAG: PASS
  ...

Command-Line Options

analyze_pdf.py

Option	Description	Default
`--input PATH`	Path to PDF file to analyze (required)	-
`--output PATH`	Base path for output files (without extension)	`{input_stem}_analysis`
`--verbose`	Enable verbose debug logging	`False`
`--reading-order-sample {all\|standard}`	Reading order sampling strategy	`all`

batch_analyze.py

Option	Description	Default
`--input-dir PATH`	Directory containing PDF files to analyze (required)	-
`--output-dir PATH`	Directory to store output files (required)	-
`--analyze-script PATH`	Path to analyze_pdf.py script	`./analyze_pdf.py`

Output Structure

Text Report

The text report includes:

Document statistics (pages, bookmarks, tagging status, language)
Heading structure and hierarchy validation
Figure analysis with alt text quality assessment
Table structure validation
Text extraction quality metrics
Font embedding analysis
Internal cross-references and links
Document metadata
Reading order validation results
veraPDF compliance results (PDF/UA-1 and WCAG 2.2) with detailed rule failures
Overall accessibility summary with pass/fail checklist

JSON Report

The JSON report contains structured data with the same information, suitable for:

Automated testing pipelines
Data analysis and reporting
Integration with other tools
Historical tracking of accessibility metrics

Example Output

======================================================================
PDF ACCESSIBILITY ANALYSIS
======================================================================

File: document.pdf
Path: /path/to/document.pdf

DOCUMENT STATISTICS
----------------------------------------------------------------------
  Pages: 281
  Bookmarks: 0
  Tagged: No ✗
  Language: Not declared ✗

TEXT EXTRACTION QUALITY
----------------------------------------------------------------------
  Total characters analyzed: 27,750
  Unmapped Unicode chars: None ✓
  Replacement chars: None ✓

FONT EMBEDDING
----------------------------------------------------------------------
  PDF/UA 7.21.4.1 requires all fonts be embedded for consistent rendering.
  (Special placeholder fonts like GlyphLessFont are excluded from this check)

  Unique fonts found: 28
    • 1 embedded ✓
    • 27 not embedded ✗
    (96.4% of distinct fonts are not embedded)

  Font uses across document: 1450 total
    • 50 uses of embedded fonts
    • 1400 uses of non-embedded fonts
    (96.6% of font uses in document are non-embedded)

READING ORDER
----------------------------------------------------------------------
  Pages sampled: 204
  Pages with reading order issues: 57
  Pages without issues: 147 (72.1%)
  Reading order issues: ✗

VERAPDF VALIDATION
----------------------------------------------------------------------
  PDF/UA-1: ✗ FAIL
    Passed rules: 96/106
    Failed rules: 10

  WCAG 2.2 (Complete): ✗ FAIL
    Passed rules: 153/155
    Failed rules: 2

ACCESSIBILITY SUMMARY
----------------------------------------------------------------------
  ✗ PDF is not tagged
  ✗ Document language not declared
  ✓ Text is properly extractable (real text, not images)
  ✗ 27/28 fonts not embedded
    (96.6% of font uses in document)
  ✓ Document title in metadata
  ✗ Reading order issues (27.9% of pages)
  ✗ Standards compliance: Both PDF/UA-1 and WCAG 2.2 fail

Overall: 3/11 checks passed (27%)

Understanding Results

Pass/Fail Indicators

✓ Checkmark: Feature passes accessibility check
✗ X mark: Feature fails accessibility check

Common Issues

Not Tagged: PDF lacks logical structure tree (required for accessibility)
Language Not Declared: Screen readers need language info for proper pronunciation
Missing Alt Text: Images/figures without alternative text descriptions
Reading Order Issues: Content structure doesn't match visual reading order
Fonts Not Embedded: Non-embedded fonts can cause rendering inconsistencies and violate PDF/UA 7.21.4.1

Standards Compliance

The tool validates against PDF/UA-1 and WCAG 2.2 and combines them into a single check:

Pass: If either standard passes
Fail: Only if both standards fail

This approach recognizes that documents may meet accessibility goals through different standard paths.

WCAG Compliance Note

The veraPDF WCAG 2.2 validation covers machine-testable success criteria (Level A and AA). Full WCAG 2.2 conformance requires additional manual testing of:

Content quality and clarity
Cognitive accessibility
User experience factors
Context-specific requirements

Use Cases

Document Remediation: Identify accessibility issues before publishing
Compliance Testing: Verify PDF/UA and WCAG compliance
Quality Assurance: Automated accessibility checks in CI/CD pipelines
Accessibility Audits: Generate detailed reports for stakeholders
Batch Processing: Analyze multiple documents systematically

Limitations

Requires veraPDF to be installed for compliance validation
Reading order analysis uses a weak heuristic: Compares PyMuPDF's text extraction order (tag-tree order for tagged PDFs) against visual positioning. The fuzzy matching approach generates:
- False negatives for multi-column layouts (which may be correctly tagged but appear out of order visually)
- False positives from short repeated strings (headers, footers) that match incorrectly
- Results should be treated as a rough indicator only, not weighted heavily in scoring
- Manual verification strongly recommended for any flagged reading order issues
Reading order analysis checks all pages by default (can use --reading-order-sample standard for faster sampling on large documents)
Does not perform subjective quality checks (e.g., alt text quality, content clarity)
Some advanced structure elements may require manual verification

License

[Specify license here]

Contributing

[Contribution guidelines if applicable]

Support

For issues or questions, please [specify contact method or issue tracker].

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
README.md		README.md
analyze_pdf.py		analyze_pdf.py
batch_analyze.py		batch_analyze.py

Folders and files

Latest commit

History

Repository files navigation

PDF Accessibility Analyzer

Overview

Features

Document Structure Analysis

Content Accessibility

Standards Compliance

Output Formats

Requirements

Dependencies

Installing veraPDF

Installation

Usage

Basic Usage

Specify Output Location

Verbose Logging

Reading Order Sampling

Batch Processing

Command-Line Options

analyze_pdf.py

batch_analyze.py

Output Structure

Text Report

JSON Report

Example Output

Understanding Results

Pass/Fail Indicators

Common Issues

Standards Compliance

WCAG Compliance Note

Use Cases

Limitations

License

Contributing

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages