Skip to content

mlibrary/analyze_pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PDF Accessibility Analyzer

⚠️ Notice: This Code & README was generated by AI and may contain errors or inaccuracies. Please verify all information, especially installation instructions and technical specifications, before use.

A comprehensive Python tool for analyzing PDF documents for accessibility features and compliance with international standards.

Overview

This tool performs detailed accessibility analysis of PDF documents, checking for compliance with PDF/UA (ISO 14289) and WCAG 2.2 standards. It generates both human-readable text reports and machine-readable JSON output, making it suitable for both manual review and automated accessibility testing pipelines.

Features

Document Structure Analysis

  • Tagged PDF Detection: Identifies whether the PDF has a logical structure tree
  • Bookmark Analysis: Counts and extracts document outline/bookmark hierarchy
  • Heading Structure: Detects and validates heading tags (H1-H6) and hierarchy
  • Document Language: Checks for proper language declaration
  • Metadata Validation: Verifies document title, author, and other metadata

Content Accessibility

  • Figure Analysis: Counts figures and checks for alternative text (alt text)
  • Table Structure: Validates tables have proper header cells
  • Text Extraction Quality: Validates that text can be properly extracted (real text, not image-based)
  • Font Embedding: Checks that all fonts are embedded in the PDF (PDF/UA requirement)
  • Reading Order: Validates that logical structure matches visual reading order

Standards Compliance

Validates against accessibility standards using veraPDF:

  • PDF/UA-1 (ISO 14289-1:2014) - Universal Accessibility standard
  • WCAG 2.2 Complete - Web Content Accessibility Guidelines machine-testable criteria

The tool combines both standards into a single compliance check: passes if either PDF/UA-1 or WCAG 2.2 passes.

Output Formats

  • Text Report: Human-readable formatted report with pass/fail indicators
  • JSON Report: Structured data for programmatic analysis and integration
  • Console Output: Summary displayed in terminal
  • Batch Reports: Aggregate statistics and validation summaries for multiple PDFs

Requirements

Dependencies

pip install pikepdf PyMuPDF
  • pikepdf: PDF manipulation and structure analysis
  • PyMuPDF (fitz): Additional PDF analysis capabilities (optional but recommended)
  • veraPDF: External validator (must be installed separately and available in PATH)

Installing veraPDF

Download and install veraPDF from: https://verapdf.org/

Ensure the verapdf command is available in your system PATH.

Installation

  1. Clone or download this repository
  2. Install Python dependencies:
    pip install pikepdf PyMuPDF
  3. Install veraPDF (see above)

Usage

Basic Usage

python3 analyze_pdf.py --input /path/to/document.pdf

This will create two files:

  • document_analysis.txt - Human-readable report
  • document_analysis.json - Machine-readable data

Specify Output Location

python3 analyze_pdf.py --input document.pdf --output reports/my_analysis

Creates:

  • reports/my_analysis.txt
  • reports/my_analysis.json

Verbose Logging

python3 analyze_pdf.py --input document.pdf --verbose

Shows detailed debug information during analysis.

Reading Order Sampling

For large documents, you can control reading order analysis:

# Check all pages (default)
python3 analyze_pdf.py --input document.pdf --reading-order-sample all

# Sample strategy: first 15 pages + 5 random pages
python3 analyze_pdf.py --input document.pdf --reading-order-sample standard

Batch Processing

Analyze multiple PDFs in a directory using batch_analyze.py:

python3 batch_analyze.py --input-dir /path/to/pdfs --output-dir batch_results

This will:

  • Analyze all PDFs in the input directory
  • Create individual analysis reports (.txt and .json) for each PDF
  • Generate a comprehensive batch summary with:
    • Aggregate statistics (total pages, min/max/average)
    • Validation statistics (PDF/UA-1 and WCAG 2.2 pass/fail counts)
    • Page count distribution
    • Per-file validation status
  • Creates batch_summary.json (machine-readable) and batch_summary.txt (human-readable)

Example batch summary output:

================================================================================
PDF BATCH ANALYSIS RESULTS
================================================================================

Total PDFs analyzed: 50
Total pages across all PDFs: 12,450

AGGREGATE PAGE STATISTICS:
  Minimum pages: 25
  Maximum pages: 600
  Average pages: 249.00

VALIDATION STATISTICS:
  PDF/UA-1:
    Pass: 12
    Fail: 38
  WCAG 2.2 (Complete):
    Pass: 18
    Fail: 32

================================================================================

Page count distribution (top 20):
   1.  250 pages:  15 PDFs
   2.  200 pages:  10 PDFs
   ...

================================================================================

FILES ANALYZED:
--------------------------------------------------------------------------------
  document1.pdf                                        600 pages  PDF/UA-1: FAIL  WCAG: PASS
  document2.pdf                                        450 pages  PDF/UA-1: PASS  WCAG: PASS
  ...

Command-Line Options

analyze_pdf.py

Option Description Default
--input PATH Path to PDF file to analyze (required) -
--output PATH Base path for output files (without extension) {input_stem}_analysis
--verbose Enable verbose debug logging False
--reading-order-sample {all|standard} Reading order sampling strategy all

batch_analyze.py

Option Description Default
--input-dir PATH Directory containing PDF files to analyze (required) -
--output-dir PATH Directory to store output files (required) -
--analyze-script PATH Path to analyze_pdf.py script ./analyze_pdf.py

Output Structure

Text Report

The text report includes:

  • Document statistics (pages, bookmarks, tagging status, language)
  • Heading structure and hierarchy validation
  • Figure analysis with alt text quality assessment
  • Table structure validation
  • Text extraction quality metrics
  • Font embedding analysis
  • Internal cross-references and links
  • Document metadata
  • Reading order validation results
  • veraPDF compliance results (PDF/UA-1 and WCAG 2.2) with detailed rule failures
  • Overall accessibility summary with pass/fail checklist

JSON Report

The JSON report contains structured data with the same information, suitable for:

  • Automated testing pipelines
  • Data analysis and reporting
  • Integration with other tools
  • Historical tracking of accessibility metrics

Example Output

======================================================================
PDF ACCESSIBILITY ANALYSIS
======================================================================

File: document.pdf
Path: /path/to/document.pdf

DOCUMENT STATISTICS
----------------------------------------------------------------------
  Pages: 281
  Bookmarks: 0
  Tagged: No ✗
  Language: Not declared ✗

TEXT EXTRACTION QUALITY
----------------------------------------------------------------------
  Total characters analyzed: 27,750
  Unmapped Unicode chars: None ✓
  Replacement chars: None ✓

FONT EMBEDDING
----------------------------------------------------------------------
  PDF/UA 7.21.4.1 requires all fonts be embedded for consistent rendering.
  (Special placeholder fonts like GlyphLessFont are excluded from this check)

  Unique fonts found: 28
    • 1 embedded ✓
    • 27 not embedded ✗
    (96.4% of distinct fonts are not embedded)

  Font uses across document: 1450 total
    • 50 uses of embedded fonts
    • 1400 uses of non-embedded fonts
    (96.6% of font uses in document are non-embedded)

READING ORDER
----------------------------------------------------------------------
  Pages sampled: 204
  Pages with reading order issues: 57
  Pages without issues: 147 (72.1%)
  Reading order issues: ✗

VERAPDF VALIDATION
----------------------------------------------------------------------
  PDF/UA-1: ✗ FAIL
    Passed rules: 96/106
    Failed rules: 10

  WCAG 2.2 (Complete): ✗ FAIL
    Passed rules: 153/155
    Failed rules: 2

ACCESSIBILITY SUMMARY
----------------------------------------------------------------------
  ✗ PDF is not tagged
  ✗ Document language not declared
  ✓ Text is properly extractable (real text, not images)
  ✗ 27/28 fonts not embedded
    (96.6% of font uses in document)
  ✓ Document title in metadata
  ✗ Reading order issues (27.9% of pages)
  ✗ Standards compliance: Both PDF/UA-1 and WCAG 2.2 fail

Overall: 3/11 checks passed (27%)

Understanding Results

Pass/Fail Indicators

  • Checkmark: Feature passes accessibility check
  • X mark: Feature fails accessibility check

Common Issues

  • Not Tagged: PDF lacks logical structure tree (required for accessibility)
  • Language Not Declared: Screen readers need language info for proper pronunciation
  • Missing Alt Text: Images/figures without alternative text descriptions
  • Reading Order Issues: Content structure doesn't match visual reading order
  • Fonts Not Embedded: Non-embedded fonts can cause rendering inconsistencies and violate PDF/UA 7.21.4.1

Standards Compliance

The tool validates against PDF/UA-1 and WCAG 2.2 and combines them into a single check:

  • Pass: If either standard passes
  • Fail: Only if both standards fail

This approach recognizes that documents may meet accessibility goals through different standard paths.

WCAG Compliance Note

The veraPDF WCAG 2.2 validation covers machine-testable success criteria (Level A and AA). Full WCAG 2.2 conformance requires additional manual testing of:

  • Content quality and clarity
  • Cognitive accessibility
  • User experience factors
  • Context-specific requirements

Use Cases

  • Document Remediation: Identify accessibility issues before publishing
  • Compliance Testing: Verify PDF/UA and WCAG compliance
  • Quality Assurance: Automated accessibility checks in CI/CD pipelines
  • Accessibility Audits: Generate detailed reports for stakeholders
  • Batch Processing: Analyze multiple documents systematically

Limitations

  • Requires veraPDF to be installed for compliance validation
  • Reading order analysis uses a weak heuristic: Compares PyMuPDF's text extraction order (tag-tree order for tagged PDFs) against visual positioning. The fuzzy matching approach generates:
    • False negatives for multi-column layouts (which may be correctly tagged but appear out of order visually)
    • False positives from short repeated strings (headers, footers) that match incorrectly
    • Results should be treated as a rough indicator only, not weighted heavily in scoring
    • Manual verification strongly recommended for any flagged reading order issues
  • Reading order analysis checks all pages by default (can use --reading-order-sample standard for faster sampling on large documents)
  • Does not perform subjective quality checks (e.g., alt text quality, content clarity)
  • Some advanced structure elements may require manual verification

License

[Specify license here]

Contributing

[Contribution guidelines if applicable]

Support

For issues or questions, please [specify contact method or issue tracker].

About

Analyzes PDF for accessibility features, starting with tags and tests for compliance with standards.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages