Skip to content

Make PDF-related dependencies optional#261

Merged
AdrianDAlessandro merged 7 commits intomainfrom
make-pdf-deps-optional
May 21, 2025
Merged

Make PDF-related dependencies optional#261
AdrianDAlessandro merged 7 commits intomainfrom
make-pdf-deps-optional

Conversation

@alexdewar
Copy link
Copy Markdown
Collaborator

@alexdewar alexdewar commented May 21, 2025

Description

This PR makes the marker-pdf dependency optional for AC, so that users who don't want to process PDF files can avoid installing loads of extra dependencies (the total is ~5.5.GB on my machine). This does mean that they have to opt in to PDF support now, e.g. by running pip install autocorpus[pdf] rather than just pip install autocorpus. If users try to process PDF files without PDF support installed, they'll be prompted to install the additional dependencies.

It also means developers need to explicitly install PDF support too if they want to develop the PDF functionality or run the PDF regression test, with:

poetry install --extras pdf

Unfortunately, with the way I've done it currently, pytest will fail on the PDF regression test if marker-pdf isn't installed, which is annoying. It might be better to just skip the test in this case -- and if we did that, we could also use this as an alternative workaround for the test not working on the macOS GitHub runner (i.e. by not installing it for the macOS runner). Another option would be to add marker-pdf to the dev group, which would ensure that all developers have it installed. I think the first option is a little cleaner, personally.

I've also moved a couple of *-stubs packages to the dev group, because that's where they belong (otherwise end users will get them needlessly when they run pip install).

I assume there will be some conflicts with #260 (I haven't looked at it yet), but hopefully not too many. I did rearrange things a bit, so that we could isolate the parts of the code that import marker-pdf and make AC work without it. PDF-related functionality was split between the autocorpus and the bioc_supplementary modules, but really it deserves its own module, so I've made a new pdf module for this purpose. Longer term, I think the functionality for processing different SI file types should be in submodules, i.e. it could be in supplementary/pdf.py (or something), with the common SI functionality moved to supplementary/__init__.py.

Closes #258.

@alexdewar alexdewar requested a review from Copilot May 21, 2025 10:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR makes the PDF-related dependency optional and refactors PDF processing into its own module. Key changes include updating pyproject.toml to mark marker-pdf as optional and moving stub packages to the dev group, adding a new autocorpus/pdf.py module for PDF functionality, and updating the autocorpus/autocorpus.py file along with documentation and CI configuration to reflect these changes.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pyproject.toml Updated dependency configuration for optional PDF support
autocorpus/pdf.py Introduced new module to encapsulate PDF processing functionality
autocorpus/bioc_supplementary.py Removed redundant PDF extraction functions
autocorpus/autocorpus.py Refactored to use the new PDF module with improved error logging
README.md Expanded installation instructions for PDF support
.github/actions/setup/action.yml Updated CI workflow to install PDF extras
Comments suppressed due to low confidence (1)

autocorpus/autocorpus.py:285

  • When PDF dependencies are missing, catching ModuleNotFoundError and then re-raising disrupts the test flow. Consider handling this case gracefully (for example, by skipping the PDF test) to avoid abrupt failures during testing.
self.__extract_pdf_content(file)

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2025

Codecov Report

Attention: Patch coverage is 82.55814% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
autocorpus/pdf.py 85.00% 7 Missing and 5 partials ⚠️
autocorpus/autocorpus.py 50.00% 3 Missing ⚠️
Files with missing lines Coverage Δ
autocorpus/bioc_supplementary.py 70.94% <ø> (-4.84%) ⬇️
autocorpus/autocorpus.py 44.09% <50.00%> (-4.29%) ⬇️
autocorpus/pdf.py 85.00% <85.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@AdrianDAlessandro AdrianDAlessandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! For the test failing, it's probably fine, I think developers should always have all extras anyway

Comment thread autocorpus/pdf.py
Comment thread README.md Outdated
Comment thread .github/actions/setup/action.yml Outdated
@AdrianDAlessandro AdrianDAlessandro marked this pull request as ready for review May 21, 2025 12:38
@AdrianDAlessandro AdrianDAlessandro merged commit 8f95271 into main May 21, 2025
16 checks passed
@AdrianDAlessandro AdrianDAlessandro deleted the make-pdf-deps-optional branch May 21, 2025 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make PDF dependencies optional

3 participants