Make PDF-related dependencies optional by alexdewar · Pull Request #261 · omicsNLP/Auto-CORPus

alexdewar · 2025-05-21T10:53:19Z

Description

This PR makes the marker-pdf dependency optional for AC, so that users who don't want to process PDF files can avoid installing loads of extra dependencies (the total is ~5.5.GB on my machine). This does mean that they have to opt in to PDF support now, e.g. by running pip install autocorpus[pdf] rather than just pip install autocorpus. If users try to process PDF files without PDF support installed, they'll be prompted to install the additional dependencies.

It also means developers need to explicitly install PDF support too if they want to develop the PDF functionality or run the PDF regression test, with:

poetry install --extras pdf

Unfortunately, with the way I've done it currently, pytest will fail on the PDF regression test if marker-pdf isn't installed, which is annoying. It might be better to just skip the test in this case -- and if we did that, we could also use this as an alternative workaround for the test not working on the macOS GitHub runner (i.e. by not installing it for the macOS runner). Another option would be to add marker-pdf to the dev group, which would ensure that all developers have it installed. I think the first option is a little cleaner, personally.

I've also moved a couple of *-stubs packages to the dev group, because that's where they belong (otherwise end users will get them needlessly when they run pip install).

I assume there will be some conflicts with #260 (I haven't looked at it yet), but hopefully not too many. I did rearrange things a bit, so that we could isolate the parts of the code that import marker-pdf and make AC work without it. PDF-related functionality was split between the autocorpus and the bioc_supplementary modules, but really it deserves its own module, so I've made a new pdf module for this purpose. Longer term, I think the functionality for processing different SI file types should be in submodules, i.e. it could be in supplementary/pdf.py (or something), with the common SI functionality moved to supplementary/__init__.py.

Closes #258.

Copilot

Pull Request Overview

This PR makes the PDF-related dependency optional and refactors PDF processing into its own module. Key changes include updating pyproject.toml to mark marker-pdf as optional and moving stub packages to the dev group, adding a new autocorpus/pdf.py module for PDF functionality, and updating the autocorpus/autocorpus.py file along with documentation and CI configuration to reflect these changes.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pyproject.toml	Updated dependency configuration for optional PDF support
autocorpus/pdf.py	Introduced new module to encapsulate PDF processing functionality
autocorpus/bioc_supplementary.py	Removed redundant PDF extraction functions
autocorpus/autocorpus.py	Refactored to use the new PDF module with improved error logging
README.md	Expanded installation instructions for PDF support
.github/actions/setup/action.yml	Updated CI workflow to install PDF extras

Comments suppressed due to low confidence (1)

autocorpus/autocorpus.py:285

When PDF dependencies are missing, catching ModuleNotFoundError and then re-raising disrupts the test flow. Consider handling this case gracefully (for example, by skipping the PDF test) to avoid abrupt failures during testing.

self.__extract_pdf_content(file)

Closes #258.

codecov · 2025-05-21T11:33:52Z

Codecov Report

Attention: Patch coverage is 82.55814% with 15 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
autocorpus/pdf.py	85.00%	7 Missing and 5 partials ⚠️
autocorpus/autocorpus.py	50.00%	3 Missing ⚠️

Files with missing lines	Coverage Δ
autocorpus/bioc_supplementary.py	`70.94% <ø> (-4.84%)`	⬇️
autocorpus/autocorpus.py	`44.09% <50.00%> (-4.29%)`	⬇️
autocorpus/pdf.py	`85.00% <85.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AdrianDAlessandro

LGTM! For the test failing, it's probably fine, I think developers should always have all extras anyway

alexdewar requested a review from Copilot May 21, 2025 10:53

Copilot AI reviewed May 21, 2025

View reviewed changes

alexdewar added 5 commits May 21, 2025 12:29

pyproject.toml: Move pandas-stubs and lxml-stubs to dev group

6a0ae52

Make code robust to absence of marker-pdf package

a60b4a5

Move other PDF-related functionality to pdf module

f145ce8

Make marker-pdf an optional dependency

81c1594

Closes #258.

Update readme with instructions for enabling PDF support

a78349e

alexdewar force-pushed the make-pdf-deps-optional branch from c6444b7 to a78349e Compare May 21, 2025 11:29

alexdewar requested review from AdrianDAlessandro and Thomas-Rowlands May 21, 2025 11:29

AdrianDAlessandro approved these changes May 21, 2025

View reviewed changes

Comment thread autocorpus/pdf.py

Comment thread README.md Outdated

Comment thread .github/actions/setup/action.yml Outdated

AdrianDAlessandro mentioned this pull request May 21, 2025

Word test additions and old .doc document conversion #260

Merged

10 tasks

AdrianDAlessandro added 2 commits May 21, 2025 13:22

Update .github/actions/setup/action.yml

51e5f71

Suggest --all-extras in README for development

afe1a16

AdrianDAlessandro marked this pull request as ready for review May 21, 2025 12:38

AdrianDAlessandro enabled auto-merge May 21, 2025 12:38

AdrianDAlessandro merged commit 8f95271 into main May 21, 2025
16 checks passed

AdrianDAlessandro deleted the make-pdf-deps-optional branch May 21, 2025 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PDF-related dependencies optional#261

Make PDF-related dependencies optional#261
AdrianDAlessandro merged 7 commits intomainfrom
make-pdf-deps-optional

alexdewar commented May 21, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

codecov Bot commented May 21, 2025 •

edited

Loading

Uh oh!

AdrianDAlessandro left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alexdewar commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

codecov Bot commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AdrianDAlessandro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexdewar commented May 21, 2025 •

edited

Loading

codecov Bot commented May 21, 2025 •

edited

Loading