Make PDF-related dependencies optional#261
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR makes the PDF-related dependency optional and refactors PDF processing into its own module. Key changes include updating pyproject.toml to mark marker-pdf as optional and moving stub packages to the dev group, adding a new autocorpus/pdf.py module for PDF functionality, and updating the autocorpus/autocorpus.py file along with documentation and CI configuration to reflect these changes.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Updated dependency configuration for optional PDF support |
| autocorpus/pdf.py | Introduced new module to encapsulate PDF processing functionality |
| autocorpus/bioc_supplementary.py | Removed redundant PDF extraction functions |
| autocorpus/autocorpus.py | Refactored to use the new PDF module with improved error logging |
| README.md | Expanded installation instructions for PDF support |
| .github/actions/setup/action.yml | Updated CI workflow to install PDF extras |
Comments suppressed due to low confidence (1)
autocorpus/autocorpus.py:285
- When PDF dependencies are missing, catching ModuleNotFoundError and then re-raising disrupts the test flow. Consider handling this case gracefully (for example, by skipping the PDF test) to avoid abrupt failures during testing.
self.__extract_pdf_content(file)
c6444b7 to
a78349e
Compare
Codecov ReportAttention: Patch coverage is
🚀 New features to boost your workflow:
|
AdrianDAlessandro
left a comment
There was a problem hiding this comment.
LGTM! For the test failing, it's probably fine, I think developers should always have all extras anyway
Description
This PR makes the
marker-pdfdependency optional for AC, so that users who don't want to process PDF files can avoid installing loads of extra dependencies (the total is ~5.5.GB on my machine). This does mean that they have to opt in to PDF support now, e.g. by runningpip install autocorpus[pdf]rather than justpip install autocorpus. If users try to process PDF files without PDF support installed, they'll be prompted to install the additional dependencies.It also means developers need to explicitly install PDF support too if they want to develop the PDF functionality or run the PDF regression test, with:
Unfortunately, with the way I've done it currently,
pytestwill fail on the PDF regression test ifmarker-pdfisn't installed, which is annoying. It might be better to just skip the test in this case -- and if we did that, we could also use this as an alternative workaround for the test not working on the macOS GitHub runner (i.e. by not installing it for the macOS runner). Another option would be to addmarker-pdfto thedevgroup, which would ensure that all developers have it installed. I think the first option is a little cleaner, personally.I've also moved a couple of
*-stubspackages to thedevgroup, because that's where they belong (otherwise end users will get them needlessly when they runpip install).I assume there will be some conflicts with #260 (I haven't looked at it yet), but hopefully not too many. I did rearrange things a bit, so that we could isolate the parts of the code that import
marker-pdfand make AC work without it. PDF-related functionality was split between theautocorpusand thebioc_supplementarymodules, but really it deserves its own module, so I've made a newpdfmodule for this purpose. Longer term, I think the functionality for processing different SI file types should be in submodules, i.e. it could be insupplementary/pdf.py(or something), with the common SI functionality moved tosupplementary/__init__.py.Closes #258.