Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf2txt: clean up construction of LAParams from arguments #682

Merged
merged 13 commits into from Jan 25, 2022

Conversation

0xabu
Copy link
Contributor

@0xabu 0xabu commented Oct 14, 2021

This PR cleans up the way pdf2txt constructs the LAParams object it passes to extract_text_to_fp.

Specifically:

  1. Rather than hardcoding the default values of the parameters in two places, have pdf2txt fetch them from the object's existing defaults (all were identical, presumably intentionally so).
  2. Rather than passing the parameters as arguments to an intermediate function, construct LAParams closer to the argument parser.
  3. Call the LAParams constructor rather than setting attributes after construction, so that LAParams._validate runs to validate the chosen parameters.

As a deliberate side-effect of point 2 above, this fixes the following crash:

$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "pdfminer/high_level.py", line 85, in extract_text_to_fp
    interpreter.process_page(page)
  File "pdfminer/pdfinterp.py", line 896, in process_page
    self.device.end_page(page)
  File "pdfminer/converter.py", line 51, in end_page
    self.cur_item.analyze(self.laparams)
  File "pdfminer/layout.py", line 822, in analyze
    group.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 577, in analyze
    self._objs.sort(
  File "pdfminer/layout.py", line 578, in <lambda>
    key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'

Thus, this supersedes PR #657 (which was a minimal fix for just that crash).

Related: Issue #477, PR #479

Fix #540

How Has This Been Tested?

  • I ran pdf2txt --boxes_flow=disabled, and verified that (a) there was no crash, and (b) the alternate layout algorithm was used in LTLayoutContainer.analyze.
  • I verified that changing layout parameters alters the output.
  • I verified that passing invalid layout parameters now raises an error:
$ tools/pdf2txt.py --boxes-flow 42 samples/simple1.pdf
Traceback (most recent call last):
 File "tools/pdf2txt.py", line 217, in <module>
   sys.exit(main())
 File "tools/pdf2txt.py", line 199, in main
   A = parse_args(args)
 File "tools/pdf2txt.py", line 189, in parse_args
   parsed_args.laparams = LAParams(**kwargs)
 File "pdfminer/layout.py", line 89, in __init__
   self._validate()
 File "pdfminer/layout.py", line 100, in _validate
   raise ValueError(boxes_flow_err_msg)
ValueError: LAParam boxes_flow should be None, or a number between -1 and +1

Checklist

  • I have added tests that prove my fix is effective or that my feature
    works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or I have verified that this
    is not necessary
  • I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a concise human-readable description of the change to
    CHANGELOG.md

0xabu and others added 9 commits August 16, 2021 08:33
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "pdfminer/high_level.py", line 85, in extract_text_to_fp
    interpreter.process_page(page)
  File "pdfminer/pdfinterp.py", line 896, in process_page
    self.device.end_page(page)
  File "pdfminer/converter.py", line 51, in end_page
    self.cur_item.analyze(self.laparams)
  File "pdfminer/layout.py", line 822, in analyze
    group.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 577, in analyze
    self._objs.sort(
  File "pdfminer/layout.py", line 578, in <lambda>
    key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```

Related: Issue pdfminer#477, PR pdfminer#479
 * avoid specifying default values twice
 * construct LAParams earlier, rather than passing its components around
 * fix crash with --boxes_flow=disabled
Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! Again a very nice improvement!

Will do some small changes myself and then merge it.

tools/pdf2txt.py Show resolved Hide resolved
tools/pdf2txt.py Outdated Show resolved Hide resolved
@pietermarsman
Copy link
Member

I did some more changes:

  • Not using locals() or getattr() anymore when creating LAParams.
  • Moved all post-processing of the parsed args to the parse_args() method.
  • Also used the default values from LAParams for --detect-vertical and --all-texts. Presumably you omited those by accident?
  • Add cli argument for --line-overlap as well.

@pietermarsman pietermarsman merged commit d87bd02 into pdfminer:develop Jan 25, 2022
@0xabu
Copy link
Contributor Author

0xabu commented Jan 25, 2022

LGTM, thanks.

@0xabu 0xabu deleted the pdf2txt_laparams branch January 25, 2022 21:09
Beants added a commit to HiTalentAlgorithms/pdfminer.six that referenced this pull request Feb 14, 2022
* develop:
  Check blackness in github actions (pdfminer#711)
  Changed `log.info` to  `log.debug` in six files (pdfminer#690)
  Update README.md batch for Continuous integration
  Update actions.yml so that it will run for all PR's
  Update development tools: travis ci to github actions, tox to nox, nose to pytest (pdfminer#704)
  Added feature: page labels (pdfminer#680)
  Remove obsolete returns (pdfminer#707)
  Revert "Remove obsolete returns"
  Remove obsolete returns
  Only use xref fallback if `PDFNoValidXRef` is raised and `fallback` is True (pdfminer#684)
  Use logger.warn instead of warnings.warn if warning cannot be prevented by user (pdfminer#673)
  Change log.info into log.debug to make pdfinterp.py less verbose
  Fix regression in page layout that sometimes returned text lines out of order (pdfminer#659)
  export type annotations in package (pdfminer#679)
  fix typos in PR template (pdfminer#681)
  pdf2txt: clean up construction of LAParams from arguments (pdfminer#682)
  Fixes jbig2 writer to write valid jb2 files
  Add support for JPEG2000 image encoding
  Added test case for CCITTFaxDecoder (pdfminer#700)
  Attempt to handle decompression error on some broken PDF files (pdfminer#637)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Argument boxes_flow doesn't work for me (pdf2txt tool)
2 participants