Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot pass None as parameter for --boxes-flow #477

Closed
a-franck opened this issue Aug 24, 2020 · 1 comment · Fixed by #479
Closed

Cannot pass None as parameter for --boxes-flow #477

a-franck opened this issue Aug 24, 2020 · 1 comment · Fixed by #479
Assignees

Comments

@a-franck
Copy link

I cannot pass the value None to the parameter --boxes-flow when executing pdf2txt.py from the command line:

pdf2txt.py: error: argument --boxes-flow/-F: invalid float value: 'None'

According to the help, this should be possible:

--boxes-flow BOXES_FLOW, -F BOXES_FLOW
Specifies how much a horizontal and vertical position
of a text matters when determining the order of lines.
The value should be within the range of -1.0 (only
horizontal position matters) to +1.0 (only vertical
position matters). You can also pass None to disable
advanced layout analysis, and instead return text
based on the position of the bottom left corner of the
text box.

@jstockwin jstockwin added this to new in pdfminer.six via automation Aug 25, 2020
@jstockwin jstockwin moved this from new to in progress in pdfminer.six Aug 25, 2020
@jstockwin jstockwin self-assigned this Aug 25, 2020
@jstockwin
Copy link
Member

@a-franck thanks for reporting this. This was me not being careful enough when I enabled allowing boxes-flow to be None. I agree it's an issue and will fix it ASAP!

pdfminer.six automation moved this from in progress to done Oct 10, 2020
0xabu added a commit to 0xabu/pdfminer that referenced this issue Aug 16, 2021
Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "pdfminer/high_level.py", line 85, in extract_text_to_fp
    interpreter.process_page(page)
  File "pdfminer/pdfinterp.py", line 896, in process_page
    self.device.end_page(page)
  File "pdfminer/converter.py", line 51, in end_page
    self.cur_item.analyze(self.laparams)
  File "pdfminer/layout.py", line 822, in analyze
    group.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 577, in analyze
    self._objs.sort(
  File "pdfminer/layout.py", line 578, in <lambda>
    key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```

Related: Issue pdfminer#477, PR pdfminer#479
pietermarsman added a commit that referenced this issue Jan 25, 2022
* Fix pdf2txt --boxes-flow=disabled

Fixes:
```
$ pdf2txt.py --boxes-flow=disabled test.pdf
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "pdfminer/high_level.py", line 85, in extract_text_to_fp
    interpreter.process_page(page)
  File "pdfminer/pdfinterp.py", line 896, in process_page
    self.device.end_page(page)
  File "pdfminer/converter.py", line 51, in end_page
    self.cur_item.analyze(self.laparams)
  File "pdfminer/layout.py", line 822, in analyze
    group.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 575, in analyze
    LTTextGroup.analyze(self, laparams)
  File "pdfminer/layout.py", line 362, in analyze
    obj.analyze(laparams)
  File "pdfminer/layout.py", line 577, in analyze
    self._objs.sort(
  File "pdfminer/layout.py", line 578, in <lambda>
    key=lambda obj: (1 - laparams.boxes_flow) * obj.x0
TypeError: unsupported operand type(s) for -: 'int' and 'str'
```

Related: Issue #477, PR #479

* update CHANGELOG

* merge CHANGELOG

* pdf2txt: clean up handling of layout parameter arguments
 * avoid specifying default values twice
 * construct LAParams earlier, rather than passing its components around
 * fix crash with --boxes_flow=disabled

* update CHANGELOG

* construct new LAParams, so _validate runs

* Improve readability of setting LAParams by explicitly copying them from parsed_args into init of LAParams. And move all parsed_args post processing to the parse_args() method.

* Add cli argument for line_overlap

* Also use default values from LAParams for --detect-vertical and --all-texts

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
pdfminer.six
  
done
Development

Successfully merging a pull request may close this issue.

2 participants