Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functionality is added to be able to separate words with more than one space #468

Closed
wants to merge 5 commits into from

Conversation

Ev2geny
Copy link

@Ev2geny Ev2geny commented Aug 11, 2020

Pull request

Functionality is added to be able to separate words with more than one space.

For this a new parameter is added to LAParams
qnt_spaces - quantity of spaces, which will be inserted between words for readability. More than one space can be used, to separate words, which already have spaces in them. (e.g. to separate numbers, which have space as a thousand separator)

Default value for qnt_spaces = 1, so change is backwards compatible and all existing tests pass

Example in case space is used as thousand separator:

Original PDF:

30 123.17               60 456.87

text, converted with pdfminer.six without this fix (one space is used as a word separator, and it is not changeable):

30 123.17 60 456.87

text, converted with pdfminer.six with this fix (possible to optionally add more than one space between words):

30 123.17        60 456.87

This resolves this issue: #366

How Has This Been Tested?

Test is added to test this new functionality

test_highlevel_extracttext.py ==> test_simple1_2_spaces_as_word_separator

Checklist

  • I have added tests that prove my fix is effective or that my feature
      works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
      version
  • I have updated the README.md or I am verified that this
      is not necessary
  • I have updated the readthedocs documentation or I
      verified that this is not necessary
  • I have added a consice human-readable description of the change to
      CHANGELOG.md

@Ev2geny
Copy link
Author

Ev2geny commented Aug 11, 2020

I will close this pull request and create a new one, which does not generate errors.

@Ev2geny Ev2geny closed this Aug 11, 2020
@Ev2geny Ev2geny reopened this Aug 11, 2020
@Ev2geny
Copy link
Author

Ev2geny commented Aug 11, 2020

OK, all checks pass now

@pietermarsman
Copy link
Member

Hi @Ev2geny. I cannot accept this PR because this is not the way this feature should be implemented. LTAnno is a representation of the concept of a space or enter in the PDF. It is not meant for formatting the output. Formatting of the output should be implemented using an (extension of) PDFDevice.

If you want to discuss this feature further, create (or reopen) an issue.

@Ev2geny
Copy link
Author

Ev2geny commented Sep 10, 2020

Hi @pietermarsman , I continue discussion in #366 (comment), but I guess I do not have permissions to re-open the issue. Can you do it for me please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants