improved ExtractText(3) #969

pubpub-zz · 2022-06-10T20:34:27Z

New corrections for extract_text()
fixes extraction in cmap
#953
#431
#242
#591 /#954 should be good but doubts on arabic

TODO : add some encodings missing

PyPDF2/_cmap.py

MartinThoma · 2022-06-10T21:07:24Z

There are two minor Flake8 issues:

./tests/test_utils.py:7:1: F401 'PyPDF2._utils.read_block_backwards' imported but unused
./tests/test_utils.py:7:1: F401 'PyPDF2._utils.read_previous_line' imported but unused

Do you prefer to fix them yourself or should I do it? (also as a general question)

pubpub-zz · 2022-06-10T21:09:47Z

@MartinThoma,
I need your help !! 😥
I have an issue in test_utils.py : My changes on tag 2.1.0 works but I get regressions on main.

Can you have a look please

MartinThoma · 2022-06-10T21:29:42Z

@pubpub-zz I might be sleepy-dumb, but I don't see what you mean. I think you only have to minor stylistic / mypy adjustments you need to make: #971

MartinThoma · 2022-06-10T21:30:12Z

I'll have a more detailed look tomorrow at all the goodness you're bringing to PyPDF2 this time :-)

MartinThoma · 2022-06-10T21:32:11Z

Oh, if you worry about the code coverage: That's not so bad. It's especially not a blocker from getting your improvements merged.

I will run various tests (especially https://github.com/py-pdf/benchmarks ) to check things are improved. I can live if coverage drops a bit (and I will have a more detailed look at the places which are not covered)

pubpub-zz · 2022-06-10T21:37:59Z

@MartinThoma
If you look at the changed files I had to drastically revert in test_utils.py as I had major issues with it. give me 5 min and I will confirm/infirm my issue

pubpub-zz · 2022-06-10T21:42:33Z

@MartinThoma
my problems are this section of code


@pytest.mark.parametrize(
    ("dat", "pos", "expected", "expected_pos"),
    [
        (b"abc", 1, b"a", 0),
        (b"abc", 2, b"ab", 0),
        (b"abc", 3, b"abc", 0),
        (b"abc\n", 3, b"abc", 0),
        (b"abc\n", 4, b"", 3),
        (b"abc\n\r", 4, b"", 3),
        (b"abc\nd", 5, b"d", 3),
        # Skip over multiple CR/LF bytes
        (b"abc\n\r\ndef", 9, b"def", 3),
        # Include a block full of newlines...
        (
            b"abc" + b"\n" * (2 * io.DEFAULT_BUFFER_SIZE) + b"d",
            2 * io.DEFAULT_BUFFER_SIZE + 4,
            b"d",
            3,
        ),
        # Include a block full of non-newline characters
        (
            b"abc\n" + b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            2 * io.DEFAULT_BUFFER_SIZE + 4,
            b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            3,
        ),
        # Both
        (
            b"abcxyz"
            + b"\n" * (2 * io.DEFAULT_BUFFER_SIZE)
            + b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            4 * io.DEFAULT_BUFFER_SIZE + 6,
            b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            6,
        ),
    ],
)
def test_read_previous_line(dat, pos, expected, expected_pos):
    s = io.BytesIO(dat)
    s.seek(pos)
    assert read_previous_line(s) == expected
    assert s.tell() == expected_pos

MartinThoma · 2022-06-10T21:48:35Z

Oh damn. That sounds as if it's related to #646

I'll have a closer look tomorrow

pubpub-zz · 2022-06-10T21:52:17Z

I still have some work to fix text extraction with the "paper rotated"
Chinese / russian/ .... are working
I have some doubts about arabic as the text is written right to left.

codecov · 2022-06-10T22:04:17Z

Codecov Report

Merging #969 (2aea3e9) into main (9c4e7f5) will increase coverage by 0.16%.
The diff coverage is 86.43%.

@@            Coverage Diff             @@
##             main     #969      +/-   ##
==========================================
+ Coverage   84.25%   84.42%   +0.16%     
==========================================
  Files          18       18              
  Lines        4115     4179      +64     
  Branches      868      887      +19     
==========================================
+ Hits         3467     3528      +61     
- Misses        465      468       +3     
  Partials      183      183

Impacted Files	Coverage Δ
PyPDF2/_page.py	`82.65% <ø> (+1.10%)`	⬆️
PyPDF2/_cmap.py	`76.43% <76.43%> (+3.70%)`	⬆️
PyPDF2/generic.py	`89.70% <81.81%> (-0.13%)`	⬇️
PyPDF2/_utils.py	`98.03% <98.03%> (ø)`
PyPDF2/__init__.py	`100.00% <100.00%> (ø)`
PyPDF2/filters.py	`81.81% <0.00%> (+0.64%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c4e7f5...2aea3e9. Read the comment docs.

MartinThoma · 2022-06-11T06:10:42Z

@pubpub-zz I've added the test back, without any adjustment. It works: #971
Did you maybe solve the issue in between?

MartinThoma · 2022-06-12T10:23:42Z

@pubpub-zz I cannot answer that. It would be ok for me to have the behavior as in the specs, which also means that crazyones.txt needs to get adjusted.

I just re-ran the benchmark and the results look very similar.

pubpub-zz · 2022-06-12T13:31:46Z

@pubpub-zz I cannot answer that. It would be ok for me to have the behavior as in the specs, which also means that crazyones.txt needs to get adjusted.

I just re-ran the benchmark and the results look very similar.

@MartinThoma,
I did more research and I think I've got the solution. should be available by diner

MartinThoma · 2022-06-12T13:43:46Z

@pubpub-zz I love how committed you are to improve PyPDF2, but please don't feel pressured because I said that I wanted to make a release today. It's unpaid so it should be fun. If it doesn't work today or even for some weeks, it would be fine 🤗

pubpub-zz · 2022-06-12T17:38:48Z

Devlivered just before diner and I've mowed the lawn 😁

pubpub-zz · 2022-06-12T17:44:20Z

@MartinThoma , can you have a look I do not understand : there is an error on test_utils but it did not changed it.
and it's working fine locally

MartinThoma · 2022-06-12T19:28:16Z

That is a merge with main which got wrong. You need to adjust the "ids" parameter to match the number of tests. I think there is currently an "11" but it should be "8" (in the range function)

MartinThoma · 2022-06-12T19:29:04Z

I've set the ids because the auto-generated I'd takes just all of the parameters which was extremely long

MartinThoma · 2022-06-12T19:30:39Z

Devlivered just before diner and I've mowed the lawn

Good job 😁👍 I was just making burgers for my girlfriend and we will now have an relaxed evening 😊

tests/test_utils.py

MartinThoma · 2022-06-13T11:16:42Z

@pubpub-zz I've updated the PR so that the tests run. It was weird that they didn't succeed ... apparently, the tests ran on code as if it was already having the automatic merge. The automatic merge didn't adjust the ids range: 0ba91aa

I try to go through the PR today evening / night :-)

PyPDF2/_cmap.py

MartinThoma · 2022-06-13T16:28:32Z

@pubpub-zz Looks good to me! I would squash-commit with the following text:

ENH: Text Extraction improvements

- Improvements around /Encoding / /ToUnicode
- Extraction of CMaps improved
- Fallback for font def missing
- Support for /Identity-H and /Identity-V: utf-16-be
- Support for /GB-EUC-H / /GB-EUC-V: gbk
- Support for /GBpc-EUC-H / /GBpc-EUC-V : gb2312
- Store default font space width for 18 commonly used fonts to improve
  whitespace extraction

Does that represent the changes well to users?

MartinThoma · 2022-06-13T16:29:08Z

Besides the two typos I've just commented, there is one robustness-change I would do: The .decode("utf-16-be") fails 167x for 22847 PDF files (0.7% of my dataset, so not too wild) with:

    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1125, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 21, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 221, in parse_to_unicode
    ] = unhexlify(sq).decode("utf-16-be")
  File "/home/moose/.pyenv/versions/3.10.2/lib/python3.10/encodings/utf_16_be.py", line 16, in decode
    return codecs.utf_16_be_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-be' codec can't decode bytes in position 0-1: unexpected end of data

I would just wrap it in a try-except UnicodeDecodeError:

import logging
logger = logging.getLogger(__name__)

...

                while a <= b:
                    sq = fmt2 % c
                    key = unhexlify(fmt % a).decode(
                                "charmap" if map_dict[-1] == 1 else "utf-16-be"
                            )
                    unhexlified = unhexlify(sq)
                    try:
                         decoded = unhexlified.decode("utf-16-be")
                    except UnicodeDecodeError as exc:
                        logger.warning("UnicodeDecodeError when parsing cmap")
                        a += 1
                        c += 1
                        continue
                    map_dict[key] = decoded
                    int_entry.append(a)
                    a += 1
                    c += 1

Co-authored-by: Martin Thoma <info@martin-thoma.de>

pubpub-zz · 2022-06-13T16:53:03Z

under analysis

pubpub-zz · 2022-06-13T16:56:36Z

ENH: Text Extraction improvements

Improvements around /Encoding / /ToUnicode

Extraction of CMaps improved

Fallback for font def missing

Support for /Identity-H and /Identity-V: utf-16-be

Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation)
Arabic (for evaluation)
whitespace extraction improvement

…end of data use surrogatepass in _cmap and _page

pubpub-zz · 2022-06-13T19:21:24Z

@MartinThoma
This latest mod fixed the 'utf-16-be' codec can't decode bytes in position 0-1: unexpected end of data
This should close the issue on the .7% remaining

The 2.2.0 release improves text extraction again via (#969): * Improvements around /Encoding / /ToUnicode * Extraction of CMaps improved * Fallback for font def missing * Support for /Identity-H and /Identity-V: utf-16-be * Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation) * Arabic (for evaluation) * Whitespace extraction improvements Those changes should mainly improve the text extraction for non-ASCII alphabets, e.g. Russian / Chinese / Japanese / Korean / Arabic. Full Changelog: 2.1.1...2.2.0

pubpub-zz added 9 commits June 10, 2022 22:14

Relative import

48421df

improve TextExtraction

c7829d8

TODO : add some encodings missing

Extend testing

7a9c22c

improve readability of BooleanObjects

b0a7736

Apply Black

d7f84d0

fix early mypy

59504ec

fix mypy2

58bd0e5

attempt fix iss with test_utils

941461a

Merge branch 'main' into ExtractText

e4c37cb

MartinThoma reviewed Jun 10, 2022

View reviewed changes

PyPDF2/_cmap.py Outdated Show resolved Hide resolved

MartinThoma added 2 commits June 10, 2022 23:21

Minor flake8 fix

39e94f9

Adjust mypy types

9763868

Merge branch 'pubpub-zz-ExtractText' into origin/ExtractText

53294f2

pubpub-zz added 2 commits June 10, 2022 23:43

revert in test_utils

744464f

paste error

0ed4d9a

pubpub-zz added 2 commits June 10, 2022 23:53

flake 8

5b96216

flake8

b2830e9

Add 'test_previous_line' back

1223d75

MartinThoma mentioned this pull request Jun 11, 2022

Pubpub zz extract text #971

Closed

fix Encoding / ToUnicode at the same time

534a8bb

Merge branch 'main' into ExtractText

d92597a

MartinThoma reviewed Jun 13, 2022

View reviewed changes

tests/test_utils.py Outdated Show resolved Hide resolved

Apply suggestions from code review

0ba91aa

MartinThoma reviewed Jun 13, 2022

View reviewed changes

PyPDF2/_cmap.py Outdated Show resolved Hide resolved

MartinThoma reviewed Jun 13, 2022

View reviewed changes

PyPDF2/_cmap.py Outdated Show resolved Hide resolved

pubpub-zz and others added 2 commits June 13, 2022 18:35

typo

88f1298

Co-authored-by: Martin Thoma <info@martin-thoma.de>

typoUpdate PyPDF2/_cmap.py

de7ddc0

Co-authored-by: Martin Thoma <info@martin-thoma.de>

fix 'utf-16-be' codec can't decode bytes in position 0-1: unexpected …

2aea3e9

…end of data use surrogatepass in _cmap and _page

MartinThoma merged commit 72fcaae into py-pdf:main Jun 13, 2022

pubpub-zz mentioned this pull request Jun 13, 2022

File causes loop method call between functions extract_xform_text and _extract_text #966

Closed

pubpub-zz deleted the ExtractText branch June 14, 2022 18:04

MartinThoma mentioned this pull request Jun 16, 2022

process_operation raises "TypeError: a bytes-like object is required, not 'dict'" #953

Closed

DL6ER mentioned this pull request Aug 28, 2022

UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x45 in position 0: truncated data #1293

Closed

geo-ghci-test bot mentioned this pull request Apr 12, 2024

Geo GHCI test Dashboard sbrunner/scan-to-paperless#1314

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved ExtractText(3) #969

improved ExtractText(3) #969

pubpub-zz commented Jun 10, 2022 •

edited

MartinThoma commented Jun 10, 2022

pubpub-zz commented Jun 10, 2022

MartinThoma commented Jun 10, 2022

MartinThoma commented Jun 10, 2022

MartinThoma commented Jun 10, 2022 •

edited

pubpub-zz commented Jun 10, 2022

pubpub-zz commented Jun 10, 2022

MartinThoma commented Jun 10, 2022

pubpub-zz commented Jun 10, 2022

codecov bot commented Jun 10, 2022 •

edited

MartinThoma commented Jun 11, 2022

MartinThoma commented Jun 12, 2022

pubpub-zz commented Jun 12, 2022

MartinThoma commented Jun 12, 2022

pubpub-zz commented Jun 12, 2022

pubpub-zz commented Jun 12, 2022 •

edited

MartinThoma commented Jun 12, 2022

MartinThoma commented Jun 12, 2022

MartinThoma commented Jun 12, 2022

MartinThoma commented Jun 13, 2022

MartinThoma commented Jun 13, 2022

MartinThoma commented Jun 13, 2022

pubpub-zz commented Jun 13, 2022

pubpub-zz commented Jun 13, 2022 •

edited

pubpub-zz commented Jun 13, 2022 •

edited

improved ExtractText(3) #969

improved ExtractText(3) #969

Conversation

pubpub-zz commented Jun 10, 2022 • edited

MartinThoma commented Jun 10, 2022

pubpub-zz commented Jun 10, 2022

MartinThoma commented Jun 10, 2022

MartinThoma commented Jun 10, 2022

MartinThoma commented Jun 10, 2022 • edited

pubpub-zz commented Jun 10, 2022

pubpub-zz commented Jun 10, 2022

MartinThoma commented Jun 10, 2022

pubpub-zz commented Jun 10, 2022

codecov bot commented Jun 10, 2022 • edited

Codecov Report

MartinThoma commented Jun 11, 2022

MartinThoma commented Jun 12, 2022

pubpub-zz commented Jun 12, 2022

MartinThoma commented Jun 12, 2022

pubpub-zz commented Jun 12, 2022

pubpub-zz commented Jun 12, 2022 • edited

MartinThoma commented Jun 12, 2022

MartinThoma commented Jun 12, 2022

MartinThoma commented Jun 12, 2022

MartinThoma commented Jun 13, 2022

MartinThoma commented Jun 13, 2022

MartinThoma commented Jun 13, 2022

pubpub-zz commented Jun 13, 2022

pubpub-zz commented Jun 13, 2022 • edited

pubpub-zz commented Jun 13, 2022 • edited

pubpub-zz commented Jun 10, 2022 •

edited

MartinThoma commented Jun 10, 2022 •

edited

codecov bot commented Jun 10, 2022 •

edited

pubpub-zz commented Jun 12, 2022 •

edited

pubpub-zz commented Jun 13, 2022 •

edited

pubpub-zz commented Jun 13, 2022 •

edited