MAINT: Text extraction improvements #1126

MartinThoma · 2022-07-17T12:21:18Z

Credits to pubpub-zz, see
#1118 (comment)

Co-authored-by: pubpub-zz 4083478+pubpub-zz@users.noreply.github.com

Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

codecov · 2022-07-17T12:24:51Z

Codecov Report

Merging #1126 (40925ed) into main (0b693e1) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1126   +/-   ##
=======================================
  Coverage   92.02%   92.02%           
=======================================
  Files          24       24           
  Lines        4667     4667           
  Branches      964      964           
=======================================
  Hits         4295     4295           
  Misses        227      227           
  Partials      145      145

Impacted Files	Coverage Δ
PyPDF2/_page.py	`92.60% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b693e1...40925ed. Read the comment docs.

dkg · 2022-07-17T12:36:19Z

Note that the modification to parse_to_unicode here attempts to clean up some part of ae0ff49, but doesn't seem to account for the earlier modification in that commit, where the null dstString was mapped to .. If you are going to go with this approach, you should avoid mapping the null dstString to . as well (that is, revert the first hunk of ae0ff49).

But see also my comment over on #1118 about why i think this approach is less correct than the approach that you've already merged.

dkg · 2022-07-17T12:50:21Z

I've opened py-pdf/sample-files#13 to put habibi.pdf in the sample-files repo. i recommend including a test for it before merging this.

MartinThoma · 2022-07-17T18:46:04Z

With

        elif process_char:
            lst = [x for x in l.split(b" ") if x]
            map_dict[-1] = len(lst[0]) // 2
            if len(lst) == 1:       # some case where the 2nd param is empty (seems not IAW pdfspec)
                map_dict[
                    unhexlify(lst[0]).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
                    )
                ] = ""
            else:
                while len(lst) > 0:
                    map_dict[
                        unhexlify(lst[0]).decode(
                            "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
                        )
                    ] = unhexlify(lst[1]).decode(
                        "utf-16-be", "surrogatepass"
                    )  # join is here as some cases where the code was split
                    int_entry.append(int(lst[0], 16))
                    lst = lst[2:]

I get

>                       ] = unhexlify(lst[1]).decode(
                            "utf-16-be", "surrogatepass"
                        )  # join is here as some cases where the code was split
E                       binascii.Error: Odd-length string

New Features (ENH): - Add color and font_format to PdfReader.outlines[i] (#1104) - Extract Text Enhancement (whitespaces) (#1084) Bug Fixes (BUG): - Use `build_destination` for named destination outlines (#1128) - Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118) - Prevent deduplication of PageObject (#1105) - None-check in DictionaryObject.read_from_stream (#1113) - Avoid IndexError in _cmap.parse_to_unicode (#1110) Documentation (DOC): - Explanation for git submodule - Watermark and stamp (#1095) Maintenance (MAINT): - Text extraction improvements (#1126) - Destination.color returns ArrayObject instead of tuple as fallback (#1119) - Use add_bookmark_destination in add_bookmark (#1100) - Use add_bookmark_destination in add_bookmark_dict (#1099) Testing (TST): - Remove xfail from test_outline_title_issue_1121 - Add test for arab text (#1127) - Add xfail for decryption fail (#1125) - Add xfail test for IndexError when extracting text (#1124) - Add MCVE showing outline title issue (#1123) Code Style (STY): - Apply black and isort - Use IntFlag for permissions_flag / update_page_form_field_values (#1094) - Simplify code (#1101) Full Changelog: 2.5.0...2.6.0

MAINT: Text extraction improvements

7740a6e

Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma force-pushed the text-extraction-impr branch from c08fa8f to 7740a6e Compare July 17, 2022 12:25

Modify loop structure

2da249d

Merge branch 'main' into text-extraction-impr

74f47b7

Undo 'elif process_char' part

40925ed

MartinThoma merged commit e24b0a0 into main Jul 17, 2022

MartinThoma deleted the text-extraction-impr branch July 17, 2022 18:53

pubpub-zz mentioned this pull request Feb 8, 2023

text_extraction invalid for habibi.pdf #1619

Closed

MartinThoma mentioned this pull request Feb 10, 2023

BUG: Text extraction not working with one glyph to char sequence #1620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Text extraction improvements #1126

MAINT: Text extraction improvements #1126

MartinThoma commented Jul 17, 2022

codecov bot commented Jul 17, 2022 •

edited

dkg commented Jul 17, 2022

dkg commented Jul 17, 2022

MartinThoma commented Jul 17, 2022

MAINT: Text extraction improvements #1126

MAINT: Text extraction improvements #1126

Conversation

MartinThoma commented Jul 17, 2022

codecov bot commented Jul 17, 2022 • edited

Codecov Report

dkg commented Jul 17, 2022

dkg commented Jul 17, 2022

MartinThoma commented Jul 17, 2022

codecov bot commented Jul 17, 2022 •

edited