Add `normalize_unicode=False/True` parameter to text extraction methods #905

jsvine · 2023-06-13T15:22:28Z

Per @petermr's suggestion in #904 (comment), I think it's a good idea to add such a parameter/option, using unicodedata.normalize(...) — in a similar vein to the expand_ligatures parameter added in v0.9.0. I'll look into this.

Some useful reference links, as a note-to-self:

The text was updated successfully, but these errors were encountered:

agusluques · 2024-07-10T14:54:16Z

Hi @jsvine, is there a workaround for this in the meantime?

Can I manually apply a normalize function to all text in the PDF?

jsvine · 2024-07-15T22:46:38Z

Hi @agusluques, and thanks for checking. There have not been any updates on this, but there may still be a solution for certain use-cases. What's your particular use-case?

agusluques · 2024-07-16T13:23:26Z

@jsvine thanks for the answer. Basically, I am trying to do some split by ; (U+003B) but the PDF seems to have a different ; (U+037E). I am doing some manual replacement but it will be great to have this at the moment of reading the PDF so I don't have any point of risk in case I forget to include the cleaning logic

petermr · 2024-07-16T15:47:02Z

The definitive rules are defined in the Unicode spec ( https://unicode.org/reports/tr15/). It needs careful reading ("Taken step-by-step, the Unicode Normalization Algorithm is fairly complex"). It specifically discusses the Greek question mark. There are different formal approaches

>>

The four Unicode Normalization Forms are summarized in *Table 1.* Table 1. Normalization Forms <https://unicode.org/reports/tr15/#Normalization_Forms_Table> FormDescription Normalization Form D (NFD) Canonical Decomposition Normalization Form C (NFC) Canonical Decomposition, followed by Canonical Composition Normalization Form KD (NFKD) Compatibility Decomposition Normalization Form KC (NFKC) Compatibility Decomposition, followed by Canonical Composition ===== 10 Respecting Canonical Equivalence <https://unicode.org/reports/tr15/#Canonical_Equivalence> This section describes the relationship of normalization to respecting (or preserving) canonical equivalence. A process (or function) *respects* canonical equivalence when canonical-equivalent inputs always produce canonical-equivalent outputs. For a function that transforms one string into another, this may also be called *preserving* canonical equivalence. There are a number of important aspects to this concept: 1. The outputs are *not* required to be identical, only canonically equivalent. 2. *Not* all processes are required to respect canonical equivalence. For example: - A function that collects a set of the General_Category values present in a string will and should produce a different value for <*angstrom sign, semicolon>* than for <*A, combining ring above, greek question mark>*, even though they are canonically equivalent. - A function that does a binary comparison of strings will also find these two sequences different. 3. Higher-level processes that transform or compare strings, or that perform other higher-level functions, must respect canonical equivalence or problems will result. <<< It's important we adhere precisely to Unicode terminology and philosophy For me (a crystallographer) it's the equivalence between Aring and Angstrom (which are frequently misused. Note that Aring if further complicated and may have to be normalised 0041 (A) + 030A (combining ring) => 00C5 (Aring) The problems frequently arise when authors pick symbols from menus without realising what character results. There are a lot of further illiteracies which probably can't be dealt with, e.g. em-dash for minus

…

On Tue, Jul 16, 2024 at 2:23 PM Agus Luques ***@***.***> wrote: @jsvine <https://github.com/jsvine> thanks for the answer. Basically, I am trying to do some split by ; (U+003B) but the PDF seems to have a different ; (U+037E). I am doing some manual replacement but it will be great to have this at the moment of reading the PDF so I don't have any point of risk in case I forget to include the cleaning logic — Reply to this email directly, view it on GitHub <#905 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS2BQYIOJARAT3TN5ULZMUNGHAVCNFSM6AAAAABKVC3SXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZQHA4DEMZVHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

@petermr

Allows user to pre-normalize Unicode characters. h/t @petermr + @agusluques in #905

jsvine · 2024-08-04T18:14:45Z

Feature now added in 03a477f

On the develop branch, you should be able to run pdfplumber.open(..., unicode_norm="NFC"), where that latter argument can be any of the abbreviations for the four normalization forms.

Give it a whirl and let me know if it suits your needs / meets your expectations?

jsvine added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Jun 13, 2023

jsvine self-assigned this Jun 13, 2023

jsvine added a commit that referenced this issue Aug 4, 2024

Add pdfplumber.open(unicode_norm=...)

03a477f

Allows user to pre-normalize Unicode characters. h/t @petermr + @agusluques in #905

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `normalize_unicode=False/True` parameter to text extraction methods #905

Add `normalize_unicode=False/True` parameter to text extraction methods #905

jsvine commented Jun 13, 2023

agusluques commented Jul 10, 2024

jsvine commented Jul 15, 2024

agusluques commented Jul 16, 2024

petermr commented Jul 16, 2024 via email

jsvine commented Aug 4, 2024

Add normalize_unicode=False/True parameter to text extraction methods #905

Add normalize_unicode=False/True parameter to text extraction methods #905

Comments

jsvine commented Jun 13, 2023

agusluques commented Jul 10, 2024

jsvine commented Jul 15, 2024

agusluques commented Jul 16, 2024

petermr commented Jul 16, 2024 via email

jsvine commented Aug 4, 2024

Add `normalize_unicode=False/True` parameter to text extraction methods #905

Add `normalize_unicode=False/True` parameter to text extraction methods #905