Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add normalize_unicode=False/True parameter to text extraction methods #905

Open
jsvine opened this issue Jun 13, 2023 · 5 comments
Open
Assignees
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"

Comments

@jsvine
Copy link
Owner

jsvine commented Jun 13, 2023

Per @petermr's suggestion in #904 (comment), I think it's a good idea to add such a parameter/option, using unicodedata.normalize(...) — in a similar vein to the expand_ligatures parameter added in v0.9.0. I'll look into this.

Some useful reference links, as a note-to-self:

@jsvine jsvine added the feature-request All feature requests receive this label initially, can be upgraded to "enhancement" label Jun 13, 2023
@jsvine jsvine self-assigned this Jun 13, 2023
@agusluques
Copy link

Hi @jsvine, is there a workaround for this in the meantime?

Can I manually apply a normalize function to all text in the PDF?

@jsvine
Copy link
Owner Author

jsvine commented Jul 15, 2024

Hi @agusluques, and thanks for checking. There have not been any updates on this, but there may still be a solution for certain use-cases. What's your particular use-case?

@agusluques
Copy link

@jsvine thanks for the answer. Basically, I am trying to do some split by ; (U+003B) but the PDF seems to have a different ; (U+037E). I am doing some manual replacement but it will be great to have this at the moment of reading the PDF so I don't have any point of risk in case I forget to include the cleaning logic

@petermr
Copy link

petermr commented Jul 16, 2024 via email

jsvine added a commit that referenced this issue Aug 4, 2024
Allows user to pre-normalize Unicode characters.

h/t @petermr + @agusluques in #905
@jsvine
Copy link
Owner Author

jsvine commented Aug 4, 2024

Feature now added in 03a477f

On the develop branch, you should be able to run pdfplumber.open(..., unicode_norm="NFC"), where that latter argument can be any of the abbreviations for the four normalization forms.

Give it a whirl and let me know if it suits your needs / meets your expectations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request All feature requests receive this label initially, can be upgraded to "enhancement"
Projects
None yet
Development

No branches or pull requests

3 participants