Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not decode afii characters (ISO 10036) #1381

Open
pubpub-zz opened this issue Oct 5, 2022 · 0 comments
Open

can not decode afii characters (ISO 10036) #1381

pubpub-zz opened this issue Oct 5, 2022 · 0 comments
Labels
help wanted We appreciate help everywhere - this one might be an easy start! workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text

Comments

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Oct 5, 2022

extracted from #1379
PS : in the extraction result, the arabic characters are replaced with /afiinnnn. this is because the data uses the iso 10036 standard that I've not been able to find any free information on how to do transcoding
file 02voc.pdf
test code:

import PyPDF2;
PyPDF2.PdfReader("e:/02voc.pdf").pages[2].extract_text()

Originally posted by @pubpub-zz in #1379 (comment)

@pubpub-zz pubpub-zz added the help wanted We appreciate help everywhere - this one might be an easy start! label Oct 5, 2022
@MartinThoma MartinThoma added the workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text label Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted We appreciate help everywhere - this one might be an easy start! workflow-arabic-text-extraction Related to text extraction, but with a focus on Arabic text
Projects
None yet
Development

No branches or pull requests

2 participants