Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue parsing PDFs - UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined> #417

Closed
timalamenciak opened this issue Jul 26, 2024 · 5 comments · Fixed by #422
Labels
bug Something isn't working

Comments

@timalamenciak
Copy link
Contributor

timalamenciak commented Jul 26, 2024

Trying to pull in the PDF from this article throws the below error: https://onlinelibrary.wiley.com/doi/10.1002/eco.1705

This has been tested on other PDFs to the same end.

ontogpt -vvv extract -t trek_2.yaml -i test1.pdf
INFO:root:Logger root set to level 10
INFO:root:Input file: test1.pdf
Traceback (most recent call last):
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Scripts\\ontogpt", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\pypoetry\Cache\virtualenvs\ontogpt-UsRDAP_3-py3.12\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\Documents\Coding\TReK-OntoGPT\ontogpt\src\ontogpt\cli.py", line 329, in extract
    text = open(inputfile, "r").read()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Tim Alamenciak\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4422: character maps to <undefined>
@caufieldjh caufieldjh added the bug Something isn't working label Jul 30, 2024
@timalamenciak
Copy link
Contributor Author

Update on this - I had the error crop up again when copying-and-pasting from a PDF, so I dug into the code. This block appears to be the challenge (lines 324-329 of cli.py):


        if use_textract:
            import textract

            text = textract.process(inputfile).decode("utf-8")
        else:
            text = open(inputfile, "r").read()

On my own version, I added an ignore flag to the text open file. This will ignore improperly formatted characters, which may lose data, but I think in this package's use case, that won't be crippling.


        if use_textract:
            import textract

            text = textract.process(inputfile).decode("utf-8")
        else:
            text = open(inputfile, "r", **errors="ignore"**).read()

Textract is still not working.

@caufieldjh
Copy link
Member

Might just fix this with #421.
In the meantime, I'll have a fix here shortly along the lines of what you suggest - though I don't recommend parsing entire PDFs with it unless you want to get a lot of unreadable characters.

@caufieldjh caufieldjh linked a pull request Aug 1, 2024 that will close this issue
caufieldjh added a commit that referenced this issue Aug 2, 2024
Encoding errors will be ignored when parsing text from files.
@caufieldjh
Copy link
Member

Hi @timalamenciak - give PDF parsing a try in v1.0.2 (just released) - it now uses the option --use-pdf instead of --use-textract

@timalamenciak
Copy link
Contributor Author

Thrilling! That worked.

@timalamenciak
Copy link
Contributor Author

Thanks @caufieldjh !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants