Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while using hocr-pdf file #121

Closed
shekarnode opened this issue May 10, 2018 · 11 comments
Closed

Error while using hocr-pdf file #121

shekarnode opened this issue May 10, 2018 · 11 comments

Comments

@shekarnode
Copy link

shekarnode commented May 10, 2018

While using the below command i m getting error related to character
help out please

hocr-pdf . > out.pdf
Traceback (most recent call last):
  File "C:\Python36\Scripts\hocr-pdf.py", line 143, in <module>
    export_pdf(args.imgdir, 300)
  File "C:\Python36\Scripts\hocr-pdf.py", line 70, in export_pdf
    pdf.save()
  File "c:\python36\lib\site-packages\reportlab\pdfgen\canvas.py", line 1237, in save
    self._doc.SaveToFile(self._filename, self)
  File "c:\python36\lib\site-packages\reportlab\pdfbase\pdfdoc.py", line 224, in SaveToFile
    f.write(data)
  File "C:\Python36\Scripts\hocr-pdf.py", line 47, in write
    sys.stdout.write(data)
  File "c:\python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-14: character maps to <undefined>
@stweil
Copy link
Collaborator

stweil commented May 10, 2018

Can you provide a hOCR file which causes this error? How did you create it?

@shekarnode
Copy link
Author

shekarnode commented May 11, 2018

I used Tesseract 4.0.0 to generate hocr
Hocr File

This is the image for above generate Hocr
e3_out

@shekarnode
Copy link
Author

Is there any other solution for getting table from hocr data ?

@zuphilip
Copy link
Collaborator

This works for me as well after I have renamed the image and converted it to a jpg file.

  1. Do you have the jpg file also in your directory?
  2. What is your environment? Linux or Windows?
  3. What Python version do you use? python -V
  4. What is the encoding of your bash which Python uses?

@shekarnode
Copy link
Author

shekarnode commented May 16, 2018

@zuphilip

  1. i was using png image for conversion , now i replaced it with jpg.
  2. Environment - Windows
  3. Python 3.6.4
  4. well i was using cmd to get output , tried with git bash , i got pdf as output but it was just a normal pdf i.e. not in searchable format.

are you able to generate searchable pdf ?

@amitdo
Copy link
Contributor

amitdo commented May 16, 2018

Tesseract has an option to output to pdf. Did you tried it?

@zuphilip
Copy link
Collaborator

are you able to generate searchable pdf ?

Yes, I see a searchable PDF, but I am working on Linux.

For windows terminal the encoding can be a problem. You can check the encoding for python in windows terminal by starting python and then type

>>> import sys
>>> sys.stdout.encoding

If that is now UTF-8 then you can try to run the command with PYTHONIOENCODING=UTF-8 in front, i.e.

PYTHONIOENCODING=UTF-8 hocr-pdf . > out.pdf

i got pdf as output but it was just a normal pdf i.e. not in searchable format.

This is with the git bash on windows, right? Can you upload your result here?

@shekarnode
Copy link
Author

@zuphilip
out.pdf
this the pdf file being generated

@amitdo
i have tried generating searchable pdf from tesseract also:
the commands are provided over here were used .
still the output is not searchable fromat its just simple pdf with image.

@zuphilip
Copy link
Collaborator

@shekarnode There is text in your generated PDF and I can search for text as well.

@shekarnode
Copy link
Author

I was using adobe reader and all the time was not able to search ,now when I opened the pdf in browser I found out it was searchable.

Thanks @zuphilip for helping out.

@amitdo
Copy link
Contributor

amitdo commented May 16, 2018

The pdf produced by Tesseract is also searchable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants