Error while using hocr-pdf file #121

shekarnode · 2018-05-10T12:22:04Z

While using the below command i m getting error related to character
help out please

hocr-pdf . > out.pdf
Traceback (most recent call last):
  File "C:\Python36\Scripts\hocr-pdf.py", line 143, in <module>
    export_pdf(args.imgdir, 300)
  File "C:\Python36\Scripts\hocr-pdf.py", line 70, in export_pdf
    pdf.save()
  File "c:\python36\lib\site-packages\reportlab\pdfgen\canvas.py", line 1237, in save
    self._doc.SaveToFile(self._filename, self)
  File "c:\python36\lib\site-packages\reportlab\pdfbase\pdfdoc.py", line 224, in SaveToFile
    f.write(data)
  File "C:\Python36\Scripts\hocr-pdf.py", line 47, in write
    sys.stdout.write(data)
  File "c:\python36\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-14: character maps to <undefined>

The text was updated successfully, but these errors were encountered:

stweil · 2018-05-10T16:19:03Z

Can you provide a hOCR file which causes this error? How did you create it?

shekarnode · 2018-05-11T05:43:10Z

I used Tesseract 4.0.0 to generate hocr
Hocr File

This is the image for above generate Hocr

shekarnode · 2018-05-15T12:33:26Z

Is there any other solution for getting table from hocr data ?

zuphilip · 2018-05-15T15:44:41Z

This works for me as well after I have renamed the image and converted it to a jpg file.

Do you have the jpg file also in your directory?
What is your environment? Linux or Windows?
What Python version do you use? python -V
What is the encoding of your bash which Python uses?

shekarnode · 2018-05-16T06:28:34Z

@zuphilip

i was using png image for conversion , now i replaced it with jpg.
Environment - Windows
Python 3.6.4
well i was using cmd to get output , tried with git bash , i got pdf as output but it was just a normal pdf i.e. not in searchable format.

are you able to generate searchable pdf ?

amitdo · 2018-05-16T06:45:38Z

Tesseract has an option to output to pdf. Did you tried it?

zuphilip · 2018-05-16T07:27:40Z

are you able to generate searchable pdf ?

Yes, I see a searchable PDF, but I am working on Linux.

For windows terminal the encoding can be a problem. You can check the encoding for python in windows terminal by starting python and then type

>>> import sys
>>> sys.stdout.encoding

If that is now UTF-8 then you can try to run the command with PYTHONIOENCODING=UTF-8 in front, i.e.

PYTHONIOENCODING=UTF-8 hocr-pdf . > out.pdf

i got pdf as output but it was just a normal pdf i.e. not in searchable format.

This is with the git bash on windows, right? Can you upload your result here?

shekarnode · 2018-05-16T07:46:37Z

@zuphilip
out.pdf
this the pdf file being generated

@amitdo
i have tried generating searchable pdf from tesseract also:
the commands are provided over here were used .
still the output is not searchable fromat its just simple pdf with image.

zuphilip · 2018-05-16T07:49:42Z

@shekarnode There is text in your generated PDF and I can search for text as well.

shekarnode · 2018-05-16T08:21:21Z

I was using adobe reader and all the time was not able to search ,now when I opened the pdf in browser I found out it was searchable.

Thanks @zuphilip for helping out.

amitdo · 2018-05-16T13:25:28Z

The pdf produced by Tesseract is also searchable.

shekarnode closed this as completed May 16, 2018

zuphilip mentioned this issue May 16, 2018

hocr-pdf: Use UTF-8 for PDF output independently of terminal settings #124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while using hocr-pdf file #121

Error while using hocr-pdf file #121

shekarnode commented May 10, 2018 •

edited

Loading

stweil commented May 10, 2018

shekarnode commented May 11, 2018 •

edited

Loading

shekarnode commented May 15, 2018

zuphilip commented May 15, 2018

shekarnode commented May 16, 2018 •

edited by zuphilip

Loading

amitdo commented May 16, 2018

zuphilip commented May 16, 2018

shekarnode commented May 16, 2018

zuphilip commented May 16, 2018

shekarnode commented May 16, 2018

amitdo commented May 16, 2018

Error while using hocr-pdf file #121

Error while using hocr-pdf file #121

Comments

shekarnode commented May 10, 2018 • edited Loading

stweil commented May 10, 2018

shekarnode commented May 11, 2018 • edited Loading

shekarnode commented May 15, 2018

zuphilip commented May 15, 2018

shekarnode commented May 16, 2018 • edited by zuphilip Loading

amitdo commented May 16, 2018

zuphilip commented May 16, 2018

shekarnode commented May 16, 2018

zuphilip commented May 16, 2018

shekarnode commented May 16, 2018

amitdo commented May 16, 2018

shekarnode commented May 10, 2018 •

edited

Loading

shekarnode commented May 11, 2018 •

edited

Loading

shekarnode commented May 16, 2018 •

edited by zuphilip

Loading