Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can not ocr a unicode '€' #71

Closed
xqp opened this issue Feb 17, 2022 · 2 comments
Closed

can not ocr a unicode '€' #71

xqp opened this issue Feb 17, 2022 · 2 comments

Comments

@xqp
Copy link

xqp commented Feb 17, 2022

Hey

unfortunately it is not possible to ocr a PDF which contains a '€'.

pdf for ocr: 6b04b3384c6ab4108a020e931b3e31e6.pdf

  File "/opt/homebrew/lib/python3.9/site-packages/PIL/ImageDraw.py", line 428, in draw_text
    mask, offset = font.getmask2(
AttributeError: 'ImageFont' object has no attribute 'getmask2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/patrikfiedler/simple-cups-backend/luca_connect.py", line 75, in <module>
    doc = PDF.loads(pdf_file_handle, [l])
  File "/opt/homebrew/lib/python3.9/site-packages/borb/pdf/pdf.py", line 54, in loads
    return ReadAnyObjectTransformer().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/reference/xref_transformer.py", line 139, in transform
    trailer = self.get_root_transformer().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/object/dictionary_transformer.py", line 46, in transform
    v = self.get_root_transformer().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/reference/reference_transformer.py", line 103, in transform
    transformed_referenced_object = self.get_root_transformer().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/page/root_dictionary_transformer.py", line 84, in transform
    transformed_root_dictionary = t.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/object/dictionary_transformer.py", line 46, in transform
    v = self.get_root_transformer().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/reference/reference_transformer.py", line 103, in transform
    transformed_referenced_object = self.get_root_transformer().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/object/dictionary_transformer.py", line 46, in transform
    v = self.get_root_transformer().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/object/array_transformer.py", line 46, in transform
    object_to_transform[i] = self.get_root_transformer().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/reference/reference_transformer.py", line 103, in transform
    transformed_referenced_object = self.get_root_transformer().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/any_object_transformer.py", line 100, in transform
    return super().transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/transformer.py", line 123, in transform
    out = h.transform(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/io/read/page/page_dictionary_transformer.py", line 100, in transform
    CanvasStreamProcessor(page_out, canvas, []).read(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/pdf/canvas/canvas_stream_processor.py", line 290, in read
    raise e
  File "/opt/homebrew/lib/python3.9/site-packages/borb/pdf/canvas/canvas_stream_processor.py", line 284, in read
    operator.invoke(self, operands, event_listeners)
  File "/opt/homebrew/lib/python3.9/site-packages/borb/pdf/canvas/operator/xobject/do.py", line 57, in invoke
    l._event_occurred(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/toolkit/ocr/ocr_as_optional_content_group.py", line 133, in _event_occurred
    super(OCRAsOptionalContentGroup, self)._event_occurred(event)
  File "/opt/homebrew/lib/python3.9/site-packages/borb/toolkit/ocr/ocr_image_render_event_listener.py", line 176, in _event_occurred
    font_color: RGBColor = self._get_font_color(
  File "/opt/homebrew/lib/python3.9/site-packages/borb/toolkit/ocr/ocr_image_render_event_listener.py", line 252, in _get_font_color
    text_image_draw.text((0, 0), text, fill=(0, 0, 0))
  File "/opt/homebrew/lib/python3.9/site-packages/PIL/ImageDraw.py", line 483, in text
    draw_text(ink)
  File "/opt/homebrew/lib/python3.9/site-packages/PIL/ImageDraw.py", line 443, in draw_text
    mask = font.getmask(
  File "/opt/homebrew/lib/python3.9/site-packages/PIL/ImageFont.py", line 148, in getmask
    return self.font.getmask(text, mode)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)

here is a GitHub issue for PIL package that address this issue.
Im not very familiar with python.. can anyone help me to fix this?

Thanks

@xqp xqp changed the title can not our a unicode '€' can not ocr a unicode '€' Feb 17, 2022
@jorisschellekens
Copy link
Owner

jorisschellekens commented Feb 17, 2022

borb is trying to figure out what the font_color should be for a given piece of OCR-ed text.
To do this, borb builds an Image with a white background and puts the character in it (in this case the euro symbol) in a black font.
It then measures how many pixels are black.

That ratio (black / total) should be roughly the same in the original image.
e.g. 8% of all pixels are black in the generated image, the original image has 8% yellow pixels, let's assume the euro symbol is drawn in yellow.

This also means that borb will try to render every character in the document in a PIL/Pillow Image.
That is why you are getting this error. Because not every font can render every character. In this case, the default font for rendering text in Pillow is unable to render .

@jorisschellekens
Copy link
Owner

This issue has been fixed and will be available in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants