UTF Encoding Fix for Annotation Comments #21

theotheo · 2023-10-23T18:15:59Z

Hello,

I encountered an issue when using Cyrillic characters in annotations: during export, the text was transformed into an unreadable set of characters. My knowledge of PDF is not sufficient to confidently pinpoint the exact cause of the problem. However, some code experiments helped me find a simple solution that appears to resolve the issue.

I will illustrate the problem with a specially created PDF file with annotations: Example.pdf. For maximum clarity, I will also provide screenshot

So, the screenshot shows a PDF with 2 lines of text and 2 annotated annotations in which Latin characters are combined with Cyrillic.

The export of this document looks as follows:

  [
    {
        "annotatedText": "text",
        "color": "#ffff00",
        "colorCategory": "Yellow",
        "comment": "This is :\u003e\u003c\u003c5=B0@89",
        "date": "2023-10-23T20:33:08+03:00",
        "id": "highlight-p1x90y719",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 719.27
    },
    {
        "annotatedText": "текст",
        "color": "#00ff00",
        "colorCategory": "Green",
        "comment": "-B\u003e a comment",
        "date": "2023-10-23T20:33:35+03:00",
        "id": "highlight-p1x90y696",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 696.66
    }
]

As you can see, the "comment" fields contain unreadable characters (where Unicode can be guessed).

The updated version produces results with correctly encoded characters:

[
    {
        "annotatedText": "text",
        "color": "#ffff00",
        "colorCategory": "Yellow",
        "comment": "This is комментарий",
        "date": "2023-10-23T20:33:08+03:00",
        "id": "highlight-p1x90y719",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 719.27
    },
    {
        "annotatedText": "текст",
        "color": "#00ff00",
        "colorCategory": "Green",
        "comment": "Это a comment",
        "date": "2023-10-23T20:33:35+03:00",
        "id": "highlight-p1x90y696",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 696.66
    }
]

This resolves the issue with unreadable characters in the comments.

P.S. This may not be crucial, but I'd like to mention that I'm using this project through your Obsidian-Zotero-Integrator. I should also note that I use Okular for annotation.

theotheo added 2 commits October 23, 2023 21:02

Fix: UTF encoding for comments in annotations

ab55d44

chore: update forgotten version number line

8d5f9ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF Encoding Fix for Annotation Comments #21

UTF Encoding Fix for Annotation Comments #21

theotheo commented Oct 23, 2023

UTF Encoding Fix for Annotation Comments #21

Are you sure you want to change the base?

UTF Encoding Fix for Annotation Comments #21

Conversation

theotheo commented Oct 23, 2023