Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF Encoding Fix for Annotation Comments #21

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

theotheo
Copy link

Hello,

I encountered an issue when using Cyrillic characters in annotations: during export, the text was transformed into an unreadable set of characters. My knowledge of PDF is not sufficient to confidently pinpoint the exact cause of the problem. However, some code experiments helped me find a simple solution that appears to resolve the issue.

I will illustrate the problem with a specially created PDF file with annotations: Example.pdf. For maximum clarity, I will also provide screenshot
image
So, the screenshot shows a PDF with 2 lines of text and 2 annotated annotations in which Latin characters are combined with Cyrillic.

The export of this document looks as follows:

  [
    {
        "annotatedText": "text",
        "color": "#ffff00",
        "colorCategory": "Yellow",
        "comment": "This is :\u003e\u003c\u003c5=B0@89",
        "date": "2023-10-23T20:33:08+03:00",
        "id": "highlight-p1x90y719",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 719.27
    },
    {
        "annotatedText": "текст",
        "color": "#00ff00",
        "colorCategory": "Green",
        "comment": "-B\u003e a comment",
        "date": "2023-10-23T20:33:35+03:00",
        "id": "highlight-p1x90y696",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 696.66
    }
]

As you can see, the "comment" fields contain unreadable characters (where Unicode can be guessed).

The updated version produces results with correctly encoded characters:

[
    {
        "annotatedText": "text",
        "color": "#ffff00",
        "colorCategory": "Yellow",
        "comment": "This is комментарий",
        "date": "2023-10-23T20:33:08+03:00",
        "id": "highlight-p1x90y719",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 719.27
    },
    {
        "annotatedText": "текст",
        "color": "#00ff00",
        "colorCategory": "Green",
        "comment": "Это a comment",
        "date": "2023-10-23T20:33:35+03:00",
        "id": "highlight-p1x90y696",
        "page": 1,
        "pageLabel": "1",
        "type": "highlight",
        "x": 90.67,
        "y": 696.66
    }
]

This resolves the issue with unreadable characters in the comments.

P.S. This may not be crucial, but I'd like to mention that I'm using this project through your Obsidian-Zotero-Integrator. I should also note that I use Okular for annotation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant