Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read/write/import/export highlights from calibre #849

Open
johnfactotum opened this issue Jan 14, 2022 · 6 comments
Open

Read/write/import/export highlights from calibre #849

johnfactotum opened this issue Jan 14, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@johnfactotum
Copy link
Owner

Just a quick note about how highlighting works in calibre. It doesn't seem to be documented anywhere.

If "Keep a copy of annotations/bookmarks in the e-book file, for easy sharing" is checked in "Preferences" > "Miscellaneous" in the e-book viewer, calibre will store a copy of annotation data in META-INF/calibre_bookmarks.txt, which is a base64 encoded JSON file.

Here is a sample:

encoding=json+base64:
W3sicG9zIjogImVwdWJjZmkoLzE2LzIvNC82LzE6MjIpIiwgInBvc190eXBlIjogImVwdWJjZmkiLCAi
dGltZXN0YW1wIjogIjIwMjItMDEtMTRUMTY6MzI6MDEuMDI4OTcyKzAwOjAwIiwgInR5cGUiOiAibGFz
dC1yZWFkIn0sIHsiZW5kX2NmaSI6ICIvMi80LzYvMTozMTEiLCAiaGlnaGxpZ2h0ZWRfdGV4dCI6ICJP
ZnRlbiB0aGUgaW5mb3JtYXRpb24gaXMgaW5jb21wbGV0ZSBvciBqdXN0IHBsYWluIHdyb25nLiBEb27i
gJl0IHdvcnJ5IOKAkyBjYWxpYnJlIG1ha2VzIGl0IGVhc3kgdG8gZml4IHRoaXMuIiwgInNwaW5lX2lu
ZGV4IjogNywgInNwaW5lX25hbWUiOiAidGV4dC90YXNrXzFfb3JnYW5pemluZy54aHRtbCIsICJzdGFy
dF9jZmkiOiAiLzIvNC82LzE6MjA2IiwgInN0eWxlIjogeyJraW5kIjogImRlY29yYXRpb24iLCAidHlw
ZSI6ICJidWlsdGluIiwgIndoaWNoIjogIndhdnkifSwgInRpbWVzdGFtcCI6ICIyMDIyLTAxLTE0VDE2
OjI1OjI0LjcwMFoiLCAidG9jX2ZhbWlseV90aXRsZXMiOiBbIkNvbW1vbiBUYXNrcyIsICJUYXNrIDE6
IE9yZ2FuaXppbmciXSwgInR5cGUiOiAiaGlnaGxpZ2h0IiwgInV1aWQiOiAieWZOazlLZnpPRm02T2o0
OWZsRDVYUSJ9LCB7ImVuZF9jZmkiOiAiLzIvNC82LzE6NzYiLCAiaGlnaGxpZ2h0ZWRfdGV4dCI6ICJE
dXJpbmcgYW4gZS1ib29rIGltcG9ydCwgY2FsaWJyZSB0cmllcyB0byByZWFkIHRoZSBtZXRhZGF0YSBm
cm9tIHRoZSBlLWJvb2suIiwgInNwaW5lX2luZGV4IjogNywgInNwaW5lX25hbWUiOiAidGV4dC90YXNr
XzFfb3JnYW5pemluZy54aHRtbCIsICJzdGFydF9jZmkiOiAiLzIvNC82LzE6MCIsICJzdHlsZSI6IHsi
a2luZCI6ICJjb2xvciIsICJ0eXBlIjogImJ1aWx0aW4iLCAid2hpY2giOiAieWVsbG93In0sICJ0aW1l
c3RhbXAiOiAiMjAyMi0wMS0xNFQxNjozMTo1Ny4zNzhaIiwgInRvY19mYW1pbHlfdGl0bGVzIjogWyJD
b21tb24gVGFza3MiLCAiVGFzayAxOiBPcmdhbml6aW5nIl0sICJ0eXBlIjogImhpZ2hsaWdodCIsICJ1
dWlkIjogIkF1VkVEMEd2b1N6eWVwbDNNVXpVUFEifV0=

Which decodes to,

[
    {
        "pos": "epubcfi(/16/2/4/6/1:22)",
        "pos_type": "epubcfi",
        "timestamp": "2022-01-14T16:32:01.028972+00:00",
        "type": "last-read"
    },
    {
        "end_cfi": "/2/4/6/1:311",
        "highlighted_text": "Often the information is incomplete or just plain wrong. Don’t worry – calibre makes it easy to fix this.",
        "spine_index": 7,
        "spine_name": "text/task_1_organizing.xhtml",
        "start_cfi": "/2/4/6/1:206",
        "style": {"kind":"decoration","type":"builtin","which":"wavy"},
        "timestamp": "2022-01-14T16:25:24.700Z",
        "toc_family_titles": ["Common Tasks","Task 1: Organizing"],
        "type": "highlight",
        "uuid": "yfNk9KfzOFm6Oj49flD5XQ"
    },
    {
        "end_cfi": "/2/4/6/1:76",
        "highlighted_text": "During an e-book import, calibre tries to read the metadata from the e-book.",
        "spine_index": 7,
        "spine_name": "text/task_1_organizing.xhtml",
        "start_cfi": "/2/4/6/1:0",
        "style": {"kind":"color","type":"builtin","which":"yellow"},
        "timestamp": "2022-01-14T16:31:57.378Z",
        "toc_family_titles": ["Common Tasks","Task 1: Organizing"],
        "type": "highlight",
        "uuid": "AuVED0GvoSzyepl3MUzUPQ"
    }
]

Remarks:

  • It uses non-standard CFI. No idea about the format used for last read position (which is also used for bookmarks, not included in this sample). It seems to reference the spine in some weird way. The highlights reference spine by both href and index.
  • Headings from TOC are stored with the highlight.
  • Highlights are stored unsorted; they are sorted in the application.

Also, not included in the sample above, but notes are stored with the notes key in each highlight.

Also if you choose "Export", it will produce JSON in the following format:

{
  "highlights": [ /* highlights here */ ],
  "type": "calibre_highlights",
  "version": 1
}

It does not, however, support importing highlights (see https://www.mobileread.com/forums/showpost.php?s=e73c5ef33e1b606d66d59b001fb5a7dd&p=4171284&postcount=6).

@johnfactotum johnfactotum added the enhancement New feature or request label Jan 14, 2022
@Moonbase59
Copy link

Moonbase59 commented Jan 15, 2022

Didn’t know you could actually switch that off in Calibre’s Reader—I always disliked a reader to actually modify original EPUBs. So thanks for finding that one!

On the other hand, I always liked that both Calibre reader and yours seem at least to use CFI—might be a step forward and I always hoped these two might somehow be able to interchange highlights/notes/bookmarks. Somehow.

It is a pity that reading status, bookmarks, highlights and notes (and their exchange) aren’t really standardized. Sadly, now everybody cooks their own.

@johnfactotum
Copy link
Owner Author

It is a pity that reading status, bookmarks, highlights and notes (and their exchange) aren’t really standardized.

Well, for better or worse, calibre sort of is the standard (or a standard, in the free/open-source world, at least).

On the other hand, I always liked that both Calibre reader and yours seem at least to use CFI

Well, there are really only so many ways to reference document fragments, whether it's a point reference for bookmarks or a range for highlights. Most reading systems probably uses one or more of the following:

  • DOM-based selectors, this includes XPath and CFI.
  • Offset based. This can either be the raw byte offset or the decoded string character offset.
  • Text based, based on the extracted text of the fragment.

The DOM-based references isn't too hard to implement for any browser-based reading systems and they should be more or less interoperable (barring any implementation bugs). The main advantage of using CFI, apart from it being the standard, is that it's designed to be more robust against modifications made to the document source code (i.e. updated versions of the book), and it can be sorted without having to look at the source document.

The offset-based method is used by Mobipocket (and possibly Kindle) in its proprietary format. It's brittle and hard to implement, though probably more performant on a low powered machine, depending on how the book is rendered. It is possible to convert between offset-based and DOM-based references, but it's not straightforward at all.

The text-based method is the slowest and least reliable. It should probably only be used as a last resort, or as an additional aid to other kinds of selectors, or simply as part of the annotation when exported.

A slightly unfortunate thing here is that the CFI used by calibre is not standard.

"pos": "epubcfi(/16/2/4/6/1:22)",

So it appears that the first step /16 is an otherwise standard reference to spine item 7, except that it does not reference the <spine> element itself and doesn't have an indirection (!).

The next step /2 is a non-standard reference to the root element. which is always /2.

(The / is called a "Step Reference to Child Element or Character Data" (emphasis added). As the root element is not a child element of anything, logically it cannot and should not be referenced in the CFI.)

The equivalent standard CFI would be

epubcfi(/6/16!/4/6/1:22)

@Moonbase59
Copy link

I wonder how much sense it’d make if you talked directly to Kovid Goyal about that. Maybe some overall more standard solution could be found, to the benefit of users?

Personally, I don’t see Calibre and Foliate as competition, but complementing each other: The more involved user will use Calibre anyway, so an exchange might be nice, but there are many people that just want to read an e-book on the desktop once in a while, without the wish or need for a fully-fledged solution like Calibre.

Lacking a common reading status/bookmark/highlight/annotation standard, the biggest hurdles for users (and e-book editors) are nowadays:

  • syncing between devices (like for continuing reading/editing on another device)
  • the inability to access highlights/annotations made on a reading device for further processing, say in academic use
  • missing reliable (and accepted) citations in academic work
  • the inability for editors to highlight and annotate errors on a desktop or real device in a way that is easily transportable, allows for counter-annotations and/or acceptance/rejection with a click.
  • for a user, to mark and annotate errors in a published book and give feedback to the publisher.

@johnfactotum
Copy link
Owner Author

IMO the biggest problem isn't the lack of standards. It's the lack of documentation. Most apps don't even document how annotations are stored internally. Simply figuring out how they are stored would go a long way towards interoperability and compatibility by allowing other apps or third-party tools to convert between different formats.

But allowing exchange without an explicit import/export step is going to be difficult. For example, there isn't even a standardized way to determine whether or not two files are the same book or not. While there is a standard way to give unique identifiers to books, the spec actually recommends against relying completely on this identifier, and many reading systems (unlike Foliate) don't use this identifier at all (see #435). So there isn't even a reliable way to associate annotations with a book.

syncing between devices (like for continuing reading/editing on another device)

Syncing anything between devices is itself often a hard enough problem. There are many different kinds of technologies and designs, each with their advantages and limitations.

Embedding data in the file would nicely sidestep the problem of both uniquely identifying books, as well as any issues with transporting the annotations. But it requires modifying the file. Some people like it, but for others it's a dealbreaker. (Though I suspect that people might be less resistant to the idea if calibre had made things clearer.)

So here we can observe that it would be hard to find a design that suits different, often conflicting needs well. There's always a trade-off, but there are never any clear answers.

missing reliable (and accepted) citations in academic work

Academic citing is really a different problem. First there's the problem of compatibility with printed works. This is more or less a solved problem. It's just that Foliate doesn't support page-list (yet), but I think most reading systems do support it (as long as it's provided by the book, of course).

Then there's the problem of how to reference pages if no page-list is available. Many reading systems have artificial pages or "locations". A request to standardize this has been declined. And for publishers, it's unclear whether or how to provide a page-list when there's no print version.

This is, however, mostly an issue when presenting the book or annotations for human consumption. It has very little to do with how the annotations are stored for the internal consumption of reading systems.

@Moonbase59
Copy link

Looks like still some way to go. For everyone.

johnfactotum added a commit to johnfactotum/foliate-js that referenced this issue Nov 4, 2022
@johnfactotum
Copy link
Owner Author

It seems Calibre also produces incorrect CFIs in some situations. To reproduce,

  1. Download and open the "advanced epub" file from https://standardebooks.org/ebooks/herman-melville/moby-dick
  2. Navigate to Chapter XXXII, Cetology.
  3. Highlight the text "According to magnitude I divide the whales into three primary books (subdivisible into chapters),"
  4. Export highlights from Calibre.

Actual result:

Calibre creates the following highlight data.

      {
      "end_cfi": "/2/4/2[chapter-32]/34/1:91",
      "highlighted_text": "According to magnitude I divide the whales into three primary books (subdivisible into chapters),",
      "spine_index": 37,
      "spine_name": "epub/text/chapter-32.xhtml",
      "start_cfi": "/2/4/2[chapter-32]/34/1:7",
      "style": {
        "kind": "decoration",
        "type": "builtin",
        "which": "wavy"
      }

Here 1:91 in the end CFI is wrong. 91 is out of the range of the 1st node. The word "books" and "chapters" are in their own nodes, so the 1st node is only 69 characters long.

Expected result:

The end of the highlighted text, ),, is in the 5th node. The correct CFI should end in 5:2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants