calibre sax json parser #11922

pazos · 2024-05-29T17:12:13Z

Add a proper sax parser for "calibre.metadata" files, using lunajson.

Requires koreader/koreader-base#1801

Fixes #11611
Fixes #11215
Fixes #9016

This change is

plugins/calibre.koplugin/parser.lua

plugins/calibre.koplugin/metadata.lua

pazos · 2024-05-29T17:30:34Z

@Frenzie: what is the proper magic to exclude lunajson files from CI?

plugins/calibre.koplugin/lunajson/encoder.lua

plugins/calibre.koplugin/lunajson/decoder.lua

pazos · 2024-05-29T19:13:24Z

@NiLuJe: thanks for the review but I didn't expect one on foreign code :)

Frenzie · 2024-05-29T19:52:29Z

@pazos

diff --git a/kodev b/kodev
index eb2e52d0f..cec1342de 100755
--- a/kodev
+++ b/kodev
@@ -962,7 +962,7 @@ function kodev-check() {
         exit_code=1
     fi
 
-    tab_detected=$(grep -P "\\t" --include \*.lua --exclude={dateparser.lua,xml.lua} --recursive {reader,setupkoenv,datastorage}.lua frontend plugins spec || true)
+    tab_detected=$(grep -P "\\t" --include \*.lua --exclude={dateparser.lua,lunajson.lua,xml.lua} --exclude-dir=lunajson --recursive {reader,setupkoenv,datastorage}.lua frontend plugins spec || true)
     if [ "${tab_detected}" ]; then
         echo -e "\\n${ANSI_RED}Warning: tab character detected. Please use spaces."
         echo "${tab_detected}"

pazos · 2024-05-31T00:02:23Z

The implementation was broken, see https://www.mobileread.com/forums/showthread.php?p=4427793#post4427793

Now it is the proper thing, AFAICT.

Frenzie · 2024-05-31T19:12:18Z

Alright, sounds good.

…

On Fri, May 31, 2024, 20:37 Martín Fernández ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In plugins/calibre.koplugin/metadata.lua <#11922 (comment)>: > @@ -54,7 +54,7 @@ end -- this is the max file size we attempt to decode using json. For larger -- files we want to attempt to manually parse the file to avoid OOM errors -local MAX_JSON_FILESIZE = 30 * 1000 * 1000 +local MAX_JSON_FILESIZE = 30 I don't know. I think I would like to keep a fixed size. Most of the bug reports I get are from mobileread users and seems easier for them to tell me the size of their metadata.calibre files than to attach logs. But, more important than that: - current 30MB size is too big for rapidsjon to handle in some devices/configs, for instance #11215 <#11215> *** - a sax parser, even in plain lua, should be faster than a dom parser on a scenario where we care about less than a 1-10% of the file. **** ***: most likely a setup with few books and a huge amount of user metadata per book. ****: this holds true on my tests with a little library of ~400 books and no fancy/extra metadata per book. Even after I start a wireless connection, which strips all the unused metadata from the json file, I usually got reads in the same order of magnitude using lunajson sax parser vs rapidjson dom parser. — Reply to this email directly, view it on GitHub <#11922 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABRQBMSPF5PVRTXAYXJ7PLZFC7OFAVCNFSM6AAAAABIPLM7QSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAOJRGQ4TQMZQGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Will be used by koreader/koreader#11922

pazos · 2024-05-31T21:55:34Z

If some of you (wink, wink @NiLuJe) have a big on-device metadata with a bunch of custom columns I wouldn't mind a benchmark (or at least the nº of books / size in bytes of the document)

FWIW, here 150 naked books got me a metadata.calibre of ~620KiB, so less than 5KiB per book.

I'm assuming it could grow to +50 KiB per book. (On mobileread an advanced user said ~4000 books weight 40MiB, so maybe 50 is the worst case scenario)

Even then a MAX_JSON_FILESIZE of 10MiB could handle 2000 books without extra metadata or 200 books with too much metadata.

So it seems a nice threshold to switch to the sax parser.

benoit-pierre · 2024-05-31T21:59:00Z

Is there an advantage in having 2 code paths? Would always using the sax parser make a noticeable (negative) impact on performance?

pazos · 2024-05-31T22:18:10Z

Is there an advantage in having 2 code paths?

Very good question. That's why I would like a benchmark from somebody that uses calibre heavily and doesn't use wireless transfers to push their books.

Would always using the sax parser make a noticeable (negative) impact on performance?

Not on a desktop with a few hundred books. Rapidjson is faster with all unneeded fields stripped and not slower with default calibre fields.

No idea about real devices since my metadata there is already stripped.

Anyhow, on my quite limited tests both produce results in the same order of magnitude, but rapidjson has less outliers.

NiLuJe · 2024-06-01T01:02:05Z

I don't really have fancy columns setup, but I'll try to give you some numbers ;).

NiLuJe · 2024-06-01T18:25:49Z

Yup, it's considerably slower even on a fairly modest library (metadata file is 7.2MB), even on a relatively speedy SoC (that was a Sage, it boosts its A7 cores fairly high, so much so that they might still be faster than the MTK's A53 in terms of single-threaded performance).

Anyway:

RapidJSON:

06/01/24-20:18:31 INFO  calibre info loaded from disk (search) in 472.000 milliseconds: 876 books 
06/01/24-20:18:31 INFO  metadata: 876 books imported from calibre in 499.000 milliseconds 
06/01/24-20:18:31 INFO  search done in 60.000 milliseconds (series, case sensitive: false, title: true, authors: true, series: false)

LunaJSON:

06/01/24-20:22:22 INFO  calibre info loaded from disk (search) in 1794.000 milliseconds: 876 books 
06/01/24-20:22:22 INFO  metadata: 876 books imported from calibre in 1820.000 milliseconds 
06/01/24-20:22:22 INFO  search done in 39.000 milliseconds (series, case sensitive: false, title: true, authors: true, series: false)

In terms of UX, that translates into going from barely being able to notice the parsing to really noticing a stall ^^.

NiLuJe · 2024-06-01T18:30:15Z

On the upside, at a very quick glance, the results look correct, so It Works(TM) ;).

pazos · 2024-06-01T19:37:08Z

Yup, it's considerably slower even on a fairly modest library

Ok, thanks. That's much worse than I expected. And it doesn't seem like it is going to improve on very big libraries.

I see a few things that can be improved in this PR, I'm going to look.

NiLuJe · 2024-06-01T21:12:42Z

I see a few things that can be improved in this PR, I'm going to look.

Give me a shout when you're ready for me to re-test ;).

pazos · 2024-06-01T21:50:54Z

I'm not able to improve the thing too much, but at least catched a bug that would crash the program when trying to re-parse :)

Also improved a bit the tests, so they measure what is expected to measure:

All calibre fields

# lunajson
Parsed in 0.0116 milliseconds
Parsed in 0.0101 milliseconds
Parsed in 0.0099 milliseconds
Parsed in 0.0094 milliseconds
Parsed in 0.0089 milliseconds
Parsed in 0.0092 milliseconds
Parsed in 0.0089 milliseconds
Parsed in 0.0093 milliseconds
Parsed in 0.0089 milliseconds
Parsed in 0.0091 milliseconds

# rapidjson
Parsed in 0.0015 milliseconds
Parsed in 0.0016 milliseconds
Parsed in 0.0015 milliseconds
Parsed in 0.0014 milliseconds
Parsed in 0.0015 milliseconds
Parsed in 0.0015 milliseconds
Parsed in 0.0015 milliseconds
Parsed in 0.0018 milliseconds
Parsed in 0.0014 milliseconds
Parsed in 0.0014 milliseconds

Removed fields

# lunajson
Parsed in 0.0057 milliseconds
Parsed in 0.0052 milliseconds
Parsed in 0.0051 milliseconds
Parsed in 0.0046 milliseconds
Parsed in 0.0047 milliseconds
Parsed in 0.0046 milliseconds
Parsed in 0.0046 milliseconds
Parsed in 0.0045 milliseconds
Parsed in 0.0046 milliseconds
Parsed in 0.0046 milliseconds

# rapidjson
Parsed in 0.0009 milliseconds
Parsed in 0.0010 milliseconds
Parsed in 0.0010 milliseconds
Parsed in 0.0009 milliseconds
Parsed in 0.0009 milliseconds
Parsed in 0.0008 milliseconds
Parsed in 0.0009 milliseconds
Parsed in 0.0008 milliseconds
Parsed in 0.0008 milliseconds
Parsed in 0.0009 milliseconds

NiLuJe · 2024-06-02T06:00:30Z

FWIW, something based on hashmap hitchecks instead of ipairs seems ever-so-slightly faster over here, and possibly reads a tiny bit better?

-- parse "metadata.calibre" files
local lj = require("lunajson")

local array_fields = {
    authors = true,
    tags = true,
    series = true,
}

local required_fields = {
    authors = true,
    last_modified = true,
    lpath = true,
    series = true,
    series_index = true,
    size = true,
    tags = true,
    title = true,
    uuid = true,
}

local field
local t = {}
local function append(v)
    -- These *may* be arrays, so we need the extra check to confirm that startarray ran
    if array_fields[field] and t[field] then
        table.insert(t[field], v)
    else
        t[field] = v
        field = nil
    end
end

local depth = 0
local result = {}
local sax = {
    startobject = function()
        depth = depth + 1
    end,
    endobject = function()
        if depth == 1 then
            table.insert(result, t)
            t = {}
        end
        depth = depth - 1
    end,
    startarray = function()
        if array_fields[field] then
            t[field] = {}
        end
    end,
    endarray = function()
        if field then
            field = nil
        end
    end,
    key = function(s)
        if required_fields[s] then
            field = s
        end
    end,
    string = function(s)
        if field then
            append(s)
        end
    end,
    number = function(n)
        if field then
            append(n)
        end
    end,
    boolean = function(b)
        if field then
            append(b)
        end
    end,
    null = function()
        if field then
            append(nil)
        end
    end,
}

local parser = {}
function parser.parseFile(file)
    result = {}
    local p = lj.newfileparser(file, sax)
    p.run()
    field = nil
    return result
end

return parser

(hashmap)

06/02/24-07:59:30 INFO  calibre info loaded from disk (search) in 1661.000 milliseconds: 876 books 
06/02/24-07:59:30 INFO  metadata: 876 books imported from calibre in 1686.000 milliseconds 
06/02/24-07:59:30 INFO  search done in 40.000 milliseconds (series, case sensitive: false, title: true, authors: true, series: false)

vs.

(arrays + ipairs checks)

06/02/24-07:56:05 INFO  calibre info loaded from disk (search) in 1710.000 milliseconds: 876 books 
06/02/24-07:56:05 INFO  metadata: 876 books imported from calibre in 1736.000 milliseconds 
06/02/24-07:56:05 INFO  search done in 39.000 milliseconds (series, case sensitive: false, title: true, authors: true, series: false)

NiLuJe · 2024-06-02T06:09:17Z

Eeeeeh, most of this might simply come from not calling append at all for unwanted keys...

NiLuJe · 2024-06-02T06:39:51Z

Fixed only using the first value of an array in the above; I broke it at some point of the edits ;p.

pazos · 2024-06-02T15:08:10Z

Indeed, it looks much better :)

Eeeeeh, most of this might simply come from not calling append at all for unwanted keys...

Yup, but that's the best we can get :)

Is there an advantage in having 2 code paths? Would always using the sax parser make a noticeable (negative) impact on performance?

Based on the tests, yeah, we want two code paths. We also want to enforce rapidjson whenever it is possible. We could add an UI option to switch between implementations (keeping auto as default), so people could change to lunajson if they get OOM error with default settings or force rapidjson beyond the threshold if they know their device has enough ram to handle their particular json files.

pazos · 2024-06-02T16:45:23Z

We could add an UI option to switch between implementations

For example:

pazos · 2024-06-05T15:13:44Z

I can't really test the OOM scenario

You don't need to test it.
There's no OOM scenario as long as the c++ does the same than the lua version.
Just massive speed gains :).

it actually looks like it's actually the sheer amount of k,v pairs that just balloons to stupid amounts

Actually... large string values could explain some of the overhead (e.g., the most common one being description).

Yup, too much to store the whole DOM on memory. The vm just dies.

pazos · 2024-06-05T15:18:43Z

I'd rather this make it in, actually.

Although, perhaps with slighty different menu labels/config keys. I'm thinking something along the lines of:

Fast (RapidJSON via load_calibre)
Safe (LunaJSON)
Legacy (RapidJSON via load)
(And, for now, until we have more data from insane libraries, keep the size check, but perhaps default to fast in order to exercise the codepath and find potential issues with it).

The option to skip UI and go reckless on the fast rapidjson codepath is out of the equation?.

NiLuJe · 2024-06-05T17:32:41Z

The option to skip UI and go reckless on the fast rapidjson codepath is out of the equation?

I mean, eventually, maybe? Right now, I'd keep it in just to be safe ;p.

pazos

Good!

Feel free to bump base and merge when ready :)

Co-authored-by: NiLuJe <ninuje@gmail.com>

Keep the file-size check for now, until this gets some more mileage. Ultimately, we'll *probably* want to always use fast, but for now, leave some options on tha table in case things go kablooey ;).

koreader/koreader-base#1813

benoit-pierre · 2024-06-06T15:19:22Z

The rapidjson.load_calibre variant crashes on my test data, here is the first entry:

    {
        "series_index": null,
        "size": 2520019,
        "series": null,
        "last_modified": "2023-10-12T08:41:34+00:00",
        "authors": [
            "Stephen King"
        ],
        "tags": {},
        "lpath": "Shining, The - Stephen King (2279).epub",
        "title": "The Shining"
    },

I don't know how and/or why it ended up with "tags": {}, instead of "tags": [],, but this cause the crash.

pazos · 2024-06-06T15:32:16Z

I don't know how and/or why it ended up with "tags": {}, instead of "tags": [],, but this cause the crash.

Do you use the wireless client? Because it rewrites the json file based on current json parser. So any issue with the parser ends up in the original json file.

If that's true, then the metadata itself isn't valid anymore. During my tests with the lua parser I went the following route to avoid these kind of issues.

Delete json files (and books)
Push books and json files using calibre's connect to folder
Verify the metadata can be parsed and searches work ok with any field
Disconnect from folder, start the wireless client
Verify the client comm with the server is ok
Send a few books wirelessly
Verify the metadata is still correct.

benoit-pierre · 2024-06-06T15:35:17Z

I had previously connected to calibre with the wireless client, but I don't know if the bad data is from before or after updating to test the new code.

benoit-pierre · 2024-06-06T15:35:57Z

Anyway, it should not crash.

benoit-pierre · 2024-06-06T17:49:10Z

And crash means a segfault, not an exception in the Lua code.

NiLuJe · 2024-06-06T18:36:55Z

but may not take too kindly to malformed input files

Famous last words ;p.

Yeah, that probably unbalances the push/pop on the Lua stack, trashing the actual stack.

I'll see if I can come up with something not too awful to not horribly implode ;).

NiLuJe · 2024-06-06T19:02:09Z

@pazos: Do you remember if the frontend code can deal with potential array fields being nil (instead of an empty table)?

(i.e., should I push a nil or a {} for these broken "arrays"?).

NiLuJe · 2024-06-06T19:07:29Z

Alternatively, how does it deal with a missing required field? (as that's my other potential approach ;p).

pazos · 2024-06-06T19:17:17Z

@NiLuJe: There's https://github.com/koreader/koreader/blob/master/plugins/calibre.koplugin/metadata.lua#L41-L53, which should turn nils to whatever value calibre expects.

The only values that are required are the strings, which should never be nil (calibre will fill bogus values if it can't figure out the proper ones, i.e: last_modified: None)

NiLuJe · 2024-06-06T20:34:32Z

I don't know how and/or why it ended up with "tags": {}, instead of "tags": [],

That's because when dumping back to json, for tables that have lost their metatable array/object flag, by default, it assumes empty tables are objects, not arrays (c.f., the empty_table_as_array encoding option, it defaults to false).

And the aforementioned slim method, when confronted with a missing field, sets arrays to an untagged empty table ;).

On the same subject, I'd removed those tags from the load_calibre variant, on the assumption that these could never end up being dumped back to json, but that might not have been a great idea ;).

In order to deal with metadata.calibre files we've mistakenly mangled in the past ;). Re: koreader/koreader#11922 (comment) & koreader/koreader#11922 (comment)

pazos commented May 29, 2024

View reviewed changes

plugins/calibre.koplugin/parser.lua Outdated Show resolved Hide resolved

pazos commented May 29, 2024

View reviewed changes

plugins/calibre.koplugin/metadata.lua Outdated Show resolved Hide resolved

pazos force-pushed the calibre_metadata_parser branch from bb0d7a5 to f1a7584 Compare May 29, 2024 17:41

NiLuJe reviewed May 29, 2024

View reviewed changes

plugins/calibre.koplugin/lunajson/encoder.lua Outdated Show resolved Hide resolved

NiLuJe reviewed May 29, 2024

View reviewed changes

plugins/calibre.koplugin/lunajson/decoder.lua Outdated Show resolved Hide resolved

pazos force-pushed the calibre_metadata_parser branch from 6c1ea14 to 1519cfe Compare May 29, 2024 23:13

pazos mentioned this pull request May 31, 2024

add lunajson koreader/koreader-base#1801

Merged

pazos changed the title ~~wip: calibre sax json parser~~ calibre sax json parser May 31, 2024

Frenzie pushed a commit to koreader/koreader-base that referenced this pull request May 31, 2024

Add lunajson (#1801)

ee0d002

Will be used by koreader/koreader#11922

This comment was marked as outdated.

Sign in to view

pazos force-pushed the calibre_metadata_parser branch from 1ecba54 to 0347cb8 Compare June 2, 2024 15:19

NiLuJe mentioned this pull request Jun 5, 2024

Update Lua-RapidJSON koreader/koreader-base#1813

Merged

pazos marked this pull request as ready for review June 5, 2024 18:29

pazos commented Jun 5, 2024

View reviewed changes

pazos and others added 9 commits June 5, 2024 20:57

calibre sax json parser

fde5e2e

Co-authored-by: NiLuJe <ninuje@gmail.com>

UI option to switch between json parsers

b215d55

use rapidjson up to 50MiB

e547327

comment

d6ef69f

DRYer

5a9aafa

protected

bcc7698

Expose the three different parsers (for now)

27f8868

Keep the file-size check for now, until this gets some more mileage. Ultimately, we'll *probably* want to always use fast, but for now, leave some options on tha table in case things go kablooey ;).

I clearly need more sleep.

9b8a63f

Bump base

23cf411

koreader/koreader-base#1813

NiLuJe force-pushed the calibre_metadata_parser branch from c9c45be to 23cf411 Compare June 5, 2024 18:58

NiLuJe merged commit 79c13be into koreader:master Jun 5, 2024
3 of 4 checks passed

NiLuJe added a commit to NiLuJe/koreader-base that referenced this pull request Jun 6, 2024

Update Lua-RapidJSON

c4d8bab

In order to deal with metadata.calibre files we've mistakenly mangled in the past ;). Re: koreader/koreader#11922 (comment) & koreader/koreader#11922 (comment)

NiLuJe mentioned this pull request Jun 6, 2024

Update Lua-RapidJSON koreader/koreader-base#1816

Merged

NiLuJe added a commit to koreader/koreader-base that referenced this pull request Jun 7, 2024

Update Lua-RapidJSON (#1816)

f43f183

In order to deal with metadata.calibre files we've mistakenly mangled in the past ;). Re: koreader/koreader#11922 (comment) & koreader/koreader#11922 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calibre sax json parser #11922

calibre sax json parser #11922

pazos commented May 29, 2024 •

edited

pazos commented May 29, 2024

pazos commented May 29, 2024

Frenzie commented May 29, 2024

pazos commented May 31, 2024

Frenzie commented May 31, 2024 via email

pazos commented May 31, 2024

benoit-pierre commented May 31, 2024

pazos commented May 31, 2024

NiLuJe commented Jun 1, 2024

NiLuJe commented Jun 1, 2024 •

edited

NiLuJe commented Jun 1, 2024 •

edited

pazos commented Jun 1, 2024

This comment was marked as outdated.

NiLuJe commented Jun 1, 2024 •

edited

pazos commented Jun 1, 2024

NiLuJe commented Jun 2, 2024 •

edited

NiLuJe commented Jun 2, 2024

NiLuJe commented Jun 2, 2024

pazos commented Jun 2, 2024 •

edited

pazos commented Jun 2, 2024

pazos commented Jun 5, 2024

pazos commented Jun 5, 2024

NiLuJe commented Jun 5, 2024

pazos left a comment

benoit-pierre commented Jun 6, 2024

pazos commented Jun 6, 2024

benoit-pierre commented Jun 6, 2024

benoit-pierre commented Jun 6, 2024

benoit-pierre commented Jun 6, 2024

NiLuJe commented Jun 6, 2024 •

edited

NiLuJe commented Jun 6, 2024 •

edited

NiLuJe commented Jun 6, 2024

pazos commented Jun 6, 2024 •

edited

NiLuJe commented Jun 6, 2024 •

edited

calibre sax json parser #11922

calibre sax json parser #11922

Conversation

pazos commented May 29, 2024 • edited

pazos commented May 29, 2024

pazos commented May 29, 2024

Frenzie commented May 29, 2024

pazos commented May 31, 2024

Frenzie commented May 31, 2024 via email

pazos commented May 31, 2024

benoit-pierre commented May 31, 2024

pazos commented May 31, 2024

NiLuJe commented Jun 1, 2024

NiLuJe commented Jun 1, 2024 • edited

NiLuJe commented Jun 1, 2024 • edited

pazos commented Jun 1, 2024

This comment was marked as outdated.

NiLuJe commented Jun 1, 2024 • edited

pazos commented Jun 1, 2024

All calibre fields

Removed fields

NiLuJe commented Jun 2, 2024 • edited

NiLuJe commented Jun 2, 2024

NiLuJe commented Jun 2, 2024

pazos commented Jun 2, 2024 • edited

pazos commented Jun 2, 2024

pazos commented Jun 5, 2024

pazos commented Jun 5, 2024

NiLuJe commented Jun 5, 2024

pazos left a comment

Choose a reason for hiding this comment

benoit-pierre commented Jun 6, 2024

pazos commented Jun 6, 2024

benoit-pierre commented Jun 6, 2024

benoit-pierre commented Jun 6, 2024

benoit-pierre commented Jun 6, 2024

NiLuJe commented Jun 6, 2024 • edited

NiLuJe commented Jun 6, 2024 • edited

NiLuJe commented Jun 6, 2024

pazos commented Jun 6, 2024 • edited

NiLuJe commented Jun 6, 2024 • edited

pazos commented May 29, 2024 •

edited

NiLuJe commented Jun 1, 2024 •

edited

NiLuJe commented Jun 1, 2024 •

edited

NiLuJe commented Jun 1, 2024 •

edited

NiLuJe commented Jun 2, 2024 •

edited

pazos commented Jun 2, 2024 •

edited

NiLuJe commented Jun 6, 2024 •

edited

NiLuJe commented Jun 6, 2024 •

edited

pazos commented Jun 6, 2024 •

edited

NiLuJe commented Jun 6, 2024 •

edited