Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calibre sax json parser #11922

Merged
merged 9 commits into from
Jun 5, 2024
Merged

Conversation

pazos
Copy link
Member

@pazos pazos commented May 29, 2024

Add a proper sax parser for "calibre.metadata" files, using lunajson.

Requires koreader/koreader-base#1801

Fixes #11611
Fixes #11215
Fixes #9016


This change is Reviewable

@pazos
Copy link
Member Author

pazos commented May 29, 2024

@Frenzie: what is the proper magic to exclude lunajson files from CI?

@pazos
Copy link
Member Author

pazos commented May 29, 2024

@NiLuJe: thanks for the review but I didn't expect one on foreign code :)

@Frenzie
Copy link
Member

Frenzie commented May 29, 2024

@pazos

diff --git a/kodev b/kodev
index eb2e52d0f..cec1342de 100755
--- a/kodev
+++ b/kodev
@@ -962,7 +962,7 @@ function kodev-check() {
         exit_code=1
     fi
 
-    tab_detected=$(grep -P "\\t" --include \*.lua --exclude={dateparser.lua,xml.lua} --recursive {reader,setupkoenv,datastorage}.lua frontend plugins spec || true)
+    tab_detected=$(grep -P "\\t" --include \*.lua --exclude={dateparser.lua,lunajson.lua,xml.lua} --exclude-dir=lunajson --recursive {reader,setupkoenv,datastorage}.lua frontend plugins spec || true)
     if [ "${tab_detected}" ]; then
         echo -e "\\n${ANSI_RED}Warning: tab character detected. Please use spaces."
         echo "${tab_detected}"

@pazos
Copy link
Member Author

pazos commented May 31, 2024

The implementation was broken, see https://www.mobileread.com/forums/showthread.php?p=4427793#post4427793

Now it is the proper thing, AFAICT.

@pazos pazos changed the title wip: calibre sax json parser calibre sax json parser May 31, 2024
@Frenzie
Copy link
Member

Frenzie commented May 31, 2024 via email

Frenzie pushed a commit to koreader/koreader-base that referenced this pull request May 31, 2024
@pazos
Copy link
Member Author

pazos commented May 31, 2024

If some of you (wink, wink @NiLuJe) have a big on-device metadata with a bunch of custom columns I wouldn't mind a benchmark (or at least the nº of books / size in bytes of the document)

FWIW, here 150 naked books got me a metadata.calibre of ~620KiB, so less than 5KiB per book.

I'm assuming it could grow to +50 KiB per book. (On mobileread an advanced user said ~4000 books weight 40MiB, so maybe 50 is the worst case scenario)

Even then a MAX_JSON_FILESIZE of 10MiB could handle 2000 books without extra metadata or 200 books with too much metadata.

So it seems a nice threshold to switch to the sax parser.

@benoit-pierre
Copy link
Contributor

Is there an advantage in having 2 code paths? Would always using the sax parser make a noticeable (negative) impact on performance?

@pazos
Copy link
Member Author

pazos commented May 31, 2024

Is there an advantage in having 2 code paths?

Very good question. That's why I would like a benchmark from somebody that uses calibre heavily and doesn't use wireless transfers to push their books.

Would always using the sax parser make a noticeable (negative) impact on performance?

Not on a desktop with a few hundred books. Rapidjson is faster with all unneeded fields stripped and not slower with default calibre fields.

No idea about real devices since my metadata there is already stripped.

Anyhow, on my quite limited tests both produce results in the same order of magnitude, but rapidjson has less outliers.

@NiLuJe
Copy link
Member

NiLuJe commented Jun 1, 2024

I don't really have fancy columns setup, but I'll try to give you some numbers ;).

@NiLuJe
Copy link
Member

NiLuJe commented Jun 1, 2024

Yup, it's considerably slower even on a fairly modest library (metadata file is 7.2MB), even on a relatively speedy SoC (that was a Sage, it boosts its A7 cores fairly high, so much so that they might still be faster than the MTK's A53 in terms of single-threaded performance).

Anyway:

RapidJSON:

06/01/24-20:18:31 INFO  calibre info loaded from disk (search) in 472.000 milliseconds: 876 books 
06/01/24-20:18:31 INFO  metadata: 876 books imported from calibre in 499.000 milliseconds 
06/01/24-20:18:31 INFO  search done in 60.000 milliseconds (series, case sensitive: false, title: true, authors: true, series: false)

LunaJSON:

06/01/24-20:22:22 INFO  calibre info loaded from disk (search) in 1794.000 milliseconds: 876 books 
06/01/24-20:22:22 INFO  metadata: 876 books imported from calibre in 1820.000 milliseconds 
06/01/24-20:22:22 INFO  search done in 39.000 milliseconds (series, case sensitive: false, title: true, authors: true, series: false)

In terms of UX, that translates into going from barely being able to notice the parsing to really noticing a stall ^^.

@NiLuJe
Copy link
Member

NiLuJe commented Jun 1, 2024

On the upside, at a very quick glance, the results look correct, so It Works(TM) ;).

@pazos
Copy link
Member Author

pazos commented Jun 1, 2024

Yup, it's considerably slower even on a fairly modest library

Ok, thanks. That's much worse than I expected. And it doesn't seem like it is going to improve on very big libraries.

I see a few things that can be improved in this PR, I'm going to look.

@pazos

This comment was marked as outdated.

@NiLuJe
Copy link
Member

NiLuJe commented Jun 1, 2024

I see a few things that can be improved in this PR, I'm going to look.

Give me a shout when you're ready for me to re-test ;).

@pazos
Copy link
Member Author

pazos commented Jun 1, 2024

I'm not able to improve the thing too much, but at least catched a bug that would crash the program when trying to re-parse :)

Also improved a bit the tests, so they measure what is expected to measure:

All calibre fields

# lunajson
Parsed in 0.0116 milliseconds
Parsed in 0.0101 milliseconds
Parsed in 0.0099 milliseconds
Parsed in 0.0094 milliseconds
Parsed in 0.0089 milliseconds
Parsed in 0.0092 milliseconds
Parsed in 0.0089 milliseconds
Parsed in 0.0093 milliseconds
Parsed in 0.0089 milliseconds
Parsed in 0.0091 milliseconds

# rapidjson
Parsed in 0.0015 milliseconds
Parsed in 0.0016 milliseconds
Parsed in 0.0015 milliseconds
Parsed in 0.0014 milliseconds
Parsed in 0.0015 milliseconds
Parsed in 0.0015 milliseconds
Parsed in 0.0015 milliseconds
Parsed in 0.0018 milliseconds
Parsed in 0.0014 milliseconds
Parsed in 0.0014 milliseconds

Removed fields

# lunajson
Parsed in 0.0057 milliseconds
Parsed in 0.0052 milliseconds
Parsed in 0.0051 milliseconds
Parsed in 0.0046 milliseconds
Parsed in 0.0047 milliseconds
Parsed in 0.0046 milliseconds
Parsed in 0.0046 milliseconds
Parsed in 0.0045 milliseconds
Parsed in 0.0046 milliseconds
Parsed in 0.0046 milliseconds

# rapidjson
Parsed in 0.0009 milliseconds
Parsed in 0.0010 milliseconds
Parsed in 0.0010 milliseconds
Parsed in 0.0009 milliseconds
Parsed in 0.0009 milliseconds
Parsed in 0.0008 milliseconds
Parsed in 0.0009 milliseconds
Parsed in 0.0008 milliseconds
Parsed in 0.0008 milliseconds
Parsed in 0.0009 milliseconds

@NiLuJe
Copy link
Member

NiLuJe commented Jun 2, 2024

FWIW, something based on hashmap hitchecks instead of ipairs seems ever-so-slightly faster over here, and possibly reads a tiny bit better?

-- parse "metadata.calibre" files
local lj = require("lunajson")

local array_fields = {
    authors = true,
    tags = true,
    series = true,
}

local required_fields = {
    authors = true,
    last_modified = true,
    lpath = true,
    series = true,
    series_index = true,
    size = true,
    tags = true,
    title = true,
    uuid = true,
}

local field
local t = {}
local function append(v)
    -- These *may* be arrays, so we need the extra check to confirm that startarray ran
    if array_fields[field] and t[field] then
        table.insert(t[field], v)
    else
        t[field] = v
        field = nil
    end
end

local depth = 0
local result = {}
local sax = {
    startobject = function()
        depth = depth + 1
    end,
    endobject = function()
        if depth == 1 then
            table.insert(result, t)
            t = {}
        end
        depth = depth - 1
    end,
    startarray = function()
        if array_fields[field] then
            t[field] = {}
        end
    end,
    endarray = function()
        if field then
            field = nil
        end
    end,
    key = function(s)
        if required_fields[s] then
            field = s
        end
    end,
    string = function(s)
        if field then
            append(s)
        end
    end,
    number = function(n)
        if field then
            append(n)
        end
    end,
    boolean = function(b)
        if field then
            append(b)
        end
    end,
    null = function()
        if field then
            append(nil)
        end
    end,
}

local parser = {}
function parser.parseFile(file)
    result = {}
    local p = lj.newfileparser(file, sax)
    p.run()
    field = nil
    return result
end

return parser

(hashmap)

06/02/24-07:59:30 INFO  calibre info loaded from disk (search) in 1661.000 milliseconds: 876 books 
06/02/24-07:59:30 INFO  metadata: 876 books imported from calibre in 1686.000 milliseconds 
06/02/24-07:59:30 INFO  search done in 40.000 milliseconds (series, case sensitive: false, title: true, authors: true, series: false) 

vs.

(arrays + ipairs checks)

06/02/24-07:56:05 INFO  calibre info loaded from disk (search) in 1710.000 milliseconds: 876 books 
06/02/24-07:56:05 INFO  metadata: 876 books imported from calibre in 1736.000 milliseconds 
06/02/24-07:56:05 INFO  search done in 39.000 milliseconds (series, case sensitive: false, title: true, authors: true, series: false)

@NiLuJe
Copy link
Member

NiLuJe commented Jun 2, 2024

Eeeeeh, most of this might simply come from not calling append at all for unwanted keys...

@NiLuJe
Copy link
Member

NiLuJe commented Jun 2, 2024

Fixed only using the first value of an array in the above; I broke it at some point of the edits ;p.

@pazos
Copy link
Member Author

pazos commented Jun 2, 2024

Indeed, it looks much better :)

Eeeeeh, most of this might simply come from not calling append at all for unwanted keys...

Yup, but that's the best we can get :)

Is there an advantage in having 2 code paths? Would always using the sax parser make a noticeable (negative) impact on performance?

Based on the tests, yeah, we want two code paths. We also want to enforce rapidjson whenever it is possible. We could add an UI option to switch between implementations (keeping auto as default), so people could change to lunajson if they get OOM error with default settings or force rapidjson beyond the threshold if they know their device has enough ram to handle their particular json files.

@pazos
Copy link
Member Author

pazos commented Jun 2, 2024

We could add an UI option to switch between implementations

For example:
Sin nombre

@pazos
Copy link
Member Author

pazos commented Jun 5, 2024

I can't really test the OOM scenario

You don't need to test it.
There's no OOM scenario as long as the c++ does the same than the lua version.
Just massive speed gains :).

it actually looks like it's actually the sheer amount of k,v pairs that just balloons to stupid amounts

Actually... large string values could explain some of the overhead (e.g., the most common one being description).

Yup, too much to store the whole DOM on memory. The vm just dies.

@pazos
Copy link
Member Author

pazos commented Jun 5, 2024

I'd rather this make it in, actually.

Although, perhaps with slighty different menu labels/config keys. I'm thinking something along the lines of:

Fast (RapidJSON via load_calibre)
Safe (LunaJSON)
Legacy (RapidJSON via load)
(And, for now, until we have more data from insane libraries, keep the size check, but perhaps default to fast in order to exercise the codepath and find potential issues with it).

The option to skip UI and go reckless on the fast rapidjson codepath is out of the equation?.

@NiLuJe
Copy link
Member

NiLuJe commented Jun 5, 2024

The option to skip UI and go reckless on the fast rapidjson codepath is out of the equation?

I mean, eventually, maybe? Right now, I'd keep it in just to be safe ;p.

@pazos pazos marked this pull request as ready for review June 5, 2024 18:29
Copy link
Member Author

@pazos pazos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

Feel free to bump base and merge when ready :)

pazos and others added 9 commits June 5, 2024 20:57
Co-authored-by: NiLuJe <ninuje@gmail.com>
Keep the file-size check for now, until this gets some more mileage.

Ultimately, we'll *probably* want to always use fast, but for now,
leave some options on tha table in case things go kablooey ;).
@NiLuJe NiLuJe merged commit 79c13be into koreader:master Jun 5, 2024
3 of 4 checks passed
@benoit-pierre
Copy link
Contributor

The rapidjson.load_calibre variant crashes on my test data, here is the first entry:

    {
        "series_index": null,
        "size": 2520019,
        "series": null,
        "last_modified": "2023-10-12T08:41:34+00:00",
        "authors": [
            "Stephen King"
        ],
        "tags": {},
        "lpath": "Shining, The - Stephen King (2279).epub",
        "title": "The Shining"
    },

I don't know how and/or why it ended up with "tags": {}, instead of "tags": [],, but this cause the crash.

@pazos
Copy link
Member Author

pazos commented Jun 6, 2024

I don't know how and/or why it ended up with "tags": {}, instead of "tags": [],, but this cause the crash.

Do you use the wireless client? Because it rewrites the json file based on current json parser. So any issue with the parser ends up in the original json file.

If that's true, then the metadata itself isn't valid anymore. During my tests with the lua parser I went the following route to avoid these kind of issues.

  1. Delete json files (and books)
  2. Push books and json files using calibre's connect to folder
  3. Verify the metadata can be parsed and searches work ok with any field
  4. Disconnect from folder, start the wireless client
  5. Verify the client comm with the server is ok
  6. Send a few books wirelessly
  7. Verify the metadata is still correct.

@benoit-pierre
Copy link
Contributor

I had previously connected to calibre with the wireless client, but I don't know if the bad data is from before or after updating to test the new code.

@benoit-pierre
Copy link
Contributor

Anyway, it should not crash.

@benoit-pierre
Copy link
Contributor

And crash means a segfault, not an exception in the Lua code.

@NiLuJe
Copy link
Member

NiLuJe commented Jun 6, 2024

but may not take too kindly to malformed input files

Famous last words ;p.


Yeah, that probably unbalances the push/pop on the Lua stack, trashing the actual stack.

I'll see if I can come up with something not too awful to not horribly implode ;).

@NiLuJe
Copy link
Member

NiLuJe commented Jun 6, 2024

@pazos: Do you remember if the frontend code can deal with potential array fields being nil (instead of an empty table)?

(i.e., should I push a nil or a {} for these broken "arrays"?).

@NiLuJe
Copy link
Member

NiLuJe commented Jun 6, 2024

Alternatively, how does it deal with a missing required field? (as that's my other potential approach ;p).

@pazos
Copy link
Member Author

pazos commented Jun 6, 2024

@NiLuJe: There's https://github.com/koreader/koreader/blob/master/plugins/calibre.koplugin/metadata.lua#L41-L53, which should turn nils to whatever value calibre expects.

The only values that are required are the strings, which should never be nil (calibre will fill bogus values if it can't figure out the proper ones, i.e: last_modified: None)

@NiLuJe
Copy link
Member

NiLuJe commented Jun 6, 2024

I don't know how and/or why it ended up with "tags": {}, instead of "tags": [],

That's because when dumping back to json, for tables that have lost their metatable array/object flag, by default, it assumes empty tables are objects, not arrays (c.f., the empty_table_as_array encoding option, it defaults to false).

And the aforementioned slim method, when confronted with a missing field, sets arrays to an untagged empty table ;).

On the same subject, I'd removed those tags from the load_calibre variant, on the assumption that these could never end up being dumped back to json, but that might not have been a great idea ;).

NiLuJe added a commit to NiLuJe/koreader-base that referenced this pull request Jun 6, 2024
In order to deal with metadata.calibre files we've mistakenly mangled in
the past ;).

Re: koreader/koreader#11922 (comment) & koreader/koreader#11922 (comment)
NiLuJe added a commit to koreader/koreader-base that referenced this pull request Jun 7, 2024
In order to deal with metadata.calibre files we've mistakenly mangled in
the past ;).

Re: koreader/koreader#11922 (comment) & koreader/koreader#11922 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants