-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calibre sax json parser #11922
calibre sax json parser #11922
Conversation
@Frenzie: what is the proper magic to exclude lunajson files from CI? |
bb0d7a5
to
f1a7584
Compare
@NiLuJe: thanks for the review but I didn't expect one on foreign code :) |
diff --git a/kodev b/kodev
index eb2e52d0f..cec1342de 100755
--- a/kodev
+++ b/kodev
@@ -962,7 +962,7 @@ function kodev-check() {
exit_code=1
fi
- tab_detected=$(grep -P "\\t" --include \*.lua --exclude={dateparser.lua,xml.lua} --recursive {reader,setupkoenv,datastorage}.lua frontend plugins spec || true)
+ tab_detected=$(grep -P "\\t" --include \*.lua --exclude={dateparser.lua,lunajson.lua,xml.lua} --exclude-dir=lunajson --recursive {reader,setupkoenv,datastorage}.lua frontend plugins spec || true)
if [ "${tab_detected}" ]; then
echo -e "\\n${ANSI_RED}Warning: tab character detected. Please use spaces."
echo "${tab_detected}" |
6c1ea14
to
1519cfe
Compare
The implementation was broken, see https://www.mobileread.com/forums/showthread.php?p=4427793#post4427793 Now it is the proper thing, AFAICT. |
Alright, sounds good.
…On Fri, May 31, 2024, 20:37 Martín Fernández ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In plugins/calibre.koplugin/metadata.lua
<#11922 (comment)>:
> @@ -54,7 +54,7 @@ end
-- this is the max file size we attempt to decode using json. For larger
-- files we want to attempt to manually parse the file to avoid OOM errors
-local MAX_JSON_FILESIZE = 30 * 1000 * 1000
+local MAX_JSON_FILESIZE = 30
I don't know. I think I would like to keep a fixed size. Most of the bug
reports I get are from mobileread users and seems easier for them to tell
me the size of their metadata.calibre files than to attach logs.
But, more important than that:
- current 30MB size is too big for rapidsjon to handle in some
devices/configs, for instance #11215
<#11215> ***
- a sax parser, even in plain lua, should be faster than a dom parser
on a scenario where we care about less than a 1-10% of the file. ****
***: most likely a setup with few books and a huge amount of user metadata
per book.
****: this holds true on my tests with a little library of ~400 books and
no fancy/extra metadata per book. Even after I start a wireless connection,
which strips all the unused metadata from the json file, I usually got
reads in the same order of magnitude using lunajson sax parser vs rapidjson
dom parser.
—
Reply to this email directly, view it on GitHub
<#11922 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABRQBMSPF5PVRTXAYXJ7PLZFC7OFAVCNFSM6AAAAABIPLM7QSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAOJRGQ4TQMZQGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Will be used by koreader/koreader#11922
If some of you (wink, wink @NiLuJe) have a big on-device metadata with a bunch of custom columns I wouldn't mind a benchmark (or at least the nº of books / size in bytes of the document) FWIW, here 150 naked books got me a I'm assuming it could grow to +50 KiB per book. (On mobileread an advanced user said ~4000 books weight 40MiB, so maybe 50 is the worst case scenario) Even then a So it seems a nice threshold to switch to the sax parser. |
Is there an advantage in having 2 code paths? Would always using the sax parser make a noticeable (negative) impact on performance? |
Very good question. That's why I would like a benchmark from somebody that uses calibre heavily and doesn't use wireless transfers to push their books.
Not on a desktop with a few hundred books. Rapidjson is faster with all unneeded fields stripped and not slower with default calibre fields. No idea about real devices since my metadata there is already stripped. Anyhow, on my quite limited tests both produce results in the same order of magnitude, but rapidjson has less outliers. |
I don't really have fancy columns setup, but I'll try to give you some numbers ;). |
Yup, it's considerably slower even on a fairly modest library (metadata file is 7.2MB), even on a relatively speedy SoC (that was a Sage, it boosts its A7 cores fairly high, so much so that they might still be faster than the MTK's A53 in terms of single-threaded performance). Anyway: RapidJSON:
LunaJSON:
In terms of UX, that translates into going from barely being able to notice the parsing to really noticing a stall ^^. |
On the upside, at a very quick glance, the results look correct, so It Works(TM) ;). |
Ok, thanks. That's much worse than I expected. And it doesn't seem like it is going to improve on very big libraries. I see a few things that can be improved in this PR, I'm going to look. |
This comment was marked as outdated.
This comment was marked as outdated.
Give me a shout when you're ready for me to re-test ;). |
I'm not able to improve the thing too much, but at least catched a bug that would crash the program when trying to re-parse :) Also improved a bit the tests, so they measure what is expected to measure: All calibre fields
Removed fields
|
FWIW, something based on hashmap hitchecks instead of -- parse "metadata.calibre" files
local lj = require("lunajson")
local array_fields = {
authors = true,
tags = true,
series = true,
}
local required_fields = {
authors = true,
last_modified = true,
lpath = true,
series = true,
series_index = true,
size = true,
tags = true,
title = true,
uuid = true,
}
local field
local t = {}
local function append(v)
-- These *may* be arrays, so we need the extra check to confirm that startarray ran
if array_fields[field] and t[field] then
table.insert(t[field], v)
else
t[field] = v
field = nil
end
end
local depth = 0
local result = {}
local sax = {
startobject = function()
depth = depth + 1
end,
endobject = function()
if depth == 1 then
table.insert(result, t)
t = {}
end
depth = depth - 1
end,
startarray = function()
if array_fields[field] then
t[field] = {}
end
end,
endarray = function()
if field then
field = nil
end
end,
key = function(s)
if required_fields[s] then
field = s
end
end,
string = function(s)
if field then
append(s)
end
end,
number = function(n)
if field then
append(n)
end
end,
boolean = function(b)
if field then
append(b)
end
end,
null = function()
if field then
append(nil)
end
end,
}
local parser = {}
function parser.parseFile(file)
result = {}
local p = lj.newfileparser(file, sax)
p.run()
field = nil
return result
end
return parser (hashmap)
vs. (arrays + ipairs checks)
|
Eeeeeh, most of this might simply come from not calling |
Fixed only using the first value of an array in the above; I broke it at some point of the edits ;p. |
Indeed, it looks much better :)
Yup, but that's the best we can get :)
Based on the tests, yeah, we want two code paths. We also want to enforce |
1ecba54
to
0347cb8
Compare
You don't need to test it.
Yup, too much to store the whole DOM on memory. The vm just dies. |
The option to skip UI and go reckless on the fast rapidjson codepath is out of the equation?. |
I mean, eventually, maybe? Right now, I'd keep it in just to be safe ;p. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good!
Feel free to bump base and merge when ready :)
Co-authored-by: NiLuJe <ninuje@gmail.com>
Keep the file-size check for now, until this gets some more mileage. Ultimately, we'll *probably* want to always use fast, but for now, leave some options on tha table in case things go kablooey ;).
c9c45be
to
23cf411
Compare
The {
"series_index": null,
"size": 2520019,
"series": null,
"last_modified": "2023-10-12T08:41:34+00:00",
"authors": [
"Stephen King"
],
"tags": {},
"lpath": "Shining, The - Stephen King (2279).epub",
"title": "The Shining"
}, I don't know how and/or why it ended up with |
Do you use the wireless client? Because it rewrites the json file based on current json parser. So any issue with the parser ends up in the original json file. If that's true, then the metadata itself isn't valid anymore. During my tests with the lua parser I went the following route to avoid these kind of issues.
|
I had previously connected to calibre with the wireless client, but I don't know if the bad data is from before or after updating to test the new code. |
Anyway, it should not crash. |
And crash means a segfault, not an exception in the Lua code. |
Famous last words ;p. Yeah, that probably unbalances the push/pop on the Lua stack, trashing the actual stack. I'll see if I can come up with something not too awful to not horribly implode ;). |
@pazos: Do you remember if the frontend code can deal with potential array fields being (i.e., should I push a |
Alternatively, how does it deal with a missing required field? (as that's my other potential approach ;p). |
@NiLuJe: There's https://github.com/koreader/koreader/blob/master/plugins/calibre.koplugin/metadata.lua#L41-L53, which should turn nils to whatever value calibre expects. The only values that are required are the strings, which should never be nil (calibre will fill bogus values if it can't figure out the proper ones, i.e: last_modified: None) |
That's because when dumping back to json, for tables that have lost their metatable array/object flag, by default, it assumes empty tables are objects, not arrays (c.f., the And the aforementioned On the same subject, I'd removed those tags from the load_calibre variant, on the assumption that these could never end up being dumped back to json, but that might not have been a great idea ;). |
In order to deal with metadata.calibre files we've mistakenly mangled in the past ;). Re: koreader/koreader#11922 (comment) & koreader/koreader#11922 (comment)
In order to deal with metadata.calibre files we've mistakenly mangled in the past ;). Re: koreader/koreader#11922 (comment) & koreader/koreader#11922 (comment)
Add a proper sax parser for "calibre.metadata" files, using lunajson.
Requires koreader/koreader-base#1801
Fixes #11611
Fixes #11215
Fixes #9016
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)