Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table persistence #7120

Merged
merged 3 commits into from Jan 13, 2021
Merged

Table persistence #7120

merged 3 commits into from Jan 13, 2021

Conversation

pazos
Copy link
Member

@pazos pazos commented Jan 8, 2021

  • table dumps is what we're using now to get table persistence. It is pretty printed, uncompressed.
  • binary dumps using bitser is needed for tables where human readibility doesn't matter. It is faster encoding, decoding and produces smaller cache files.

A test to show the difference in encoding/decoding speed and the result cache size

encoding table 1: bitser: 1161KB [84ms]| dump: 9118KB [429ms](7.85x size)
encoding table 2: bitser: 2332KB [179ms]| dump: 18259KB [975ms](7.83x size)
encoding table 3: bitser: 3524KB [291ms]| dump: 27439KB [1442ms](7.79x size)
encoding table 4: bitser: 4735KB [377ms]| dump: 36658KB [2092ms](7.74x size)
encoding table 5: bitser: 5946KB [503ms]| dump: 45876KB [2531ms](7.72x size)
encoding table 6: bitser: 7157KB [592ms]| dump: 55095KB [3163ms](7.70x size)
encoding table 7: bitser: 8368KB [696ms]| dump: 64314KB [3715ms](7.69x size)
encoding table 8: bitser: 9579KB [813ms]| dump: 73533KB [4799ms](7.68x size)
decoding table 1: bitser: [38ms]| dump: [96ms]
decoding table 2: bitser: [65ms]| dump: [221ms]
decoding table 3: bitser: [103ms]| dump: [748ms]
decoding table 4: bitser: [132ms]| dump: [488ms]
decoding table 5: bitser: [311ms]| dump: [773ms]
decoding table 6: bitser: [239ms]| dump: [894ms]
decoding table 7: bitser: [401ms]| dump: [1049ms]
decoding table 8: bitser: [373ms]| dump: [1154ms]

This change is Reviewable

@pazos
Copy link
Member Author

pazos commented Jan 8, 2021

In any case the idea is implement loadFile() and saveFile(t) in all objects inherited from BaseCache and keep BaseCache encoding/decoding agnostic, while implementing there common methods for all cache files.

I've implemented a few just to showcase: exists(), size().
timestamp() is left unimplemented for now but makes sense to me. Any other suggestion?

@pazos pazos marked this pull request as draft January 8, 2021 16:16
@poire-z
Copy link
Contributor

poire-z commented Jan 8, 2021

Looks alright, no real comment.
May be about the naming of the containing dir: frontend/cache/* - which I guess won't cause any issue/doubt with Lua with the fact we also have frontend/cache.lua - but in python I would have worried a bit :)
Might still be a bit confusing to a human when reading local mycache = require("frontend/cache"), dunno.

@Frenzie
Copy link
Member

Frenzie commented Jan 8, 2021

Looks sane at a glance.

@pazos
Copy link
Member Author

pazos commented Jan 9, 2021

Looks alright, no real comment.
May be about the naming of the containing dir: frontend/cache/* - which I guess won't cause any issue/doubt with Lua with the fact we also have frontend/cache.lua - but in python I would have worried a bit :)
Might still be a bit confusing to a human when reading local mycache = require("frontend/cache"), dunno.

I think cache is not a good name after all. This is non-atomic, all in one, table persistence on disk.
The reason for the folder is: don't pollute frontend/ with a few 30 lines scripts that implement an specific method of table persistance.

Played a bit with the code yesterday to see how to handle bytecode dumps and ended up with a single file that does what we want.

local bitser = require("bitser")
local dump = require("dump")

local codecs = {
    -- bitser: binary form, fast encode/decode, low size. Not human readable.
    bitser = {
        id = "Bitser serializer",
        serialize = function(t, file)
            print(t, file)
            local ok, str = pcall(bitser.dumps, t)
            if not ok then
                return nil, "cannot serialize " .. tostring(t)
            end
            return str
        end,
        deserialize = function(file)
            local f, err, str
            f, err = io.open(file, "rb")
            if not f then
                return nil, err
            end
            str, err = f:read("*a")
            f:close()
            if not str then
                return nil, err
            end
            local ok, t = pcall(bitser.loads, str)
            if not ok then
                return nil, "malformed serialized data"
            end
            return t
        end,
    },
    -- dump: human readable, pretty printed, fast enough for most user cases.
    dump = {
        id = "Dump serializer",
        -- by default we store plain text, but we can store lua bytecode instead
        serialize = function(t, file, as_bytecode)
            local content, err = dump(t)
            if not content then
                return nil, string.format("cannot serialize table %s: %s", t, err)
            end
            local str
            if as_bytecode then
                str, err =  load("return " .. content)
                if not str then
                    print("cannot convert table to bytecode: %s, ignoring", err)
                else
                    print(string.format("file %s will contain bytecode", file))
                    str = string.dump(str, true)
                end
            end

            if not str then
                print(string.format("file %s will contain plain text", file))
                str = "return " .. content
            end
            return str
        end,
        deserialize = function(file)
            local ok, t, err = pcall(dofile, file)
            if not ok then
                return nil, err
            end
            return t
        end,
    }
}

local Persistence = {}

function Persistence:new(o)
    o = o or {}
    setmetatable(o, self)
    self.__index = self
    return o:init(o.path, o.codec)
end

function Persistence:init(path, codec)
    if type(path) ~= "string" then
        return nil, "path is required"
    end
    self.path = path
    self.codec = codec or "dump"
    return self
end

function Persistence:stats()
    local lfs = require("lfs")
    local size = lfs.attributes(self.path, "size") or -1
    local exists = size ~= -1
    local timestamp = "NYI"
    return self.path, self.codec.id, exists, size, timestamp
end

function Persistence:loadFile()
    return codecs[self.codec]:deserialize(self.path) or {}
end

function Persistence:saveFile(t, as_bytecode)
    local key = self.codec
    local str = codecs[key].serialize(t, self.path, as_bytecode)
    self:writeFile(str)
end

function Persistence:writeFile(str)
    local f, err = io.open(self.path, "wb")
    if not f then
        return nil, err
    end
    f:write(str)
    f:close()
    return true
end


return Persistence

Still sounds a bit weak, because I would prefeer to name the file "cache" than "persistance". Anyhow, that's how it works:

-- binary persistance
local bin_cache = require("persistance"):new{
    path = "myfile.dat",
    codec = "bitser",
}

bin_cache:saveFile(table)
print(bin_cache:stats())
local t = bin_cache:loadFile()

-- plain text persistance
local plain_cache = require("persistance"):new{
    path = "otherfile.what",
}

-- write plain text
plain_cache:saveFile(table)

-- try bytecode first, it will fallback to plain text if it fails
plain_cache:saveFile(table, true)

@NiLuJe
Copy link
Member

NiLuJe commented Jan 9, 2021

As far as naming goes, something along ser/deser would be fine with me (e.g., I kinda like the name of rust's https://github.com/serde-rs/serde).

And I'm strongly refraining from the obvious "desser(t)" pun ;p.

@poire-z
Copy link
Contributor

poire-z commented Jan 9, 2021

I like this one-file module.
No better idea about the name. Just ser/deser is a bit too vague, as this module would serialize to a file on disk (and not to some arbitrary string we could use elsewhere, that ser/deser could imply).
(Python names modules doing that like pickle or shelve - can't think of any funny name like that atm.)

@Frenzie
Copy link
Member

Frenzie commented Jan 9, 2021

ferment
brine
lactic acid bacteria

@Frenzie
Copy link
Member

Frenzie commented Jan 9, 2021

Here's the function's new mascot:
lactid-acid

@poire-z
Copy link
Contributor

poire-z commented Jan 9, 2021

Or a smaller persist.lua ?
Other words given by the various synonym engines that I used so may times when looking for words for variables :) :
endure, perdure, keep, linger, stick

@NiLuJe
Copy link
Member

NiLuJe commented Jan 9, 2021

Well, here's my shitty skeuomorphic take: trestle, because it can be used to prop tables up and down and move them around?

@Frenzie
Copy link
Member

Frenzie commented Jan 9, 2021

lacticAcidify()    -- ferment
sulfurDioxidize()  -- inhibit fermentation

@pazos
Copy link
Member Author

pazos commented Jan 9, 2021

I see it's going to be hard to reach name consensus 😄

I like persist.lua the most but I would be ok with lacticAcidBacterium.lua as long as we keep the mascot.

end
end

function Persist:writeFile(str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this one be public ?
If not, either rename it to Persist:_writeFile() so we know it is not - or inline it in :saveFile() )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a legacy workaround while I played with bytecode. No longer needed. I'm going to inline it in saveFile

@pazos pazos marked this pull request as ready for review January 10, 2021 12:53
@pazos
Copy link
Member Author

pazos commented Jan 10, 2021

Tried to move frontend/readhistory.lua to persist but that doesn't work nice. No idea why as I didn't played with it. I will leave that for the future.

Both calibre and the font list work as fine as before, but now we're able to load a table with the metadata of hundred of thousands of books in a few hundred milliseconds.

@poire-z
Copy link
Contributor

poire-z commented Jan 10, 2021

(Beware with the settings and history files, they have some security features to avoid losing them, like "rename previous to .old and use it if current is unreadable" and ffiutil.fsyncOpenedFile() - so, I think it's better to leave them untouched :) (Unless you want to add these security features as options to Persist :) but yes, later.)

frontend/fontlist.lua Outdated Show resolved Hide resolved

local t, err = cache:loadFile()
if not t then
print("error loading from cache", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.info(cache.path, err, ": initializing new one") or something like that

frontend/persist.lua Outdated Show resolved Hide resolved
frontend/persist.lua Outdated Show resolved Hide resolved
frontend/persist.lua Outdated Show resolved Hide resolved
Copy link
Member

@Frenzie Frenzie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to add to @poire-z's comments right now.

frontend/fontlist.lua Outdated Show resolved Hide resolved
frontend/persist.lua Outdated Show resolved Hide resolved
frontend/persist.lua Outdated Show resolved Hide resolved
@pazos
Copy link
Member Author

pazos commented Jan 11, 2021

@poire-z: I think I addressed all your comments.

Except this one:

No better idea about the name. Just ser/deser is a bit too vague, as this module would serialize to a file on disk (and not to some arbitrary string we could use elsewhere, that ser/deser could imply).

Does make sense to have functions for table to string serialize and string to table deserialize?

a la Persist:serialize(t, codec) and Persist:deserialize(codec)

Or maybe a single function that returns the table with a "codec" and doesn't need to create a new object, to be used like:

local ser = Persist:getCodec("bitser")
local str = ser.serialize(t)
local nt = ser.deserialize(str)

or

local binarifier = Persist:getCodec("bitser").serialize
print(binarifier(t))

str, err = load("return " .. content)
if not str then
local bytecode, err = load("return " .. dump(t))
if not bytecode then
print("cannot convert table to bytecode: %s, ignoring", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.warn ? or is that just fine because it will never happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will never happen in the context of fontlist but it will happen if the table to serialize has + 65536 nested tables.
Done 👍

@poire-z
Copy link
Contributor

poire-z commented Jan 11, 2021

Except this one:

No better idea about the name

This one was solved by naming it persist.lua :) I wasn't asking for a serializer/deserializer. But if you're at it:

Or maybe a single function that returns the table with a "codec" and doesn't need to create a new object

Yes, that's just perfect, and allows flexibility for the caller.

@pazos
Copy link
Member Author

pazos commented Jan 12, 2021

It turns out that the dump "deserializer" was pretty hardcoded to be used from a file, as it was using dofile. Now the behaviour is try the string as a file, with loadfile. If that fails fallback to loadstring.

@poire-z
Copy link
Contributor

poire-z commented Jan 13, 2021

Is this ready ? and not to risky for 2021.01 ?
If not, and if I bump crengine for 2021.01, I'll be bringing base/bitset with, is that ok?

@pazos
Copy link
Member Author

pazos commented Jan 13, 2021

Is this ready?

Yeah (as long as it passes your reviews)

and not to risky for 2021.01 ?

Nope, but I would like to merge it at least a couple of nightlies before release.

If not, and if I bump crengine for 2021.01, I'll be bringing base/bitset with, is that ok?

It is safe to bump base even without merging this. base/ffi/bitser.lua will be unused.

Copy link
Contributor

@poire-z poire-z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more comments, looks fine at the API level.
(I must say I skipped a bit the local ok, str = and local t, err = checks and flows - not fun to get into :) so trusting you on this.)

@Frenzie Frenzie added this to the 2021.01 milestone Jan 13, 2021
@Frenzie Frenzie changed the title [RFC]: table persistence Table persistence Jan 13, 2021
@Frenzie Frenzie merged commit 1148e46 into koreader:master Jan 13, 2021
@Frenzie
Copy link
Member

Frenzie commented Jan 13, 2021

Looks sane enough at a glance, so merging to get in a few days of testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants