New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ORM for "trusted notebook" db ? #1782

Closed
Carreau opened this Issue Sep 22, 2016 · 20 comments

Comments

Projects
None yet
6 participants
@Carreau
Contributor

Carreau commented Sep 22, 2016

We are currently using SQLite, which does not like being on NFS.
It's slightly annoying for cloud deployment.

I would like to investigate/document making that plugable in order for large scale deployment to have a customisable solution.

@mej

This comment has been minimized.

mej commented Sep 22, 2016

For users with superuser access, at least on Linux, adding the option local_lock=all to the /etc/fstab file or /sbin/mount command line should provide relief by bypassing server-side locking and doing all locking on the client. This should ONLY be done in situations where it's guaranteed with absolute certainty that only 1 client will ever write to that SQLite database at a time; otherwise, corruption will occur.

Another possible option is SQLite proxy locking; see https://github.com/aosm/SQLite/blob/master/Notes/ProxyLocking.txt for details.

@minrk

This comment has been minimized.

Member

minrk commented Sep 23, 2016

Proxy locking sounds interesting. I'm reticent to invoke sqlalchemy or another ORM as a dependency for such a small thing as nbformat, especially when our SQL demands are so incredibly light.

The first thing I would look at is supporting any dbapi2 implementation, in which case all we should need is the right config for exposing the connection function and arguments. That should get us psycopg2 for postgres, mysql-python for mysql, etc..

@Carreau

This comment has been minimized.

Contributor

Carreau commented Sep 23, 2016

That seem reasonable. It should be pretty easy to make a swapable db manager as well. If not already possible.

@minrk

This comment has been minimized.

Member

minrk commented Sep 26, 2016

I got dbapi2 just about working, but found that even the most basic things like 'AUTOINCREMENT' aren't in common with postgres, so we would need to know what the common subset is, or deal with the relative inconsistencies. That tells me that sqlalchemy might be better, but that seems like it might be a too-heavyweight dependency.

jupyter/nbformat#50

@takluyver

This comment has been minimized.

Member

takluyver commented Sep 26, 2016

If we do want an ORM, I believe peewee is a lighter dependency than sqlalchemy. However, abstracting over multiple SQL databases in order to store a set of hashes feels like something has gone wrong.

@minrk

This comment has been minimized.

Member

minrk commented Sep 26, 2016

@takluyver agreed. If you have a better tool for looking up a hash that supports concurrent access, fast lookup without reading the whole thing in, etc., that would be great. If only SQLite+NFS weren't such a mess, I wouldn't bother putting any energy into supporting anything other than sqlite.

We also want the same thing for IPython history, where search, etc. come in. It's not actually the ORM that we want, it's just the SQL abstraction.

@takluyver

This comment has been minimized.

Member

takluyver commented Sep 26, 2016

It seems like a key-value database would be a better fit than SQL for the signatures, though that's less clear for history. There are several embedded k-v databases, like dbm, Berkeley DB, Kyoto Cabinet. Most seem to have copyleft licenses; UnQLite is an exception, but probably has the same issues with NFS that SQLite does: https://twitter.com/unqlite_db/status/642479750368370688 . And I have no evidence that any of the other options doesn't fall over with NFS.

The only embeddable thing I've found that might work is the approach used in cdb - to add an entry, rebuild the database and rename it over the old file, relying on rename to be atomic. But our database gets big enough and changes often enough that we probably don't want to be rewriting it over NFS on every addition.

So if we can't trust the filesystem, we're back to database server to maintain consistency. I still suspect it's worth looking at key-value stores over SQL databases, but I don't have the patience to wade through all the NoSQL hype and marketing to figure out which one would actually be a good fit.

Did you look into SQLite proxy locking? I had a brief skim, and it sounded relevant, but I didn't go through thoroughly enough to be sure.

@takluyver

This comment has been minimized.

Member

takluyver commented Sep 26, 2016

(Of course, we probably need to keep an embedded option like sqlite, because I don't know of any database server we can express as a dependency, and we can't expect users to set up a separate database package)

@minrk

This comment has been minimized.

Member

minrk commented Oct 7, 2016

@takluyver sure. I'm mainly thinking of having a usable alternative for those on NFS systems. If there's an easy option that doesn't have NFS issues, we can just pick that up, but otherwise having a decent alternative seems good enough.

simplekv seems like an easy option for key-value store, and works on the filesystem by default, with support for memcache, redis, etc. I don't see a mechanism for the culling to trim the entries, which is the only relational bit that we do here.

I'll also peek at peewee, which seems like it might be a simple solution.

@takluyver

This comment has been minimized.

Member

takluyver commented Oct 7, 2016

Nice. Digging into the code of simplekv gives me a "why didn't we think of that before?" moment: its FilesystemStore creates files in a directory, named with the keys in question.

That does leave a couple of concerns, though:

  • Do we need to work on any filesystems which may have small limits on filename length?
  • Can all filesystems quickly check for the existence of a file in a directory containing thousands of files?
@mej

This comment has been minimized.

mej commented Oct 7, 2016

There are numerous well-documented problems on many filesystems with large numbers of files in a single directory. (Start here, for example.) The definition of "large" has been getting better over time, but it's still something to consider carefully. And unless the filesystem is B-tree-based or similar, finding a file in a directory is typically a linear search (i.e., O(n)). Fortunately, most modern filesystems are tree/hash-based by default (e.g., ext4's dir_index).

@takluyver

This comment has been minimized.

Member

takluyver commented Oct 7, 2016

sigh so we probably can't rely on that. Thanks for the info.

We could maybe write an implementation that used a nested directory structure based on chunks of a hex digest - so instead of a49cd3... you might have a4/9c/d3.... But that starts to sound complex and fiddly.

@mej

This comment has been minimized.

mej commented Oct 7, 2016

That actually isn't unreasonable. Lots of existing stuff uses that, including Git. :-)

@minrk

This comment has been minimized.

Member

minrk commented Oct 9, 2016

Culling old items seems pretty expensive without the relational part, since it seems to necessarily mean reading every file in the store (or stat if we want to use/trust mtime rather than contents). That's the only real argument I see against it right now.

Pip uses the same scheme (a/b/c/d/e/f123...) for its cache directory. I don't think it's too complex to deal with.

@michaelaye

This comment has been minimized.

michaelaye commented Aug 15, 2017

Considering that jupyter/nbformat#71 is done, isn't this done as well?

@takluyver

This comment has been minimized.

Member

takluyver commented Aug 16, 2017

Yup, depending on how you interpret this it's either done, or we've decided not to do it. Instead of using an ORM to allow different databases, we have defined the SignatureStore API, for which someone could implement a Postgres backend, or an SQLAlchemy backend, or a directory full of hashes backend, and so on.

Unfortunately, we haven't found anything better to use as the default than the SQLite backend.

@takluyver takluyver closed this Aug 16, 2017

@takluyver takluyver modified the milestones: Not Notebook, Backlog Aug 16, 2017

@michaelaye

This comment has been minimized.

michaelaye commented Aug 16, 2017

To summarize, for my understanding:

  • There is a SignatureStore API
  • It could be used by any DB-like mechanism to create a backend that manages the signatures
  • the Jupyter team uses the SignatureStore API by default via SQLite which has troubles over NFS

A last question: Did the team have a resource problem to implement other solutions, or is the performance overhead so large with other DBs that SQLite is technically still the best solution, despite the NFS issue?

If I am using notebooks on a shared Dropbox folder, and I would like the signature to stay valid on 2 machines accessing these notebooks subsequently (so not concurrently), would that work? I'm having especially the issue that sometimes I don't close the notebooks on one machine, and then I get SQLite clashes in the history file, but usually no other problems. But I'm afraid that these kinds of issues will create trouble as well, trying to use it for signatures, right?

@takluyver

This comment has been minimized.

Member

takluyver commented Aug 16, 2017

A last question: Did the team have a resource problem to implement other solutions, or is the performance overhead so large with other DBs that SQLite is technically still the best solution, despite the NFS issue?

I haven't tried, but I think other DBs would be fine. I haven't implemented any because I don't have a need for them myself, and we can't realistically make them the default, because that would depend on a database server.

If I am using notebooks on a shared Dropbox folder, and I would like the signature to stay valid on 2 machines accessing these notebooks subsequently (so not concurrently), would that work?

If you go to the trouble of setting up a shared database server and configuring both notebook servers to use that, it certainly should work (it probably won't at first, because it's new ;-). I don't know how Dropbox behaves with SQLite database files.

@michaelaye

This comment has been minimized.

michaelaye commented Aug 16, 2017

Is a database server the only way to do this? Why couldn't the notebook read a serialized dictionary for the signature? Or is that unsafe? (Pardon my ignorance).

@takluyver

This comment has been minimized.

Member

takluyver commented Aug 16, 2017

A serialised dictionary written by an atexit handler would mostly work fine. The limitations are:

  • If the notebook server crashes, or you terminate it with a signal it doesn't handle, any signatures it has added won't be written.
  • If you're running more than one notebook server at once, the last one you shut down will overwrite the signatures added by the other one.

There are ways you could attempt to work around those issues, but before long you're reinventing a database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment