New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ORM for "trusted notebook" db ? #1782
Comments
For users with superuser access, at least on Linux, adding the option Another possible option is SQLite proxy locking; see https://github.com/aosm/SQLite/blob/master/Notes/ProxyLocking.txt for details. |
Proxy locking sounds interesting. I'm reticent to invoke sqlalchemy or another ORM as a dependency for such a small thing as nbformat, especially when our SQL demands are so incredibly light. The first thing I would look at is supporting any dbapi2 implementation, in which case all we should need is the right config for exposing the connection function and arguments. That should get us psycopg2 for postgres, mysql-python for mysql, etc.. |
That seem reasonable. It should be pretty easy to make a swapable db manager as well. If not already possible. |
I got dbapi2 just about working, but found that even the most basic things like 'AUTOINCREMENT' aren't in common with postgres, so we would need to know what the common subset is, or deal with the relative inconsistencies. That tells me that sqlalchemy might be better, but that seems like it might be a too-heavyweight dependency. |
If we do want an ORM, I believe peewee is a lighter dependency than sqlalchemy. However, abstracting over multiple SQL databases in order to store a set of hashes feels like something has gone wrong. |
@takluyver agreed. If you have a better tool for looking up a hash that supports concurrent access, fast lookup without reading the whole thing in, etc., that would be great. If only SQLite+NFS weren't such a mess, I wouldn't bother putting any energy into supporting anything other than sqlite. We also want the same thing for IPython history, where search, etc. come in. It's not actually the ORM that we want, it's just the SQL abstraction. |
It seems like a key-value database would be a better fit than SQL for the signatures, though that's less clear for history. There are several embedded k-v databases, like dbm, Berkeley DB, Kyoto Cabinet. Most seem to have copyleft licenses; UnQLite is an exception, but probably has the same issues with NFS that SQLite does: https://twitter.com/unqlite_db/status/642479750368370688 . And I have no evidence that any of the other options doesn't fall over with NFS. The only embeddable thing I've found that might work is the approach used in cdb - to add an entry, rebuild the database and rename it over the old file, relying on rename to be atomic. But our database gets big enough and changes often enough that we probably don't want to be rewriting it over NFS on every addition. So if we can't trust the filesystem, we're back to database server to maintain consistency. I still suspect it's worth looking at key-value stores over SQL databases, but I don't have the patience to wade through all the NoSQL hype and marketing to figure out which one would actually be a good fit. Did you look into SQLite proxy locking? I had a brief skim, and it sounded relevant, but I didn't go through thoroughly enough to be sure. |
(Of course, we probably need to keep an embedded option like sqlite, because I don't know of any database server we can express as a dependency, and we can't expect users to set up a separate database package) |
@takluyver sure. I'm mainly thinking of having a usable alternative for those on NFS systems. If there's an easy option that doesn't have NFS issues, we can just pick that up, but otherwise having a decent alternative seems good enough. simplekv seems like an easy option for key-value store, and works on the filesystem by default, with support for memcache, redis, etc. I don't see a mechanism for the culling to trim the entries, which is the only relational bit that we do here. I'll also peek at peewee, which seems like it might be a simple solution. |
Nice. Digging into the code of simplekv gives me a "why didn't we think of that before?" moment: its FilesystemStore creates files in a directory, named with the keys in question. That does leave a couple of concerns, though:
|
There are numerous well-documented problems on many filesystems with large numbers of files in a single directory. (Start here, for example.) The definition of "large" has been getting better over time, but it's still something to consider carefully. And unless the filesystem is B-tree-based or similar, finding a file in a directory is typically a linear search (i.e., O(n)). Fortunately, most modern filesystems are tree/hash-based by default (e.g., ext4's dir_index). |
sigh so we probably can't rely on that. Thanks for the info. We could maybe write an implementation that used a nested directory structure based on chunks of a hex digest - so instead of |
That actually isn't unreasonable. Lots of existing stuff uses that, including Git. :-) |
Culling old items seems pretty expensive without the relational part, since it seems to necessarily mean reading every file in the store (or stat if we want to use/trust mtime rather than contents). That's the only real argument I see against it right now. Pip uses the same scheme (a/b/c/d/e/f123...) for its cache directory. I don't think it's too complex to deal with. |
Considering that jupyter/nbformat#71 is done, isn't this done as well? |
Yup, depending on how you interpret this it's either done, or we've decided not to do it. Instead of using an ORM to allow different databases, we have defined the SignatureStore API, for which someone could implement a Postgres backend, or an SQLAlchemy backend, or a directory full of hashes backend, and so on. Unfortunately, we haven't found anything better to use as the default than the SQLite backend. |
To summarize, for my understanding:
A last question: Did the team have a resource problem to implement other solutions, or is the performance overhead so large with other DBs that SQLite is technically still the best solution, despite the NFS issue? If I am using notebooks on a shared Dropbox folder, and I would like the signature to stay valid on 2 machines accessing these notebooks subsequently (so not concurrently), would that work? I'm having especially the issue that sometimes I don't close the notebooks on one machine, and then I get SQLite clashes in the history file, but usually no other problems. But I'm afraid that these kinds of issues will create trouble as well, trying to use it for signatures, right? |
I haven't tried, but I think other DBs would be fine. I haven't implemented any because I don't have a need for them myself, and we can't realistically make them the default, because that would depend on a database server.
If you go to the trouble of setting up a shared database server and configuring both notebook servers to use that, it certainly should work (it probably won't at first, because it's new ;-). I don't know how Dropbox behaves with SQLite database files. |
Is a database server the only way to do this? Why couldn't the notebook read a serialized dictionary for the signature? Or is that unsafe? (Pardon my ignorance). |
A serialised dictionary written by an atexit handler would mostly work fine. The limitations are:
There are ways you could attempt to work around those issues, but before long you're reinventing a database. |
@Carreau @willingc FYI I came to this issue because it's linked from here in the last paragraph, saying that changing the DB is "currently being worked on". However this issue is closed. So I suspect the last paragraph needs to be updated to specify what the result was, or it should link to some new issue that's open. |
We are currently using SQLite, which does not like being on NFS.
It's slightly annoying for cloud deployment.
I would like to investigate/document making that plugable in order for large scale deployment to have a customisable solution.
The text was updated successfully, but these errors were encountered: