New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Non-Sequential IDs in SQLite3 Database #91
Conversation
FileIds no longer reflect order of insertion in database, so delete that test.
This request passes all tests as of commit 15694e6 and should work seamlessly with new and pre-existing databases. This pull request doesn't touch |
Hi , random ids could not resolve the problem because there is a possible collision (of course at very low probability) in case of same choice between two db. |
@ghlpo For this use case random integers should be sufficient. Keep in mind that git commits hashes are essentially random, but we don't worry about collisions there either. Anyway, the reason I can't just use an offset on the ids for one database, is that I'm not looking to merely merge databases, I'm trying to keep the same database in sync across multiple machines. In case I forget to sync but add two separate files on two separate machines, having non-colliding ids allows a seamless merge. Let me know if my intent is unclear, I tried to explain it in the "Use Case" description above, but perhaps I didn't do a good job. My hope is that this could be combined nicely with something like git-annex and be useful to someone other than just me. |
If you want to be able to merge easily, wouldn't hash-based IDs work better? For example, you have a database D (the 'original database'). Now you have two databases that you need to merge, D[A] and D[B]. Under the 'random' scheme, merging results in two "different" tags T. Under a hash-based scheme, the tags T are independently(and AFAICS correctly) assigned the same id -> no duplication. A separate issue: |
@0ion9 More than anything, this pull request is intended to start a discussion about the possibility of making tmsu usable in decentralized/distributed setups. For this pull request, I opted simply for easy implementation as a proof of concept kind of thing. Really just think of this as an 'issue' with example code attached.
What qualms do you have? What ways are you thinking that sequential keys are/could be usefully levereaged? As far as I can tell, the user cannot tell any difference between |
You always have to do a global scan when merging anyway, as silent corruption of data is not acceptable and cannot be avoided under any id assignment scheme. (git notably 'solves' this problem by reducing the chance of a collision to 1 in (2**159), by using a 160-bit hash [sha1]. Maybe possible for file ids, but utter overkill IMO)
I don't understand this objection. File table data can always be out of date (renames, moves, modifications, as you say), but I don't see how that changes things: the file table is still tmsu's 'best guess' of what is going on, and if it's wrong, a correct merge cannot be done in any case. Can you clarify please?
There's two:
I think at the time I was a bit mixed up: I was thinking about your presumption that it was possible to do a quick (non-scanning) merge. Since this isn't true [ie. collisions must, in any case, be handled] , my objections are wholly covered in the above two points. |
Perhaps you're right. My patch may work just as well by replacing the random integer for tags and value Maybe submitting this as a pull request was the wrong option as it gives the impression that this is inteded to be some kind of optimal solution to a problem? In my previous comment, I tried to emphasize that this is simply intended as a proof-of-concept and discussion-starter thing. I'm hoping more that the discussion here is about the main idea rather than the particulars of my quick hack. Keep in mind that what I really want to do is have my multiple (let's say) rsynced folders to keep consistent tags on the files no matter where I tag from. I could just let rsync copy around the database, but then that gives rise to some problems when two separate locations add different tags etc. Is that something others may be interested in doing? I certainly think so.
My intial feeling here is that this is somewhat of a minor quibble, but I looked into it further to get some real numbers. An empty database is about 100 kB. My database with around 1000 entries and random That said. This is a pretty tiny database. I've already got a hundred files in it, so to grow the database to anything appreciable, say about 100 MB, I'd have to be managing something like 100,000 files in it. These seem like pretty reasonable numbers to me. And I think having cool decentralized support in tmsu would sort of be worth a few MB on my hard drive. The whole point of this discussion, though, is to see what everyone else thinks! =)
Huh. I'm surprised you use this. Anyway, at even present |
In case I haven't been clear: I'm not against supporting this, I just think that that premise 'we can assign ids such that we can just Consider the problem of merging two
The above process is actually fairly simple/efficient IMO: examine both the tag and value tables. generate two temporary lookup tables P.id -> O.final_id. Update the final Then update P.file_tag tag and value ids according to the lookup tables. Then comes the slow part: Update the file_tags table with all items from P.file_tags that are not already present. I believe this is a minimal description. There are other complications to be addressed, such as :
Doubling of size is interesting -- it suggests that Sqlite's storage of small values could be more efficient, as typical tag / value / file ids should be 8-16 bit, and your generated IDs seem to be 64bit [ie. I would have expected a factor of more than 2] A few more sample points:
Actually I made a mistake, sorting taggings is not dependent on ID at all, only on-disk insertion order (which is still an implementation detail of sqlite, admittedly; I don't know whether VACUUMing the database would modify this order, etc). I don't sort tags or values chronologically, although I can see some use for doing so. Anyway, the insertion-order trick can be applied here, so this is independent of ID assignment too AFAIK. |
Stoked that you're thinking about this. And yeah, this current patch is super naive. Currently, I just bolted tmsu and git together by keeping an sql dump of the database checked into a git repo and let git handle all the merging via git-hooks, since it has solved that problem pretty well. Anyway, the idea I really have in my head is something akin to a bunch of tmsu databases that know about each other, with an example workflow something like this:
And then the tags on Anyway, appreciate your input! |
Do your git-hooks translate id numbers in file_tag in diffs to human readable values? If so, would you consider publishing them?
rqlite / raft seems like it does the right logging stuff to achieve the desired end. However, purely from a maintenance standpoint, either approach requires serious justification; I think I'd want to know what @oniony thinks about them before proceeding further down that track. |
https://github.com/xelxebar/tmsu-sql-filters It's a clean and smudge filter for the SQL database dump. Since it depends on database internals, it's probably pretty fragile, though. Oh, also, just so you know, these scripts throw several things in
Definitely. |
Proposal
Instead of using sequential integer
id
s (for thefile
,tag
andvalue
tables, etc.), using non-sequential (e.g. random) integers allows things like database merging.Use Case
My intended use case is to maintain a consistent tag database across multiple machines. Currently, I just naively sync the sqlite3 db file across machines. However, this is dangerous as local database copies can get out of sync if I'm not careful to always pull the latest version before adding a file.
Ideally, if my databases get out of sync, I want to be able to seamlessly merge the changes a la git. My initial attempt at a solution was to dump the tmsu database into an SQL file and maintain that in a git repo. That way I could merge the SQL files with git and then just rebuild the database from the merged version.
However, since IDs are sequential, merging these SQL files causes file/tag/value
id
s to collide. Since it doesn't seem that tmsu is making use of the sequential nature of currentid
s, changing these to random integers would allow use cases like my merging above without affecting the userside elsewhere.