Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[Discussion] Metadata - How, Which & Where #1336
This is issue is to discuss the specifics of how and which metadata to save.
So I think bolt should be the better choice but I never worked with any of the above some I'm open to suggestions.
As with which data to save -
I think this is the bare minimum we need to save for anything useful. I'll think of a proper schema and upload it here once I have it a bit more finalized but i think of some relation between remotes might be useful.
I haven't done the schema yet but having the hash as one of the keys will probably be very useful. this means that we need to make sure it is unique and though SHA1 might be good enough I would much rather go with SHA256 but that might be a bit expensive to store and calculate especially for large files ( hashes being calculated for verifying the remote file would still be saved for cache and for verifying the remote file hasn't changed )
When storing file path we still need to take into consideration that files might move on local filesystem/remote #1331
We need to see how we handle dedupe files since they obviously going to have the same hash but might change independently in future runs
Versioning for db/records it might be useful to version the meta data so if we have any future schema changes migration wont be an issue
We need to see where to store the database file itself. rclone config path is probably the best place but we might want to offer a cmd flag to change the default
There are probably some more considerations I'm forgetting.
referenced this issue
Apr 14, 2017
Btw it would not be bad idea to keep metadata database directly on cloud so it could be read from multiple servers without making API calls. The only thing to consider is database locks when multiple mounts write at same time.
Each file list is one API call, while download is not.
p.s. Database could also keep stuff link ACD hard links to files.
@ajkis Thanks for the tips.
I was initially against saving the meta data on the remotes as having a single db as the source of truth might make things simpler. There are some advantages for having the remote metadata saved on the remote but I'm not sure if the added complexity is worth it -
I do think we can eventually implement a few modes where metadata can save api calls but I prefer to first implement a simple version and try not make it too complex. So for now rclone will still do the same verification as it did before, but it can rely on cached checksums and can use the metadata to know if file has been removed between runs
After looking a bit more closely at the source I think we'll probably want to save size as well since it is used to check if file has changed before calculating the hash.
@DTrace001 I think the issue you referenced relates more to limit the fields in the metadata rclone fetched from google and has minor impact on which metadata is saved locally
Please save all stats so nothing need to be queried
I think it might be worth exploring what this DB would be used for exactly.
I'd like something which would help with
My original thinking was to cache serialized versions of the
Then when rclone does a list operation, it can query the db instead of the remote as required, unserializing objects as it goes along.
[ I also have a plan to remove the root of the Fs, and have this be handled by a higher layer. This would help with this - the
The DB design above has quite a few more things in - what issues are they relevant to?
I guess we can serialize the whole Object, I thought of limiting the data to the specific stuff we need will keep the db minimal and will make using it simple, but serializing the Object is probably easier.
It seems like that way we need each FS to handle serialization and I would prefer to have it a bit more generic where all the logic including serialization is handled in a single module, but that might be hard since some remote will need very specific meta data.
So If I understand you correctly we just need the a bucket per remote where Path -> Serialized object.