New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Metadata - How, Which & Where #1336

Open
yonjah opened this Issue Apr 14, 2017 · 13 comments

Comments

Projects
None yet
5 participants
@yonjah
Contributor

yonjah commented Apr 14, 2017

This is issue is to discuss the specifics of how and which metadata to save.
On previous issue #949 @ncw mentioned the use of BoltDB and from what I could find looking at embedded databases bolt seems like the best option. Since we are looking for a pure go implementation there are not many other options goleveldb, tiedot will probably work but leveldb feels like it will require more bootstrapping and tiedot doesn't seem as mature (both has seem to be a lot less popular than bolt).

So I think bolt should be the better choice but I never worked with any of the above some I'm open to suggestions.

As with which data to save -
hash
Modification time
file name (absolute path)

I think this is the bare minimum we need to save for anything useful. I'll think of a proper schema and upload it here once I have it a bit more finalized but i think of some relation between remotes might be useful.

I haven't done the schema yet but having the hash as one of the keys will probably be very useful. this means that we need to make sure it is unique and though SHA1 might be good enough I would much rather go with SHA256 but that might be a bit expensive to store and calculate especially for large files ( hashes being calculated for verifying the remote file would still be saved for cache and for verifying the remote file hasn't changed )
If SHA256 is much less common than SHA1 with the remotes it might be worth using it and taking the almost non existent risk of collision

When storing file path we still need to take into consideration that files might move on local filesystem/remote #1331

We need to see how we handle dedupe files since they obviously going to have the same hash but might change independently in future runs

Versioning for db/records it might be useful to version the meta data so if we have any future schema changes migration wont be an issue

We need to see where to store the database file itself. rclone config path is probably the best place but we might want to offer a cmd flag to change the default

There are probably some more considerations I'm forgetting.

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis Apr 14, 2017

Did you check how/what https://github.com/astrada/google-drive-ocamlfuse is storing, since report is that its working the best.

p.s. Someone who is using it, could upload the sqlite that is being used.

ajkis commented Apr 14, 2017

Did you check how/what https://github.com/astrada/google-drive-ocamlfuse is storing, since report is that its working the best.

p.s. Someone who is using it, could upload the sqlite that is being used.

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis Apr 15, 2017

Btw it would not be bad idea to keep metadata database directly on cloud so it could be read from multiple servers without making API calls. The only thing to consider is database locks when multiple mounts write at same time.

Each file list is one API call, while download is not.

p.s. Database could also keep stuff link ACD hard links to files.

ajkis commented Apr 15, 2017

Btw it would not be bad idea to keep metadata database directly on cloud so it could be read from multiple servers without making API calls. The only thing to consider is database locks when multiple mounts write at same time.

Each file list is one API call, while download is not.

p.s. Database could also keep stuff link ACD hard links to files.

@yonjah

This comment has been minimized.

Show comment
Hide comment
@yonjah

yonjah Apr 15, 2017

Contributor

@ajkis Thanks for the tips.
I looked into google-drive-ocamlfuse and they save a lot of meta data. I don't thin we need so much info but we'll see how it goes if the db footprint is not very big it might be worth the add some more stuff in the future.
One thing that might be useful is the remote file id since I assume this can save a request when verifying the file but that might be different with each remote.

I was initially against saving the meta data on the remotes as having a single db as the source of truth might make things simpler. There are some advantages for having the remote metadata saved on the remote but I'm not sure if the added complexity is worth it -
Adv -

  1. easy to know when remote has changed from a different machine
  2. no need to query remote for individual changes

Complexity -

  1. We need to work on local version of the db even for a remote so downloading and updating it is necessary
  2. We need to keep a list of deleted files where with a single local db we just delete the record as we remove it from the local/remote
  3. We might actually need to verify the data on the remote (adv 2) to match the metdata since updates are not necessarily done by rclone or metadata might fail to update (if there is some connection error in the middle of rclone run and it exists)

I do think we can eventually implement a few modes where metadata can save api calls but I prefer to first implement a simple version and try not make it too complex. So for now rclone will still do the same verification as it did before, but it can rely on cached checksums and can use the metadata to know if file has been removed between runs

Contributor

yonjah commented Apr 15, 2017

@ajkis Thanks for the tips.
I looked into google-drive-ocamlfuse and they save a lot of meta data. I don't thin we need so much info but we'll see how it goes if the db footprint is not very big it might be worth the add some more stuff in the future.
One thing that might be useful is the remote file id since I assume this can save a request when verifying the file but that might be different with each remote.

I was initially against saving the meta data on the remotes as having a single db as the source of truth might make things simpler. There are some advantages for having the remote metadata saved on the remote but I'm not sure if the added complexity is worth it -
Adv -

  1. easy to know when remote has changed from a different machine
  2. no need to query remote for individual changes

Complexity -

  1. We need to work on local version of the db even for a remote so downloading and updating it is necessary
  2. We need to keep a list of deleted files where with a single local db we just delete the record as we remove it from the local/remote
  3. We might actually need to verify the data on the remote (adv 2) to match the metdata since updates are not necessarily done by rclone or metadata might fail to update (if there is some connection error in the middle of rclone run and it exists)

I do think we can eventually implement a few modes where metadata can save api calls but I prefer to first implement a simple version and try not make it too complex. So for now rclone will still do the same verification as it did before, but it can rely on cached checksums and can use the metadata to know if file has been removed between runs

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis Apr 15, 2017

Yea I agree, best way at begining would be to make local version with minimum data in DB and then add more if there would be some major benefits.

ajkis commented Apr 15, 2017

Yea I agree, best way at begining would be to make local version with minimum data in DB and then add more if there would be some major benefits.

@yonjah

This comment has been minimized.

Show comment
Hide comment
@yonjah

yonjah Apr 16, 2017

Contributor

So I was looking into the actual schema and I think the following should be enough -
metadataschema

This design should allow us to find files by remote/path or by hash.
In theory the mod time can be saved per file and not per remote, but since some remotes don't support time change it might cause issues.
The Hash is being saved multiple times. this is not a major issue with SHA1 but with SHA256 or anything that takes more space we might benefit from having the Hash as a separate index and making references using generated ids

I also added a schema for resume uploads -
uploadsschema
But after looking at it it doesn't benefit much from being saved in a proper database. So I think for now it might be worth just saving this data separately as json format

Contributor

yonjah commented Apr 16, 2017

So I was looking into the actual schema and I think the following should be enough -
metadataschema

This design should allow us to find files by remote/path or by hash.
In theory the mod time can be saved per file and not per remote, but since some remotes don't support time change it might cause issues.
The Hash is being saved multiple times. this is not a major issue with SHA1 but with SHA256 or anything that takes more space we might benefit from having the Hash as a separate index and making references using generated ids

I also added a schema for resume uploads -
uploadsschema
But after looking at it it doesn't benefit much from being saved in a proper database. So I think for now it might be worth just saving this data separately as json format

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis Apr 16, 2017

Looks good, btw now with db support hash for crypt could be added as well.

ajkis commented Apr 16, 2017

Looks good, btw now with db support hash for crypt could be added as well.

@yonjah

This comment has been minimized.

Show comment
Hide comment
@yonjah

yonjah Apr 16, 2017

Contributor

Yea I'm thinking there might be some secondary data added to some remotes.
Since the schema doesn't have to be fixed or identical I think I'll try to have an API which will allow each remote to add custom metadata if it needs it.

Contributor

yonjah commented Apr 16, 2017

Yea I'm thinking there might be some secondary data added to some remotes.
Since the schema doesn't have to be fixed or identical I think I'll try to have an API which will allow each remote to add custom metadata if it needs it.

@EnorMOZ

This comment has been minimized.

Show comment
Hide comment
@EnorMOZ

EnorMOZ Apr 17, 2017

This might help as well and probably change what fields/info is gathered. #1346

EnorMOZ commented Apr 17, 2017

This might help as well and probably change what fields/info is gathered. #1346

@yonjah

This comment has been minimized.

Show comment
Hide comment
@yonjah

yonjah Apr 17, 2017

Contributor

After looking a bit more closely at the source I think we'll probably want to save size as well since it is used to check if file has changed before calculating the hash.

@DTrace001 I think the issue you referenced relates more to limit the fields in the metadata rclone fetched from google and has minor impact on which metadata is saved locally

Contributor

yonjah commented Apr 17, 2017

After looking a bit more closely at the source I think we'll probably want to save size as well since it is used to check if file has changed before calculating the hash.

@DTrace001 I think the issue you referenced relates more to limit the fields in the metadata rclone fetched from google and has minor impact on which metadata is saved locally

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis Apr 17, 2017

Please save all stats so nothing need to be queried
stat 100M.file
File: '100M.file'
Size: 104857600 Blocks: 204800 IO Block: 4096 regular file
Device: 2bh/43d Inode: 4536129523114675729 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ plex) Gid: ( 1000/ plex)
Access: 2017-02-23 12:16:44.209000000 +0100
Modify: 2017-02-23 12:16:44.209000000 +0100
Change: 2017-02-23 12:16:44.209000000 +0100

ajkis commented Apr 17, 2017

Please save all stats so nothing need to be queried
stat 100M.file
File: '100M.file'
Size: 104857600 Blocks: 204800 IO Block: 4096 regular file
Device: 2bh/43d Inode: 4536129523114675729 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ plex) Gid: ( 1000/ plex)
Access: 2017-02-23 12:16:44.209000000 +0100
Modify: 2017-02-23 12:16:44.209000000 +0100
Change: 2017-02-23 12:16:44.209000000 +0100

@ncw

This comment has been minimized.

Show comment
Hide comment
@ncw

ncw Apr 18, 2017

Owner

I think it might be worth exploring what this DB would be used for exactly.

I'd like something which would help with

  • #949 - local caching of hashes
  • #897 - cache a remote hierarchy so it could them be kept up to date with the changes API

My original thinking was to cache serialized versions of the Object keyed on path. This would mean adding some new methods to the Fs to serialize and unserialize them. This is essentially "caching the remote file id" as you mention above, just in a more generic fashion.

Then when rclone does a list operation, it can query the db instead of the remote as required, unserializing objects as it goes along.

[ I also have a plan to remove the root of the Fs, and have this be handled by a higher layer. This would help with this - the remote:path stuff will need care with the path. ]

The DB design above has quite a few more things in - what issues are they relevant to?

Owner

ncw commented Apr 18, 2017

I think it might be worth exploring what this DB would be used for exactly.

I'd like something which would help with

  • #949 - local caching of hashes
  • #897 - cache a remote hierarchy so it could them be kept up to date with the changes API

My original thinking was to cache serialized versions of the Object keyed on path. This would mean adding some new methods to the Fs to serialize and unserialize them. This is essentially "caching the remote file id" as you mention above, just in a more generic fashion.

Then when rclone does a list operation, it can query the db instead of the remote as required, unserializing objects as it goes along.

[ I also have a plan to remove the root of the Fs, and have this be handled by a higher layer. This would help with this - the remote:path stuff will need care with the path. ]

The DB design above has quite a few more things in - what issues are they relevant to?

@yonjah

This comment has been minimized.

Show comment
Hide comment
@yonjah

yonjah Apr 22, 2017

Contributor

My original thinking was to cache serialized versions of the Object keyed on path. This would mean adding some new methods to the Fs to serialize and unserialize them. This is essentially "caching the remote file id" as you mention above, just in a more generic fashion.

I guess we can serialize the whole Object, I thought of limiting the data to the specific stuff we need will keep the db minimal and will make using it simple, but serializing the Object is probably easier.

It seems like that way we need each FS to handle serialization and I would prefer to have it a bit more generic where all the logic including serialization is handled in a single module, but that might be hard since some remote will need very specific meta data.

So If I understand you correctly we just need the a bucket per remote where Path -> Serialized object.
Having a file bucket with hash as keys can help us find dedupes and files that has been renamed or copied but I guess we can scrape it for now and create it once we have an actual use case

Contributor

yonjah commented Apr 22, 2017

My original thinking was to cache serialized versions of the Object keyed on path. This would mean adding some new methods to the Fs to serialize and unserialize them. This is essentially "caching the remote file id" as you mention above, just in a more generic fashion.

I guess we can serialize the whole Object, I thought of limiting the data to the specific stuff we need will keep the db minimal and will make using it simple, but serializing the Object is probably easier.

It seems like that way we need each FS to handle serialization and I would prefer to have it a bit more generic where all the logic including serialization is handled in a single module, but that might be hard since some remote will need very specific meta data.

So If I understand you correctly we just need the a bucket per remote where Path -> Serialized object.
Having a file bucket with hash as keys can help us find dedupes and files that has been renamed or copied but I guess we can scrape it for now and create it once we have an actual use case

@sweharris

This comment has been minimized.

Show comment
Hide comment
@sweharris

sweharris Apr 24, 2017

Contributor

Could this also be used to handle the "save symlink" scenario? eg if I have ln -s foo1 foo2 then it would be good to have foo2 meta-data indicate it's a symlink pointing to foo1.

Contributor

sweharris commented Apr 24, 2017

Could this also be used to handle the "save symlink" scenario? eg if I have ln -s foo1 foo2 then it would be good to have foo2 meta-data indicate it's a symlink pointing to foo1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment