New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] rcolone + Metadata features and implementation #1337

Open
yonjah opened this Issue Apr 14, 2017 · 18 comments

Comments

Projects
None yet
5 participants
@yonjah
Contributor

yonjah commented Apr 14, 2017

There has been a lot of discussion and request for features that requires rclone to save metadata ( #977 discuss and references a bunch of other issues ). This issue is here to track discussion and progress for implementing both the metadata layer and the the relevant features.

For now I'm only going and try to implement two features which I hope should be relatively simple.
And since the main issue is implementing the metadata layer. it will make it a bit more trivial implementing other features.

Stuff needs discussion -

  • How, where and which metadata is going to be saved #1336
  • Keeping metadata Fresh and purging data of removed files #1335
  • Handling collisions and discrepancies #1334

Features to implement -

  • Two-way Data sync #118

Resume Uploads #1333 (There is an ACD issue #87 but I'm going to focus on OneDrive for now) Removed since this feature doesn't necessarily requires a database

@pedrosimao

This comment has been minimized.

Show comment
Hide comment
@pedrosimao

pedrosimao Apr 20, 2017

This is a great initiative. But I was thinking of an alternative solution to that.
Instead of creating separate metadata files, extracted from the original files, what could be done is simply compressing all files using LZ4 (https://github.com/lz4/lz4) before copying the files to the backup server. LZ4 is super fast (up to 400MB/s), which means it wouldn't make rclone slower. The compressed files would preserved all metadata, with the benefit of making some files smaller. This would actually reduce bandwidth need and upload times. When copying from backup server back to the sources rclone would automatically uncompress everything. What do you think of this approach? Seems relatively simple to implement, and would solve another big problem (bandwidth limitations).

pedrosimao commented Apr 20, 2017

This is a great initiative. But I was thinking of an alternative solution to that.
Instead of creating separate metadata files, extracted from the original files, what could be done is simply compressing all files using LZ4 (https://github.com/lz4/lz4) before copying the files to the backup server. LZ4 is super fast (up to 400MB/s), which means it wouldn't make rclone slower. The compressed files would preserved all metadata, with the benefit of making some files smaller. This would actually reduce bandwidth need and upload times. When copying from backup server back to the sources rclone would automatically uncompress everything. What do you think of this approach? Seems relatively simple to implement, and would solve another big problem (bandwidth limitations).

@pedrosimao

This comment has been minimized.

Show comment
Hide comment
@pedrosimao

pedrosimao Apr 20, 2017

I think the challenge here would be to keep the capacity of detecting file changes, while comparing the original files with the ones in the backup, since the destination files would be compressed. In this sense the use of extracted metadata might be a solution. We could create a database of all metadata of all files, save it in a separate file from the compresse files, and compare this metadata with the metadata of the source server. What do you think? Sounds reasonable?

pedrosimao commented Apr 20, 2017

I think the challenge here would be to keep the capacity of detecting file changes, while comparing the original files with the ones in the backup, since the destination files would be compressed. In this sense the use of extracted metadata might be a solution. We could create a database of all metadata of all files, save it in a separate file from the compresse files, and compare this metadata with the metadata of the source server. What do you think? Sounds reasonable?

@pedrosimao

This comment has been minimized.

Show comment
Hide comment
@pedrosimao

pedrosimao Apr 20, 2017

Here is LZ4 source in GO language:
https://github.com/pierrec/lz4

pedrosimao commented Apr 20, 2017

Here is LZ4 source in GO language:
https://github.com/pierrec/lz4

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis Apr 20, 2017

This is fine for backup purposes, however rclone became also main tool for everyone keeping their media files on cloud's for streaming ( plex, kodi etc... )

ajkis commented Apr 20, 2017

This is fine for backup purposes, however rclone became also main tool for everyone keeping their media files on cloud's for streaming ( plex, kodi etc... )

@pedrosimao

This comment has been minimized.

Show comment
Hide comment
@pedrosimao

pedrosimao Apr 20, 2017

Sure... but if you are reading or streaming your files via rclone, that would not be a problem, since LZ4 decompression would be done on the fly.
However if you want files to be readable without rclone, that could be a problem in this specific case.

pedrosimao commented Apr 20, 2017

Sure... but if you are reading or streaming your files via rclone, that would not be a problem, since LZ4 decompression would be done on the fly.
However if you want files to be readable without rclone, that could be a problem in this specific case.

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis Apr 20, 2017

As long is compression is optional its good, the thing is for streaming you want as less overhead as possible.

ajkis commented Apr 20, 2017

As long is compression is optional its good, the thing is for streaming you want as less overhead as possible.

@pedrosimao

This comment has been minimized.

Show comment
Hide comment
@pedrosimao

pedrosimao Apr 20, 2017

I agree, but decompression speed on LZ4 is amazingly fast. Take a look on benchmarks here: http://lz4.github.io/lz4/
While very high quality videos will rarely go over 12Mbps, LZ4 on an i5 processor can decompress things as fast as 125Mbps. Btw, for video LZ4 compression wouldn't affect anything, since H264 is already a form of compression and in this case LZ4 would skip compression, but still preserving metadata. You could make a test at your own computer to see how fast you can decompress a H264 movie, for example.

pedrosimao commented Apr 20, 2017

I agree, but decompression speed on LZ4 is amazingly fast. Take a look on benchmarks here: http://lz4.github.io/lz4/
While very high quality videos will rarely go over 12Mbps, LZ4 on an i5 processor can decompress things as fast as 125Mbps. Btw, for video LZ4 compression wouldn't affect anything, since H264 is already a form of compression and in this case LZ4 would skip compression, but still preserving metadata. You could make a test at your own computer to see how fast you can decompress a H264 movie, for example.

@sweharris

This comment has been minimized.

Show comment
Hide comment
@sweharris

sweharris Apr 24, 2017

Contributor

Compression must be optional because today I can rclone copy Pictures Amazon:Pictures and then view from the web page and order physical copies of the pictures. Similarly I could use this to copy files to my Google Drive and have the content viewable on my Chromebook.

We can not assume that rclone is the only tool that is used to access this data.

Contributor

sweharris commented Apr 24, 2017

Compression must be optional because today I can rclone copy Pictures Amazon:Pictures and then view from the web page and order physical copies of the pictures. Similarly I could use this to copy files to my Google Drive and have the content viewable on my Chromebook.

We can not assume that rclone is the only tool that is used to access this data.

@zenjabba

This comment has been minimized.

Show comment
Hide comment
@zenjabba

zenjabba Apr 25, 2017

Space used isn't that big of an issue anymore, so why would you bother with compression? I prefer it to be unencrypted and open so anything can read the storage pools without needing to modify anything.

zenjabba commented Apr 25, 2017

Space used isn't that big of an issue anymore, so why would you bother with compression? I prefer it to be unencrypted and open so anything can read the storage pools without needing to modify anything.

@yonjah

This comment has been minimized.

Show comment
Hide comment
@yonjah

yonjah Apr 27, 2017

Contributor

@pedrosimao enabling LZ4 compression seems like a great feature, but as others mention it has to be optional.
Also it is a bit irrelevant to this issue since the metadata we are looking for has to be available locally without querying the remote.
Among other stuff, we are looking at caching hashes so we wont have to recalculate them and we are looking at knowing if previously uploaded files have been deleted on the remote so we should delete them locally. both of these use cases need a dedicated local cache

Contributor

yonjah commented Apr 27, 2017

@pedrosimao enabling LZ4 compression seems like a great feature, but as others mention it has to be optional.
Also it is a bit irrelevant to this issue since the metadata we are looking for has to be available locally without querying the remote.
Among other stuff, we are looking at caching hashes so we wont have to recalculate them and we are looking at knowing if previously uploaded files have been deleted on the remote so we should delete them locally. both of these use cases need a dedicated local cache

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis Apr 29, 2017

@yonjah are you working on it

ajkis commented Apr 29, 2017

@yonjah are you working on it

@yonjah

This comment has been minimized.

Show comment
Hide comment
@yonjah

yonjah Apr 29, 2017

Contributor

@ajkis I plan to.
I started playing with Bolt a bit but didn't have much time the last couple of weeks

Contributor

yonjah commented Apr 29, 2017

@ajkis I plan to.
I started playing with Bolt a bit but didn't have much time the last couple of weeks

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis Apr 29, 2017

@yonjah great was not sure if you are just collecting info

p.s. Did you join rclone's slack

ajkis commented Apr 29, 2017

@yonjah great was not sure if you are just collecting info

p.s. Did you join rclone's slack

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis May 19, 2017

@yonjah we need you :) Amazon revoke rclone access and google mount will cause bans.
https://forum.rclone.org/t/rclone-has-been-banned-from-amazon-drive/2314/10?u=ajki

ajkis commented May 19, 2017

@yonjah we need you :) Amazon revoke rclone access and google mount will cause bans.
https://forum.rclone.org/t/rclone-has-been-banned-from-amazon-drive/2314/10?u=ajki

@pedrosimao

This comment has been minimized.

Show comment
Hide comment
@pedrosimao

pedrosimao May 19, 2017

What? That's totally absurd. On the basis of what they made the ban?

pedrosimao commented May 19, 2017

What? That's totally absurd. On the basis of what they made the ban?

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis May 19, 2017

Amazon is lame service provider, it started by Amazon revoking acd_cli access then for few days some of those users where implementing rclone api secret keys ( not confirmed that it had anything to do with ban )

We can only assume amazon wants to get rid of "heavy" users by revoking 3rd party apps that made most of traffic and used most of storage. Easy way to get rid of probably 90% of space taken there, as I really doubt people using their crappy Win/OSx Drive app have any significant data.

p.s. Best to move this convo to forums where ncw posted about rclone amazon ban.

ajkis commented May 19, 2017

Amazon is lame service provider, it started by Amazon revoking acd_cli access then for few days some of those users where implementing rclone api secret keys ( not confirmed that it had anything to do with ban )

We can only assume amazon wants to get rid of "heavy" users by revoking 3rd party apps that made most of traffic and used most of storage. Easy way to get rid of probably 90% of space taken there, as I really doubt people using their crappy Win/OSx Drive app have any significant data.

p.s. Best to move this convo to forums where ncw posted about rclone amazon ban.

@yonjah

This comment has been minimized.

Show comment
Hide comment
@yonjah

yonjah May 19, 2017

Contributor

@ajkis I was following the ACD fiasco and its really a shame a company like amazon is offering such a crappy service. I can only hope that the large amount of backslash coming from the community will force them to find some kind of solution even if it will only be temporary.

As with implementing the cache I'm sorry I can't offer you any quick solution. I'm still learning go and rclone internals and I have roughly a day per week to dedicate to this. I had some trouble understanding where the best thing to integrate the cache I thought it might be possible to do it as part of the listing process (so the remote fs will list both the remote and the cache and merge them together) but I'm not sure if this won't end up being too messy. Now I'm thinking of maybe creating a layer fs that will wrap another remote and will add the caching as it goes through ( probably similar to what crypt does but you wont have to create a new remote to use it) but I still haven't had the chance to explore the code and see if this will actually be the best design.
My original deadline for a basic cache was end of June (since after that I'm not going to have time for a while working on this) I really hope I'll be able to make it, but until now things have been slower than I expected.

Contributor

yonjah commented May 19, 2017

@ajkis I was following the ACD fiasco and its really a shame a company like amazon is offering such a crappy service. I can only hope that the large amount of backslash coming from the community will force them to find some kind of solution even if it will only be temporary.

As with implementing the cache I'm sorry I can't offer you any quick solution. I'm still learning go and rclone internals and I have roughly a day per week to dedicate to this. I had some trouble understanding where the best thing to integrate the cache I thought it might be possible to do it as part of the listing process (so the remote fs will list both the remote and the cache and merge them together) but I'm not sure if this won't end up being too messy. Now I'm thinking of maybe creating a layer fs that will wrap another remote and will add the caching as it goes through ( probably similar to what crypt does but you wont have to create a new remote to use it) but I still haven't had the chance to explore the code and see if this will actually be the best design.
My original deadline for a basic cache was end of June (since after that I'm not going to have time for a while working on this) I really hope I'll be able to make it, but until now things have been slower than I expected.

@ajkis

This comment has been minimized.

Show comment
Hide comment
@ajkis

ajkis May 19, 2017

@yonjah yea I understand, it would be great if @ncw could offer some additional support for it.

ajkis commented May 19, 2017

@yonjah yea I understand, it would be great if @ncw could offer some additional support for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment