Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not an issue, but a FYI #41

Open
fangchin opened this issue Mar 25, 2021 · 2 comments
Open

Not an issue, but a FYI #41

fangchin opened this issue Mar 25, 2021 · 2 comments

Comments

@fangchin
Copy link

Hi all,

Please review this HPCwire's "off the wire" article: DOE Technical Report: *When to Use rsync?
March 25, 2021 https://bit.ly/2OZqKV7

Regards

@hjmangalam
Copy link
Owner

hjmangalam commented Mar 26, 2021 via email

@fangchin
Copy link
Author

fangchin commented Mar 26, 2021

Hi Harry,

Thanks for this note. I'll be reading the paper and responding in detail.
If you'd like the conversation to continue on github, do nothing. If you'd
like to continue it in private, my email is widely available.

Thanks for responding. Happy to continue the discussions right here on the github.

First of all, please let me note that as we pointed out in our report Test environment, p. 4 that we had had very tight time for the investigation and highly constrained access to the two employed testbeds - there are other projects waiting for them. Nevertheless, the methodology is precisely described in Test methodology, p. 4 ; the testers are freely available to the public https://github.com/fangchin/test_rsync; and we are confident about the rigorousness, comprehensiveness, automated testing, and fairness employed for the investigation.

As it turns out, I'm working on the multihost version right now and I hope to push it to github in a week or two.

It's our view that any multi-host application must show the linear scalability efficiency defined in the report A glance at two PDDMs, p. 14. Also, by "multi-host", did you mean "scale-out" (i.e. multi-node cluster)? If so, then HA, auto load sharing etc. among multiple instances running on different cluster nodes should be intrinsic. We do hope the our work spurs similar discussions and investigations for other data movers.

I'm surprised you didn't include fpsync, a similar rsync wrapper by Ganael
LaPlanche which supports multihosts already (and who wrote the fpart file
chunker that parsyncfp uses to allow transport to start before the full
file recursion is done.)

I am afraid that a different rsync "wrappers" cannot change the intrinsic limitations of rsync in tackling LOSF, really large files (e.g. hundreds of GBs, multiple TBs), and large RTT values.

In addition, a monograph usually focuses on a single subject. So as the title of the report indicates, it focuses on rsync and a single selected rsync-based tool like parsyncfp (we didn't even have time to evaluate rsync-ssl!). As you alluded, it would be great to include more, other than fpsync, bbcp would be a good one to evaluate for example. Nevertheless, trying as best as we can, we only have 24 hours/day and we have other businesses to take care of :)

Best Regards,

Chin Fang, Zettar Inc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants