-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large memory utilisation when syncing many files #441
Comments
Since the order is only used to determine whether one object is only in source, only in destination or in both, maybe partitioning it to files is easier and more performant than external sorting. Then you only need to read the contents of the right partition (say, apply some hash and use the result), if one object is present only in the right set then it will be present in the file that matches its hash for the right set and not in the file that matches its hash for the left set. You can build a set in memory per file and discard it after you're done with it. It would need to be a different hash than whatever Go uses for maps tho, otherwise your set will be just a list 🤔 |
Thanks, a lot to both of you! Current logic of syncTo determine which files are to be copied, sync commands need to identify which files are present in only source, only destination, and in both. The straightforward (and a good) solution is:
ProblemTo sort we need the full list of them. So we need to bring all of them to the memory. It might be tolerable if there are enough memory to keep them (in my tests, 1 million objects needed around 7 GB RAM at peak time). Once the number of files became huge the process is killed by Solution CandidatesObtain the keys in a sorted order rather than sorting after obtainingGoal is to ensure that the storage.List sends objects inorder. So that we won't need to get full list and sort it1. Just read from two channels until both of the channels are closed, and in the process send them to the appropriate channels2. S3 ServerObjects are returned sorted in an ascending order of the respective key names in the list3. LocalThe List method (with wildcards) returns almost sorted since Drawbacks
External SortUse external sorting as @misuto proposed. There are some libraries on github, they might be worth of trying. Especially lanrat/extsort seems promising. It is using channels to get input and return outputs. Drawback (?)I've personally never implemented an external sorting algorithm. From my point of view there are a lot of unclear things. But still might be worth trying. Use more efficient data structures (Trie)At that point we have the full keys and full addresses (URL.Absolute ). So we might5 only use the absolute path instead of a URL object in our sort code/algorithm. Other toolsrcloneit uses in memory sorting. Note that their implementation was the reference source for s5cmd sync MCMC does not sort all of the list. For remote, it relies on the order assumption. For local it uses a custom algorithm that returns objects in a sorted fashion. Its algorithm also handles the problem I mentioned earlier by appending an "/" to the end of directory (in Less method): Conclusion
Extra commentaryThere are also some places that some objects are not released (references wasn't set to nil) immediately after they're used. I guess @Oppen meant some of those places. But even if we fix them they, probably, won't reduce the maximum memory usage, so it won't solve the out of memory error or fix the #447. Nevertheless, I think those improvements should've been done. Footnotes
|
For reference, GCS doc says objects in the list are ordered lexicographically by name. https://cloud.google.com/storage/docs/xml-api/get-bucket-list |
Problem
I have a problem when I try to sync a large amount of files, approximately 400 000 files. When I do this around 11 GB of memory is used. It works for now but will become problematic when I try to move this process to a weaker VM.
I did some debugging and concluded that the problem is when the metadata of the files(type
Object
) are fetched and stored in-memory in the functiongetSourceAndDestinationObjects
.Proposed solution
The solution I propose is to implement a new flag that will use external sorting instead of in-memory sorting. This should be an optional flag as the performance will be lower for this mode. I do not have any experience with external sorting but I would like to help by working on this after we have decided how this could be solved, I am of course open to new ideas for solving this problem.
Version:
v2.0.0-beta.2
The text was updated successfully, but these errors were encountered: