Large memory utilisation when syncing many files #441

misuto · 2022-05-16T14:39:10Z

Problem

I have a problem when I try to sync a large amount of files, approximately 400 000 files. When I do this around 11 GB of memory is used. It works for now but will become problematic when I try to move this process to a weaker VM.
I did some debugging and concluded that the problem is when the metadata of the files(type Object) are fetched and stored in-memory in the function getSourceAndDestinationObjects.

Proposed solution

The solution I propose is to implement a new flag that will use external sorting instead of in-memory sorting. This should be an optional flag as the performance will be lower for this mode. I do not have any experience with external sorting but I would like to help by working on this after we have decided how this could be solved, I am of course open to new ideas for solving this problem.

Version: v2.0.0-beta.2

The text was updated successfully, but these errors were encountered:

Oppen · 2022-06-10T21:03:56Z

Since the order is only used to determine whether one object is only in source, only in destination or in both, maybe partitioning it to files is easier and more performant than external sorting. Then you only need to read the contents of the right partition (say, apply some hash and use the result), if one object is present only in the right set then it will be present in the file that matches its hash for the right set and not in the file that matches its hash for the left set. You can build a set in memory per file and discard it after you're done with it. It would need to be a different hash than whatever Go uses for maps tho, otherwise your set will be just a list 🤔
You can of course create a list with the contents of each file and sort that, and after that use the same old algorithm.
In any case, I see many temporary allocations could come from the appends used willy nilly in that part of the code, and from the fact it does get everything together (which is not completely avoidable).

kucukaslan · 2022-08-01T16:15:08Z

Thanks, a lot to both of you!

Current logic of sync

To determine which files are to be copied, sync commands need to identify which files are present in only source, only destination, and in both. The straightforward (and a good) solution is:

to obtain list of files in both source and destination:

s5cmd/command/sync.go

Line 280 in 123b1c7

func (s Sync) getSourceAndDestinationObjects(ctx context.Context, srcurl, dsturl *url.URL) ([]*storage.Object, []*storage.Object, error) {

sort the both lists

s5cmd/command/sync.go

Lines 223 to 228 in 123b1c7

sort.SliceStable(sourceObjects, func(i, j int) bool {

return sourceObjects[i].URL.Relative() < sourceObjects[j].URL.Relative()

})

sort.SliceStable(destObjects, func(i, j int) bool {

return destObjects[i].URL.Relative() < destObjects[j].URL.Relative()

})

Then start comparing them -by taking advantage of the order.

Problem

To sort we need the full list of them. So we need to bring all of them to the memory. It might be tolerable if there are enough memory to keep them (in my tests, 1 million objects needed around 7 GB RAM at peak time). Once the number of files became huge the process is killed by fatal error: runtime: out of memory.

Solution Candidates

Obtain the keys in a sorted order rather than sorting after obtaining

Goal is to ensure that the storage.List sends objects inorder. So that we won't need to get full list and sort it¹. Just read from two channels until both of the channels are closed, and in the process send them to the appropriate channels².

S3 Server

Objects are returned sorted in an ascending order of the respective key names in the list³.

Local

The List method (with wildcards) returns almost sorted since filepath.Glob eventually sorts them. So does (?) the karrick/godirwalk.

Drawbacks

S3: Not all S3-compatible servers returns in sorted order.
Local: It is almost sorted. Consider three files (where / is directory seperator):
- a.b
- a/b
- ab
  Normally . < / < b so we expect them to be in the order I wrote. But when the aforementioned Sort operation is called the strings are
- a.b
- a
- ab
  since the name of the directory entry is a ⁴. So its output becomes a/b < a.b < ab. To fix it we might need to write our own walker.

External Sort

Use external sorting as @misuto proposed. There are some libraries on github, they might be worth of trying. Especially lanrat/extsort seems promising. It is using channels to get input and return outputs.

Drawback (?)

I've personally never implemented an external sorting algorithm. From my point of view there are a lot of unclear things. But still might be worth trying.

Use more efficient data structures (Trie)

At that point we have the full keys and full addresses (URL.Absolute ). So we might⁵ only use the absolute path instead of a URL object in our sort code/algorithm.
Additionally we can use Trie to further reduce the memory usage.

Other tools

rclone

it uses in memory sorting. Note that their implementation was the reference source for s5cmd sync
https://github.com/rclone/rclone/blob/HEAD/fs/march/march.go#L304

MC

MC does not sort all of the list. For remote, it relies on the order assumption. For local it uses a custom algorithm that returns objects in a sorted fashion. Its algorithm also handles the problem I mentioned earlier by appending an "/" to the end of directory (in Less method):
https://github.com/minio/mc/blob/fe5b2f97e3d22324474e742edd15d9562467ef75/client-fs.go#L285-L297

Conclusion

~~We are tended to use, at least try, the third option (with Trie)~~. But, unfortunately, we will delay working on this issue at least untill the completion of #475 and #478.

Extra commentary

There are also some places that some objects are not released (references wasn't set to nil) immediately after they're used. I guess @Oppen meant some of those places. But even if we fix them they, probably, won't reduce the maximum memory usage, so it won't solve the out of memory error or fix the #447. Nevertheless, I think those improvements should've been done.

Modify getSourceAndDestinationObjects (return channels) and remove the sorting code ↩
Modify compareObjects (receive and return channels) ↩
https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html ↩
and the objects in that directory are not known yet ↩
TBH, I suspect that we might need the URL.relative and URL.base fields too, but not sure, didn't think thorougly yet. ↩

igungor · 2022-09-14T04:40:30Z

For reference, GCS doc says objects in the list are ordered lexicographically by name.

https://cloud.google.com/storage/docs/xml-api/get-bucket-list

It uses external sort instead of in-memory sort to dramatically reduce memory usage in the expense of the speed. It uses encoding/gob format to store to disk and restore from there. Fixes #441 Fixes #447

misuto mentioned this issue Jul 6, 2022

Sync command issue with many files. #447

Closed

kucukaslan mentioned this issue Aug 9, 2022

command: use external sort for comparison in sync #483

Merged

igungor closed this as completed in #483 Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large memory utilisation when syncing many files #441

Large memory utilisation when syncing many files #441

misuto commented May 16, 2022

Oppen commented Jun 10, 2022

kucukaslan commented Aug 1, 2022 •

edited

Loading

igungor commented Sep 14, 2022

Large memory utilisation when syncing many files #441

Large memory utilisation when syncing many files #441

Comments

misuto commented May 16, 2022

Problem

Proposed solution

Oppen commented Jun 10, 2022

kucukaslan commented Aug 1, 2022 • edited Loading

Current logic of sync

Problem

Solution Candidates

Obtain the keys in a sorted order rather than sorting after obtaining

S3 Server

Local

Drawbacks

External Sort

Drawback (?)

Use more efficient data structures (Trie)

Other tools

rclone

MC

Conclusion

Extra commentary

Footnotes

igungor commented Sep 14, 2022

kucukaslan commented Aug 1, 2022 •

edited

Loading