Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large memory utilisation when syncing many files #441

Closed
misuto opened this issue May 16, 2022 · 3 comments · Fixed by #483
Closed

Large memory utilisation when syncing many files #441

misuto opened this issue May 16, 2022 · 3 comments · Fixed by #483

Comments

@misuto
Copy link

misuto commented May 16, 2022

Problem

I have a problem when I try to sync a large amount of files, approximately 400 000 files. When I do this around 11 GB of memory is used. It works for now but will become problematic when I try to move this process to a weaker VM.
I did some debugging and concluded that the problem is when the metadata of the files(type Object) are fetched and stored in-memory in the function getSourceAndDestinationObjects.

Proposed solution

The solution I propose is to implement a new flag that will use external sorting instead of in-memory sorting. This should be an optional flag as the performance will be lower for this mode. I do not have any experience with external sorting but I would like to help by working on this after we have decided how this could be solved, I am of course open to new ideas for solving this problem.

Version: v2.0.0-beta.2

@Oppen
Copy link

Oppen commented Jun 10, 2022

Since the order is only used to determine whether one object is only in source, only in destination or in both, maybe partitioning it to files is easier and more performant than external sorting. Then you only need to read the contents of the right partition (say, apply some hash and use the result), if one object is present only in the right set then it will be present in the file that matches its hash for the right set and not in the file that matches its hash for the left set. You can build a set in memory per file and discard it after you're done with it. It would need to be a different hash than whatever Go uses for maps tho, otherwise your set will be just a list 🤔
You can of course create a list with the contents of each file and sort that, and after that use the same old algorithm.
In any case, I see many temporary allocations could come from the appends used willy nilly in that part of the code, and from the fact it does get everything together (which is not completely avoidable).

@kucukaslan
Copy link
Contributor

kucukaslan commented Aug 1, 2022

Thanks, a lot to both of you!

Current logic of sync

To determine which files are to be copied, sync commands need to identify which files are present in only source, only destination, and in both. The straightforward (and a good) solution is:

  1. to obtain list of files in both source and destination:

func (s Sync) getSourceAndDestinationObjects(ctx context.Context, srcurl, dsturl *url.URL) ([]*storage.Object, []*storage.Object, error) {

  1. sort the both lists

s5cmd/command/sync.go

Lines 223 to 228 in 123b1c7

sort.SliceStable(sourceObjects, func(i, j int) bool {
return sourceObjects[i].URL.Relative() < sourceObjects[j].URL.Relative()
})
sort.SliceStable(destObjects, func(i, j int) bool {
return destObjects[i].URL.Relative() < destObjects[j].URL.Relative()
})

  1. Then start comparing them -by taking advantage of the order.

Problem

To sort we need the full list of them. So we need to bring all of them to the memory. It might be tolerable if there are enough memory to keep them (in my tests, 1 million objects needed around 7 GB RAM at peak time). Once the number of files became huge the process is killed by fatal error: runtime: out of memory.

Solution Candidates

Obtain the keys in a sorted order rather than sorting after obtaining

Goal is to ensure that the storage.List sends objects inorder. So that we won't need to get full list and sort it1. Just read from two channels until both of the channels are closed, and in the process send them to the appropriate channels2.

S3 Server

Objects are returned sorted in an ascending order of the respective key names in the list3.

Local

The List method (with wildcards) returns almost sorted since filepath.Glob eventually sorts them. So does (?) the karrick/godirwalk.

Drawbacks

  • S3: Not all S3-compatible servers returns in sorted order.
  • Local: It is almost sorted. Consider three files (where / is directory seperator):
    • a.b
    • a/b
    • ab
      Normally . < / < b so we expect them to be in the order I wrote. But when the aforementioned Sort operation is called the strings are
    • a.b
    • a
    • ab
      since the name of the directory entry is a 4. So its output becomes a/b < a.b < ab. To fix it we might need to write our own walker.

External Sort

Use external sorting as @misuto proposed. There are some libraries on github, they might be worth of trying. Especially lanrat/extsort seems promising. It is using channels to get input and return outputs.

Drawback (?)

I've personally never implemented an external sorting algorithm. From my point of view there are a lot of unclear things. But still might be worth trying.

Use more efficient data structures (Trie)

At that point we have the full keys and full addresses (URL.Absolute ). So we might5 only use the absolute path instead of a URL object in our sort code/algorithm.
Additionally we can use Trie to further reduce the memory usage.

Other tools

rclone

it uses in memory sorting. Note that their implementation was the reference source for s5cmd sync
https://github.com/rclone/rclone/blob/HEAD/fs/march/march.go#L304

MC

MC does not sort all of the list. For remote, it relies on the order assumption. For local it uses a custom algorithm that returns objects in a sorted fashion. Its algorithm also handles the problem I mentioned earlier by appending an "/" to the end of directory (in Less method):
https://github.com/minio/mc/blob/fe5b2f97e3d22324474e742edd15d9562467ef75/client-fs.go#L285-L297

Conclusion

We are tended to use, at least try, the third option (with Trie). But, unfortunately, we will delay working on this issue at least untill the completion of #475 and #478.

Extra commentary

There are also some places that some objects are not released (references wasn't set to nil) immediately after they're used. I guess @Oppen meant some of those places. But even if we fix them they, probably, won't reduce the maximum memory usage, so it won't solve the out of memory error or fix the #447. Nevertheless, I think those improvements should've been done.

Footnotes

  1. Modify getSourceAndDestinationObjects (return channels) and remove the sorting code

  2. Modify compareObjects (receive and return channels)

  3. https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html

  4. and the objects in that directory are not known yet

  5. TBH, I suspect that we might need the URL.relative and URL.base fields too, but not sure, didn't think thorougly yet.

@igungor
Copy link
Member

igungor commented Sep 14, 2022

For reference, GCS doc says objects in the list are ordered lexicographically by name.

https://cloud.google.com/storage/docs/xml-api/get-bucket-list

igungor pushed a commit that referenced this issue Jun 16, 2023
It uses external sort instead of in-memory sort to dramatically reduce memory usage in the expense of the speed.
It uses encoding/gob format to store to disk and restore from there.

Fixes #441
Fixes #447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants