-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: increased memory usage in 1.23 with AzCopy #71308
Comments
I take it this is split out from #69590 From the information given, it doesn't look very actionable (no reproducer, no measurement of runtime memory classes). |
Please find the memory profiling below. Let me know if you need any additional information.
|
Steps to Reproduce the Issue
Observation:
|
@seankhliao I have added the required information. Can you please take a look? |
Your heap profile shows a heap size of ~3GB. How does this compare to the memory usage reported by the OS? Is it close (within ~50%) or way off? I'm trying to get a sense of whether we are actually seeing most of the memory in the heap profile. The other thing that would help to see is the breakdown of all of the |
You also said that memory use grows unbounded. It would be helpful to collect a memory profile and |
Hello, I’ve attached the zip file containing two profile files as requested:
Please let me know if you need any additional information or if there are any other details I can provide. |
Also the before and after of all of the /memory/classes/... metrics in https://pkg.go.dev/runtime/metrics. |
Ah, your (1) profile is too early, that is right at init. I was thinking if memory usage increases constantly for say 5min, collect a profile after 1min and 5min. We still want to see the main allocations in the program. |
Hello, Metrics at 89% Memory Usage: /memory/classes/heap/free:bytes: 636829696 Metrics at 98% Memory Usage: /memory/classes/heap/free:bytes: 3295191040 I hope this helps. Let me know if you need any further information or clarification. |
Thanks! My takeaways from this are:
You say that that same program on 1.22 stabilizes at 54%. There are a few possible causes I can think of:
You said that the input to this program is ~1.4M files. Do you start 1.4M goroutines concurrently to process each of these files? A good next step might be to collect goroutine profiles from the 1.22 and 1.23 versions to see if they have similar numbers of goroutines. The Edit: I see that there is quite complicated goroutine pool management. (See worker goroutine and its creator + channel interactions). So an additional possibility is that something in 1.23 is causing that pool management to launch way more goroutines. |
Hello, I will debug it from our end and let you know if I find anything. Thanks |
Hello, I have captured the goroutines count at different timestamps by switching between Go versions while keeping the AzCopy codebase unchanged. Below are my observations: With Go 1.23.1:
With Go 1.22.7:
From the captured goroutine counts at various timestamps, we observe a significant difference in behavior between Go 1.23.1 and Go 1.22.7, despite keeping the AzCopy codebase unchanged.
I also tried collecting the goroutine profile while running the AzCopy application, and below is the resulting dump:
Additionally, while running the application in debug mode, I observed numerous entries in the call stack, which seem to be related to the goroutine behavior. |
Thanks, definitely seems like there is a goroutine leak for some reason. Just the pprof top output isn't enough to tell what these goroutines were doing. Given you have some complex worker pool size logic (https://github.com/Azure/azure-storage-azcopy/blob/1b3cc0c26c6a2f1bc1fd926c01188507ab4c86ae/ste/mgr-JobMgr.go#L919), I suggest taking a look at that to see if it is misbehaving and creating too many goroutines. |
@prattmic, we can take a look. In the meantime, can the Go team take a look at the changes that were made in Go 1.23? Nothing in the release notes indicated there were changes that would affect our memory profile in this way. |
We need more information, but are making progress. We have determined that this problem is not related to memory profiles directly. The problem is that there are far more goroutines (which in turn use more memory). To look into potential problems into 1.23, I need more context. What are these goroutines doing? What created them? e.g., are these internal net/http goroutines stuck somewhere? Are these your worker goroutines that were intentionally created by your pool manager? If so, why did it create so many or why are they stuck and not exiting? Given this is a network heavy application, my initial guess would be something related to net or net/http getting stuck, but that's really just a guess. |
not sure whether it's related to |
|
Hi @prattmic , In the AzCopy code, explicitly setting runtime.GOMAXPROCS(2) when using Go 1.23.1 stabilizes memory usage to levels similar to those observed with Go 1.22.7. This could be considered a workaround. However, it's important to note that the time to complete the job with Go 1.23.1 and runtime.GOMAXPROCS(2) is longer compared to using Go 1.22.7. Specifically:
|
We can't accept this trade-off in latency. This might be a naive question, but happens if we set GOMAXPROCS to 4? |
From the comments above, the increase in memory usage appears to be due to a large increase in the number of goroutines. Setting |
@ianlancetaylor, agreed, we ran our tests again with |
Yes. If a mutex profile is able to identify a problem, that should lead us to the root cause. |
Another idea: The AzCopy goroutine pool has a handfull of You can disable that improvement by commenting out this line from your GOROOT and building again: Line 427 in ab44565
|
We are spawning multiple goroutines at different points in our application to schedule transfers in chunks, which makes it somewhat difficult to pinpoint exactly where all these goroutines are being created and what their specific tasks are. However, as part of investigation, I have observed that in one particular function, each goroutine acquires a semaphore while establishing an HTTP client connection. We have set the maximum connection count to 200 to rate limit. The issue I have seen with Go 1.23.1 is that the acquisition and release of the semaphore have started to take progressively longer over time, which is not something I observed in previous versions. To further analyze this, I added a logging statement at https://github.com/Azure/azure-storage-azcopy/blob/main/ste/mgr-JobPartMgr.go#L123 of the relevant function. Through this logging, we can confirm that the semaphore Acquire() and Release() operations are incrementally taking much longer than expected. As a result, the spawned goroutines are getting piled up, causing a significant increase in memory consumption. Interestingly, when I tested this behavior with Go 1.22.7, I found that the semaphore Acquire() and Release() operations were completed almost in some milliseconds, and no such issue of goroutines piling up occurred. For reference, I’ve attached an output file that demonstrates this behavior in detail, showing the difference between Go 1.22.7 and Go 1.23.1 in terms of semaphore performance. Based on these observations, it appears that the semaphore behavior in Go 1.23.1 is contributing to the goroutine pile-up and increased memory usage, especially when compared to Go 1.22.7. Reference: The debug code added to capture the Acquire() and Release() time.
|
Looking at the code, |
We've observed that this memory usage issue is occurring only on Windows systems. On Linux, the application is working as expected without any memory-related problems. The issue seems to have started occurring with Go version 1.22.8 and onwards on Windows. Prior to this version, there were no such memory usage issues. This suggests that the issue may be related to changes introduced in Go 1.22.8 or later specifically affecting the Windows environment. |
CC @golang/windows |
@dphulkar-msft we are also facing the memory issue in azure-fileshare-csi-driver which runs on linux. So not sure if this issue is specific to windows system. + @andyzhangx . |
@mayankagg9722 the original issue is increased memory usage, we don't have data shows that memory usage is increased in azure file csi driver, when there are lots of files to restore, the azcopy would create more parallel jobs to perform the file copy, the memory usage increase should be expected in that case. |
Interesting that this occurs in 1.22.8, but not 1.22.7. There are only 3 new commits in 1.22.8 (full commit log): https://go.dev/cl/611297 cmd/cgo: correct padding required by alignment The last one is a test-only change, so that only leaves 2 candidate CLs. To be honest, neither of these stand out as likely candidates for this problem, but since there are so few commits, you can just test each one directly. i.e., clone https://go.googlesource.com/go and bisect between tags |
I've just managed to reproduce the issue with 1.23, but not with 1.22.8. Calling out 1.22.8 has probably been a confusion. I haven't found the root cause yet, but I'm sure it is related to how Looking at the Dials when requests are canceled | https://go-review.googlesource.com/c/go/+/576555 I'm on a trip this week so i don't think I can work out a small reproducer. Will try to do it during next week. I leave this thoughts here just in case they ring a bell. @neild |
Go version
go 1.23.1
Output of
go env
in your module/workspace:What did you do?
Used AzCopy version 10.27.0, built with Go version 1.23.1, to copy a dataset with the following characteristics:
Observed memory usage behavior during the operation.
Customers also reported similar memory issues when using AzCopy versions built with Go 1.23.1: Issue #2901
To identify the root cause, multiple experiments were conducted:
What did you see happen?
Observations:
What did you expect to see?
The text was updated successfully, but these errors were encountered: