Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: increased memory usage in 1.23 with AzCopy #71308

Open
dphulkar-msft opened this issue Jan 17, 2025 · 33 comments
Open

runtime: increased memory usage in 1.23 with AzCopy #71308

dphulkar-msft opened this issue Jan 17, 2025 · 33 comments
Assignees
Labels
BugReport Issues describing a possible bug in the Go implementation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance

Comments

@dphulkar-msft
Copy link

dphulkar-msft commented Jan 17, 2025

Go version

go 1.23.1

Output of go env in your module/workspace:

go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\dphulkar\AppData\Local\go-build
set GOENV=C:\Users\dphulkar\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\dphulkar\go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\dphulkar\go
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Users\dphulkar\go\pkg\mod\golang.org\toolchain@v0.0.1-go1.23.1.windows-amd64
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLCHAIN=auto
set GOTOOLDIR=C:\Users\dphulkar\go\pkg\mod\golang.org\toolchain@v0.0.1-go1.23.1.windows-amd64\pkg\tool\windows_amd64  
set GOVCS=
set GOVERSION=go1.23.1
set GODEBUG=
set GOTELEMETRY=local
set GOTELEMETRYDIR=C:\Users\dphulkar\AppData\Roaming\go\telemetry
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=0
set GOMOD=C:\Users\dphulkar\azure-storage-azcopy\go.mod
set GOWORK=
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=C:\Users\dphulkar\AppData\Local\Temp\go-build1895105759=/tmp/go-build -gno-record-gcc-switches

What did you do?

  • Used AzCopy version 10.27.0, built with Go version 1.23.1, to copy a dataset with the following characteristics:

    1. Total Data Size: 567 GB
    2. Total Files: 1,415,230 files
  • Observed memory usage behavior during the operation.

  • Customers also reported similar memory issues when using AzCopy versions built with Go 1.23.1: Issue #2901

  • To identify the root cause, multiple experiments were conducted:

    1. AzCopy 10.26.0 (built with Go 1.22.x) showed stable memory usage (~54% RAM).
    2. AzCopy 10.27.0 (built with Go 1.23.1) exhibited a gradual memory increase, peaking at 96%.
    3. Downgrading the Go runtime in AzCopy 10.27.x from 1.23.1 to 1.22.5/1.22.7 mitigated the issue, with memory usage stabilizing (~54%).
    4. Applying a workaround by setting runtime.MemoryProfileRate=0 in AzCopy 10.27.0 (built with Go 1.23.1) failed to resolve the problem, as memory usage still peaked at 99%.
    5. Tried setting the environment variable GODEBUG=profstackdepth=32 to limit profiling stack depth, but memory usage remained high, peaking at 97% RAM.
    6. Also experimented with lowering the profstackdepth value further, but it did not resolve the issue.

What did you see happen?

  • Memory Usage: Gradual increase, peaking at 96% RAM.
  • CPU Usage: Fluctuating between 51% and 100%.
  • Memory Profile: Did not stabilize, leading to high resource consumption over time.

Observations:

  1. Significant memory usage difference compared to prior versions (e.g., AzCopy 10.26).
  2. Downgrading the Go runtime from version 1.23.1 to 1.22.5 or 1.22.7 mitigated the issue, with memory usage stabilizing around 54%.
  3. The workaround of setting runtime.MemoryProfileRate=0 did not resolve the issue, as memory usage still reached 99% RAM.

What did you expect to see?

  • Consistent memory usage profile, similar to AzCopy 10.26 or when using Go 1.22.5/1.22.7.
  • Stabilized memory usage without a gradual increase over time.
@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Jan 17, 2025
@dphulkar-msft dphulkar-msft changed the title import/path: Increased Memory Usage with Go 1.23.1 in AzCopy Increased Memory Usage with Go 1.23.1 in AzCopy Jan 17, 2025
@seankhliao seankhliao changed the title Increased Memory Usage with Go 1.23.1 in AzCopy runtime: increased memory usage in 1.23 with AzCopy Jan 17, 2025
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 17, 2025
@seankhliao seankhliao added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed compiler/runtime Issues related to the Go compiler and/or runtime. labels Jan 17, 2025
@seankhliao
Copy link
Member

I take it this is split out from #69590

From the information given, it doesn't look very actionable (no reproducer, no measurement of runtime memory classes).

@seankhliao seankhliao added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 17, 2025
@dphulkar-msft
Copy link
Author

dphulkar-msft commented Jan 20, 2025

Please find the memory profiling below. Let me know if you need any additional information.

$ go tool pprof -text heap.pb.gz
>> 
File: azcopy.exe
Build ID: C:\Users\dphulkar\azure-storage-azcopy\azcopy.exe2025-01-20 10:46:13.2153836 +0530 IST
Type: inuse_space
Time: Jan 20, 2025 at 11:24am (IST)
Showing nodes accounting for 2705.27MB, 92.89% of 2912.47MB total
Dropped 253 nodes (cum <= 14.56MB)
      flat  flat%   sum%        cum   cum%
  783.62MB 26.91% 26.91%   959.64MB 32.95%  net/http.(*Transport).dialConn
  366.66MB 12.59% 39.49%   366.66MB 12.59%  runtime.malg
  284.04MB  9.75% 49.25%   387.94MB 13.32%  net/http.(*Transport).getConn
  177.05MB  6.08% 55.33%   177.55MB  6.10%  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.JoinPaths
  148.51MB  5.10% 60.43%   785.48MB 26.97%  github.com/Azure/azure-storage-azcopy/v10/ste.logPolicy.Do
  139.01MB  4.77% 65.20%   139.01MB  4.77%  context.(*cancelCtx).Done
  122.51MB  4.21% 69.40%   162.51MB  5.58%  golang.org/x/sync/semaphore.(*Weighted).Acquire
  111.01MB  3.81% 73.22%   211.30MB  7.26%  context.withCancel (inline)
   96.51MB  3.31% 76.53%   147.52MB  5.06%  context.WithDeadlineCause
   77.50MB  2.66% 79.19%    77.50MB  2.66%  context.WithValue
   69.52MB  2.39% 81.58%   114.53MB  3.93%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobPartTransferMgr).Info
      43MB  1.48% 83.05%       43MB  1.48%  net/http.(*persistConn).roundTrip
   40.50MB  1.39% 84.45%    40.50MB  1.39%  github.com/Azure/azure-storage-azcopy/v10/common.GenerateFullPath
      40MB  1.37% 85.82%       40MB  1.37%  container/list.(*List).insertValue (inline)
      34MB  1.17% 86.99%       34MB  1.17%  net.JoinHostPort (inline)
   27.51MB  0.94% 87.93%   202.55MB  6.95%  github.com/Azure/azure-storage-azcopy/v10/ste.newURLToBlockBlobCopier
      22MB  0.76% 88.69%   232.82MB  7.99%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobPartMgr).ScheduleTransfers
      17MB  0.58% 89.27%   137.29MB  4.71%  context.WithCancel
   15.50MB  0.53% 89.80%    15.50MB  0.53%  context.WithoutCancel (inline)
      15MB  0.52% 90.32%       15MB  0.52%  github.com/Azure/azure-sdk-for-go/sdk/azcore.(*Client).WithClientName
      15MB  0.52% 90.83%       15MB  0.52%  strings.(*Builder).WriteString (inline)
   12.28MB  0.42% 91.25%   150.30MB  5.16%  context.(*cancelCtx).propagateCancel
   11.50MB  0.39% 91.65%    45.50MB  1.56%  github.com/Azure/azure-storage-azcopy/v10/ste.withPipelineNetworkStats (inline)
   10.02MB  0.34% 91.99%    23.03MB  0.79%  runtime.allocm
    7.50MB  0.26% 92.25%       19MB  0.65%  github.com/Azure/azure-storage-azcopy/v10/ste.(*urlToBlockBlobCopier).generateStartPutBlobFromURL (inline)
    5.50MB  0.19% 92.44%   293.47MB 10.08%  github.com/Azure/azure-storage-azcopy/v10/ste.anyToRemote_file
       4MB  0.14% 92.58%      956MB 32.82%  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.(*Request).Next
       3MB   0.1% 92.68%   175.04MB  6.01%  github.com/Azure/azure-storage-azcopy/v10/ste.newBlockBlobSenderBase
       3MB   0.1% 92.78%    53.51MB  1.84%  github.com/Azure/azure-storage-azcopy/v10/ste.newBlobSourceInfoProvider
       2MB 0.069% 92.85%    17.50MB   0.6%  net/url.Values.Encode
    0.50MB 0.017% 92.87%   557.45MB 19.14%  net/http.(*Transport).roundTrip
    0.50MB 0.017% 92.89%      952MB 32.69%  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*retryPolicy).Do
         0     0% 92.89%       40MB  1.37%  container/list.(*List).PushBack (inline)
         0     0% 92.89%    91.01MB  3.12%  context.WithCancelCause
         0     0% 92.89%   147.52MB  5.06%  context.WithDeadline (inline)
         0     0% 92.89%   147.02MB  5.05%  context.WithTimeout
         0     0% 92.89%      956MB 32.82%  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.Pipeline.Do
         0     0% 92.89%      956MB 32.82%  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.PolicyFunc.Do
         0     0% 92.89%   560.45MB 19.24%  github.com/Azure/azure-sdk-for-go/sdk/azcore/internal/exported.transportPolicy.Do
         0     0% 92.89%   561.45MB 19.28%  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*httpTracePolicy).Do
         0     0% 92.89%   560.95MB 19.26%  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*logPolicy).Do
         0     0% 92.89%      954MB 32.76%  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.(*requestIDPolicy).Do
         0     0% 92.89%   560.95MB 19.26%  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.bodyDownloadPolicy
         0     0% 92.89%   561.95MB 19.29%  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.httpHeaderPolicy
         0     0% 92.89%      956MB 32.82%  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.includeResponsePolicy
         0     0% 92.89%      956MB 32.82%  github.com/Azure/azure-sdk-for-go/sdk/azcore/runtime.telemetryPolicy.Do
         0     0% 92.89%   466.09MB 16.00%  github.com/Azure/azure-sdk-for-go/sdk/storage/azblob/blob.(*Client).GetProperties
         0     0% 92.89%   466.09MB 16.00%  github.com/Azure/azure-sdk-for-go/sdk/storage/azblob/blockblob.(*Client).GetProperties
         0     0% 92.89%   503.91MB 17.30%  github.com/Azure/azure-sdk-for-go/sdk/storage/azblob/blockblob.(*Client).UploadBlobFromURL
         0     0% 92.89%    41.51MB  1.43%  github.com/Azure/azure-sdk-for-go/sdk/storage/azblob/container.(*Client).NewBlobClient
         0     0% 92.89%   164.54MB  5.65%  github.com/Azure/azure-sdk-for-go/sdk/storage/azblob/container.(*Client).NewBlockBlobClient
         0     0% 92.89%   466.09MB 16.00%  github.com/Azure/azure-sdk-for-go/sdk/storage/azblob/internal/generated.(*BlobClient).GetProperties
         0     0% 92.89%   502.91MB 17.27%  github.com/Azure/azure-sdk-for-go/sdk/storage/azblob/internal/generated.(*BlockBlobClient).PutBlobFromURL
         0     0% 92.89%    40.50MB  1.39%  github.com/Azure/azure-storage-azcopy/v10/common.GenerateFullPathWithQuery
         0     0% 92.89%    40.50MB  1.39%  github.com/Azure/azure-storage-azcopy/v10/ste.(*JobPartPlanHeader).TransferSrcDstStrings
         0     0% 92.89%   466.09MB 16.00%  github.com/Azure/azure-storage-azcopy/v10/ste.(*blockBlobSenderBase).GetDestinationLength
         0     0% 92.89%   173.01MB  5.94%  github.com/Azure/azure-storage-azcopy/v10/ste.(*dialRateLimiter).DialContext
         0     0% 92.89%      952MB 32.69%  github.com/Azure/azure-storage-azcopy/v10/ste.(*fileUploadRangeFromURLFixPolicy).Do
         0     0% 92.89%   978.50MB 33.60%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobMgr).chunkProcessor
         0     0% 92.89%   232.82MB  7.99%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobMgr).scheduleJobParts
         0     0% 92.89%   295.97MB 10.16%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobMgr).transferProcessor
         0     0% 92.89%   295.97MB 10.16%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobMgr).transferProcessor.func1
         0     0% 92.89%   294.97MB 10.13%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobPartMgr).StartJobXfer
         0     0% 92.89%   467.59MB 16.05%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobPartTransferMgr).ReportChunkDone
         0     0% 92.89%   294.97MB 10.13%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobPartTransferMgr).StartJobXfer
         0     0% 92.89%   467.59MB 16.05%  github.com/Azure/azure-storage-azcopy/v10/ste.(*jobPartTransferMgr).runActionAfterLastChunk (inline)
         0     0% 92.89%   785.98MB 26.99%  github.com/Azure/azure-storage-azcopy/v10/ste.(*retryNotificationPolicy).Do
         0     0% 92.89%       19MB  0.65%  github.com/Azure/azure-storage-azcopy/v10/ste.(*urlToBlockBlobCopier).GenerateCopyFunc
         0     0% 92.89%   978.50MB 33.60%  github.com/Azure/azure-storage-azcopy/v10/ste.(*urlToBlockBlobCopier).GenerateCopyFunc.(*urlToBlockBlobCopier).generateStartPutBlobFromURL.createSendToRemoteChunkFunc.createChunkFunc.func3
         0     0% 92.89%   507.91MB 17.44%  github.com/Azure/azure-storage-azcopy/v10/ste.(*urlToBlockBlobCopier).GenerateCopyFunc.(*urlToBlockBlobCopier).generateStartPutBlobFromURL.func1
         0     0% 92.89%   952.50MB 32.70%  github.com/Azure/azure-storage-azcopy/v10/ste.(*versionPolicy).Do
         0     0% 92.89%   294.97MB 10.13%  github.com/Azure/azure-storage-azcopy/v10/ste.anyToRemote
         0     0% 92.89%   467.59MB 16.05%  github.com/Azure/azure-storage-azcopy/v10/ste.anyToRemote_file.func1
         0     0% 92.89%   294.97MB 10.13%  github.com/Azure/azure-storage-azcopy/v10/ste.computeJobXfer.parameterizeSend.func5
         0     0% 92.89%   467.59MB 16.05%  github.com/Azure/azure-storage-azcopy/v10/ste.epilogueWithCleanupSendToRemote
         0     0% 92.89%   202.55MB  6.95%  github.com/Azure/azure-storage-azcopy/v10/ste.newURLToBlobCopier
         0     0% 92.89%    20.50MB   0.7%  github.com/Azure/azure-storage-azcopy/v10/ste.scheduleSendChunks
         0     0% 92.89%   562.45MB 19.31%  github.com/Azure/azure-storage-azcopy/v10/ste.statsPolicy.Do
         0     0% 92.89%   560.45MB 19.24%  net/http.(*Client).Do
         0     0% 92.89%   560.45MB 19.24%  net/http.(*Client).do
         0     0% 92.89%   557.45MB 19.14%  net/http.(*Client).send
         0     0% 92.89%   557.45MB 19.14%  net/http.(*Transport).RoundTrip
         0     0% 92.89%       34MB  1.17%  net/http.(*Transport).connectMethodForRequest
         0     0% 92.89%   173.01MB  5.94%  net/http.(*Transport).dial
         0     0% 92.89%   959.64MB 32.95%  net/http.(*Transport).dialConnFor
         0     0% 92.89%   959.64MB 32.95%  net/http.(*Transport).startDialConnForLocked.func1
         0     0% 92.89%       34MB  1.17%  net/http.canonicalAddr
         0     0% 92.89%   557.45MB 19.14%  net/http.send
         0     0% 92.89%    43.50MB  1.49%  net/http/httptrace.WithClientTrace
         0     0% 92.89%    22.53MB  0.77%  runtime.handoffp
         0     0% 92.89%    15.86MB  0.54%  runtime.main
         0     0% 92.89%    22.53MB  0.77%  runtime.mstart
         0     0% 92.89%    22.53MB  0.77%  runtime.mstart0
         0     0% 92.89%    22.53MB  0.77%  runtime.mstart1
         0     0% 92.89%    23.03MB  0.79%  runtime.newm
         0     0% 92.89%   372.59MB 12.79%  runtime.newproc.func1
         0     0% 92.89%   372.59MB 12.79%  runtime.newproc1
         0     0% 92.89%    22.53MB  0.77%  runtime.retake
         0     0% 92.89%    23.03MB  0.79%  runtime.startm
         0     0% 92.89%    22.53MB  0.77%  runtime.sysmon
         0     0% 92.89%   373.09MB 12.81%  runtime.systemstack

@dphulkar-msft
Copy link
Author

dphulkar-msft commented Jan 20, 2025

Steps to Reproduce the Issue

  • Test with AzCopy 10.26.0:

    1. Download and use AzCopy 10.26.0 to perform a service-to-service copy operation with a large dataset.
    2. Record the memory usage during the operation.
  • Test with AzCopy 10.27.1:

    1. Download and use AzCopy 10.27.1 to perform the same service-to-service copy operation with the same large dataset.
    2. Observe and compare the memory usage. A significant increase in memory usage is noticed with version 10.27.1.
  • Test by Downgrading the Go Runtime:

    1. Downgrade the Go runtime to 10.22.7 or 10.22.5.
    2. Perform the same operation using AzCopy and note that the memory usage is almost identical to what is observed with AzCopy 10.26.0.

Observation:

  1. AzCopy 10.27.1 exhibits significantly higher memory usage compared to AzCopy 10.26.0.
  2. Downgrading the Go runtime version resolves the memory usage issue, aligning the behavior with AzCopy 10.26.0.

@dphulkar-msft
Copy link
Author

@seankhliao I have added the required information. Can you please take a look?

@seanmcc-msft
Copy link

seanmcc-msft commented Jan 22, 2025

@mknyszek and @nsrip-dd, could you take a look? Please let us know what addition information you need to reproduce. We need to resolve this issue or find a workaround in the next few weeks, as we cannot keep AzCopy on Go 1.22.x forever due to potential CVEs.

@prattmic
Copy link
Member

Your heap profile shows a heap size of ~3GB. How does this compare to the memory usage reported by the OS? Is it close (within ~50%) or way off? I'm trying to get a sense of whether we are actually seeing most of the memory in the heap profile.

The other thing that would help to see is the breakdown of all of the /memory/classes/... metrics in https://pkg.go.dev/runtime/metrics.

@prattmic
Copy link
Member

prattmic commented Jan 22, 2025

You also said that memory use grows unbounded. It would be helpful to collect a memory profile and /memory/classes/... dump twice: once near the beginning (before it is using too much memory) and again when memory usage has grown significantly. Then we can see which aspects are growing.

@dphulkar-msft
Copy link
Author

Hello,

I’ve attached the zip file containing two profile files as requested:

  1. One profile captured near the beginning of the process.
  2. The other profile captured after memory usage has grown significantly.

Please let me know if you need any additional information or if there are any other details I can provide.

memory_growth_profile.zip

@prattmic
Copy link
Member

Also the before and after of all of the /memory/classes/... metrics in https://pkg.go.dev/runtime/metrics.

@prattmic
Copy link
Member

Ah, your (1) profile is too early, that is right at init. I was thinking if memory usage increases constantly for say 5min, collect a profile after 1min and 5min. We still want to see the main allocations in the program.

@dphulkar-msft
Copy link
Author

dphulkar-msft commented Jan 27, 2025

Hello,
As requested, I have attached the zip file containing the profile files. Below are the /memory/classes metrics for the azcopy process at 89% and 98% memory usage on Windows:

Metrics at 89% Memory Usage:

/memory/classes/heap/free:bytes: 636829696
/memory/classes/heap/objects:bytes: 5764946744
/memory/classes/heap/released:bytes: 7217152
/memory/classes/heap/stacks:bytes: 6534758400
/memory/classes/heap/unused:bytes: 389940424
/memory/classes/metadata/mcache/free:bytes: 9344
/memory/classes/metadata/mcache/inuse:bytes: 23360
/memory/classes/metadata/mspan/free:bytes: 11288480
/memory/classes/metadata/mspan/inuse:bytes: 149120800
/memory/classes/metadata/other:bytes: 82315312
/memory/classes/os-stacks:bytes: 0
/memory/classes/other:bytes: 19243470
/memory/classes/profiling/buckets:bytes: 2037642
/memory/classes/total:bytes: 13597730824

Metrics at 98% Memory Usage:

/memory/classes/heap/free:bytes: 3295191040
/memory/classes/heap/objects:bytes: 3969905328
/memory/classes/heap/released:bytes: 4284416
/memory/classes/heap/stacks:bytes: 8311242752
/memory/classes/heap/unused:bytes: 1783795024
/memory/classes/metadata/mcache/free:bytes: 9344
/memory/classes/metadata/mcache/inuse:bytes: 23360
/memory/classes/metadata/mspan/free:bytes: 59887680
/memory/classes/metadata/mspan/inuse:bytes: 151782720
/memory/classes/metadata/other:bytes: 100771792
/memory/classes/os-stacks:bytes: 0
/memory/classes/other:bytes: 26663342
/memory/classes/profiling/buckets:bytes:2044114
/memory/classes/total:bytes: 17705600912

mem_growth_profile.zip

I hope this helps. Let me know if you need any further information or clarification.

@prattmic
Copy link
Member

prattmic commented Jan 27, 2025

Thanks! My takeaways from this are:

  • You seem to have a lot of goroutines, and keep adding more and more. Is this expected?
    • Goroutine stacks (/memory/classes/heap/stacks:bytes) are ~50% of total memory (~8GB / ~17GB). That is way higher than normal.
    • In the profiles, total live size of goroutine objects (runtime.malg) is nearly double from 1min -> 5min. The overall proportions of both profiles are similar, so it doesn't look like there is a leak in a specific area, just that there is more of everything.

You say that that same program on 1.22 stabilizes at 54%. There are a few possible causes I can think of:

  • 1.22 and 1.23 have a similar number of goroutines, but some change in 1.23 has significantly increased the stack usage of each goroutine. i.e., in 1.22 there are still tons of goroutines, but they each use less memory.
  • Alternatively, 1.22 has far fewer goroutines. Something in 1.23 has caused goroutines to get stuck (maybe in net/http?) in a way they don't in 1.22.

You said that the input to this program is ~1.4M files. Do you start 1.4M goroutines concurrently to process each of these files?

A good next step might be to collect goroutine profiles from the 1.22 and 1.23 versions to see if they have similar numbers of goroutines. The /sched/goroutines:goroutines runtime metric will also give a goroutine count, though you won't be able to see where they are.

Edit: I see that there is quite complicated goroutine pool management. (See worker goroutine and its creator + channel interactions). So an additional possibility is that something in 1.23 is causing that pool management to launch way more goroutines.

@dphulkar-msft
Copy link
Author

Hello,

I will debug it from our end and let you know if I find anything.

Thanks

@dphulkar-msft
Copy link
Author

Hello,

I have captured the goroutines count at different timestamps by switching between Go versions while keeping the AzCopy codebase unchanged. Below are my observations:

With Go 1.23.1:

  • /sched/goroutines:goroutines: 21,378
  • /sched/goroutines:goroutines: 534,291
  • /sched/goroutines:goroutines: 869,009
  • /sched/goroutines:goroutines: 924,700
  • /sched/goroutines:goroutines: 1,055,401
  • /sched/goroutines:goroutines: 1,130,102
  • /sched/goroutines:goroutines: 1,539,766

With Go 1.22.7:

  • /sched/goroutines:goroutines: 1,310
  • /sched/goroutines:goroutines: 1,386
  • /sched/goroutines:goroutines: 1,564

From the captured goroutine counts at various timestamps, we observe a significant difference in behavior between Go 1.23.1 and Go 1.22.7, despite keeping the AzCopy codebase unchanged.

  • With Go 1.23.1, the number of goroutines increases rapidly over time, reaching a peak of 1,539,766 at the last recorded timestamp.
  • In contrast, with Go 1.22.7, the goroutine count remains much lower and increases at a more gradual rate, reaching a maximum of 1,564 by the last recorded timestamp.

I also tried collecting the goroutine profile while running the AzCopy application, and below is the resulting dump:

(pprof) top 
Showing nodes accounting for 87484, 100% of 87490 total
Dropped 215 nodes (cum <= 437)
Showing top 10 nodes out of 25
      flat  flat%   sum%        cum   cum%
     82259 94.02% 94.02%      82259 94.02%  runtime.gopark
      5225  5.97%   100%       5225  5.97%  runtime.cgocall
         0     0%   100%       5119  5.85%  crypto/tls.(*Conn).HandshakeContext     
         0     0%   100%       5119  5.85%  crypto/tls.(*Conn).clientHandshake      
         0     0%   100%       5119  5.85%  crypto/tls.(*Conn).handshakeContext     
         0     0%   100%       5119  5.85%  crypto/tls.(*Conn).handshakeContext.func2
         0     0%   100%       5119  5.85%  crypto/tls.(*Conn).verifyServerCertificate
         0     0%   100%       5119  5.85%  crypto/tls.(*clientHandshakeStateTLS13).handshake
         0     0%   100%       5119  5.85%  crypto/tls.(*clientHandshakeStateTLS13).readServerCertificate
         0     0%   100%       5119  5.85%  crypto/x509.(*Certificate).Verify       
(pprof) quit

Additionally, while running the application in debug mode, I observed numerous entries in the call stack, which seem to be related to the goroutine behavior.

Image

@prattmic
Copy link
Member

Thanks, definitely seems like there is a goroutine leak for some reason.

Just the pprof top output isn't enough to tell what these goroutines were doing. runtime.gopark means most goroutines are blocked, but we can't see on what. Looking at the graph pprof -svg (also on pprof -http) or flame graph (on pprof -http) should show what these goroutines are.

Given you have some complex worker pool size logic (https://github.com/Azure/azure-storage-azcopy/blob/1b3cc0c26c6a2f1bc1fd926c01188507ab4c86ae/ste/mgr-JobMgr.go#L919), I suggest taking a look at that to see if it is misbehaving and creating too many goroutines.

@seanmcc-msft
Copy link

@prattmic, we can take a look. In the meantime, can the Go team take a look at the changes that were made in Go 1.23? Nothing in the release notes indicated there were changes that would affect our memory profile in this way.

@prattmic
Copy link
Member

We need more information, but are making progress. We have determined that this problem is not related to memory profiles directly. The problem is that there are far more goroutines (which in turn use more memory).

To look into potential problems into 1.23, I need more context. What are these goroutines doing? What created them? e.g., are these internal net/http goroutines stuck somewhere? Are these your worker goroutines that were intentionally created by your pool manager? If so, why did it create so many or why are they stuck and not exiting?

Given this is a network heavy application, my initial guess would be something related to net or net/http getting stuck, but that's really just a guess.

@andyzhangx
Copy link

not sure whether it's related to runtime.GOMAXPROCS setting, I think that's worth try, similar fix:
kubernetes-sigs/azuredisk-csi-driver#2789

@vibhansa-msft
Copy link

GOMAXPROCS is set to 2 in AzCopy code. For the goroutines we are doing some more investigations and will provide more details here once we figure out some interesting inputs. Meanwhile we tried with 1.23.5 version as well and behavior is same (leaking Go routines). For some reason exactly the same code works just fine if we downgrade to 1.22.x. Ideally if the AzCopy code had the leak it should have failed in 1.22.x version as well but doesn't appear to be the case here.

@dphulkar-msft
Copy link
Author

Hi @prattmic ,

In the AzCopy code, explicitly setting runtime.GOMAXPROCS(2) when using Go 1.23.1 stabilizes memory usage to levels similar to those observed with Go 1.22.7. This could be considered a workaround. However, it's important to note that the time to complete the job with Go 1.23.1 and runtime.GOMAXPROCS(2) is longer compared to using Go 1.22.7.

Specifically:

  • With Go 1.23.1 and runtime.GOMAXPROCS(2), the job completion time is 163.62 minutes.
  • With Go 1.22.7, the job completion time is 55.00 minutes.

@seanmcc-msft
Copy link

With Go 1.23.1 and runtime.GOMAXPROCS(2), the job completion time is 163.62 minutes.
With Go 1.22.7, the job completion time is 55.00 minutes.

We can't accept this trade-off in latency. This might be a naive question, but happens if we set GOMAXPROCS to 4?

@ianlancetaylor
Copy link
Member

From the comments above, the increase in memory usage appears to be due to a large increase in the number of goroutines. Setting GOMAXPROCS isn't going to stop goroutines from leaking. So if setting GOMAXPROCS does seem to help, that suggests that there is a highly contended lock somewhere, one that has so much contention that the program is unable to make progress. And that suggests that it might be worth looking at a mutex profile. It may not provide any useful information, but perhaps it will. See https://pkg.go.dev/runtime/pprof#hdr-Mutex_profile-Profile.

@seanmcc-msft
Copy link

seanmcc-msft commented Jan 30, 2025

@ianlancetaylor, agreed, we ran our tests again with GOMAXPROCS=6 and also had a large latency increase. I think the key question is why, with no other change to our codebase, we had significantly more goroutines created with Go 1.23.x vs 1.22.x.

@ianlancetaylor
Copy link
Member

Yes. If a mutex profile is able to identify a problem, that should lead us to the root cause.

@qmuntal
Copy link
Member

qmuntal commented Jan 30, 2025

Another idea: The AzCopy goroutine pool has a handfull of time.Seep calls in the order of milliseconds, and those sleeps seem to control how opten goroutunes are spawned. I landed a timer resolution improvement in [CL 488675]
(https://go-review.googlesource.com/c/go/+/488675) which could be making those sleeps calls to (correcly) weak up more often than in Go 1.22.

You can disable that improvement by commenting out this line from your GOROOT and building again:

haveHighResSleep = _NtCreateWaitCompletionPacket != nil
.

@dphulkar-msft
Copy link
Author

dphulkar-msft commented Feb 3, 2025

debug_output.zip

We are spawning multiple goroutines at different points in our application to schedule transfers in chunks, which makes it somewhat difficult to pinpoint exactly where all these goroutines are being created and what their specific tasks are.

However, as part of investigation, I have observed that in one particular function, each goroutine acquires a semaphore while establishing an HTTP client connection. We have set the maximum connection count to 200 to rate limit. The issue I have seen with Go 1.23.1 is that the acquisition and release of the semaphore have started to take progressively longer over time, which is not something I observed in previous versions.

To further analyze this, I added a logging statement at https://github.com/Azure/azure-storage-azcopy/blob/main/ste/mgr-JobPartMgr.go#L123 of the relevant function. Through this logging, we can confirm that the semaphore Acquire() and Release() operations are incrementally taking much longer than expected. As a result, the spawned goroutines are getting piled up, causing a significant increase in memory consumption.

Interestingly, when I tested this behavior with Go 1.22.7, I found that the semaphore Acquire() and Release() operations were completed almost in some milliseconds, and no such issue of goroutines piling up occurred.

For reference, I’ve attached an output file that demonstrates this behavior in detail, showing the difference between Go 1.22.7 and Go 1.23.1 in terms of semaphore performance.

Based on these observations, it appears that the semaphore behavior in Go 1.23.1 is contributing to the goroutine pile-up and increased memory usage, especially when compared to Go 1.22.7.

Reference: The debug code added to capture the Acquire() and Release() time.

`var cnt int64
func (d *dialRateLimiter) DialContext(ctx context.Context, network, address string) (net.Conn, error) {
	entryTime := time.Now()
	cnt = atomic.AddInt64(&cnt, 1)
	err := d.sem.Acquire(context.Background(), 1)
	if err != nil {
		return nil, err
	}
	startTime := time.Since(entryTime)
	entryTime = time.Now()
	defer func() {
		fmt.Println("Exiting go routine no", cnt, "StartDuration", startTime, "EndDuration", time.Since(entryTime))
		defer d.sem.Release(1)
	}()
	return d.dialer.DialContext(ctx, network, address)
}`

@ianlancetaylor
Copy link
Member

Looking at the code, d.sem is type *semaphore.Weighted from golang.org/x/sync/semaphore. I compared the x/sync/semaphore benchmarks from 1.22 and HEAD, but nothing jumped out. Geomean was 0.3% higher.

@dphulkar-msft
Copy link
Author

We've observed that this memory usage issue is occurring only on Windows systems. On Linux, the application is working as expected without any memory-related problems.

The issue seems to have started occurring with Go version 1.22.8 and onwards on Windows. Prior to this version, there were no such memory usage issues. This suggests that the issue may be related to changes introduced in Go 1.22.8 or later specifically affecting the Windows environment.

@ianlancetaylor
Copy link
Member

CC @golang/windows

@mayankagg9722
Copy link

mayankagg9722 commented Feb 5, 2025

We've observed that this memory usage issue is occurring only on Windows systems. On Linux, the application is working as expected without any memory-related problems.

The issue seems to have started occurring with Go version 1.22.8 and onwards on Windows. Prior to this version, there were no such memory usage issues. This suggests that the issue may be related to changes introduced in Go 1.22.8 or later specifically affecting the Windows environment.

@dphulkar-msft we are also facing the memory issue in azure-fileshare-csi-driver which runs on linux. So not sure if this issue is specific to windows system. + @andyzhangx .

@andyzhangx
Copy link

We've observed that this memory usage issue is occurring only on Windows systems. On Linux, the application is working as expected without any memory-related problems.
The issue seems to have started occurring with Go version 1.22.8 and onwards on Windows. Prior to this version, there were no such memory usage issues. This suggests that the issue may be related to changes introduced in Go 1.22.8 or later specifically affecting the Windows environment.

@dphulkar-msft we are also facing the memory issue in azure-fileshare-csi-driver which runs on linux. So not sure if this issue is specific to windows system. + @andyzhangx .

@mayankagg9722 the original issue is increased memory usage, we don't have data shows that memory usage is increased in azure file csi driver, when there are lots of files to restore, the azcopy would create more parallel jobs to perform the file copy, the memory usage increase should be expected in that case.

@qmuntal qmuntal self-assigned this Feb 6, 2025
@prattmic
Copy link
Member

Interesting that this occurs in 1.22.8, but not 1.22.7.

There are only 3 new commits in 1.22.8 (full commit log):

https://go.dev/cl/611297 cmd/cgo: correct padding required by alignment
https://go.dev/cl/610238 runtime: size maps.Clone destination bucket array safely
https://go.dev/cl/612475 syscall: skip TestAmbientCapsUserns when restricted, document

The last one is a test-only change, so that only leaves 2 candidate CLs.

To be honest, neither of these stand out as likely candidates for this problem, but since there are so few commits, you can just test each one directly. i.e., clone https://go.googlesource.com/go and bisect between tags go1.22.7 and go1.22.8.

@qmuntal
Copy link
Member

qmuntal commented Feb 10, 2025

Interesting that this occurs in 1.22.8, but not 1.22.7.

I've just managed to reproduce the issue with 1.23, but not with 1.22.8. Calling out 1.22.8 has probably been a confusion.

I haven't found the root cause yet, but I'm sure it is related to how http.Transport handles idle connections. If I change the AzCopy source code to set the http.Transport.MaxConnsPerHost to a high number (in here), e.g. 200, instead using the default value of 2, then the goroutine creation rate goes back to the Go 1.22 pattern.

Looking at the net/http git history I can see several commits that touched idle connection management:

Dials when requests are canceled | https://go-review.googlesource.com/c/go/+/576555
net/http: remove persistConn reference from wantConn | https://go-review.googlesource.com/c/go/+/522095

I'm on a trip this week so i don't think I can work out a small reproducer. Will try to do it during next week. I leave this thoughts here just in case they ring a bell. @neild

@qmuntal qmuntal removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests