Memory leak with Application Insights + asp .net core 3.1 + linux containers #1678

SidShetye · 2020-02-12T19:06:25Z

FYI, this is also being tracked by Azure support as case 119121121001290

Repro Steps

Please see via my personal repo here at for both the sample repo as well as the pre-built containers. Also note that the restricted way azure app service spawns containers prevents you from running dotnet dump on azure (issue with diagnostics team) so this is reproduced outside azure with the standard VS2019 asp.net core 3.1 template.

Actual Behavior

The user visible behavior is that Azure App Service recycles the container with the log entry

Stoping site <sitename> because it exceeded swap usage limits.

What happens under the hoods is that it takes between 1 to 4 days for the memory leak to exhaust about 1.5 GB memory. This is seen as high used and low available memory as see by the free -w -h linux tool (which uses the /proc/meminfo from container host).

              total        used        free      shared     buffers       cache   available
Mem:           1789        1258          81           5           8         441         378
Swap:             0           0           0

Eventually the container is killed (exit code 137, low memory).

Disabling Application Insights by not setting the APPINSIGHTS_INSTRUMENTATIONKEY environment variable prevents this leak.

Expected Behavior

No memory leak

Version Info

SDK Version :

  <ItemGroup>
    <PackageReference Include="Microsoft.ApplicationInsights.AspNetCore" Version="2.12.1" />
    <PackageReference Include="Microsoft.ApplicationInsights.Profiler.AspNetCore" Version="2.1.0-beta1" />
    <PackageReference Include="Microsoft.Extensions.Logging.ApplicationInsights" Version="2.12.1" />
  </ItemGroup>

.NET Version : .net core 3.1
How Application was onboarded with SDK: csproj integration
Hosting Info (IIS/Azure WebApps/ etc): Azure App Service Linux Containers as well as locally as simple linux containers

The text was updated successfully, but these errors were encountered:

pharring · 2020-02-13T17:33:36Z

@SidShetye For the purposes of elimination (or confirmation), please can you repeat the test, but without the "Profiler.AspNetCore" package?

SidShetye · 2020-02-13T19:39:03Z

We've spent a ton of time on this to date (see case 119121121001290 and it's master case) and can't really sponsor any additional time. Since we've have identified the product causing this and even have a repro repo, the idea is you (the vendor) debugs this this point on.

The leak could be from anywhere in Application Insights but what I can add from the top of my head is that we did see an almost 4x memory footprint with profiler enabled within just 30 seconds of launch (not in containers or in linux)

SidShetye · 2020-02-13T23:44:15Z

@pharring I should mention, the support case 119121121001290's secure file exchange areas has memory dumps of this situation. There is a reference dump a few minutes after starting and then 2 days when the memory consumption is 1.4 GB. The container eventually died after a total of 3 days of uptime where even a simple ps wouldn't fork.

root@c9aa866c3d4c:/storage# ps aux
-bash: fork: Cannot allocate memory

SidShetye · 2020-02-14T18:18:39Z

@pharring I'm uploading some of the prior work from our support ticket here for easy access. It's off the VS2019 template reproduce repo above (no proprietary bits), so safe for public distribution.

Memory dumps: Uploaded to Azure Storage here. It should be accessible for the next 7 days for you to download.

Video of low memory situation: Not a terribly exciting video but it's just 90 seconds. https://www.youtube.com/watch?v=4JIufEk2hw8

pharring · 2020-02-22T14:56:18Z

@xiaomi7732 can you take a look?

noahfalk · 2020-02-25T03:27:14Z

Memory dumps: Uploaded to Azure Storage here. It should be accessible for the next 7 days for you to download.

@SidShetye Any chance we can get these dumps back with a longer expiration time? Folks on the AppInsights team asked if I could take a look but at the moment I have access to neither the dumps nor a valid ikey to make progress. If possible the dumps are likely to be a much faster way to get useful info than waiting to get the repro set up and leaking again.

SidShetye · 2020-02-25T03:50:48Z

Memory dumps: Uploaded to Azure Storage here. It should be accessible for the next 7 days for you to download.

@SidShetye Any chance we can get these dumps back with a longer expiration time? Folks on the AppInsights team asked if I could take a look but at the moment I have access to neither the dumps nor a valid ikey to make progress. If possible the dumps are likely to be a much faster way to get useful info than waiting to get the repro set up and leaking again.

Traveling, on mobile but this should work
https://crypterondiag.blob.core.windows.net/reported-issues/appinsight-memory-blowup/2020-02-12_09_14_00%20generic%20aspnetcore3.1%20AppInsights%20memoryleak.zip?sp=r&st=2020-01-02T03:47:31Z&se=2020-05-01T10:47:31Z&spr=https&sv=2019-02-02&sr=b&sig=rv1HBvhY%2FP8EcY%2FFV4XdKff7S6mmLe8u0rNx6Q09YlQ%3D

noahfalk · 2020-02-26T08:17:14Z

Thanks @SidShetye ! A dev on my team started looking at the dumps today and ruled a few things out but didn't get to a root cause yet. We'll update once we learn more.

evilpilaf · 2020-03-11T08:13:10Z

This looks related to a similar issue I'm having with a .netcore 3.1 web api running in azure app services, The memory slowly builds up through a day maxing out the instances and causing the pools to be recycled, Attached is the memory dump we got from production where we can see that the issue comes from the AddApplicationInsightsTelemetry() method.

SidShetye · 2020-03-17T17:47:53Z

@noahfalk Any update since the last 3 weeks?

noahfalk · 2020-03-18T05:23:58Z

Hi @SidShetye, one of my coworkers has been working on this for considerable time these last weeks and we are continuing to do so - I apologize it is taking a long time to solve. We are seeing that native memory is leaking which is much more time consuming to track down than managed memory leaks. We attempted to use several different memory leak analysis tools on Linux but thus far they've either failed to run or the data they produced has made no sense. We are now at the point where we are making custom modifications to tool called libleak and we've identified potentially problematic allocation stacks, but the callstack unwind/symbolication from the tool is unusable. We are trying a next set of tool modifications that will let us stop the app under a debugger on the apparently problematic allocations so that we can gain access to higher quality stack unwinds.

SidShetye · 2020-03-18T20:40:13Z

thanks @noahfalk

On a side note, it wouldn't hurt to talk to the .net diagnostics tool team so any custom tool enhancements at your end can flow back to the broader .net community. It's only a matter of time before other folks also have to debug managed + unmanaged leaks on linux. Perhaps you can save them that frustration.

noahfalk · 2020-03-19T07:01:26Z

it wouldn't hurt to talk to the .net diagnostics tool team

I think the team you are referring to is my team : ) Myself and immediate coworkers own tools such as dotnet-dump, dotnet-trace, dotnet-counters as well as all the low level diagnostic features implemented inside the .Net runtime. @pharring above is on the AppInsights team and he asked if we could help them out investigating this.

so any custom tool enhancements at your end can flow back to the broader .net community

The modifications we are hacking together would be pretty rough for someone else to try repurposing as-is but I definitely agree on the broader sentiment. I want to stay focused on getting this bug solved for you first but after that we can take stock of what code/knowledge we gained in the exercise and how can we leverage it for broader community benefit.

SidShetye · 2020-03-19T19:08:21Z

@noahfalk Ok, got it. Thanks for those tools, hopefully we can open dotnet-dumps within VS/VSCode's GUI some day! A topic for another day though.

Back to this, I wanted to add that the leaks we observed were seen on 2+ core systems with server GC. On a single core server or workstation GC, we didn't observe those leaks. We had mentioned this on the internal support ticket but just noticed that the github issue doesn't mention that.

noahfalk · 2020-03-25T00:58:56Z

Hi @SidShetye a quick update and thanks for your continued patience on this long bug. The custom tooling work we did is bearing dividends and we isolated two callstacks with high suspicion of leaking memory. As of a few days ago we switched modes to root causing, creating tentative fixes, and doing measurements to assess the impact of the changes. Those tasks are still in progress but it is very satisfying to have some leads now.

noahfalk · 2020-04-08T03:09:48Z

Hi @SidShetye, just wanted to give another update on progress. Due to the tireless efforts of my coworker @sywhang we found a few issues that had varying contributions, but the one that appears most problematic plays out like this:

The yellow is a baseline scenario that we modified slightly from the one you provided to make ApplicationInsights service profiler capture traces much more frequently. The blue line is after we applied a tentative fix in the coreclr runtime. x-axis is seconds, y-axis is commited memory in MB and we ran both test cases long enough that we thought the trend was apparent.

Interestingly we thought the native CRT heap was growing because something was calling malloc() and forgetting to call free() but that wasn't the culprit. Instead the malloc/free calls were correctly balanced but there was a bad interaction between the runtime allocation pattern and the glibc heap policies that decide when unused CRT heap virtual memory should be unmapped and where new allocations should be placed. We didn't determine why specifically our pattern was causing so much trouble for glibc, but the tentative fix is using a new custom allocator that gets a pool of memory directly from the OS with mmap().

This particular issue is in the coreclr runtime (Service profiler is using a runtime feature to create and buffer the trace) so we'd need to address it via a runtime fix. Our current tentative fix still has worse throughput than the original so we are profiling it and refining it in hopes that we don't fix one problem only to create a new one. We'll also need to meet with our servicing team to figure out what options we have for distributing a fix once we have it.

I'll continue to follow up here as we get more info. In the meantime we found that adjusting the environment variable MALLOC_ARENA_MAX to a low value appeared to substantially mitigate the memory usage so that is a potential option for you to experiment with if you are trying to get this working as soon as possible. Here are a few links to other software projects that were dealing with similar issues to add a little context:
prestodb/presto#8993
cloudfoundry/java-buildpack#320

SidShetye · 2020-04-11T18:06:33Z

Thanks for the update and - wow, this runs deep! Perhaps a write up on Microsoft devblogs would be great, especially as an advanced followup to this piece. The folks root-causing this deserve the credit/recognition.

noahfalk · 2020-04-21T18:45:15Z

@ank3it can you assign @sywhang? It appears I can remove my own assignment but can't reassign to anyone else (odd). @sywhang has been doing the lion's share of the work, I've just been writing up a little progress here on his behalf.

sywhang · 2020-05-07T17:39:35Z

Hi @SidShetye,

Sorry it took quite a while for me to try out various approaches to fix the underlying issue and decide on a reasonable fix in terms of complexity and performance impact, but the fix finally got merged to the master branch of dotnet/runtime repository in dotnet/runtime#35924.

Would it be possible for you to try out using the latest branch and confirm that we've fixed your issue? I'm now trying to backport the fix to our servicing release of .NET Core 3.1 and having validation from you as the first customer who reported this issue would let me go through the backport process for .NET servicing release.

Thank you!

SidShetye · 2020-05-08T02:27:39Z

@sywhang I really appreciate the efforts in fixing this but we have already stripped out Application Insights from our stack. Is there a write-up on this? Seems like it was a very tricky issue, so a post-analysis write up could help others too.

sywhang · 2020-05-08T20:21:27Z

Thanks @SidShetye for the update! I'll try to get the backport approved regardless so that others won't hit the same issue.

Regarding writeup, I don't have anything written up on this yet because we're so close to .NET 5 release snap date and I have a huge stack of work items to go through, but I am definitely planning to share some of the tools I wrote as part of this investigation, as well as a writeup on the investigation as soon as I have some extra cycles : )

forik · 2020-05-10T11:43:12Z

We are also encountering memory issues and suspect AI. Below is a screenshot of a diff from two dotnet-gcdump files taken in a 1 hour period. Seems to be a diagnosticlistener leakage somewhere inside AI.

noahfalk · 2020-05-12T08:24:46Z

@forik I'd recommend opening a new issue with the information on your bug so it can be better tracked. The new issue template will ask for info specific to your setup and from what you posted so far it appears unlikely that your issue is a duplicate of Sid's issue.

sywhang · 2020-05-14T20:51:58Z

The fix to this issue has been merged to both master branch and .NET Core 3.1 servicing branch (targetting 3.1.5 servicing release) of the runtime repo. I don't have write access to this repository but I believe the issue can be closed now.

SidShetye · 2020-05-15T01:43:17Z

Can’t test for confirmation but I’ll go ahead and close it.

yigitgursoy · 2021-11-09T09:37:29Z

Still happening in Net6.0

saveveltri · 2024-03-04T16:52:35Z

I can confirm that it's still happening in .net core 6. Is the MALLOC_ARENA_MAX workaround still works well ?

noahfalk · 2024-03-05T08:32:31Z

If you are seeing unbounded memory growth in App Insights profiler running on .NET 6+ then it is probably going to be a bug in a different code path, and potentially a different root cause. I'd recommend creating a new issue so the App Insights team can investigate it.

Is the MALLOC_ARENA_MAX workaround still works well ?

I don't know whether MALLOC_ARENA_MAX will affect the behavior you are seeing. Its possible that it will help if your leak was coming from some other code path that continues to use malloc(). The allocations that were problematic in the scenario we investigated for this bug years ago were converted to use mmap(). I wouldn't expect those specific allocations to be leaking or to be impacted by MALLOC_ARENA_MAX any longer.

TimothyMothra added the needs investigation label Feb 12, 2020

SidShetye mentioned this issue Feb 14, 2020

Microsoft.AspNetCore.Hosting throws Startup assembly StartupBootstrapper failed to execute dotnet/aspnetcore#14547

Closed

SidShetye mentioned this issue Feb 26, 2020

[Feature Request] dotnet trace should trigger based on namespace and frequency dotnet/diagnostics#832

Open

evilpilaf mentioned this issue Mar 11, 2020

[ASPNET Core] Memory Usage / GC Thrashing #1270

Closed

SidShetye mentioned this issue Mar 17, 2020

Excessive memory consumption by Application Insights metric reporting #1734

Closed

ank3it assigned ank3it and noahfalk and unassigned ank3it Apr 21, 2020

noahfalk removed their assignment Apr 21, 2020

This was referenced May 6, 2020

Fix a potential memory leak due to EventPipe buffer allocation dotnet/runtime#35924

Merged

Port to 3.1 - Fix a native memory leak in EventPipe dotnet/coreclr#28038

Merged

forik mentioned this issue May 13, 2020

Potential memory leak of DiagnosticListener #1867

Closed

SidShetye closed this as completed May 15, 2020

sywhang mentioned this issue Jul 30, 2020

Memory Leak, Increasing CPU usage djluck/prometheus-net.DotNetRuntime#6

Open

Bharts8 mentioned this issue Dec 9, 2021

Potential net6 App Insights Memory Leak #2491

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak with Application Insights + asp .net core 3.1 + linux containers #1678

Memory leak with Application Insights + asp .net core 3.1 + linux containers #1678

SidShetye commented Feb 12, 2020 •

edited

Loading

pharring commented Feb 13, 2020 •

edited

Loading

SidShetye commented Feb 13, 2020

SidShetye commented Feb 13, 2020

SidShetye commented Feb 14, 2020

pharring commented Feb 22, 2020

noahfalk commented Feb 25, 2020 •

edited

Loading

SidShetye commented Feb 25, 2020

noahfalk commented Feb 26, 2020

evilpilaf commented Mar 11, 2020

SidShetye commented Mar 17, 2020

noahfalk commented Mar 18, 2020

SidShetye commented Mar 18, 2020

noahfalk commented Mar 19, 2020

SidShetye commented Mar 19, 2020

noahfalk commented Mar 25, 2020

noahfalk commented Apr 8, 2020

SidShetye commented Apr 11, 2020

noahfalk commented Apr 21, 2020

sywhang commented May 7, 2020

SidShetye commented May 8, 2020

sywhang commented May 8, 2020 •

edited

Loading

forik commented May 10, 2020

noahfalk commented May 12, 2020

sywhang commented May 14, 2020

SidShetye commented May 15, 2020

yigitgursoy commented Nov 9, 2021

saveveltri commented Mar 4, 2024

noahfalk commented Mar 5, 2024

Memory leak with Application Insights + asp .net core 3.1 + linux containers #1678

Memory leak with Application Insights + asp .net core 3.1 + linux containers #1678

Comments

SidShetye commented Feb 12, 2020 • edited Loading

Repro Steps

Actual Behavior

Expected Behavior

Version Info

pharring commented Feb 13, 2020 • edited Loading

SidShetye commented Feb 13, 2020

SidShetye commented Feb 13, 2020

SidShetye commented Feb 14, 2020

pharring commented Feb 22, 2020

noahfalk commented Feb 25, 2020 • edited Loading

SidShetye commented Feb 25, 2020

noahfalk commented Feb 26, 2020

evilpilaf commented Mar 11, 2020

SidShetye commented Mar 17, 2020

noahfalk commented Mar 18, 2020

SidShetye commented Mar 18, 2020

noahfalk commented Mar 19, 2020

SidShetye commented Mar 19, 2020

noahfalk commented Mar 25, 2020

noahfalk commented Apr 8, 2020

SidShetye commented Apr 11, 2020

noahfalk commented Apr 21, 2020

sywhang commented May 7, 2020

SidShetye commented May 8, 2020

sywhang commented May 8, 2020 • edited Loading

forik commented May 10, 2020

noahfalk commented May 12, 2020

sywhang commented May 14, 2020

SidShetye commented May 15, 2020

yigitgursoy commented Nov 9, 2021

saveveltri commented Mar 4, 2024

noahfalk commented Mar 5, 2024

SidShetye commented Feb 12, 2020 •

edited

Loading

pharring commented Feb 13, 2020 •

edited

Loading

noahfalk commented Feb 25, 2020 •

edited

Loading

sywhang commented May 8, 2020 •

edited

Loading