Skip to content
This repository has been archived by the owner on Apr 3, 2024. It is now read-only.

Potential Memory Leak #811

Closed
tbadlov opened this issue Jan 10, 2020 · 57 comments · Fixed by #957 or #975
Closed

Potential Memory Leak #811

tbadlov opened this issue Jan 10, 2020 · 57 comments · Fixed by #957 or #975
Assignees
Labels
api: clouddebugger Issues related to the googleapis/cloud-debug-nodejs API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@tbadlov
Copy link

tbadlov commented Jan 10, 2020

Thanks for stopping by to let us know something could be better!

PLEASE READ: If you have a support contract with Google, please create an issue in the support console instead of filing on GitHub. This will ensure a timely response.

Please run down the following list and make sure you've tried the usual "quick fixes":

If you are still having issues, please be sure to include as much information as possible:

Environment details

  • OS: Google App Engine
  • Node.js version: 12.14.x
  • npm version: 6.13
  • @google-cloud/debug-agent version: 4.2.1

Steps to reproduce

We found that the package cause a memory leak. This image shows our application memory with the package and without it.

image

We are running on Google App Engine with 4 nodes. Any help is appreciated.

Thanks!

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Jan 11, 2020
@bcoe bcoe added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. needs more info This issue needs more information from the customer to proceed. and removed needs more info This issue needs more information from the customer to proceed. labels Jan 14, 2020
@yoshi-automation yoshi-automation removed the triage me I really want to be triaged. label Jan 14, 2020
@DominicKramer
Copy link
Contributor

It looks like the issue node #28420 is impacting the version of node you are using. I suspect it could be the root cause.

@google-cloud-label-sync google-cloud-label-sync bot added the api: clouddebugger Issues related to the googleapis/cloud-debug-nodejs API. label Jan 30, 2020
@legopin
Copy link

legopin commented Feb 19, 2020

@DominicKramer node version upgrade didn't resolve the issue.

We are observing a similar issue on the upgraded node version.
The memory usage gradually climbs until OOM kill when debug agent is enabled. After it was disabled the memory usage is stable.

Screen Shot 2020-02-19 at 2 59 40 PM

Running on GKE (Google Kubernetes Engine) pod

  • OS: official node docker image
  • Node.js version:12.16.0
  • npm version:6.13.4
  • @google-cloud/debug-agent version:4.2.2

@DominicKramer
Copy link
Contributor

Thank you for the update. Have you seen similar problems on GCP services other than GKE?

@legopin
Copy link

legopin commented Feb 20, 2020

That's the only GCP service we use. We have all our services running under kubernetes

@tbadlov
Copy link
Author

tbadlov commented Feb 20, 2020

Ours is on AppEngine

@DominicKramer
Copy link
Contributor

@legopin and @tbadlov thanks for the update.

@soldair
Copy link
Contributor

soldair commented Feb 21, 2020

if anyone has a repro gist or repository we can run that leaks this way that would help us get to the bottom off this a little more quickly. It helps to confirm that we're solving the problem you're hitting.

@legopin
Copy link

legopin commented Feb 24, 2020

@soldair
Here is a reproducible code snippet:
This was run in GKE as a single pod, continued to observe memory leak even when no requests were received.

Screen Shot 2020-02-24 at 5 56 19 PM

index.js

require("@google-cloud/debug-agent").start({
  projectId: 'ID',
  allowExpressions: true,
  serviceContext: {
    service: 'debug-leak',
  },
});

const Koa = require("koa");
const app = new Koa();

// response
app.use(ctx => {
  ctx.body = "Hello Koa";
});

app.listen(3000, ()=>{console.log('server started')});

Dockerfile

FROM node:12.16.0

COPY package.json package-lock.json /tmp/

RUN mkdir /opt/app \
    && mv /tmp/package.json /tmp/package-lock.json /opt/app \
    && cd /opt/app && npm ci

WORKDIR /opt/app
COPY src /opt/app/src

EXPOSE 3000

ENV NODE_ENV production

CMD ["node", "src"]

@soldair soldair assigned soldair and DominicKramer and unassigned soldair Feb 24, 2020
@legopin
Copy link

legopin commented Mar 19, 2020

Hello, are there any updates on this issue? Were you able to reproduce the issue?

@kilianc
Copy link

kilianc commented Apr 11, 2020

We are experiencing this as well and had to shut down the debugger. The image below shows the same app deployed to two clusters, one with the debugger on and one with the debugger off.

The debugger is a big piece of our workflow so this is pretty dramatic :)

@kilianc
Copy link

kilianc commented Apr 12, 2020

I just tested the same code with 12.16 and same results. The library in the current state is not usable. How is this not a p0 on a LTS version of node!

@bcoe
Copy link
Contributor

bcoe commented Apr 13, 2020

@kilianc If I understand correctly this issue isn't happening on 10.x?

@kilianc
Copy link

kilianc commented Apr 13, 2020

@bcoe can't confirm, we're running on 12.x it's very easy to repro with the example provided (just changing the FROM to 10.x.

How can I help?

@escapedoc
Copy link

Are there any updates on the issue?
We ae running on GCE:

  • node v14.1.0
  • @google-cloud/debug-agent: ^4.2.2

Memory leaks occur when we have any logpoints. In 2 hours our VM exceeds it's memory having only one logpoint and die. Only stop and start can help (SSH doesn't work also)

@ankero
Copy link

ankero commented Jul 7, 2020

I know that this is reported well enough. Just wanted to pitch in that we are experiencing the same in our GAE Flex in all services that utilise this library. We've tested this by removing npm libraries one-by-one and stress testing our application to see the memory consumption. Only the versions with the debug-agent are affected with this leak.

I think it is also important to note that the less log lines there are the slower the memory leak. So this seems to be somehow related to logging.

I would even say that this is a critical issue as it makes it impossible to utilise this library.

@DominicKramer DominicKramer removed their assignment Jul 7, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jul 8, 2020
@andrewg010
Copy link

andrewg010 commented Jul 22, 2020

It seems our services on GKE that use this library are also affected by this issue. It seems to be a problem on any node version from 12.11.0 onwards with debug agent 5.1.0

@Louis-Ye Louis-Ye added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jun 9, 2021
@Louis-Ye Louis-Ye self-assigned this Jun 9, 2021
@Louis-Ye
Copy link
Contributor

Louis-Ye commented Jun 9, 2021

Hi, thanks everyone for providing the information! For memory leak without active breakpoints, I wasn't sure what is the cause as it is hard to reproduce at my side, but I have a feeling that this is still about V8. So currently I'm cooking up another PR that makes cloud debugger only attache to V8 debugger when there are active breakpoints. If this works (when no active breakpoints), then the periodical reset + this lazy-V8 attaching should sufficiently solve the memory leak problem. If the lazy-V8 attaching does not work, then we will have to dig into another direction.

@Louis-Ye
Copy link
Contributor

Louis-Ye commented Jun 18, 2021

The lazy-v8-attach patch is available on version v5.2.4. Please let me know if the memory leak problem still exists.

@somewhatabstract
Copy link

@Louis-Ye Thanks! I'll carve out some time this week to try it out and let you know.

@somewhatabstract
Copy link

@Louis-Ye I tried out version v5.2.5 and the memory leak is still occurring - there does not appear to be any improvement. 😞

@Louis-Ye
Copy link
Contributor

Louis-Ye commented Jul 6, 2021

@somewhatabstract Thanks for trying the new patch! And sorry to hear that it doesn't fix your problem. I went on and clone https://github.com/Khan/render-gateway to my local machine and started the example application with cloud debugger enabled, and I did not notice a memory leak there. Can you share what is the way you enable debugger (like which line you put require('@google-cloud/debug-agent').start(...)) and what is the configuration/deployment method for you to create the memory leak situation? Thanks again for providing the information!

@somewhatabstract
Copy link

somewhatabstract commented Jul 6, 2021

Can you share what is the way you enable debugger (like which line you put require('@google-cloud/debug-agent').start(...)) and what is the configuration/deployment method for you to create the memory leak situation? Thanks again for providing the information!

Yeah, the example probably doesn't do much of what may cause the issue (I'm not sure it even enables the debug agent).

To use the debug agent, the call to runServer (from import {runServer} from "render-gateway";) must pass options that specify debugAgent: true in the cloudOptions:

     runServer({
        cloudOptions: {
            debugAgent: true,
            profiler: true,
        },
        ....other options...
    });

In addition, the renderEnvironment option provides an environment that obtains JS files from our CDN for rendering our frontend. These are downloaded dynamically and executed inside a JSDOM environment.

When we pass debugAgent: false, disabling the Cloud Debug agent, no memory leak occurs. Pass true and it leaks.

The code that the debugAgent option controls is here:

https://github.com/Khan/render-gateway/blob/master/src/shared/setup-stackdriver.js#L15-L18

(StackDriver because that's what it was branded as when the code was first written).

That is invoked here:

https://github.com/Khan/render-gateway/blob/master/src/shared/start-gateway.js#L65-L66

Let me know if you need more info. I may be able to carve out some time to give a working example that includes the leak, but I'll have to see what my schedule looks like for that.

@Louis-Ye
Copy link
Contributor

Louis-Ye commented Jul 13, 2021

@somewhatabstract Thanks for the information! I modified the runServer in https://github.com/Khan/render-gateway/blob/ee04f6ddd49f68e97e498b4b1bb5940df3a17675/examples/simple/run.js with the parameter you provided (cloudoptions: {debugAgent: true}), and then I do yarn start to start this application on a GCE instance and run it for a day. The memory usage I saw increases from 45 MB to 50 MB in a day. The 5 MB extra is too small to tell that if there is a memory leak or not. Do you happen to remember what is the rate of your leak?

I may be able to carve out some time to give a working example that includes the leak, but I'll have to see what my schedule looks like for that.

That would be great! As we really want to solve this issue, and the largest barrier we have right now is to reproduce this issue.

@pascaldelange
Copy link

Hi,

I have been observing the same issue with the debug agent and the spanner library on an App Engine service.

Running on App Engine Flex with the nodejs runtime, I have a minimal working app with only liveness/readiness checks, that do a SELECT 1 query on a Spanner instance (automatically called several times per minute by App Engine's internal liveness checks).

The issue started after I accidentally relaxed the node version set in package.json from ^10 to >=10.

The memory usage increases at approximately the following speed:

  • only Spanner lib used: slow increase (~10Mb/hour)
  • Spanner lib + debug agent used: rapid increase (~90Mb/hour)
  • no Spanner lib and no debug agent (the liveness check resolves a hardcoded "ok"): no increase
  • no Spanner used, with debug agent: very slow to no increase (maybe 1-2Mb/hour, but I didn't run it for very long)

I observed the same for all sorts of combinations of the Spanner & debug agent versions, including the most recent ones (resp. 5.12 and 5.2.7), as I tried playing with the versions before I realized the node version was the cause.

You can find a repo with the code used to reproduce this issue at the following link: https://github.com/shinetools/memory-leak-test

@foxted
Copy link

foxted commented Aug 11, 2021

Hi,

Faced the same issue recently, our pods kept crashing in a loop due to memory usage constantly increasing and pods run reach their limits.

We updated our implementation to only activate Cloud Profiler & Cloud Trace, but deactivating Cloud Debugger and the issue is gone.

@somewhatabstract
Copy link

somewhatabstract commented Aug 11, 2021

@somewhatabstract Thanks for the information! I modified the runServer in https://github.com/Khan/render-gateway/blob/ee04f6ddd49f68e97e498b4b1bb5940df3a17675/examples/simple/run.js with the parameter you provided (cloudoptions: {debugAgent: true}), and then I do yarn start to start this application on a GCE instance and run it for a day. The memory usage I saw increases from 45 MB to 50 MB in a day. The 5 MB extra is too small to tell that if there is a memory leak or not. Do you happen to remember what is the rate of your leak?

I may be able to carve out some time to give a working example that includes the leak, but I'll have to see what my schedule looks like for that.

That would be great! As we really want to solve this issue, and the largest barrier we have right now is to reproduce this issue.

Sorry for the delay @Louis-Ye. I haven't forgotten, I just haven't had an opportunity to look at this yet.

As for the rate of the leak, it was quick. Like ~25-40 requests. However, each request was using the JSDOM environment, and loading a lot of JavaScript. I imagine the examples just aren't loading an awful lot on each request.

@adriaanmeuris
Copy link

can confirm, we were regularly seeing 503 errors on our Cloud Run instances with @google-cloud/debug-agent enabled. After disabling, memory usage was constant and solved the issue:

Screen Shot 2021-10-25 at 10 42 51

We're running a node:16.6-alpine3.14 containers on Cloud Run with @google-cloud/debug-agent version 5.2.8

@Shabirmean
Copy link
Member

Shabirmean commented Nov 26, 2021

We are facing this issue as well with the GoogleCloudPlatform/microservices-demo sample application.

The currencyservice and the paymentsservice, both are nodejs application. The pods running them continue to increase in memory for around 10 hours until they get killed and are re-scheduled.

image

The cloud profiler shows that the request-retry package used by google-cloud/common is what's taking up a lot of the memory.

image

@Shabirmean
Copy link
Member

Here is a heap distribution between two instances where the debugger was enabled and disabled.

With debug agent

image

Without debug agent

image

@bcoe
Copy link
Contributor

bcoe commented Dec 21, 2021

Hey @Shabirmean, that's great news that we have an internal reproduction, perhaps you could work with @Louis-Ye and we could squash this once and for all?

@MattGson
Copy link

MattGson commented May 6, 2022

2 years later and this library is still unusable in production (it's only real purpose). This is pretty embarrassing for an official Google product.

@jcrowdertealbook
Copy link

Stumbled across this ticket yesterday, after a month's long adventure hunting down a memory leak in 2 production applications(s). We removed the debug-agent today, and the memory leaks are gone. So happy I found this issue. I would suggest increasing the priority, and possibly noting this ticket on the project's README

Environment: node12 on GAE Flex

@wieringen
Copy link

wieringen commented May 15, 2022

I stopped using the debugger in production for the same reason. It took me a while to figure out what the problem was. It would be a good idea to add a big warning in the README.

Environment: node16 on Cloud Run

@joshualyon
Copy link

joshualyon commented May 23, 2022

Like others, we've also identified the @google-cloud/debug-agent as the culprit of our memory leak. We updated our Node version from v10 to >v10 and started experiencing memory leaks (the issue appears to occur in 12, 14, and 16, but not 10).

For the time being, we've disabled the debug agent and memory usage has become much more stable.

With the recent emails that went out about the discontinuation of Cloud Debugger by May 31, 2023, should we expect that this leak will get resolved or the issue abandoned? Reading between the lines, it seems like the @google-cloud/debug-agent library will continue to act as the agent (eg. it's being updated to support reporting to Firebase) and the snapshot-debugger project will act as the tool for setting breakpoints (snapshot/logpoints).

@yonirab
Copy link

yonirab commented May 26, 2022

Unbelievable. I just stumbled across this thread and can also now confirm that @google-cloud/debug-agent appears to have been the cause of a horrible memory leak in our GKE production workloads. Disabling it seems to have solved the problem.

I was actually very happy with the functionality of the Google Cloud Debugger service, and was rather dismayed to read the deprecation notice. Little did I realize till today that other than providing lots of nice debugging capabilities, it was also leaking memory horribly.

Environment: node 16 on GKE

@rednebmas
Copy link

rednebmas commented Feb 22, 2023

This was causing a memory leak (the error was Exceeded hard memory limit) in my production application and causing the server to crash. Disabling the debug-agent solved the issue.

I was using Google Cloud Platform App Engine Standard, node 16.

This package should come with a huge warning!! Would the maintainers be open to a PR which adds a large warning to the readme?

For reference, linking stackoverflow I made: https://stackoverflow.com/questions/75454436/debugging-an-out-of-memory-error-in-node-js/

@mctavish
Copy link
Contributor

mctavish commented Feb 1, 2024

I have been unable to reproduce this issue since switching to the firebase backend. With the imminent archival, I'm going to go ahead and close this issue but the reference to the potential issue still remains in the README.

@mctavish mctavish closed this as completed Feb 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api: clouddebugger Issues related to the googleapis/cloud-debug-nodejs API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet