New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.6.1] 100% CPU on Ubuntu server, caused by client? #9796

Open
KoenLav opened this Issue Apr 4, 2018 · 107 comments

Comments

Projects
None yet
@KoenLav
Copy link

KoenLav commented Apr 4, 2018

We are experiencing consistent 100% CPU usage on one of our Meteor deployments. This issue seems to have appeared out of nowhere (not after a of a new version).

What we already tried: create a new server with the same specs and deploy to that server. When we switch over our DNS to the new server at first all is well; but (we think) when a particular client connects the Node process starts using 100% CPU.

We're deploying Meteor to a Ubuntu host using MUP (which instantiates a Docker container consisting of the meteord base image and our bundle). The image has NodeJS 8.9.4 and NPM 5.6.0.

Any pointers as to how to pinpoint this issue would be greatly appreciated!

We believe this is the most interesting portion of the V8 profiler we ran on the logs (https://paste2.org/zPsHbDya):

   ticks parent  name
  2420054   92.5%  /opt/nodejs/bin/node
  219101    9.1%    /opt/nodejs/bin/node
  165262   75.4%      Builtin: ArrayMap
  57108   34.6%        LazyCompile: *changeField /bundle/bundle/programs/server/packages/ddp-server.js:287:25
  57067   99.9%          LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23
  57042  100.0%            LazyCompile: *<anonymous> packages/meteor.js:1231:19
  48280   29.2%        LazyCompile: *EJSON.clone.v /bundle/bundle/programs/server/packages/ejson.js:646:15
  47783   99.0%          Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
  47783  100.0%            Builtin: ArrayForEach
  23827   14.4%        LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:320:28
  23616   99.1%          LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23
  23616  100.0%            LazyCompile: *<anonymous> packages/meteor.js:1231:19
  22206   13.4%        LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
  22205  100.0%          Builtin: ArrayForEach
  15641   70.4%            LazyCompile: *_sendAdds /bundle/bundle/programs/server/packages/mongo.js:1913:23
   3405   15.3%            LazyCompile: *EJSON.clone.v /bundle/bundle/programs/server/packages/ejson.js:646:15
   2442   11.0%            LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1782:34
   6566    4.0%        Builtin: ArrayForEach
   4420   67.3%          LazyCompile: *_sendAdds /bundle/bundle/programs/server/packages/mongo.js:1913:23
   4420  100.0%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1782:34
   2082   31.7%          LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1782:34
   2082  100.0%            LazyCompile: *<anonymous> packages/meteor.js:1231:19
   3827    2.3%        RegExp: ([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)(?:=(?:([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)|"((?:\\\\[\\x00-\\x7f]|[^\\x00-\\x08\\x0a-\\x1f\\x7f"])*)"))? {9}
   3812   99.6%          Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
   3812  100.0%            Builtin: ArrayForEach
  25601   11.7%      LazyCompile: *ObserveHandle.stop /bundle/bundle/programs/server/packages/mongo.js:1955:41
  21756   85.0%        Function: ~<anonymous> /bundle/bundle/programs/server/packages/mongo.js:3663:25
  21756  100.0%          LazyCompile: *<anonymous> packages/meteor.js:1231:19
  19776   90.9%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
   1978    9.1%            LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
   2805   11.0%        LazyCompile: *baseUniq /bundle/bundle/programs/server/npm/node_modules/lodash.union/index.js:742:18
   2805  100.0%          LazyCompile: *<anonymous> packages/meteor.js:1231:19
   2805  100.0%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
   1040    4.1%        LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:3663:25
   1040  100.0%          LazyCompile: *<anonymous> packages/meteor.js:1231:19
    535   51.4%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
    505   48.6%            LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:1298:32
   2944    1.3%      LazyCompile: *v.map.value /bundle/bundle/programs/server/packages/ejson.js:678:18
   2607   88.6%        Function: ~Socket._writeGeneric net.js:708:42
   2607  100.0%          LazyCompile: *Socket._write net.js:785:35
   1686   64.7%            LazyCompile: *ondata internal/streams/legacy.js:14:18
    495   19.0%            LazyCompile: *Connection.write /bundle/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:527:38
    419   16.1%            LazyCompile: *clearBuffer _stream_writable.js:469:21
    179    6.1%        Builtin: ArrayMap
     57   31.8%          LazyCompile: *EJSON.clone.v /bundle/bundle/programs/server/packages/ejson.js:646:15
     52   91.2%            Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
      4    7.0%            Function: ~<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1879:36
      1    1.8%            Function: ~changeField /bundle/bundle/programs/server/packages/ddp-server.js:287:25
     55   30.7%          LazyCompile: *changeField /bundle/bundle/programs/server/packages/ddp-server.js:287:25
     55  100.0%            LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23
     31   17.3%          LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
     31  100.0%            Builtin: ArrayForEach
     20   11.2%          LazyCompile: *Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:320:28
     20  100.0%            LazyCompile: *added /bundle/bundle/programs/server/packages/mongo.js:3650:23
      5    2.8%          Builtin: ArrayForEach
      5  100.0%            LazyCompile: *_sendAdds /bundle/bundle/programs/server/packages/mongo.js:1913:23
      4    2.2%          RegExp: ([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)(?:=(?:([!#\\$%&'\\*\\+\\-\\.\\^_`\\|~0-9a-z]+)|"((?:\\\\[\\x00-\\x7f]|[^\\x00-\\x08\\x0a-\\x1f\\x7f"])*)"))? {9}
      4  100.0%            Function: ~Object.keys.forEach.key /bundle/bundle/programs/server/packages/ejson.js:697:26
      4    2.2%          LazyCompile: *<anonymous> /bundle/bundle/programs/server/packages/mongo.js:1879:36
      4  100.0%            Function: ~SQp._run packages/meteor.js:851:21
      3    1.7%          LazyCompile: *callback zlib.js:447:20
      3  100.0%            Builtin: ArrayForEach
     60    2.0%        Function: ~<anonymous> /bundle/bundle/programs/server/packages/ddp-server.js:740:22
@alanning

This comment has been minimized.

Copy link
Contributor

alanning commented Apr 6, 2018

Not sure about the cause but perhaps try testing with a Galaxy deploy and see if you can recreate the problem there. Would add some extra data to help triangulate down to the root cause.

From the logs it looks like mongo communication is involved somehow so perhaps check the metrics on your mongo cluster and see if anything looks suspicious.

When you say "when a particular client connects" did you mean one particular customer or "one client" more like, "our desktop client"? If one particular client, have you been able to reproduce it in a test environment using their credentials/configuration?

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 8, 2018

@alanning our server setup consists out of multiple microservices (backoffice, operational, scheduled tasks, api, customer, etc.) and is hosted on DigitalOcean. Our database instance is hosted with MongoDB Atlas.

We managed to resolve the issue by either (1) shutting down ALL our production servers which were connected to MongoDB (at the same time) or (2) logging out all connected users (by updating all users). At this point we were spray painting with possible solutions, hoping something would stick.

Before doing so we tried several other options like shutting down our services one by one, but once the container which was experiencing the issue spun up again the issue reappeared (even when we deployed an instance on a completely new server). In addition to that we assumed some kind of infinite loop in our code (resulting from poorly written Collection Hooks) could be the issue, but none of these changes did any good.

I did not mean a specific client (like Cordova/Browser/Desktop) but rather one specific session on one device. This is what we assumed as the issue reappeared x seconds after restarting the instance where x would be longer if we left down the microservice longer (which would result in a longer timeout between reconnect attempts).

Judging from the lodge we concluded there would have to be a MongoDB related issue as well, we checked the logs and found some (limited, non-recurring) limited key errors which appeared right before the issue started, but we have not been able to reproduce this behavior.

As the issue is solved for us now I'm going to leave it at this, but it might be worthwhile to investigate which issues could result from a client continuously writing objects with duplicate keys to MongoDB (as this is the most probable scenario judging from all the details).

@KoenLav KoenLav closed this Apr 8, 2018

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 10, 2018

Today the issue re-appeared; we would really like to find the source of this issue.

Any thoughts on finding the cause would be appreciated!

@KoenLav KoenLav reopened this Apr 10, 2018

@alanning

This comment has been minimized.

Copy link
Contributor

alanning commented Apr 11, 2018

What version of Meteor are you running?

I would work towards reproducing the issue by replicating that problem client's data/session in your test environment. Once you can reproduce it then you can start isolating different parts to find the culprit.

Does MongoDB Atlas provide any relevant metrics for the timeframe in question?

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 11, 2018

Meteor 1.6.1.

We too would like to replicate the issues, but other than that it will probably have something to do with MongoDB. At present we don't even know how to reproduce the issue in production. We have a specific timestamp when the issue (at least, the high CPU load resulting from the issue) started, so we might be able to get some more information from the logs.

To that effect I searched all our Collections looking for objects which were created/updated within 3 minutes of the issue appearing (the CPU usage increases from 5% to 100% in a matter of seconds), but to no avail.

In addition to that we looked through the MongoDB Atlas metrics but they didn't show any significant increase in CPU, bandwidth, connections, whatsoever. This makes us think that, whereas the problem might be caused by some specific connection/data, the result is only measurable within the code. Could it be that Meteor think a document is updated all the time and attempts to publish changes to all clients, but it is actually not?

Are there any other profilers we could run or logging we could enable to further narrow down the issue?

@smeijer

This comment has been minimized.

Copy link

smeijer commented Apr 11, 2018

I would like to join in this investigation. I too am experiencing a similar issue. Production servers jump to 100% at random moments, but always during working hours.

First I thought I had a memory leak, or indeed an infinite loop somewhere. But after testing & debugging for a week, I still haven't found a thing.

Even putting the server under load with artillery doesn't bring it to his knees.

Are there any other profilers we could run or logging we could enable to further narrow down the issue?

Any handles you can give us would be highly appreciated. Running Meteor 1.6.1 on Ubuntu via PM2. pm2 restart all "fixes" the issue for some time.

@hwillson

This comment has been minimized.

Copy link
Member

hwillson commented Apr 11, 2018

I have a sneaking suspicion that the issues listed here are caused by Fiber spikes. See #7747 for more details. This is a known issue and still happens with current day Meteor (the repro in that issue still shows the Fiber spike after updating to 1.6.1.1). That issue was closed due to inactivity, but if anyone is interested in digging into this further, the details in that issue should help.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 11, 2018

@hwillson interesting!

In our particular app clients are subscribed to +/- 15 collections out of which an entry into one collection causes a cascading insert and update on other collections.

Is there some way to easily try out the patch which sets the fiber pool size?

Also could you sneak in a peek into the logs (or is your suspicion already based on those?) and share what you make out of of them?

To me it looks like Meteor thinks that there are constantly changes in the data, while this is in actuality not the case.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 11, 2018

@smeijer just to make sure we are experimenting the same issue; the CPU load increases to 100% within a matter of seconds and remains there.

The issue disappeared upon restarting Meteor (but in our initial scenario it resurfaced right away, whereas now it has been gone for a day).

Do you also make use of quite a few subscriptions, or update a significant amount of documents in a collection to which some clients are subscribed at some point?

@smeijer

This comment has been minimized.

Copy link

smeijer commented Apr 11, 2018

@KoenLav,

Yes. The CPU increases to 100% within a matter of seconds. Below you'll find a CPU log of the past two weeks. I've tried letting it run, to see if it come down it selves, but that didn't happen.

(50% is 100% of 1 of the 2 cores)
image

There is no other process running beside the meteor bundle and pm2. And you'll see on the chart that there is power enough for normal operations.

Yes, the issue disappeared upon restarting Meteor pm2 restart x (where x is 0 or 1, depending on which instance is stuck). I'm running a total of 6 instances divided over 3 servers. I never had two servers "crash" at the same time, but after a restart of the stucked one, there is a bigger chance that another instance does get stuck in the next couple of hours. But this isn't a given fact.

Regarding to the subscriptions; I'm in a process of migrating away from publish / subscribe / methods, while going full graphql. So there aren't a lot of subscriptions in my application, but the ones that are still there, use quite a bit of data (geojson transport).

I'm not sure what you feel significant, but I wouldn't say so. I have a write-heavy application. And yes, these are the "busy hours". But you see on my CPU chart that there should be performance enough.

@smeijer

This comment has been minimized.

Copy link

smeijer commented Apr 11, 2018

Is there a way I can count / log the number of active fibers? To see if / when I reach the 120 named in the Fiber thread?

@hwillson

This comment has been minimized.

Copy link
Member

hwillson commented Apr 11, 2018

@KoenLav It's just a hunch - I'll take a closer look at the logs shortly.

@smeijer Try logging Fiber.fibersCreated:

import Fiber from 'fibers';
console.log(`Current fiber count: ${Fiber.fibersCreated}`);
@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 11, 2018

@hwillson thanks! One questions: does this log the number of active fibers or the number of fibers created (all time)?

I believe it to be the latter, according to this: laverdet/node-fibers#85

But I'm not entirely sure how Meteor handles Fibers locally, so these numbers might in fact be the same thing.

@smeijer this might also be interesting: laverdet/node-fibers#91

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 11, 2018

Added:

import Fiber from 'fibers';

setInterval(() => {
  console.log(`Current fiber count: ${Fiber.fibersCreated}`);
}, 60 * 1000);

To our /server/main.js file in production. Let's see what happens.

@smeijer

This comment has been minimized.

Copy link

smeijer commented Apr 11, 2018

I'll read those two issues tomorrow, but to share some results already. After a cold start, and just clicking away in my local dev environment:

fibersCreated 331 
poolsize 120

heapUsed 108.173 MiB 
heapTotal 118.438 MiB 
rss 148.547 MiB 

I don't see the fibersCreated go down after closing my browser session. So at first sight, it seems to be an 'all-time'. But it got there so fast, that it can still be a problem. I have no idea. A restart, and opening directly on a specific page, resulted in 271 fibers under the fibersCreated property. (still not saying anyting if they are being disposed / reused correctly.)

I do use async / await heavily on the graphql part. And triggering 'fire and forget' actions there. (mongo update, and not waiting for results).

import numeral from 'numeral';
import Fiber from 'fibers';

import logger from '../logger';

setInterval(() => {
  const { rss, heapTotal, heapUsed } = process.memoryUsage();
  const { fibersCreated, poolSize } = Fiber;

  log.trace(
    { rss, heapTotal, heapUsed, fibersCreated, poolSize },
    'heapUsed %s heapTotal %s rss %s fibers %d poolSize %d',
    numeral(heapUsed).format('0.000 ib'),
    numeral(heapTotal).format('0.000 ib'),
    numeral(rss).format('0.000 ib'),
    fibersCreated,
    poolSize,
  );
}, 5000).unref();
@xet7

This comment has been minimized.

Copy link

xet7 commented Apr 11, 2018

I'm using Meteor 1.6.0.1, having same problems, adding this comment to follow this issue.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 11, 2018

@xet7 does your problem also occur randomly and do you suspect it to be related to subscriptions/observing changes?

Do you have any more information that is relevant from the other thread(s)?

@xet7

This comment has been minimized.

Copy link

xet7 commented Apr 11, 2018

@KoenLav

Setup is like mentioned at wekan/wekan-mongodb#2 (comment)

Also when tested with 4 EC2 instances without Docker, result is same, node has 100% CPU usage.

Oplog should be setup correctly, I'm not sure.

Problem is with Wekan https://wekan.github.io and all other Meteor.js projects, on powerful servers, and on Raspberry Pi small projects, so all new projects start with some other than Meteor.js, because of problems with Meteor.js CPU usage.

@aboire

This comment has been minimized.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 12, 2018

@hwillson could you provide some more information on what sort of scenario's could cause such a Fiber spike?

This could help us find the offending code portions in our application code.

@xet7

This comment has been minimized.

Copy link

xet7 commented Apr 12, 2018

Is this still the same fiber issue that can be reproduced with this?
#7747 (comment)

@smeijer

This comment has been minimized.

Copy link

smeijer commented Apr 12, 2018

So I guess this is a way to track the currently active fibers:

import numeral from 'numeral';
import Fiber from 'fibers';

const runFiber = Fiber.prototype.run;

Fiber.fibersActive = 0;
Fiber.prototype.run = function runWithCount(...args) {
  const wasStarted = this.started;

  try {
    runFiber.apply(this, args);
  } finally {
    Fiber.fibersActive +=
      wasStarted && !this.started ? -1 : !wasStarted && this.started ? +1 : 0;
  }
};

setInterval(() => {
  const { fibersActive, fibersCreated, poolSize } = Fiber;

  console.log(`fibersActive ${fibersActive}/${poolSize} fibersCreated ${fibersCreated}`);
}, 5000).unref();

Logging statements like:

fibersActive 1/120 fibersCreated 3
fibersActive 5/120 fibersCreated 8
fibersActive 10/120 fibersCreated 17
fibersActive 13/120 fibersCreated 17
fibersActive 21/120 fibersCreated 264
fibersActive 21/120 fibersCreated 386

I had a case where fibersActive was -1, but I guess there are already active fibers before the wrap is applied. Not a big issue for now. As it's still giving useful insights, just with a 1 or 2 offset.

@xet7

This comment has been minimized.

Copy link

xet7 commented Apr 12, 2018

I tried the repro code here:
https://github.com/jparyani/meteor-fiber-repro

  1. Upgraded to Meteor 1.6.0.1:
meteor update --release 1.6.0.1
  1. Added babel
meteor npm install --save babel-runtime
  1. Started it
meteor run
  1. Webbroser at http://localhost:3000 has this web interface:

repro-app

  1. Clicked left button, and then clicked right button. Those are visible as left and right spike at orange CPU usage area:

repro-cpu

  1. On console there is this log:
$ meteor run
[[[[[ ~/repos/meteor-fiber-repro ]]]]]        

=> Started proxy.                             
=> A patch (Meteor 1.6.1.1) for your current release is available!
   Update this project now with 'meteor update --patch'.
=> Started MongoDB.                           
I20180412-11:05:12.832(3)? 4                  
=> Started your app.

=> App running at: http://localhost:3000/
I20180412-11:05:17.832(3)? 5
I20180412-11:05:22.833(3)? 5
I20180412-11:05:27.834(3)? 5
I20180412-11:05:32.835(3)? 5
I20180412-11:05:37.835(3)? 5
I20180412-11:05:42.835(3)? 5
I20180412-11:05:47.835(3)? 5
I20180412-11:05:52.835(3)? 5
I20180412-11:05:57.836(3)? 5
I20180412-11:06:02.837(3)? 5
I20180412-11:06:07.837(3)? 5
I20180412-11:06:12.837(3)? 5
I20180412-11:06:17.837(3)? 5
I20180412-11:06:22.837(3)? 5
I20180412-11:06:27.838(3)? 5
I20180412-11:06:32.838(3)? 5
I20180412-11:06:37.838(3)? 5
I20180412-11:06:42.841(3)? 5
I20180412-11:06:47.842(3)? 5
I20180412-11:06:52.843(3)? 5
I20180412-11:06:57.844(3)? 5
I20180412-11:07:02.845(3)? 5
I20180412-11:07:07.845(3)? 5
I20180412-11:07:12.846(3)? 5
I20180412-11:07:17.847(3)? 5
I20180412-11:07:22.848(3)? 5
I20180412-11:07:27.850(3)? 5
I20180412-11:07:32.850(3)? 5
I20180412-11:07:38.680(3)? 5
I20180412-11:07:43.681(3)? 5
I20180412-11:07:48.683(3)? 5
I20180412-11:07:53.684(3)? 5
I20180412-11:07:58.685(3)? 5
I20180412-11:08:03.815(3)? 540
I20180412-11:08:12.340(3)? 4420
I20180412-11:08:15.584(3)? set complete
I20180412-11:08:17.384(3)? 9155
I20180412-11:08:22.397(3)? 9155
I20180412-11:08:27.399(3)? 9155
I20180412-11:08:32.398(3)? 9155
I20180412-11:08:34.261(3)? unset complete
I20180412-11:08:37.398(3)? 9155
I20180412-11:08:42.399(3)? 9155
I20180412-11:08:47.399(3)? 9155
I20180412-11:08:52.399(3)? 9155
I20180412-11:08:57.400(3)? 9155
I20180412-11:08:58.500(3)? set complete
I20180412-11:09:03.110(3)? 12354
I20180412-11:09:08.499(3)? 17928
I20180412-11:09:13.500(3)? 17959
I20180412-11:09:18.502(3)? 17959
I20180412-11:09:23.501(3)? 17959
I20180412-11:09:28.502(3)? 17959
I20180412-11:09:33.503(3)? 17959
I20180412-11:09:38.504(3)? 17959
I20180412-11:09:43.505(3)? 17959
@smeijer

This comment has been minimized.

Copy link

smeijer commented Apr 12, 2018

Running artillery on my local environment. I'm simulating the HTTP get request to get an initial page, which is using onPageLoad to preload a redux state (only fetching the current user by logintoken), and fetching data through the matching graphql endpoint. Something like:

- http.get /blog/$id
- graphql.query blog(id: $id)
- http.get /blog/$id/comments
- graphql.query comments(blogId: $id)

Turns out, the number of fibers is running trough the roof. Although, it does return below the pool size, the used memory does not.

image

I've scaled the fibersCreated down by a factor 1000 to keep the chart readable.

Worth mentioning, I've killed artillery before it finished. Because it started to be terribly slow. I'll try tuning the number down. But I think it will be useful to run this code in production, to get some real data.

@smeijer

This comment has been minimized.

Copy link

smeijer commented Apr 13, 2018

So I've been logging production for a day now, and the number of active fibers are settled around ~25. (based on the code posted above).

In the meanwhile, I've also scaled one server up. I've been running the same amount of instances as the server has cores (pm2 instances: max), and I've now scaled one server up to have 1 spare CPU (pm2 instances: -1).

I've been playing a bit with the --max_old_space_size flag on my dev machine, and when that's to low, I'm getting the same symptoms. So it looks like the garbage becomes just too much for the CPU to handle.

That does also confirm another scenario. To debug this issue, I've implemented a pmx trigger to create a heap snapshot and a heap profile on request. But when requesting a heap snapshot the CPU has a big chance of getting stuck. Not all the time, but 1 out of 4 for sure.

As I've been restarting the services quite often these days to implement debugging options/logging statements, the services haven't gone stale in the last 24 hours. I guess I have to wait now.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Apr 13, 2018

Stable at 37 or 38 fibers here.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Jun 25, 2018

FYI:

V8 patch is on its way to Node.js 8.*

nodejs/node#21529

@jimrandomh

This comment has been minimized.

Copy link

jimrandomh commented Aug 1, 2018

I've opened an issue for updating Meteor's default to Node 8.12 as soon as they've made a release: #10117. In the meantime, you can make yourself an unofficial Node binary and put it in your .meteor directory.

@dongtanhuy

This comment has been minimized.

Copy link

dongtanhuy commented Aug 8, 2018

@jimrandomh How can you change the version of internal NodeJS of Meteor? Can you show me the instruction?

@xet7

This comment has been minimized.

Copy link

xet7 commented Aug 8, 2018

@dongtanhuy

Meteor does not have internal node. First you install node, then meteor.

There is prebuilt in-progress version of node v8.12:
nodejs/node#21593 (comment)

For example, you can unarchive it, and replace node binary with new one.

Look at where node is installed:

which node

Then replacing::

sudo su
cd /usr/bin
# check that it's binary, not symlink, with directory listing
ls -lah  | grep node 
mv /home/user/Downloads/node node
chown root:root node
chmod a+x node
@smeijer

This comment has been minimized.

Copy link

smeijer commented Aug 8, 2018

Just wanted to let know that I haven't experienced this issue anymore since I've replaced a lot of meteor publications with apollo/graphql requests.

The reactivity is thereby lost, but in my case a 30sec poll is also fine. On the places that I do require reactivity, I only publish a single timestamp. This timestamp is passed through to apollo, which triggers a refetch when the timestamp is changed.

The discussion here has also been helpfull to improve performance here and there.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Aug 8, 2018

@smeijer we also haven't experienced this issue anymore after rewriting some of our (long-running) publications (or actually the subscriptions); we made all clients resubscribe on a specific publication (which receives a lot of updates) every 6 hours.

@jimrandomh

This comment has been minimized.

Copy link

jimrandomh commented Aug 8, 2018

@dongtanhuy For my development system, I found that meteor was not using the system install of node, but one that it kept somewhere in ~/.meteor. I found it with find ~/.meteor -name node and replaced it with a pre-release node 8.12. This is not an ideal way to do it, but it works well enough as a stopgap until node 8.12 is released.

@dongtanhuy

This comment has been minimized.

Copy link

dongtanhuy commented Aug 9, 2018

@jimrandomh Can you explain more detail?
I find in .meteor folder by find ~/.meteor -name node and the result is a very long list of packages with content folder name node. What should I do next?
Here is a part of my search result:

/Users/juliandong/.meteor/packages/babel-compiler/.6.14.1.14hmmwc++os+web.browser+web.cordova/npm/node_modules/babel-generator/lib/node
/Users/juliandong/.meteor/packages/babel-compiler/.6.14.1.14hmmwc++os+web.browser+web.cordova/npm/node_modules/reify/node
/Users/juliandong/.meteor/packages/babel-compiler/.7.0.6.lb792n.450y++os+web.browser+web.cordova/npm/node_modules/@babel/generator/lib/node
/Users/juliandong/.meteor/packages/babel-compiler/.7.0.6.lb792n.450y++os+web.browser+web.cordova/npm/node_modules/reify/node
/Users/juliandong/.meteor/packages/babel-compiler/.7.1.1.12085lg.zffz++os+web.browser+web.browser.legacy+web.cordova/npm/node_modules/@babel/generator/lib/node
@jimrandomh

This comment has been minimized.

Copy link

jimrandomh commented Aug 10, 2018

@dongtanhuy On my system (Linux) it was in ~/.meteor/packages/meteor-tool/.1.5.2.ecibw9++os.linux.x86_64+web.browser+web.cordova/mt-os.linux.x86_64/dev_bundle/bin/node. I'm not sure where it would be on MacOS.

@stelund

This comment has been minimized.

Copy link

stelund commented Oct 2, 2018

I have a different suggestion to the CPU hogging issue. At the bottom of the description of https://www.npmjs.com/package/fibers an issue of GC trying to unwind an fiber but causes an endless loop. In doc_fetcher.js there is a while loop with ejson.clone and then call the callback which swallows errors.

I have some small evidence to this by using gdb to core dump a few times on CPU hogging instances. I then load it into llnode and look at the stack. All times it's been busy in the ejson clone call inside the doc fetcher.

Evidence that this loop could occur is the issue #10032 . To my knowledge however using of custom fields is not the cause for my infinity loop.

@AndrewMorsillo

This comment has been minimized.

Copy link

AndrewMorsillo commented Nov 27, 2018

@kentonv Now that the fiber fix has landed in Node 8.12.0 is it still necessary to increase the pool size for meteor?

@kentonv

This comment has been minimized.

Copy link
Contributor

kentonv commented Nov 27, 2018

@AndrewMorsillo Increasing the pool size was to work around a different issue, involving a memory leak when the number of fibers goes over the pool size. I think that was fixed a long time ago. So no, it shouldn't be necessary to increase the pool size, although it may still improve performance by a small amount if you commonly have more fibers than the default pool size.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Dec 1, 2018

Are others still experiencing this specific issue, or something like it?

=====

Just to inform everyone following this issue about our specific problem; it hasn't surfaced anymore since we refactored a long running subscription which receives a lot of updates.

Previously we subscribed to this collection at startup, and clients could remain running that same subscription for weeks if not months. Now we refresh the subscription every 6 hours (by stopping it and resubscribing).

We've only noticed CPU hogging once more; when a client fired off a method to generate a report from the server based on a huge amount of data. Restarting the server wouldn't do any good; the client would simply reconnect and re-rerun the method for which it had not yet received any data. Luckily we managed to isolate the client which performed this action relatively quickly, and the issue disappeared the second I closed the browser window on their machine.

Since then we have isolated the reporting capabilities in a separate micro-service to allow us to scale these instances independently and isolate them from our operational instances.

@alanning

This comment has been minimized.

Copy link
Contributor

alanning commented Dec 2, 2018

Since then we have isolated the reporting capabilities in a separate micro-service to allow us to scale these instances independently and isolate them from our operational instances.

This is great advice, independent of Meteor. Not really related to this github issue but in the past my company also didn't isolate our reporting properly and had a case of a report doing SELECT N+1 which pegged CPU for whichever Meteor webserver got the request. Identified via APM (kadira). Exhibited symptoms exactly matched the ones @KoenLav mentioned. For us it was exacerbated by Meteor's behavior of auto-wrapping async calls to be compatible with the fiber context but the underlying cause was just poor query design.

Summary: identify the core service your app is providing and make sure it is running independently. Put everything non-critical, including reporting, on some other system.

@evolross

This comment has been minimized.

Copy link

evolross commented Dec 3, 2018

Are there any good tutorials or guides on how to create separate services or micro-services for Meteor?

Presently my app has two components - an admin app and a client app and they both share the same server collection code so they can both work with all the required collections and call Meteor Methods etc.

But how does one go about writing a server-only service? Does it just use the server collection code it requires and then omits any of the client packages? Any advice would be great. Sorry to post this here but was following the conversation.

@benjamn

This comment has been minimized.

Copy link
Member

benjamn commented Dec 6, 2018

Can folks try updating to the latest Meteor 1.8.1 beta to see if this is fixed?

meteor update --release 1.8.1-beta.8

Note that this version uses Node 8.14.0 and promise@0.11.2 (the Meteor package), which uses meteor-promise@0.8.7 (an npm package), which includes this commit by @VeselyT, which should allow excess (non-recycled) fibers to be garbage collected.

See also #10359.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Dec 7, 2018

@benjamn the logic being that the issue in the issue in this thread also triggers a lot of Fibers which are never returned to the pool.

Maybe a nice start to verify this would be running the reproduction listed in the older Fiber spike issue mentioned by hwillson?

@vinaykevadia

This comment has been minimized.

Copy link

vinaykevadia commented Jan 4, 2019

@KoenLav thanks for your effort on this,
I am running meteor on EC2 (with MUP docker deployment tool) and Mongodb atlas,
I am facing the same problem unexpected CPU hang on EC2 , option is to reboot to put it on normal stage.
Do you guys find any solution? as i was going through this discussion, and was not able to find any particular solution for this issue.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Jan 4, 2019

@vinaykevadia we managed to mitigate the issue in one of our services by limiting the amount of 'hard calculation' we are performing there.

We are however currently experiencing this issue again, in another service where reports are generated based on a decent amount of data. I'm working on profiling this specific issue now.

The only thing that has helped for us, so far, is spray painting with solutions and hoping somethings sticks.

@smeijer

This comment has been minimized.

Copy link

smeijer commented Jan 4, 2019

Do you guys find any solution?

I solved it by removing almost all publications, and replace them by polling. Try to find your most expensive publication, and get rid of that one first.

#9796 (comment)

@smeijer

This comment has been minimized.

Copy link

smeijer commented Jan 4, 2019

@KoenLav, I am a bit curious about your situation though.

Previously we subscribed to this collection at startup, and clients could remain running that same subscription for weeks if not months. Now we refresh the subscription every 6 hours (by stopping it and resubscribing).
...
Luckily we managed to isolate the client which performed this action relatively quickly, and the issue disappeared the second I closed the browser window on their machine.

Do I understand correctly, that your users (/clients), keep their browser open for several months? No reboot, or even a simple F5 browser refresh?

I think that if something like this could indeed be a problem, that Meteor should add the (optional) refreshing of subscriptions on an given interval to the core. If I understnad it correctly, and that's what your problem fixed.

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Jan 6, 2019

@smeijer we provide a cash register system for hospitality establishments based on a combination of Meteor, React, Electron and Cordova (slowly migrating to React Native). Some of these cash registers (behind a bar, for instance) will just be "always on". Hence subscriptions are never refreshed (unless we do so manually).

Before we used to have a monolithic application, as such reports and the like were also generated in the same application (which was just poor design) and this may have been the cause of the issue there (rather than the long-standing publications).

Now we have migrated these reports to the backoffice service, which is accessed via the browser (rather than Electron/Cordova), and the problem reappeared. Hence it's unlikely to have anything to do with (long-standing) publications (in this specific case). We have isolated the cause of the issue, which is generating a big (but nog huge) report based on sales data for the past year, and have instrumented the application to allow for easy profiling based on the qualialabs:profile package by @veered.

In my opinion the issue isn't that the CPU spikes to 100% (I mean, that can just happen, especially on computationally heavy tasks), but rather the fact that it remains at 100% CPU even after completing the task (which seems to be happening right now).

I intend to provide a profile of the application while it is performing the heavy computations (which I don't think will render much, if any, interesting results) and one after it has finished the heavy computations (which I expect will show some kind of endless loop in Meteor/Fiber/Node/V8 internals).

@evolross

This comment has been minimized.

Copy link

evolross commented Jan 7, 2019

This isn't meant to be a solution to the problem, but as a work-around and perhaps a more optimized approach to generating these reports, have you considered using Meteor Methods to gather and return the data? Why use pub-sub and "real-time" data for a static report that is pulling a year's worth of data?

@KoenLav

This comment has been minimized.

Copy link

KoenLav commented Jan 7, 2019

@evolross this may not have been clear from my earlier comment, but generating the report is the issue (not subscribing to it afterwards).

The issue also appears when we only generate and don't even subscribe at a later point in time.

Your point is valid nonetheless, from an architectural perspective, but not relevant for this specific issue.

Besides the end result of aggregated data is still less than 128KB, so subscribing shouldn't be an issue (although I agree it's also not necessary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment